Learning a regular language by inferring a DFA with the TTT algorithm

fanf21 pts0 comments

Learning Regular Languages with the TTT Algorithm

TLDR; This tutorial is a complete implementation of the TTT algorithm<br>for active automata learning in Python. TTT combines the discrimination<br>tree of Kearns and Vazirani with binary search counterexample analysis<br>from Rivest and Schapire, and adds prefix transformation and discriminator<br>finalization to eliminate all redundant membership queries. The Python<br>interpreter is embedded so that you can work through the implementation<br>steps.

Why learn an input language?

Suppose you are given a piece of software. For example, a network protocol<br>implementation, a parser, or a security filter. You want to understand what<br>inputs it accepts. You have no access to the source code, and can only run it<br>and observe whether it accepts or rejects a given input.<br>This is the blackbox setting.

A naive answer is to try to test it exhaustively. But the set of all strings<br>accepted by even simple grammars is infinite.<br>A better approach is to infer a finite model, that is a DFA, that captures the<br>input behaviour exactly. Such a model is useful on its own. You can inspect<br>it, verify properties, generate tests from it, or compare it against a<br>specification to find discrepancies.<br>Active automata learning is the discipline of constructing this model<br>efficiently, using as few queries as possible.

In my previous post,<br>I implemented Angluin’s L* algorithm for learning regular languages from<br>a blackbox oracle. L* uses a flat observation table to track state<br>distinctions, which leads to redundant membership queries: when a<br>counterexample arrives, all its suffixes are added as columns even though<br>most distinguish no new states.

TTT is the state-of-the art algorithm for regular language inference. Using<br>this algorithm, you can infer the input language of any blackbox program<br>up to its regular approximation. It is much more faster than L*, and the<br>number of membership queries it generates (that is, the number of inputs<br>it needs to test the blackbox with) is provably non-redundant.

Several independent contributions are incorporated in the TTT algorithm.<br>Rivest and Schapire1 contributed the binary search<br>counterexample analysis, which finds the single relevant suffix in a<br>counterexample in \(O(\log k)\) queries (rather than \(k\) queries).<br>The introduction of discrimination tree as a replacement for the observation<br>table is due Kearns and Vazirani2.

TTT by Isberner, Howar and Steffen3<br>adds two further refinements: prefix transformation,<br>which keeps access sequences minimal, and discriminator finalization,<br>which keeps the discrimination tree shallow.<br>TTT is provably redundancy-free. That is, it never makes a membership query<br>whose answer could have been derived from earlier queries.

Language inference can also be applied to hardware. There are however,<br>other considerations in such settings. For example, it may not be possible<br>or even expensive to restart a system.<br>ADT4 is a notable extension of TTT, which uses adaptive<br>distinguishing sequences, and can reduce resets in hardware settings.

Definitions

Alphabet \(A\): the set of input symbols the DFA reads.

Membership query: a string passed to the blackbox oracle. The oracle<br>answers yes (accepted) or no (rejected).

Equivalence query: a hypothesis grammar passed to the teacher. The<br>teacher answers yes, or returns a counterexample string where the<br>hypothesis and the target disagree.

PAC oracle: a probabilistic approximation to the equivalence oracle.<br>After \(N\) random tests without finding a counterexample, we declare<br>the hypothesis probably approximately correct.

Discrimination tree (DT): a binary tree whose inner nodes are<br>discriminator suffixes and whose leaves are states. Sifting a string<br>\(w\) through the tree classifies it to a state using one membership<br>query per level.

Access sequence \(reach(q)\): the shortest known string that reaches<br>state \(q\) in the target. This is called \(acc(q)\) in TTT literature,<br>but using \(reach(q)\) to avoid conflation with \(accept\) in DFA.

Spanning tree: a mapping from each known state to its access sequence.<br>In this implementation we use a dict (called State Table) rather than<br>a tree.

Open transition: a transition from state \(q\) on symbol \(a\) that<br>has not yet been sifted to determine its target state.

Counterexample decomposition: the process of finding the split point<br>in a counterexample, extracting a new discriminator, and splitting a<br>leaf in the DT.

Contents

Definitions

Prerequisites

From L* to TTT

The DFA Representation

The Oracle

The Discrimination Tree

The State Table<br>Sifting

Hypothesis Construction<br>Incremental Hypothesis Update

Counterexample Decomposition<br>The Split Point

Prefix Transformation

Splitting a Leaf

Discriminator Finalization

Finding the Split Point

Putting Decomposition Together

Worklist Growth in close_transitions

Non-Redundancy

A Note on the Equivalence Oracle

The Main Loop

Examples

Evaluating Model Accuracy

Comparison with...

tree counterexample state algorithm from queries

Related Articles