Learning Regular Languages with the TTT Algorithm
TLDR; This tutorial is a complete implementation of the TTT algorithm<br>for active automata learning in Python. TTT combines the discrimination<br>tree of Kearns and Vazirani with binary search counterexample analysis<br>from Rivest and Schapire, and adds prefix transformation and discriminator<br>finalization to eliminate all redundant membership queries. The Python<br>interpreter is embedded so that you can work through the implementation<br>steps.
Why learn an input language?
Suppose you are given a piece of software. For example, a network protocol<br>implementation, a parser, or a security filter. You want to understand what<br>inputs it accepts. You have no access to the source code, and can only run it<br>and observe whether it accepts or rejects a given input.<br>This is the blackbox setting.
A naive answer is to try to test it exhaustively. But the set of all strings<br>accepted by even simple grammars is infinite.<br>A better approach is to infer a finite model, that is a DFA, that captures the<br>input behaviour exactly. Such a model is useful on its own. You can inspect<br>it, verify properties, generate tests from it, or compare it against a<br>specification to find discrepancies.<br>Active automata learning is the discipline of constructing this model<br>efficiently, using as few queries as possible.
In my previous post,<br>I implemented Angluin’s L* algorithm for learning regular languages from<br>a blackbox oracle. L* uses a flat observation table to track state<br>distinctions, which leads to redundant membership queries: when a<br>counterexample arrives, all its suffixes are added as columns even though<br>most distinguish no new states.
TTT is the state-of-the art algorithm for regular language inference. Using<br>this algorithm, you can infer the input language of any blackbox program<br>up to its regular approximation. It is much more faster than L*, and the<br>number of membership queries it generates (that is, the number of inputs<br>it needs to test the blackbox with) is provably non-redundant.
Several independent contributions are incorporated in the TTT algorithm.<br>Rivest and Schapire1 contributed the binary search<br>counterexample analysis, which finds the single relevant suffix in a<br>counterexample in \(O(\log k)\) queries (rather than \(k\) queries).<br>The introduction of discrimination tree as a replacement for the observation<br>table is due Kearns and Vazirani2.
TTT by Isberner, Howar and Steffen3<br>adds two further refinements: prefix transformation,<br>which keeps access sequences minimal, and discriminator finalization,<br>which keeps the discrimination tree shallow.<br>TTT is provably redundancy-free. That is, it never makes a membership query<br>whose answer could have been derived from earlier queries.
Language inference can also be applied to hardware. There are however,<br>other considerations in such settings. For example, it may not be possible<br>or even expensive to restart a system.<br>ADT4 is a notable extension of TTT, which uses adaptive<br>distinguishing sequences, and can reduce resets in hardware settings.
Definitions
Alphabet \(A\): the set of input symbols the DFA reads.
Membership query: a string passed to the blackbox oracle. The oracle<br>answers yes (accepted) or no (rejected).
Equivalence query: a hypothesis grammar passed to the teacher. The<br>teacher answers yes, or returns a counterexample string where the<br>hypothesis and the target disagree.
PAC oracle: a probabilistic approximation to the equivalence oracle.<br>After \(N\) random tests without finding a counterexample, we declare<br>the hypothesis probably approximately correct.
Discrimination tree (DT): a binary tree whose inner nodes are<br>discriminator suffixes and whose leaves are states. Sifting a string<br>\(w\) through the tree classifies it to a state using one membership<br>query per level.
Access sequence \(reach(q)\): the shortest known string that reaches<br>state \(q\) in the target. This is called \(acc(q)\) in TTT literature,<br>but using \(reach(q)\) to avoid conflation with \(accept\) in DFA.
Spanning tree: a mapping from each known state to its access sequence.<br>In this implementation we use a dict (called State Table) rather than<br>a tree.
Open transition: a transition from state \(q\) on symbol \(a\) that<br>has not yet been sifted to determine its target state.
Counterexample decomposition: the process of finding the split point<br>in a counterexample, extracting a new discriminator, and splitting a<br>leaf in the DT.
Contents
Definitions
Prerequisites
From L* to TTT
The DFA Representation
The Oracle
The Discrimination Tree
The State Table<br>Sifting
Hypothesis Construction<br>Incremental Hypothesis Update
Counterexample Decomposition<br>The Split Point
Prefix Transformation
Splitting a Leaf
Discriminator Finalization
Finding the Split Point
Putting Decomposition Together
Worklist Growth in close_transitions
Non-Redundancy
A Note on the Equivalence Oracle
The Main Loop
Examples
Evaluating Model Accuracy
Comparison with...