Learning a regular language by inferring a DFA with the TTT algorithm

Learning Regular Languages with the TTT Algorithm

TLDR; This tutorial is a complete implementation of the TTT algorithm for active automata learning in Python. TTT combines the discrimination tree of Kearns and Vazirani with binary search counterexample analysis from Rivest and Schapire, and adds prefix transformation and discriminator finalization to eliminate all redundant membership queries. The Python interpreter is embedded so that you can work through the implementation steps.

Why learn an input language?

Suppose you are given a piece of software. For example, a network protocol implementation, a parser, or a security filter. You want to understand what inputs it accepts. You have no access to the source code, and can only run it and observe whether it accepts or rejects a given input. This is the blackbox setting.

A naive answer is to try to test it exhaustively. But the set of all strings accepted by even simple grammars is infinite. A better approach is to infer a finite model, that is a DFA, that captures the input behaviour exactly. Such a model is useful on its own. You can inspect it, verify properties, generate tests from it, or compare it against a specification to find discrepancies. Active automata learning is the discipline of constructing this model efficiently, using as few queries as possible.

In my previous post, I implemented Angluin’s L* algorithm for learning regular languages from a blackbox oracle. L* uses a flat observation table to track state distinctions, which leads to redundant membership queries: when a counterexample arrives, all its suffixes are added as columns even though most distinguish no new states.

TTT is the state-of-the art algorithm for regular language inference. Using this algorithm, you can infer the input language of any blackbox program up to its regular approximation. It is much more faster than L*, and the number of membership queries it generates (that is, the number of inputs it needs to test the blackbox with) is provably non-redundant.

Several independent contributions are incorporated in the TTT algorithm. Rivest and Schapire1 contributed the binary search counterexample analysis, which finds the single relevant suffix in a counterexample in \(O(\log k)\) queries (rather than \(k\) queries). The introduction of discrimination tree as a replacement for the observation table is due Kearns and Vazirani2.

TTT by Isberner, Howar and Steffen3 adds two further refinements: prefix transformation, which keeps access sequences minimal, and discriminator finalization, which keeps the discrimination tree shallow. TTT is provably redundancy-free. That is, it never makes a membership query whose answer could have been derived from earlier queries.

Language inference can also be applied to hardware. There are however, other considerations in such settings. For example, it may not be possible or even expensive to restart a system. ADT4 is a notable extension of TTT, which uses adaptive distinguishing sequences, and can reduce resets in hardware settings.

Definitions

Alphabet \(A\): the set of input symbols the DFA reads.

Membership query: a string passed to the blackbox oracle. The oracle answers yes (accepted) or no (rejected).

Equivalence query: a hypothesis grammar passed to the teacher. The teacher answers yes, or returns a counterexample string where the hypothesis and the target disagree.

PAC oracle: a probabilistic approximation to the equivalence oracle. After \(N\) random tests without finding a counterexample, we declare the hypothesis probably approximately correct.

Discrimination tree (DT): a binary tree whose inner nodes are discriminator suffixes and whose leaves are states. Sifting a string \(w\) through the tree classifies it to a state using one membership query per level.

Access sequence \(reach(q)\): the shortest known string that reaches state \(q\) in the target. This is called \(acc(q)\) in TTT literature, but using \(reach(q)\) to avoid conflation with \(accept\) in DFA.

Spanning tree: a mapping from each known state to its access sequence. In this implementation we use a dict (called State Table) rather than a tree.

Open transition: a transition from state \(q\) on symbol \(a\) that has not yet been sifted to determine its target state.

Counterexample decomposition: the process of finding the split point in a counterexample, extracting a new discriminator, and splitting a leaf in the DT.

Contents

Definitions

Prerequisites

From L* to TTT

The DFA Representation

The Oracle

The Discrimination Tree

The State Table Sifting

Hypothesis Construction Incremental Hypothesis Update

Counterexample Decomposition The Split Point

Prefix Transformation

Splitting a Leaf

Discriminator Finalization

Finding the Split Point

Putting Decomposition Together

Worklist Growth in close_transitions

Non-Redundancy

A Note on the Equivalence Oracle

The Main Loop

Examples

Evaluating Model Accuracy

Comparison with...

Learning a regular language by inferring a DFA with the TTT algorithm

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs