Generative Unix CTF for RL

ronald_raygun1 pts0 comments

unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning

unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning

AuthorsGeoffrey Bradway, Roger Creus Castanyer, Lorenz Wolf, Maxwill Lin, Matthew James Sargent, Augustine N. Mavor-Parker

DescriptionWe introduce unix-ctf, a procedural generator of capture-the-flag environments for training and evaluating Unix competence in language-model shell agents.

External Linkhttps://arxiv.org/abs/2605.29115<br>DateMay 25, 2026<br>AffiliationsVmax<br>EventsCTF

Terminal agents are usually described as models that can use a shell. But "using a shell" hides two different skills.<br>One skill is general programming through a terminal. For example, writing a Python script, running tests, editing files, or compiling a small program. The shell is the interface, but the hard part is still ordinary programming.

The other skill is Unix competence. An agent needs to understand how the operating system, filesystem, shell, and file formats expose information. A flag stored in an extended attribute will not show up in ls -l, stat, file, or a recursive grep over file contents. In this case, writing a Python program is not the best approach. Instead, an agent should know that the data lives in an inode side channel and ask for it with getfattr.

Current terminal benchmarks are often biased toward programming tasks. If they do measure Unix competence, they often do so indirectly. A model that is strong at Python but weak at Unix can still solve a meaningful fraction of terminal tasks. The reverse skill profile is tested less often. This makes it hard to tell whether a training pipeline is teaching models to operate a Unix system, or simply teaching them to write code while standing inside one.

unix-ctf is our attempt to isolate the Unix side. It procedurally generates capture-the-flag tasks inside fresh Linux containers. Each task hides a short token, such as flag{a3b1c9...}, using a single Unix feature. The agent has to recover the flag by discovering and using that feature.

The flag can be verified mechanically, and because the hiding technique can be tied to a specific OS, shell, or file-format feature, the task surface targets Unix competence directly.

Terminal benchmarks often mix Unix competence, shell-flavored coding, and general programming. unix-ctf shifts the distribution toward Unix-specific skills.

Unix Competence

We call a task Unix competence when success depends on an OS, shell, or file-format feature with no clean analogue in ordinary general-purpose programming.

Extended attributes are one example. ELF build IDs, X.509 custom object identifiers, pre-epoch modification times, file capabilities, named pipes, Unix sockets, shell functions, systemd drop-ins, and /proc state have the same shape.

A programming language can wrap these features, but Unix knowledge is still required to complete the task. os.getxattr is still a wrapper around the getxattr(2) system call. An agent has to know that the feature exists and that it is the right access path.

unix-ctf shifts the weight toward the parts of Unix that are easy to miss: filesystem metadata, binary formats, shell state, process and IPC primitives, service configuration, logs, certificates, archives, encodings, and serialization formats.

Building the Technique Library

Vmax's research agenda is concerned with agents setting their own objectives and tasks, rather than relying on manual prompts. We continue this with unix-ctf's technique library, a portable way to hide and recover a flag using a Unix feature.

Each technique enters the library through an offline harvest pipeline. A frontier model first explores a target technique inside a pre-built Linux container and produces a candidate hiding procedure plus a recovery command. Then the system checks two things mechanically.

First, the planted flag must not appear as plaintext anywhere on disk. A simple recursive search should fail to find it. Second, the recovery command must print the flag and exit successfully.

If a candidate task passes these checks, a smaller model rewrites the planting procedure into a parameterized script pair: plant.sh, which accepts a target directory and flag, and recovery.sh, which accepts a target directory and recovers the flag. Finally, those scripts are re-run in a fresh directory with a fresh flag. This catches a common failure mode where the model accidentally hardcodes the original path or original token.

The final step canonicalizes surviving variants into distinct technique IDs. Across the run reported in the paper, 656 of 750 raw attempts survived to portable variants, an 87.5% end-to-end yield. After deduplication, those became 441 variants and 155 canonical technique identifiers.

As tasks are mined from our technique library, candidate tasks convert into our full taskset at a high yield. Alternative approaches ask a model to generate a Dockerfile, setup script, planting logic, and tests from...

unix flag shell competence technique model

Related Articles