ISA_recovery: Auto-generates a Ghidra SLEIGH spec for undocumented ISAs

ilreb1 pts0 comments

GitHub - infobyte/isa_recovery: ISA Recovery · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

infobyte

isa_recovery

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>1 Commit<br>1 Commit

docker

docker

integration_tests

integration_tests

isa_recovery

isa_recovery

scripts

scripts

wiki

wiki

README.md

README.md

config.example.yaml

config.example.yaml

pyproject.toml

pyproject.toml

View all files

Repository files navigation

ISA Recovery System

A reverse-engineering pipeline that turns a firmware binary and its (possibly-wrong) disassembly into a working Ghidra processor specification. When you hit a proprietary processor with no documentation and no Ghidra support, this tool recovers the real encoding of each instruction — which bits are the opcode, which are registers, which are immediates — and writes out a SLEIGH spec you can load directly into Ghidra to decompile the firmware.

Under the hood it is an agentic workflow : a fixed pipeline where each step is a large language model prompted for a narrow job. The workflow is orchestrated by deterministic code — not by the LLMs themselves — and every SLEIGH constructor generated at the end is verified by compiling it with Ghidra's sleigh binary before being accepted. Failed compilations are fed back to the model for up to three repair attempts.

How It Works

Objdump<br>Bootstrap ─── deterministic clustering (no LLM)<br>┌─ Processing Loop ──────────────────────────┐<br>│ Text Interpreter → Bit Interpreter ──┐ │<br>│ → Knowledge Manager │ │<br>│ → Supervisor │ │<br>│ │ split ─────┘ │<br>│ └── next cluster ──────────┤<br>└────────────────────────────────────────────┘<br>Knowledge Base<br>SLEIGH Generator ─── compile-verify-retry loop<br>Ghidra .slaspec

Instructions are grouped into clusters by structure (byte size, token pattern, fixed-bit mask). Each cluster is then analyzed by a chain of specialized LLM steps:

Text Interpreter extracts the text pattern (add {REG1}, {REG2}, {REG3}).

Bit Interpreter maps each placeholder to a bit range using field-correlation tools; can request a split if a cluster mixes encodings.

Knowledge Manager integrates per-cluster evidence into a typed knowledge base of registers, instructions, addressing modes, and architecture traits.

Supervisor is primarily a deterministic gatekeeper (structural checks on match rates, unmapped placeholders, opcode overlap). It only invokes an LLM when a check fails, and it can either accept, re-run a specific agent with feedback, or escalate to the human via the TUI.

When the knowledge base is complete, a separate SLEIGH generator builds the Ghidra spec in two phases: a deterministic skeleton of all constructors marked unimpl, then an LLM fills in the p-code semantics one instruction at a time, compiling each against Ghidra's sleigh binary and retrying on failure.

Designed as a co-pilot for the analyst, not a replacement : the TUI exposes every decision, the supervisor escalates ambiguous clusters to a human, and the full LLM conversation, tool-call, and token-usage history is written to disk.

Tested on LEGv8, MIPS, pi32v2, and x86.

Quick Start

.env<br>./docker/run.sh integration_tests/mips

# Local<br>pip install -e ".[all]"<br>python -m main --config config.yaml"># Docker (recommended)<br>echo "ANTHROPIC_API_KEY=sk-ant-..." > .env<br>./docker/run.sh integration_tests/mips

# Local<br>pip install -e ".[all]"<br>python -m main --config config.yaml

What You Need (and What You Get)

Input : a firmware binary and an objdump disassembly — even one produced against the wrong architecture. The tool does not solve the disassembly problem itself; output quality scales with input disassembly quality.

Output : a Ghidra .slaspec file plus a JSON knowledge base of registers, instruction encodings, addressing modes, and architecture traits.

Documentation

Full documentation — architecture, agent internals, worked examples, configuration reference — lives in the wiki:

pip install -e ".[docs]"<br>cd wiki && mkdocs serve

Then open...

ghidra sleigh config isa_recovery search commit

Related Articles