Reverse-engineering Codemasters' BIGF archive format in Ruby

Reading a Binary Game Format in Ruby | Davidslv

The Engineer's Notebook

Why Architecture Matters in Rails Applications How to Identify Boundaries in a Rails Monolith The Modular Monolith as the Default Starting Point Rails Engines vs Packwerk: When to Use What When Rails Engines Are the Wrong Tool Testing Strategy for a Modular Rails Application From One Controller to Thirteen Handlers: A Webhook Refactor The Propshaft Version Lever You Were Told Was Gone The View Layer Rails Couldn't See Reading a Binary Game Format in Ruby

← All writing

On this page

When you say “I’m going to reverse-engineer a binary file format,” people picture C, or Python with struct, or Kaitai. Nobody pictures Ruby. Ruby is for web apps and DSLs and being pleasant; it is not, in the popular imagination, for byte-banging floats out of a 2003 racing game.

That popular imagination is wrong. The reader for Codemasters’ BIGF archive format — the container that holds the AI data in TOCA Race Driver — is pure, dependency-free Ruby, and it reads four different games’ archives. I should be upfront about how it came to be: this was reverse engineering done with an AI the whole way — me steering, deciding what to trust and verifying every claim against the bytes; the model drafting code, recalling the corners of the standard library, and proposing hypotheses I then tested. What follows is the part of Ruby that made that collaboration genuinely pleasant: Ruby strings are byte buffers, and String#unpack is a tiny, fast binary parser hiding in plain sight.

Strings are bytes

The first thing to internalise is that a Ruby String is not “text.” It’s a sequence of bytes with an encoding label attached. Read a file in binary mode and you get the raw bytes, indexable and sliceable like any string:

data = File.binread("aib.big") # the whole file as an ASCII-8BIT String data[0, 4] # => "BIGF" — the first four bytes data.bytesize # => 3448832

File.binread is the key: it reads the file as binary (ASCII-8BIT / BINARY encoding), so no UTF-8 interpretation mangles your 0x80+ bytes. From there, data[offset, length] carves out byte ranges, and data.index(needle, from) finds a magic number or a marker anywhere in the file. That’s most of a parser already.

unpack: the binary decoder you already have

The workhorse is String#unpack (and its single-value sibling unpack1). You hand it a format string of directives and it decodes the bytes. The two directives that did 90% of the work here:

V — an unsigned 32-bit integer, little-endian . Every count, block index, offset and size in BIGF is a V.

e — a little-endian single-precision float (32-bit). The AI data is arrays of these: the racing-line coordinates, the control values, the padding.

data[4, 4].unpack1("V") # => 39 — the entry count, as a u32 LE data[12, 16].unpack("e4") # => [0.0, 0.0, 137.0, 0.0] — four float32s

Endianness lives in the directive, which is the whole game: V is little-endian u32, N is big-endian; e is little-endian float, g is big-endian. Codemasters’ PC games are little-endian, so it’s V and e throughout. (When we later looked at an Xbox 360 file, big-endian PowerPC, it would have been N and g — the format string is the only thing that changes.)

unpack is implemented in C inside the interpreter, so decoding a few hundred thousand floats is not slow. You are not paying a “scripting language” tax here.

Walking the container

BIGF is a header, a directory, and a data section. The header check is a one-liner:

MAGIC = "BIGF".b raise "not a BIGF archive" unless data[0, 4] == MAGIC

That .b is worth a footnote: it returns a binary copy of the string literal, so the comparison is byte-for-byte regardless of source-file encoding. I use it for every binary constant.

BIGF has two directory layouts. One is a flat table of fixed 24-byte records — char name[16]; u32 size; u32 offset — which is a textbook unpack loop:

count = data[4, 4].unpack1("V") base = data[8, 4].unpack1("V") # data-section base, read from the header (not assumed!) off = 0x24 # records start after the 0x20 header + a 4-byte pad

count.times do rec = data[off, 24] name = rec[0, 16].split("\x00").first.to_s # NUL-terminated name field size, offset = rec[16, 8].unpack("V2") # two u32s in one go members Entry.new(name:, offset: base + offset, size:) off += 24 end

Three small Ruby niceties are doing real work there. rec[0, 16].split("\x00").first turns a fixed-width, NUL-padded C string into a Ruby string. unpack("V2") pulls two integers at once (the count suffix). And — a hard-won detail — base is read from the header field at 0x08 rather than hard-coded, because measuring 1,371 real files showed it isn’t always the 0x800 everyone assumes.

The other layout is variable-length: names interspersed with a 0x44 00 00 00 marker. That’s where String#index shines — you scan for the extension, walk back to the preceding NUL to find the name’s start, then look just past it for the marker:

while (idx =...

Reverse-engineering Codemasters' BIGF archive format in Ruby

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level