Reading a Binary Game Format in Ruby | Davidslv
Skip to content
The Engineer's Notebook
Why Architecture Matters in Rails Applications<br>How to Identify Boundaries in a Rails Monolith<br>The Modular Monolith as the Default Starting Point<br>Rails Engines vs Packwerk: When to Use What<br>When Rails Engines Are the Wrong Tool<br>Testing Strategy for a Modular Rails Application<br>From One Controller to Thirteen Handlers: A Webhook Refactor<br>The Propshaft Version Lever You Were Told Was Gone<br>The View Layer Rails Couldn't See<br>Reading a Binary Game Format in Ruby
← All writing
On this page
When you say “I’m going to reverse-engineer a binary file format,” people picture C,<br>or Python with struct, or Kaitai. Nobody pictures Ruby. Ruby is for web apps and<br>DSLs and being pleasant; it is not, in the popular imagination, for byte-banging<br>floats out of a 2003 racing game.
That popular imagination is wrong. The reader for Codemasters’ BIGF archive format —<br>the container that holds the AI data in TOCA Race Driver — is pure, dependency-free<br>Ruby, and it reads four different games’ archives. I should be upfront about how it<br>came to be: this was reverse engineering done with an AI the whole way — me<br>steering, deciding what to trust and verifying every claim against the bytes; the<br>model drafting code, recalling the corners of the standard library, and proposing<br>hypotheses I then tested. What follows is the part of Ruby that made that<br>collaboration genuinely pleasant: Ruby strings are byte buffers, and<br>String#unpack is a tiny, fast binary parser hiding in plain sight.
Strings are bytes
The first thing to internalise is that a Ruby String is not “text.” It’s a<br>sequence of bytes with an encoding label attached. Read a file in binary mode and<br>you get the raw bytes, indexable and sliceable like any string:
data = File.binread("aib.big") # the whole file as an ASCII-8BIT String<br>data[0, 4] # => "BIGF" — the first four bytes<br>data.bytesize # => 3448832
File.binread is the key: it reads the file as binary (ASCII-8BIT / BINARY<br>encoding), so no UTF-8 interpretation mangles your 0x80+ bytes. From there,<br>data[offset, length] carves out byte ranges, and data.index(needle, from) finds<br>a magic number or a marker anywhere in the file. That’s most of a parser already.
unpack: the binary decoder you already have
The workhorse is String#unpack (and its single-value sibling unpack1). You hand<br>it a format string of directives and it decodes the bytes. The two directives that<br>did 90% of the work here:
V — an unsigned 32-bit integer, little-endian . Every count, block index,<br>offset and size in BIGF is a V.
e — a little-endian single-precision float (32-bit). The AI data is<br>arrays of these: the racing-line coordinates, the control values, the padding.
data[4, 4].unpack1("V") # => 39 — the entry count, as a u32 LE<br>data[12, 16].unpack("e4") # => [0.0, 0.0, 137.0, 0.0] — four float32s
Endianness lives in the directive, which is the whole game: V is little-endian<br>u32, N is big-endian; e is little-endian float, g is big-endian. Codemasters’<br>PC games are little-endian, so it’s V and e throughout. (When we later looked at<br>an Xbox 360 file, big-endian PowerPC, it would have been N and g — the format<br>string is the only thing that changes.)
unpack is implemented in C inside the interpreter, so decoding a few hundred<br>thousand floats is not slow. You are not paying a “scripting language” tax here.
Walking the container
BIGF is a header, a directory, and a data section. The header check is a one-liner:
MAGIC = "BIGF".b<br>raise "not a BIGF archive" unless data[0, 4] == MAGIC
That .b is worth a footnote: it returns a binary copy of the string literal, so<br>the comparison is byte-for-byte regardless of source-file encoding. I use it for<br>every binary constant.
BIGF has two directory layouts. One is a flat table of fixed 24-byte records —<br>char name[16]; u32 size; u32 offset — which is a textbook unpack loop:
count = data[4, 4].unpack1("V")<br>base = data[8, 4].unpack1("V") # data-section base, read from the header (not assumed!)<br>off = 0x24 # records start after the 0x20 header + a 4-byte pad
count.times do<br>rec = data[off, 24]<br>name = rec[0, 16].split("\x00").first.to_s # NUL-terminated name field<br>size, offset = rec[16, 8].unpack("V2") # two u32s in one go<br>members Entry.new(name:, offset: base + offset, size:)<br>off += 24<br>end
Three small Ruby niceties are doing real work there. rec[0, 16].split("\x00").first<br>turns a fixed-width, NUL-padded C string into a Ruby string. unpack("V2") pulls<br>two integers at once (the count suffix). And — a hard-won detail — base is read<br>from the header field at 0x08 rather than hard-coded, because measuring 1,371 real<br>files showed it isn’t always the 0x800 everyone assumes.
The other layout is variable-length: names interspersed with a 0x44 00 00 00<br>marker. That’s where String#index shines — you scan for the extension, walk back to<br>the preceding NUL to find the name’s start, then look just past it for the marker:
while (idx =...