What exactly is a file format? An interactive guide | growingSWE
What exactly is a file format? An interactive guide
I just realized one day that I didn't know what file formats really were. Why didn't I question this before?
You double-click a file and it opens in the "right" app. Usually we take that for granted.
But why did that work?
.txt, .png, and .pdf are just suffixes in a filename. The real question is what is inside the file, and how software turns those raw bytes into text, images, or audio.
A file is bytes. A format is the rulebook for interpreting those bytes.
A file is just bytes
Every file on your computer is a sequence of bytes. A byte is a number from 0 to 255. Documents, photos, songs, executables: all of them are number sequences on disk.
Those numbers do not carry meaning by themselves. Meaning comes from interpretation rules. The same bytes can be read as text, pixel values, audio samples, or machine instructions.
Try the views below and watch one byte stream show up in different forms:
Same bytes, different views<br>48H<br>65e<br>6Cl<br>6Cl<br>6Fo<br>2C,<br>20<br>77w<br>6Fo<br>72r<br>6Cl<br>64d<br>21!<br>0A\n<br>42B<br>79y<br>74t<br>65e<br>73s<br>20<br>61a<br>72r<br>65e<br>20<br>6Aj<br>75u<br>73s<br>74t<br>20<br>6En<br>75u<br>6Dm
HexTextDecimalColor
The bytes stayed the same. Only the interpretation changed.
Extensions are just labels
In report.txt, .txt is the extension. It is a hint to the operating system, not proof.
You can rename any file to almost anything. Rename a PNG image to document.txt and the bytes remain identical.
Extensions can lie. Compare what each filename claims with what the bytes actually say:
Extensions can lie<br>document.txt<br>extension says: Textbytes says: PNG Image
89·<br>50P<br>4EN<br>47G<br>0D\r<br>0A\n<br>1A·<br>0A\n<br>00\0<br>00\0<br>00\0<br>0D\r<br>49I<br>48H<br>44D<br>52R
\x89PNG\r\n\x1a\n
document.txt<br>photo.jpg<br>music.mp3<br>data.csv
1 / 4
If the OS trusted extensions alone, several of these files would open in the wrong program.
Bytes that identify themselves
Many binary formats begin with a fixed byte pattern called a magic number (or file signature). Those first bytes identify the format regardless of filename.
Plain text files often lack a reliable signature. Binary formats usually do not.
Here are common magic bytes and how they look at the start of a file:
FormatExtensionMagic bytesASCIIPNG.png89 50 4E 47 0D 0A 1A 0A\x89PNG\r\n\x1a\nJPEG.jpgFF D8\xff\xd8GIF.gif47 49 46 38 39 61GIF89aBMP.bmp42 4DBMPDF.pdf25 50 44 46 2D%PDF-ZIP.zip50 4B 03 04PK\x03\x04GZIP.gz1F 8B\x1f\x8bMP3 (ID3v2 tag).mp349 44 33ID3ELF7F 45 4C 46\x7fELF
PNG includes "PNG" in its signature. PDF starts with "%PDF-" and then a version marker.
JPEG begins with FF D8 (SOI, Start of Image), then another marker that depends on subtype: JFIF commonly continues with FF E0, Exif with FF E1.
MP3 starts with ID3 only when an ID3v2 metadata tag exists. Without that tag, the file starts directly with an MPEG frame sync.
ZIP starts with 50 4B (ASCII "PK", from Phil Katz). Most ZIP archives begin with PK\x03\x04, but empty archives and some self-extracting variants can start with other PK signatures.
Unix file works this way: it checks opening bytes against a signature database and ignores the filename.
Magic numbers solve identification for many binary formats. For plain text, things are fuzzier, so encoding matters.
How text becomes bytes
Text needs a mapping from characters to numbers. ASCII (1963) is the classic example: 128 characters, including letters, digits, punctuation, and control characters such as newline.
Type text and see each character turn into bytes:
Characters as bytes<br>H0x48<br>e0x65<br>l0x6C<br>l0x6C<br>o0x6F<br>!0x21
Type
In ASCII, each character maps to one byte. "A" is 65 (hex 41), "a" is 97 (hex 61), and space is 32 (hex 20). Uppercase and lowercase letters differ by one bit.
ASCII works for basic English text, but 128 symbols are not enough for global writing systems. It has no built-in space for characters like "é", "中", or "😀".
150,000 characters in one encoding
Unicode assigns a unique number, called a code point, to every character in every writing system, plus thousands of symbols and emoji. The current version defines over 150,000 characters.
Unicode is a numbering system, not a byte layout. You still need an encoding.
UTF-8 is the common one: one byte for ASCII characters, then two, three, or four bytes for other code points.
Walk through these examples and watch byte length change:
UTF-8 multi-byte encoding<br>U+0048<br>1 byte
0x48<br>01001000
H (1 byte (ASCII))<br>é (2 bytes)<br>中 (3 bytes)<br>😀 (4 bytes)
1 / 4
UTF-8 is self-synchronizing because the prefix bits carry structure. If you jump into the middle of a stream, you can scan forward until a valid start byte appears.
0 means a one-byte character. 110 starts a two-byte sequence. 1110 starts three bytes. 11110 starts four. Continuation bytes begin with 10.
These patterns let software guess UTF-8 from raw bytes, though short strings and pure ASCII are ambiguous because they are valid in several encodings.
In practice, UTF-8 won: it...