What Is a File Format?

What exactly is a file format? An interactive guide | growingSWE

What exactly is a file format? An interactive guide

I just realized one day that I didn't know what file formats really were. Why didn't I question this before?

You double-click a file and it opens in the "right" app. Usually we take that for granted.

But why did that work?

.txt, .png, and .pdf are just suffixes in a filename. The real question is what is inside the file, and how software turns those raw bytes into text, images, or audio.

A file is bytes. A format is the rulebook for interpreting those bytes.

A file is just bytes

Every file on your computer is a sequence of bytes. A byte is a number from 0 to 255. Documents, photos, songs, executables: all of them are number sequences on disk.

Those numbers do not carry meaning by themselves. Meaning comes from interpretation rules. The same bytes can be read as text, pixel values, audio samples, or machine instructions.

Try the views below and watch one byte stream show up in different forms:

Same bytes, different views 48H 65e 6Cl 6Cl 6Fo 2C, 20 77w 6Fo 72r 6Cl 64d 21! 0A\n 42B 79y 74t 65e 73s 20 61a 72r 65e 20 6Aj 75u 73s 74t 20 6En 75u 6Dm

HexTextDecimalColor

The bytes stayed the same. Only the interpretation changed.

Extensions are just labels

In report.txt, .txt is the extension. It is a hint to the operating system, not proof.

You can rename any file to almost anything. Rename a PNG image to document.txt and the bytes remain identical.

Extensions can lie. Compare what each filename claims with what the bytes actually say:

Extensions can lie document.txt extension says: Textbytes says: PNG Image

89· 50P 4EN 47G 0D\r 0A\n 1A· 0A\n 00\0 00\0 00\0 0D\r 49I 48H 44D 52R

\x89PNG\r\n\x1a\n

document.txt photo.jpg music.mp3 data.csv

1 / 4

If the OS trusted extensions alone, several of these files would open in the wrong program.

Bytes that identify themselves

Many binary formats begin with a fixed byte pattern called a magic number (or file signature). Those first bytes identify the format regardless of filename.

Plain text files often lack a reliable signature. Binary formats usually do not.

Here are common magic bytes and how they look at the start of a file:

FormatExtensionMagic bytesASCIIPNG.png89 50 4E 47 0D 0A 1A 0A\x89PNG\r\n\x1a\nJPEG.jpgFF D8\xff\xd8GIF.gif47 49 46 38 39 61GIF89aBMP.bmp42 4DBMPDF.pdf25 50 44 46 2D%PDF-ZIP.zip50 4B 03 04PK\x03\x04GZIP.gz1F 8B\x1f\x8bMP3 (ID3v2 tag).mp349 44 33ID3ELF7F 45 4C 46\x7fELF

PNG includes "PNG" in its signature. PDF starts with "%PDF-" and then a version marker.

JPEG begins with FF D8 (SOI, Start of Image), then another marker that depends on subtype: JFIF commonly continues with FF E0, Exif with FF E1.

MP3 starts with ID3 only when an ID3v2 metadata tag exists. Without that tag, the file starts directly with an MPEG frame sync.

ZIP starts with 50 4B (ASCII "PK", from Phil Katz). Most ZIP archives begin with PK\x03\x04, but empty archives and some self-extracting variants can start with other PK signatures.

Unix file works this way: it checks opening bytes against a signature database and ignores the filename.

Magic numbers solve identification for many binary formats. For plain text, things are fuzzier, so encoding matters.

How text becomes bytes

Text needs a mapping from characters to numbers. ASCII (1963) is the classic example: 128 characters, including letters, digits, punctuation, and control characters such as newline.

Type text and see each character turn into bytes:

Characters as bytes H0x48 e0x65 l0x6C l0x6C o0x6F !0x21

Type

In ASCII, each character maps to one byte. "A" is 65 (hex 41), "a" is 97 (hex 61), and space is 32 (hex 20). Uppercase and lowercase letters differ by one bit.

ASCII works for basic English text, but 128 symbols are not enough for global writing systems. It has no built-in space for characters like "é", "中", or "😀".

150,000 characters in one encoding

Unicode assigns a unique number, called a code point, to every character in every writing system, plus thousands of symbols and emoji. The current version defines over 150,000 characters.

Unicode is a numbering system, not a byte layout. You still need an encoding.

UTF-8 is the common one: one byte for ASCII characters, then two, three, or four bytes for other code points.

Walk through these examples and watch byte length change:

UTF-8 multi-byte encoding U+0048 1 byte

0x48 01001000

H (1 byte (ASCII)) é (2 bytes) 中 (3 bytes) 😀 (4 bytes)

1 / 4

UTF-8 is self-synchronizing because the prefix bits carry structure. If you jump into the middle of a stream, you can scan forward until a valid start byte appears.

0 means a one-byte character. 110 starts a two-byte sequence. 1110 starts three bytes. 11110 starts four. Continuation bytes begin with 10.

These patterns let software guess UTF-8 from raw bytes, though short strings and pure ASCII are ambiguous because they are valid in several encodings.

In practice, UTF-8 won: it...

What Is a File Format?

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars