Beyond File Extensions: Detecting File Types with Magic Bytes
Beyond File Extensions: Detecting File Types with Magic Bytes#
A colleague once sent me a file with no extension. “What’s this?” he asked. I dropped it into a hex viewer, saw 89 50 4E 47 0D 0A 1A 0A at byte 0, and knew immediately: it’s a PNG. This technique — detecting file types by their binary header bytes — is called Magic Bytes.
I recently implemented this in a browser-based file type detector for JsonKit. Here’s how it works under the hood.
The Magic Behind Magic Bytes#
Every file format starts with a unique byte sequence — its “signature.” JPEG always begins with FF D8 FF, PDF with 25 50 44 46 (ASCII: %PDF), and GIF with 47 49 46 38 (GIF8). This is how operating systems and applications identify file types, not by the filename extension.
Rename a photo.jpg to photo.pdf, and any tool reading the magic bytes will still identify it as JPEG. The file header doesn’t lie.
The Matching Algorithm#
The core logic is straightforward: read the file’s leading bytes and compare them against a predefined signature table.
Here’s the data structure:
interface MagicEntry {
mime: string
extension: string
signatures: { offset: number; bytes: number[] }[]
}
Each format can have multiple signatures at different offsets. WebP is a good example — it uses the RIFF container format (first 4 bytes: RIFF), with the actual WebP identifier at offset 8:
{ mime: 'image/webp', extension: 'webp', signatures: [
{ offset: 0, bytes: [0x52, 0x49, 0x46, 0x46] }, // "RIFF"
{ offset: 8, bytes: [0x57, 0x45, 0x42, 0x50] }, // "WEBP"
] }
Matching is simple byte-by-byte comparison:
function matchBytes(
data: Uint8Array, bytes: number[], offset: number
): boolean {
for (let i = 0; i < bytes.length; i++) {
if (offset + i >= data.length) return false
if (data[offset + i] !== bytes[i]) return false
}
return true
}
function detectFormat(data: Uint8Array): MagicEntry | null {
for (const entry of MAGIC_TABLE) {
for (const sig of entry.signatures) {
if (matchBytes(data, sig.bytes, sig.offset)) return entry
}
}
return null
}
Time complexity is O(n × m), where n is the signature count (~30+ formats in our table) and m is the signature byte length. In practice, we only read the first 64 bytes, so performance is negligible.
Edge Cases Worth Knowing#
1. Extension Mismatch Detection#
After detecting the real type, the tool compares it against the file’s extension. A mismatch often means the file was incorrectly renamed — either intentionally (malware disguising as another format) or accidentally. We handle aliases like .jpg ↔ .jpeg to avoid false alarms.
2. Ambiguous Signatures#
Some formats share the same magic bytes. Both .xls and .ppt files start with D0 CF 11 E0 A1 B1 1A E1 (the OLE2 container format). The first match in the table wins, but true disambiguation requires parsing deeper into the file structure.
3. Non-Zero Offsets#
Not all signatures sit at byte 0. TAR archives have the ustar identifier at offset 257. EOT fonts have their signature at offset 34. Your reader needs to read enough bytes up front — reading just the first 8 bytes isn’t sufficient.
4. UTF-8 BOM#
A UTF-8 byte order mark (EF BB BF) at the start of a text file can cause XML and SVG signature matching to fail. The detector should skip BOM bytes before attempting text format matching.
Why Pure Browser?#
The entire implementation runs on FileReader + Uint8Array — no server uploads, no third-party libraries, no file command dependency. Advantages:
- Zero dependencies — loads instantly
- Privacy-first — files never leave your machine
- Real-time — drop a file and get results immediately
The tool currently supports 30+ formats including images, documents, audio, video, archives, fonts, and executables.
The Takeaway#
Magic byte detection isn’t rocket science, but the implementation details — offset handling, multi-signature validation, extension cross-checking — determine whether the tool is genuinely useful or just a toy.
Need to verify a file’s true identity? Try File Type Detector — it tells you what a file really is, not just what its extension claims to be.
Related tools: File Type Detector | File Hash Calculator | Image Compress