Linux tar Command Deep Dive: From File Packaging to Compression Algorithms
Linux tar Command Deep Dive: From File Packaging to Compression Algorithms#
tar is the classic archiving tool in Linux. Most developers only know tar -czvf, but there’s more to its implementation than meets the eye.
Core Concept: Archiving vs Compression#
First, a crucial distinction: tar is an archiver, not a compressor.
Archiving means packing multiple files into one. Think of it as putting scattered items into a box—the box size equals the sum of all items.
Compression reduces file size through algorithms. It’s like vacuum-sealing clothes in the box.
Typical usage:
tar -cvf archive.tar files/ # Archive only, no compression
tar -czvf archive.tar.gz files/ # Archive + gzip compression
tar -cjvf archive.tar.bz2 files/ # Archive + bzip2 compression
tar -cJvf archive.tar.xz files/ # Archive + xz compression
The -z, -j, -J flags tell tar to invoke the corresponding compression tool after archiving.
tar File Format Anatomy#
The tar format is remarkably simple. Each file has a 512-byte header containing metadata (filename, size, permissions, timestamp), followed by file content (padded to 512-byte boundaries), ending with two 512-byte zero blocks.
struct tar_header {
char name[100]; // Filename
char mode[8]; // Permissions
char uid[8]; // User ID
char gid[8]; // Group ID
char size[12]; // File size
char mtime[12]; // Modification time
char checksum[8]; // Checksum
char typeflag; // File type
char linkname[100]; // Link target
char magic[6]; // "ustar"
char version[2]; // Version
char uname[32]; // Username
char gname[32]; // Group name
char devmajor[8]; // Major device number
char devminor[8]; // Minor device number
char prefix[155]; // Path prefix
char padding[12]; // Padding
};
Interesting design choices:
- All ASCII text fields: Numbers stored as octal strings for cross-platform compatibility (e.g., permission 755 stored as
0000755\0) - Fixed-length header: Exactly 512 bytes for easy random access
- Checksum calculation: Initial checksum field is spaces, then sum all bytes
Here’s a minimal tar packer:
import os
def create_tar(files, output):
with open(output, 'wb') as f:
for file_path in files:
# Create header
header = bytearray(512)
name = os.path.basename(file_path).encode('utf-8')[:100]
header[0:len(name)] = name
stat = os.stat(file_path)
# Permissions (octal)
header[100:107] = f'{stat.st_mode:07o}\0'.encode()
# File size (octal)
header[124:135] = f'{stat.st_size:011o}\0'.encode()
# Calculate checksum
checksum = sum(header)
header[148:155] = f'{checksum:06o}\0 '.encode()
# Write header
f.write(header)
# Write content
with open(file_path, 'rb') as infile:
content = infile.read()
f.write(content)
# Pad to 512 bytes
padding = (512 - len(content) % 512) % 512
f.write(b'\0' * padding)
# End marker (two empty blocks)
f.write(b'\0' * 1024)
create_tar(['file1.txt', 'file2.txt'], 'archive.tar')
Option Letter Meanings#
tar’s option design is unique—you can omit the - prefix:
tar cvf archive.tar files/ # Without hyphen
tar -cvf archive.tar files/ # With hyphen
Common options:
c(create): Create archivex(extract): Extract archivet(list): List contentsv(verbose): Show processf(file): Specify archive filename (must be last, as it’s followed by filename)z(gzip): Use gzip compression/decompressionj(bzip2): Use bzip2 compression/decompressionJ(xz): Use xz compression/decompressionC: Extract to specified directory
Key tip: f must come last because the filename follows immediately. Other options can be in any order.
Practical Techniques: Incremental Backup and Exclusions#
Excluding Specific Files#
# Exclude node_modules and .git directories
tar -czvf project.tar.gz \
--exclude='node_modules' \
--exclude='.git' \
project/
# Read exclusion list from file
tar -czvf project.tar.gz -X exclude.txt project/
exclude.txt contents:
node_modules
.git
*.log
.DS_Store
Incremental Backups#
tar supports time-based incremental backups:
# Full backup
tar -czvf full-backup.tar.gz /data
# Incremental backup (today's modifications only)
tar -czvf incremental-$(date +%Y%m%d).tar.gz \
-N "today" \
/data
# Backup files modified in last 7 days
tar -czvf week-changes.tar.gz \
-N "$(date -d '7 days ago' +%Y-%m-%d)" \
/data
Compression Algorithm Comparison#
| Algorithm | Flag | Compression Ratio | Speed | Use Case |
|---|---|---|---|---|
| gzip | -z | Medium | Fast | Daily use, network transfer |
| bzip2 | -j | High | Slow | Long-term archiving, space saving |
| xz | -J | Highest | Slowest | Large file archiving, release packages |
Real-world test (100MB project code):
time tar -czvf test.tar.gz project/ # 2.3s, 18MB
time tar -cjvf test.tar.bz2 project/ # 8.1s, 14MB
time tar -cJvf test.tar.xz project/ # 25s, 11MB
xz has the best ratio but is slowest—ideal for one-time compression of long-term archives.
Extraction Tips and Common Issues#
Extract to Specific Directory#
# Extract to /opt
tar -xzvf archive.tar.gz -C /opt
# Extract single file
tar -xzvf archive.tar.gz path/to/file.txt
List Archive Contents (Without Extracting)#
tar -tzvf archive.tar.gz
Output format:
-rw-r--r-- user/group 1234 2026-05-08 10:30 file1.txt
drwxr-xr-x user/group 0 2026-05-08 10:30 directory/
Absolute Path Pitfall#
tar strips leading / by default to prevent overwriting system files:
tar -czvf backup.tar.gz /home/user/project/
# Internal path is home/user/project/, not /home/user/project/
To preserve absolute paths (dangerous):
tar -czvf backup.tar.gz -P /home/user/project/
Filename Encoding Issues#
Cross-platform transfers may encounter encoding problems:
# Force UTF-8
tar -xzvf archive.tar.gz --force-local
Performance: Handling Large Files#
For huge files, use pipes to avoid temporary files:
# Direct transfer via SSH
tar -czvf - /large/directory | ssh user@server "cat > backup.tar.gz"
# Split into multiple parts
tar -czvf - /large/directory | split -b 1G - backup.tar.gz.part
Reassemble and extract:
cat backup.tar.gz.part* | tar -xzvf -
Web Implementation: Browser-side tar Parsing#
JavaScript can parse tar files in the browser:
async function parseTar(buffer) {
const view = new DataView(buffer)
let offset = 0
const files = []
while (offset < buffer.byteLength - 1024) {
// Read filename
const name = new TextDecoder().decode(
new Uint8Array(buffer, offset, 100)
).replace(/\0/g, '')
if (!name) break // Empty block, end
// Read file size (octal)
const sizeStr = new TextDecoder().decode(
new Uint8Array(buffer, offset + 124, 11)
).trim()
const size = parseInt(sizeStr, 8)
// Extract content
const content = new Uint8Array(buffer, offset + 512, size)
files.push({ name, size, content })
// Move to next file (512-byte aligned)
offset += 512 + Math.ceil(size / 512) * 512
}
return files
}
// Usage
const response = await fetch('archive.tar')
const buffer = await response.arrayBuffer()
const files = await parseTar(buffer)
console.log(files.map(f => f.name))
Practical Script: Quick Project Backup#
#!/bin/bash
# project-backup.sh
PROJECT_NAME="my-project"
BACKUP_DIR="/backup"
DATE=$(date +%Y%m%d-%H%M%S)
tar -czvf "${BACKUP_DIR}/${PROJECT_NAME}-${DATE}.tar.gz" \
--exclude='node_modules' \
--exclude='.next' \
--exclude='dist' \
--exclude='*.log' \
-C /home/user/projects ${PROJECT_NAME}
# Keep only last 7 days of backups
find ${BACKUP_DIR} -name "${PROJECT_NAME}-*.tar.gz" -mtime +7 -delete
echo "Backup created: ${PROJECT_NAME}-${DATE}.tar.gz"
Summary#
tar embodies the Unix philosophy: do one thing well, and cooperate with other tools. It focuses on archiving, leaving compression to gzip/bzip2/xz, transfer to ssh, splitting to split.
Key takeaways:
- Understand archiving vs compression distinction
- Remember
fmust come last - Use
--excludefor selective backups - Choose compression algorithm based on use case
Want more tar options? Check the docs: Linux tar Command Guide
Related tools: Gzip Compression Tool | File Diff Checker