Linux tar Command Deep Dive: From File Packaging to Compression Algorithms#

tar is the classic archiving tool in Linux. Most developers only know tar -czvf, but there’s more to its implementation than meets the eye.

Core Concept: Archiving vs Compression#

First, a crucial distinction: tar is an archiver, not a compressor.

Archiving means packing multiple files into one. Think of it as putting scattered items into a box—the box size equals the sum of all items.

Compression reduces file size through algorithms. It’s like vacuum-sealing clothes in the box.

Typical usage:

tar -cvf archive.tar files/      # Archive only, no compression
tar -czvf archive.tar.gz files/  # Archive + gzip compression
tar -cjvf archive.tar.bz2 files/ # Archive + bzip2 compression
tar -cJvf archive.tar.xz files/  # Archive + xz compression

The -z, -j, -J flags tell tar to invoke the corresponding compression tool after archiving.

tar File Format Anatomy#

The tar format is remarkably simple. Each file has a 512-byte header containing metadata (filename, size, permissions, timestamp), followed by file content (padded to 512-byte boundaries), ending with two 512-byte zero blocks.

struct tar_header {
  char name[100];     // Filename
  char mode[8];       // Permissions
  char uid[8];        // User ID
  char gid[8];        // Group ID
  char size[12];      // File size
  char mtime[12];     // Modification time
  char checksum[8];   // Checksum
  char typeflag;      // File type
  char linkname[100]; // Link target
  char magic[6];      // "ustar"
  char version[2];    // Version
  char uname[32];     // Username
  char gname[32];     // Group name
  char devmajor[8];   // Major device number
  char devminor[8];   // Minor device number
  char prefix[155];   // Path prefix
  char padding[12];   // Padding
};

Interesting design choices:

  1. All ASCII text fields: Numbers stored as octal strings for cross-platform compatibility (e.g., permission 755 stored as 0000755\0)
  2. Fixed-length header: Exactly 512 bytes for easy random access
  3. Checksum calculation: Initial checksum field is spaces, then sum all bytes

Here’s a minimal tar packer:

import os

def create_tar(files, output):
    with open(output, 'wb') as f:
        for file_path in files:
            # Create header
            header = bytearray(512)
            name = os.path.basename(file_path).encode('utf-8')[:100]
            header[0:len(name)] = name
            
            stat = os.stat(file_path)
            # Permissions (octal)
            header[100:107] = f'{stat.st_mode:07o}\0'.encode()
            # File size (octal)
            header[124:135] = f'{stat.st_size:011o}\0'.encode()
            
            # Calculate checksum
            checksum = sum(header)
            header[148:155] = f'{checksum:06o}\0 '.encode()
            
            # Write header
            f.write(header)
            
            # Write content
            with open(file_path, 'rb') as infile:
                content = infile.read()
                f.write(content)
                # Pad to 512 bytes
                padding = (512 - len(content) % 512) % 512
                f.write(b'\0' * padding)
        
        # End marker (two empty blocks)
        f.write(b'\0' * 1024)

create_tar(['file1.txt', 'file2.txt'], 'archive.tar')

Option Letter Meanings#

tar’s option design is unique—you can omit the - prefix:

tar cvf archive.tar files/  # Without hyphen
tar -cvf archive.tar files/ # With hyphen

Common options:

  • c (create): Create archive
  • x (extract): Extract archive
  • t (list): List contents
  • v (verbose): Show process
  • f (file): Specify archive filename (must be last, as it’s followed by filename)
  • z (gzip): Use gzip compression/decompression
  • j (bzip2): Use bzip2 compression/decompression
  • J (xz): Use xz compression/decompression
  • C: Extract to specified directory

Key tip: f must come last because the filename follows immediately. Other options can be in any order.

Practical Techniques: Incremental Backup and Exclusions#

Excluding Specific Files#

# Exclude node_modules and .git directories
tar -czvf project.tar.gz \
  --exclude='node_modules' \
  --exclude='.git' \
  project/

# Read exclusion list from file
tar -czvf project.tar.gz -X exclude.txt project/

exclude.txt contents:

node_modules
.git
*.log
.DS_Store

Incremental Backups#

tar supports time-based incremental backups:

# Full backup
tar -czvf full-backup.tar.gz /data

# Incremental backup (today's modifications only)
tar -czvf incremental-$(date +%Y%m%d).tar.gz \
  -N "today" \
  /data

# Backup files modified in last 7 days
tar -czvf week-changes.tar.gz \
  -N "$(date -d '7 days ago' +%Y-%m-%d)" \
  /data

Compression Algorithm Comparison#

Algorithm Flag Compression Ratio Speed Use Case
gzip -z Medium Fast Daily use, network transfer
bzip2 -j High Slow Long-term archiving, space saving
xz -J Highest Slowest Large file archiving, release packages

Real-world test (100MB project code):

time tar -czvf test.tar.gz project/    # 2.3s, 18MB
time tar -cjvf test.tar.bz2 project/   # 8.1s, 14MB
time tar -cJvf test.tar.xz project/    # 25s,  11MB

xz has the best ratio but is slowest—ideal for one-time compression of long-term archives.

Extraction Tips and Common Issues#

Extract to Specific Directory#

# Extract to /opt
tar -xzvf archive.tar.gz -C /opt

# Extract single file
tar -xzvf archive.tar.gz path/to/file.txt

List Archive Contents (Without Extracting)#

tar -tzvf archive.tar.gz

Output format:

-rw-r--r-- user/group 1234 2026-05-08 10:30 file1.txt
drwxr-xr-x user/group 0    2026-05-08 10:30 directory/

Absolute Path Pitfall#

tar strips leading / by default to prevent overwriting system files:

tar -czvf backup.tar.gz /home/user/project/
# Internal path is home/user/project/, not /home/user/project/

To preserve absolute paths (dangerous):

tar -czvf backup.tar.gz -P /home/user/project/

Filename Encoding Issues#

Cross-platform transfers may encounter encoding problems:

# Force UTF-8
tar -xzvf archive.tar.gz --force-local

Performance: Handling Large Files#

For huge files, use pipes to avoid temporary files:

# Direct transfer via SSH
tar -czvf - /large/directory | ssh user@server "cat > backup.tar.gz"

# Split into multiple parts
tar -czvf - /large/directory | split -b 1G - backup.tar.gz.part

Reassemble and extract:

cat backup.tar.gz.part* | tar -xzvf -

Web Implementation: Browser-side tar Parsing#

JavaScript can parse tar files in the browser:

async function parseTar(buffer) {
  const view = new DataView(buffer)
  let offset = 0
  const files = []

  while (offset < buffer.byteLength - 1024) {
    // Read filename
    const name = new TextDecoder().decode(
      new Uint8Array(buffer, offset, 100)
    ).replace(/\0/g, '')

    if (!name) break // Empty block, end

    // Read file size (octal)
    const sizeStr = new TextDecoder().decode(
      new Uint8Array(buffer, offset + 124, 11)
    ).trim()
    const size = parseInt(sizeStr, 8)

    // Extract content
    const content = new Uint8Array(buffer, offset + 512, size)

    files.push({ name, size, content })

    // Move to next file (512-byte aligned)
    offset += 512 + Math.ceil(size / 512) * 512
  }

  return files
}

// Usage
const response = await fetch('archive.tar')
const buffer = await response.arrayBuffer()
const files = await parseTar(buffer)
console.log(files.map(f => f.name))

Practical Script: Quick Project Backup#

#!/bin/bash
# project-backup.sh

PROJECT_NAME="my-project"
BACKUP_DIR="/backup"
DATE=$(date +%Y%m%d-%H%M%S)

tar -czvf "${BACKUP_DIR}/${PROJECT_NAME}-${DATE}.tar.gz" \
  --exclude='node_modules' \
  --exclude='.next' \
  --exclude='dist' \
  --exclude='*.log' \
  -C /home/user/projects ${PROJECT_NAME}

# Keep only last 7 days of backups
find ${BACKUP_DIR} -name "${PROJECT_NAME}-*.tar.gz" -mtime +7 -delete

echo "Backup created: ${PROJECT_NAME}-${DATE}.tar.gz"

Summary#

tar embodies the Unix philosophy: do one thing well, and cooperate with other tools. It focuses on archiving, leaving compression to gzip/bzip2/xz, transfer to ssh, splitting to split.

Key takeaways:

  1. Understand archiving vs compression distinction
  2. Remember f must come last
  3. Use --exclude for selective backups
  4. Choose compression algorithm based on use case

Want more tar options? Check the docs: Linux tar Command Guide


Related tools: Gzip Compression Tool | File Diff Checker