Linux du Command Deep Dive: From Disk Usage Statistics to Directory Space Analysis
Linux du Command Deep Dive: From Disk Usage Statistics to Directory Space Analysis#
Last week, our server triggered a disk space alert. I needed to quickly identify which directories were consuming the most space. I used to rely on ls -lh for a quick glance, but with deep directory structures, that approach is inefficient. This time I dug into the du command and discovered its design is more interesting than I thought.
du’s Core Principle: Recursive Traversal + Block Counting#
du counts disk blocks allocated to files, not their logical size. This distinction is subtle but important.
The underlying implementation uses the stat() system call to retrieve each file’s st_blocks field:
// Simplified du implementation core logic
#include <sys/stat.h>
#include <dirent.h>
off_t calculate_usage(const char *path) {
struct stat st;
if (lstat(path, &st) < 0) return 0;
// Directory recursive traversal
if (S_ISDIR(st.st_mode)) {
DIR *dir = opendir(path);
struct dirent *entry;
off_t total = 0;
while ((entry = readdir(dir)) != NULL) {
if (strcmp(entry->d_name, ".") == 0 ||
strcmp(entry->d_name, "..") == 0) continue;
char fullpath[PATH_MAX];
snprintf(fullpath, sizeof(fullpath), "%s/%s", path, entry->d_name);
total += calculate_usage(fullpath);
}
closedir(dir);
return total + 512; // Directory itself uses 1 block
}
// File: return block count * 512 bytes
return st.st_blocks * 512;
}
Key details:
- st_blocks is block count, not bytes: actual usage =
st_blocks * 512 - Sparse file handling: a 10GB sparse file might only use a few blocks
- Directory overhead: each directory uses at least 4KB (one block)
Common Parameters Explained#
1. -h Human-Readable Format#
du -h /var/log
# 4.0K /var/log/apt
# 12M /var/log/journal
# 128M /var/log
Implementation principle: iterate through unit array to find the best fit:
function humanReadable(bytes: number): string {
const units = ['B', 'K', 'M', 'G', 'T', 'P']
let i = 0
while (bytes >= 1024 && i < units.length - 1) {
bytes /= 1024
i++
}
return `${bytes.toFixed(1)}${units[i]}`
}
2. -s Show Only Total#
du -sh /home/user
# 2.5G /home/user
Skip recursive output, return only top-level statistics.
3. --max-depth Control Depth#
du -h --max-depth=1 /var
# 4.0K /var/tmp
# 12M /var/log
# 128M /var/lib
# 145M /var
This is extremely useful for quickly locating large directories.
4. -a Show All Files#
By default du only counts directories. -a makes it output every file:
du -ah /tmp
# 4.0K /tmp/test.txt
# 8.0K /tmp/data.json
# 12K /tmp
5. --exclude Exclude Specific Files#
du -h --exclude="*.log" /var
du -h --exclude="node_modules" /home/user/project
Supports wildcards and can be used multiple times.
Real-World Scenarios and Performance Optimization#
Scenario 1: Locate Large Directories#
du -h --max-depth=1 /var | sort -hr
# 145M /var
# 128M /var/lib
# 12M /var/log
# 4.0K /var/tmp
sort -hr sorts by human-readable format in reverse, showing the largest directories at a glance.
Scenario 2: Exclude Specific Directories#
du -sh --exclude="node_modules" --exclude=".git" ~/projects/*
Exclude dependencies and version control directories when analyzing project directories.
Scenario 3: Find Large Files#
Combine with find:
find /var -type f -size +100M -exec du -h {} \;
Or use du directly:
du -ah /var | grep -E "^[0-9.]+G"
Scenario 4: Monitor Directory Growth#
watch -n 60 "du -sh /var/log"
Check /var/log space usage every minute, useful for monitoring log growth.
Performance Considerations#
1. Large Directory Traversal Optimization#
du traverses the entire directory tree, which is slow for directories with millions of files. Optimization approaches:
- –one-file-system: Don’t cross filesystem boundaries, skip mount points
- –threshold: Only show directories exceeding specified size
- Parallel processing: Use GNU parallel for subdirectories
# Parallel subdirectory statistics
find /var -maxdepth 1 -type d | parallel du -sh
2. Hard Link Handling#
du counts hard links multiple times by default. Use -l to count them once:
du -lh /path/to/directory
3. Cache Issues#
Running du frequently impacts filesystem cache. Use nocache to avoid this:
nocache du -sh /large/directory
Difference from df#
Many people confuse du and df:
- df: Filesystem-level space statistics, reads superblock
- du: Directory/file-level space statistics, traverses directory tree
Typical discrepancy scenario:
df -h /var
# Filesystem Size Used Avail Use% Mounted on
# /dev/sda1 50G 45G 5.0G 90% /
du -sh /var
# 35G /var
Why does df show 45G but du only 35G?
- Deleted but still open files: space not freed until processes release them
- Reserved space: ext4 reserves 5% for root by default
- Metadata overhead: inode tables, journal, etc.
Find deleted but still open files:
lsof +L1 | grep deleted
Web Implementation: Browser-Side Directory Statistics#
Using the File System Access API, you can implement similar functionality in browsers:
async function calculateDirectorySize(dirHandle: FileSystemDirectoryHandle): Promise<number> {
let totalSize = 0
for await (const entry of dirHandle.values()) {
if (entry.kind === 'file') {
const file = await entry.getFile()
totalSize += file.size
} else if (entry.kind === 'directory') {
const subDirHandle = await dirHandle.getDirectoryHandle(entry.name)
totalSize += await calculateDirectorySize(subDirHandle)
}
}
return totalSize
}
// Usage example
const dirHandle = await window.showDirectoryPicker()
const size = await calculateDirectorySize(dirHandle)
console.log(`Total size: ${formatBytes(size)}`)
Note: Browser API counts file size, not disk blocks, differing from native du.
Common Pitfalls#
1. Insufficient Permissions#
du skips directories without permission, statistics may be incomplete:
du: cannot read directory '/root': Permission denied
Use sudo for complete statistics.
2. Symbolic Links#
By default du doesn’t follow symbolic links. Use -L to follow:
du -Lh /path/to/symlink
3. Sparse File Misunderstanding#
# Create sparse file
dd if=/dev/zero of=sparse.img bs=1 count=0 seek=10G
ls -lh sparse.img
# -rw-r--r-- 1 user user 10G May 9 22:00 sparse.img
du -h sparse.img
# 0 sparse.img # Actual usage is 0
ls shows file size, du shows disk usage.
Summary#
Although the du command is simple, its design embodies the Unix philosophy: do one thing and do it well. From block counting to recursive traversal, from hard link handling to sparse file support, every detail is carefully designed.
Next time you encounter disk space issues, try these combinations:
# Quickly locate large directories
du -h --max-depth=1 | sort -hr | head -10
# Exclude interference
du -sh --exclude="node_modules" --exclude=".git" *
# Monitor directory growth
watch -n 60 "du -sh /var/log"
Hope this helps you understand the du command better. For detailed parameters, see: Linux du Command Reference
Related: Linux df Disk Space Monitoring | File Size Converter