Linux uniq Command: From Adjacent Deduplication to Count Statistics
Linux uniq Command: From Adjacent Deduplication to Count Statistics#
When processing log files or analyzing data, deduplication is a common task. The uniq command does exactly that—but with a catch: it only removes adjacent duplicate lines. Let’s dive deep into how uniq works and its practical applications.
Core Algorithm: Adjacent Line Comparison#
The implementation of uniq is surprisingly simple—a state machine at its heart:
// Simplified uniq core logic
char *prev_line = NULL;
while ((line = read_line()) != NULL) {
if (prev_line == NULL || strcmp(line, prev_line) != 0) {
print_line(line); // Output only if different
}
free(prev_line);
prev_line = strdup(line);
}
Time complexity O(n) with a single scan and no extra space. This is why it requires sorted input—otherwise, you can’t achieve O(1) space deduplication.
Key Parameters Implementation#
-c Count Occurrences#
int count = 1;
char *prev_line = NULL;
while ((line = read_line()) != NULL) {
if (prev_line != NULL && strcmp(line, prev_line) == 0) {
count++; // Same line, increment count
} else {
if (prev_line != NULL) {
printf("%7d %s\n", count, prev_line);
}
count = 1;
free(prev_line);
prev_line = strdup(line);
}
}
// Output last group
if (prev_line != NULL) {
printf("%7d %s\n", count, prev_line);
}
Output format: 7-digit right-aligned number followed by line content.
-d Output Only Duplicate Lines#
if (prev_line != NULL && strcmp(line, prev_line) == 0) {
if (count == 1) { // Second occurrence, output now
print_line(line);
}
count++;
}
-u Output Only Unique Lines#
if (prev_line != NULL && strcmp(line, prev_line) == 0) {
count++;
} else {
if (prev_line != NULL && count == 1) {
print_line(prev_line); // Only output if appeared once
}
count = 1;
}
-f N Skip Fields#
Skip the first N fields, compare only the rest:
char *skip_fields(char *line, int n) {
while (n > 0) {
while (*line && isspace(*line)) line++; // Skip spaces
while (*line && !isspace(*line)) line++; // Skip field
n--;
}
return line;
}
// Compare with fields skipped
if (strcmp(skip_fields(line, n), skip_fields(prev_line, n)) == 0) {
// Duplicate
}
-s N Skip Characters#
Skip the first N characters of each line:
if (strcmp(line + n, prev_line + n) == 0) {
// Duplicate
}
-w N Compare Only First N Characters#
if (strncmp(line, prev_line, n) == 0) {
// Duplicate
}
Performance Optimization: Memory-Mapped Large Files#
For large files, mmap is much faster than line-by-line reading:
#include <sys/mman.h>
void uniq_large_file(const char *filename) {
int fd = open(filename, O_RDONLY);
struct stat st;
fstat(fd, &st);
size_t size = st.st_size;
char *data = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
char *prev_line = NULL;
char *line = data;
char *end = data + size;
while (line < end) {
char *next = memchr(line, '\n', end - line);
if (!next) next = end;
*next = '\0'; // Temporarily replace with null terminator
if (prev_line == NULL || strcmp(line, prev_line) != 0) {
puts(line);
}
*next = '\n'; // Restore
free(prev_line);
prev_line = strdup(line);
line = next + 1;
}
munmap(data, size);
close(fd);
}
Advantages:
- Avoid frequent
read()system calls - Leverage OS page cache
- Controllable memory usage (only keep current and previous lines)
Practical Scenarios#
1. Log Deduplication#
# Sort then dedupe, count IP access frequency
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10
2. Find Duplicate Files#
# Dedupe by MD5 hash
find . -type f -exec md5sum {} \; | sort | uniq -w32 -d
-w32 compares only MD5 hash (32 chars), ignoring filenames.
3. CSV Column Deduplication#
# Find duplicates in column 2
sort -t, -k2,2 data.csv | uniq -f1 -d
-f1 skips field 1, compares only field 2.
4. Remove Consecutive Blank Lines#
# Keep at most one blank line
cat file.txt | uniq
Blank lines are “identical lines”, so multiple blanks collapse to one.
Common Pitfalls#
1. Forgetting to Sort First#
# Wrong: uniq only removes adjacent duplicates
cat data.txt | uniq
# Correct: must sort first
cat data.txt | sort | uniq
2. Confusing -d and -u#
-d: Output only duplicate lines (appear 2+ times)-u: Output only unique lines (appear exactly once)
# Find duplicate IPs
awk '{print $1}' access.log | sort | uniq -d
# Find IPs that visited only once
awk '{print $1}' access.log | sort | uniq -u
3. Ignoring Case Sensitivity#
uniq is case-sensitive by default, use -i flag:
# Hello and hello treated as same
sort data.txt | uniq -i
uniq vs sort -u#
sort -u also deduplicates, but with differences:
| Feature | uniq | sort -u |
|---|---|---|
| Time Complexity | O(n) | O(n log n) |
| Space Complexity | O(1) | O(n) |
| Requires Sorted Input | Yes | No |
| Count Occurrences | Yes (-c) | No |
| Filter Duplicates | Yes (-d/-u) | No |
When to use which:
- Data already sorted: Use
uniq(faster) - Data unsorted and only need dedupe: Use
sort -u(one step) - Need count statistics: Use
uniq -c
Web Implementation: JavaScript uniq#
function uniq(lines: string[], options: {
count?: boolean;
duplicates?: boolean;
unique?: boolean;
ignoreCase?: boolean;
skipFields?: number;
skipChars?: number;
compareChars?: number;
} = {}): string[] {
const result: Array<{ line: string; count: number }> = []
for (let i = 0; i < lines.length; i++) {
let line = lines[i]
let compareLine = line
// Skip fields
if (options.skipFields) {
const fields = line.split(/\s+/)
compareLine = fields.slice(options.skipFields).join(' ')
}
// Skip characters
if (options.skipChars) {
compareLine = compareLine.slice(options.skipChars)
}
// Compare only first N characters
if (options.compareChars) {
compareLine = compareLine.slice(0, options.compareChars)
}
// Ignore case
if (options.ignoreCase) {
compareLine = compareLine.toLowerCase()
}
const last = result[result.length - 1]
if (!last || last.line !== compareLine) {
result.push({ line: compareLine, count: 1, originalLine: line })
} else {
last.count++
}
}
// Filter based on options
return result
.filter(item => {
if (options.duplicates) return item.count > 1
if (options.unique) return item.count === 1
return true
})
.map(item => {
if (options.count) {
return `${item.count.toString().padStart(7)} ${item.originalLine || item.line}`
}
return item.originalLine || item.line
})
}
Performance Test#
Processing a 100MB log file:
# Method 1: sort + uniq
time (sort access.log | uniq > /dev/null)
# real: 8.2s
# Method 2: sort -u
time (sort -u access.log > /dev/null)
# real: 7.9s
# Method 3: Already sorted file + uniq
time (uniq sorted.log > /dev/null)
# real: 0.3s
Conclusion: If data is already sorted, uniq is 20x faster than sort -u.
Summary#
The core value of uniq:
- Efficient: O(n) time, O(1) space
- Flexible: Supports counting, filtering, field skipping
- Composable: Works perfectly with
sort,awk, and other commands
Next time you need deduplication with statistics, don’t forget this “small but mighty” tool.
Related Tools: Linux sort Command | Linux awk Text Processing