Linux uniq Command: From Adjacent Deduplication to Count Statistics#

When processing log files or analyzing data, deduplication is a common task. The uniq command does exactly that—but with a catch: it only removes adjacent duplicate lines. Let’s dive deep into how uniq works and its practical applications.

Core Algorithm: Adjacent Line Comparison#

The implementation of uniq is surprisingly simple—a state machine at its heart:

// Simplified uniq core logic
char *prev_line = NULL;
while ((line = read_line()) != NULL) {
  if (prev_line == NULL || strcmp(line, prev_line) != 0) {
    print_line(line);  // Output only if different
  }
  free(prev_line);
  prev_line = strdup(line);
}

Time complexity O(n) with a single scan and no extra space. This is why it requires sorted input—otherwise, you can’t achieve O(1) space deduplication.

Key Parameters Implementation#

-c Count Occurrences#

int count = 1;
char *prev_line = NULL;
while ((line = read_line()) != NULL) {
  if (prev_line != NULL && strcmp(line, prev_line) == 0) {
    count++;  // Same line, increment count
  } else {
    if (prev_line != NULL) {
      printf("%7d %s\n", count, prev_line);
    }
    count = 1;
    free(prev_line);
    prev_line = strdup(line);
  }
}
// Output last group
if (prev_line != NULL) {
  printf("%7d %s\n", count, prev_line);
}

Output format: 7-digit right-aligned number followed by line content.

-d Output Only Duplicate Lines#

if (prev_line != NULL && strcmp(line, prev_line) == 0) {
  if (count == 1) {  // Second occurrence, output now
    print_line(line);
  }
  count++;
}

-u Output Only Unique Lines#

if (prev_line != NULL && strcmp(line, prev_line) == 0) {
  count++;
} else {
  if (prev_line != NULL && count == 1) {
    print_line(prev_line);  // Only output if appeared once
  }
  count = 1;
}

-f N Skip Fields#

Skip the first N fields, compare only the rest:

char *skip_fields(char *line, int n) {
  while (n > 0) {
    while (*line && isspace(*line)) line++;  // Skip spaces
    while (*line && !isspace(*line)) line++; // Skip field
    n--;
  }
  return line;
}

// Compare with fields skipped
if (strcmp(skip_fields(line, n), skip_fields(prev_line, n)) == 0) {
  // Duplicate
}

-s N Skip Characters#

Skip the first N characters of each line:

if (strcmp(line + n, prev_line + n) == 0) {
  // Duplicate
}

-w N Compare Only First N Characters#

if (strncmp(line, prev_line, n) == 0) {
  // Duplicate
}

Performance Optimization: Memory-Mapped Large Files#

For large files, mmap is much faster than line-by-line reading:

#include <sys/mman.h>

void uniq_large_file(const char *filename) {
  int fd = open(filename, O_RDONLY);
  struct stat st;
  fstat(fd, &st);
  size_t size = st.st_size;

  char *data = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);

  char *prev_line = NULL;
  char *line = data;
  char *end = data + size;

  while (line < end) {
    char *next = memchr(line, '\n', end - line);
    if (!next) next = end;

    *next = '\0';  // Temporarily replace with null terminator
    if (prev_line == NULL || strcmp(line, prev_line) != 0) {
      puts(line);
    }
    *next = '\n';  // Restore

    free(prev_line);
    prev_line = strdup(line);
    line = next + 1;
  }

  munmap(data, size);
  close(fd);
}

Advantages:

  • Avoid frequent read() system calls
  • Leverage OS page cache
  • Controllable memory usage (only keep current and previous lines)

Practical Scenarios#

1. Log Deduplication#

# Sort then dedupe, count IP access frequency
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

2. Find Duplicate Files#

# Dedupe by MD5 hash
find . -type f -exec md5sum {} \; | sort | uniq -w32 -d

-w32 compares only MD5 hash (32 chars), ignoring filenames.

3. CSV Column Deduplication#

# Find duplicates in column 2
sort -t, -k2,2 data.csv | uniq -f1 -d

-f1 skips field 1, compares only field 2.

4. Remove Consecutive Blank Lines#

# Keep at most one blank line
cat file.txt | uniq

Blank lines are “identical lines”, so multiple blanks collapse to one.

Common Pitfalls#

1. Forgetting to Sort First#

# Wrong: uniq only removes adjacent duplicates
cat data.txt | uniq

# Correct: must sort first
cat data.txt | sort | uniq

2. Confusing -d and -u#

  • -d: Output only duplicate lines (appear 2+ times)
  • -u: Output only unique lines (appear exactly once)
# Find duplicate IPs
awk '{print $1}' access.log | sort | uniq -d

# Find IPs that visited only once
awk '{print $1}' access.log | sort | uniq -u

3. Ignoring Case Sensitivity#

uniq is case-sensitive by default, use -i flag:

# Hello and hello treated as same
sort data.txt | uniq -i

uniq vs sort -u#

sort -u also deduplicates, but with differences:

Feature uniq sort -u
Time Complexity O(n) O(n log n)
Space Complexity O(1) O(n)
Requires Sorted Input Yes No
Count Occurrences Yes (-c) No
Filter Duplicates Yes (-d/-u) No

When to use which:

  • Data already sorted: Use uniq (faster)
  • Data unsorted and only need dedupe: Use sort -u (one step)
  • Need count statistics: Use uniq -c

Web Implementation: JavaScript uniq#

function uniq(lines: string[], options: {
  count?: boolean;
  duplicates?: boolean;
  unique?: boolean;
  ignoreCase?: boolean;
  skipFields?: number;
  skipChars?: number;
  compareChars?: number;
} = {}): string[] {
  const result: Array<{ line: string; count: number }> = []

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i]
    let compareLine = line

    // Skip fields
    if (options.skipFields) {
      const fields = line.split(/\s+/)
      compareLine = fields.slice(options.skipFields).join(' ')
    }

    // Skip characters
    if (options.skipChars) {
      compareLine = compareLine.slice(options.skipChars)
    }

    // Compare only first N characters
    if (options.compareChars) {
      compareLine = compareLine.slice(0, options.compareChars)
    }

    // Ignore case
    if (options.ignoreCase) {
      compareLine = compareLine.toLowerCase()
    }

    const last = result[result.length - 1]
    if (!last || last.line !== compareLine) {
      result.push({ line: compareLine, count: 1, originalLine: line })
    } else {
      last.count++
    }
  }

  // Filter based on options
  return result
    .filter(item => {
      if (options.duplicates) return item.count > 1
      if (options.unique) return item.count === 1
      return true
    })
    .map(item => {
      if (options.count) {
        return `${item.count.toString().padStart(7)} ${item.originalLine || item.line}`
      }
      return item.originalLine || item.line
    })
}

Performance Test#

Processing a 100MB log file:

# Method 1: sort + uniq
time (sort access.log | uniq > /dev/null)
# real: 8.2s

# Method 2: sort -u
time (sort -u access.log > /dev/null)
# real: 7.9s

# Method 3: Already sorted file + uniq
time (uniq sorted.log > /dev/null)
# real: 0.3s

Conclusion: If data is already sorted, uniq is 20x faster than sort -u.

Summary#

The core value of uniq:

  1. Efficient: O(n) time, O(1) space
  2. Flexible: Supports counting, filtering, field skipping
  3. Composable: Works perfectly with sort, awk, and other commands

Next time you need deduplication with statistics, don’t forget this “small but mighty” tool.


Related Tools: Linux sort Command | Linux awk Text Processing