Text Deduplication Algorithms: From Set to Efficient Implementation#

Duplicate data is everywhere. Log analysis, user list cleaning, keyword extraction… the scenarios are endless. Let’s dive into several deduplication approaches and their performance characteristics.

The Simplest Solution: Set Deduplication#

JavaScript’s Set is built for uniqueness:

function dedupeLines(text) {
  const lines = text.split('\n')
  return [...new Set(lines)].join('\n')
}

One line. Done. But there are caveats:

  1. Order not guaranteed: Set doesn’t preserve insertion order (though modern JS engines mostly do)
  2. Case-sensitive: "Apple" and "apple" are treated as different items
  3. Empty lines: Blank lines get deduplicated too

Order-Preserving Deduplication#

If you need to keep the original order, use Set + filter:

function dedupePreserveOrder(text: string, caseSensitive = true): string {
  const lines = text.split('\n').filter(line => line.trim() !== '')
  const seen = new Set<string>()
  const result: string[] = []
  
  for (const line of lines) {
    const key = caseSensitive ? line : line.toLowerCase()
    if (!seen.has(key)) {
      seen.add(key)
      result.push(line)  // Preserve original case
    }
  }
  
  return result.join('\n')
}

The key insight: separate the key for comparison from the original value stored in result.

Three Granularities: Lines, Words, Characters#

Real-world needs vary:

Line-Level Deduplication#

case 'lines':
  const lines = input.split('\n').filter(line => line.trim() !== '')
  const seenLines = new Set<string>()
  lines.forEach(line => {
    const key = caseSensitive ? line : line.toLowerCase()
    if (!seenLines.has(key)) {
      seenLines.add(key)
      uniqueItems.push(line)
    }
  })
  break

Word-Level Deduplication#

case 'words':
  const words = input.split(/\s+/).filter(word => word !== '')
  const seenWords = new Set<string>()
  words.forEach(word => {
    const key = caseSensitive ? word : word.toLowerCase()
    if (!seenWords.has(key)) {
      seenWords.add(key)
      uniqueItems.push(word)
    }
  })
  break

The regex /\s+/ matches all whitespace: spaces, tabs, newlines.

Character-Level Deduplication#

case 'chars':
  const chars = input.split('')
  const seenChars = new Set<string>()
  chars.forEach(char => {
    const key = caseSensitive ? char : char.toLowerCase()
    if (!seenChars.has(key)) {
      seenChars.add(key)
      uniqueItems.push(char)
    }
  })
  break

Character deduplication is useful for analyzing character sets in text or simplifying strings.

Performance Comparison: Set vs Object vs Array#

How do these approaches compare?

// Test data: 1 million random strings
const data = Array.from({ length: 1000000 }, 
  () => Math.random().toString(36).substring(7)
)

// Approach 1: Set
console.time('Set')
const result1 = [...new Set(data)]
console.timeEnd('Set')  // ~45ms

// Approach 2: Object
console.time('Object')
const obj = {}
data.forEach(item => obj[item] = true)
const result2 = Object.keys(obj)
console.timeEnd('Object')  // ~120ms

// Approach 3: Array.includes (anti-pattern)
console.time('Array')
const result3 = []
data.forEach(item => {
  if (!result3.includes(item)) {
    result3.push(item)
  }
})
console.timeEnd('Array')  // Didn't finish. Waited 5 minutes and gave up

Set wins. It’s designed for uniqueness, internally uses a hash table with O(1) lookup. Object is slower due to type coercion. Array.includes is O(n²)—it explodes with large datasets.

Case-Insensitive Implementation#

The key is the key variable:

const key = caseSensitive ? line : line.toLowerCase()

if (!seen.has(key)) {
  seen.add(key)
  result.push(line)  // Note: push original value, not key
}

Why separate them? Users might want to preserve original case while using lowercase for comparison.

Example:

Input:
Apple
apple
APPLE
Banana

Case-insensitive deduplication:
Apple
Banana

The first occurrence Apple is preserved; subsequent apple and APPLE are considered duplicates.

Edge Cases#

Empty Line Filtering#

const lines = text.split('\n').filter(line => line.trim() !== '')

Empty lines are usually noise in data cleaning—filter them out.

Unicode Characters#

toLowerCase() handles Unicode well:

'Café'.toLowerCase()  // 'café'
'MÜNCHEN'.toLowerCase()  // 'münchen'

But some languages need toLocaleLowerCase():

'İstanbul'.toLowerCase()  // 'i̇stanbul' (incorrect)
'İstanbul'.toLocaleLowerCase('tr')  // 'istanbul' (correct)

Statistics#

Give users feedback after deduplication:

setStats({
  original: originalCount,   // Original count
  unique: uniqueItems.length, // Unique count
  removed: originalCount - uniqueItems.length  // Removed count
})

Users can see the deduplication effect at a glance.

Online Tool Implementation#

Based on these principles, I built an online text deduplication tool:

Features:

  • Three granularities: by line, word, character
  • Optional case sensitivity
  • Preserves original order
  • Real-time statistics: original count, unique count, removed count

Try it at: Text Deduplication Tool

The core logic isn’t complex, but attention to detail matters:

  • Edge cases (empty lines, Unicode)
  • User experience (statistics, order preservation)
  • Performance (Set vs alternatives)

Data cleaning is a fundamental skill. Deduplication is one of the most common requirements.


Related Tools: