Text Deduplication Algorithms: From Set to Efficient Implementation
Text Deduplication Algorithms: From Set to Efficient Implementation#
Duplicate data is everywhere. Log analysis, user list cleaning, keyword extraction… the scenarios are endless. Let’s dive into several deduplication approaches and their performance characteristics.
The Simplest Solution: Set Deduplication#
JavaScript’s Set is built for uniqueness:
function dedupeLines(text) {
const lines = text.split('\n')
return [...new Set(lines)].join('\n')
}
One line. Done. But there are caveats:
- Order not guaranteed:
Setdoesn’t preserve insertion order (though modern JS engines mostly do) - Case-sensitive:
"Apple"and"apple"are treated as different items - Empty lines: Blank lines get deduplicated too
Order-Preserving Deduplication#
If you need to keep the original order, use Set + filter:
function dedupePreserveOrder(text: string, caseSensitive = true): string {
const lines = text.split('\n').filter(line => line.trim() !== '')
const seen = new Set<string>()
const result: string[] = []
for (const line of lines) {
const key = caseSensitive ? line : line.toLowerCase()
if (!seen.has(key)) {
seen.add(key)
result.push(line) // Preserve original case
}
}
return result.join('\n')
}
The key insight: separate the key for comparison from the original value stored in result.
Three Granularities: Lines, Words, Characters#
Real-world needs vary:
Line-Level Deduplication#
case 'lines':
const lines = input.split('\n').filter(line => line.trim() !== '')
const seenLines = new Set<string>()
lines.forEach(line => {
const key = caseSensitive ? line : line.toLowerCase()
if (!seenLines.has(key)) {
seenLines.add(key)
uniqueItems.push(line)
}
})
break
Word-Level Deduplication#
case 'words':
const words = input.split(/\s+/).filter(word => word !== '')
const seenWords = new Set<string>()
words.forEach(word => {
const key = caseSensitive ? word : word.toLowerCase()
if (!seenWords.has(key)) {
seenWords.add(key)
uniqueItems.push(word)
}
})
break
The regex /\s+/ matches all whitespace: spaces, tabs, newlines.
Character-Level Deduplication#
case 'chars':
const chars = input.split('')
const seenChars = new Set<string>()
chars.forEach(char => {
const key = caseSensitive ? char : char.toLowerCase()
if (!seenChars.has(key)) {
seenChars.add(key)
uniqueItems.push(char)
}
})
break
Character deduplication is useful for analyzing character sets in text or simplifying strings.
Performance Comparison: Set vs Object vs Array#
How do these approaches compare?
// Test data: 1 million random strings
const data = Array.from({ length: 1000000 },
() => Math.random().toString(36).substring(7)
)
// Approach 1: Set
console.time('Set')
const result1 = [...new Set(data)]
console.timeEnd('Set') // ~45ms
// Approach 2: Object
console.time('Object')
const obj = {}
data.forEach(item => obj[item] = true)
const result2 = Object.keys(obj)
console.timeEnd('Object') // ~120ms
// Approach 3: Array.includes (anti-pattern)
console.time('Array')
const result3 = []
data.forEach(item => {
if (!result3.includes(item)) {
result3.push(item)
}
})
console.timeEnd('Array') // Didn't finish. Waited 5 minutes and gave up
Set wins. It’s designed for uniqueness, internally uses a hash table with O(1) lookup. Object is slower due to type coercion. Array.includes is O(n²)—it explodes with large datasets.
Case-Insensitive Implementation#
The key is the key variable:
const key = caseSensitive ? line : line.toLowerCase()
if (!seen.has(key)) {
seen.add(key)
result.push(line) // Note: push original value, not key
}
Why separate them? Users might want to preserve original case while using lowercase for comparison.
Example:
Input:
Apple
apple
APPLE
Banana
Case-insensitive deduplication:
Apple
Banana
The first occurrence Apple is preserved; subsequent apple and APPLE are considered duplicates.
Edge Cases#
Empty Line Filtering#
const lines = text.split('\n').filter(line => line.trim() !== '')
Empty lines are usually noise in data cleaning—filter them out.
Unicode Characters#
toLowerCase() handles Unicode well:
'Café'.toLowerCase() // 'café'
'MÜNCHEN'.toLowerCase() // 'münchen'
But some languages need toLocaleLowerCase():
'İstanbul'.toLowerCase() // 'i̇stanbul' (incorrect)
'İstanbul'.toLocaleLowerCase('tr') // 'istanbul' (correct)
Statistics#
Give users feedback after deduplication:
setStats({
original: originalCount, // Original count
unique: uniqueItems.length, // Unique count
removed: originalCount - uniqueItems.length // Removed count
})
Users can see the deduplication effect at a glance.
Online Tool Implementation#
Based on these principles, I built an online text deduplication tool:
Features:
- Three granularities: by line, word, character
- Optional case sensitivity
- Preserves original order
- Real-time statistics: original count, unique count, removed count
Try it at: Text Deduplication Tool
The core logic isn’t complex, but attention to detail matters:
- Edge cases (empty lines, Unicode)
- User experience (statistics, order preservation)
- Performance (Set vs alternatives)
Data cleaning is a fundamental skill. Deduplication is one of the most common requirements.
Related Tools: