Word Counter: Unicode Handling and Regex Edge Cases in Mixed Chinese-English Text
Word Counter: Unicode Handling and Regex Edge Cases in Mixed Chinese-English Text#
Writing articles, tweets, documentation—you always need to count words. Many word counters exist, but when mixing Chinese and English, the results are often wrong. The culprit? Improper Unicode character handling and regex edge cases.
The Core Logic of Word Counting#
1. Character Count#
Seems simple—just text.length, right? Not quite.
const text = "Hello 世界"
console.log(text.length) // 8, correct
But here’s the catch—Emoji and special characters:
const emoji = "👍"
console.log(emoji.length) // 2, not 1!
const combined = "👨💻" // Developer emoji (multiple code points combined)
console.log(combined.length) // 5!
This happens because JavaScript strings use UTF-16 encoding. Emoji are supplementary plane characters occupying 2 code units. Combined emoji are even more complex—multiple code points joined together.
The correct approach—use for...of or Array.from:
function countChars(text: string): number {
return [...text].length
}
console.log(countChars("👍")) // 1, correct
console.log(countChars("👨💻")) // 1, correct
for...of iterates over Unicode code points, not code units.
2. Word Count#
English words are space-separated. Chinese has no spaces. How to count?
function countWords(text: string): number {
if (text.trim() === '') return 0
// Count English words
const words = text.trim().split(/\s+/).length
// Count Chinese characters as individual "words"
const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length
return words + chineseChars
}
But this double-counts Chinese characters. Correct approach:
function countWords(text: string): number {
if (text.trim() === '') return 0
// Separate Chinese and English
const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length
const nonChineseText = text.replace(/[\u4e00-\u9fa5]/g, ' ')
const englishWords = nonChineseText.trim().split(/\s+/).filter(w => w).length
return chineseChars + englishWords
}
console.log(countWords("Hello 世界")) // 3: Hello + 世 + 界
console.log(countWords("React 是一个优秀的框架")) // 7
Sentence and Paragraph Counting Traps#
Sentence Count#
Common mistake—only using English punctuation:
// ❌ Only counts English sentences
const sentences = text.split(/[.!?]+/).filter(s => s.trim()).length
Mixed Chinese-English text needs both:
function countSentences(text: string): number {
// Chinese and English punctuation: 。!?.!?
return text.split(/[。!?!?.]+/).filter(s => s.trim() !== '').length
}
But there’s another trap—ellipses ... and Chinese ellipses ……:
const text = "他在想……这该怎么办。"
// Wrong split: ["他在想", "", "这该怎么办", ""]
Pre-process ellipses:
function countSentences(text: string): number {
// Replace ellipses with periods first
const normalized = text.replace(/[.。]{2,}|…{1,}/g, '。')
return normalized.split(/[。!?!?.]+/).filter(s => s.trim() !== '').length
}
Paragraph Count#
Paragraphs are separated by consecutive newlines:
function countParagraphs(text: string): number {
return text.split(/\n\s*\n/).filter(p => p.trim() !== '').length
}
\n\s*\n matches newline + any whitespace + newline, handling \n\n, \n \n, and variants.
Character Frequency Algorithm#
Want to know which characters appear most? Use reduce in a single pass:
function charFrequency(text: string): Map<string, number> {
const freq = new Map<string, number>()
for (const char of text) {
if (char.trim()) { // Skip whitespace
freq.set(char, (freq.get(char) || 0) + 1)
}
}
return freq
}
// Get Top N
function topChars(text: string, n: number = 10): [string, number][] {
const freq = charFrequency(text)
return [...freq.entries()]
.sort((a, b) => b[1] - a[1])
.slice(0, n)
}
Performance note: Map’s get/set is O(1), overall complexity O(n).
Reading Time Estimation#
Experience-based: Chinese reading speed ~300-400 characters/minute, English ~200-250 words/minute.
function estimateReadingTime(text: string): number {
const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length
const englishWords = text.replace(/[\u4e00-\u9fa5]/g, ' ')
.trim().split(/\s+/).filter(w => w).length
const chineseMinutes = chineseChars / 350 // Chinese reading speed
const englishMinutes = englishWords / 225 // English reading speed
return Math.ceil(chineseMinutes + englishMinutes)
}
Complete Implementation#
Combine everything into a stats object:
interface TextStats {
characters: number // Total characters
charactersNoSpaces: number // Without spaces
words: number // Words (mixed Chinese-English)
chineseChars: number // Chinese characters
sentences: number // Sentences
paragraphs: number // Paragraphs
lines: number // Lines
readingTime: number // Estimated reading time (minutes)
}
function analyzeText(text: string): TextStats {
const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length
const nonChinese = text.replace(/[\u4e00-\u9fa5]/g, ' ')
return {
characters: [...text].length,
charactersNoSpaces: [...text.replace(/\s/g, '')].length,
words: chineseChars + nonChinese.trim().split(/\s+/).filter(w => w).length,
chineseChars,
sentences: text.replace(/[.。]{2,}|…{1,}/g, '。')
.split(/[。!?!?.]+/).filter(s => s.trim()).length,
paragraphs: text.split(/\n\s*\n/).filter(p => p.trim()).length,
lines: text.split('\n').filter(l => l.trim()).length,
readingTime: estimateReadingTime(text)
}
}
Performance Considerations#
For large text (100K+ characters), a few optimizations:
- Avoid repeated traversal: Count multiple metrics in one pass
- Pre-compile regex:
const CHINESE_REGEX = /[\u4e00-\u9fa5]/g - Web Worker: Move counting to a background thread
// worker.ts
self.onmessage = (e) => {
const stats = analyzeText(e.data)
self.postMessage(stats)
}
// main.tsx
const worker = new Worker('worker.ts')
worker.postMessage(longText)
worker.onmessage = (e) => setStats(e.data)
Practical Application#
Based on this implementation, I built: Word Counter
Features:
- Real-time character, word, sentence, paragraph counting
- Accurate Chinese-English mixed text detection
- Top 10 character frequency visualization
- Reading time estimation
The algorithm isn’t complex, but handling edge cases and supporting mixed Chinese-English requires deep understanding of Unicode and regular expressions.
Related tools: Text Diff | Text Replace