Word Counter: Unicode Handling and Regex Edge Cases in Mixed Chinese-English Text#

Writing articles, tweets, documentation—you always need to count words. Many word counters exist, but when mixing Chinese and English, the results are often wrong. The culprit? Improper Unicode character handling and regex edge cases.

The Core Logic of Word Counting#

1. Character Count#

Seems simple—just text.length, right? Not quite.

const text = "Hello 世界"
console.log(text.length)  // 8, correct

But here’s the catch—Emoji and special characters:

const emoji = "👍"
console.log(emoji.length)  // 2, not 1!

const combined = "👨‍💻"  // Developer emoji (multiple code points combined)
console.log(combined.length)  // 5!

This happens because JavaScript strings use UTF-16 encoding. Emoji are supplementary plane characters occupying 2 code units. Combined emoji are even more complex—multiple code points joined together.

The correct approach—use for...of or Array.from:

function countChars(text: string): number {
  return [...text].length
}

console.log(countChars("👍"))  // 1, correct
console.log(countChars("👨‍💻"))  // 1, correct

for...of iterates over Unicode code points, not code units.

2. Word Count#

English words are space-separated. Chinese has no spaces. How to count?

function countWords(text: string): number {
  if (text.trim() === '') return 0

  // Count English words
  const words = text.trim().split(/\s+/).length

  // Count Chinese characters as individual "words"
  const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length

  return words + chineseChars
}

But this double-counts Chinese characters. Correct approach:

function countWords(text: string): number {
  if (text.trim() === '') return 0

  // Separate Chinese and English
  const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length
  const nonChineseText = text.replace(/[\u4e00-\u9fa5]/g, ' ')
  const englishWords = nonChineseText.trim().split(/\s+/).filter(w => w).length

  return chineseChars + englishWords
}

console.log(countWords("Hello 世界"))  // 3: Hello + 世 + 界
console.log(countWords("React 是一个优秀的框架"))  // 7

Sentence and Paragraph Counting Traps#

Sentence Count#

Common mistake—only using English punctuation:

// ❌ Only counts English sentences
const sentences = text.split(/[.!?]+/).filter(s => s.trim()).length

Mixed Chinese-English text needs both:

function countSentences(text: string): number {
  // Chinese and English punctuation: 。!?.!?
  return text.split(/[。!?!?.]+/).filter(s => s.trim() !== '').length
}

But there’s another trap—ellipses ... and Chinese ellipses ……:

const text = "他在想……这该怎么办。"
// Wrong split: ["他在想", "", "这该怎么办", ""]

Pre-process ellipses:

function countSentences(text: string): number {
  // Replace ellipses with periods first
  const normalized = text.replace(/[.。]{2,}|…{1,}/g, '。')
  return normalized.split(/[。!?!?.]+/).filter(s => s.trim() !== '').length
}

Paragraph Count#

Paragraphs are separated by consecutive newlines:

function countParagraphs(text: string): number {
  return text.split(/\n\s*\n/).filter(p => p.trim() !== '').length
}

\n\s*\n matches newline + any whitespace + newline, handling \n\n, \n \n, and variants.

Character Frequency Algorithm#

Want to know which characters appear most? Use reduce in a single pass:

function charFrequency(text: string): Map<string, number> {
  const freq = new Map<string, number>()

  for (const char of text) {
    if (char.trim()) {  // Skip whitespace
      freq.set(char, (freq.get(char) || 0) + 1)
    }
  }

  return freq
}

// Get Top N
function topChars(text: string, n: number = 10): [string, number][] {
  const freq = charFrequency(text)
  return [...freq.entries()]
    .sort((a, b) => b[1] - a[1])
    .slice(0, n)
}

Performance note: Map’s get/set is O(1), overall complexity O(n).

Reading Time Estimation#

Experience-based: Chinese reading speed ~300-400 characters/minute, English ~200-250 words/minute.

function estimateReadingTime(text: string): number {
  const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length
  const englishWords = text.replace(/[\u4e00-\u9fa5]/g, ' ')
    .trim().split(/\s+/).filter(w => w).length

  const chineseMinutes = chineseChars / 350  // Chinese reading speed
  const englishMinutes = englishWords / 225  // English reading speed

  return Math.ceil(chineseMinutes + englishMinutes)
}

Complete Implementation#

Combine everything into a stats object:

interface TextStats {
  characters: number          // Total characters
  charactersNoSpaces: number  // Without spaces
  words: number               // Words (mixed Chinese-English)
  chineseChars: number        // Chinese characters
  sentences: number           // Sentences
  paragraphs: number          // Paragraphs
  lines: number               // Lines
  readingTime: number         // Estimated reading time (minutes)
}

function analyzeText(text: string): TextStats {
  const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length
  const nonChinese = text.replace(/[\u4e00-\u9fa5]/g, ' ')

  return {
    characters: [...text].length,
    charactersNoSpaces: [...text.replace(/\s/g, '')].length,
    words: chineseChars + nonChinese.trim().split(/\s+/).filter(w => w).length,
    chineseChars,
    sentences: text.replace(/[.。]{2,}|…{1,}/g, '。')
      .split(/[。!?!?.]+/).filter(s => s.trim()).length,
    paragraphs: text.split(/\n\s*\n/).filter(p => p.trim()).length,
    lines: text.split('\n').filter(l => l.trim()).length,
    readingTime: estimateReadingTime(text)
  }
}

Performance Considerations#

For large text (100K+ characters), a few optimizations:

  1. Avoid repeated traversal: Count multiple metrics in one pass
  2. Pre-compile regex: const CHINESE_REGEX = /[\u4e00-\u9fa5]/g
  3. Web Worker: Move counting to a background thread
// worker.ts
self.onmessage = (e) => {
  const stats = analyzeText(e.data)
  self.postMessage(stats)
}

// main.tsx
const worker = new Worker('worker.ts')
worker.postMessage(longText)
worker.onmessage = (e) => setStats(e.data)

Practical Application#

Based on this implementation, I built: Word Counter

Features:

  • Real-time character, word, sentence, paragraph counting
  • Accurate Chinese-English mixed text detection
  • Top 10 character frequency visualization
  • Reading time estimation

The algorithm isn’t complex, but handling edge cases and supporting mixed Chinese-English requires deep understanding of Unicode and regular expressions.


Related tools: Text Diff | Text Replace