From Regex to Unicode: Building an Accurate Text Statistics Tool#

Recently, while writing technical documentation, I needed to count words. I tried several online tools and found inconsistent handling of Chinese and English text—some counted Chinese by characters, others by “words.” I decided to implement my own and dive into the technical details behind text statistics.

Core Metrics for Text Statistics#

A complete text statistics tool needs at least these metrics:

interface TextStats {
  charCount: number          // Total characters
  charCountNoSpace: number   // Characters without spaces
  wordCount: number          // Word count
  lineCount: number          // Line count
  paragraphCount: number     // Paragraph count
  sentenceCount: number      // Sentence count
  chineseCount: number       // Chinese characters
  englishCount: number       // English characters
  numberCount: number        // Digit characters
  punctuationCount: number   // Punctuation marks
}

Seems simple, but each metric has pitfalls.

Character Counting: Do Spaces Count?#

The most basic character counting has two approaches:

const charCount = text.length
const charCountNoSpace = text.replace(/\s/g, '').length

\s matches all whitespace: spaces, tabs, newlines, carriage returns, etc. But here’s a detail—Chinese full-width space (U+3000) is not in \s!

To handle Chinese full-width spaces:

const charCountNoSpace = text.replace(/[\s\u3000]/g, '').length

Unicode Character Pitfalls#

JavaScript’s String.length returns UTF-16 code units, not characters:

'𠮷'.length  // 2, but it's actually 1 character
'👍'.length  // 2, but it's actually 1 character

This is because these characters require 4 bytes in UTF-16 (surrogate pairs). The correct counting method:

function getRealCharCount(text: string): number {
  // Method 1: Use Array.from
  return Array.from(text).length
  
  // Method 2: Use for...of
  let count = 0
  for (const _ of text) {
    count++
  }
  return count
  
  // Method 3: Use regex (ES2018+)
  return (text.match(/\p{Any}/gu) || []).length
}

\p{Any} is a Unicode property escape that matches any character, and the u flag enables Unicode mode.

Word Counting: Chinese vs English Differences#

English words are separated by spaces—simple:

const wordCount = text.trim() ? text.trim().split(/\s+/).length : 0

But Chinese has no space delimiters—Chinese “words” require segmentation algorithms. A simple approach:

// Count Chinese by characters, English by words
const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length
const englishWords = text.match(/[a-zA-Z]+/g) || []
const wordCount = chineseChars + englishWords.length

More Precise Segmentation#

To truly count Chinese words, you need a segmentation library like jieba:

import * as jieba from 'nodejieba'

function countChineseWords(text: string): number {
  const chineseText = text.match(/[\u4e00-\u9fa5]+/g) || []
  const words = chineseText.map(seg => jieba.cut(seg)).flat()
  return words.length
}

But segmentation introduces additional complexity:

  • Performance overhead: Segmentation requires dictionary matching
  • Accuracy issues: New words and proper nouns may be segmented incorrectly
  • Size issues: Dictionary files are several MB

For online tools, counting Chinese by characters is more practical.

Line Count and Paragraph Count#

Line counting seems straightforward:

const lineCount = text.split('\n').length

But there are edge cases:

1. Empty File#

''.split('\n').length  // 1, but actually 0 lines

Correct handling:

const lineCount = text.length === 0 ? 0 : text.split('\n').length

2. Trailing Newline#

'hello\nworld\n'.split('\n').length  // 3, but actually 2 lines

Handling:

const lineCount = text.split('\n').filter(line => line.trim() || text.endsWith('\n')).length

Paragraph Counting#

Paragraphs are separated by blank lines:

const paragraphCount = text.trim() 
  ? text.split(/\n\s*\n/).filter(p => p.trim()).length 
  : 0

\n\s*\n matches blank lines with possible whitespace between two newlines.

Sentence Counting: Punctuation Pitfalls#

Sentences are separated by punctuation:

const sentenceCount = text.split(/[.!?。!?]+/).filter(s => s.trim()).length

But there are many more punctuation marks:

  • English: . ! ? ; ...(ellipsis)
  • Chinese: ! ? ; ……(ellipsis)
  • Others: (U+2026) (U+22EF)

A more complete regex:

const sentenceEndings = /[.!?:;,。!?:;,…⋯…]+/g
const sentenceCount = text.split(sentenceEndings).filter(s => s.trim()).length

Decimal Points and Abbreviations#

English . is also used for decimals and abbreviations, causing miscounts:

'The price is $3.14.'.split(/[.!?]+/).filter(s => s.trim()).length
// 2, but actually 1 sentence

Precise handling requires NLP (Natural Language Processing), or simple heuristic rules:

function countSentences(text: string): number {
  // Exclude decimal points
  let normalized = text.replace(/(\d)\.(\d)/g, '$1DOT$2')
  // Exclude common abbreviations
  const abbreviations = ['Mr', 'Mrs', 'Dr', 'Prof', 'etc', 'e.g', 'i.e']
  abbreviations.forEach(abbr => {
    normalized = normalized.replace(new RegExp(`\\b${abbr}\\.`), `${abbr}DOT`)
  })
  
  const sentences = normalized.split(/[.!?。!?]+/).filter(s => s.trim())
  return sentences.length
}

Unicode Character Classification#

Counting Chinese, English, numbers, punctuation:

const chineseCount = (text.match(/[\u4e00-\u9fa5]/g) || []).length
const englishCount = (text.match(/[a-zA-Z]/g) || []).length
const numberCount = (text.match(/\d/g) || []).length
const punctuationCount = (text.match(/[.,!?;:'"()()。,!?:;、]/g) || []).length

Unicode Range Details#

Chinese Unicode ranges:

  • Basic: \u4e00-\u9fa5 (20,992 common characters)
  • Extension A: \u3400-\u4dbf (6,592 rare characters)
  • Extensions B-F: More rare characters

More complete Chinese matching:

const chineseRegex = /[\u4e00-\u9fa5\u3400-\u4dbf\uf900-\ufaff]/g

Using Unicode Property Escapes#

ES2018 supports Unicode property escapes for more semantic matching:

const chineseCount = (text.match(/\p{Script=Han}/gu) || []).length
const englishCount = (text.match(/\p{Letter}/gu) || []).length
const numberCount = (text.match(/\p{Number}/gu) || []).length
const punctuationCount = (text.match(/\p{Punctuation}/gu) || []).length

But browser compatibility needs attention:

  • Chrome 64+, Firefox 79+, Safari 11.1+
  • IE not supported

Performance Optimization: Real-time Statistics#

Real-time statistics during user input requires performance optimization.

1. useMemo Caching#

const stats = useMemo(() => {
  return {
    charCount: text.length,
    wordCount: text.trim() ? text.trim().split(/\s+/).length : 0,
    // ...
  }
}, [text])

Only recalculate when text changes.

2. Debounced Input#

For large text, don’t calculate in real-time during input:

const debouncedText = useDebounce(text, 300)

const stats = useMemo(() => {
  // Calculate using debouncedText
}, [debouncedText])

3. Web Worker#

For very large text (>1MB), move computation to a Web Worker:

// worker.ts
self.onmessage = (e) => {
  const text = e.data
  const stats = {
    charCount: text.length,
    chineseCount: (text.match(/[\u4e00-\u9fa5]/g) || []).length,
    // ...
  }
  self.postMessage(stats)
}

// main.tsx
const worker = new Worker('worker.ts')
worker.postMessage(largeText)
worker.onmessage = (e) => setStats(e.data)

Complete Implementation#

Based on the above analysis, here’s the complete statistics function:

function getTextStats(text: string): TextStats {
  // Basic statistics
  const charCount = text.length
  const charCountNoSpace = text.replace(/[\s\u3000]/g, '').length
  
  // Word count (Chinese by characters, English by words)
  const chineseCount = (text.match(/[\u4e00-\u9fa5]/g) || []).length
  const englishWords = (text.match(/[a-zA-Z]+/g) || []).length
  const wordCount = chineseCount + englishWords
  
  // Line count
  const lineCount = text.length === 0 ? 0 : text.split('\n').length
  
  // Paragraph count
  const paragraphCount = text.trim() 
    ? text.split(/\n\s*\n/).filter(p => p.trim()).length 
    : 0
  
  // Sentence count
  const sentenceCount = text.split(/[.!?。!?:;…]+/).filter(s => s.trim()).length
  
  // Character classification
  const englishCount = (text.match(/[a-zA-Z]/g) || []).length
  const numberCount = (text.match(/\d/g) || []).length
  const punctuationCount = (text.match(/[.,!?;:'"()()。,!?:;、…]/g) || []).length
  
  return {
    charCount,
    charCountNoSpace,
    wordCount,
    lineCount,
    paragraphCount,
    sentenceCount,
    chineseCount,
    englishCount,
    numberCount,
    punctuationCount,
  }
}

Practical Application#

Based on these ideas, I built an online tool: Text Statistics

Key features:

  • Real-time statistics for 10 text metrics
  • Supports mixed Chinese and English text
  • Correctly handles Unicode characters
  • Performance optimized for large text

The implementation isn’t complex, but getting each detail right requires careful thought. Hope this helps!


Related Tools: Word Counter | Line Sort