From Regex to Unicode: Building an Accurate Text Statistics Tool#

Recently, while writing technical documentation, I needed to count words. I tried several online tools and found inconsistent handling of Chinese and English text—some counted Chinese by characters, others by “words.” I decided to implement my own and dive into the technical details behind text statistics.

Core Metrics for Text Statistics#

A complete text statistics tool needs at least these metrics:

interface TextStats {
  charCount: number          // Total characters
  charCountNoSpace: number   // Characters without spaces
  wordCount: number          // Word count
  lineCount: number          // Line count
  paragraphCount: number     // Paragraph count
  sentenceCount: number      // Sentence count
  chineseCount: number       // Chinese characters
  englishCount: number       // English characters
  numberCount: number        // Digit characters
  punctuationCount: number   // Punctuation marks
}

Seems simple, but each metric has pitfalls.

Character Counting: Do Spaces Count?#

The most basic character counting has two approaches:

const charCount = text.length
const charCountNoSpace = text.replace(/\s/g, '').length

\s matches all whitespace: spaces, tabs, newlines, carriage returns, etc. But here’s a detail—Chinese full-width space (U+3000) is not in \s!

To handle Chinese full-width spaces:

const charCountNoSpace = text.replace(/[\s\u3000]/g, '').length

Unicode Character Pitfalls#

JavaScript’s String.length returns UTF-16 code units, not characters:

'𠮷'.length  // 2, but it's actually 1 character
'👍'.length  // 2, but it's actually 1 character

This is because these characters require 4 bytes in UTF-16 (surrogate pairs). The correct counting method:

function getRealCharCount(text: string): number {
  // Method 1: Use Array.from
  return Array.from(text).length
  
  // Method 2: Use for...of
  let count = 0
  for (const _ of text) {
    count++
  }
  return count
  
  // Method 3: Use regex (ES2018+)
  return (text.match(/\p{Any}/gu) || []).length
}

\p{Any} is a Unicode property escape that matches any character, and the u flag enables Unicode mode.

Word Counting: Chinese vs English Differences#

English words are separated by spaces—simple:

const wordCount = text.trim() ? text.trim().split(/\s+/).length : 0

But Chinese has no space delimiters—Chinese “words” require segmentation algorithms. A simple approach:

// Count Chinese by characters, English by words
const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length
const englishWords = text.match(/[a-zA-Z]+/g) || []
const wordCount = chineseChars + englishWords.length

More Precise Segmentation#

To truly count Chinese words, you need a segmentation library like jieba:

import * as jieba from 'nodejieba'

function countChineseWords(text: string): number {
  const chineseText = text.match(/[\u4e00-\u9fa5]+/g) || []
  const words = chineseText.map(seg => jieba.cut(seg)).flat()
  return words.length
}

But segmentation introduces additional complexity:

Performance overhead: Segmentation requires dictionary matching
Accuracy issues: New words and proper nouns may be segmented incorrectly
Size issues: Dictionary files are several MB

For online tools, counting Chinese by characters is more practical.

Line Count and Paragraph Count#

Line counting seems straightforward:

const lineCount = text.split('\n').length

But there are edge cases:

1. Empty File#

''.split('\n').length  // 1, but actually 0 lines

Correct handling:

const lineCount = text.length === 0 ? 0 : text.split('\n').length

2. Trailing Newline#

'hello\nworld\n'.split('\n').length  // 3, but actually 2 lines

Handling:

const lineCount = text.split('\n').filter(line => line.trim() || text.endsWith('\n')).length

Paragraph Counting#

Paragraphs are separated by blank lines:

const paragraphCount = text.trim() 
  ? text.split(/\n\s*\n/).filter(p => p.trim()).length 
  : 0

\n\s*\n matches blank lines with possible whitespace between two newlines.

Sentence Counting: Punctuation Pitfalls#

Sentences are separated by punctuation:

const sentenceCount = text.split(/[.!?。!?]+/).filter(s => s.trim()).length

But there are many more punctuation marks:

English: . ! ? ; ...(ellipsis)
Chinese: 。 ! ? ; ……(ellipsis)
Others: …(U+2026) ⋯(U+22EF)

A more complete regex:

const sentenceEndings = /[.!?:;,。!?:;,…⋯…]+/g
const sentenceCount = text.split(sentenceEndings).filter(s => s.trim()).length

Decimal Points and Abbreviations#

English . is also used for decimals and abbreviations, causing miscounts:

'The price is $3.14.'.split(/[.!?]+/).filter(s => s.trim()).length
// 2, but actually 1 sentence

Precise handling requires NLP (Natural Language Processing), or simple heuristic rules:

function countSentences(text: string): number {
  // Exclude decimal points
  let normalized = text.replace(/(\d)\.(\d)/g, '$1DOT$2')
  // Exclude common abbreviations
  const abbreviations = ['Mr', 'Mrs', 'Dr', 'Prof', 'etc', 'e.g', 'i.e']
  abbreviations.forEach(abbr => {
    normalized = normalized.replace(new RegExp(`\\b${abbr}\\.`), `${abbr}DOT`)
  })
  
  const sentences = normalized.split(/[.!?。!?]+/).filter(s => s.trim())
  return sentences.length
}

Unicode Character Classification#

Counting Chinese, English, numbers, punctuation:

const chineseCount = (text.match(/[\u4e00-\u9fa5]/g) || []).length
const englishCount = (text.match(/[a-zA-Z]/g) || []).length
const numberCount = (text.match(/\d/g) || []).length
const punctuationCount = (text.match(/[.,!?;:'"()（）。，!?:;、]/g) || []).length

Unicode Range Details#

Chinese Unicode ranges:

Basic: \u4e00-\u9fa5 (20,992 common characters)
Extension A: \u3400-\u4dbf (6,592 rare characters)
Extensions B-F: More rare characters

More complete Chinese matching:

const chineseRegex = /[\u4e00-\u9fa5\u3400-\u4dbf\uf900-\ufaff]/g

Using Unicode Property Escapes#

ES2018 supports Unicode property escapes for more semantic matching:

const chineseCount = (text.match(/\p{Script=Han}/gu) || []).length
const englishCount = (text.match(/\p{Letter}/gu) || []).length
const numberCount = (text.match(/\p{Number}/gu) || []).length
const punctuationCount = (text.match(/\p{Punctuation}/gu) || []).length

But browser compatibility needs attention:

Chrome 64+, Firefox 79+, Safari 11.1+
IE not supported

Performance Optimization: Real-time Statistics#

Real-time statistics during user input requires performance optimization.

1. useMemo Caching#

const stats = useMemo(() => {
  return {
    charCount: text.length,
    wordCount: text.trim() ? text.trim().split(/\s+/).length : 0,
    // ...
  }
}, [text])

Only recalculate when text changes.

2. Debounced Input#

For large text, don’t calculate in real-time during input:

const debouncedText = useDebounce(text, 300)

const stats = useMemo(() => {
  // Calculate using debouncedText
}, [debouncedText])

3. Web Worker#

For very large text (>1MB), move computation to a Web Worker:

// worker.ts
self.onmessage = (e) => {
  const text = e.data
  const stats = {
    charCount: text.length,
    chineseCount: (text.match(/[\u4e00-\u9fa5]/g) || []).length,
    // ...
  }
  self.postMessage(stats)
}

// main.tsx
const worker = new Worker('worker.ts')
worker.postMessage(largeText)
worker.onmessage = (e) => setStats(e.data)

Complete Implementation#

Based on the above analysis, here’s the complete statistics function:

function getTextStats(text: string): TextStats {
  // Basic statistics
  const charCount = text.length
  const charCountNoSpace = text.replace(/[\s\u3000]/g, '').length
  
  // Word count (Chinese by characters, English by words)
  const chineseCount = (text.match(/[\u4e00-\u9fa5]/g) || []).length
  const englishWords = (text.match(/[a-zA-Z]+/g) || []).length
  const wordCount = chineseCount + englishWords
  
  // Line count
  const lineCount = text.length === 0 ? 0 : text.split('\n').length
  
  // Paragraph count
  const paragraphCount = text.trim() 
    ? text.split(/\n\s*\n/).filter(p => p.trim()).length 
    : 0
  
  // Sentence count
  const sentenceCount = text.split(/[.!?。!?:;…]+/).filter(s => s.trim()).length
  
  // Character classification
  const englishCount = (text.match(/[a-zA-Z]/g) || []).length
  const numberCount = (text.match(/\d/g) || []).length
  const punctuationCount = (text.match(/[.,!?;:'"()（）。，!?:;、…]/g) || []).length
  
  return {
    charCount,
    charCountNoSpace,
    wordCount,
    lineCount,
    paragraphCount,
    sentenceCount,
    chineseCount,
    englishCount,
    numberCount,
    punctuationCount,
  }
}

Practical Application#

Based on these ideas, I built an online tool: Text Statistics

Key features:

Real-time statistics for 10 text metrics
Supports mixed Chinese and English text
Correctly handles Unicode characters
Performance optimized for large text

The implementation isn’t complex, but getting each detail right requires careful thought. Hope this helps!

Related Tools: Word Counter | Line Sort