From Regex to Unicode: Building an Accurate Text Statistics Tool
From Regex to Unicode: Building an Accurate Text Statistics Tool#
Recently, while writing technical documentation, I needed to count words. I tried several online tools and found inconsistent handling of Chinese and English text—some counted Chinese by characters, others by “words.” I decided to implement my own and dive into the technical details behind text statistics.
Core Metrics for Text Statistics#
A complete text statistics tool needs at least these metrics:
interface TextStats {
charCount: number // Total characters
charCountNoSpace: number // Characters without spaces
wordCount: number // Word count
lineCount: number // Line count
paragraphCount: number // Paragraph count
sentenceCount: number // Sentence count
chineseCount: number // Chinese characters
englishCount: number // English characters
numberCount: number // Digit characters
punctuationCount: number // Punctuation marks
}
Seems simple, but each metric has pitfalls.
Character Counting: Do Spaces Count?#
The most basic character counting has two approaches:
const charCount = text.length
const charCountNoSpace = text.replace(/\s/g, '').length
\s matches all whitespace: spaces, tabs, newlines, carriage returns, etc. But here’s a detail—Chinese full-width space (U+3000) is not in \s!
To handle Chinese full-width spaces:
const charCountNoSpace = text.replace(/[\s\u3000]/g, '').length
Unicode Character Pitfalls#
JavaScript’s String.length returns UTF-16 code units, not characters:
'𠮷'.length // 2, but it's actually 1 character
'👍'.length // 2, but it's actually 1 character
This is because these characters require 4 bytes in UTF-16 (surrogate pairs). The correct counting method:
function getRealCharCount(text: string): number {
// Method 1: Use Array.from
return Array.from(text).length
// Method 2: Use for...of
let count = 0
for (const _ of text) {
count++
}
return count
// Method 3: Use regex (ES2018+)
return (text.match(/\p{Any}/gu) || []).length
}
\p{Any} is a Unicode property escape that matches any character, and the u flag enables Unicode mode.
Word Counting: Chinese vs English Differences#
English words are separated by spaces—simple:
const wordCount = text.trim() ? text.trim().split(/\s+/).length : 0
But Chinese has no space delimiters—Chinese “words” require segmentation algorithms. A simple approach:
// Count Chinese by characters, English by words
const chineseChars = (text.match(/[\u4e00-\u9fa5]/g) || []).length
const englishWords = text.match(/[a-zA-Z]+/g) || []
const wordCount = chineseChars + englishWords.length
More Precise Segmentation#
To truly count Chinese words, you need a segmentation library like jieba:
import * as jieba from 'nodejieba'
function countChineseWords(text: string): number {
const chineseText = text.match(/[\u4e00-\u9fa5]+/g) || []
const words = chineseText.map(seg => jieba.cut(seg)).flat()
return words.length
}
But segmentation introduces additional complexity:
- Performance overhead: Segmentation requires dictionary matching
- Accuracy issues: New words and proper nouns may be segmented incorrectly
- Size issues: Dictionary files are several MB
For online tools, counting Chinese by characters is more practical.
Line Count and Paragraph Count#
Line counting seems straightforward:
const lineCount = text.split('\n').length
But there are edge cases:
1. Empty File#
''.split('\n').length // 1, but actually 0 lines
Correct handling:
const lineCount = text.length === 0 ? 0 : text.split('\n').length
2. Trailing Newline#
'hello\nworld\n'.split('\n').length // 3, but actually 2 lines
Handling:
const lineCount = text.split('\n').filter(line => line.trim() || text.endsWith('\n')).length
Paragraph Counting#
Paragraphs are separated by blank lines:
const paragraphCount = text.trim()
? text.split(/\n\s*\n/).filter(p => p.trim()).length
: 0
\n\s*\n matches blank lines with possible whitespace between two newlines.
Sentence Counting: Punctuation Pitfalls#
Sentences are separated by punctuation:
const sentenceCount = text.split(/[.!?。!?]+/).filter(s => s.trim()).length
But there are many more punctuation marks:
- English:
.!?;...(ellipsis) - Chinese:
。!?;……(ellipsis) - Others:
…(U+2026)⋯(U+22EF)
A more complete regex:
const sentenceEndings = /[.!?:;,。!?:;,…⋯…]+/g
const sentenceCount = text.split(sentenceEndings).filter(s => s.trim()).length
Decimal Points and Abbreviations#
English . is also used for decimals and abbreviations, causing miscounts:
'The price is $3.14.'.split(/[.!?]+/).filter(s => s.trim()).length
// 2, but actually 1 sentence
Precise handling requires NLP (Natural Language Processing), or simple heuristic rules:
function countSentences(text: string): number {
// Exclude decimal points
let normalized = text.replace(/(\d)\.(\d)/g, '$1DOT$2')
// Exclude common abbreviations
const abbreviations = ['Mr', 'Mrs', 'Dr', 'Prof', 'etc', 'e.g', 'i.e']
abbreviations.forEach(abbr => {
normalized = normalized.replace(new RegExp(`\\b${abbr}\\.`), `${abbr}DOT`)
})
const sentences = normalized.split(/[.!?。!?]+/).filter(s => s.trim())
return sentences.length
}
Unicode Character Classification#
Counting Chinese, English, numbers, punctuation:
const chineseCount = (text.match(/[\u4e00-\u9fa5]/g) || []).length
const englishCount = (text.match(/[a-zA-Z]/g) || []).length
const numberCount = (text.match(/\d/g) || []).length
const punctuationCount = (text.match(/[.,!?;:'"()()。,!?:;、]/g) || []).length
Unicode Range Details#
Chinese Unicode ranges:
- Basic:
\u4e00-\u9fa5(20,992 common characters) - Extension A:
\u3400-\u4dbf(6,592 rare characters) - Extensions B-F: More rare characters
More complete Chinese matching:
const chineseRegex = /[\u4e00-\u9fa5\u3400-\u4dbf\uf900-\ufaff]/g
Using Unicode Property Escapes#
ES2018 supports Unicode property escapes for more semantic matching:
const chineseCount = (text.match(/\p{Script=Han}/gu) || []).length
const englishCount = (text.match(/\p{Letter}/gu) || []).length
const numberCount = (text.match(/\p{Number}/gu) || []).length
const punctuationCount = (text.match(/\p{Punctuation}/gu) || []).length
But browser compatibility needs attention:
- Chrome 64+, Firefox 79+, Safari 11.1+
- IE not supported
Performance Optimization: Real-time Statistics#
Real-time statistics during user input requires performance optimization.
1. useMemo Caching#
const stats = useMemo(() => {
return {
charCount: text.length,
wordCount: text.trim() ? text.trim().split(/\s+/).length : 0,
// ...
}
}, [text])
Only recalculate when text changes.
2. Debounced Input#
For large text, don’t calculate in real-time during input:
const debouncedText = useDebounce(text, 300)
const stats = useMemo(() => {
// Calculate using debouncedText
}, [debouncedText])
3. Web Worker#
For very large text (>1MB), move computation to a Web Worker:
// worker.ts
self.onmessage = (e) => {
const text = e.data
const stats = {
charCount: text.length,
chineseCount: (text.match(/[\u4e00-\u9fa5]/g) || []).length,
// ...
}
self.postMessage(stats)
}
// main.tsx
const worker = new Worker('worker.ts')
worker.postMessage(largeText)
worker.onmessage = (e) => setStats(e.data)
Complete Implementation#
Based on the above analysis, here’s the complete statistics function:
function getTextStats(text: string): TextStats {
// Basic statistics
const charCount = text.length
const charCountNoSpace = text.replace(/[\s\u3000]/g, '').length
// Word count (Chinese by characters, English by words)
const chineseCount = (text.match(/[\u4e00-\u9fa5]/g) || []).length
const englishWords = (text.match(/[a-zA-Z]+/g) || []).length
const wordCount = chineseCount + englishWords
// Line count
const lineCount = text.length === 0 ? 0 : text.split('\n').length
// Paragraph count
const paragraphCount = text.trim()
? text.split(/\n\s*\n/).filter(p => p.trim()).length
: 0
// Sentence count
const sentenceCount = text.split(/[.!?。!?:;…]+/).filter(s => s.trim()).length
// Character classification
const englishCount = (text.match(/[a-zA-Z]/g) || []).length
const numberCount = (text.match(/\d/g) || []).length
const punctuationCount = (text.match(/[.,!?;:'"()()。,!?:;、…]/g) || []).length
return {
charCount,
charCountNoSpace,
wordCount,
lineCount,
paragraphCount,
sentenceCount,
chineseCount,
englishCount,
numberCount,
punctuationCount,
}
}
Practical Application#
Based on these ideas, I built an online tool: Text Statistics
Key features:
- Real-time statistics for 10 text metrics
- Supports mixed Chinese and English text
- Correctly handles Unicode characters
- Performance optimized for large text
The implementation isn’t complex, but getting each detail right requires careful thought. Hope this helps!
Related Tools: Word Counter | Line Sort