Regex Explainer: Building a Parser from String to Visual Breakdown
Regex Explainer: Building a Parser from String to Visual Breakdown#
Published: April 29, 2026, 22:40
Ever stared at a complex regular expression for five minutes, completely lost? Or inherited someone else’s code with a “cryptic” regex that makes you question your career choices?
Last week, I was building a form validation feature and needed to understand this email validation regex: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$. After staring blankly for five minutes, I thought: why not build a tool that “translates” regex into human-readable explanations?
That’s how the JsonKit Regex Explainer was born.
Lexical Analysis: Tokenizing Regex#
A regular expression is essentially a mini language. The first step in explaining it is lexical analysis—breaking the string into meaningful tokens.
For example, ^\d{2,4}$ should be split into:
^- Start anchor\d- Digit character class{2,4}- Quantifier$- End anchor
But there’s a catch: handling escape characters.
The Escape Character Trap#
The backslash \ in regex is a troublemaker. It can be followed by:
- Predefined character classes:
\d,\w,\s - Special character escapes:
\.,\*,\\ - Invalid but syntactically legal:
\a
My initial implementation was:
// Wrong: only looking at current character
if (c === '\\') {
parts.push({ text: '\\', description: 'Escape character' })
i++
}
Result: ^\d+$ was interpreted as \ + d + +, completely wrong. The correct approach is to treat the escape character and the following character as a unit:
if (c === '\\') {
const next = pattern[i + 1]
if (!next) {
// String ends with \, syntax error
parts.push({ text: '\\', description: 'Incomplete escape' })
i++
continue
}
const escapeMap = {
'd': 'Match any digit [0-9]',
'D': 'Match any non-digit',
'w': 'Match any word character [a-zA-Z0-9_]',
's': 'Match any whitespace',
// ... other predefined classes
}
if (escapeMap[next]) {
parts.push({ text: `\\${next}`, description: escapeMap[next] })
} else {
parts.push({ text: `\\${next}`, description: `Escaped character "${next}"` })
}
i += 2 // Skip both \ and the following character
}
Now ^\d+$ is correctly parsed as ^ + \d + + + $.
Character Class Boundary Issues#
Character classes [...] are another tricky syntax. Take [^]]—it looks weird, but it’s legal: matches any character except ].
When parsing character classes, you need to handle:
- Leading
^for negation - Leading
]as a literal character (not the closing bracket) \]in the middle as an escaped]
Here’s my implementation:
if (c === '[') {
let j = i + 1
// Handle negation
if (j < pattern.length && pattern[j] === '^') j++
// Leading ] is a literal
if (j < pattern.length && pattern[j] === ']') j++
// Find the closing ]
while (j < pattern.length && pattern[j] !== ']') {
if (pattern[j] === '\\') j++ // Skip escaped character
j++
}
if (j < pattern.length) j++ // Include ]
const charClass = pattern.slice(i, j)
const isNegated = charClass.startsWith('[^')
parts.push({
text: charClass,
description: isNegated
? `Match any character NOT in "${charClass}"`
: `Match any character in "${charClass}"`,
})
i = j
}
The key insight: don’t use regex to parse regex. As meta as it sounds, regex engines have too many edge cases—manual parsing is more controllable.
Greedy vs Lazy Quantifiers#
Regex quantifiers have two modes:
*,+,?are greedy—match as much as possible*?,+?,??are lazy—match as little as possible
The ? suffix is easily overlooked. I initially only detected the quantifier itself:
// Wrong: ignoring lazy mode
if (c === '*') {
parts.push({ text: '*', description: 'Match 0 or more times' })
i++
}
Later I found that .*? was explained as . + * + ?, completely wrong. The correct logic:
if (c === '*' || c === '+' || c === '?') {
const isGreedy = pattern[i + 1] !== '?'
const suffix = !isGreedy ? '?' : ''
const descMap = {
'*': {
greedy: 'Match 0 or more times (greedy)',
lazy: 'Match 0 or more times (lazy)',
},
'+': {
greedy: 'Match 1 or more times (greedy)',
lazy: 'Match 1 or more times (lazy)',
},
'?': {
greedy: 'Match 0 or 1 times (greedy)',
lazy: 'Match 0 or 1 times (lazy)',
},
}
parts.push({
text: c + suffix,
description: isGreedy ? descMap[c].greedy : descMap[c].lazy,
})
i += isGreedy ? 1 : 2 // Lazy mode skips the following ?
}
This detail matters because greedy and lazy behave very differently. For example, matching HTML tags:
const html = '<div>content</div>'
// Greedy: matches the whole string
html.match(/<.*>/) // ['<div>content</div>']
// Lazy: matches only the first tag
html.match(/<.*?>/) // ['<div>']
The Many Forms of Groups#
Regex grouping syntax is rich:
(abc)- Capturing group(?:abc)- Non-capturing group(?=abc)- Positive lookahead(?!abc)- Negative lookahead
When parsing groups, you need to “peek” ahead:
if (c === '(') {
let desc: string
if (pattern[i + 1] === '?' && pattern[i + 2] === ':') {
desc = 'Non-capturing group'
parts.push({ text: '(?:', description: desc })
i += 3
} else if (pattern[i + 1] === '?' && pattern[i + 2] === '=') {
desc = 'Positive lookahead'
parts.push({ text: '(?=', description: desc })
i += 3
} else if (pattern[i + 1] === '?' && pattern[i + 2] === '!') {
desc = 'Negative lookahead'
parts.push({ text: '(?!', description: desc })
i += 3
} else {
desc = 'Start capturing group'
parts.push({ text: '(', description: desc })
i++
}
}
Lookaheads are powerful for validation. For example, password strength:
// At least one uppercase, one lowercase, one digit
const passwordRegex = /^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$/
// Explanation:
// (?=.*[A-Z]) - Positive lookahead: must have uppercase
// (?=.*[a-z]) - Positive lookahead: must have lowercase
// (?=.*\d) - Positive lookahead: must have digit
// .{8,} - At least 8 characters
Performance: Pre-compiled Colors#
A detail in the explainer: each token needs a different color for distinction. I initially calculated colors dynamically during render:
// Inefficient: calculates every render
const getColor = (index) => {
const colors = ['cyan', 'purple', 'green', 'amber', 'pink', 'blue']
return colors[index % colors.length]
}
Later changed to pre-compilation:
const colorCycle = [
'text-cyan-400', 'text-purple-400', 'text-green-400',
'text-amber-400', 'text-pink-400', 'text-blue-400',
]
let colorIdx = 0
const nextColor = () => colorCycle[colorIdx++ % colorCycle.length]
// Assign during parsing
parts.push({ text: '\\d', color: nextColor(), description: '...' })
Though the performance gain is minimal (regex patterns are usually short), the code is cleaner.
Real-World Applications#
This explainer is genuinely useful in practice:
- Debugging regex: Wrote a complex regex? Explain it to verify
- Learning regex: Beginners can understand syntax through explanations
- Code review: Colleague’s regex too complex? Generate documentation
Recently, I was building a log parser and needed to match Nginx log format:
^(\S+) (\S+) (\S+) \[([^\]]+)\] "(\S+) (\S+) (\S+)" (\d+) (\d+) "([^"]*)" "([^"]*)"$
The explainer instantly clarified:
(\S+)- Matches IP address\[([^\]]+)\]- Matches timestamp"(\S+) (\S+) (\S+)"- Matches method, path, protocol(\d+) (\d+)- Matches status code and response size
Conclusion#
Building a regex explainer is about correctly handling edge cases in syntax:
- Escape characters must be processed with the following character
- Character classes have special rules for
]and^ - Quantifiers have greedy/lazy modes to distinguish
- Groups have multiple forms requiring peek-ahead
This simple tool dramatically improves regex readability. Next time you encounter a “cryptic” regex, try the JsonKit Regex Explainer.
Related Tools:
- Regex Tester - Real-time regex matching
- Regex Cheat Sheet - Quick syntax reference
- Text Replace - Batch replacement with regex