Regex Explainer: Building a Parser from String to Visual Breakdown#

Published: April 29, 2026, 22:40

Ever stared at a complex regular expression for five minutes, completely lost? Or inherited someone else’s code with a “cryptic” regex that makes you question your career choices?

Last week, I was building a form validation feature and needed to understand this email validation regex: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$. After staring blankly for five minutes, I thought: why not build a tool that “translates” regex into human-readable explanations?

That’s how the JsonKit Regex Explainer was born.

Lexical Analysis: Tokenizing Regex#

A regular expression is essentially a mini language. The first step in explaining it is lexical analysis—breaking the string into meaningful tokens.

For example, ^\d{2,4}$ should be split into:

  • ^ - Start anchor
  • \d - Digit character class
  • {2,4} - Quantifier
  • $ - End anchor

But there’s a catch: handling escape characters.

The Escape Character Trap#

The backslash \ in regex is a troublemaker. It can be followed by:

  • Predefined character classes: \d, \w, \s
  • Special character escapes: \., \*, \\
  • Invalid but syntactically legal: \a

My initial implementation was:

// Wrong: only looking at current character
if (c === '\\') {
  parts.push({ text: '\\', description: 'Escape character' })
  i++
}

Result: ^\d+$ was interpreted as \ + d + +, completely wrong. The correct approach is to treat the escape character and the following character as a unit:

if (c === '\\') {
  const next = pattern[i + 1]
  if (!next) {
    // String ends with \, syntax error
    parts.push({ text: '\\', description: 'Incomplete escape' })
    i++
    continue
  }
  
  const escapeMap = {
    'd': 'Match any digit [0-9]',
    'D': 'Match any non-digit',
    'w': 'Match any word character [a-zA-Z0-9_]',
    's': 'Match any whitespace',
    // ... other predefined classes
  }
  
  if (escapeMap[next]) {
    parts.push({ text: `\\${next}`, description: escapeMap[next] })
  } else {
    parts.push({ text: `\\${next}`, description: `Escaped character "${next}"` })
  }
  i += 2  // Skip both \ and the following character
}

Now ^\d+$ is correctly parsed as ^ + \d + + + $.

Character Class Boundary Issues#

Character classes [...] are another tricky syntax. Take [^]]—it looks weird, but it’s legal: matches any character except ].

When parsing character classes, you need to handle:

  1. Leading ^ for negation
  2. Leading ] as a literal character (not the closing bracket)
  3. \] in the middle as an escaped ]

Here’s my implementation:

if (c === '[') {
  let j = i + 1
  
  // Handle negation
  if (j < pattern.length && pattern[j] === '^') j++
  
  // Leading ] is a literal
  if (j < pattern.length && pattern[j] === ']') j++
  
  // Find the closing ]
  while (j < pattern.length && pattern[j] !== ']') {
    if (pattern[j] === '\\') j++  // Skip escaped character
    j++
  }
  
  if (j < pattern.length) j++  // Include ]
  
  const charClass = pattern.slice(i, j)
  const isNegated = charClass.startsWith('[^')
  
  parts.push({
    text: charClass,
    description: isNegated
      ? `Match any character NOT in "${charClass}"`
      : `Match any character in "${charClass}"`,
  })
  i = j
}

The key insight: don’t use regex to parse regex. As meta as it sounds, regex engines have too many edge cases—manual parsing is more controllable.

Greedy vs Lazy Quantifiers#

Regex quantifiers have two modes:

  • *, +, ? are greedy—match as much as possible
  • *?, +?, ?? are lazy—match as little as possible

The ? suffix is easily overlooked. I initially only detected the quantifier itself:

// Wrong: ignoring lazy mode
if (c === '*') {
  parts.push({ text: '*', description: 'Match 0 or more times' })
  i++
}

Later I found that .*? was explained as . + * + ?, completely wrong. The correct logic:

if (c === '*' || c === '+' || c === '?') {
  const isGreedy = pattern[i + 1] !== '?'
  const suffix = !isGreedy ? '?' : ''
  
  const descMap = {
    '*': {
      greedy: 'Match 0 or more times (greedy)',
      lazy: 'Match 0 or more times (lazy)',
    },
    '+': {
      greedy: 'Match 1 or more times (greedy)',
      lazy: 'Match 1 or more times (lazy)',
    },
    '?': {
      greedy: 'Match 0 or 1 times (greedy)',
      lazy: 'Match 0 or 1 times (lazy)',
    },
  }
  
  parts.push({
    text: c + suffix,
    description: isGreedy ? descMap[c].greedy : descMap[c].lazy,
  })
  i += isGreedy ? 1 : 2  // Lazy mode skips the following ?
}

This detail matters because greedy and lazy behave very differently. For example, matching HTML tags:

const html = '<div>content</div>'

// Greedy: matches the whole string
html.match(/<.*>/)  // ['<div>content</div>']

// Lazy: matches only the first tag
html.match(/<.*?>/)  // ['<div>']

The Many Forms of Groups#

Regex grouping syntax is rich:

  • (abc) - Capturing group
  • (?:abc) - Non-capturing group
  • (?=abc) - Positive lookahead
  • (?!abc) - Negative lookahead

When parsing groups, you need to “peek” ahead:

if (c === '(') {
  let desc: string
  
  if (pattern[i + 1] === '?' && pattern[i + 2] === ':') {
    desc = 'Non-capturing group'
    parts.push({ text: '(?:', description: desc })
    i += 3
  } else if (pattern[i + 1] === '?' && pattern[i + 2] === '=') {
    desc = 'Positive lookahead'
    parts.push({ text: '(?=', description: desc })
    i += 3
  } else if (pattern[i + 1] === '?' && pattern[i + 2] === '!') {
    desc = 'Negative lookahead'
    parts.push({ text: '(?!', description: desc })
    i += 3
  } else {
    desc = 'Start capturing group'
    parts.push({ text: '(', description: desc })
    i++
  }
}

Lookaheads are powerful for validation. For example, password strength:

// At least one uppercase, one lowercase, one digit
const passwordRegex = /^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$/

// Explanation:
// (?=.*[A-Z])  - Positive lookahead: must have uppercase
// (?=.*[a-z])  - Positive lookahead: must have lowercase
// (?=.*\d)     - Positive lookahead: must have digit
// .{8,}        - At least 8 characters

Performance: Pre-compiled Colors#

A detail in the explainer: each token needs a different color for distinction. I initially calculated colors dynamically during render:

// Inefficient: calculates every render
const getColor = (index) => {
  const colors = ['cyan', 'purple', 'green', 'amber', 'pink', 'blue']
  return colors[index % colors.length]
}

Later changed to pre-compilation:

const colorCycle = [
  'text-cyan-400', 'text-purple-400', 'text-green-400',
  'text-amber-400', 'text-pink-400', 'text-blue-400',
]
let colorIdx = 0
const nextColor = () => colorCycle[colorIdx++ % colorCycle.length]

// Assign during parsing
parts.push({ text: '\\d', color: nextColor(), description: '...' })

Though the performance gain is minimal (regex patterns are usually short), the code is cleaner.

Real-World Applications#

This explainer is genuinely useful in practice:

  1. Debugging regex: Wrote a complex regex? Explain it to verify
  2. Learning regex: Beginners can understand syntax through explanations
  3. Code review: Colleague’s regex too complex? Generate documentation

Recently, I was building a log parser and needed to match Nginx log format:

^(\S+) (\S+) (\S+) \[([^\]]+)\] "(\S+) (\S+) (\S+)" (\d+) (\d+) "([^"]*)" "([^"]*)"$

The explainer instantly clarified:

  • (\S+) - Matches IP address
  • \[([^\]]+)\] - Matches timestamp
  • "(\S+) (\S+) (\S+)" - Matches method, path, protocol
  • (\d+) (\d+) - Matches status code and response size

Conclusion#

Building a regex explainer is about correctly handling edge cases in syntax:

  • Escape characters must be processed with the following character
  • Character classes have special rules for ] and ^
  • Quantifiers have greedy/lazy modes to distinguish
  • Groups have multiple forms requiring peek-ahead

This simple tool dramatically improves regex readability. Next time you encounter a “cryptic” regex, try the JsonKit Regex Explainer.


Related Tools: