From Regex to AST: Building a Markdown to HTML Parser#

I needed a Markdown to HTML converter for a developer toolbox project. My first thought was to use marked.js, but then I realized — the core logic isn’t that complex. Why not implement it myself and learn how parsers work?

Why Not Use an Existing Library?#

marked.js is powerful, but has some drawbacks:

  1. Large bundle size: 40KB minified, overkill for basic features
  2. Over-engineered: Supports GFM, tables, footnotes — I just need headings, lists, code blocks
  3. Black box: Hard to debug when something goes wrong

A custom implementation is under 100 lines of core code, and you learn parser fundamentals.

The Regex Approach: Quick and Dirty#

The most direct approach is regex replacement:

function markdownToHtml(markdown: string): string {
  return markdown
    // Code blocks (must be first to avoid inner content being matched)
    .replace(/```(\w*)\n([\s\S]*?)```/g, '<pre><code class="language-$1">$2</code></pre>')
    // Inline code
    .replace(/`([^`]+)`/g, '<code>$1</code>')
    // Headings (longest first to avoid # matching ##)
    .replace(/^######\s(.+)$/gm, '<h6>$1</h6>')
    .replace(/^#####\s(.+)$/gm, '<h5>$1</h5>')
    .replace(/^####\s(.+)$/gm, '<h4>$1</h4>')
    .replace(/^###\s(.+)$/gm, '<h3>$1</h3>')
    .replace(/^##\s(.+)$/gm, '<h2>$1</h2>')
    .replace(/^#\s(.+)$/gm, '<h1>$1</h1>')
    // Bold
    .replace(/\*\*([^*]+)\*\*/g, '<strong>$1</strong>')
    // Italic
    .replace(/\*([^*]+)\*/g, '<em>$1</em>')
    // Links
    .replace(/\[([^\]]+)\]\(([^)]+)\)/g, '<a href="$2">$1</a>')
    // Line breaks
    .replace(/\n/g, '<br>')
}

This approach has a fatal flaw: regex has no state.

The Regex Pitfalls#

1. Nested Matching Fails#

**bold with *italic* inside**

Expected: <strong>bold with <em>italic</em> inside</strong>

Actual: <strong>bold with *italic* inside</strong>

The ** matches from the first ** to the last **, swallowing the middle *.

2. Code Block Content Gets Polluted#

This should not be processed as bold

Regex will process the ** inside the code block, breaking syntax highlighting.

The solution is to extract code blocks first, then restore them after processing:

function markdownToHtml(markdown: string): string {
  const codeBlocks: string[] = []
  
  // Extract code blocks, replace with placeholders
  let result = markdown.replace(/```(\w*)\n([\s\S]*?)```/g, (_, lang, code) => {
    const index = codeBlocks.length
    codeBlocks.push(`<pre><code class="language-${lang}">${escapeHtml(code)}</code></pre>`)
    return `{{CODE_BLOCK_${index}}}`
  })
  
  // Process other Markdown syntax
  result = result
    .replace(/^###\s(.+)$/gm, '<h3>$1</h3>')
    .replace(/\*\*([^*]+)\*\*/g, '<strong>$1</strong>')
    // ...
  
  // Restore code blocks
  codeBlocks.forEach((block, i) => {
    result = result.replace(`{{CODE_BLOCK_${i}}}`, block)
  })
  
  return result
}

3. List Processing is Complex#

Markdown lists can be nested:

- Item 1
  - Subitem 1.1
  - Subitem 1.2
- Item 2

Regex struggles with this hierarchy. You need a stack or recursion.

A Better Approach: Tokenize + AST#

Regex only works for simple cases. To handle nesting and complex structures, you need Tokenize → AST → HTML.

Step 1: Lexical Analysis (Tokenize)#

Split Markdown text into tokens:

type Token = {
  type: 'heading' | 'paragraph' | 'code' | 'list' | 'text'
  content?: string
  level?: number
  children?: Token[]
}

function tokenize(markdown: string): Token[] {
  const tokens: Token[] = []
  const lines = markdown.split('\n')
  
  for (let i = 0; i < lines.length; i++) {
    const line = lines[i]
    
    // Heading
    const headingMatch = line.match(/^(#{1,6})\s+(.+)$/)
    if (headingMatch) {
      tokens.push({
        type: 'heading',
        level: headingMatch[1].length,
        content: headingMatch[2]
      })
      continue
    }
    
    // Code block start
    if (line.startsWith('```')) {
      const lang = line.slice(3)
      let code = ''
      i++
      while (i < lines.length && !lines[i].startsWith('```')) {
        code += lines[i] + '\n'
        i++
      }
      tokens.push({ type: 'code', content: code, lang })
      continue
    }
    
    // List
    const listMatch = line.match(/^(\s*)[-*+]\s+(.+)$/)
    if (listMatch) {
      tokens.push({
        type: 'list',
        content: listMatch[2],
        indent: listMatch[1].length
      })
      continue
    }
    
    // Plain text
    if (line.trim()) {
      tokens.push({ type: 'paragraph', content: line })
    }
  }
  
  return tokens
}

Step 2: Build AST#

Convert tokens into a tree structure:

interface ASTNode {
  type: string
  children?: ASTNode[]
  content?: string
}

function buildAST(tokens: Token[]): ASTNode[] {
  const ast: ASTNode[] = []
  
  for (const token of tokens) {
    switch (token.type) {
      case 'heading':
        ast.push({
          type: `h${token.level}`,
          children: parseInline(token.content || '')
        })
        break
      case 'code':
        ast.push({
          type: 'pre',
          children: [{ type: 'code', content: token.content }]
        })
        break
      // ... other types
    }
  }
  
  return ast
}

// Parse inline elements (bold, italic, links)
function parseInline(text: string): ASTNode[] {
  const nodes: ASTNode[] = []
  let remaining = text
  
  while (remaining) {
    // Bold
    const boldMatch = remaining.match(/\*\*([^*]+)\*\*/)
    if (boldMatch) {
      const before = remaining.slice(0, boldMatch.index)
      if (before) nodes.push({ type: 'text', content: before })
      nodes.push({ type: 'strong', children: parseInline(boldMatch[1]) })
      remaining = remaining.slice(boldMatch.index! + boldMatch[0].length)
      continue
    }
    
    // Link
    const linkMatch = remaining.match(/\[([^\]]+)\]\(([^)]+)\)/)
    if (linkMatch) {
      const before = remaining.slice(0, linkMatch.index)
      if (before) nodes.push({ type: 'text', content: before })
      nodes.push({ type: 'a', href: linkMatch[2], children: parseInline(linkMatch[1]) })
      remaining = remaining.slice(linkMatch.index! + linkMatch[0].length)
      continue
    }
    
    // No special element matched, treat as plain text
    nodes.push({ type: 'text', content: remaining })
    break
  }
  
  return nodes
}

Step 3: Generate HTML#

Traverse the AST to produce HTML:

function generateHTML(ast: ASTNode[]): string {
  return ast.map(node => {
    switch (node.type) {
      case 'h1':
      case 'h2':
      case 'h3':
        return `<${node.type}>${generateHTML(node.children || [])}</${node.type}>`
      case 'strong':
        return `<strong>${generateHTML(node.children || [])}</strong>`
      case 'a':
        return `<a href="${node.href}">${generateHTML(node.children || [])}</a>`
      case 'text':
        return escapeHtml(node.content || '')
      case 'pre':
        return `<pre>${generateHTML(node.children || [])}</pre>`
      case 'code':
        return `<code>${escapeHtml(node.content || '')}</code>`
      default:
        return ''
    }
  }).join('')
}

Performance Comparison#

Approach Code Size 10KB Text Processing Supports Nesting
Regex ~50 lines 2-5ms
Tokenize + AST ~200 lines 10-20ms
marked.js 40KB 5-10ms

Regex is fastest but limited. AST is more robust for production use.

Practical Application#

Based on these ideas, I built: Markdown to HTML Converter

Features:

  • Live preview
  • Headings, lists, code blocks, links, images
  • One-click copy HTML
  • Download complete HTML file

Core code under 100 lines, zero dependencies, fast loading. Good enough.

Conclusion#

Building a parser is about knowing your requirements:

  • Simple features only? Regex works
  • Need nesting and complex structures? Use AST
  • Want full feature support? Use marked.js or markdown-it

Don’t over-engineer, don’t underestimate complexity. Ship first, iterate later.


Related: Markdown Editor | HTML Formatter