From Regex to AST: Building a Markdown to HTML Parser
From Regex to AST: Building a Markdown to HTML Parser#
I needed a Markdown to HTML converter for a developer toolbox project. My first thought was to use marked.js, but then I realized — the core logic isn’t that complex. Why not implement it myself and learn how parsers work?
Why Not Use an Existing Library?#
marked.js is powerful, but has some drawbacks:
- Large bundle size: 40KB minified, overkill for basic features
- Over-engineered: Supports GFM, tables, footnotes — I just need headings, lists, code blocks
- Black box: Hard to debug when something goes wrong
A custom implementation is under 100 lines of core code, and you learn parser fundamentals.
The Regex Approach: Quick and Dirty#
The most direct approach is regex replacement:
function markdownToHtml(markdown: string): string {
return markdown
// Code blocks (must be first to avoid inner content being matched)
.replace(/```(\w*)\n([\s\S]*?)```/g, '<pre><code class="language-$1">$2</code></pre>')
// Inline code
.replace(/`([^`]+)`/g, '<code>$1</code>')
// Headings (longest first to avoid # matching ##)
.replace(/^######\s(.+)$/gm, '<h6>$1</h6>')
.replace(/^#####\s(.+)$/gm, '<h5>$1</h5>')
.replace(/^####\s(.+)$/gm, '<h4>$1</h4>')
.replace(/^###\s(.+)$/gm, '<h3>$1</h3>')
.replace(/^##\s(.+)$/gm, '<h2>$1</h2>')
.replace(/^#\s(.+)$/gm, '<h1>$1</h1>')
// Bold
.replace(/\*\*([^*]+)\*\*/g, '<strong>$1</strong>')
// Italic
.replace(/\*([^*]+)\*/g, '<em>$1</em>')
// Links
.replace(/\[([^\]]+)\]\(([^)]+)\)/g, '<a href="$2">$1</a>')
// Line breaks
.replace(/\n/g, '<br>')
}
This approach has a fatal flaw: regex has no state.
The Regex Pitfalls#
1. Nested Matching Fails#
**bold with *italic* inside**
Expected: <strong>bold with <em>italic</em> inside</strong>
Actual: <strong>bold with *italic* inside</strong>
The ** matches from the first ** to the last **, swallowing the middle *.
2. Code Block Content Gets Polluted#
This should not be processed as bold
Regex will process the ** inside the code block, breaking syntax highlighting.
The solution is to extract code blocks first, then restore them after processing:
function markdownToHtml(markdown: string): string {
const codeBlocks: string[] = []
// Extract code blocks, replace with placeholders
let result = markdown.replace(/```(\w*)\n([\s\S]*?)```/g, (_, lang, code) => {
const index = codeBlocks.length
codeBlocks.push(`<pre><code class="language-${lang}">${escapeHtml(code)}</code></pre>`)
return `{{CODE_BLOCK_${index}}}`
})
// Process other Markdown syntax
result = result
.replace(/^###\s(.+)$/gm, '<h3>$1</h3>')
.replace(/\*\*([^*]+)\*\*/g, '<strong>$1</strong>')
// ...
// Restore code blocks
codeBlocks.forEach((block, i) => {
result = result.replace(`{{CODE_BLOCK_${i}}}`, block)
})
return result
}
3. List Processing is Complex#
Markdown lists can be nested:
- Item 1
- Subitem 1.1
- Subitem 1.2
- Item 2
Regex struggles with this hierarchy. You need a stack or recursion.
A Better Approach: Tokenize + AST#
Regex only works for simple cases. To handle nesting and complex structures, you need Tokenize → AST → HTML.
Step 1: Lexical Analysis (Tokenize)#
Split Markdown text into tokens:
type Token = {
type: 'heading' | 'paragraph' | 'code' | 'list' | 'text'
content?: string
level?: number
children?: Token[]
}
function tokenize(markdown: string): Token[] {
const tokens: Token[] = []
const lines = markdown.split('\n')
for (let i = 0; i < lines.length; i++) {
const line = lines[i]
// Heading
const headingMatch = line.match(/^(#{1,6})\s+(.+)$/)
if (headingMatch) {
tokens.push({
type: 'heading',
level: headingMatch[1].length,
content: headingMatch[2]
})
continue
}
// Code block start
if (line.startsWith('```')) {
const lang = line.slice(3)
let code = ''
i++
while (i < lines.length && !lines[i].startsWith('```')) {
code += lines[i] + '\n'
i++
}
tokens.push({ type: 'code', content: code, lang })
continue
}
// List
const listMatch = line.match(/^(\s*)[-*+]\s+(.+)$/)
if (listMatch) {
tokens.push({
type: 'list',
content: listMatch[2],
indent: listMatch[1].length
})
continue
}
// Plain text
if (line.trim()) {
tokens.push({ type: 'paragraph', content: line })
}
}
return tokens
}
Step 2: Build AST#
Convert tokens into a tree structure:
interface ASTNode {
type: string
children?: ASTNode[]
content?: string
}
function buildAST(tokens: Token[]): ASTNode[] {
const ast: ASTNode[] = []
for (const token of tokens) {
switch (token.type) {
case 'heading':
ast.push({
type: `h${token.level}`,
children: parseInline(token.content || '')
})
break
case 'code':
ast.push({
type: 'pre',
children: [{ type: 'code', content: token.content }]
})
break
// ... other types
}
}
return ast
}
// Parse inline elements (bold, italic, links)
function parseInline(text: string): ASTNode[] {
const nodes: ASTNode[] = []
let remaining = text
while (remaining) {
// Bold
const boldMatch = remaining.match(/\*\*([^*]+)\*\*/)
if (boldMatch) {
const before = remaining.slice(0, boldMatch.index)
if (before) nodes.push({ type: 'text', content: before })
nodes.push({ type: 'strong', children: parseInline(boldMatch[1]) })
remaining = remaining.slice(boldMatch.index! + boldMatch[0].length)
continue
}
// Link
const linkMatch = remaining.match(/\[([^\]]+)\]\(([^)]+)\)/)
if (linkMatch) {
const before = remaining.slice(0, linkMatch.index)
if (before) nodes.push({ type: 'text', content: before })
nodes.push({ type: 'a', href: linkMatch[2], children: parseInline(linkMatch[1]) })
remaining = remaining.slice(linkMatch.index! + linkMatch[0].length)
continue
}
// No special element matched, treat as plain text
nodes.push({ type: 'text', content: remaining })
break
}
return nodes
}
Step 3: Generate HTML#
Traverse the AST to produce HTML:
function generateHTML(ast: ASTNode[]): string {
return ast.map(node => {
switch (node.type) {
case 'h1':
case 'h2':
case 'h3':
return `<${node.type}>${generateHTML(node.children || [])}</${node.type}>`
case 'strong':
return `<strong>${generateHTML(node.children || [])}</strong>`
case 'a':
return `<a href="${node.href}">${generateHTML(node.children || [])}</a>`
case 'text':
return escapeHtml(node.content || '')
case 'pre':
return `<pre>${generateHTML(node.children || [])}</pre>`
case 'code':
return `<code>${escapeHtml(node.content || '')}</code>`
default:
return ''
}
}).join('')
}
Performance Comparison#
| Approach | Code Size | 10KB Text Processing | Supports Nesting |
|---|---|---|---|
| Regex | ~50 lines | 2-5ms | ❌ |
| Tokenize + AST | ~200 lines | 10-20ms | ✅ |
| marked.js | 40KB | 5-10ms | ✅ |
Regex is fastest but limited. AST is more robust for production use.
Practical Application#
Based on these ideas, I built: Markdown to HTML Converter
Features:
- Live preview
- Headings, lists, code blocks, links, images
- One-click copy HTML
- Download complete HTML file
Core code under 100 lines, zero dependencies, fast loading. Good enough.
Conclusion#
Building a parser is about knowing your requirements:
- Simple features only? Regex works
- Need nesting and complex structures? Use AST
- Want full feature support? Use marked.js or markdown-it
Don’t over-engineer, don’t underestimate complexity. Ship first, iterate later.
Related: Markdown Editor | HTML Formatter