From Regex to AST: Building a Code Formatter
From Regex to AST: Building a Code Formatter#
Dealing with minified legacy code is painful. While Prettier is powerful, sometimes you just want to quickly format a snippet without configuring rules. So I built my own. Here’s how it works.
The Essence of Code Formatting#
Formatting boils down to two things: line breaks and indentation.
// Minified code
function add(a,b){return a+b;}
// Formatted
function add(a, b) {
return a + b;
}
Looks simple, but it’s full of edge cases.
HTML Formatting: Tag Hierarchy#
The challenge with HTML is identifying tag nesting. Self-closing tags (<img>, <br>) don’t need indentation, while regular tags need indentation based on nesting depth.
Basic Implementation#
function formatHtml(code: string): string {
let formatted = code
let indent = 0
const indentStr = ' ' // 2-space indent
// Step 1: Add line breaks between tags
formatted = formatted.replace(/></g, '>\n<')
// Step 2: Process indentation line by line
formatted = formatted.split('\n').map(line => {
line = line.trim()
// Closing tag: decrease indent first
if (line.match(/^<\/\w/)) {
indent = Math.max(0, indent - 1)
}
// Add current indentation
const result = indentStr.repeat(indent) + line
// Opening tag: increase indent (exclude self-closing)
if (line.match(/^<\w[^>]*[^\/]>$/)) {
indent++
}
return result
}).join('\n')
return formatted
}
Edge Case: Self-Closing Tag Trap#
The regex /^<\w[^>]*[^\/]>$/ has a problem: it can’t correctly identify all self-closing tags.
<!-- Wrong: treated as tags needing indentation -->
<img src="a.jpg"/>
<br>
<input type="text">
<!-- Correct: none should increase indent -->
Improved solution: maintain a self-closing tag list
const selfClosingTags = new Set([
'area', 'base', 'br', 'col', 'embed', 'hr', 'img',
'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'
])
function isSelfClosing(tag: string): boolean {
const match = tag.match(/^<(\w+)/)
return match ? selfClosingTags.has(match[1].toLowerCase()) : false
}
// Improved indent logic
if (line.match(/^<\w/) && !isSelfClosing(line) && !line.endsWith('/>')) {
indent++
}
Edge Case: Inline Elements#
Some tags shouldn’t break lines, like <span>, <a>, <strong>:
<!-- Don't want this -->
<p>
This is
<strong>
bold
</strong>
text
</p>
<!-- Want this -->
<p>This is <strong>bold</strong> text</p>
This requires smarter detection: if a tag has text content before or after, keep it inline. This is beyond regex capabilities and needs AST parsing.
CSS Formatting: Braces and Semicolons#
CSS formatting is relatively simple because the syntax is regular: selector + braces + property list.
function formatCss(code: string): string {
let formatted = code
// Step 1: Add line breaks around {
formatted = formatted.replace(/\{/g, ' {\n ')
// Step 2: Add line break after ;
formatted = formatted.replace(/;/g, ';\n ')
// Step 3: Add line breaks around }
formatted = formatted.replace(/\}/g, '\n}\n')
// Step 4: Clean up multiple blank lines
formatted = formatted.replace(/\n\s*\n/g, '\n')
return formatted.trim()
}
Edge Case: Nested Rules and Media Queries#
The implementation works for simple CSS, but fails with nested rules:
/* Original */
.container{.item{color:red;}@media(max-width:600px){.item{color:blue;}}}
/* Wrong formatting */
.container {
.item {
color: red;
}
@media(max-width:600px) {
.item {
color: blue;
}
}
}
The problem: nested rules need extra indentation levels. Solution is tracking brace nesting depth:
function formatCssWithNesting(code: string): string {
let indent = 0
const indentStr = ' '
// Split at { } ;
const tokens = code.split(/([{};])/)
return tokens.map(token => {
if (token === '{') {
const result = ' {\n' + indentStr.repeat(++indent)
return result
} else if (token === '}') {
indent = Math.max(0, indent - 1)
return '\n' + indentStr.repeat(indent) + '}\n'
} else if (token === ';') {
return ';\n' + indentStr.repeat(indent)
} else {
return token.trim()
}
}).join('').replace(/\n\s*\n/g, '\n').trim()
}
SQL Formatting: Keyword Recognition#
SQL formatting focuses on making keywords (SELECT, FROM, WHERE) stand on their own lines and converting to uppercase.
function formatSql(code: string): string {
const keywords = [
'SELECT', 'FROM', 'WHERE', 'JOIN', 'ON', 'GROUP BY',
'ORDER BY', 'HAVING', 'LIMIT', 'INSERT', 'UPDATE',
'DELETE', 'CREATE', 'ALTER', 'DROP', 'AND', 'OR',
'NOT', 'IN', 'LIKE'
]
let formatted = code.toUpperCase()
// Add line break before each keyword
keywords.forEach(keyword => {
const regex = new RegExp(`\\b${keyword}\\b`, 'gi')
formatted = formatted.replace(regex, `\n${keyword}`)
})
// Remove leading line break
formatted = formatted.replace(/^\n/, '')
return formatted
}
Edge Case: Keyword Order Problem#
There’s a trap: GROUP BY and ORDER BY contain BY. If BY is in the keyword list, it causes duplicate line breaks:
-- Wrong result
SELECT * FROM users
GROUP
BY age
ORDER
BY name
-- Correct result
SELECT * FROM users
GROUP BY age
ORDER BY name
Solution: match longer keywords first, or use negative lookahead:
// Sort by length descending, match longer first
keywords.sort((a, b) => b.length - a.length)
// Or use negative lookahead to exclude already matched
const regex = new RegExp(`\\b${keyword}\\b(?![\\s\\w]*\\bBY\\b)`, 'gi')
Edge Case: Keywords in Strings#
Keywords inside strings shouldn’t be formatted:
SELECT 'WHERE is this' FROM users
Regex can’t distinguish keywords inside vs outside strings. This needs a more complex parser.
JavaScript Formatting: Brace Hell#
JavaScript formatting is the most complex due to diverse syntax: functions, objects, arrays, template strings…
Simple Implementation#
function formatJavaScript(code: string): string {
let indent = 0
const indentStr = ' '
let formatted = code
// Add line breaks at { } ;
formatted = formatted.replace(/\{/g, ' {\n')
formatted = formatted.replace(/\}/g, '\n}\n')
formatted = formatted.replace(/;/g, ';\n')
// Process indentation line by line
formatted = formatted.split('\n').map(line => {
line = line.trim()
// Closing brace: decrease indent first
if (line === '}') {
indent = Math.max(0, indent - 1)
}
// Add current indentation
const result = indentStr.repeat(indent) + line
// Opening brace: increase indent
if (line.endsWith('{')) {
indent++
}
return result
}).join('\n')
return formatted.trim()
}
Edge Case: Object Literals#
The implementation incorrectly formats object literals:
// Original
const obj = {a:1,b:2}
// Wrong formatting
const obj =
{
a:1,b:2
}
// Correct formatting
const obj = {
a: 1,
b: 2
}
The problem: can’t distinguish code blocks {} from object literals {}. This needs AST parsing.
Edge Case: Arrow Functions#
// Single-line arrow functions shouldn't break
const add = (a, b) => a + b
// Multi-line arrow functions need breaks
const add = (a, b) => {
return a + b
}
Regex can’t determine if an arrow function body is an expression or a block.
Regex vs AST: When to Use Which?#
Regex Advantages#
- Fast: Pure string operations, no parsing
- No dependencies: Don’t need parser libraries
- Fault tolerant: Can format even with syntax errors
Regex Limitations#
- Can’t handle nesting: Brace nesting, tag nesting
- Can’t distinguish context: Keywords in strings, object vs block
- Many edge cases: Self-closing tags, inline elements, multiline strings
AST Advantages#
- Precise: Understands syntax structure, no misjudgment
- Extensible: Supports complex formatting rules
- Configurable: Indent, line break, spacing rules are customizable
AST Limitations#
- Performance overhead: Parsing AST is slower than regex
- Syntax errors: Can’t parse code with syntax errors
- Bundle size: Parser libraries are usually large
Practical Advice#
Quick Formatting Scenarios#
If you just want to quickly format a code snippet, regex is enough:
// Regex implementation
function quickFormat(code: string, language: string): string {
switch (language) {
case 'html': return formatHtml(code)
case 'css': return formatCss(code)
case 'sql': return formatSql(code)
case 'javascript': return formatJavaScript(code)
default: return code
}
}
Production-Grade Formatting#
For production-grade formatting quality, use mature tools:
import prettier from 'prettier'
async function productionFormat(code: string, language: string): Promise<string> {
const parser = {
html: 'html',
css: 'css',
sql: 'sql',
javascript: 'babel'
}[language]
return prettier.format(code, { parser })
}
The Result#
Based on these ideas, I built: Code Formatter
Features:
- Supports HTML, CSS, SQL, JavaScript
- Automatic indentation and line breaks
- SQL keywords auto-uppercase
- Perfect for quick code beautification
The implementation isn’t complex, but getting the details right takes effort. Hope this helps.
Related: Code Minifier | JSON Formatter