From Regex to AST: Building a Code Formatter#

Dealing with minified legacy code is painful. While Prettier is powerful, sometimes you just want to quickly format a snippet without configuring rules. So I built my own. Here’s how it works.

The Essence of Code Formatting#

Formatting boils down to two things: line breaks and indentation.

// Minified code
function add(a,b){return a+b;}

// Formatted
function add(a, b) {
  return a + b;
}

Looks simple, but it’s full of edge cases.

HTML Formatting: Tag Hierarchy#

The challenge with HTML is identifying tag nesting. Self-closing tags (<img>, <br>) don’t need indentation, while regular tags need indentation based on nesting depth.

Basic Implementation#

function formatHtml(code: string): string {
  let formatted = code
  let indent = 0
  const indentStr = '  '  // 2-space indent
  
  // Step 1: Add line breaks between tags
  formatted = formatted.replace(/></g, '>\n<')
  
  // Step 2: Process indentation line by line
  formatted = formatted.split('\n').map(line => {
    line = line.trim()
    
    // Closing tag: decrease indent first
    if (line.match(/^<\/\w/)) {
      indent = Math.max(0, indent - 1)
    }
    
    // Add current indentation
    const result = indentStr.repeat(indent) + line
    
    // Opening tag: increase indent (exclude self-closing)
    if (line.match(/^<\w[^>]*[^\/]>$/)) {
      indent++
    }
    
    return result
  }).join('\n')
  
  return formatted
}

Edge Case: Self-Closing Tag Trap#

The regex /^<\w[^>]*[^\/]>$/ has a problem: it can’t correctly identify all self-closing tags.

<!-- Wrong: treated as tags needing indentation -->
<img src="a.jpg"/>
<br>
<input type="text">

<!-- Correct: none should increase indent -->

Improved solution: maintain a self-closing tag list

const selfClosingTags = new Set([
  'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 
  'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'
])

function isSelfClosing(tag: string): boolean {
  const match = tag.match(/^<(\w+)/)
  return match ? selfClosingTags.has(match[1].toLowerCase()) : false
}

// Improved indent logic
if (line.match(/^<\w/) && !isSelfClosing(line) && !line.endsWith('/>')) {
  indent++
}

Edge Case: Inline Elements#

Some tags shouldn’t break lines, like <span>, <a>, <strong>:

<!-- Don't want this -->
<p>
  This is 
  <strong>
    bold
  </strong>
  text
</p>

<!-- Want this -->
<p>This is <strong>bold</strong> text</p>

This requires smarter detection: if a tag has text content before or after, keep it inline. This is beyond regex capabilities and needs AST parsing.

CSS Formatting: Braces and Semicolons#

CSS formatting is relatively simple because the syntax is regular: selector + braces + property list.

function formatCss(code: string): string {
  let formatted = code
  
  // Step 1: Add line breaks around {
  formatted = formatted.replace(/\{/g, ' {\n  ')
  
  // Step 2: Add line break after ;
  formatted = formatted.replace(/;/g, ';\n  ')
  
  // Step 3: Add line breaks around }
  formatted = formatted.replace(/\}/g, '\n}\n')
  
  // Step 4: Clean up multiple blank lines
  formatted = formatted.replace(/\n\s*\n/g, '\n')
  
  return formatted.trim()
}

Edge Case: Nested Rules and Media Queries#

The implementation works for simple CSS, but fails with nested rules:

/* Original */
.container{.item{color:red;}@media(max-width:600px){.item{color:blue;}}}

/* Wrong formatting */
.container {
  .item {
    color: red;
  }
  @media(max-width:600px) {
    .item {
      color: blue;
    }
  }
}

The problem: nested rules need extra indentation levels. Solution is tracking brace nesting depth:

function formatCssWithNesting(code: string): string {
  let indent = 0
  const indentStr = '  '
  
  // Split at { } ;
  const tokens = code.split(/([{};])/)
  
  return tokens.map(token => {
    if (token === '{') {
      const result = ' {\n' + indentStr.repeat(++indent)
      return result
    } else if (token === '}') {
      indent = Math.max(0, indent - 1)
      return '\n' + indentStr.repeat(indent) + '}\n'
    } else if (token === ';') {
      return ';\n' + indentStr.repeat(indent)
    } else {
      return token.trim()
    }
  }).join('').replace(/\n\s*\n/g, '\n').trim()
}

SQL Formatting: Keyword Recognition#

SQL formatting focuses on making keywords (SELECT, FROM, WHERE) stand on their own lines and converting to uppercase.

function formatSql(code: string): string {
  const keywords = [
    'SELECT', 'FROM', 'WHERE', 'JOIN', 'ON', 'GROUP BY', 
    'ORDER BY', 'HAVING', 'LIMIT', 'INSERT', 'UPDATE', 
    'DELETE', 'CREATE', 'ALTER', 'DROP', 'AND', 'OR', 
    'NOT', 'IN', 'LIKE'
  ]
  
  let formatted = code.toUpperCase()
  
  // Add line break before each keyword
  keywords.forEach(keyword => {
    const regex = new RegExp(`\\b${keyword}\\b`, 'gi')
    formatted = formatted.replace(regex, `\n${keyword}`)
  })
  
  // Remove leading line break
  formatted = formatted.replace(/^\n/, '')
  
  return formatted
}

Edge Case: Keyword Order Problem#

There’s a trap: GROUP BY and ORDER BY contain BY. If BY is in the keyword list, it causes duplicate line breaks:

-- Wrong result
SELECT * FROM users
GROUP
BY age
ORDER
BY name

-- Correct result
SELECT * FROM users
GROUP BY age
ORDER BY name

Solution: match longer keywords first, or use negative lookahead:

// Sort by length descending, match longer first
keywords.sort((a, b) => b.length - a.length)

// Or use negative lookahead to exclude already matched
const regex = new RegExp(`\\b${keyword}\\b(?![\\s\\w]*\\bBY\\b)`, 'gi')

Edge Case: Keywords in Strings#

Keywords inside strings shouldn’t be formatted:

SELECT 'WHERE is this' FROM users

Regex can’t distinguish keywords inside vs outside strings. This needs a more complex parser.

JavaScript Formatting: Brace Hell#

JavaScript formatting is the most complex due to diverse syntax: functions, objects, arrays, template strings…

Simple Implementation#

function formatJavaScript(code: string): string {
  let indent = 0
  const indentStr = '  '
  
  let formatted = code
  
  // Add line breaks at { } ;
  formatted = formatted.replace(/\{/g, ' {\n')
  formatted = formatted.replace(/\}/g, '\n}\n')
  formatted = formatted.replace(/;/g, ';\n')
  
  // Process indentation line by line
  formatted = formatted.split('\n').map(line => {
    line = line.trim()
    
    // Closing brace: decrease indent first
    if (line === '}') {
      indent = Math.max(0, indent - 1)
    }
    
    // Add current indentation
    const result = indentStr.repeat(indent) + line
    
    // Opening brace: increase indent
    if (line.endsWith('{')) {
      indent++
    }
    
    return result
  }).join('\n')
  
  return formatted.trim()
}

Edge Case: Object Literals#

The implementation incorrectly formats object literals:

// Original
const obj = {a:1,b:2}

// Wrong formatting
const obj = 
{
  a:1,b:2
}

// Correct formatting
const obj = {
  a: 1,
  b: 2
}

The problem: can’t distinguish code blocks {} from object literals {}. This needs AST parsing.

Edge Case: Arrow Functions#

// Single-line arrow functions shouldn't break
const add = (a, b) => a + b

// Multi-line arrow functions need breaks
const add = (a, b) => {
  return a + b
}

Regex can’t determine if an arrow function body is an expression or a block.

Regex vs AST: When to Use Which?#

Regex Advantages#

  • Fast: Pure string operations, no parsing
  • No dependencies: Don’t need parser libraries
  • Fault tolerant: Can format even with syntax errors

Regex Limitations#

  • Can’t handle nesting: Brace nesting, tag nesting
  • Can’t distinguish context: Keywords in strings, object vs block
  • Many edge cases: Self-closing tags, inline elements, multiline strings

AST Advantages#

  • Precise: Understands syntax structure, no misjudgment
  • Extensible: Supports complex formatting rules
  • Configurable: Indent, line break, spacing rules are customizable

AST Limitations#

  • Performance overhead: Parsing AST is slower than regex
  • Syntax errors: Can’t parse code with syntax errors
  • Bundle size: Parser libraries are usually large

Practical Advice#

Quick Formatting Scenarios#

If you just want to quickly format a code snippet, regex is enough:

// Regex implementation
function quickFormat(code: string, language: string): string {
  switch (language) {
    case 'html': return formatHtml(code)
    case 'css': return formatCss(code)
    case 'sql': return formatSql(code)
    case 'javascript': return formatJavaScript(code)
    default: return code
  }
}

Production-Grade Formatting#

For production-grade formatting quality, use mature tools:

import prettier from 'prettier'

async function productionFormat(code: string, language: string): Promise<string> {
  const parser = {
    html: 'html',
    css: 'css',
    sql: 'sql',
    javascript: 'babel'
  }[language]
  
  return prettier.format(code, { parser })
}

The Result#

Based on these ideas, I built: Code Formatter

Features:

  • Supports HTML, CSS, SQL, JavaScript
  • Automatic indentation and line breaks
  • SQL keywords auto-uppercase
  • Perfect for quick code beautification

The implementation isn’t complex, but getting the details right takes effort. Hope this helps.


Related: Code Minifier | JSON Formatter