SQL Formatter: From Regex to AST Parser Implementation#

Recently during a code review, I came across a SQL statement compressed into a single line - hundreds of characters strung together, making it painful to read. I looked for an online tool to format it, but the output style didn’t match our team’s conventions. I decided to implement my own and document the approach.

Why SQL Formatting is Hard#

SQL formatting is much more complex than JSON formatting. JSON has JSON.stringify(obj, null, 2) - one line and you’re done. SQL has no native API, so you need to parse it yourself.

The core challenges:

  1. Keyword identification: SELECT, FROM, WHERE need to be on separate lines
  2. Indentation levels: Subqueries, CASE WHEN, nested parentheses all need proper indentation
  3. String handling: Keywords inside strings shouldn’t be recognized
  4. Comment preservation: Single-line -- and multi-line /* */ comments should be preserved

Approach 1: Regex Replacement (Quick and Dirty)#

The fastest implementation, works for 80% of cases:

function formatSqlSimple(sql: string): string {
  // First compress whitespace
  let formatted = sql
    .replace(/\s+/g, ' ')
    .replace(/\s*,\s*/g, ', ')
    .replace(/\s*\(\s*/g, ' (')
    .replace(/\s*\)\s*/g, ') ')
  
  // Newlines after keywords
  const keywords = [
    'SELECT', 'FROM', 'WHERE', 'GROUP BY', 'ORDER BY', 
    'HAVING', 'LIMIT', 'JOIN', 'LEFT JOIN', 'RIGHT JOIN'
  ]
  
  keywords.forEach(keyword => {
    const regex = new RegExp(`\\b${keyword}\\b`, 'gi')
    formatted = formatted.replace(regex, '\n' + keyword)
  })
  
  return formatted
}

Input:

select id,name,email from users where status='active' order by id desc limit 10

Output:

SELECT id, name, email 
FROM users 
WHERE status='active' 
ORDER BY id DESC 
LIMIT 10

Fast, but here are the problems:

Problem 1: Keywords inside strings get misidentified#

SELECT 'This is FROM test' FROM users

Becomes:

SELECT 'This is 
FROM test' 
FROM users

The FROM inside the string also gets a newline.

Problem 2: Subquery indentation is messy#

SELECT * FROM (SELECT id FROM users) AS t

Subqueries should be indented, but regex can’t handle nested levels.

Approach 2: Token Stream + State Machine#

A more reliable approach is to first tokenize the SQL, then process by token type:

enum TokenType {
  KEYWORD,      // SELECT, FROM, WHERE
  IDENTIFIER,   // table_name, column_name
  STRING,       // 'hello', "world"
  NUMBER,       // 123, 45.67
  OPERATOR,     // =, <, >, AND, OR
  PUNCTUATION,  // (, ), , ; 
  COMMENT,      // -- comment, /* block */
  WHITESPACE    // spaces, newlines
}

interface Token {
  type: TokenType
  value: string
  line: number
  column: number
}

function tokenize(sql: string): Token[] {
  const tokens: Token[] = []
  let pos = 0
  let line = 1
  let column = 1
  
  const KEYWORDS = new Set([
    'SELECT', 'FROM', 'WHERE', 'JOIN', 'LEFT', 'RIGHT', 
    'INNER', 'ON', 'GROUP', 'BY', 'ORDER', 'HAVING', 'LIMIT'
  ])
  
  while (pos < sql.length) {
    // Skip whitespace
    if (/\s/.test(sql[pos])) {
      let start = pos
      while (pos < sql.length && /\s/.test(sql[pos])) {
        if (sql[pos] === '\n') { line++; column = 1 }
        else column++
        pos++
      }
      tokens.push({ type: TokenType.WHITESPACE, value: sql.slice(start, pos), line, column })
      continue
    }
    
    // String literals
    if (sql[pos] === "'" || sql[pos] === '"') {
      const quote = sql[pos]
      const start = pos++
      while (pos < sql.length && sql[pos] !== quote) {
        if (sql[pos] === '\\') pos++ // Skip escape
        pos++
      }
      pos++ // End quote
      tokens.push({ type: TokenType.STRING, value: sql.slice(start, pos), line, column })
      column += pos - start
      continue
    }
    
    // Identifiers or keywords
    if (/[a-zA-Z_]/.test(sql[pos])) {
      const start = pos
      while (pos < sql.length && /[a-zA-Z0-9_]/.test(sql[pos])) pos++
      const value = sql.slice(start, pos)
      const isKeyword = KEYWORDS.has(value.toUpperCase())
      tokens.push({ 
        type: isKeyword ? TokenType.KEYWORD : TokenType.IDENTIFIER, 
        value, line, column 
      })
      column += pos - start
      continue
    }
    
    // Other characters (operators, punctuation)
    tokens.push({ type: TokenType.PUNCTUATION, value: sql[pos], line, column })
    column++
    pos++
  }
  
  return tokens
}

With a token stream, formatting becomes straightforward:

function formatFromTokens(tokens: Token[], indentSize = 2): string {
  const lines: string[] = []
  let currentLine: string[] = []
  let indent = 0
  const indentStr = ' '.repeat(indentSize)
  
  const NEWLINE_KEYWORDS = new Set([
    'SELECT', 'FROM', 'WHERE', 'GROUP BY', 'ORDER BY', 
    'HAVING', 'LIMIT', 'JOIN', 'LEFT JOIN', 'RIGHT JOIN'
  ])
  
  for (const token of tokens) {
    // Skip whitespace tokens - we control newlines ourselves
    if (token.type === TokenType.WHITESPACE) continue
    
    // Newline on keywords
    if (token.type === TokenType.KEYWORD && NEWLINE_KEYWORDS.has(token.value.toUpperCase())) {
      if (currentLine.length > 0) {
        lines.push(indentStr.repeat(indent) + currentLine.join(' ').trim())
        currentLine = []
      }
      currentLine.push(token.value.toUpperCase())
      continue
    }
    
    // Left paren increases indent
    if (token.value === '(') {
      currentLine.push('(')
      lines.push(indentStr.repeat(indent) + currentLine.join(' ').trim())
      currentLine = []
      indent++
      continue
    }
    
    // Right paren decreases indent
    if (token.value === ')') {
      if (currentLine.length > 0) {
        lines.push(indentStr.repeat(indent) + currentLine.join(' ').trim())
        currentLine = []
      }
      indent--
      currentLine.push(')')
      continue
    }
    
    currentLine.push(token.value)
  }
  
  // Last line
  if (currentLine.length > 0) {
    lines.push(indentStr.repeat(indent) + currentLine.join(' ').trim())
  }
  
  return lines.join('\n')
}

Approach 3: AST Parsing (Most Accurate)#

For complex SQL (stored procedures, window functions, CTEs), use a mature parser:

  • sql-formatter (npm): Supports multiple dialects, flexible configuration
  • prettier-plugin-sql: Integrates with Prettier ecosystem
  • node-sql-parser: Generates complete AST for complex analysis
import { format } from 'sql-formatter'

const formatted = format('SELECT * FROM users WHERE id=1', {
  language: 'mysql',
  indent: '  ',
  uppercase: true,
  linesBetweenQueries: 2
})

Output:

SELECT
  *
FROM
  users
WHERE
  id = 1

Keyword Case Handling#

Team conventions vary - some prefer uppercase SELECT, others lowercase. We support three modes:

function applyKeywordCase(sql: string, mode: 'upper' | 'lower' | 'original'): string {
  if (mode === 'original') return sql
  
  const KEYWORDS = ['SELECT', 'FROM', 'WHERE', 'JOIN', 'LEFT', 'RIGHT']
  
  KEYWORDS.forEach(keyword => {
    const regex = new RegExp(`\\b${keyword}\\b`, 'gi')
    sql = sql.replace(regex, mode === 'upper' ? keyword : keyword.toLowerCase())
  })
  
  return sql
}

Use \b word boundary to avoid replacing selection with SELECTION.

SQL Compression#

The reverse operation - compress SQL to a single line:

function compressSql(sql: string): string {
  return sql
    .replace(/\/\*[\s\S]*?\*\//g, ' ')  // Remove multi-line comments
    .replace(/--.*$/gm, ' ')             // Remove single-line comments
    .replace(/\s+/g, ' ')                // Merge whitespace
    .trim()
}

Before:

-- Query active users
SELECT 
  id,
  name
FROM users
WHERE status = 'active'

After:

SELECT id, name FROM users WHERE status = 'active'

Performance Optimization#

For large SQL (several KB), key optimizations:

  1. Avoid regex recompilation: Cache regex objects
  2. Single pass: Complete all recognition in one tokenization pass
  3. Lazy formatting: Debounce 300ms before processing
const debouncedFormat = useMemo(
  () => debounce((sql: string) => {
    const tokens = tokenize(sql)
    const formatted = formatFromTokens(tokens)
    setOutput(formatted)
  }, 300),
  []
)

Final Result#

Based on these approaches, I built an online tool: SQL Formatter

Features:

  • Format / Compress
  • Keyword uppercase / lowercase / preserve original
  • Indent with 2/4/8 spaces
  • Complex SQL support (subqueries, CASE WHEN)

The core algorithm progresses from regex replacement to token stream to AST parsing - choose the right approach for your needs. Simple formatting works with regex; complex scenarios need a parser.


Related tools: JSON Formatter | Code Formatter