SQL Formatter: From Regex to AST Parser Implementation
SQL Formatter: From Regex to AST Parser Implementation#
Recently during a code review, I came across a SQL statement compressed into a single line - hundreds of characters strung together, making it painful to read. I looked for an online tool to format it, but the output style didn’t match our team’s conventions. I decided to implement my own and document the approach.
Why SQL Formatting is Hard#
SQL formatting is much more complex than JSON formatting. JSON has JSON.stringify(obj, null, 2) - one line and you’re done. SQL has no native API, so you need to parse it yourself.
The core challenges:
- Keyword identification: SELECT, FROM, WHERE need to be on separate lines
- Indentation levels: Subqueries, CASE WHEN, nested parentheses all need proper indentation
- String handling: Keywords inside strings shouldn’t be recognized
- Comment preservation: Single-line
--and multi-line/* */comments should be preserved
Approach 1: Regex Replacement (Quick and Dirty)#
The fastest implementation, works for 80% of cases:
function formatSqlSimple(sql: string): string {
// First compress whitespace
let formatted = sql
.replace(/\s+/g, ' ')
.replace(/\s*,\s*/g, ', ')
.replace(/\s*\(\s*/g, ' (')
.replace(/\s*\)\s*/g, ') ')
// Newlines after keywords
const keywords = [
'SELECT', 'FROM', 'WHERE', 'GROUP BY', 'ORDER BY',
'HAVING', 'LIMIT', 'JOIN', 'LEFT JOIN', 'RIGHT JOIN'
]
keywords.forEach(keyword => {
const regex = new RegExp(`\\b${keyword}\\b`, 'gi')
formatted = formatted.replace(regex, '\n' + keyword)
})
return formatted
}
Input:
select id,name,email from users where status='active' order by id desc limit 10
Output:
SELECT id, name, email
FROM users
WHERE status='active'
ORDER BY id DESC
LIMIT 10
Fast, but here are the problems:
Problem 1: Keywords inside strings get misidentified#
SELECT 'This is FROM test' FROM users
Becomes:
SELECT 'This is
FROM test'
FROM users
The FROM inside the string also gets a newline.
Problem 2: Subquery indentation is messy#
SELECT * FROM (SELECT id FROM users) AS t
Subqueries should be indented, but regex can’t handle nested levels.
Approach 2: Token Stream + State Machine#
A more reliable approach is to first tokenize the SQL, then process by token type:
enum TokenType {
KEYWORD, // SELECT, FROM, WHERE
IDENTIFIER, // table_name, column_name
STRING, // 'hello', "world"
NUMBER, // 123, 45.67
OPERATOR, // =, <, >, AND, OR
PUNCTUATION, // (, ), , ;
COMMENT, // -- comment, /* block */
WHITESPACE // spaces, newlines
}
interface Token {
type: TokenType
value: string
line: number
column: number
}
function tokenize(sql: string): Token[] {
const tokens: Token[] = []
let pos = 0
let line = 1
let column = 1
const KEYWORDS = new Set([
'SELECT', 'FROM', 'WHERE', 'JOIN', 'LEFT', 'RIGHT',
'INNER', 'ON', 'GROUP', 'BY', 'ORDER', 'HAVING', 'LIMIT'
])
while (pos < sql.length) {
// Skip whitespace
if (/\s/.test(sql[pos])) {
let start = pos
while (pos < sql.length && /\s/.test(sql[pos])) {
if (sql[pos] === '\n') { line++; column = 1 }
else column++
pos++
}
tokens.push({ type: TokenType.WHITESPACE, value: sql.slice(start, pos), line, column })
continue
}
// String literals
if (sql[pos] === "'" || sql[pos] === '"') {
const quote = sql[pos]
const start = pos++
while (pos < sql.length && sql[pos] !== quote) {
if (sql[pos] === '\\') pos++ // Skip escape
pos++
}
pos++ // End quote
tokens.push({ type: TokenType.STRING, value: sql.slice(start, pos), line, column })
column += pos - start
continue
}
// Identifiers or keywords
if (/[a-zA-Z_]/.test(sql[pos])) {
const start = pos
while (pos < sql.length && /[a-zA-Z0-9_]/.test(sql[pos])) pos++
const value = sql.slice(start, pos)
const isKeyword = KEYWORDS.has(value.toUpperCase())
tokens.push({
type: isKeyword ? TokenType.KEYWORD : TokenType.IDENTIFIER,
value, line, column
})
column += pos - start
continue
}
// Other characters (operators, punctuation)
tokens.push({ type: TokenType.PUNCTUATION, value: sql[pos], line, column })
column++
pos++
}
return tokens
}
With a token stream, formatting becomes straightforward:
function formatFromTokens(tokens: Token[], indentSize = 2): string {
const lines: string[] = []
let currentLine: string[] = []
let indent = 0
const indentStr = ' '.repeat(indentSize)
const NEWLINE_KEYWORDS = new Set([
'SELECT', 'FROM', 'WHERE', 'GROUP BY', 'ORDER BY',
'HAVING', 'LIMIT', 'JOIN', 'LEFT JOIN', 'RIGHT JOIN'
])
for (const token of tokens) {
// Skip whitespace tokens - we control newlines ourselves
if (token.type === TokenType.WHITESPACE) continue
// Newline on keywords
if (token.type === TokenType.KEYWORD && NEWLINE_KEYWORDS.has(token.value.toUpperCase())) {
if (currentLine.length > 0) {
lines.push(indentStr.repeat(indent) + currentLine.join(' ').trim())
currentLine = []
}
currentLine.push(token.value.toUpperCase())
continue
}
// Left paren increases indent
if (token.value === '(') {
currentLine.push('(')
lines.push(indentStr.repeat(indent) + currentLine.join(' ').trim())
currentLine = []
indent++
continue
}
// Right paren decreases indent
if (token.value === ')') {
if (currentLine.length > 0) {
lines.push(indentStr.repeat(indent) + currentLine.join(' ').trim())
currentLine = []
}
indent--
currentLine.push(')')
continue
}
currentLine.push(token.value)
}
// Last line
if (currentLine.length > 0) {
lines.push(indentStr.repeat(indent) + currentLine.join(' ').trim())
}
return lines.join('\n')
}
Approach 3: AST Parsing (Most Accurate)#
For complex SQL (stored procedures, window functions, CTEs), use a mature parser:
- sql-formatter (npm): Supports multiple dialects, flexible configuration
- prettier-plugin-sql: Integrates with Prettier ecosystem
- node-sql-parser: Generates complete AST for complex analysis
import { format } from 'sql-formatter'
const formatted = format('SELECT * FROM users WHERE id=1', {
language: 'mysql',
indent: ' ',
uppercase: true,
linesBetweenQueries: 2
})
Output:
SELECT
*
FROM
users
WHERE
id = 1
Keyword Case Handling#
Team conventions vary - some prefer uppercase SELECT, others lowercase. We support three modes:
function applyKeywordCase(sql: string, mode: 'upper' | 'lower' | 'original'): string {
if (mode === 'original') return sql
const KEYWORDS = ['SELECT', 'FROM', 'WHERE', 'JOIN', 'LEFT', 'RIGHT']
KEYWORDS.forEach(keyword => {
const regex = new RegExp(`\\b${keyword}\\b`, 'gi')
sql = sql.replace(regex, mode === 'upper' ? keyword : keyword.toLowerCase())
})
return sql
}
Use \b word boundary to avoid replacing selection with SELECTION.
SQL Compression#
The reverse operation - compress SQL to a single line:
function compressSql(sql: string): string {
return sql
.replace(/\/\*[\s\S]*?\*\//g, ' ') // Remove multi-line comments
.replace(/--.*$/gm, ' ') // Remove single-line comments
.replace(/\s+/g, ' ') // Merge whitespace
.trim()
}
Before:
-- Query active users
SELECT
id,
name
FROM users
WHERE status = 'active'
After:
SELECT id, name FROM users WHERE status = 'active'
Performance Optimization#
For large SQL (several KB), key optimizations:
- Avoid regex recompilation: Cache regex objects
- Single pass: Complete all recognition in one tokenization pass
- Lazy formatting: Debounce 300ms before processing
const debouncedFormat = useMemo(
() => debounce((sql: string) => {
const tokens = tokenize(sql)
const formatted = formatFromTokens(tokens)
setOutput(formatted)
}, 300),
[]
)
Final Result#
Based on these approaches, I built an online tool: SQL Formatter
Features:
- Format / Compress
- Keyword uppercase / lowercase / preserve original
- Indent with 2/4/8 spaces
- Complex SQL support (subqueries, CASE WHEN)
The core algorithm progresses from regex replacement to token stream to AST parsing - choose the right approach for your needs. Simple formatting works with regex; complex scenarios need a parser.
Related tools: JSON Formatter | Code Formatter