HTML Formatter Implementation: From String Parsing to Indentation Reconstruction
HTML Formatter Implementation: From String Parsing to Indentation Reconstruction#
Tool link: https://jsokit.com/tools/html-format
Introduction#
In web development, the readability of HTML code directly impacts team collaboration efficiency. Poorly indented code gives headaches to future maintainers. There are many HTML formatters on the market, but how do they work under the hood? This article dives deep into the core algorithms and implementation details of HTML formatting.
Core Principle: Two-Phase Processing#
HTML formatting is essentially a “parse → reconstruct” process:
- Minification Phase: Remove all unnecessary whitespace (
>\s+<→><) to get a clean HTML string - Reconstruction Phase: Traverse the string and intelligently add indentation and line breaks
Why compress first? Because input sources vary—it could be already formatted code with extra line breaks, or a single-line minified string. Compressing first ensures consistent output.
Core Algorithm: State Machine Traversal#
The most intuitive implementation is a state machine that processes character by character:
function beautifyHtml(html: string, spaces: number): string {
let formatted = ''
let indentLevel = 0
const indentStr = ' '.repeat(spaces)
// Preprocessing: remove whitespace between tags
html = html.replace(/>\s+</g, '><').trim()
// Set of void (self-closing) elements
const voidElements = new Set([
'area', 'base', 'br', 'col', 'embed', 'hr',
'img', 'input', 'link', 'meta', 'param',
'source', 'track', 'wbr'
])
let i = 0
while (i < html.length) {
if (html[i] === '<') {
if (html[i + 1] === '/') {
// Closing tag: decrease indent first, then output
indentLevel = Math.max(0, indentLevel - 1)
formatted += '\n' + indentStr.repeat(indentLevel)
// Copy entire tag
while (i < html.length && html[i] !== '>') {
formatted += html[i++]
}
formatted += '>'
i++
} else if (html[i + 1] === '!') {
// Comment or DOCTYPE: maintain current indent
formatted += '\n' + indentStr.repeat(indentLevel)
while (i < html.length && html[i] !== '>') {
formatted += html[i++]
}
formatted += '>'
i++
} else {
// Opening tag: output first, then increase indent
if (formatted.length > 0) {
formatted += '\n' + indentStr.repeat(indentLevel)
}
// Extract tag name
let tagName = ''
let j = i + 1
while (j < html.length && html[j] !== ' ' && html[j] !== '>') {
tagName += html[j++].toLowerCase()
}
// Copy entire tag
while (i < html.length && html[i] !== '>') {
formatted += html[i++]
}
formatted += '>'
i++
// Only increase indent for non-void elements
if (!voidElements.has(tagName)) {
indentLevel++
}
}
} else {
// Text content
let text = ''
while (i < html.length && html[i] !== '<') {
text += html[i++]
}
text = text.trim()
if (text) {
formatted += text
}
}
}
return formatted.trim()
}
Key Insights#
1. Handling Void Elements
HTML has 14 void elements that don’t contain content, so they shouldn’t increase indentation. Using a Set for fast lookup:
const voidElements = new Set(['br', 'img', 'input', 'meta', 'link', ...])
2. Indentation Timing
- Opening tag: Output first, then
indentLevel++(except void elements) - Closing tag:
indentLevel--first, then output
This order ensures correct parent-child indentation relationships.
3. Text Content Handling
Text between tags needs trim() to avoid line breaks and spaces from the original text affecting formatting results.
Minification Algorithm: Regex to the Rescue#
Compared to formatting, HTML minification is much simpler. The core idea is removing all unnecessary whitespace:
function minifyHtml(html: string): string {
return html
.replace(/<!--[\s\S]*?-->/g, '') // Remove comments
.replace(/>\s+</g, '><') // Remove whitespace between tags
.replace(/\s+/g, ' ') // Merge consecutive whitespace
.replace(/\s*>\s*/g, '>') // Remove whitespace around >
.replace(/\s*<\s*/g, '<') // Remove whitespace around <
.trim()
}
Why Not Remove All Whitespace?#
Minification can’t break content. For example, the space in <p>Hello World</p> must be preserved. The regex >\s+< only removes whitespace between tags, not inside tag content.
Edge Cases and Pitfalls#
1. Nested Inline Tags
<!-- Before formatting -->
<p>This is <strong>bold</strong> text</p>
<!-- Incorrect formatting -->
<p>
This is
<strong>bold</strong>
text
</p>
<!-- Correct handling: inline tags don't break lines -->
<p>This is <strong>bold</strong> text</p>
Solution: Identify inline tags (span, strong, em, a, etc.) and avoid line breaks for them. This requires extending tag classification logic.
2. <pre> and <code> Tags
Whitespace inside these tags must be preserved exactly. The correct approach is extracting their content before minification, then restoring after formatting.
3. Quotes in Attributes
<div class="container" data-value="hello world">
Spaces in attribute values can’t be removed. The state machine needs to correctly identify content within quotes to avoid accidental deletion.
Performance Optimization: Avoid Regex Backtracking#
When HTML files are large (e.g., 1MB+), regular expressions may trigger catastrophic backtracking. For example:
// Dangerous! Will hang on long strings
html.replace(/<[^>]*>/g, ...)
// Safe approach: use state machine for character-by-character parsing
For extremely large HTML files, streaming is recommended—process while reading, with constant memory usage.
Real-World Use Cases#
1. Development Phase
- Formatting compressed HTML from backend APIs to view structure
- Unifying indentation style when copying code snippets
2. Production Environment
- HTML minification reduces file size by 15-25%
- Combined with Gzip compression, total transfer size can decrease by 60-70%
3. SEO Optimization
Although search engines don’t care about code indentation, clean code helps with:
- Quickly locating meta tags and structured data
- Checking semantic tag usage
- Troubleshooting unclosed tags
Alternative: AST Parser#
While the state machine approach is simple, it fails on complex cases (like <script> tags containing < characters in strings). Industrial-grade solutions use AST parsers:
import { parse } from 'htmlparser2'
import render from 'dom-serializer'
function formatWithAST(html: string): string {
const ast = parse(html, {
lowerCaseTags: true,
recognizeSelfClosing: true
})
return render(ast, { prettyPrint: true })
}
Pros: 100% accurate, correctly handles all edge cases Cons: Depends on third-party library (~150KB), slightly lower performance
Conclusion#
The core of HTML formatter is the two-phase “compress + reconstruct” process. State machine implementation is simple and efficient, suitable for most scenarios; AST parsers offer higher accuracy for production environments. Understanding these underlying principles helps us quickly implement similar features and better comprehend the frontend toolchain.
Try the online tool: https://jsokit.com/tools/html-format to experience one-click HTML formatting.
Related Tools#
- JSON Formatter - JSON code beautification and minification
- CSS Formatter - CSS code formatting and optimization
- JavaScript Formatter - JS/JSX code beautification
Keywords: HTML formatter, HTML beautifier, HTML minification, code beautification, frontend tools
Publishing platforms: Dev.to, Medium, Hashnode