XML Formatter: Building a Parser with State Machine
XML Formatter: Building a Parser with State Machine#
Dealing with minified XML from third-party APIs is painful. Most online tools are either bloated or can’t handle special nodes like CDATA and comments. So I built my own. Here’s how it works.
The Core: No Native API for XML#
Unlike JSON where JSON.parse + JSON.stringify does the job in two lines, XML has no native formatting API. You have to parse the string yourself, identify tags, attributes, and text content, then reformat.
The core is a state machine: scan character by character, jump to the right branch based on current char.
State Machine Implementation#
function beautifyXml(xml: string, spaces: number): string {
let formatted = ''
let indentLevel = 0
const indentStr = ' '.repeat(spaces)
// Remove extra whitespace first
xml = xml.replace(/>\s+</g, '><').trim()
let i = 0
while (i < xml.length) {
if (xml[i] === '<') {
if (xml[i + 1] === '/') {
// Closing tag - indent before output
indentLevel = Math.max(0, indentLevel - 1)
formatted += '\n' + indentStr.repeat(indentLevel)
while (i < xml.length && xml[i] !== '>') {
formatted += xml[i++]
}
formatted += '>'
i++
} else if (xml[i + 1] === '?') {
// XML declaration <?xml version="1.0"?>
while (i < xml.length && xml[i] !== '>') {
formatted += xml[i++]
}
formatted += '>'
i++
} else if (xml[i + 1] === '!') {
// Comment or CDATA
if (xml.substring(i, i + 9) === '<![CDATA[') {
// CDATA block: preserve content as-is
while (i < xml.length && xml.substring(i, i + 3) !== ']]>') {
formatted += xml[i++]
}
formatted += ']]>'
i += 3
} else {
// Comment <!-- -->
while (i < xml.length && xml[i] !== '>') {
formatted += xml[i++]
}
formatted += '>'
i++
}
} else {
// Opening tag - output before indenting
formatted += '\n' + indentStr.repeat(indentLevel)
// Check if self-closing <tag/>
let j = i + 1
let tagContent = ''
while (j < xml.length && xml[j] !== '>') {
tagContent += xml[j++]
}
const isSelfClosing = tagContent.endsWith('/')
while (i < xml.length && xml[i] !== '>') {
formatted += xml[i++]
}
formatted += '>'
i++
if (!isSelfClosing) {
indentLevel++
}
}
} else {
// Text content
let text = ''
while (i < xml.length && xml[i] !== '<') {
text += xml[i++]
}
text = text.trim()
if (text) {
formatted += text
}
}
}
return formatted.trim()
}
Key points:
- Closing tag: indent first: When seeing
</tag>, decrease indent level before outputting - Self-closing tags don’t increase indent:
<tag/>doesn’t create a new level - CDATA preserved:
<![CDATA[...]]>content must stay untouched, may contain special characters
XML Minification: Regex Does It#
Minification is simpler - remove comments, collapse whitespace:
function minifyXml(xml: string): string {
return xml
.replace(/<!--[\s\S]*?-->/g, '') // Remove comments
.replace(/>\s+</g, '><') // Remove whitespace between tags
.replace(/\s+/g, ' ') // Collapse consecutive whitespace
.replace(/\s*>\s*/g, '>') // Remove whitespace around >
.replace(/\s*</g, '<') // Remove whitespace after <
.trim()
}
Note: Comments use [\s\S]*? non-greedy matching since comments can span lines.
Edge Cases I Hit#
1. Don’t Format Inside CDATA#
CDATA content is literal - formatting breaks data:
<!-- Original -->
<msg><![CDATA[<script>alert('xss')</script>]]></msg>
<!-- Wrong formatting -->
<msg>
<![CDATA[
<script>alert('xss')</script>
]]>
</msg>
Newlines and indentation inside CDATA become part of the data. So scan from <![CDATA[ straight to ]]>, outputting everything untouched.
2. Quotes in Attribute Values#
Tag attributes can contain quotes:
<node attr="He said "hello"" />
Scanning tag content can’t just look for > - need to handle quoted escapes. The implementation above simplifies this; for production, consider DOMParser.
3. XML Declaration Position#
<?xml version="1.0"?> must be at the document start, no preceding spaces or newlines. Formatting should keep it first.
4. Whitespace Text Nodes#
Newlines and spaces between tags are text nodes:
<parent>
<child>text</child>
</parent>
Between <parent> and <child> there’s a newline+indent text node. Use trim() to filter these whitespace nodes during formatting.
Why Not DOMParser?#
Browser native DOMParser parses XML, but has issues:
- HTML vs XML behavior:
DOMParserdefaults to HTML parsing, some XML features (self-closing tags) behave differently - Losing original format:
DOMParserloses some original string info (attribute order, quote type) - CDATA handling: Some browsers’
DOMParserchanges CDATA content
For simple formatting, string parsing works fine and gives more control over output format.
Performance Tips#
For large XML files (several MB):
- Avoid string concatenation: Use array
push+joininstead of+= - Batch processing: Write to result after identifying complete tags, reduce intermediate strings
- Web Worker: Move parsing to background thread, avoid UI blocking
// Array-optimized version
const parts: string[] = []
parts.push('\n', indentStr.repeat(indentLevel))
// ...
return parts.join('')
The Result#
Based on these ideas: XML Formatter
Features:
- Beautify / Minify XML
- Supports CDATA, comments, XML declaration
- Indent options (2/4/8 spaces)
- Shows compression ratio
Not complex, but handling CDATA and self-closing tags edge cases takes some thought. Hope this helps.
Related: JSON Formatter | Code Formatter