XML Formatter: Building a Parser with State Machine#

Dealing with minified XML from third-party APIs is painful. Most online tools are either bloated or can’t handle special nodes like CDATA and comments. So I built my own. Here’s how it works.

The Core: No Native API for XML#

Unlike JSON where JSON.parse + JSON.stringify does the job in two lines, XML has no native formatting API. You have to parse the string yourself, identify tags, attributes, and text content, then reformat.

The core is a state machine: scan character by character, jump to the right branch based on current char.

State Machine Implementation#

function beautifyXml(xml: string, spaces: number): string {
  let formatted = ''
  let indentLevel = 0
  const indentStr = ' '.repeat(spaces)

  // Remove extra whitespace first
  xml = xml.replace(/>\s+</g, '><').trim()

  let i = 0
  while (i < xml.length) {
    if (xml[i] === '<') {
      if (xml[i + 1] === '/') {
        // Closing tag - indent before output
        indentLevel = Math.max(0, indentLevel - 1)
        formatted += '\n' + indentStr.repeat(indentLevel)
        while (i < xml.length && xml[i] !== '>') {
          formatted += xml[i++]
        }
        formatted += '>'
        i++
      } else if (xml[i + 1] === '?') {
        // XML declaration <?xml version="1.0"?>
        while (i < xml.length && xml[i] !== '>') {
          formatted += xml[i++]
        }
        formatted += '>'
        i++
      } else if (xml[i + 1] === '!') {
        // Comment or CDATA
        if (xml.substring(i, i + 9) === '<![CDATA[') {
          // CDATA block: preserve content as-is
          while (i < xml.length && xml.substring(i, i + 3) !== ']]>') {
            formatted += xml[i++]
          }
          formatted += ']]>'
          i += 3
        } else {
          // Comment <!-- -->
          while (i < xml.length && xml[i] !== '>') {
            formatted += xml[i++]
          }
          formatted += '>'
          i++
        }
      } else {
        // Opening tag - output before indenting
        formatted += '\n' + indentStr.repeat(indentLevel)

        // Check if self-closing <tag/>
        let j = i + 1
        let tagContent = ''
        while (j < xml.length && xml[j] !== '>') {
          tagContent += xml[j++]
        }
        const isSelfClosing = tagContent.endsWith('/')

        while (i < xml.length && xml[i] !== '>') {
          formatted += xml[i++]
        }
        formatted += '>'
        i++

        if (!isSelfClosing) {
          indentLevel++
        }
      }
    } else {
      // Text content
      let text = ''
      while (i < xml.length && xml[i] !== '<') {
        text += xml[i++]
      }
      text = text.trim()
      if (text) {
        formatted += text
      }
    }
  }

  return formatted.trim()
}

Key points:

Closing tag: indent first: When seeing </tag>, decrease indent level before outputting
Self-closing tags don’t increase indent: <tag/> doesn’t create a new level
CDATA preserved: <![CDATA[...]]> content must stay untouched, may contain special characters

XML Minification: Regex Does It#

Minification is simpler - remove comments, collapse whitespace:

function minifyXml(xml: string): string {
  return xml
    .replace(/<!--[\s\S]*?-->/g, '')  // Remove comments
    .replace(/>\s+</g, '><')          // Remove whitespace between tags
    .replace(/\s+/g, ' ')             // Collapse consecutive whitespace
    .replace(/\s*>\s*/g, '>')         // Remove whitespace around >
    .replace(/\s*</g, '<')            // Remove whitespace after <
    .trim()
}

Note: Comments use [\s\S]*? non-greedy matching since comments can span lines.

Edge Cases I Hit#

1. Don’t Format Inside CDATA#

CDATA content is literal - formatting breaks data:

<!-- Original -->
<msg><![CDATA[<script>alert('xss')</script>]]></msg>

<!-- Wrong formatting -->
<msg>
  <![CDATA[
    <script>alert('xss')</script>
  ]]>
</msg>

Newlines and indentation inside CDATA become part of the data. So scan from <![CDATA[ straight to ]]>, outputting everything untouched.

2. Quotes in Attribute Values#

Tag attributes can contain quotes:

<node attr="He said "hello"" />

Scanning tag content can’t just look for > - need to handle quoted escapes. The implementation above simplifies this; for production, consider DOMParser.

3. XML Declaration Position#

<?xml version="1.0"?> must be at the document start, no preceding spaces or newlines. Formatting should keep it first.

4. Whitespace Text Nodes#

Newlines and spaces between tags are text nodes:

<parent>
  <child>text</child>
</parent>

Between <parent> and <child> there’s a newline+indent text node. Use trim() to filter these whitespace nodes during formatting.

Why Not DOMParser?#

Browser native DOMParser parses XML, but has issues:

HTML vs XML behavior: DOMParser defaults to HTML parsing, some XML features (self-closing tags) behave differently
Losing original format: DOMParser loses some original string info (attribute order, quote type)
CDATA handling: Some browsers’ DOMParser changes CDATA content

For simple formatting, string parsing works fine and gives more control over output format.

Performance Tips#

For large XML files (several MB):

Avoid string concatenation: Use array push + join instead of +=
Batch processing: Write to result after identifying complete tags, reduce intermediate strings
Web Worker: Move parsing to background thread, avoid UI blocking

// Array-optimized version
const parts: string[] = []
parts.push('\n', indentStr.repeat(indentLevel))
// ...
return parts.join('')

The Result#

Based on these ideas: XML Formatter

Features:

Beautify / Minify XML
Supports CDATA, comments, XML declaration
Indent options (2/4/8 spaces)
Shows compression ratio

Not complex, but handling CDATA and self-closing tags edge cases takes some thought. Hope this helps.