HTML Formatter Implementation: From String Parsing to Indentation Reconstruction#

Tool link: https://jsokit.com/tools/html-format

Introduction#

In web development, the readability of HTML code directly impacts team collaboration efficiency. Poorly indented code gives headaches to future maintainers. There are many HTML formatters on the market, but how do they work under the hood? This article dives deep into the core algorithms and implementation details of HTML formatting.

Core Principle: Two-Phase Processing#

HTML formatting is essentially a “parse → reconstruct” process:

  1. Minification Phase: Remove all unnecessary whitespace (>\s+<><) to get a clean HTML string
  2. Reconstruction Phase: Traverse the string and intelligently add indentation and line breaks

Why compress first? Because input sources vary—it could be already formatted code with extra line breaks, or a single-line minified string. Compressing first ensures consistent output.

Core Algorithm: State Machine Traversal#

The most intuitive implementation is a state machine that processes character by character:

function beautifyHtml(html: string, spaces: number): string {
  let formatted = ''
  let indentLevel = 0
  const indentStr = ' '.repeat(spaces)

  // Preprocessing: remove whitespace between tags
  html = html.replace(/>\s+</g, '><').trim()

  // Set of void (self-closing) elements
  const voidElements = new Set([
    'area', 'base', 'br', 'col', 'embed', 'hr',
    'img', 'input', 'link', 'meta', 'param',
    'source', 'track', 'wbr'
  ])

  let i = 0
  while (i < html.length) {
    if (html[i] === '<') {
      if (html[i + 1] === '/') {
        // Closing tag: decrease indent first, then output
        indentLevel = Math.max(0, indentLevel - 1)
        formatted += '\n' + indentStr.repeat(indentLevel)
        // Copy entire tag
        while (i < html.length && html[i] !== '>') {
          formatted += html[i++]
        }
        formatted += '>'
        i++
      } else if (html[i + 1] === '!') {
        // Comment or DOCTYPE: maintain current indent
        formatted += '\n' + indentStr.repeat(indentLevel)
        while (i < html.length && html[i] !== '>') {
          formatted += html[i++]
        }
        formatted += '>'
        i++
      } else {
        // Opening tag: output first, then increase indent
        if (formatted.length > 0) {
          formatted += '\n' + indentStr.repeat(indentLevel)
        }

        // Extract tag name
        let tagName = ''
        let j = i + 1
        while (j < html.length && html[j] !== ' ' && html[j] !== '>') {
          tagName += html[j++].toLowerCase()
        }

        // Copy entire tag
        while (i < html.length && html[i] !== '>') {
          formatted += html[i++]
        }
        formatted += '>'
        i++

        // Only increase indent for non-void elements
        if (!voidElements.has(tagName)) {
          indentLevel++
        }
      }
    } else {
      // Text content
      let text = ''
      while (i < html.length && html[i] !== '<') {
        text += html[i++]
      }
      text = text.trim()
      if (text) {
        formatted += text
      }
    }
  }

  return formatted.trim()
}

Key Insights#

1. Handling Void Elements

HTML has 14 void elements that don’t contain content, so they shouldn’t increase indentation. Using a Set for fast lookup:

const voidElements = new Set(['br', 'img', 'input', 'meta', 'link', ...])

2. Indentation Timing

  • Opening tag: Output first, then indentLevel++ (except void elements)
  • Closing tag: indentLevel-- first, then output

This order ensures correct parent-child indentation relationships.

3. Text Content Handling

Text between tags needs trim() to avoid line breaks and spaces from the original text affecting formatting results.

Minification Algorithm: Regex to the Rescue#

Compared to formatting, HTML minification is much simpler. The core idea is removing all unnecessary whitespace:

function minifyHtml(html: string): string {
  return html
    .replace(/<!--[\s\S]*?-->/g, '')      // Remove comments
    .replace(/>\s+</g, '><')               // Remove whitespace between tags
    .replace(/\s+/g, ' ')                  // Merge consecutive whitespace
    .replace(/\s*>\s*/g, '>')              // Remove whitespace around >
    .replace(/\s*<\s*/g, '<')              // Remove whitespace around <
    .trim()
}

Why Not Remove All Whitespace?#

Minification can’t break content. For example, the space in <p>Hello World</p> must be preserved. The regex >\s+< only removes whitespace between tags, not inside tag content.

Edge Cases and Pitfalls#

1. Nested Inline Tags

<!-- Before formatting -->
<p>This is <strong>bold</strong> text</p>

<!-- Incorrect formatting -->
<p>
  This is
  <strong>bold</strong>
  text
</p>

<!-- Correct handling: inline tags don't break lines -->
<p>This is <strong>bold</strong> text</p>

Solution: Identify inline tags (span, strong, em, a, etc.) and avoid line breaks for them. This requires extending tag classification logic.

2. <pre> and <code> Tags

Whitespace inside these tags must be preserved exactly. The correct approach is extracting their content before minification, then restoring after formatting.

3. Quotes in Attributes

<div class="container" data-value="hello world">

Spaces in attribute values can’t be removed. The state machine needs to correctly identify content within quotes to avoid accidental deletion.

Performance Optimization: Avoid Regex Backtracking#

When HTML files are large (e.g., 1MB+), regular expressions may trigger catastrophic backtracking. For example:

// Dangerous! Will hang on long strings
html.replace(/<[^>]*>/g, ...)

// Safe approach: use state machine for character-by-character parsing

For extremely large HTML files, streaming is recommended—process while reading, with constant memory usage.

Real-World Use Cases#

1. Development Phase

  • Formatting compressed HTML from backend APIs to view structure
  • Unifying indentation style when copying code snippets

2. Production Environment

  • HTML minification reduces file size by 15-25%
  • Combined with Gzip compression, total transfer size can decrease by 60-70%

3. SEO Optimization

Although search engines don’t care about code indentation, clean code helps with:

  • Quickly locating meta tags and structured data
  • Checking semantic tag usage
  • Troubleshooting unclosed tags

Alternative: AST Parser#

While the state machine approach is simple, it fails on complex cases (like <script> tags containing < characters in strings). Industrial-grade solutions use AST parsers:

import { parse } from 'htmlparser2'
import render from 'dom-serializer'

function formatWithAST(html: string): string {
  const ast = parse(html, {
    lowerCaseTags: true,
    recognizeSelfClosing: true
  })
  return render(ast, { prettyPrint: true })
}

Pros: 100% accurate, correctly handles all edge cases Cons: Depends on third-party library (~150KB), slightly lower performance

Conclusion#

The core of HTML formatter is the two-phase “compress + reconstruct” process. State machine implementation is simple and efficient, suitable for most scenarios; AST parsers offer higher accuracy for production environments. Understanding these underlying principles helps us quickly implement similar features and better comprehend the frontend toolchain.

Try the online tool: https://jsokit.com/tools/html-format to experience one-click HTML formatting.



Keywords: HTML formatter, HTML beautifier, HTML minification, code beautification, frontend tools

Publishing platforms: Dev.to, Medium, Hashnode