HTML Entity Encoding: From XSS Prevention to Character Escaping Implementation#

Building a content management system recently, I needed to safely display user-submitted text on pages. Direct rendering? XSS attacks await. Escape everything? Garbage output. That’s where HTML entity encoding comes in.

What Are HTML Entities#

HTML entities represent special characters using specific character sequences. For example, < becomes &lt;, > becomes &gt;. These seemingly cryptic strings solve two core problems:

  1. Security: Prevent XSS attacks
  2. Compatibility: Correctly display special characters and Unicode

Core Encoding Implementation#

The most basic approach uses character replacement:

const HTML_ENTITIES = {
  '&': '&amp;',
  '<': '&lt;',
  '>': '&gt;',
  '"': '&quot;',
  "'": '&#x27;',
  '/': '&#x2F;',
}

function encodeHtml(text) {
  return text.replace(/[&<>"'/]/g, char => HTML_ENTITIES[char] || char)
}

Simple enough, but several details deserve attention:

1. & Must Be Escaped First#

If & comes later, you get double-encoding:

// Wrong order
'<script>'.replace(/</g, '&lt;').replace(/&/g, '&amp;')
// Result: '&amp;lt;script&amp;gt;' - double encoded!

// Correct order
'<script>'.replace(/&/g, '&amp;').replace(/</g, '&lt;')
// Result: '&lt;script&gt;'

2. Why Escape / and '#

Many developers only escape <>&", which isn’t enough:

<!-- Risk of not escaping / -->
<script>
  var url = "USER_INPUT";
  // If USER_INPUT is "</script><script>alert(1)//"
  // Not escaping / directly closes the script tag
</script>

<!-- Risk of not escaping ' -->
<a href='USER_INPUT'>Click</a>
<!-- If USER_INPUT is "' onclick='alert(1)'" -->
<!-- Attribute injection happens -->

Decoding Pitfalls#

Decoding is trickier than encoding. Many write:

// DANGEROUS! Don't use in production
function decodeHtml(html) {
  const el = document.createElement('div')
  el.innerHTML = html
  return el.textContent
}

This leverages browser HTML parsing - simple but crude. It won’t work in Node.js, and introduces DOM XSS risks.

A safer approach builds a reverse mapping:

const ENTITY_TO_CHAR = {
  '&amp;': '&',
  '&lt;': '<',
  '&gt;': '>',
  '&quot;': '"',
  '&#x27;': "'",
  '&#x2F;': '/',
  '&nbsp;': ' ',
  '&copy;': '©',
  // ... more entities
}

function decodeHtml(html) {
  return html.replace(/&[a-z]+;|&#x?[0-9a-f]+;/gi, entity => {
    return ENTITY_TO_CHAR[entity] || entity
  })
}

Handling Numeric Entities#

HTML entities come in two forms:

  • Named entities: &lt;, &copy;
  • Numeric entities: &#60; (decimal), &#x3C; (hexadecimal)

Decoding numeric entities:

function decodeNumericEntity(entity) {
  const hex = entity.match(/&#x([0-9a-f]+);/i)
  if (hex) {
    return String.fromCodePoint(parseInt(hex[1], 16))
  }

  const dec = entity.match(/&#(\d+);/)
  if (dec) {
    return String.fromCodePoint(parseInt(dec[1], 10))
  }

  return entity
}

// Test
decodeNumericEntity('&#x3C;')  // '<'
decodeNumericEntity('&#60;')   // '<'
decodeNumericEntity('&#128512;') // '😀'

XSS Prevention Edge Cases#

Encoding prevents most XSS, but isn’t a silver bullet.

Context Matters#

The same input needs different escaping in different positions:

<!-- HTML content context -->
<div>USER_INPUT</div>
<!-- Needs: &<>"' escaping -->

<!-- Attribute context -->
<input value="USER_INPUT">
<!-- Needs: &<>"' escaping -->

<!-- URL attribute context -->
<a href="USER_INPUT">Click</a>
<!-- Needs: protocol validation + URL encoding -->

<!-- JavaScript context -->
<script>var data = "USER_INPUT";</script>
<!-- Needs: \ escaping + HTML entities -->

Encoding Can’t Fix Injection#

// User input
const input = 'javascript:alert(1)'

// After encoding
const encoded = encodeHtml(input) // 'javascript:alert(1)'
// Encoding doesn't change content

// But used in href
<a href="${encoded}">Click</a>
// Clicking still triggers XSS!

The correct approach validates protocols:

function safeUrl(url) {
  const allowed = ['http://', 'https://', 'mailto:']
  if (allowed.some(p => url.startsWith(p))) {
    return url
  }
  return '#'
}

Performance Optimization#

When processing large amounts of text, regex replacement can become a bottleneck.

Using Map for Speed#

const ENTITY_MAP = new Map(Object.entries(HTML_ENTITIES))

function encodeHtmlFast(text) {
  let result = ''
  for (const char of text) {
    result += ENTITY_MAP.get(char) || char
  }
  return result
}

For short text, regex is faster; for long text, Map iteration wins.

Batch Processing#

function batchEncode(items) {
  const pattern = /[&<>"'/]/g
  return items.map(text => text.replace(pattern, c => HTML_ENTITIES[c]))
}

Practical Applications#

1. Rich Text Editor Output#

Content from editors needs sanitization:

function sanitizeRichText(html) {
  // 1. Decode all entities
  const decoded = decodeHtml(html)

  // 2. Remove dangerous tags
  const cleaned = decoded
    .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
    .replace(/on\w+="[^"]*"/gi, '')

  // 3. Re-encode
  return encodeHtml(cleaned)
}

2. HTML in JSON#

APIs returning JSON with HTML content:

const data = {
  title: '<script>alert("xss")</script>',
  content: 'A & B < C'
}

// Encode before transmission
const safe = {
  title: encodeHtml(data.title),
  // '&lt;script&gt;alert("xss")&lt;/script&gt;'
  content: encodeHtml(data.content)
  // 'A &amp; B &lt; C'
}

3. Special Symbol Display#

Displaying copyright, currency, mathematical symbols:

const specialChars = {
  '©': '&copy;',
  '®': '&reg;',
  '™': '&trade;',
  '€': '&euro;',
  '¥': '&yen;',
  '±': '&plusmn;',
  '×': '&times;',
  '÷': '&divide;',
}

// In environments that don't support direct input of these characters
// (like legacy email clients), use HTML entities for correct display

A Complete Tool#

Based on these principles, I built: HTML Entity Encoder/Decoder

Features:

  • Encode / decode bidirectional conversion
  • Supports named and numeric entities
  • Quick selection of common entities
  • Safe handling of user input

Encoding and decoding seem simple, but doing security and compatibility right takes effort. Hope this helps.


Related: URL Encoder/Decoder | Base64 Encoder/Decoder