HTML Entity Encoding: From XSS Prevention to Character Escaping Implementation#

Building a content management system recently, I needed to safely display user-submitted text on pages. Direct rendering? XSS attacks await. Escape everything? Garbage output. That’s where HTML entity encoding comes in.

What Are HTML Entities#

HTML entities represent special characters using specific character sequences. For example, < becomes <, > becomes >. These seemingly cryptic strings solve two core problems:

Security: Prevent XSS attacks
Compatibility: Correctly display special characters and Unicode

Core Encoding Implementation#

The most basic approach uses character replacement:

const HTML_ENTITIES = {
  '&': '&amp;',
  '<': '&lt;',
  '>': '&gt;',
  '"': '&quot;',
  "'": '&#x27;',
  '/': '&#x2F;',
}

function encodeHtml(text) {
  return text.replace(/[&<>"'/]/g, char => HTML_ENTITIES[char] || char)
}

Simple enough, but several details deserve attention:

1. `&` Must Be Escaped First#

If & comes later, you get double-encoding:

// Wrong order
'<script>'.replace(/</g, '&lt;').replace(/&/g, '&amp;')
// Result: '&amp;lt;script&amp;gt;' - double encoded!

// Correct order
'<script>'.replace(/&/g, '&amp;').replace(/</g, '&lt;')
// Result: '&lt;script&gt;'

2. Why Escape `/` and `'`#

Many developers only escape <>&", which isn’t enough:

<!-- Risk of not escaping / -->
<script>
  var url = "USER_INPUT";
  // If USER_INPUT is "</script><script>alert(1)//"
  // Not escaping / directly closes the script tag
</script>

<!-- Risk of not escaping ' -->
<a href='USER_INPUT'>Click</a>
<!-- If USER_INPUT is "' onclick='alert(1)'" -->
<!-- Attribute injection happens -->

Decoding Pitfalls#

Decoding is trickier than encoding. Many write:

// DANGEROUS! Don't use in production
function decodeHtml(html) {
  const el = document.createElement('div')
  el.innerHTML = html
  return el.textContent
}

This leverages browser HTML parsing - simple but crude. It won’t work in Node.js, and introduces DOM XSS risks.

A safer approach builds a reverse mapping:

const ENTITY_TO_CHAR = {
  '&amp;': '&',
  '&lt;': '<',
  '&gt;': '>',
  '&quot;': '"',
  '&#x27;': "'",
  '&#x2F;': '/',
  '&nbsp;': ' ',
  '&copy;': '©',
  // ... more entities
}

function decodeHtml(html) {
  return html.replace(/&[a-z]+;|&#x?[0-9a-f]+;/gi, entity => {
    return ENTITY_TO_CHAR[entity] || entity
  })
}

Handling Numeric Entities#

HTML entities come in two forms:

Named entities: <, ©
Numeric entities: < (decimal), < (hexadecimal)

Decoding numeric entities:

function decodeNumericEntity(entity) {
  const hex = entity.match(/&#x([0-9a-f]+);/i)
  if (hex) {
    return String.fromCodePoint(parseInt(hex[1], 16))
  }

  const dec = entity.match(/&#(\d+);/)
  if (dec) {
    return String.fromCodePoint(parseInt(dec[1], 10))
  }

  return entity
}

// Test
decodeNumericEntity('&#x3C;')  // '<'
decodeNumericEntity('&#60;')   // '<'
decodeNumericEntity('&#128512;') // '😀'

XSS Prevention Edge Cases#

Encoding prevents most XSS, but isn’t a silver bullet.

Context Matters#

The same input needs different escaping in different positions:

<!-- HTML content context -->
<div>USER_INPUT</div>
<!-- Needs: &<>"' escaping -->

<!-- Attribute context -->
<input value="USER_INPUT">
<!-- Needs: &<>"' escaping -->

<!-- URL attribute context -->
<a href="USER_INPUT">Click</a>
<!-- Needs: protocol validation + URL encoding -->

<!-- JavaScript context -->
<script>var data = "USER_INPUT";</script>
<!-- Needs: \ escaping + HTML entities -->

Encoding Can’t Fix Injection#

// User input
const input = 'javascript:alert(1)'

// After encoding
const encoded = encodeHtml(input) // 'javascript:alert(1)'
// Encoding doesn't change content

// But used in href
<a href="${encoded}">Click</a>
// Clicking still triggers XSS!

The correct approach validates protocols:

function safeUrl(url) {
  const allowed = ['http://', 'https://', 'mailto:']
  if (allowed.some(p => url.startsWith(p))) {
    return url
  }
  return '#'
}

Performance Optimization#

When processing large amounts of text, regex replacement can become a bottleneck.

Using Map for Speed#

const ENTITY_MAP = new Map(Object.entries(HTML_ENTITIES))

function encodeHtmlFast(text) {
  let result = ''
  for (const char of text) {
    result += ENTITY_MAP.get(char) || char
  }
  return result
}

For short text, regex is faster; for long text, Map iteration wins.

Batch Processing#

function batchEncode(items) {
  const pattern = /[&<>"'/]/g
  return items.map(text => text.replace(pattern, c => HTML_ENTITIES[c]))
}

Practical Applications#

1. Rich Text Editor Output#

Content from editors needs sanitization:

function sanitizeRichText(html) {
  // 1. Decode all entities
  const decoded = decodeHtml(html)

  // 2. Remove dangerous tags
  const cleaned = decoded
    .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
    .replace(/on\w+="[^"]*"/gi, '')

  // 3. Re-encode
  return encodeHtml(cleaned)
}

2. HTML in JSON#

APIs returning JSON with HTML content:

const data = {
  title: '<script>alert("xss")</script>',
  content: 'A & B < C'
}

// Encode before transmission
const safe = {
  title: encodeHtml(data.title),
  // '&lt;script&gt;alert("xss")&lt;/script&gt;'
  content: encodeHtml(data.content)
  // 'A &amp; B &lt; C'
}

3. Special Symbol Display#

Displaying copyright, currency, mathematical symbols:

const specialChars = {
  '©': '&copy;',
  '®': '&reg;',
  '™': '&trade;',
  '€': '&euro;',
  '¥': '&yen;',
  '±': '&plusmn;',
  '×': '&times;',
  '÷': '&divide;',
}

// In environments that don't support direct input of these characters
// (like legacy email clients), use HTML entities for correct display

A Complete Tool#

Based on these principles, I built: HTML Entity Encoder/Decoder

Features:

Encode / decode bidirectional conversion
Supports named and numeric entities
Quick selection of common entities
Safe handling of user input

Encoding and decoding seem simple, but doing security and compatibility right takes effort. Hope this helps.

HTML Entity Encoding: From XSS Prevention to Character Escaping Implementation