HTML Entity Encoding: From XSS Prevention to Character Escaping Implementation
HTML Entity Encoding: From XSS Prevention to Character Escaping Implementation#
Building a content management system recently, I needed to safely display user-submitted text on pages. Direct rendering? XSS attacks await. Escape everything? Garbage output. That’s where HTML entity encoding comes in.
What Are HTML Entities#
HTML entities represent special characters using specific character sequences. For example, < becomes <, > becomes >. These seemingly cryptic strings solve two core problems:
- Security: Prevent XSS attacks
- Compatibility: Correctly display special characters and Unicode
Core Encoding Implementation#
The most basic approach uses character replacement:
const HTML_ENTITIES = {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": ''',
'/': '/',
}
function encodeHtml(text) {
return text.replace(/[&<>"'/]/g, char => HTML_ENTITIES[char] || char)
}
Simple enough, but several details deserve attention:
1. & Must Be Escaped First#
If & comes later, you get double-encoding:
// Wrong order
'<script>'.replace(/</g, '<').replace(/&/g, '&')
// Result: '&lt;script&gt;' - double encoded!
// Correct order
'<script>'.replace(/&/g, '&').replace(/</g, '<')
// Result: '<script>'
2. Why Escape / and '#
Many developers only escape <>&", which isn’t enough:
<!-- Risk of not escaping / -->
<script>
var url = "USER_INPUT";
// If USER_INPUT is "</script><script>alert(1)//"
// Not escaping / directly closes the script tag
</script>
<!-- Risk of not escaping ' -->
<a href='USER_INPUT'>Click</a>
<!-- If USER_INPUT is "' onclick='alert(1)'" -->
<!-- Attribute injection happens -->
Decoding Pitfalls#
Decoding is trickier than encoding. Many write:
// DANGEROUS! Don't use in production
function decodeHtml(html) {
const el = document.createElement('div')
el.innerHTML = html
return el.textContent
}
This leverages browser HTML parsing - simple but crude. It won’t work in Node.js, and introduces DOM XSS risks.
A safer approach builds a reverse mapping:
const ENTITY_TO_CHAR = {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
''': "'",
'/': '/',
' ': ' ',
'©': '©',
// ... more entities
}
function decodeHtml(html) {
return html.replace(/&[a-z]+;|&#x?[0-9a-f]+;/gi, entity => {
return ENTITY_TO_CHAR[entity] || entity
})
}
Handling Numeric Entities#
HTML entities come in two forms:
- Named entities:
<,© - Numeric entities:
<(decimal),<(hexadecimal)
Decoding numeric entities:
function decodeNumericEntity(entity) {
const hex = entity.match(/&#x([0-9a-f]+);/i)
if (hex) {
return String.fromCodePoint(parseInt(hex[1], 16))
}
const dec = entity.match(/&#(\d+);/)
if (dec) {
return String.fromCodePoint(parseInt(dec[1], 10))
}
return entity
}
// Test
decodeNumericEntity('<') // '<'
decodeNumericEntity('<') // '<'
decodeNumericEntity('😀') // '😀'
XSS Prevention Edge Cases#
Encoding prevents most XSS, but isn’t a silver bullet.
Context Matters#
The same input needs different escaping in different positions:
<!-- HTML content context -->
<div>USER_INPUT</div>
<!-- Needs: &<>"' escaping -->
<!-- Attribute context -->
<input value="USER_INPUT">
<!-- Needs: &<>"' escaping -->
<!-- URL attribute context -->
<a href="USER_INPUT">Click</a>
<!-- Needs: protocol validation + URL encoding -->
<!-- JavaScript context -->
<script>var data = "USER_INPUT";</script>
<!-- Needs: \ escaping + HTML entities -->
Encoding Can’t Fix Injection#
// User input
const input = 'javascript:alert(1)'
// After encoding
const encoded = encodeHtml(input) // 'javascript:alert(1)'
// Encoding doesn't change content
// But used in href
<a href="${encoded}">Click</a>
// Clicking still triggers XSS!
The correct approach validates protocols:
function safeUrl(url) {
const allowed = ['http://', 'https://', 'mailto:']
if (allowed.some(p => url.startsWith(p))) {
return url
}
return '#'
}
Performance Optimization#
When processing large amounts of text, regex replacement can become a bottleneck.
Using Map for Speed#
const ENTITY_MAP = new Map(Object.entries(HTML_ENTITIES))
function encodeHtmlFast(text) {
let result = ''
for (const char of text) {
result += ENTITY_MAP.get(char) || char
}
return result
}
For short text, regex is faster; for long text, Map iteration wins.
Batch Processing#
function batchEncode(items) {
const pattern = /[&<>"'/]/g
return items.map(text => text.replace(pattern, c => HTML_ENTITIES[c]))
}
Practical Applications#
1. Rich Text Editor Output#
Content from editors needs sanitization:
function sanitizeRichText(html) {
// 1. Decode all entities
const decoded = decodeHtml(html)
// 2. Remove dangerous tags
const cleaned = decoded
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
.replace(/on\w+="[^"]*"/gi, '')
// 3. Re-encode
return encodeHtml(cleaned)
}
2. HTML in JSON#
APIs returning JSON with HTML content:
const data = {
title: '<script>alert("xss")</script>',
content: 'A & B < C'
}
// Encode before transmission
const safe = {
title: encodeHtml(data.title),
// '<script>alert("xss")</script>'
content: encodeHtml(data.content)
// 'A & B < C'
}
3. Special Symbol Display#
Displaying copyright, currency, mathematical symbols:
const specialChars = {
'©': '©',
'®': '®',
'™': '™',
'€': '€',
'¥': '¥',
'±': '±',
'×': '×',
'÷': '÷',
}
// In environments that don't support direct input of these characters
// (like legacy email clients), use HTML entities for correct display
A Complete Tool#
Based on these principles, I built: HTML Entity Encoder/Decoder
Features:
- Encode / decode bidirectional conversion
- Supports named and numeric entities
- Quick selection of common entities
- Safe handling of user input
Encoding and decoding seem simple, but doing security and compatibility right takes effort. Hope this helps.
Related: URL Encoder/Decoder | Base64 Encoder/Decoder