Building a Document Format Converter: From TXT/Markdown/HTML to Any Format
Building a Document Format Converter: From TXT/Markdown/HTML to Any Format#
Every time I need to embed a Markdown document into a webpage or extract plain text from HTML, I end up installing desktop software or writing scripts. But the core logic of document format conversion isn’t that complex — browser APIs can handle most cases. Let me walk through building a pure frontend document converter.
The Essence: Re-encoding Text#
Document format conversion sounds fancy, but it boils down to three steps: read source file → parse content structure → re-encode in target format.
For text-based documents (TXT, Markdown, HTML, JSON), they’re all plain text with different structuring rules:
- TXT: No structure, pure text
- Markdown: Uses
#,*,>markers for structure - HTML: Uses
<tag>elements for structure - JSON: Uses
{},[]for data structure
The core is understanding how these rules map to each other.
FileReader API: The Starting Point#
Browser-side file reading relies on FileReader:
const handleFile = (file: File) => {
const reader = new FileReader()
reader.onload = (e) => {
const content = e.target?.result as string
// content is now the raw file text
processContent(content)
}
reader.readAsText(file) // UTF-8 by default
}
readAsText defaults to UTF-8. For GBK-encoded files:
reader.readAsText(file, 'GBK')
Drag-and-drop with the onDrop event:
const handleDrop = (e: React.DragEvent) => {
e.preventDefault()
const file = e.dataTransfer.files[0]
if (file) handleFile(file)
}
Core Conversion Algorithms#
Plain Text → HTML#
This is the most common conversion. The key is HTML entity escaping + newline handling:
function textToHtml(text: string): string {
return text
.replace(/&/g, '&') // & must be first
.replace(/</g, '<') // prevent XSS
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''')
.replace(/\n/g, '<br>') // newlines to <br>
}
Key points:
&must be replaced first — otherwise<becomes&lt;<and>must be escaped — otherwise text tags get parsed as HTML- Newline mapping:
\nneeds<br>or wrapping in<pre>
Markdown → HTML#
Markdown to HTML is much more complex, requiring a parser. A lightweight approach uses regex for common syntax:
function markdownToHtml(md: string): string {
let html = md
// Headers
.replace(/^### (.+)$/gm, '<h3>$1</h3>')
.replace(/^## (.+)$/gm, '<h2>$1</h2>')
.replace(/^# (.+)$/gm, '<h1>$1</h1>')
// Bold and italic
.replace(/\*\*\*(.+?)\*\*\*/g, '<strong><em>$1</em></strong>')
.replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>')
.replace(/\*(.+?)\*/g, '<em>$1</em>')
// Code blocks
.replace(/```(\w*)\n([\s\S]*?)```/g, '<pre><code class="$1">$2</code></pre>')
// Inline code
.replace(/`(.+?)`/g, '<code>$1</code>')
// Links
.replace(/\[(.+?)\]\((.+?)\)/g, '<a href="$2">$1</a>')
// Paragraphs
.replace(/\n\n/g, '</p><p>')
return `<p>${html}</p>`
}
But regex has inherent limitations: can’t handle nested syntax, many edge cases. For production, use mature parsers like marked or markdown-it.
Any Content → JSON#
Wrap text content into a JSON structure:
function toJson(content: string, filename?: string): string {
const result = {
content,
metadata: {
filename: filename || 'unknown',
size: new Blob([content]).size,
lines: content.split('\n').length,
charset: 'UTF-8',
convertedAt: new Date().toISOString()
}
}
return JSON.stringify(result, null, 2)
}
Blob.size gives actual byte size, more accurate than content.length which counts UTF-16 code units.
HTML Entity Escaping: XSS Prevention#
The most overlooked security issue in document conversion is XSS. If user-uploaded text contains <script>alert('xss')</script>, inserting it directly into the DOM executes it.
Complete escaping function:
function escapeHtml(str: string): string {
const escapeMap: Record<string, string> = {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": ''',
'/': '/'
}
return str.replace(/[&<>"'/]/g, (char) => escapeMap[char])
}
Or use the browser’s built-in textContent assignment:
// Safe approach: browser auto-escapes
div.textContent = userInput
const escaped = div.innerHTML
Browser-native escaping is more reliable than regex, covering more edge characters.
File Download: Blob + URL.createObjectURL#
After conversion, trigger download:
function downloadFile(content: string, filename: string, mimeType: string) {
const blob = new Blob([content], { type: mimeType })
const url = URL.createObjectURL(blob)
const a = document.createElement('a')
a.href = url
a.download = filename
a.click()
// Release memory to avoid leaks
URL.revokeObjectURL(url)
}
URL.createObjectURL creates a temporary URL that occupies memory. You must manually release it. Forgetting revokeObjectURL causes memory leaks, especially in batch conversion scenarios.
MIME types for each format:
| Format | MIME Type |
|---|---|
| TXT | text/plain |
| Markdown | text/markdown |
| HTML | text/html |
| JSON | application/json |
Encoding Detection and Garbled Text#
Encoding issues are the biggest headache in file conversion. Chinese documents often use GBK encoding, which reads as garbled text in UTF-8.
Browser-side encoding detection can use encoding-japanese library, or a simpler approach — try UTF-8 first, retry with GBK if garbled characters appear:
async function readFileWithEncoding(file: File): Promise<string> {
// Try UTF-8 first
const utf8Content = await readFileAsText(file, 'UTF-8')
// Simple detection: if many replacement characters, might not be UTF-8
const replacementCount = (utf8Content.match(/\uFFFD/g) || []).length
if (replacementCount > utf8Content.length * 0.01) {
// More than 1% replacement chars, try GBK
return readFileAsText(file, 'GBK')
}
return utf8Content
}
function readFileAsText(file: File, encoding: string): Promise<string> {
return new Promise((resolve) => {
const reader = new FileReader()
reader.onload = (e) => resolve(e.target?.result as string)
reader.readAsText(file, encoding)
})
}
\uFFFD is the Unicode replacement character (�). When the decoder encounters unrecognized bytes, it outputs this character. Many occurrences indicate wrong encoding.
The Result#
Based on these ideas, I built: Document Format Converter
Features:
- Drag-and-drop file upload
- Automatic HTML entity escaping
- Real-time preview of conversion result
- One-click download in target format
Supports conversion between TXT, Markdown, HTML, JSON. All processing happens in the browser — files never upload to a server.
The code isn’t complex, but encoding detection, XSS prevention, and memory release are details that cause production issues if overlooked.
Related: Markdown to HTML | HTML Formatter