Building a Document Format Converter: From TXT/Markdown/HTML to Any Format#

Every time I need to embed a Markdown document into a webpage or extract plain text from HTML, I end up installing desktop software or writing scripts. But the core logic of document format conversion isn’t that complex — browser APIs can handle most cases. Let me walk through building a pure frontend document converter.

The Essence: Re-encoding Text#

Document format conversion sounds fancy, but it boils down to three steps: read source file → parse content structure → re-encode in target format.

For text-based documents (TXT, Markdown, HTML, JSON), they’re all plain text with different structuring rules:

  • TXT: No structure, pure text
  • Markdown: Uses #, *, > markers for structure
  • HTML: Uses <tag> elements for structure
  • JSON: Uses {}, [] for data structure

The core is understanding how these rules map to each other.

FileReader API: The Starting Point#

Browser-side file reading relies on FileReader:

const handleFile = (file: File) => {
  const reader = new FileReader()
  reader.onload = (e) => {
    const content = e.target?.result as string
    // content is now the raw file text
    processContent(content)
  }
  reader.readAsText(file)  // UTF-8 by default
}

readAsText defaults to UTF-8. For GBK-encoded files:

reader.readAsText(file, 'GBK')

Drag-and-drop with the onDrop event:

const handleDrop = (e: React.DragEvent) => {
  e.preventDefault()
  const file = e.dataTransfer.files[0]
  if (file) handleFile(file)
}

Core Conversion Algorithms#

Plain Text → HTML#

This is the most common conversion. The key is HTML entity escaping + newline handling:

function textToHtml(text: string): string {
  return text
    .replace(/&/g, '&amp;')    // & must be first
    .replace(/</g, '&lt;')     // prevent XSS
    .replace(/>/g, '&gt;')
    .replace(/"/g, '&quot;')
    .replace(/'/g, '&#39;')
    .replace(/\n/g, '<br>')    // newlines to <br>
}

Key points:

  1. & must be replaced first — otherwise &lt; becomes &amp;lt;
  2. < and > must be escaped — otherwise text tags get parsed as HTML
  3. Newline mapping: \n needs <br> or wrapping in <pre>

Markdown → HTML#

Markdown to HTML is much more complex, requiring a parser. A lightweight approach uses regex for common syntax:

function markdownToHtml(md: string): string {
  let html = md
    // Headers
    .replace(/^### (.+)$/gm, '<h3>$1</h3>')
    .replace(/^## (.+)$/gm, '<h2>$1</h2>')
    .replace(/^# (.+)$/gm, '<h1>$1</h1>')
    // Bold and italic
    .replace(/\*\*\*(.+?)\*\*\*/g, '<strong><em>$1</em></strong>')
    .replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>')
    .replace(/\*(.+?)\*/g, '<em>$1</em>')
    // Code blocks
    .replace(/```(\w*)\n([\s\S]*?)```/g, '<pre><code class="$1">$2</code></pre>')
    // Inline code
    .replace(/`(.+?)`/g, '<code>$1</code>')
    // Links
    .replace(/\[(.+?)\]\((.+?)\)/g, '<a href="$2">$1</a>')
    // Paragraphs
    .replace(/\n\n/g, '</p><p>')

  return `<p>${html}</p>`
}

But regex has inherent limitations: can’t handle nested syntax, many edge cases. For production, use mature parsers like marked or markdown-it.

Any Content → JSON#

Wrap text content into a JSON structure:

function toJson(content: string, filename?: string): string {
  const result = {
    content,
    metadata: {
      filename: filename || 'unknown',
      size: new Blob([content]).size,
      lines: content.split('\n').length,
      charset: 'UTF-8',
      convertedAt: new Date().toISOString()
    }
  }
  return JSON.stringify(result, null, 2)
}

Blob.size gives actual byte size, more accurate than content.length which counts UTF-16 code units.

HTML Entity Escaping: XSS Prevention#

The most overlooked security issue in document conversion is XSS. If user-uploaded text contains <script>alert('xss')</script>, inserting it directly into the DOM executes it.

Complete escaping function:

function escapeHtml(str: string): string {
  const escapeMap: Record<string, string> = {
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;',
    '"': '&quot;',
    "'": '&#x27;',
    '/': '&#x2F;'
  }
  return str.replace(/[&<>"'/]/g, (char) => escapeMap[char])
}

Or use the browser’s built-in textContent assignment:

// Safe approach: browser auto-escapes
div.textContent = userInput
const escaped = div.innerHTML

Browser-native escaping is more reliable than regex, covering more edge characters.

File Download: Blob + URL.createObjectURL#

After conversion, trigger download:

function downloadFile(content: string, filename: string, mimeType: string) {
  const blob = new Blob([content], { type: mimeType })
  const url = URL.createObjectURL(blob)
  const a = document.createElement('a')
  a.href = url
  a.download = filename
  a.click()
  // Release memory to avoid leaks
  URL.revokeObjectURL(url)
}

URL.createObjectURL creates a temporary URL that occupies memory. You must manually release it. Forgetting revokeObjectURL causes memory leaks, especially in batch conversion scenarios.

MIME types for each format:

Format MIME Type
TXT text/plain
Markdown text/markdown
HTML text/html
JSON application/json

Encoding Detection and Garbled Text#

Encoding issues are the biggest headache in file conversion. Chinese documents often use GBK encoding, which reads as garbled text in UTF-8.

Browser-side encoding detection can use encoding-japanese library, or a simpler approach — try UTF-8 first, retry with GBK if garbled characters appear:

async function readFileWithEncoding(file: File): Promise<string> {
  // Try UTF-8 first
  const utf8Content = await readFileAsText(file, 'UTF-8')

  // Simple detection: if many replacement characters, might not be UTF-8
  const replacementCount = (utf8Content.match(/\uFFFD/g) || []).length
  if (replacementCount > utf8Content.length * 0.01) {
    // More than 1% replacement chars, try GBK
    return readFileAsText(file, 'GBK')
  }

  return utf8Content
}

function readFileAsText(file: File, encoding: string): Promise<string> {
  return new Promise((resolve) => {
    const reader = new FileReader()
    reader.onload = (e) => resolve(e.target?.result as string)
    reader.readAsText(file, encoding)
  })
}

\uFFFD is the Unicode replacement character (�). When the decoder encounters unrecognized bytes, it outputs this character. Many occurrences indicate wrong encoding.

The Result#

Based on these ideas, I built: Document Format Converter

Features:

  • Drag-and-drop file upload
  • Automatic HTML entity escaping
  • Real-time preview of conversion result
  • One-click download in target format

Supports conversion between TXT, Markdown, HTML, JSON. All processing happens in the browser — files never upload to a server.

The code isn’t complex, but encoding detection, XSS prevention, and memory release are details that cause production issues if overlooked.


Related: Markdown to HTML | HTML Formatter