Building a Sitemap Generator: XML Structure and SEO Best Practices#

I was doing SEO optimization for my blog recently when Google Search Console kept complaining about my sitemap. Opening sitemap.xml, the format was fine, but the configuration was a mess: every page had priority 1.0, changefreq was always “always”—basically telling search engines “this whole site is spam.”

So I built my own sitemap generator and dug into the technical details and SEO best practices.

The XML Structure#

sitemap.xml is essentially an XML file following the sitemaps.org protocol. The minimal structure:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-05-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
</urlset>

Generating this in JavaScript is straightforward—essentially string concatenation:

function generateSitemap(entries: UrlEntry[], baseUrl: string): string {
  const lines: string[] = []
  lines.push('<?xml version="1.0" encoding="UTF-8"?>')
  lines.push('<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">')

  entries.forEach(entry => {
    const fullUrl = entry.loc.startsWith('http')
      ? entry.loc
      : `${baseUrl.replace(/\/$/, '')}/${entry.loc.replace(/^\//, '')}`

    lines.push('  <url>')
    lines.push(`    <loc>${escapeXml(fullUrl)}</loc>`)
    if (entry.lastmod) lines.push(`    <lastmod>${entry.lastmod}</lastmod>`)
    if (entry.changefreq) lines.push(`    <changefreq>${entry.changefreq}</changefreq>`)
    lines.push(`    <priority>${entry.priority}</priority>`)
    lines.push('  </url>')
  })

  lines.push('</urlset>')
  return lines.join('\n')
}

One detail: URLs must be absolute paths, not relative. So we need to handle baseUrl and path concatenation, managing trailing and leading slashes.

XML Escaping: An Overlooked Gotcha#

URLs may contain XML special characters like & in query parameters. Writing them directly causes XML parsing failures.

Standard XML escape function:

function escapeXml(s: string): string {
  return s
    .replace(/&/g, '&amp;')
    .replace(/</g, '&lt;')
    .replace(/>/g, '&gt;')
    .replace(/"/g, '&quot;')
    .replace(/'/g, '&apos;')
}

Example:

const url = 'https://example.com/search?q=hello&sort=desc'
escapeXml(url)
// Output: 'https://example.com/search?q=hello&amp;sort=desc'

Without escaping, the XML parser treats & as the start of an entity reference and throws an error.

Priority and Changefreq: Do Search Engines Even Care?#

This confuses many people. Official docs call these fields “suggestions,” but testing shows Google largely ignores them.

Google’s John Mueller explicitly stated: priority is only for relative ranking within a site, and changefreq is merely a “suggestion.” Search engines rely more on their own crawlers to determine update frequency.

But that doesn’t mean these fields are useless. The right approach:

// Set reasonable priority by page type
const pagePriority = {
  homepage: 1.0,
  mainCategory: 0.8,
  subCategory: 0.6,
  articlePage: 0.5,
  tagPage: 0.3,
  searchPage: 0.1  // Search results shouldn't be in sitemap
}

// Set changefreq based on actual update patterns
const pageFrequency = {
  news: 'hourly',
  blog: 'weekly',
  product: 'monthly',
  archive: 'yearly'
}

The key: be honest. If you set every page’s priority to 1.0, search engines conclude your site has no important pages, actually lowering overall weight.

Lastmod: This Field Actually Matters#

Unlike priority and changefreq, lastmod (last modified time) is taken seriously. It tells search engines when a page was updated, helping crawlers decide whether to re-crawl.

Format must be W3C Datetime:

// Standard format: YYYY-MM-DD
const lastmod = '2026-05-05'

// Can include time: YYYY-MM-DDThh:mm:ss+TZD
const lastmodWithTime = '2026-05-05T14:30:00+08:00'

In practice, you can get this from file modification time:

import fs from 'fs'

function getLastMod(filePath: string): string {
  const stats = fs.statSync(filePath)
  return stats.mtime.toISOString().split('T')[0]
}

Sitemap Index: Breaking the 50,000 URL Limit#

A single sitemap file can have at most 50,000 URLs. Beyond that, use a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
</sitemapindex>

Generating chunked sitemaps in code:

function chunkSitemap(urls: string[], chunkSize = 50000): string[] {
  const chunks: string[][] = []
  for (let i = 0; i < urls.length; i += chunkSize) {
    chunks.push(urls.slice(i, i + chunkSize))
  }
  return chunks.map((chunk, index) => {
    const filename = `sitemap-${index}.xml`
    fs.writeFileSync(filename, generateSitemap(chunk))
    return filename
  })
}

robots.txt Configuration#

After generating a sitemap, declare it in robots.txt:

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

This tells search engines where to find the sitemap. For large sites, declare multiple:

Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-images.xml

Bulk URL Import Implementation#

Manually entering URLs one by one is tedious. Bulk import via text parsing:

function parseBulkUrls(input: string): UrlEntry[] {
  return input
    .split('\n')
    .map(line => line.trim())
    .filter(line => line.length > 0)
    .map(url => ({
      loc: url,
      lastmod: new Date().toISOString().split('T')[0],
      changefreq: 'monthly',
      priority: 0.5
    }))
}

Going further, automatically crawl all site links:

import cheerio from 'cheerio'

async function crawlSitemap(baseUrl: string): Promise<string[]> {
  const visited = new Set<string>()
  const queue = [baseUrl]
  const urls: string[] = []

  while (queue.length > 0 && urls.length < 50000) {
    const url = queue.shift()!
    if (visited.has(url)) continue
    visited.add(url)

    const html = await fetch(url).then(r => r.text())
    const $ = cheerio.load(html)

    $('a[href]').each((_, el) => {
      const href = $(el).attr('href')
      if (href && href.startsWith(baseUrl) && !visited.has(href)) {
        queue.push(href)
      }
    })

    urls.push(url)
  }

  return urls
}

This simple crawler auto-discovers internal links to build a complete sitemap.

SEO Recommendations#

Based on experience:

  1. Update regularly: Content sites should update sitemap daily
  2. Include only canonical URLs: No duplicate pages with query parameters
  3. Exclude low-value pages: Search results, login pages, 404s shouldn’t appear
  4. Validate format: Check with an XML validator before submitting
  5. Submit to search consoles: Both Google Search Console and Bing Webmaster Tools support manual submission

Tool Recommendation#

If you don’t want to write code, try: Sitemap Generator

Supports bulk URL import, custom priority and changefreq, one-click XML download. Sufficient for small to medium sites.


Related: Robots.txt Generator | Meta Tag Generator