Building a Sitemap Generator: XML Structure and SEO Best Practices
Building a Sitemap Generator: XML Structure and SEO Best Practices#
I was doing SEO optimization for my blog recently when Google Search Console kept complaining about my sitemap. Opening sitemap.xml, the format was fine, but the configuration was a mess: every page had priority 1.0, changefreq was always “always”—basically telling search engines “this whole site is spam.”
So I built my own sitemap generator and dug into the technical details and SEO best practices.
The XML Structure#
sitemap.xml is essentially an XML file following the sitemaps.org protocol. The minimal structure:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2026-05-01</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
</urlset>
Generating this in JavaScript is straightforward—essentially string concatenation:
function generateSitemap(entries: UrlEntry[], baseUrl: string): string {
const lines: string[] = []
lines.push('<?xml version="1.0" encoding="UTF-8"?>')
lines.push('<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">')
entries.forEach(entry => {
const fullUrl = entry.loc.startsWith('http')
? entry.loc
: `${baseUrl.replace(/\/$/, '')}/${entry.loc.replace(/^\//, '')}`
lines.push(' <url>')
lines.push(` <loc>${escapeXml(fullUrl)}</loc>`)
if (entry.lastmod) lines.push(` <lastmod>${entry.lastmod}</lastmod>`)
if (entry.changefreq) lines.push(` <changefreq>${entry.changefreq}</changefreq>`)
lines.push(` <priority>${entry.priority}</priority>`)
lines.push(' </url>')
})
lines.push('</urlset>')
return lines.join('\n')
}
One detail: URLs must be absolute paths, not relative. So we need to handle baseUrl and path concatenation, managing trailing and leading slashes.
XML Escaping: An Overlooked Gotcha#
URLs may contain XML special characters like & in query parameters. Writing them directly causes XML parsing failures.
Standard XML escape function:
function escapeXml(s: string): string {
return s
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''')
}
Example:
const url = 'https://example.com/search?q=hello&sort=desc'
escapeXml(url)
// Output: 'https://example.com/search?q=hello&sort=desc'
Without escaping, the XML parser treats & as the start of an entity reference and throws an error.
Priority and Changefreq: Do Search Engines Even Care?#
This confuses many people. Official docs call these fields “suggestions,” but testing shows Google largely ignores them.
Google’s John Mueller explicitly stated: priority is only for relative ranking within a site, and changefreq is merely a “suggestion.” Search engines rely more on their own crawlers to determine update frequency.
But that doesn’t mean these fields are useless. The right approach:
// Set reasonable priority by page type
const pagePriority = {
homepage: 1.0,
mainCategory: 0.8,
subCategory: 0.6,
articlePage: 0.5,
tagPage: 0.3,
searchPage: 0.1 // Search results shouldn't be in sitemap
}
// Set changefreq based on actual update patterns
const pageFrequency = {
news: 'hourly',
blog: 'weekly',
product: 'monthly',
archive: 'yearly'
}
The key: be honest. If you set every page’s priority to 1.0, search engines conclude your site has no important pages, actually lowering overall weight.
Lastmod: This Field Actually Matters#
Unlike priority and changefreq, lastmod (last modified time) is taken seriously. It tells search engines when a page was updated, helping crawlers decide whether to re-crawl.
Format must be W3C Datetime:
// Standard format: YYYY-MM-DD
const lastmod = '2026-05-05'
// Can include time: YYYY-MM-DDThh:mm:ss+TZD
const lastmodWithTime = '2026-05-05T14:30:00+08:00'
In practice, you can get this from file modification time:
import fs from 'fs'
function getLastMod(filePath: string): string {
const stats = fs.statSync(filePath)
return stats.mtime.toISOString().split('T')[0]
}
Sitemap Index: Breaking the 50,000 URL Limit#
A single sitemap file can have at most 50,000 URLs. Beyond that, use a sitemap index:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2026-05-05</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-posts.xml</loc>
<lastmod>2026-05-05</lastmod>
</sitemap>
</sitemapindex>
Generating chunked sitemaps in code:
function chunkSitemap(urls: string[], chunkSize = 50000): string[] {
const chunks: string[][] = []
for (let i = 0; i < urls.length; i += chunkSize) {
chunks.push(urls.slice(i, i + chunkSize))
}
return chunks.map((chunk, index) => {
const filename = `sitemap-${index}.xml`
fs.writeFileSync(filename, generateSitemap(chunk))
return filename
})
}
robots.txt Configuration#
After generating a sitemap, declare it in robots.txt:
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
This tells search engines where to find the sitemap. For large sites, declare multiple:
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-images.xml
Bulk URL Import Implementation#
Manually entering URLs one by one is tedious. Bulk import via text parsing:
function parseBulkUrls(input: string): UrlEntry[] {
return input
.split('\n')
.map(line => line.trim())
.filter(line => line.length > 0)
.map(url => ({
loc: url,
lastmod: new Date().toISOString().split('T')[0],
changefreq: 'monthly',
priority: 0.5
}))
}
Going further, automatically crawl all site links:
import cheerio from 'cheerio'
async function crawlSitemap(baseUrl: string): Promise<string[]> {
const visited = new Set<string>()
const queue = [baseUrl]
const urls: string[] = []
while (queue.length > 0 && urls.length < 50000) {
const url = queue.shift()!
if (visited.has(url)) continue
visited.add(url)
const html = await fetch(url).then(r => r.text())
const $ = cheerio.load(html)
$('a[href]').each((_, el) => {
const href = $(el).attr('href')
if (href && href.startsWith(baseUrl) && !visited.has(href)) {
queue.push(href)
}
})
urls.push(url)
}
return urls
}
This simple crawler auto-discovers internal links to build a complete sitemap.
SEO Recommendations#
Based on experience:
- Update regularly: Content sites should update sitemap daily
- Include only canonical URLs: No duplicate pages with query parameters
- Exclude low-value pages: Search results, login pages, 404s shouldn’t appear
- Validate format: Check with an XML validator before submitting
- Submit to search consoles: Both Google Search Console and Bing Webmaster Tools support manual submission
Tool Recommendation#
If you don’t want to write code, try: Sitemap Generator
Supports bulk URL import, custom priority and changefreq, one-click XML download. Sufficient for small to medium sites.
Related: Robots.txt Generator | Meta Tag Generator