Understanding robots.txt from Scratch: The Gatekeeper for Search Engine Crawlers
Understanding robots.txt from Scratch: The Gatekeeper for Search Engine Crawlers#
Last week, while investigating indexing issues on our website, I discovered that Google had indexed our /api/ endpoints and /admin/ login pages. This was embarrassing—sensitive paths shouldn’t be crawled. The root cause? We didn’t have a robots.txt file configured.
This made me realize that many developers only have a superficial understanding of robots.txt. Today, let’s dive into this seemingly simple file that actually hides many nuances.
What Exactly is robots.txt?#
Simply put, robots.txt is a text file placed in your website’s root directory that tells search engine crawlers: “which pages you can crawl, and which you should avoid.”
It acts like an access control system, but with a critical caveat: it’s a gentleman’s agreement, not a forced constraint. Mainstream search engines (Google, Bing, Baidu) will respect it, but malicious crawlers won’t care.
The basic structure looks like this:
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /public/
Sitemap: https://example.com/sitemap.xml
User-agent: Specifies which crawler this applies to,*means all crawlersDisallow: Paths that are forbiddenAllow: Paths that are allowedSitemap: Tells crawlers where your sitemap is located
The Priority Trap in Rule Matching#
This is where most people stumble. Consider this example:
User-agent: *
Disallow: /
Allow: /public/
You might think this means “disallow everything, but allow /public/”. In reality? All pages are disallowed.
The reason lies in how Disallow and Allow rules match: it’s based on longest path matching. When two rules conflict, the more specific one takes precedence. But when / and /public/ conflict, / has already matched all paths, so Allow never gets a chance.
The correct approach is:
User-agent: *
Allow: /public/
Disallow: /
Or more precisely:
User-agent: *
Disallow: /
Allow: /public/
Allow: /static/
Here, the Allow rules are more specific and will override Disallow.
Wildcards and Regular Expression Misconceptions#
robots.txt supports two wildcards: * and $.
*matches any sequence of characters$matches the end of a path
For example:
User-agent: *
Disallow: /*.pdf$
Disallow: /temp/
Disallow: /search?q=*
This will disallow:
- All files ending with
.pdf - Everything under
/temp/ - Search result pages (
/search?q=anything)
But note: This doesn’t support full regular expressions. You can’t write Disallow: /\d+/.
Multi User-agent Strategies#
In real projects, you might want different rules for different crawlers. For instance:
User-agent: Googlebot
Allow: /
Crawl-delay: 1
User-agent: Bingbot
Allow: /
Crawl-delay: 5
User-agent: *
Disallow: /admin/
Disallow: /api/
The key here: crawlers only match the first applicable User-agent block. Googlebot matches the first block, Bingbot matches the second, and other crawlers match the third * block.
So, always put User-agent: * at the end as a fallback rule.
The Crawl-delay Controversy#
Crawl-delay controls how frequently crawlers access your site, measured in seconds. But this field isn’t part of the standard protocol, and different search engines handle it differently:
- Google: Completely ignores it; uses “Crawl rate” settings in Search Console
- Bing: Supports it, but minimum value is 1 second
- Baidu: Supports it, but don’t set it too high (affects indexing speed)
- Yandex: Fully supports it
If you set Crawl-delay: 10 but Google still crawls aggressively, don’t be surprised—it doesn’t recognize this directive.
Implementation: Generating robots.txt Programmatically#
Writing robots.txt manually is error-prone. I built an online generator tool with this core logic:
interface Rule {
type: 'Allow' | 'Disallow'
path: string
}
interface UserAgentBlock {
userAgent: string
rules: Rule[]
}
function generateRobotsTxt(
blocks: UserAgentBlock[],
crawlDelay?: string,
sitemapUrl?: string,
host?: string
): string {
const lines: string[] = []
for (const block of blocks) {
if (!block.userAgent.trim()) continue
lines.push(`User-agent: ${block.userAgent}`)
for (const rule of block.rules) {
if (!rule.path.trim()) continue
lines.push(`${rule.type}: ${rule.path}`)
}
lines.push('') // Empty line separates blocks
}
if (crawlDelay?.trim()) {
lines.push(`Crawl-delay: ${crawlDelay}`)
}
if (sitemapUrl?.trim()) {
lines.push(`Sitemap: ${sitemapUrl}`)
}
if (host?.trim()) {
lines.push(`Host: ${host}`)
}
return lines.join('\n').trim()
}
Usage example:
const blocks: UserAgentBlock[] = [
{
userAgent: '*',
rules: [
{ type: 'Disallow', path: '/admin/' },
{ type: 'Disallow', path: '/api/' },
{ type: 'Allow', path: '/public/' }
]
},
{
userAgent: 'Googlebot',
rules: [
{ type: 'Allow', path: '/' }
]
}
]
const robotsTxt = generateRobotsTxt(
blocks,
'10',
'https://example.com/sitemap.xml',
'example.com'
)
Output:
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /public/
User-agent: Googlebot
Allow: /
Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml
Host: example.com
Common Framework Configurations#
WordPress#
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /xmlrpc.php
Allow: /wp-admin/admin-ajax.php
Allow: /wp-content/uploads/
Crawl-delay: 5
Next.js#
User-agent: *
Disallow: /_next/
Disallow: /api/
Allow: /_next/static/
Allow: /api/sitemap
Note: Next.js’s /_next/ directory contains compiled resources, but /_next/static/ contains static assets (CSS, JS) that should be allowed for proper page rendering by search engines.
Verifying Your robots.txt#
Writing isn’t the end—you need to verify. Google Search Console provides a robots.txt testing tool:
- Log into Search Console
- Select your website property
- Find “robots.txt Tester” in the left menu
- Test URLs to see if they’re allowed
You can also use curl for quick checks:
curl -A "Googlebot" https://example.com/admin/
If it returns 403 Forbidden, your server is blocking crawlers at the server level (which is good).
Limitations of robots.txt#
Finally, some often-overlooked points:
- Not immediate: Search engines cache robots.txt; updates may take days or weeks to take effect
- Doesn’t prevent indexing: Even if crawling is disallowed, if other sites link to your pages, search engines may still index them via link text
- Not secure: robots.txt is a public file—anyone can see which paths you’re blocking. Don’t put sensitive paths there
For truly sensitive content, use server-side access controls (like HTTP Basic Auth, IP whitelists) instead of relying on robots.txt.
Summary#
robots.txt seems simple but has many nuances:
- Rule matching is based on longest path; order matters
- Wildcard support is limited; it’s not regex
- Crawl-delay isn’t standard; Google ignores it
- With multiple User-agents, crawlers only match the first applicable block
- Verification is more important than writing
If you don’t want to write it manually, use this online tool: Robots.txt Generator with visual configuration and real-time preview.
Related tools: Sitemap Generator | Meta Tag Generator