Understanding robots.txt from Scratch: The Gatekeeper for Search Engine Crawlers#

Last week, while investigating indexing issues on our website, I discovered that Google had indexed our /api/ endpoints and /admin/ login pages. This was embarrassing—sensitive paths shouldn’t be crawled. The root cause? We didn’t have a robots.txt file configured.

This made me realize that many developers only have a superficial understanding of robots.txt. Today, let’s dive into this seemingly simple file that actually hides many nuances.

What Exactly is robots.txt?#

Simply put, robots.txt is a text file placed in your website’s root directory that tells search engine crawlers: “which pages you can crawl, and which you should avoid.”

It acts like an access control system, but with a critical caveat: it’s a gentleman’s agreement, not a forced constraint. Mainstream search engines (Google, Bing, Baidu) will respect it, but malicious crawlers won’t care.

The basic structure looks like this:

User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /public/

Sitemap: https://example.com/sitemap.xml
  • User-agent: Specifies which crawler this applies to, * means all crawlers
  • Disallow: Paths that are forbidden
  • Allow: Paths that are allowed
  • Sitemap: Tells crawlers where your sitemap is located

The Priority Trap in Rule Matching#

This is where most people stumble. Consider this example:

User-agent: *
Disallow: /
Allow: /public/

You might think this means “disallow everything, but allow /public/”. In reality? All pages are disallowed.

The reason lies in how Disallow and Allow rules match: it’s based on longest path matching. When two rules conflict, the more specific one takes precedence. But when / and /public/ conflict, / has already matched all paths, so Allow never gets a chance.

The correct approach is:

User-agent: *
Allow: /public/
Disallow: /

Or more precisely:

User-agent: *
Disallow: /
Allow: /public/
Allow: /static/

Here, the Allow rules are more specific and will override Disallow.

Wildcards and Regular Expression Misconceptions#

robots.txt supports two wildcards: * and $.

  • * matches any sequence of characters
  • $ matches the end of a path

For example:

User-agent: *
Disallow: /*.pdf$
Disallow: /temp/
Disallow: /search?q=*

This will disallow:

  • All files ending with .pdf
  • Everything under /temp/
  • Search result pages (/search?q=anything)

But note: This doesn’t support full regular expressions. You can’t write Disallow: /\d+/.

Multi User-agent Strategies#

In real projects, you might want different rules for different crawlers. For instance:

User-agent: Googlebot
Allow: /
Crawl-delay: 1

User-agent: Bingbot
Allow: /
Crawl-delay: 5

User-agent: *
Disallow: /admin/
Disallow: /api/

The key here: crawlers only match the first applicable User-agent block. Googlebot matches the first block, Bingbot matches the second, and other crawlers match the third * block.

So, always put User-agent: * at the end as a fallback rule.

The Crawl-delay Controversy#

Crawl-delay controls how frequently crawlers access your site, measured in seconds. But this field isn’t part of the standard protocol, and different search engines handle it differently:

  • Google: Completely ignores it; uses “Crawl rate” settings in Search Console
  • Bing: Supports it, but minimum value is 1 second
  • Baidu: Supports it, but don’t set it too high (affects indexing speed)
  • Yandex: Fully supports it

If you set Crawl-delay: 10 but Google still crawls aggressively, don’t be surprised—it doesn’t recognize this directive.

Implementation: Generating robots.txt Programmatically#

Writing robots.txt manually is error-prone. I built an online generator tool with this core logic:

interface Rule {
  type: 'Allow' | 'Disallow'
  path: string
}

interface UserAgentBlock {
  userAgent: string
  rules: Rule[]
}

function generateRobotsTxt(
  blocks: UserAgentBlock[],
  crawlDelay?: string,
  sitemapUrl?: string,
  host?: string
): string {
  const lines: string[] = []

  for (const block of blocks) {
    if (!block.userAgent.trim()) continue
    lines.push(`User-agent: ${block.userAgent}`)
    
    for (const rule of block.rules) {
      if (!rule.path.trim()) continue
      lines.push(`${rule.type}: ${rule.path}`)
    }
    lines.push('') // Empty line separates blocks
  }

  if (crawlDelay?.trim()) {
    lines.push(`Crawl-delay: ${crawlDelay}`)
  }
  if (sitemapUrl?.trim()) {
    lines.push(`Sitemap: ${sitemapUrl}`)
  }
  if (host?.trim()) {
    lines.push(`Host: ${host}`)
  }

  return lines.join('\n').trim()
}

Usage example:

const blocks: UserAgentBlock[] = [
  {
    userAgent: '*',
    rules: [
      { type: 'Disallow', path: '/admin/' },
      { type: 'Disallow', path: '/api/' },
      { type: 'Allow', path: '/public/' }
    ]
  },
  {
    userAgent: 'Googlebot',
    rules: [
      { type: 'Allow', path: '/' }
    ]
  }
]

const robotsTxt = generateRobotsTxt(
  blocks,
  '10',
  'https://example.com/sitemap.xml',
  'example.com'
)

Output:

User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /public/

User-agent: Googlebot
Allow: /

Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml
Host: example.com

Common Framework Configurations#

WordPress#

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /xmlrpc.php
Allow: /wp-admin/admin-ajax.php
Allow: /wp-content/uploads/

Crawl-delay: 5

Next.js#

User-agent: *
Disallow: /_next/
Disallow: /api/
Allow: /_next/static/
Allow: /api/sitemap

Note: Next.js’s /_next/ directory contains compiled resources, but /_next/static/ contains static assets (CSS, JS) that should be allowed for proper page rendering by search engines.

Verifying Your robots.txt#

Writing isn’t the end—you need to verify. Google Search Console provides a robots.txt testing tool:

  1. Log into Search Console
  2. Select your website property
  3. Find “robots.txt Tester” in the left menu
  4. Test URLs to see if they’re allowed

You can also use curl for quick checks:

curl -A "Googlebot" https://example.com/admin/

If it returns 403 Forbidden, your server is blocking crawlers at the server level (which is good).

Limitations of robots.txt#

Finally, some often-overlooked points:

  1. Not immediate: Search engines cache robots.txt; updates may take days or weeks to take effect
  2. Doesn’t prevent indexing: Even if crawling is disallowed, if other sites link to your pages, search engines may still index them via link text
  3. Not secure: robots.txt is a public file—anyone can see which paths you’re blocking. Don’t put sensitive paths there

For truly sensitive content, use server-side access controls (like HTTP Basic Auth, IP whitelists) instead of relying on robots.txt.

Summary#

robots.txt seems simple but has many nuances:

  • Rule matching is based on longest path; order matters
  • Wildcard support is limited; it’s not regex
  • Crawl-delay isn’t standard; Google ignores it
  • With multiple User-agents, crawlers only match the first applicable block
  • Verification is more important than writing

If you don’t want to write it manually, use this online tool: Robots.txt Generator with visual configuration and real-time preview.


Related tools: Sitemap Generator | Meta Tag Generator