Understanding robots.txt from Scratch: The Gatekeeper for Search Engine Crawlers#

Last week, while investigating indexing issues on our website, I discovered that Google had indexed our /api/ endpoints and /admin/ login pages. This was embarrassing—sensitive paths shouldn’t be crawled. The root cause? We didn’t have a robots.txt file configured.

This made me realize that many developers only have a superficial understanding of robots.txt. Today, let’s dive into this seemingly simple file that actually hides many nuances.

What Exactly is robots.txt?#

Simply put, robots.txt is a text file placed in your website’s root directory that tells search engine crawlers: “which pages you can crawl, and which you should avoid.”

It acts like an access control system, but with a critical caveat: it’s a gentleman’s agreement, not a forced constraint. Mainstream search engines (Google, Bing, Baidu) will respect it, but malicious crawlers won’t care.

The basic structure looks like this:

User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /public/

Sitemap: https://example.com/sitemap.xml

User-agent: Specifies which crawler this applies to, * means all crawlers
Disallow: Paths that are forbidden
Allow: Paths that are allowed
Sitemap: Tells crawlers where your sitemap is located

The Priority Trap in Rule Matching#

This is where most people stumble. Consider this example:

User-agent: *
Disallow: /
Allow: /public/

You might think this means “disallow everything, but allow /public/”. In reality? All pages are disallowed.

The reason lies in how Disallow and Allow rules match: it’s based on longest path matching. When two rules conflict, the more specific one takes precedence. But when / and /public/ conflict, / has already matched all paths, so Allow never gets a chance.

The correct approach is:

User-agent: *
Allow: /public/
Disallow: /

Or more precisely:

User-agent: *
Disallow: /
Allow: /public/
Allow: /static/

Here, the Allow rules are more specific and will override Disallow.

Wildcards and Regular Expression Misconceptions#

robots.txt supports two wildcards: * and $.

* matches any sequence of characters
$ matches the end of a path

For example:

User-agent: *
Disallow: /*.pdf$
Disallow: /temp/
Disallow: /search?q=*

This will disallow:

All files ending with .pdf
Everything under /temp/
Search result pages (/search?q=anything)

But note: This doesn’t support full regular expressions. You can’t write Disallow: /\d+/.

Multi User-agent Strategies#

In real projects, you might want different rules for different crawlers. For instance:

User-agent: Googlebot
Allow: /
Crawl-delay: 1

User-agent: Bingbot
Allow: /
Crawl-delay: 5

User-agent: *
Disallow: /admin/
Disallow: /api/

The key here: crawlers only match the first applicable User-agent block. Googlebot matches the first block, Bingbot matches the second, and other crawlers match the third * block.

So, always put User-agent: * at the end as a fallback rule.

The Crawl-delay Controversy#

Crawl-delay controls how frequently crawlers access your site, measured in seconds. But this field isn’t part of the standard protocol, and different search engines handle it differently:

Google: Completely ignores it; uses “Crawl rate” settings in Search Console
Bing: Supports it, but minimum value is 1 second
Baidu: Supports it, but don’t set it too high (affects indexing speed)
Yandex: Fully supports it

If you set Crawl-delay: 10 but Google still crawls aggressively, don’t be surprised—it doesn’t recognize this directive.

Implementation: Generating robots.txt Programmatically#

Writing robots.txt manually is error-prone. I built an online generator tool with this core logic:

interface Rule {
  type: 'Allow' | 'Disallow'
  path: string
}

interface UserAgentBlock {
  userAgent: string
  rules: Rule[]
}

function generateRobotsTxt(
  blocks: UserAgentBlock[],
  crawlDelay?: string,
  sitemapUrl?: string,
  host?: string
): string {
  const lines: string[] = []

  for (const block of blocks) {
    if (!block.userAgent.trim()) continue
    lines.push(`User-agent: ${block.userAgent}`)
    
    for (const rule of block.rules) {
      if (!rule.path.trim()) continue
      lines.push(`${rule.type}: ${rule.path}`)
    }
    lines.push('') // Empty line separates blocks
  }

  if (crawlDelay?.trim()) {
    lines.push(`Crawl-delay: ${crawlDelay}`)
  }
  if (sitemapUrl?.trim()) {
    lines.push(`Sitemap: ${sitemapUrl}`)
  }
  if (host?.trim()) {
    lines.push(`Host: ${host}`)
  }

  return lines.join('\n').trim()
}

Usage example:

const blocks: UserAgentBlock[] = [
  {
    userAgent: '*',
    rules: [
      { type: 'Disallow', path: '/admin/' },
      { type: 'Disallow', path: '/api/' },
      { type: 'Allow', path: '/public/' }
    ]
  },
  {
    userAgent: 'Googlebot',
    rules: [
      { type: 'Allow', path: '/' }
    ]
  }
]

const robotsTxt = generateRobotsTxt(
  blocks,
  '10',
  'https://example.com/sitemap.xml',
  'example.com'
)

Output:

User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /public/

User-agent: Googlebot
Allow: /

Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml
Host: example.com

Common Framework Configurations#

WordPress#

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /xmlrpc.php
Allow: /wp-admin/admin-ajax.php
Allow: /wp-content/uploads/

Crawl-delay: 5

Next.js#

User-agent: *
Disallow: /_next/
Disallow: /api/
Allow: /_next/static/
Allow: /api/sitemap

Note: Next.js’s /_next/ directory contains compiled resources, but /_next/static/ contains static assets (CSS, JS) that should be allowed for proper page rendering by search engines.

Verifying Your robots.txt#

Writing isn’t the end—you need to verify. Google Search Console provides a robots.txt testing tool:

Log into Search Console
Select your website property
Find “robots.txt Tester” in the left menu
Test URLs to see if they’re allowed

You can also use curl for quick checks:

curl -A "Googlebot" https://example.com/admin/

If it returns 403 Forbidden, your server is blocking crawlers at the server level (which is good).

Limitations of robots.txt#

Finally, some often-overlooked points:

Not immediate: Search engines cache robots.txt; updates may take days or weeks to take effect
Doesn’t prevent indexing: Even if crawling is disallowed, if other sites link to your pages, search engines may still index them via link text
Not secure: robots.txt is a public file—anyone can see which paths you’re blocking. Don’t put sensitive paths there

For truly sensitive content, use server-side access controls (like HTTP Basic Auth, IP whitelists) instead of relying on robots.txt.

Summary#

robots.txt seems simple but has many nuances:

Rule matching is based on longest path; order matters
Wildcard support is limited; it’s not regex
Crawl-delay isn’t standard; Google ignores it
With multiple User-agents, crawlers only match the first applicable block
Verification is more important than writing

If you don’t want to write it manually, use this online tool: Robots.txt Generator with visual configuration and real-time preview.

Related tools: Sitemap Generator | Meta Tag Generator