Infrastructure
·5 min read

robots.txt is not a security tool, and it never was

A surprising number of people still believe robots.txt blocks malicious bots from crawling sensitive paths. It doesn't. It never has. It's a polite request, and the bots you most want to keep out are the ones that don't read it. Here's what robots.txt actually does, and where to put real blocks instead.

Note

**TL;DR — robots.txt is a guideline, not enforcement.** Well-behaved crawlers (Googlebot, Bingbot, ClaudeBot) read it and obey. Malicious bots and scrapers ignore it completely. If you need to actually prevent something from being accessed, the rule has to be enforced by something with the power to drop a connection: a WAF, a server firewall, fail2ban, or Nginx access rules. robots.txt is fine for telling Google not to index your staging site. It is not fine for protecting `/wp-admin/`.

Every couple of months I see someone post a screenshot of their `robots.txt` with a long list of `Disallow:` rules covering admin paths, login URLs, plugin directories, and various sensitive endpoints. The implication is always that this is a security measure — the site owner thinks they've blocked malicious bots from those paths.

It's a sympathetic mistake. The file is called "robots" something, the syntax is a list of paths to deny, the convention has been around forever. It feels like access control. It isn't. It has never been.

What robots.txt actually is

robots.txt is a public text file at the root of your domain that says "hey, crawlers, here's a list of paths I'd prefer you didn't index." That's the entire mechanism. There's no enforcement, no authentication, no validation, no logging, no rate limit, no consequence for ignoring it. It's a politely worded sign on the front of a building that says "Please don't go in here." If the visitor wants to ignore the sign, they ignore it.

The honest crawlers — Googlebot, Bingbot, the official Anthropic ClaudeBot, OpenAI's GPTBot, Yandex, the Internet Archive — all read robots.txt and respect it. They do this because their business model depends on being trusted citizens of the web. If they got caught crawling paths they'd been told not to, the long-term cost would be far worse than the short-term value of the data.

The dishonest crawlers — credential stuffers, vulnerability scanners, scrapers harvesting your content for spam farms, the kind of bot that's hammering `/xmlrpc.php` from a rotating subnet — do not read robots.txt. They have no incentive to. There's no consequence for ignoring it. They'll happily request `/wp-admin/admin-ajax.php` ten thousand times an hour even if your robots.txt has `Disallow: /wp-admin/` in big bold letters.

Why putting sensitive paths in robots.txt is actively worse

Here's the part that gets lost in the conversation. robots.txt is publicly readable. Anyone, including malicious actors, can fetch `https://yoursite.com/robots.txt` and read the entire list of paths you've told crawlers to avoid. If your robots.txt contains:

text
User-agent: *
Disallow: /admin/
Disallow: /backup/
Disallow: /old-staging/
Disallow: /private-api/v2/internal-only/
Disallow: /wp-config-backup.php

...you've just published a sitemap of exactly the paths an attacker should look at first. You've taken the things you want kept secret and announced them in a public file. The well-behaved bots stop crawling those paths (good) and the malicious ones now know they exist (very bad). It's the worst possible outcome for security.

What to use instead

Real access control happens at one of these layers, depending on what you're trying to protect:

  • **Authentication.** Anything sensitive should require a login. `/wp-admin/` already does. If you have an admin path that doesn't require auth, that's the bug to fix.
  • **Web server access rules.** Nginx `allow`/`deny` blocks let you restrict paths to specific IPs or networks. Apache `<Location>` blocks do the same. Both happen before PHP even runs.
  • **WAF / CDN rules.** Cloudflare WAF, AWS WAF, or any commercial WAF can block requests to specific paths from specific countries, ASNs, or anything else, at the edge before they reach your server.
  • **fail2ban.** Watch your access log for repeated requests to known-attack paths and ban the source IP at the firewall level after N attempts. This catches the long tail of attackers your WAF misses.
  • **HTTP Basic Auth on staging.** If you're trying to hide a non-production site, put it behind Basic Auth. It's two lines of Nginx config and unguessable bots will simply hit a 401 and move on.

None of these involve robots.txt. They all involve something between the attacker and your application that has the actual power to refuse the request.

What robots.txt is genuinely good for

I don't want to be totally dismissive — robots.txt does have legitimate uses. Just not security ones.

  • Telling Google not to index your staging environment so it doesn't accidentally show up in search results next to production.
  • Pointing search engine crawlers at your sitemap (`Sitemap: https://example.com/sitemap.xml`).
  • Asking GPTBot, ClaudeBot, and Google-Extended not to scrape your content for training data, if you care about that — most major AI crawlers honor this even though they're not legally obligated to.
  • Telling Googlebot not to waste crawl budget on faceted-search URL variants that produce duplicate content.

All of these are about cooperating with crawlers that already want to cooperate with you. None of them will stop a single attacker.

The one-line summary I give clients

robots.txt is for telling Google what you'd like; the firewall is for telling everyone else what they're allowed. Use the right tool for the right thing. If a path needs to actually be blocked, block it somewhere that can drop a packet. If a path just needs to not show up in Google, that's what robots.txt is for. Mixing them up is the source of about half the WordPress security advice on the internet, and most of it is wrong.

Topics
robots.txtWAFbot managementCloudflarefail2banLinux server administrationWordPress security
Zunaid Amin

Zunaid Amin

Manages Linux infrastructure at Rocket.net. WordPress Core Contributor since 6.3 and Hosting Team Representative for WordPress.org. Based in Dhaka, Bangladesh.

zunaid321@gmail.com