How to keep bad robots, spiders and web crawlers away

Many so called webbots or web spiders are currently used for many different things on the Internet. Examples include search engines that use them to catalog the Internet, email marketing people that search for email addresses and many more. For a description of such robots check out The Web Robots Faq.

Trapped and/or Banned Robots
Definition of a bad robot
How to identify bad robots
Banning bad robots from the site
The Robot Trap
Download the traps

Some of those robots are welcome, others are not. This page will show you how I catch the bad ones, and how I stop them from bothering me again.

Definition of a bad robot

I do not like robots that have one or more of the following features:

Use the entries in robots.txt to get to files they are supposed to ignore
Ignore the file robots.txt
Follow links through cgi scripts
Traverse the whole web site in seconds, and slow it down during this time
Revisit the web site too often (more than once a week)
Are known to search for email addresses
Collect data and sell it without that my site gets something back

In every case I try to find out what the robot is used for, and, if I decide that I do not want it anymore, I block access either for that particular robot or for a particular site.

How to identify bad robots

Most methods below rely on the fact that you have access to the access logs on the web server. You need to check them regularly for unauthorized accesses.

Robots that use entries in robots.txt to get at hidden files
I have an entry in my robots.txt file that points to a directory that is not mentioned anywhere on the web site. Anyone that accesses it must have checked the robots.txt file. Such a site will almost always be banned.

Robots that ignore robots.txt
There is a special directory on this server, namely /botsv/ which is mentioned in the file robots.txt. Any access must either be by someone surfing the net, or a robot. Any robot that access it shows that it ignores robots.txt.
Robots that follow links through cgi scripts
Visible in the log files. cgi-scripts usually are not meant to be indexed, because they are used to generate dynamic web pages that change very frequently. Each access costs more or less CPU power.
Robots that traverse the whole web site in seconds
visible in the log files
Robots that revisit the web site too often
visible in the log files
Robots that are known to search for email addresses

they are sometimes mentioned in user groups and mailing lists. However, by setting up a special web site that includes an email address that changes whenever someone loads that page it is relatively easy to spot them. Whenever an email is sent to one of those 'trap email addresses' one just has to search in the log files to find out who actually got that address. In order to do this you need access to your own domain though.

Banning bad robots from the site

This is done with a few lines in the .htaccess file. This file contains directives for the web server and are used in this case to redirect all accesses from bad robots to one page, which contains a short explanation why the robot has been banned from the site.

There are two ways to ban a robot, either by banning all accesses from a particular site or by banning all accesses that use a specific id to access the server (most browsers and robots identify themselves whenever they request a page. Internet explorer for example uses Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)", which must be interpreted as "I'm a netscape browser - well, actually I'm just a compatible browser named MSIE 4.01, running on windows 98" (A netscape browser identifies itself with "Mozilla"). In both cases the following lines are used at the beginning of the .htaccess file (note: this works with recent apache web servers, other servers may need other commands):

     RewriteEngine on
     Options +FollowSymlinks
     RewriteBase /

To ban all access from IP numbers 209.133.111.* (this is the imagelock company) use

     RewriteCond %{REMOTE_HOST} ^209.133.111..*
     RewriteRule ^.*$ X.html [L]

which means: if the remote host has an IP number that starts 209.133.111 rewrite the file name with X.html and stop rewrites.

If you want to ban a particular robot or spider, you need its name (check your access log). To ban the inktomi spider (called Slurp), you can use

     RewriteCond %{HTTP_USER_AGENT} Slurp
     RewriteRule ^.*$ X.html [L]

In order to ban several hosts and/or spiders, use

     RewriteCond %{REMOTE_HOST} ^209.133.111..* [OR]
     RewriteCond %{HTTP_USER_AGENT} Spider [OR]
     RewriteCond %{HTTP_USER_AGENT} Slurp
     RewriteRule ^.*$ X.html [L]

Note the "[OR]" after each but the last RewriteCond.

The Robot Trap

Three traps are set on this web site:

Trap to catch robots that ignore the robots.txt file
This site has a special directory that contains only one file. This directory is mentioned in the the robots.txt file and therefore no robot should ever access that specific file.
In order to annoy robots that read that file anyway, it contains special links and commands such that a robot thinks that there are other important files in that directory. Thanks to a special .htaccess file all those other files actually point to the same file. Besides, to load the file takes always at least 20 seconds without using resources on the server.
The .htaccess files looks as follows
```
     RewriteEngine on
     Options +FollowSymlinks
     RewriteBase /
     RewriteRule ^.*\.html /botsv/index.shtml
     ErrorDocument 400 /botsv/index.shtml
     ErrorDocument 402 /botsv/index.shtml
     ErrorDocument 403 /botsv/index.shtml
     ErrorDocument 404 /botsv/index.shtml
     ErrorDocument 500 /botsv/index.shtml
```
and the special file uses server side includes, is named index.shtml and the main parts are:
```
     <html><head><title>You are a bad netizen if you are a web bot!</title>
     <body><h1><b>You are a bad netizen if you are a web bot!</h1></b>
      
      
     To give robots some work here some special links:
     these are <a href=a.html>some links</a> 
     to this <a href=b.html>very page</a> 
     but with <a href=c.html>different names</a> 
```
The effect is that each robot that hits this page will see new links and request the same page over and over again. Thanks to the 20 second delay the server should not get too busy (unless the robot uses many accesses at the same time, but that would be a very bad robot indeed).
Trap to catch robots that misuse the robots.txt file
This site has a special directory with the same properties and files as the one above, except that there is no link to it at all. This directory is only mentioned in the the robots.txt file and therefore no robot should ever access that specific file unless it reads the robots.txt file.
Marc has written a program that will automatically ban access to sensitive directories for all clients that access the robots.txt file. I have not tested it though.
Traps to catch robots that slurp up email addresses
Each of the two files above and an additional one which is plainly visible contain an email address which is generated new for each robot. If that address is ever used, it is trivial to find out who slurped the email address and then block it. To generate email addresses I use
```
     here an email address you better do not use:
     <a href=mailto:bot.@ars.net>bot.@ars.net</a>. To make other robots happy as well, 
```
This assumes that the file contains a line as the one above.

Download the traps

If you want to install the traps you can download them here. You need to customize them as they will not work correctly right out of the box. Note that they do work with the servers I'm using, but depending on the server software and the configuration of your server they may or may not work for you. Be sure to test everything out before leaving it in place.