How to keep bad robots, spiders and web crawlers away
Many so called webbots or web spiders are currently used for many different
things on the Internet. Examples include search engines that use them to
catalog the Internet, email marketing people that search for email addresses
and many more. For a description of such robots check out
The Web Robots Faq.
Some of those robots are welcome, others are not. This page will show you how
I catch the bad ones, and how I stop them from bothering me again.
I do not like robots that have one or more of the following features:
- Use the entries in robots.txt to get to files they are supposed to ignore
- Ignore the file robots.txt
- Follow links through cgi scripts
- Traverse the whole web site in seconds, and slow it down during this time
- Revisit the web site too often (more than once a week)
- Are known to search for email addresses
- Collect data and sell it without that my site gets something back
In every case I try to find out what the robot is used for, and, if I
decide that I do not want it anymore, I block access either for that
particular robot or for a particular site.
Most methods below rely on the fact that you have access to
the access logs on the web server. You need to check them regularly for
unauthorized accesses.
- Robots that use entries in robots.txt to get at hidden files
I have an entry in my robots.txt file that points to a directory that
is not mentioned anywhere on the web site. Anyone that accesses it must
have checked the robots.txt file. Such a site will almost always be
banned.
- Robots that ignore robots.txt
There is a special directory on this server, namely
/botsv/ which is mentioned in the file robots.txt.
Any access must either be by someone surfing the net, or a robot.
Any robot that access it shows that it ignores robots.txt.
- Robots that follow links through cgi scripts
Visible in the log files. cgi-scripts usually are not meant to be indexed,
because they are used to generate dynamic web pages that change very
frequently. Each access costs more or less CPU power.
- Robots that traverse the whole web site in seconds
visible in the log files
- Robots that revisit the web site too often
visible in the log files
- Robots that are known to search for email addresses
they are sometimes mentioned in user groups and mailing lists. However,
by setting up a special web site that includes an email address that
changes whenever someone loads that page it is relatively easy to spot
them. Whenever an email is sent to one of those 'trap email addresses'
one just has to search in the log files to find out who actually got that
address. In order to do this you need access to your own domain though.
This is done with a few lines in the .htaccess file. This file contains
directives for the web server and are used in this case to redirect all
accesses from bad robots to one page, which contains a
short explanation why the robot has been banned from the site.
There are two ways to ban a robot, either by banning all accesses from a
particular site or by banning all accesses that use a specific id to access
the server (most browsers and robots identify themselves whenever they request
a page. Internet explorer for example uses Mozilla/4.0 (compatible; MSIE
4.01; Windows 98)", which must be interpreted as "I'm a netscape
browser - well, actually I'm just a compatible browser named MSIE 4.01,
running on windows 98" (A netscape browser identifies itself with
"Mozilla").
In both cases the following lines are used at the beginning of the .htaccess
file (note: this works with recent apache web servers, other servers may need
other commands):
RewriteEngine on
Options +FollowSymlinks
RewriteBase /
To ban all access from IP numbers 209.133.111.* (this is the imagelock
company) use
RewriteCond %{REMOTE_HOST} ^209.133.111..*
RewriteRule ^.*$ X.html [L]
which means: if the remote host has an IP number that starts 209.133.111
rewrite the file name with X.html and stop rewrites.
If you want to ban a particular robot or spider, you need its name
(check your access log). To ban the inktomi spider (called Slurp), you can use
RewriteCond %{HTTP_USER_AGENT} Slurp
RewriteRule ^.*$ X.html [L]
In order to ban several hosts and/or spiders, use
RewriteCond %{REMOTE_HOST} ^209.133.111..* [OR]
RewriteCond %{HTTP_USER_AGENT} Spider [OR]
RewriteCond %{HTTP_USER_AGENT} Slurp
RewriteRule ^.*$ X.html [L]
Note the "[OR]" after each but the last RewriteCond.
Three traps are set on this web site:
- Trap to catch robots that ignore the robots.txt file
This site has a special directory that contains only one file.
This directory is mentioned in the the robots.txt file and therefore
no robot should ever access that specific file.
In order to annoy robots that read that file anyway, it contains
special links and commands such that a robot thinks that there are
other important files in that directory. Thanks to a special .htaccess
file all those other files actually point to the same file.
Besides, to load the file takes always at least 20 seconds
without using resources on the server.
The .htaccess files looks as follows
RewriteEngine on
Options +FollowSymlinks
RewriteBase /
RewriteRule ^.*\.html /botsv/index.shtml
ErrorDocument 400 /botsv/index.shtml
ErrorDocument 402 /botsv/index.shtml
ErrorDocument 403 /botsv/index.shtml
ErrorDocument 404 /botsv/index.shtml
ErrorDocument 500 /botsv/index.shtml
and the special file uses server side includes, is named index.shtml and
the main parts are:
<html><head><title>You are a bad netizen if you are a web bot!</title>
<body><h1><b>You are a bad netizen if you are a web bot!</h1></b>
<!--#config timefmt="%y%j%H%M%S" --> <!-- of date string -->
<!--#exec cmd="sleep 20" --> <!-- make this page sloooow to load -->
To give robots some work here some special links:
these are <a href=a<!--#echo var="DATE_GMT" -->.html>some links</a>
to this <a href=b<!--#echo var="DATE_GMT" -->.html>very page</a>
but with <a href=c<!--#echo var="DATE_GMT" -->.html>different names</a>
The effect is that each robot that hits this page will see new links and
request the same page over and over again. Thanks to the 20 second delay
the server should not get too busy (unless the robot uses many accesses
at the same time, but that would be a very bad robot indeed).
- Trap to catch robots that misuse the robots.txt file
This site has a special directory with the same properties and files
as the one above, except that there is no link to it at all.
This directory is only mentioned in the the robots.txt file and therefore
no robot should ever access that specific file unless it reads the
robots.txt file.
Marc has written a program that
will automatically ban access to sensitive directories for all clients that
access the robots.txt file. I have not tested it though.
- Traps to catch robots that slurp up email addresses
Each of the two files above and an additional one
which is plainly visible contain an email address which is generated new for
each robot. If that address is ever used, it is trivial to find out who
slurped the email address and then block it. To generate email addresses I use
here an email address you better do not use:
<a href=mailto:bot.<!--#echo var="DATE_GMT" -->@ars.net>bot.<!--#echo var="DATE_GMT" -->@ars.net</a>. To make other robots happy as well,
This assumes that the file contains a line as the one above.
If you want to install the traps you can download them here. You need to
customize them as they will not work correctly right out of the box. Note
that they do work with the servers I'm using, but depending on the server
software and the configuration of your server they may or may not work for
you. Be sure to test everything out before leaving it in place.