You are on page 1of 5

How to Set Up a robots.

txt to Control Search


Engine Spiders
http://www.thesitewizard.com/archive/robotstxt.shtml
(http://www.thesitewizard.com/archive/robotstxt.shtml)
by Christopher Heng, thesitewizard.com (http://www.thesitewizard.com/)

When I first started writing my first website, I did not really think that I would ever have any reason why
I would want to create a robots.txt file. After all, did I not want search engine robots to spider and
thus index every document in my site? Yet today, all my sites, including thesitewizard.com
(http://www.thesitewizard.com/) , have a robots.txt file in their root directory. This article explains why
you might also want to include a robots.txt file on your sites, how you can do so, and notes some
common mistakes made by new webmasters with regards the robots.txt file.

For those new to the robots.txt file, it is merely a text file implementing what is known as the Standard
for Robot Exclusion. The file is placed in the main directory of a website and advises spiders and other
robots which directories or files they should not access. The file is purely advisory — not all spiders
bother to read it let alone heed it. However, most, if not all, the spiders sent by the major search
engines to index your site will read it and obey the rules contained within the file (provided those rules
make sense).

Why is a Robots.txt File Important?


What is the purpose of a robots.txt file?

1. It Can Avoid Wastage of Server Resources

Many, if not most websites, have some sort of scripts (computer programs) that run on their
website. For example, many websites have some sort of contact form, such as that created using
the Free Feedback Form Script Wizard (http://www.thesitewizard.com/wizards/feedbackform.shtml) .
Some also have a search engine on their site
(http://www.thesitewizard.com/archive/searchengine.shtml) , such as that which you see in the left
column of every page on thesitewizard.com.

When search engine robots or spiders index your site, they actually call your scripts just as a
browser would. If your site is like mine, where the scripts are solely meant for the use of humans
and serve no practical use for a search engine (why should a search engine need to invoke my
feedback form or use my site search engine?) you may want to block spiders from the directories
that contain your scripts. For example, I block spiders from my feedback form, search engine and
CGI-BIN directory. Hopefully, this will reduce the load on the web server that occurs when scripts
are executed by removing unnecessary executions.

Of course there are the occasional ill-behaved robots that hit your server at high speed. Such
spiders can actually bring down your server or at the very least slow it down for the real users who
are trying to access it. If you know of any such spiders, you might want to exclude them too. You
can do this with a robots.txt file. Unfortunately though, ill-behaved spiders often ignore robots.txt
files as well.

2. It Can Save Your Bandwidth

If you look at your website's web statistics (http://www.thesitewizard.com/general/web-statistics-


primer.shtml) , you will undoubtedly find many requests for the robots.txt file by various search
engine spiders. The search engines try to retrieve the robots.txt file before indexing your
website, to see if you have any special instructions for them.

If you don't have a robots.txt file, your web server will return a 404 error page to the engine
instead. For those who have customized their 404 error document
(http://www.thesitewizard.com/archive/custom404.shtml) , that customised 404 page will end up being
sent to the spider repeatedly throughout the day. Now, if you have customized your 404 page,
chances are that it's bigger than the standard server error message "404 File Not Found" (since
you will want your error page to say more than the default error message). In other words, failing to
create a robots.txt will cause the search engine spider to use up more of your bandwidth as a
result of its repeated retrieval of your large 404 error file. (How much more depends, of course, on
the size of your 404 error page.)

Some spiders may also request for files which you feel they should not. For example, some search
engines also index graphic files (like ".gif", ".jpg" and ".png" files"). If you don't want them to do so,
you can ban it from your graphic files directory using your robots.txt file.

3. It Removes Clutter from your Web Statistics

I don't know about you, but one of the things I check from my web statistics
(http://www.thesitewizard.com/general/web-statistics-primer.shtml) is the list of URLs that visitors tried
to access, but met with a 404 File Not Found Error. Often this tells me if I made a spelling error in
one of the internal links on one of my sites (yes, I know — I should have checked all links in the
first place, but mistakes do happen).

If you don't have a robots.txt file, you can be sure that /robots.txt is going to feature in
your web statistics 404 report, adding clutter and perhaps unnecessarily distracting your attention
from the real bad URLs that need your attention.

4. Refusing a Robot

Sometimes you don't want a particular spider to index your site for some reason or other. Perhaps
the robot is ill-behaved and spiders your site at such a high speed that it takes down your entire
server. Or perhaps you prefer that you don't want the images on your site indexed in an image
search engine. With a robots.txt file, you can exclude certain spiders from indexing your site with a
robots.txt directive, provided the spider obeys the rules in that file.

How to Set Up a Robots.txt File


Writing a robots.txt file is extremely easy. It's just an ASCII text file that you place at the root of
your domain. For example, if your domain is www.example.com, place the file at
www.example.com/robots.txt. For those who don't know what an ASCII text file is, it's just a
plain text file that you create with a type of program called an ASCII text editor
(http://www.thefreecountry.com/programming/editors.shtml) . If you use Windows, you already have an
ASCII text editor on your system, called Notepad. (Note: only Notepad on the default Windows system
is an ASCII text editor; do not use WordPad, Write, or Word.)

The file basically lists the names of spiders on one line, followed by the list of directories or files it is
not allowed to access on subsequent lines, with each directory or file on a separate line. It is possible
to use the wildcard character "*" (just the asterisk, without the quotes) instead of naming specific
spiders. When you do so, all spiders are assumed to be named. Note that the robots.txt file is a robots
exclusion file (with emphasis on the "exclusion") — there is no universal way to tell spiders to include
any file or directory.

Take the following robots.txt file for example:

User-agent: *
Disallow: /cgi-bin/

The above two lines, when inserted into a robots.txt file, inform all robots (since the wildcard asterisk
"*" character was used) that they are not allowed to access anything in the cgi-bin directory and its
descendents. That is, they are not allowed to access cgi-bin/whatever.cgi or even a file or
script in a subdirectory of cgi-bin, such as /cgi-bin/anything/whichever.cgi.

If you have a particular robot in mind, such as the Google image search robot, which collects images
on your site for the Google Image search engine, you may include lines like the following:

User-agent: Googlebot-Image
Disallow: /

This means that the Google image search robot, "Googlebot-Image", should not try to access any file
in the root directory "/" and all its subdirectories. This effectively means that it is banned from getting
any file from your entire website.

You can have multiple Disallow lines for each user agent (ie, for each spider). Here is an example of a
longer robots.txt file:

User-agent: *
Disallow: /images/
Disallow: /cgi-bin/

User-agent: Googlebot-Image
Disallow: /

The first block of text disallows all spiders from the images directory and the cgi-bin directory. The
second block of code disallows the Googlebot-Image spider from every directory.

It is possible to exclude a spider from indexing a particular file. For example, if you don't want Google's
image search robot to index a particular picture, say, mymugshot.jpg, you can add the following:

User-agent: Googlebot-Image
Disallow: /images/mymugshot.jpg

Remember to add the trailing slash ("/") if you are indicating a directory. If you simply add
User-agent: *
Disallow: /privatedata

the robots will be disallowed from accessing privatedata.html as well as


privatedataandstuff.html as well as the directory tree beginning from /privatedata/ (and
so on). In other words, there is an implied wildcard character following whatever you list in the
Disallow line.

Where Do You Get the Name of the Robots?


If you have a particular spider in mind which you want to block, you have to find out its name. To do
this, the best way is to check out the website of the search engine. Respectable engines will usually
have a page somewhere that gives you details on how you can prevent their spiders from accessing
certain files or directories.

Common Mistakes in Robots.txt


Here are some mistakes commonly made by those new to writing robots.txt rules.

1. It's Not Guaranteed to Work

As mentioned earlier, although the robots.txt format is listed in a document called "A Standard for
Robots Exclusion", not all spiders and robots actually bother to heed it. Listing something in your
robots.txt is no guarantee that it will be excluded. If you really need to block a particular spider
("bot"), you should use a .htaccess file to block that bot (http://www.thesitewizard.com/apache/block-
bots-with-htaccess.shtml) . Alternatively, you can also password-protect the directory (also with
a .htaccess file) (http://www.thesitewizard.com/apache/password-protect-directory.shtml) .

2. Don't List Your Secret Directories

Anyone can access your robots file, not just robots. For example, typing
http://www.google.com/robots.txt will get you Google's own robots.txt file. I notice that
some new webmasters seem to think that they can list their secret directories in their robots.txt file
to prevent that directory from being accessed. Far from it. Listing a directory in a robots.txt file
often attracts attention to the directory. In fact, some spiders (like certain spammers' email
harvesting robots) make it a point to check the robots.txt for excluded directories to spider.

3. Only One Directory/File per Disallow line

Don't try to be smart and put multiple directories on your Disallow line. This will probably not work
the way you think, since the Robots Exclusion Standard only provides for one directory per
Disallow statement.

How to Specify All the Files on Your Website


A recent update to the robots.txt format now allows you to link to something known as a sitemaps
protocol file (http://www.thesitewizard.com/sitepromotion/get-search-engines-index-all-web-pages.shtml) that
gives search engines a list of all the pages on your website. Please read the article How to Get Search
Engines to Discover (Index) All the Web Pages on Your Site
(http://www.thesitewizard.com/sitepromotion/get-search-engines-index-all-web-pages.shtml) for more
information about this extension.

It's Worth It
Even if you want all your directories to be accessed by spiders, a simple robots file with the following
may be useful:

User-agent: *
Disallow:

With no file or directory listed in the Disallow line, you're implying that every directory on your site may
be accessed. At the very least, this file will save you a few bytes of bandwidth each time a spider visits
your site (or more if your 404 file is large); and it will also remove Robots.txt from your web statistics
bad referral links report.

Copyright 2001-2010 by Christopher Heng. All rights reserved.


Get more free tips and articles like this (http://www.thesitewizard.com/archive/robotstxt.shtml) , on web
design, promotion, revenue and scripting, from http://www.thesitewizard.com/
(http://www.thesitewizard.com/)

You are here: Top (http://www.thesitewizard.com/) > Website Promotion and Search Engine
Optimization (http://www.thesitewizard.com/sitepromotion/index.shtml) > How to Set Up a robots.txt to
Control Search Engine Spiders (http://www.thesitewizard.com/archive/robotstxt.shtml)

thesitewizard™ News Feed (RSS Site Feed)


(http://www.thesitewizard.com/thesitewizard.xml)

Do you find this article useful? You can learn of new articles and scripts that are published on
thesitewizard.com (http://www.thesitewizard.com/) by subscribing to the RSS feed. Simply point your
RSS feed reader or a browser that supports RSS feeds at
http://www.thesitewizard.com/thesitewizard.xml (http://www.thesitewizard.com/thesitewizard.xml) . You
can read more about how to subscribe to RSS site feeds
(http://www.thesitewizard.com/faqs/howtoreadsitefeeds.shtml) from my RSS FAQ
(http://www.thesitewizard.com/faqs/howtoreadsitefeeds.shtml) .

Do Not Reprint Without Permission


This article is copyrighted. Please do not reproduce this article in whole or part, in any form, without
obtaining my written permission (http://www.thesitewizard.com/feedback.php) .

Related Pages
• How to Make / Create Your Own Website: The Beginner's A-Z Guide
(http://www.thesitewizard.com/gettingstarted/startwebsite.shtml)

You might also like