You are on page 1of 36

The Robots Exclusion Standard

By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 2 of 36

About Blogging Tips About Blogging Tips About Blogging Tips About Blogging Tips
Blogging Tips is a daily blogging advice blog which specialises in
helping bloggers create, develop, promote and make a living from their
blogs.
Visit www.bloggingtips.com for more information about our blog,
newsletter and discussion forums.
About The Author About The Author About The Author About The Author
Kevin Muldoon is the founder of Blogging Tips
and oversees all site developments.
He currently lives in central Scotland and enjoys
travelling, music, movies, keeping fit and reading.
He continues to blog for Blogging Tips on a
weekly basis as well as his personal blog
KevinMuldoon.com.
Copyright Copyright Copyright Copyright
All rights reserved. No part of this book can be reproduced, copied,
stored in a retrieval system or transmitted over the web without prior
approval of BloggingTips.com. Brief quotations embedded in reviews
are permitted.


The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 3 of 36

Who is this book for? Who is this book for? Who is this book for? Who is this book for?
This book was written to introduce bloggers to the Robots.txt file and
give them a basic understanding of what it is for and how they can use it
on their own site. It also explains how you can use meta tags within a
page to stop it from being indexed in a search engine.
A AA Acknowledgements cknowledgements cknowledgements cknowledgements
Last but not least, I would like to thank the following people who have
helped me release this e-book:
The Blogging Tips Writing Team. If they were not writing great
articles for the blog everyday, I wouldnt have had the time to
write this e-book!
The Blogging Tips readers!! Yes I know, its cheesy; but seriously,
I wouldnt have had the passion or drive to write this e-book in the
first place if it wasnt for the great feedback from you all.
Table o Table o Table o Table of Contents f Contents f Contents f Contents


CHAPTER 1: WHAT IS THE ROBOTS EXCLUSION STANDARD?.............. 5
ROBOTS.TXT CODE .................................................................. 6
CHAPTER 2: CREATING THE ROBOTS.TXT FILE...................................... 8
WILDCARDS ........................................................................ 10
USER AGENTS ...................................................................... 11
SEARCH ENGINE SPIDERS......................................................... 13
WORDPRESS EXAMPLE ROBOTS.TXT FILE ....................................... 13
CHAPTER 3: NON STANDARD DIRECTIVES............................................ 16
UNSUPPORTED DIRECTIVES ....................................................... 19
CHAPTER 4: THE ROBOTS META TAG .................................................... 21
INDEX & FOLLOW .............................................................. 22
ADDITIONAL ROBOT META TAG ATTRIBUTES .................................. 24
X-ROBOTS-TAG.................................................................... 25
CHAPTER 5: LIMITATIONS OF THE ROBOTS EXCLUSION STANDARD. 26
SEARCH ENGINES DOING THEIR OWN THING................................... 27
FILE PROTECTION .................................................................. 28
ROBOTS.TXT FILE VISIBILITY ..................................................... 29
POPULAR ROBOTS.TXT FILES ..................................................... 31
CHAPTER 6: OVERVIEW........................................................................... 33
COMMUNITY SUPPORT ............................................................. 35
REFERENCE ......................................................................... 36
Chapter 1: What is the Robots
Exclusion standard?
The Robot Exclusion Standard is a standard which was developed to
allow webmasters to stop search engine robots from crawling certain
areas of their website. It is sometimes referred to as the Robots Exclusion
Protocol.

The Robots Exclusion Standard was created on 30
th
June 1994 as a
means of controlling what search engine robots do when they visit a
web site (search engine robots are sometimes referred to as spiders

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 6 of 36

because they crawl the web). The main way of controlling what spiders
do is by entering certain details into a plain text file called robots.txt. By
placing the robots.txt file in the root directory of your domain, you can
tell the robots what pages and directories to index.
This can be incredibly useful. For example, in the first few years it was
widely used to exclude the printer friendly version of a web page from
search engines so that only the original copy was indexed (to help with
duplicate content concerns). More recently, webmasters have been
using the robots.txt file to stop their images being displayed on Google
Images, in order to reduce bandwidth costs.
It is important to note that search engine robots only look for the
robots.txt file in the root directory so you need to make sure it is at the
top level of your domain. Also, a blank robots.txt file is treated as if the
file wasnt even there.
Ro Ro Ro Robots.txt Code bots.txt Code bots.txt Code bots.txt Code
The robots.txt file is one of the easiest documents to understand in web
development. There are only 2 pieces of code you need to learn for
basic commands:
User-agent: This directive determines the search engine robot
which you want to control, it needs to be stated before any other
directives. For example, Googlebot is the name of Googles spider.
Disallow: This directive allows you to determine what page or
folder should not be visited by the search engine.

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 7 of 36

Additionally, you can use 2 UNIX commands. Firstly, you can use an
asterisk (*) to denote a wildcard. For example, User-agent: * would
apply a specific rule to all search engine spiders.
You can also use the hash symbol (#) to denote a comment. A
comment can be placed on a line of its own or at the end of any
command line. It is recommended to comment each rule you set so that
you can refer back to the file easily in the future.
Please remember that directives are case sensitive so the file or folder
you are trying to block has to be written exactly the same in the
robots.txt file. For example, blocking the file Personal-CV.doc would
not stop a search engine from indexing personal-cv.doc as it sees them
as two different files.

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 8 of 36

Chapter 2: Creating the Robots.TXT
File
In this chapter I will show you some basic examples of the robots.txt file
in action to demonstrate how easy it is to use.

Here is the code to disallow all search engine robots from indexing your
domain:
User-agent: *
Disallow: /

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 9 of 36

The code to allow all search engine robots is almost identical; you just
need to remove the forward slash:
User-agent: *
Disallow:
Earlier I mentioned that a blank robots.txt file is treated the same way as
if the file wasnt there, this is because the default action of most search
engine spiders is to index everything. The code above achieves the same
so there would actually be no point in adding it to your robots.txt file,
though it perfectly illustrates what happens when the forward slash is
removed.
Lets look at another example:
User-agent: *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /admin/
In the above example, all search engine robots would be instructed not
to index the cgi bin or the admin directory, where presumably, there are
lots of pages which you wouldnt want listed in a search engine.
You can also block individual pages from being indexed:
User-agent: *
# disallow all files in these directories
Disallow: /cgi-bin/

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 10 of 36

Disallow: /admin/
# disallow all pages below
Disallow: /private/download125.html
Disallow: /private/end-of-year-report.pdf
Wildcards Wildcards Wildcards Wildcards
Lets look at how we can use wildcards to instruct search engines
spiders on what and what not to index.
Below is the code for blocking all gif files from all search engines:
User-agent: *
Disallow: /*.gif$
To block all directories that begin with the letter d you would use:
User-agent: *
Disallow: /d*/
You need to make sure that you dont write the code above the wrong
way around. For example, the code below would block any URL with
the letter d in it (excluding the domain name itself):
User-agent: *
Disallow: /*d
It may seem that this technique is quite pointless but there are some

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 11 of 36

useful applications for it. For example, by disallowing /*data you would
block any URL with the word data in it.
To block all files at the end of a url (i.e. the file extension), you need to
add the $ symbol at the of the code. For example, to stop all of your pdf
documents from being listed in search engines you would use:
User-agent: *
Disallow: /*.pdf$
This could be extended so that you only stop pdf documents being
indexed from a particular directory:
User-agent: *
Disallow: /reports/*.pdf$
User Agents User Agents User Agents User Agents
Up until this point I have used the wildcard for the user-agent to denote
all search engine robots. However, sometimes you may wish to set rules
for specific search engines. A common spider which webmasters block
is Googlebot-Image, which is the search engine robot Google uses to
index images for Google Images.
To block or remove all images from Google Images you would use:
User-agent: Googlebot-Image
Disallow: /

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 12 of 36

You may not want to block all images from Google Images. For
example, you may only want to block images from certain parts of your
web site:
User-agent: Googlebot-Image
Disallow: /gallery/
Disallow: /images/
Disallow: /backup/
If you are using Google Adsense on your blog or website, then you may
want to display ads on a page which you dont want crawled. To do this,
you need to permit the Adsense spider (Mediapartners-Google) to visit
the page. You can do this using the allow directive.
Here is the code which blocks Google from crawling a website but
allows the Adsense spider to crawl it:
User-agent: Googlebot
Disallow: /

User-agent: Mediapartners-Google
Allow: /
The allow directive is a non-standard directive though the major search
engines now support it (I will talk more about non-standard practice in
the next chapter).
As you can see, by specifying what search engines can crawl on your
website, you can dictate what pages of your site are indexed in each of

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 13 of 36

the major search engines. Please note that every new directive has to be
entered in a new line.
Search Engine Spider Search Engine Spider Search Engine Spider Search Engine Spiders ss s
By looking at what search engines are crawling your site and what pages
they are indexing, you can plan out what areas of your site need to be
blocked within robots.txt.
For example, if you notice that Googlebot-Image is using up a lot of
bandwidth from crawling your images, you may want to block the bot
from indexing them.
The most common bots you will see in your stats package are
GoogleBot (Google), Yahoo Slurp (Yahoo), MSNBot (Microsoft Live
Search & Bing), ia_archiver (Alexa) and Teoma (Ask). These robots
are likely to use up the most bandwidth on your web site however you
will probably see a huge list of other robots crawling your site too.
A complete list of all search engine spiders can be found in the Robots
Database at Robotstxt.org (the list is too extensive to reprint here!).
WordPress Example Robots.txt File WordPress Example Robots.txt File WordPress Example Robots.txt File WordPress Example Robots.txt File
To conclude this chapter, I would like to show you an example of the
code which is recommended for bloggers who use WordPress.
It is a great example of how you can block search engines from viewing
specific areas of your web site and illustrates all of the things I have

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 14 of 36

discussed up to this point.
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /*?*
Disallow: /*?

# Google Image
User-agent: Googlebot-Image
Allow: /*
Disallow:

# Google AdSense
User-agent: Mediapartners-Google*

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 15 of 36

Allow: /*
Disallow:

# Internet Archiver Wayback Machine
User-agent: ia_archiver
Disallow: /

# digg mirror
User-agent: duggmirror
Disallow: /

Sitemap: http://www.example.com/sitemap.xml

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 16 of 36

Chapter 3: Non standard directives
The Robots Exclusion Standard was conceived in June 1994. However,
it was not created by an official body; it was agreed by consensus by
members of the robots mailing list. Which is actually quite impressive
when you think about it as it is still to this day used by all the major
search engines.


There have been some efforts to improve the standard by search
engines though. Since these directives were not included in the original
standard, they are known as Non Standard Extensions.
Though this term is perhaps a little misleading as since June 3
rd
2008,
most major search engines support these directives so it is safe to use

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 17 of 36

them. Any search engine which doesnt would simply ignore the code.
You will find links to the official announcements regarding the June
2008 agreement below:
Google
Yahoo
Live Search (Bing)
The 3 main non standard directives are Allow, crawl-delay and Sitemap.
Allow
The allow directive lets you instruct the search engine robot to index
files within a directory which it is being blocked from crawling. In order
to work with all search engine robots, it has to be placed before the
disallow directive (though this doesnt seem to be a problem with any
of the major search engine robots).
User-agent: Googlebot-Image
Allow: /images/logo.png
Disallow: /images/
Crawl-delay
The crawl-delay directive allows you to set the length of time in seconds
between requests on your server. This can be very useful in speeding up
the crawl process so that more pages are indexed or slow it down so that
the server load is reduced.

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 18 of 36

It is supported by most of the major search engines. Crawl delays
slower than 1 second are quoted with decimal points e.g. 0.1.
User-agent: *
Crawl-delay: 5
Some search engines will access your server a few times a second so
bear this in mind when choosing the crawl delay rate. Bing suggests not
using a crawl-delay value higher than 10 seconds as it makes it very
difficult for the spider to index all pages of your site.
The table below gives you an idea of how slow the indexing speed will
be according to the crawl delay being set.
Crawl-delay setting Index refresh speed
No crawl delay set Normal
1 Slow
5 Very slow
10 Extremely slow
Note, Google does not actively support this directive and will usually
disregard it and crawl at a rate it determines (more on this later).
Sitemap
In 2007 Google, Yahoo, Ask and Microsoft all agreed to allow crawlers
to recognise the sitemap directive in the robots.txt file.

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 19 of 36

The sitemap directive allows you to confirm the location of your
websites XML sitemap. The robots.txt file is the first thing a search
engine spider checks when it visits your site so it makes sense to tell the
spider where your sitemap is located.
Sitemap: http://www.example.com/sitemap.xml
Please note, you can add multiple sitemaps to your robots.txt file.
U UU Unsupported nsupported nsupported nsupported Directives Directives Directives Directives
There are a few directives which have been supported or proposed at
one time but for one reason or another, were never accepted by the
major search engines.
Noindex directive
The noindex directive is something which some search engines support
though it is not officially backed by the major search engines, not
officially anyways. It allows a page or directory to be crawled but it does
not place the pages in the SERPS (Search Engine Results Page).
Google tested it out in 2007 for a while but it never became officially
supported.
User-agent: Googlebot
Noindex: /private/
Though most search engines do support it to some extent, I wouldnt
recommend using it unless it becomes officially supported by a major

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 20 of 36

search engine such as Google, Bing or Yahoo.
Extended Standard for Robot Exclusion
Sean Connor made proposals for a extended standard several years ago.
This new standard would address some of the issues which the original
standard did not address.
Two of the suggested directives were Visit-time and Request-rate. Visit-
time would determine what times a spider could crawl a certain area of
your site (e.g. 8am to 6pm) whilst Request-rate would determine the
number of pages which can be crawled within a certain period of time
(similar to the crawl-delay directive).
The extended standard was written many years ago though search
engines failed to adopt any of the policies suggested in this proposed
standard.

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 21 of 36

Chapter 4: The Robots Meta Tag
Although the main method of controlling search engine spiders is
through the robots.txt file, it is possible to control how a spider acts by
adding some code to the HEAD section of a webpage using the META
tag. This standard was introduced in 1996.

I prefer to use the robots.txt file to control how search engines spiders
work as some spiders completely ignore META tags (though major
search engines support it). The meta tags only control what happens on
that specific page; therefore it is possible that the search engine spider
will find a page you didnt want it to find via another page.
If a page is blocked within your robots.txt file, the spider will not crawl
the page so will not read the meta tags.

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 22 of 36

However, if a page is not blocked by robots.txt but is blocked within the
page itself using meta tags, the spider would crawl the page, read the
meta tags and then not index the content.
INDEX & FOLLOW INDEX & FOLLOW INDEX & FOLLOW INDEX & FOLLOW
The original standard, created in 1996, stated four attributes for the tag
ROBOTS: INDEX, NOINDEX, FOLLOW and NOFOLLOW.
The INDEX attributes will determine whether a page is indexed or not
whilst the FOLLOW attributes will determine whether the spider
follows the links on the page i.e. whether the spider crawls the pages
which are linked on the page.
Here are some examples of this in use. The code below would be placed
within the HEAD section of your web page.
Do not index the content or follow any links:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Do not index the content but follow any links:
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
Index the content but do not follow any links:
<META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 23 of 36

Index the content and follow any links:
<META NAME="ROBOTS" CONTENT="INDEX, FOLLOW">
You may come across some web pages which display attributes on
separate lines. Note that:
<META NAME="ROBOTS" CONTENT="INDEX, FOLLOW">
Is interpreted the same way as:
<META NAME="ROBOTS" CONTENT="INDEX">
<META NAME="ROBOTS" CONTENT="FOLLOW">
Its worth noting that any page excluded by the robots.txt file or via the
robots meta tag will still have PageRank juice passed to it in Google.
However, the way juice is passed on to other pages is slightly different
between using robots.txt and meta tags; specifically, a page which uses
the robots meta tag will still pass PR juice to any link on the page even if
the NOFOLLOW attribute is used. Whereas a page blocked in the
robots.txt file will not.
To stop PR juice being passed on a page which uses the NOFOLLOW
attribute, you would have to include the rel=nofollow attribute in all
of your links.


The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 24 of 36

Additional ROBOT Meta Tag Attributes Additional ROBOT Meta Tag Attributes Additional ROBOT Meta Tag Attributes Additional ROBOT Meta Tag Attributes
In the June 2008 agreement, the major search engines agreed to start
supporting additional attributes within the ROBOTS meta tag.
The 4 new attributes which were added were:
NOARCHIVE: Prevents a cached copy of the page being listed in
the SERPS.
NOSNIPPET: Prevents a description of the page appearing in the
SERPS and stops the page being cached.
NOODP: Prevents the Open Directory Project description being
displayed in the SERPS. This is useful if the ODP description for a
website is outdated.
NOYDIR: Prevents the Yahoo Directory description being
displayed in the SERPS (Therefore only applicable to the Yahoo
spider Slurp).
These new attributes are used in the same way as the original ones.
Example:
<META NAME="ROBOTS" CONTENT="INDEX,NOODP,NOARCHIVE">



The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 25 of 36

X XX X- -- -Robots Robots Robots Robots- -- -Tag Tag Tag Tag
One of the new tags to be supported in the June 2008 agreement was
the addition of the X-Robots-Tag. This tag was developed to support
non html documents including text files, PDF documents,
spreadsheets, and Word documents, audio and video files.
The new X-Robots-Tag is placed within the HTTP header using a
server side language such as PHP, Perl or Ruby. You can use
NOINDEX, NOFOLLOW, NOSNIPPET NOARCHIVE and
NOODP.
The example below would stop all PDF documents being indexed.
<FilesMatch "\.pdf$">
Header set X-Robots-Tag "NOINDEX "
</Files>
You can see more examples of what X-Robots-Tag can do at the Official
Google Blog.

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 26 of 36

Chapter 5: Limitations of the Robots
Exclusion Standard
The Robots Exclusion Standard is incredibly useful and is something
which all webmasters, web developers and bloggers should have a basic
understanding of. However, the standard is very limited in many
respects.

One of the main issues with the Robots Exclusion Standard is the fact
that it has no governing body. The June 2008 agreement between

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 27 of 36

Google, Yahoo and Microsoft was a step in the right direction and has
brought some sort of control over what can be used and what cant.
Though over the last decade most search engines have developed their
own unique parameters for the robots.txt file so that they have more
control over how their own spider crawls the web. This is particularly
true for non-standard exclusion directives.
Search Engines Doing Their Own Thing Search Engines Doing Their Own Thing Search Engines Doing Their Own Thing Search Engines Doing Their Own Thing
With no official body, its understandable that search engines will adopt
their own ideas of how their search engine robot interacts with
websites. Though this obviously makes it difficult for web developers
who are trying to control what areas of their site is indexed by multiple
search engines.
For example, whilst doing research for this book I came across a forum
post from Mark Welch from July 2008 about Google slamming his test
server with many requests per second, even though he had set the
Crawl-delay to a few seconds (Remember, Crawl-delay is a non-
standard directive).
Though Google used to actively support the crawl-delay directive, they
now request users to set the crawl rate via Webmaster Tools.
Upon querying this with Google, Mark was informed that:
The rate at which Googlebot crawls is based on many factors. At
this time, crawl rate is not a factor in your site's crawl. If it becomes

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 28 of 36

a factor, the faster option below will become available.
This effectively means that Google will ignore the crawl-delay directive
within your robots.txt file and the rate you set within Webmaster Tools
(As someone who used to run a large discussion forum, I regularly saw
search engine robots use up more than 20GB of bandwidth per month,
so it is a little concerning that a website owner cannot easily dictate how
a spider crawls their site).
Perhaps if the major search engines all sat down and discussed their
ideas for the development of the Robots Exclusion Standard, we would
see more functionality within it.
As the internet grows, search engines will continue to modify their
search algorithm and the policies which web developers should adhere
to so perhaps it is asking a lot for them all to work together. However,
the June 2008 agreement indicates that they can work together for
mutual benefit rather than just their own.
File File File File Protection Protection Protection Protection
The major search engines will read your robots.txt file and your
ROBOTS meta tags and not index the pages and files which you do not
want indexed. They do not have to adhere to your suggestions but most
respected search engines do; but not all search engine robots are as
trustworthy as Google or Yahoo.
There are a lot of spam bots, email harvesters and untrustworthy
elements which crawl the web and they will simply index what they

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 29 of 36

want, regardless of what you have in your robots.txt file. This is why the
robots.txt file should not be considered as a replacement for password
protecting important documents, files and directories.
It is merely a way of encouraging search engines what to index when
they visit your web site, something which untrustworthy spiders will
simply ignore.
Robots.txt File Visibility Robots.txt File Visibility Robots.txt File Visibility Robots.txt File Visibility
Since the Robots Exclusion Standard directives are placed in a simple
text file, they are visible to everyone on the internet. This will obviously
be a concern to many website owners.
It means that everyone can see the contents of your robots.txt file and
see how you are controlling search engine spiders on your site. Many
blog owners are unaware of this. I will illustrate this with an example.
Blocking Important Files without Highlighting Where They Are
A common gift which blog owners give to readers is a free eBook (just
like this one!). The blog owner usually requests the user to sign up to
their blog, forum or newsletter and then emails a link to the reader so
that they can download it.
This download page is not linked from the main website however
search engines crawl everything on a domain so it is common for this
secret download page to be listed in a search engine, which means that
anyone can download the file without signing up to a newsletter or

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 30 of 36

whatever.
A way to stop this from happening is to block the search engine from
crawling the download page. For example, you could use:
User-agent: *
# hide ebook download page from SERPS
Disallow: /ebook-download/
The above code would stop search engines from indexing the download
page however it would also highlight where the eBook could be
downloaded to anyone who checks the robots.txt file. I would hazard a
guess that it is more likely for someone to find an eBook download page
via a search engine than through a robots.txt file but it is still not a great
solution.
So what can you do? Well my first suggestion would be to use the
NOINDEX attribute within the meta tags of the download page. This
would stop search engines from indexing the page therefore there
would be no need to mention it in the robots.txt file.
However, this isnt always possible with blogging scripts as a common
header is used for all blog pages. One way round this is to block the
download page via a .htaccess file using IndexIgnore.
For example:
IndexIgnore /ebook-download/
Remember, the above code would be entered into your .htaccess file

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 31 of 36

and not robots.txt.
If you use WordPress I recommend the fantastic WordPress plugin
Robots Meta from Joost de Valk. The plugin adds a side box in the post
entry page where you can determine whether the page is indexed or
followed.
If you use a different blogging platform from WordPress and cant
specify robot meta tags for individual pages and are unsure about using
htaccess, you might be better password protecting the download folder
or the file itself. This is a minor inconvenience for your readers though
it is worth considering if people are bypassing your sign up process.
Popular Robots.txt Files Popular Robots.txt Files Popular Robots.txt Files Popular Robots.txt Files
Having your robots.txt information available to the world might not be
preferable to many of you, however on the plus side, you can see any
websites robots.txt file yourself too (and can view the source of any
page and see what meta tags they used).
Checking out the robots.txt of other websites lets you see how other
webmasters are controlling search engine spiders and is a good way of
understanding how the Robot Exclusions Standard works too.
Here are links to the robots.txt file of some of the most popular
websites on the web:
http://www.google.com/robots.txt
http://www.amazon.com/robots.txt

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 32 of 36

http://ebay.com/robots.txt
http://www.youtube.com/robots.txt
http://www.facebook.com/robots.txt
http://en.wikipedia.org/robots.txt
http://bing.com/robots.txt
http://sfbay.craigslist.org/robots.txt
http://www.cnn.com/robots.txt
http://www.bbc.co.uk/robots.txt
http://www.cnet.com/robots.txt
http://www.bloggingtips.com/robots.txt
http://www.godaddy.com/robots.txt
https://www.paypal.com/robots.txt
http://www.nhl.com/robots.txt



The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 33 of 36

Chapter 6: Overview
I hope that this eBook has given you a good understanding of what the
Robots Exclusion Standard can do and how you can use it on your site.

The Robot Exclusion Standard is far from being perfect; however it is a
useful way of controlling what search engine spiders do on your web
site.
Just remember, it should not be considered as a replacement for
password protecting important files and folders, it is merely a way of
stopping certain areas of your web site being indexed by the search
engines.
And you should note that it is not a competent way of dealing with

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 34 of 36

spammers since their software will simply ignore anything in the
robots.txt file.
Hopefully, the major search engines will sit down and discuss the future
of the Robot Exclusion Standard and make it more useful and consistent
across the board.
Good Luck
If there is a blogging related topic that you would like covered in
another e-book, please let us know via our blog or forums. To see a full
list of Blogging Tips books, please visit
http://www.bloggingtips.com/books/.
I wish you all the best of luck with your blogging careers and I hope that
this e-book will help you take your blog to the next level!

Kevin Muldoon
www.bloggingtips.com

The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 35 of 36

Community Support Community Support Community Support Community Support
If you finding any aspect of the Robots Exclusion Standard difficult then
I encourage you to drop by the Blogging Tips Forums for hands on
support from experienced bloggers, webmasters and designers.
Its a great place to hang out with like minded bloggers and best of all,
registration is free!!


The Robots Exclusion Standard
By Kevin Muldoon


2009 BloggingTips.com www.bloggingtips.com Page 36 of 36

Reference Reference Reference Reference
A list of links which you may find useful
http://www.robotstxt.org
1996 Robots Exclusion Standard (RES)
Google Webmaster Tools Robots.txt page
Search Engine Watch
SEO Round Table
Webmaster World

You might also like