How search engines work?
The search engines maintain a database of web pages on the internet and the links on them. As the web pages keep on changing on the internet by adding/ editing or deleting of pages, the search engines need to update their database so that they can produce better results every time a search is made on the search engine. This recursive reading and following of links on web pages is called “Spidering”. And the program that does this is called Spider or Robot.Now mostly we want our web pages to be spidered by the search engines so that our pages can be listed on search engines. But there are situations when we dont want the spider or robot to crawl some of our web pages. These pages may be like admin panel pages or configure files etc. Morever, sometime you would have your webpages online but the site is not live for the users to visit so you would not like the robots to index the site.
For this purpose, there are two options which direct the robot not to index the specified pages i.e. robots.txt and robot meta tags. Robots.txt is useful when you would like to block lot of pages throughout the site whereas the Robot Meta tags are useful for page to page blockage.
Note: Not all the search engines follow this directive but most of the good search engines follow them.
Robots META Tags
This is a very good method for excluding single web pages. The robot meta tags are used in section of the web page.
The META tags directive shall allow you to exclude single webpages to be indexed and also links on the webpage to be followed. There are 4 options based on index and follow i.e.
1) Do not index, do not follow links.
2) Do not index, follow links.
3) Index this page, but do not follow links.
4) Index this page, follow links.
Robots.txt file
Robots.txt file method is the best method when you would like to exclude visiting search engine robots from complete site or few sections of the website. For example, you might not want a search engine crawl and index the programs in your admin directories. This method is not particularly suited to excluding single pages (see robots META tags instead).
When a robot visits the site http://www.sakshay.in/blog, it first checks for the existence of the file in the root directory of the website.
http://www.sakshay.in/robots.txt
This specially formatted file tells the spider or robot, which parts of the site to exclude. There can be only one robots.txt file per website and it need to be placed in the root directory of the website; for example in /public_html. Thus the file misplaced in some other directory shall not be read and followed like one as below:
http://www.sakshay.in/blog/robots.txt
will not be read.
Now there are many options that can be availed of by using the robots.txt correctly. I have penned down the most commonly used options:
a) If you want to allow all robots to spider your site, put the following in your robots.txt file:
User-agent: *
Disallow:
b) If you want to exclude all robots from your site, put the following in your robots.txt file:
User-agent: *
Disallow: /
c) If you want to exclude all robots from any particular folder, put the following in your robots.txt file.
User-agent: *
Disallow: /blog/
Disallow: /directory/media/
Disallow: /cgi-bin/
So this will exclude all robots from the URLs http://www.sakshay.in/blog/, http://www.sakshay.in/directory/media/, and http://www.sakshay.in/cgi-bin/.
Note: that you need a separate line for each area that you want to exclude from search engine robots. Also note the trailing slash which indicates that the entire directory should be ignored by the search engine robots.
d) If you want to allow one specific robot, but exclude all others, use the following:
User-agent: FriendlySpider
Disallow:
User-agent: *
Disallow: /
Like if you want just the google search engine to crawl your site and exclude other robots, the robots.txt will look like:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
For more information on robots.txt, please visit http://www.google.com/support/webmasters/bin/answer.pyhl=en&answer=40360
























