For the majority of websites, a robot or spider (both technically the same) will crawl your site almost daily, whether it be from Google, MSN, Yahoo, or some other search engine such as Alexa. Sometimes you don't want all your pages, or your images to be indexed. This guide covers that and more.
When a crawler (robot, spider, or whatever other name you prefer to call them by) visits your site, it will first look for a robots.txt file at the root of your site. This file will tell the crawler what you would prefer it not to index and can contain information for specific crawlers, or for all. The reason I say "prefer not to index" is that the crawler does not have to do as this file says, and so if it is really important that a crawler doesn't access particular directories or files for security and/or intellectual property reasons etc., then it is advised that it is instead blocked using whatever technology is available on the server. In the case of Apache there is the .htaccess file, as I have previously talked about in Understanding Apache htaccess and Protecting Your Site. The first thing to do, if you have not done so already, is to create your robots.txt file. To start with, let's look an example file.
User-agent: *
Disallow: /
This example is unlikely to be something you would want to utilise. The first line tells the robot crawling the site that the following rules apply to all crawlers. The second line will then tell the robot that it should not index any file on your site. Unless your site is for members only, then it is likely that you will not want to do this - usually it is beneficial to have search engines crawl your site so that it can help drive visitors to your site. Without turning up in search engines your site is likely to get little traffic. If, however, you collect statistics on your site and you happen to know that a particular, or a number of robots are abusing your bandwidth by either not crawling the site correctly, or doing it too often, then you can adjust this statement to apply to just the intended crawler(s).
User-agent: googlebot
Disallow: /
User-agent: b2w
Disallow:
User-agent: *
Disallow: /sitemorse/
This example is slightly more complicated. The first rule will tell the gooblebot crawler (the main one Google sends round for indexing in their search engine) that it is not allowed to crawl any page or folder on the site. The second rule tells the b2w crawler that it can access anything - this is there due to the following rule that tells all crawlers that they cannot access the /sitemorse/ folder. By having the b2w rule with an empty disallow it gives the crawler permission to index the contents of that folder.
If you do not want to put restrictions on the crawler indexing your site, then it is still advisable that you have a robots.txt file present as it will be requsted. If the file is not there than a 404 error will appear in your server's log file. To avoid this, use the following:
User-agent: *
Disallow:













