A simple definition of robots.txt is that “A file that controls all robots, bots, spiders and crawlers”. Web Robots also known as Crawlers or Spiders are programs that traverse the Web automatically and crawled your website and webpage.  Basically Search engines like Yahoo and Google use them to index the web pages and spammers use them to scan for email addresses, content, information and they have many other uses.

 

This file helps to control spider goes through your site, allowing you to block some page or whole website being spidered, if you have any confidential page or do not want to index your website in search Engine. If your site is blocked by robots.txt then all search engine follow the file and do not index your webpage. The basic structure robots.txt file you can see in the figure:

 

 

Robots

 

Your Website Really Need robot.txt file?

 

Robots.txt is not necessary for a website. ” A robots.txt file restricts access to your site by search engine robots that crawl the web. You need a robots.txt file only if your site includes content that you don’t want search engines to index. If you want search engines to index everything in your site, you don’t need a robots.txt file—not even an empty one. If you don’t have a robots.txt file, your server will return a 404 when Googlebot requests it, and we will continue to crawl your site. No problem.) “        Source Google

 

 

SEO Ranking and Robots.txt File

Your Website should have a proper robots.txt file if you want to have good rankings on search engines. So before upload a robots.txt file, analysis your robots.txt file. If you have not created this file not properly structure then, it can harm your SEO ranking. You can check your robots.txt file at here: Reference url: http://www.frobee.com/robots-txt-check

 

 

robots-toos

Structure of Robots.txt File:

Robots.txt file uses three rules:

  • User-agent: Search engine robot
  • Disallow: Page, Folder or URL you want to block
  • Allow:  (optional) Specific pages you want to allow

 

To allow everything
User-agent: *
Disallow:
Allow:

 

 

To block the entire site
User-agent: *
Disallow: /

 

 

To block a directory and everything in it
User-agent: *
Disallow: /web-design/

 

 

To block a specific page
User-agent: *
Disallow: /web-design.html

 

 

Block Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder
User-agent: Googlebot
Disallow: /web-design/
Allow: /web-design/logo.html

 

 

To remove a specific image from Google Images
User-agent: Googlebot-Image
Disallow: /images/logos.jpg    (for jpg file)
Disallow: /images/logos.gif    (for GIF file)

 

 

To remove all images on your site from Google Images:
User-agent: Googlebot-Image
Disallow: /

 

 

To remove all jpg images on your site from Google Images:
User-agent: Googlebot-Image
Disallow: / *.jpg$

 

 

To remove all html pages on your site from Search
User-agent: *
Disallow: / *.html$

 

 

Block all bots from indexing, except for Google allowing the indexing
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /*

 

 

Robots.txt Wildcard Matching

Note- Besides the major search engines, most crawlers don’t support wildcard matches and will most likely misunderstand or ignore them.

Google, Yahoo! Search, and Microsoft allow the use of wildcards (special character) in robots.txt files.

 

 

To block access to all URLs that includes a question mark (?)
User-agent: *
Disallow: /*?

 

 

To block access to all subdirectories that begins with web-design:
User-agent: Googlebot
Disallow: /web-design*/

 

 

To specify matching the end of a URL such as html, Php, asp, pdf etc
User-agent: Googlebot
Disallow: /*.htm$
Disallow: /*.html$
Disallow: /*.php$
Disallow: /*.asp$
Disallow: /*.xls$
Disallow: /*.pdf$

 

 

To exclude all files except one
User-agent: *
Disallow: /~web-design/logo/
The easy way is to put all files to be disallowed into a separate directory, say “logo”, and leave the one file in the level above this directory:

 

To block all url that include ? but allow that end with ?
User-agent: *
Allow: /*?$
Disallow: /*?

 

 

The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

 

The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

 

Mediapartners-Google Robot

 

To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages,
User-agent: *
Disallow: /
User-agent: Mediapartners-Google
Allow: /

 

 

Robots.txt and Google Webmasters

Google Webmaster Tools gives you access to know about the crawling and indexing of your website. Google Webmaster Tools provides you with detailed reports about your pages’ visibility on Google. To get started, simply add and verify your site and you’ll start to see information right away.

 

 

To see all URLs which Google has been blocked from crawling, visit the Blocked URLs page of the Health section of Webmaster Tools. More…

 

 

robots-block-url

 

 

 

The Index Status page also provides stats about how many of your URLs Google was able to crawl and/or index. At advanced tap It shows number of pages crawled, the number of pages that we know about which are not crawled because they are blocked by robots.txt, and also the number of pages that were not selected for inclusion in our results and number of URLs removed from Google’s search results as a result of a URL removal request. More..

 

 

Robots-webmasters

 

 

 

Robots failure and robots.txt unreachable

Before Googlebot crawls your site, it accesses your robots.txt file to determine that your site is blocking Google from crawling any pages or URLs. If your robots.txt file exists but is unreachable (in other words, it doesn’t return a 200 or 404 HTTP status code), Google postpone to crawl rather than risk crawling disallowed URLs. When this happens, Google bot will return to your site and crawl it as soon as bots can successfully access your robots.txt file.

 

 

If crawlers are not accessing robots.txt file then Google webmaster tools also shows data about total robots.txt fetch errors.

 

 

robots-google1

 

 

 

 

 

Robots.txt file can help to increase search engine rankings when used correctly, and it can keep error pages (404 not found Pages) on your web site from being indexed. You may also take advantage of robots .txt analysis tool to evaluate the changes to the file and ensure that all these changes allow access to the website fabulously.

 

 

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
What is robots.txt and How to Create This for Google Search Engine?, 10.0 out of 10 based on 1 rating

10 thoughts on “What is robots.txt and How to Create This for Google Search Engine?

  1. Ashish on said:

    Nice Article

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Your comment is awaiting moderation.
  2. pinterest alternative on said:

    You could definitely see your skills within the work you write. The arena hopes for more passionate writers such as you who are not afraid to say how they believe. At all times follow your heart.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Your comment is awaiting moderation.
  3. ley on said:

    Hello my friend! I want to say that this article is awesome, great written and include almost all vital infos. I would like to look more posts like this .

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Your comment is awaiting moderation.
  4. Samsung Galaxy on said:

    I ѕimply cоulԁn’t leave your web site before suggesting that I extremely enjoyed the usual information a person supply in your visitors? Is gonna be again regularly in order to check out new posts

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Your comment is awaiting moderation.
  5. click here on said:

    I’ll be bookmarking this site to read more, thanks for taking the time to write it

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Your comment is awaiting moderation.
  6. Jaim on said:

    It’s really a cool and helpful piece of info. I’m happy that you just shared this useful info with us. Please stay us informed like this. Thanks for sharing.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Your comment is awaiting moderation.
  7. Best Automatic SEO tool on said:

    It is truly a nice and useful piece of information. I am glad that you just shared this useful information with us. Please keep us informed like this. Thanks for sharing.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Your comment is awaiting moderation.
  8. chothuegiare on said:

    I don’t even understand how I stopped up here, however I thought this put up was good. I do not know who you’re however definitely you’re going to a famous blogger when you aren’t already. Cheers!

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Your comment is awaiting moderation.
  9. Malinda on said:

    Good post must say.. Profound writing..

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Your comment is awaiting moderation.
  10. Pamela on said:

    Good style of writing.. Quiet Impressive must say :)

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Your comment is awaiting moderation.

Leave a Reply

Your email address will not be published. Required fields are marked *