Robots.txt is a text file that is usually created by webmasters; this file helps web robots (mainly search engine robots like Google’s Spider) by saying how to crawl pages on their website. Robots.txt is maintained and carried out by REP.
Are you thinking about what REP is? It is said that REP, The Robots Exclusion Protocol is a group of standards that decides how a web robot should behave. There are two types of directives which help us to get the results which we wanted. The first one is Crawler directives and the second one is indexer directives. Both are different but both are important to get web when we do a search.
First, we will talk about the crawler. Crawler is nothing but the other name for spider. As I know spiders are robots or a program that finds the list of words found on a website. Or in simple ways crawlers just fetch files and script from the web server or the website and will be stored in the database. Here is the end of spider’s job. After fetching all of them they can relax a bit.
The next job is of Indexer, they will index the pages which spiders brought and then will also decide how to rank those pages.
Now we will a have bit more detailed look at what are the directives of crawlers and indexers are.
We coming to a crawler directives, they suggest crawlers what they should actually crawl and what they should don’t crawl. This is actually making the job of crawlers pretty easy. The examples of these types of directives are robots.txt and sitemaps.
While coming to the indexer directives, all of them require crawling. The real fact is that you can never find something like indexer directive in your website. May be in the future you would probably find some of them but not as now. The examples of these types of directives are REP tags and microformats.
Because of REP directives relevant to search engine crawling, indexing and ranking are defined on different levels, search engines have to follow a kind of command system.
The image above describes how REP or Robots Exclusion Protocol works.
Robots.txt file indicates whether search engines can or cannot crawl the web pages. I can not only block the entire pages from the search engines but they can also keep a particular section away from the search engine crawlers. It doesn’t means that search engines cannot see or index those sections but they avoid it. If we don’t want these sections they why should search engines be bothered about these kinds of things?
The basic format of the robots.txt looks like:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
This is a set of user agent and the directives. In a robots.txt there can be a lot of the sets like this.
The most common user agent is *. If you this as user agent then the next rule that you write will be followed by all user agents.
For example in your robots.txt you write,
Now what these two lines do? The first line says that this rule is for all the user-agents. After saying that this is for all user-agents they our robots.txt will say not to follow the pages or links that are inside the about folder. It is the end if that command. After completing that command the search engine robots will come to know that they shouldn’t follow any pages in that follow.
The below image will deliver a little more clear information.
You can deliver information about a folder a page or anything you want to. I have mentioned about the steps but what do you think user-agent is? You may have guessed it, isn’t? User-agent means the bots or programs used by search engines. The main robots or programs are:
You may have got which bot belongs which search engine. All the main bots are by Google. Google have a crawler which comes for crawl the site and the details will be used to show results to the users. And the next one is for mobile, it may be checking whether the sites are responsive or not. They Google have two more of them one is of media partners and the next one of ads. When you are developing a website you will surely come to know these kinds of robots. Who do you think Slurp treat? It’s yahoo. Slurp is a bot used by yahoo to crawl the websites for their search. The next to be owned by the same owner. Maybe it is the next big of web after Google. msnbot and msnbot-media is used by Bing. You may not have heard of Teoma or you may have heard it as a virus. However, Teoma is a search engine like Google or Bing. However they may come with some programs and stay within your computer or browser which makes users call it as a virus.
The above image describes a lot more of SEO. This is from Buzzfeed.com and in the first section they are just giving instructions to msnbot. The first thing is not very familiar with us. The first line of that says:
Maybe this is a new command for you. Even for many SEO experts this is still a new section. This command says to wait for 120 mini second (msc) before starting the crawl. But the drawback of this command is many of the search engines do not support this command. The most important one like Google does not support it.
Some of the bots like msnbot, discobot and the slurp bot are called out specifically, therefore those bots will only listen to those commands. The * command won’t be accepted by these bots. But all other bots will surely follow the * group commands.
Now how would you block al the web crawlers or spiders from your site. All you need to do is just write two lines of text in the robots.txt
With this command you can easily disable the access to the whole website. If you want to allow the access to your website (will be allowed by default however, if you get some problems then you may need this command):
Keep the Disallow: will say to web crawlers, you are permitted to crawl any of the pages according to your wish.
Now how will you block a web crawler from a specific folder? For example we will take Googlebot as our user agent.
The main thing you have to remember is that you should always start the disallow command with a space followed by “/”. The “/” symbol points to the root folder (we will speak about root folder after completing this) of your website. In this case you will also need to have the same symbol after your subfolder name. The second symbol point to all the files inside that folder.
Now how can we block just a web page from a web crawler?
It is bit more easily. You just need to mention the path of that file.
You will be able to notice this page name and URL in the web browser when you visit that page.
Now what is a root folder? When commonly speaking we say that it’s in the root. We don’t even say that it’s in the root folder. It is because we know that everything starts from the root. If we say root it means the root folder. Didn’t get it?
Okay, now do you have chrome web browser? If you don’t then I suggest you to download it. I suggest it because it has a great support for developers. It will help the beginner developers a lot. I used to make changes to my Blog’s CSS by using chrome developer option.
The first thing you have to do is to go to any blog. After you reach any blog then go to any of their article. Or you may go to some other pages. However, I suggest you to open one of their posts. After opening one of their posts right click on the page. You will notice a box up box.
You may have used this option to download photos from Google images or some other websites or you may have surely done it for some other uses. However, this will look a bit different that downloading images. The first option will be Back, which means to go to the last page that you where in. And next forward. If you have come back from some page then this option will help you to move forward.
Then there are couples of options like save as, to save this web page to your computer and also a print option which will help you to print a page. You may print and save as a PDF or you may take a copy of that page if you have a printer. The next one is cast; this is a bit advanced option which is used to cast this to your phone or some nearby systems. You will find more about this cast in YouTube. So if want to know more then try YouTube.
Then there goes the option to translate the page to English. Up to this all options are for web users. Any web user can easy use all of these options but when coming to the next two options. Web users may not know what that is for. When I used a computer before I studying coding the same was with me. You won’t try to know what that is. You will just ignore it. I know you will do so because I did it in that way.
The next option is bit different. At first you won’t understand anything. While View Page Source shows up the code of the website, Inspect Element pops up a small width full height box. The image below will give you an idea of how it looks.
You can see a box in the right with a lot of codes boxes and texts with different colors. All this is an example of what you get from inspect element. What will be it showing in the right side box? It’s nothing but just the code. The code which we got via using View page source. But when we used view page source we code the code in the same color but here there are many colors. And even to the right of that there is a another box, which has many layers, if you observe carefully you will come to see that all of them are in different colors, not only that there is text on all of them. In the inner box with the blue fill has the height and width of the element we are focused on. If you want to know which element you are focused on then just look at the lines of the code in the side. You will notice a blue color in one of the lines. You are focused on that element; it will be some kind of div or container. However, you will not need to study about all those now. But just understand it. The box with blue fill is the content box and the next box with the green fill is the padding. You will see a light cream color which is border and the next one is margin. Above all of this you will notice position. If you have to study more about this. Try searching box model in the search engines. You can also find some videos about this box model in YouTube. I have found a simple idea on YouTube and I thought it would be great if I could share it with you guys.
Above all the codes do you see a navigation menu? First there is some kind of mouse button and then an icon of mobile and tab or something like that. If you tap on that icon you will be able to see the site you are in at a look of mobile. You can select the mobile you want from the drop down or you may also use the size you want. And after that icon you will come to see another section which has some text or sections in it. The first one will be Elements and the next will be console and the next one will be sources and next network, performance and some other too. Just leave all that all you need is the sources section. Click on sources, and then you will get something similar to the image below.
You will get a white box in the left and a blank area in the right. Sometimes it won’t be blank. However the content is same.
Now, what is this? This is the folder looks like. In that white section you will come to see the domain name in which you are in. You just need to click on the arrow next to it to expand the view. When opening it will be automatically expanded. However, it isn’t just doing it? When you expand it you will come to see some folders. If the site or blog you are in is made of Wordpress then you will see something similar to this. Inside the domain name there will be some folders called wp-content, wp-includes. As I said this will be only showing if you are looking or inspecting at a Wordpress blog. You may also see a file or a folder. Here in the image above it is a folder. The folder here has a name of “blouse-neck-designs-with-patch-work-new-ones” This whole text is the slug of the page. Slug means the URL of the page. For optimized blogs it may be a file, like “blouse-neck-designs-with-patch-work-new-ones.html” extension may be change or maybe there won’t be any extension. The reason I asked you to open this is because the domain name of the address that you see is the main folder or the root of this website. All the files will be stored here by default. It is like a folder in your computer. Imagine that you are creating a folder for your family. You want to store the images of all your family members separately. First you make a folder name Family Photos which we can compare with the domain name of a website. After that you will create a folder name, Ann. Maybe the one for your sister after creating for your sister, you done it for your brother Justin. If you completed doing for both then you will create a folder for your mother’s family or relatives and one another for your father’s family or relatives. Also you will be creating one more folder in the folder or Ann, to store the phones of her friends by naming the folder Friends. This means you have created couples of folders. Now think you are coding and you want to take out and display one of the photos of a friend of your sister Ann what will you do.
You will have to mention the folder address. So for making it easy you will create a folder for the website. And then copy the folder called Family Photos to the website folder. After copying it you want to write the address for where you have stored it. So, should you mention that you have stored it in c, then in documents, then in a folder called coding and there will be many folders after that before you reach the finally folder friends. You don’t need to, if you do so it is waste of your time and also may result in site’s SEO if there is a lot of image. So what you have to do is you have copied it to the website folder and the coding file is in the same folder then why should you write it from first. So all you have to do is to write the address from your website folder. But how? To do so you first put a folder name family photos because you have stored it there, then you mention the folder Ann and the friends and then to the image, maria.jpeg By this method it is way too easy to link the files. It is in the same way that a file or folder is stored in a web server (The place where all the files of a website is stored). I think you have got what I said.
By default if you want to enter the root folder of a folder then you just need to enter “/”.
Now how does robots.txt work?
All the search engines have to complete mainly two jobs before they display the results to us:
To crawl sites search engines follow links to get from one site to another ultimately, crawling is done across many billion of links and website. This crawling is otherwise known as spidering.
Even this crawlers is a bit lazy, so before spidering or crawling the whole website the crawlers or spiders will look for the robots.txt file in the website folder. If it finds one, then the crawler fill first read and understand that file before they do the crawling of that website. As we know that the robots.txt file has much information which is used by those crawlers then robots.txt file will help the crawlers to go through all the procedures.
Now what will happen if a website does not have a robots.txt file? If a website does not have a robots.txt file then the search engine robot or the crawler will scan and crawl all the information in the website. If you don’t mention something in the robots.txt then no search engine robots are going to listen to what you say. It is like saying do whatever you want.
The above image is how the robots.txt files of facebook.com looks. You can easily access the robots.txt file of facebook by entering http://www.facebook.com/robots.txt. Then you will get this result.
According to Facebook’s robots.txt file, first it helps the Applebot by say don’t scan files inside, Ajax, checkpoint and lot more. It also has some other user agents that we may not know. All the user agents of facebook are:
We already know some of them but not all of them. We know Googlebot, msnbot, Slurp, Teoma. We just know this three.
Now what is AppleBot, you may have guessed what apple bot is. AppleBot is the web crawler for Apple; this bot is used by many of the products including Siri, Spotlight Suggestions. The main thing about this Applebot is, if a robots.txt file doesn’t mention the instructions for Applebot but for Googlebot then Applebot will follow the same instructions which are given to the Googlebot by the robots.txt file.
If you are a web explorer then you will have surely hear about Baidu too.
Baidu is one of the main Chinese search engines out there. And the baidubot is a bot used by this search engine to index the whole web. Maybe you haven’t heard about this search engine but china did for sure. Not only china but many others too. Because when looking at the results by alexa, baidu.com has a global rank of 4 in the whole web and 1 in china.
The next one which you will not be heard about is the ia_archiver, this is also owned by a company that we talked now, it is owned by Alexa. Alexa uses ia_archiver to scan and crawl the web for details and ranking the web pages and websites. They uses this method to display details when you search the alexa.com website about some one’s traffic.
The next one which you may not noticed will be the Naverbot, it is also a country based search engine owned by South Korea. It is a Korean search engine which has a rank of 113 in the global and 2 in South Korea. This one may not be so famous like Baidu because as we all know china has the largest population in the world, if all them uses their search engine then surely i can become the top one. I don’t think someone in USA will use baidu search engine. I will never use it for sure, not because I hate China but because I don’t know Chinese. You can get more information about the Naverbot from alexa ranking.
The next one which we will be thinking about will be the seznambot, This is the bot owned by the search engine of Czech Republic named Seznam and which has a domain name of seznam.cz. It has an alexa ranking of 3 in Czech Republic followed by 466 global rank. This site is not so famous but for me, it looks better and bit more stylish that baidu and Nover. However, many of you may have a different option. According to Alexa, 92.9% of its visitors are from Czech Republic while Slovakia delivers around 2% of its whole traffic with a country rank of 80. According to Alexa, even UK provides 0,7% of its total traffic.
Then the next one we should know about is the Twitterbot which is used by twiter for the best user experience in their site. You may also see Yandex, the bot which is owned by the Yandex search engine. Yandex is a bit popular one, It has a app store where users can get mobiles and the admission or the registration for Yandex app store as a publisher is free while Google Play Store needs 25$ as one time fees.
Then next you will come to see something else called Yeti, don’t think much yeti is the bot used by the same owner of Naverbot, the Korean site Naver. Maybe they use it for different purpose by I still miss the knowledge of all that because I really can’t get to the Korean language. So if you need to have a well research then just check out Google or some other search engines for more information.
We was talking about the important things that you want to remember when coming to making and managing a robots.txt file.
So it is said that it is a good practice to indicate the location of any sitemap that you have or associated with your domain at the bottom of the robots.txt file. The example of doing so is given in the image above.
Now what is Sitemaps? Sitemaps are just a easy we for webmasters to inform search engines or their bots about the pages on their sites that are available for crawling.
When expanding sitemaps, a sitemap is a XML file that contains the list of URLs of a site along with some additional Meta data about each page UR like when it was last updated, how often it will it change and how important it is and a lot more things.
Web crawlers or spiders usually discover pages from the links within a site, but sitemaps make this job more easy by allowing crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated meta data.
However, even the sitemap creators does not say that they can guarantee that web pages will be surely included in the search engines by using sitemaps but they will surely provide hints for the crawlers to do a better job of crawling any site.
Robots.txt files control crawler access to certain areas of our site. Using robots.txt without proper knowledge can be very dangerous. If you accidentally disallow Googlebot from crawling your entire site then you Google may not show up your site to their users. So be careful.
You are in the right place for creating .txt file. All you have to do is use our Robots.txt generator so that you will be able to create a robots.txt file without any damage.
This is how our tools page will look. First you want to select whether to allow or disallow all the robots. You can disallow all the search engine bots and then finally allow all the required ones. By this method you can make sure that you just allowed only the search engines that you need. I usually do is allow all of them, because I don’t think that some viruses are going to get on to my site via crawling, but there are some who thinks so, if you are one of then you can select disallow in that section.
Now you need to set the crawl delay, but some search engine bots like Googlebot won’t give their ears for this command. So I usually select no delay. You can also enter your sitemap URL if you wish to add them; if you have or don’t want them then you can simply leave them empty.
In the next section you can choose whether to allow or disallow a a bot. We have a large variety of bots in so that you can select all that you want or disallow all you don’t want search or crawl your site.
The next step is to enter the directories with you want to disallow. After entering all the details then you will have to click on the button create robots.txt, if you wish to clear all that you created then just press the clear button in the right to that button. After you click that button then you will be asked with a browser popup whether to create a robots.txt file or not. Then to get the file just press the ok button. After you press the okay button, you will get the robots.txt text in the box below that clear and create button. You can copy the text and then create a robots.txt file and paste the text to that file. Before leaving the site make sure that you have allowed and disallowed the right bots.