February 10, 2006
Earlier this week, we told you about a feature we made available through the Sitemaps program that analyzes the robots.txt file for a site. Here are more details about that feature.
What the analysis means
The Sitemaps robots.txt tool reads the robots.txt file in the same way Googlebot does. If the tool interprets a line as a syntax error, Googlebot doesn't understand that line. If the tool shows that a URL is allowed, Googlebot interprets that URL as allowed.
This tool provides results only for Google user-agents (such as Googlebot). Other bots may not
interpret the robots.txt file in the same way. For instance, Googlebot supports an extended
definition of the standard. It understands Allow:
lines, as well as
*
and $
. So while the tool shows lines that include these extensions as
understood, remember that this applies only to Googlebot and not necessarily to other bots that
may crawl your site.
Subdirectory sites
A robots.txt file is valid only when it's located in the root of a site. So, if you are looking at
a site in your account that is located in a subdirectory (such as
https://www.example.com/mysite/
), we show you information on the robots.txt file at
the root (https://www.example.com/robots.txt
). You may not have access to this file,
but we show it to you because the robots.txt file can impact crawling of your subdirectory site
and you may want to make sure it's allowing URLs as you expect.
Testing access to directories
If you test a URL that resolves to a file (such as
https://www.example.com/myfile.html
), this tool can determine if the robots.txt file
allows or blocks that file. If you test a URL that resolves to a directory (such as
https://www.example.com/folder1/
), this tool can determine if the robots.txt file
allows or blocks access to that URL, but it can't tell you about access to the files inside that
folder. The robots.txt file may have set restrictions on URLs inside the folder that are different
than the URL of the folder itself.
Consider this robots.txt file:
User-Agent: * Disallow: /folder1/ User-Agent: * Allow: /folder1/myfile.html
If you test https://www.example.com/folder1/
, the tool will say that it's blocked.
But if you test https://www.example.com/folder1/myfile.html
, you'll see that it's not
blocked even though it's located inside of folder1
.
Syntax not understood
You might see a "syntax not understood" error for a few different reasons. The most common one is that Googlebot couldn't parse the line. However, some other potential reasons are:
-
The site doesn't have a robots.txt file, but the server returns a status of
200
for pages that aren't found. If the server is configured this way, then when Googlebot requests the robots.txt file, the server returns a page. However, this page isn't actually a robots.txt file, so Googlebot can't process it. - The robots.txt file isn't a valid robots.txt file. If Googlebot requests a robots.txt file and receives a different type of file (for instance, an HTML file), this tool won't show a syntax error for every line in the file. Rather, it shows one error for the entire file.
- The robots.txt file containes a rule that Googlebot doesn't follow. Some user-agents obey rules other than the robots.txt standard. If Googlebot encounters one of the more common additional rules, the tool lists them syntax errors.
Known issues
We are working on a few known issues with the tool, including the way the tool processes capitalization and the analysis with Google user-agents other than Googlebot. We'll keep you posted as we get these issues resolved.