Friday, April 11, 2008
Google is constantly trying new ideas to improve our coverage of the web. We already do some pretty smart things like scanning JavaScript and Flash to discover links to new web pages, and today, we would like to talk about another new technology we've started experimenting with recently.
In the past few months we have been exploring some HTML forms to try to discover new web pages and
RLs that we otherwise couldn't find and index for users who search on Google. Specifically, when
we encounter a <form>
element on a high-quality site, we might choose to do a
small number of queries using the form. For text boxes, our computers automatically choose words
from the site that has the form; for
select menus, check boxes, and radio buttons on the form, we choose from among the values of
the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that
correspond to a possible query a user may have made. If we ascertain that the web page resulting
from our query is valid, interesting, and includes content not in our index, we may include it
in our index much as we would include any other web page.
Needless to say, this experiment follows good Internet citizenry practices. Only a small number of
particularly useful sites receive this treatment, and our crawl agent, the
ever-friendly Googlebot,
always adheres to robots.txt, nofollow
, and noindex
rules. That means that if a search
form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate.
Similarly, we only retrieve GET
forms and avoid forms that require any kind of user
information. For example, we omit any forms that have a password input or that use terms commonly
associated with personal information such as logins, userids, contacts, etc. We are also mindful
of the impact we can have on web sites and limit ourselves to a very small number of fetches for a
given site.
The web pages we discover in our enhanced crawl do not come at the expense of regular web pages that are already part of the crawl, so this change doesn't reduce PageRank for your other pages. As such it should only increase the exposure of your site in Google. This change also does not affect the crawling, ranking, or selection of other web pages in any significant way.
This experiment is part of Google's broader effort to increase its coverage of the web. In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide webmasters and users alike with a better and more comprehensive search experience.