One of the biggest search challenges has long been that the major search

engines like Google cannot crawl material that can only be retrieved through the

use of forms. Now Google is filling out those form to obtain the information

previously hidden, the company has announced.

Google says that for the past few months, it has been filling in forms on a

"small number" of "high-quality" web sites to get back information. What words has it been entering into those forms? Words automatically selected that occur

on the site, with check boxes and drop-down menus also being selected:

In the past few months we have been exploring some HTML forms to try to

discover new web pages and URLs that we otherwise couldn’t find and index for

users who search on Google. Specifically, when we encounter a <FORM> element

on a high-quality site, we might choose to do a small number of queries using

the form. For text boxes, our computers automatically choose words from the

site that has the form; for select menus, check boxes, and radio buttons on

the form, we choose from among the values of the HTML.

Results returned are then crawled. Ironically, it was just over a year ago

that Google warned

against getting search results like these indexed. Now it’s actually

generating and crawling those results itself.

Don’t want Google doing this to your site? Google says that if your form is

blocked through robots.txt or meta robots instructions, those forms won’t

be accessed. In addition, some other forms won’t be touched if they fit certain

technical criteria:

We only retrieve GET forms and avoid forms that require any kind of user

information. For example, we omit any forms that have a password input or that

use terms commonly associated with personal information such as logins,

userids, contacts, etc.

The move is potentially good for searchers, in that it will open up material

often referred to being part of the "deep web" or "invisible

web" as it was hidden behind forms. Search Engine Land executive editor

Chris Sherman actually



co-authored a book on the topic. He and fellow author Gary Price didn’t coin

the term invisible web, but they certainly help popularize it.

It should be noted that Google’s not the first to do something like this.

Companies like Quigo,

BrightPlanet, and

WhizBang Labs were doing this

type of work years ago. But it never translated over to the major search

engines. Now chapter two of surfacing deep web material is opening, this time

with a major search player — in that, Google is being a pioneer.