Tuesday, January 6, 2009

On Google Disallowing Crawling of Their LIFE Hosting

Isn’t it great how Google makes millions of photos available to the world in their LIFE photo archive? Well – with one exception: they disallow other search engines to access these photos. The same access rights that make Google Image search crawl other photo collections are not given by Google to competing image search engines for their photo hosting. Here’s the respective part in Google’s robots.txt that says so:

Disallow: /hosted/images/

Disallow: /hosted/life/

In crawler speak, this means: stay out of our images directory! Indeed if there’s any site in the world where the owners don’t need to think about SEO in traditional terms, it’s Google.

LIFE images still do appear in Google Image Search, ever since the service was released in November last year. Google’s internal programs can internally access whatever Google stores without looking at any robots.txt.

Asking Google for their reasoning behind barring external image search engines from accessing their site, I received a reply, but it didn’t actually provide a reason:

While Google allows crawling of many of its own properties from Blogger to Knol, the LIFE photo archive is not available for crawling at this time. To learn more about the licensing or merchandising of these images, visit www.timelifepictures.com.

Andy Baio tells me, “It’s disappointing that Google gets exclusive access to index these images and every other search engine is out of luck. Exclusivity like this doesn’t seem in line with Google’s philosophy. To me, it’d be like the Flickr Commons only allowing Yahoo crawlers or Picasa blocking access to anything but Googlebot.”

>> More posts


