Indeed, it seems that Google IS forgetting the old Web

XML pioneer and early blogger Tim Bray says that Google maybe suffers of deliberate memory loss. I may have found more evidence that this is the case.

Bray writes that: “I think Google has stopped in­dex­ing the old­er parts of the We­b. I think I can prove it. Google’s com­pe­ti­tion is do­ing bet­ter.”

Bray’s case: a 2006 review

Back in 2006, Bray pub­lished on his own blog a re­view of a Lou Reed album. A few days ago, when he needed to quote again the URL of that review, he realized that he himself couldn’t find it via Google “Even if you read them first and can care­ful­ly con­jure up exact-match strings, and then use the “site:” pre­fix”.

My own case, again from 2006

Back in 2006, I published on one of my domains, digifreedom.net, the opinion piece “Seven Things we’re tired of hearing from software hackers”. A few years later, for reasons not relevant here, I froze that whole project. One unwanted consequence was that the “Seven Things”, together with other posts, were not accessible anymore. I was able to put the post back online only in December 2013, at a new URL on this other website. Last Saturday I needed to email that link to a friend and I had exactly the same experience as Bray: Google would only return links to mentions, or even to whole copies, but archived elsewhere. I asked Google to reindex this whole website, but nothing changed. Yesterday afternoon, through BoingBoing I discovered Bray’s post. As soon as I read it, I tried DuckDuckGo and got the same result: Google ignores my copy of my own post, DuckDuckGo correctly lists it as first result (click here for high-resolution screenshot):

Different stories, same practical result?

Unlike Bray’s, my own post disappeared from the Web for a while, and then reappeared with the original date, but only after a few years, and in a different domain. This is an important difference which may mean that, in my case, part of Google’s failure is my own fault. Still, for all practical purposes, the result is the same:

DuckDuckGo gives as first result the most, if not the only correct answer to whoever would be interested in that post today: the current link to the original version, on the (current) website of its author. DuckDuckGo gets things right. Google does not (not at the time of writing, of course).

Bray concludes that Google deliberately forgets old pages, because “in­dex­ing the whole Web is crush­ing­ly ex­pen­sive, and get­ting more so ev­ery day” and Google “cares about giv­ing you great an­swers to the ques­tions that mat­ter to you right now”. His conclusion is that, if the Web should be “a per­ma­nen­t, long-lived store of humanity’s in­tel­lec­tu­al her­itage… it needs to be in­dexed, just like a li­brary. Google ap­par­ent­ly doesn’t share that view.”

I agree with that conclusion. For the same reason, I also find misleading the title of BoingBoing’s report of this story: “Google’s forgetting the early web”. The two posts mentioned here are not “early web”, nor really “old”. Unless we’re all missing something here, it seems more correct to say that Google forgets stuff that is more than 10 years old. If this is the case, Google will remember and index a smaller part of the web every year. Google may do so simply because it would be impossible to do more, for economical and/or technological constraints, which sooner or later would also hit its competitors. But this only makes bigger the problem of what to remember, what to forget and above all who and how should remember and forget.