Note: If this speculation is off the mark, then this is why they should care about their Reader.

The NY Times recently had an interesting piece with a behind the scenes look into the world of Google search and their ranking gurus Udi Manber and Amit Singhal [#1]. One part of this article that caught my eye was where they discuss the “freshness” problem.

Freshness, which describes how many recently created or changed pages are included in a search result, is at the center of a constant debate in search: Is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until now, Google has preferred pages old enough to attract others to link to them.

To solve this problem, the article says that Mr. Singhal thought of a solution that he calls QDF, or “query deserves freshness”. They aim to show fresh results for queries that are topical to recent events.

THE QDF solution revolves around determining whether a topic is “hot.” If news sites or blog posts are actively writing about a topic, the model figures that it is one for which users are more likely to want current information. The model also examines Google’s own stream of billions of search queries, which Mr. Singhal believes is an even better monitor of global enthusiasm about a particular subject.

Cool idea, right? But there’s a question that begs answering. Once you’ve figured out that a user is looking for fresh results, how do you figure out which ones to show to them?

Google’s web search index has a lag time of up to a few days. This may be due to the sheer size of the index and the technical problems that come with re-indexing the entire web very frequently. But, I think it’s because updating it any more often than that provides little further benefit. If they don’t wait a few days, then people have no time to create links to the new web pages out there and provide the page rank algorithm with its secret sauce. In the page rank world, fresh pages have no value. People need time to link (or not link) to them and tell the algorithm whether or not they’re worthwhile.air track tumbling

This means that a new mode of thought is required to index and rank very fresh information. Information for which the usual trust metadata is not yet available.

To fill this void, so called “blog search” has emerged. Blog search is actually just an arena in which “query deserves freshness” is always true. It’s the same problem. Technorati is the leader in this area, but others are catching up. Unforunately, Technorati’s results are often littered with spammy and irrelevant links. They try to mitigate this by assigning an authority score to the source of each result, but it’s not granular enough to get at the real issue (Google doing more frequent indexing won’t really help web search for the same reason). All the other blog search providers have similar ranking issues. So, what are they to do?

The way to solve the ranking problem for fresh information is to analyze the attention streams of people that are consuming very fresh information.

And where do people consume very fresh information? That’s right, they do it in a feed reader.

Google Reader is a way for the big G to get extremely reliable data about which new web pages are worthwhile and which ones are not. If lots of people email a story in a feed to their friends, that’s a clue that it’s interesting. If lots of people star something, yet another. If they tag it, even better. This all happens very quickly, within minutes of an item’s publication. You see, if they get a large enough group of people to consume their fresh information inside Google Reader, then they are acquiring massive amounts of structured, valuable, implicit metadata that can help them to solve the freshness ranking problem.

This attention data is infinitely more valuable to Google than the potential advertising dollars they could obtain from showing ads to Reader users. That’s why Reader is ad free and will remain that way. They just want to know what you’re looking at.

This attention data is why Technorati acquired the Personal Bee and why Ask bought Bloglines. They want at this type of information too. This is also why Yahoo! is really blowing it by not building a proper reader of their own. At this point, they should just buy one (nudge, nudge, wink, wink). [#2]

If there’s one thing I learned while working at Yahoo! it’s that search is the cash cow, the big prize. Everything else matters only in regards to how it can improve search and drive increased search market share. This is probably even more of a truism at Google. Google Reader may have started out as a 20% project, but all that fresh attention metadata means that it will eventually become an integral part of their search platform.

Notes: