Google and the Evolution of Search I: Human Evaluators

The goal is to enable Google users to be able to ask the question such as ‘What shall I do tomorrow?’ and ‘What job shall I take?’…We are very early in the total information we have within Google. The algorithms [software] will get better and we will get better at personalization. — Google CEO Eric Schmidt

For many years, Google (GOOG), on its Explanation of Our Search Results page, claimed that “a site’s ranking in Google’s search results is automatically determined by computer algorithms using thousands of factors to calculate a page’s relevance to a given query.”

Then in May of 2007, that statement changed: “A site’s ranking in Google’s search results relies heavily on computer algorithms using thousands of factors to calculate a page’s relevance to a given query.”

A slight adjustment in wording, but an important comment on the supremacy of the algorithm that Google had touted for years. Google had finally acknowledged that its search results were no longer solely and automatically determined by the company’s vaunted algorithms. Now they simply “relied heavily” on them. Why the sudden change?

Google claims it was arbitrary, unrelated to any sudden philosophical shifts within the company. But it seems far too specific an adjustment to chalk up to a random brand-management edit. We are, after all, talking about the company’s official explanation of its search results. And indeed, sources say the language was changed to account for the continual calibration of the algorithm, which these days is done with a bit of human help.

Google, for example, employs a vast team of human search “Quality Raters” (You’ll find a copy of an old training manual here). Spread out around the world, these evaluators, mostly college students, review search returns against established criteria–testing different algorithms and see which works “best” in predicting the quality of a site (though not directly judging the quality of any individual site itself).

They’re aided by Google’s own registered users, who can now, when logged into their Google accounts, promote and delete sites from their own search returns according to their preferences. These data too are used to tweak and further optimize the algorithm. So Google’s objective evaluation and ranking of Web sites is to some extent defined by subjective reasoning of a collective human intelligence. And so it must be if Google is to continue returning search results that we perceive to be the “best” answers to our search queries.

In interviews serialized over the next three days, key Google engineers with central roles in managing the company’s search engine discuss resources and techniques they use to optimize the system for users world-wide. The series kicks off below with Engineering director Scott Huffman, who oversees the company’s search evaluation team. Senior Google software engineer Matt Cutts appears tomorrow. And Google Fellow Amit Singhal wraps up the series on Friday.

Part I: Scott Huffman

John Paczkowski: How do you maintain quality in search ranking?

Scott Huffman: We are constantly evaluating the quality of our results in something like a hundred different locales and language tiers all around the world. So every day, we are looking at a random sample of grades that we think represent the queries we get from users. Evaluators look at the quality of each result relative to those queries. We are constantly tracking a pretty wide array of different kinds of quality signals that come through our text.

JP: Talk a bit more about the human element here. You’ve hired people to evaluate pages?

SH: Yes, we have folks around the world who are trained to evaluate the quality of results. We like them to be in-country so they understand the culture and that type of thing. And then we have a work flow system that feeds them different kinds of evaluation tasks. Things like “tell us how good you think this result is for this query.” And then out of the data, we produce a set of aggregate metrics that we look at and that we can track over time.

JP: So how many of these evaluators are there?

SH: How many? I don’t think we can talk about the exact number, unfortunately.

JP: Ballpark? I’ve heard 10,000.

SH: Well, the number actually is pretty large and that’s for a couple of reasons. One is that, like I mentioned, we try to do an evaluation pretty broadly across all of the locales Google is in, and there are a lot of them. So you’re already talking about a pretty large group of people. Secondly, we prefer a larger group to a narrow one because we want to use our evaluations to give us an independent picture of our quality. We get a lot of queries from all over the world so we need a broad base of people to help us understand how good our results are for them.

JP: So are these raters college students or random folks responding to a job post? What are the requirements?

SH: It’s a pretty wide range of folks. The job requirements are not super-specific. Essentially, we require a basic level of education, mainly because we need them to be able to communicate back and forth with us, give us comments and things like that in writing.

JP: And how are they trained?

SH: The training is pretty simple. There are manuals and video training and, ultimately, participation in the rating program. We help them understand what it means for search results to be highly relevant and useable for the viewer. Is there a dominant result for a particular query today? If so, it should be right there at the top. Take a broad-based query like…“Olympics.” If a user searches for “Olympics,” the results from the 1996 Olympics are not as interesting as the ones from the 2008 Olympics.

JP: So how do you vet data provided by the raters? Is there any quality control?

SH: Well, the raters work in-country, so we don’t see them everyday. And we don’t typically talk to them on the phone. We have some automated measures that account for things like, say, evaluators who consistently say two sites in a side-by-side comparison are about the same. We also have moderators. But ultimately, the real quality control is done by the folks who are working on ranking and search UI. They’re the ones who understand why we are better today in China than we were a week ago or a month ago. What changed? What are we are doing better? The evaluation program really just gives our engineers an aggregate measure of how good their algorithms are so they can improve them.

JP: So you’re describing a process in which these evaluators are going to specific Web pages and rating them according to a specific criteria. Do these data have any effect on those sites’ page ranks or pay-per-click and Ad Word bids?

SH: We don’t use any of the data we gather in that way. I mean, it is conceivable you could. But the evaluation site ratings that we gather never directly affect the search results that we return. We never go back and say, “Oh, we learned from a rater that this result isn’t as good as that one, so let’s put them in a different order.” Doing something like that would skew the whole evaluation by-and-large. So we never touch it.

JP: Let’s backtrack a little bit. How did this project begin? Who came up with it? What were its origins?

SH: Well, from the earlier days of Google, of course, we have always been interested in measuring how well our search algorithms are doing. I wasn’t here, but what I understand is that way back when there was a set of Sergy’s favorite 10 queries, people would run those and they would make sure that any change they made to the ranking algorithms would make those work. Obviously, as Google grew in traffic and reach, it needed a broader set of queries, and there was a realization that we really needed to have evaluators in the countries we service who understand the culture to do that well. We needed a team that could evaluate results from the users’ perspective.