In order to explore the relative contribution of relevance and position, we employed eye tracking as the methodology in the current study. Eye tracking devices are able to record eye movements and reveal subjects’ attention and cognitive process. In areas of cognitive psychology, human‐computer interaction, and marketing, eye tracking methods have been used for decades ( Rayner, 1998 ). We use eye tracking to investigate how users make decisions when confronted with returned Google results following a query. Eye tracking adds meaning to the more traditional log file or click behavior analysis. It allows for a more complete assessment of the information‐seeking process by revealing which query result abstracts users looked at, or were aware of, before selecting a query result or refining their query. This article provides behavioral evidence that sheds light on the influential factors in the evaluation process of search engine uses.

Our curiosity regarding this question was piqued by an earlier study conducted by Granka et al. (2004) . Their results indicated that most student subjects only view and click the top two results returned by Google. The design of this earlier study did not tease apart whether those choices were the result of the top positions of the two abstracts as influenced by Google’s ranking algorithm, or if those were truly the most relevant results as evaluated by the subjects. We were interested in finding out whether a user’s choice of a particular abstract was based on the position of that abstract, the user’s evaluation of the relevance of that abstract, or a combination of the two.

However, how well a Web page actually reflects the users’ search intentions is hard to measure. For example, the ranking algorithm of Google uses a page’s measure of in‐links to help inform its quality and relevance ( Pandey, Roy, Olston, Cho, & Chakrabarti, 2005 ). Some have argued that those algorithms, such as PageRank, simply set up a rich‐get‐richer loop, whereby a relatively few sites dominate the top ranks (Hindman, Tsiotsioliklis, & Johnson, 2003). Retrievability and visibility represent only part of the search process. We wondered what role the user plays in perpetuating this rich‐get‐richer dynamic. Particularly, we wondered how much of the correlation between Web traffic and site popularity ( Hindman et al., 2003 ) is due to the alleged efficiency of these algorithms as opposed to users’ tendency to simply trust the ranked output displayed by a search engine and forego any in‐depth analysis or comparisons of the retrieved results. More importantly, Google’s imperfect algorithm is open to abuses such as Google bombing ( Tatum, 2005 ; see also Bar‐Ilan, this issue). This might deliver erroneous messages to a large population when the searchers trust Google without questioning its underlying ranking mechanism.

The information search process is made possible through three parties: Web authors, the search engines themselves, and the users of search engines. The Web authors put their Web pages online with appropriate linking to other pages. The link structure has been used by popular search engine algorithms ( Brin & Page, 1998 ) that can take advantage of this structure to rank relevant Web pages. Users of search engines enter various keywords (sometimes with Boolean commands) according to their understanding of the task and the functionality of the search engine, and they evaluate the results returned by the search engine, making a decision on whether or not to select one of the returned results or reformulate the query. Search engines act as an information intermediary that facilitates the information seeking process.

All of the search engines noted above respond to a query with a ranked list of 10 abstracts in their default setting. The ranking reflects the search engines’ estimated relevance of Web pages to the query. Individual search engines vary by both underlying ranking implementation, and also by various characteristics of how they display the ranked results, and any additional support they provide for finding related Web pages. Users can evaluate the abstracts, or other information displayed about a given result, before deciding whether to visit any of the suggested pages by clicking on a hyperlink. In this study, we chose to use Google because of the frequency of its use, the simplicity of its display of query results (which can serve as a common basis when compared to many other search engines), and our prior experience studying Web search on Google ( Granka, Joachims, & Gay, 2004 ). We also confined the study to only one search engine to ensure a constant visual display on which we could analyze and interpret the subjects’ eye movements.

Finding online information using search engines has become a part of our everyday lives ( Gordon & Pathak, 1999 ). Currently the search engine serving the largest percentage of queries (at 47.3%) is Google, with an index of around 25 billion Web pages and 250 million queries a day ( Brooks, 2004; Search Engine Watch, 2007 ). Google now provides search functions on handheld devices and smart phones ( Google Inc., 2005a ). With the ubiquitous presence of mobile devices, anytime and anywhere access to the information world has become a reality. Other popular search engines include Yahoo, MSN, AOL, and Ask.com , and all serve the pervasive need of finding pertinent information within the enormity of the Web. Despite the popularity of search engines, most users are not aware of how they work and know little about the implications of their algorithms ( Gerhart, 2004 ).

In Granka et al. (2004) , the subjects were given the 10 tasks in random order and asked to start with Google and search for answers with a limit of three minutes for each task. The time constraint allowed for the collection of a substantial amount of eye tracking data for each task and also minimized the total time required of each subject. Most subjects voluntarily stopped the search tasks in the given amount of time. In this earlier study, complete eye tracking data were obtained for 23 subjects. The results showed that the subjects viewed the top two abstracts almost equally often and much more often than any other abstract. However, they clicked the number one ranked abstracts significantly more often than the number two ranked abstract ( Granka et al., 2004 ). This behavior prompted us to ask whether the subjects were simply defaulting to Google’s ranked results, or if their decisions were based on some critical evaluation of the results.

In the study conducted by Granka et al. (2004) , 10 tasks were devised, ranging from informational tasks such as “Who discovered the first antibiotics?” to navigational tasks such as “Find the homepage of Emeril ‐ the chef who has a TV cooking program.” Informational tasks require finding a particular fact, while navigational tasks involve searching for a particular Web page ( Broder, 2002 ). Among the 10 tasks, some were inspired by popular topics from Google Zeitgeist, while others covered local or specialized topics. Google Zeitgeist ( Google Inc., 2005b ) is a report provided by Google that reveals the most popular queries or other related trends about the queries Google receives. Our resulting mix of popular and specialized topics was an effort to simulate a likely query task situation.

In general, knowing how users evaluate result pages through eye tracking methods can help researchers to understand users’ motivations, tasks, and cognitive processes. Understanding this evaluation and decision‐making will enable correct interpretation of Web log files as feedback data and thereby improve search engine performance ( Joachims, 2002 ). As a result, it may be possible to design Web‐based information retrieval systems to better satisfy user needs.

Goldberg et al. (2002) used eye tracking methods to test the performance of subjects in completing several tasks on a Web portal page. Their research demonstrated the characteristics of subjects’ eye movements on that portal page. This research gave rise to implications for improving the design of the Web portal. Pan et al. (2004) showed that gender, website types, and the interaction between search sequence and website type all affect Web viewing behavior. For example, female subjects had shorter mean fixation durations than males; the subjects had longer mean fixation durations on the first Web pages viewed than the second ones; and the subjects spent more time gazing on the first pages than on the second ones. In general, as an indicator of information processing, eye movements on Web pages are influenced by both individual variables, such as gender, and the characteristics of the stimuli, such as the layout and content of Web pages. The current study combines eye tracking with clickstream data in order to make inferences regarding the impact of positions versus judged relevance on decision‐making processes involved in information search. By doing so, we gain an in‐depth understanding of how users evaluate search results and the factors that influence their choices.

Studies of Eye movements date back to work by Javal in 1879 ( Huey, 1908 ), and over time they have informed fundamental facts about eye movements, behavioral and experimental psychology, and human‐computer interaction, an application that is greatly benefiting from advances in the ease and accuracy of eye trackers. In a typical user study, the subject is calibrated to the eye tracking device by the researcher asking the subject to look at specific targets while the software configures the respective target locations. This is done by sending a weak infrared light to the subject’s eye and measuring the light reflection on the screen. The result is that eye movements on a computer screen can be recorded, with a high degree of accuracy, for the majority of people.

The hypertext nature of the Web has changed how people search and access information ( Bilal & Kirby, 2002 ). It is imperative to understand user behavior on the Web in order to design better search engines. This section describes past research on user behavior and search engines. The relevant literature is organized into three parts: past research on user behavior and search engines, past eye tracking research related to Web viewing behavior, and a review of the major results of the first Google eye tracking study the authors conducted, which provided the basis for the current eye tracking study.

Research Methods and Design

In the present study, rank refers to the original sequence of abstracts returned by Google. Lower ranks indicate that the Web pages are less relevant as judged by Google’s algorithm and thus placed later in the sequence; position represents the actual physical locations of the abstracts on the Google results page, for example, from top to bottom (1 to 10) on the first Google result page; relevance represents the subjective judgments of the likelihood that the information piece is related to the answer of the question or the goal of a search task. In this study, we obtain relevance through human judgments of the abstracts returned by Google (abstract relevance), as well as the pages associated with those abstracts (Web page relevance). Judgment data were important to ensure that all of the 10 results were not equally relevant.

Based on the findings from our previous study (Granka et al., 2004), the current work was designed to exploit Google’s ranking function in order to investigate how much the subjects rely on Google’s ranking to make their decisions about relevance. Unbeknown to the subjects, we manipulated the order of Google’s returned results in some cases, such that abstracts of actual lower ranked Web pages appeared higher in position and vice versa. Thus, choosing a lower ranked abstract that is in a higher position in the Google results page but is evaluated to be less relevant by human judges would be evidence that the subjects have assigned priority to Google’s “expertise” over the actual relevance of the abstract.

This section introduces eye tracking as a methodology and introduces the details of the research methods and procedures used in the current study. A laboratory setting was necessary to capture all aspects of the search sessions and related eye movements, in order to compare the variables across all subjects systematically. Although some scholars have argued that external validity is compromised in a laboratory setting, previous studies have shown that in laboratory settings and Web settings there are few or no differences in the subjects’ behavior on information search, especially on those tasks using keywords (Epstein, Klinkenberg, Wiley, & McKinley, 2001; Schulte‐Mecklenbeck & Huber, 2003).

The Subjects In this study, participants were undergraduate students with various majors (including communication, engineering, and arts and sciences) at Cornell University (U.S.A.). All students were given extra class credit for their participation in the experiment. Twenty‐two subjects were recruited, and 16 complete data sets, including 11 males and 5 females, were obtained. Attrition was due to random recording difficulties and the inability of some subjects to be calibrated precisely.2 The average age of participating subjects was 20 years and 4 months. All subjects reported that they used Google as their primary search engine and had a high familiarity with the Google interface (all scored 10 out of 10); when asked about the levels of trust in Google, they reported an average of 7.9 (out of 10). Thus, our subjects, in general, are savvy users of Google and tend to trust Google to a high degree.

Search Tasks Ten search tasks were included in this study, each of which addressed a unique aspect of the information retrieval experience. Half of the searches were navigational in nature, asking subjects to find a specific Web page or homepage. These were definitive searches, meaning that only one correct Web page would provide an acceptable answer. The other five tasks were informational, asking subjects to find a specific bit of information (Broder, 2002). Much of the content for the tasks was generated according to the content of top searches listed on Google Zeitgeist (Google Inc., 2005b). Our purpose was to ensure that the tasks in this experiment represented the various genres of searches that the general population uses on a regular basis, including travel, movies, current events, celebrities, and local issues. These tasks were also pre‐tested to ensure that the most intuitive queries would not always result in top‐ranked results; therefore, the findings should be interpreted in light of the fact that these queries are on average more difficult than a subject’s typical query. The following table is a brief description of the 10 search tasks included in the experiment and the correct answers to these tasks (Table 1). Table 1. The 10 information search tasks Task Type Task Correct Answer Navigational Find the homepage of Michael Jordan, the statistician. http://www.cs.berkeley.edu/~jordan/ Find the page displaying the route map for Greyhound buses. http://www.greyhound.com/maps/ Find the homepage of the 1000 Acres Dude Ranch. http://www.1000acres.com/ Find the homepage for graduate housing at Carnegie Mellon University http://www.housing.cmu.edu/graduatehousing/ Find the homepage of Emeril—the chef who has a television cooking program. http://www.emerils.com/emerilshome.html Informational Where is the tallest mountain in New York located? The Adirondacks OR High Peaks Region With the heavy coverage of the democratic presidential primaries, you are excited to cast your vote for a candidate. When are/were democratic presidential primaries in New York? March 2, 2004 Which actor starred as the main character in the original Time Machine movie? Rod Taylor A friend told you that Mr. Cornell used to live close to campus—near University and Steward Ave. Does anybody live in his house now? If so, who? Members of Llenroc, the Cornell chapter of the Delta Phi Fraternity live in the mansion. What is the name of the researcher who discovered the first modern antibiotic? Alexander Fleming

Experimental procedure All participants were required to give informed written consent prior to the start of the experiment. Before the actual experiment, the eye tracker was calibrated using a nine‐point standard calibration procedure for each subject (Duchowski, 2003). Participants were instructed to search for the 10 different tasks through the Google interface. Subjects were told to view the Web pages and search as they typically would under normal conditions, with the opportunity to scroll up and down the page at their leisure.3 The experimenter sat to the right of and behind the subject, where she was able to watch the subject, the subject’s eye, and also the corresponding eye movements on the two control monitors. If the experimenter recognized that the eye tracking system temporarily lost a subject’s eye path due to extreme movements, she could re‐center and if appropriate, perform a quick recalibration fix. This happened rarely and randomly; it did not interrupt the experimental session since the experimenter could perform the quick fix in a few milliseconds. The 10 search tasks were read aloud to the subject by the experimenter to eliminate unnecessary eye movements away from the computer monitor; such eye movements could potentially hinder the accuracy of the ocular calibration. Typically, due to the monitor size, scrolling was required to view abstracts ranked seven and higher on the Google results pages. To eliminate the potential bias from question order effects, all search questions were completely randomized for all subjects. The maximum time for completing each task was restricted to three minutes. As before, the time constraint allowed for a sufficient amount of eye tracking data to be collected for each task and also minimized the total time required of each subject. The most important data for this study come from how each subject responds and interacts with the 10 results following each query and not from the full completion of the task itself.

Design Because Google is continuously updating its search algorithms, one specific query will not produce the same exact results on two separate occasions. Because much of the data analyses were to occur after the experimental sessions, it was necessary to cache the Web pages with which the subjects actually interacted. A proxy server was set up to mediate the interaction between the subjects and the Google Web server. The proxy script was run on the subject’s computer and stored every search query typed by the subjects, as well as all links and Web pages that were viewed, along with the corresponding times that they were accessed and viewed. When the subject typed in a query, the query was sent to the proxy server, and the proxy server relayed it to Google. After receiving the results from Google, the proxy server manipulated the results and passed on the modified results to the subject’s Web browser. Results were modified in two ways. First, the proxy server removed the advertisements on the Google results page to avoid distraction and ensure consistent stimulus exposure across all subjects. This also saved the authors from having to filter out eye movements on ads, since the goal of the study was concerned with which query results were viewed and selected. Second, in order to explore the relative contribution of relevance versus position to the decision making process, the results were further manipulated for each subject in one of three ways. In the “Normal” condition, the proxy server returned the results in their original ranked order; in the “Swapped” condition, the proxy server swapped the positions of the first ranked abstract with the second ranked abstract, keeping the rest of the ranking intact; and in the “Reversed” condition, the proxy server reversed the positions of the abstracts on the first result page as follows: The first ranked abstract was swapped with rank 10 abstract, the rank 2 abstract was swapped with rank 9 abstract, and so on.

Eye Tracking Indices During an eye tracking experiment, several measurements are typically recorded that are relevant for studying college students’ interactions with search engines. ‘Fixation’ refers to a relatively stable eye‐in‐head position within some threshold of dispersion (typically ∼2°) over some minimum duration and with a velocity below some threshold (typically 15–100 degrees per second). In this study, we set the minimum duration as 50 milliseconds, as suggested in the ASL504 eye tracker manual (Applied Science Laboratories, 2005). Eye fixations are the most relevant metric for evaluating information processing in online search. Fixations represent the instances in which most information acquisition and processing occurs (Rayner, 1998). The total number of fixations is often used as an indicator of processing difficulty, with fixation density related to the complexity and informativeness of the visual stimulus (DeGraef, De Troy, & d’Ydewalle, 1992; Friedman, 1979; Henderson, Weeks, & Hollingsworth, 1999), such that as informativeness increases, so too does the number of fixations in that area. In the current study, we also used measures of the average number of fixations. A higher number of fixations on an abstract will represent intensified information processing. ‘Pupil Dilation’ refers to widening of the pupil. It has long been known that pupils dilate in response to emotion‐evoking stimuli (Beatty, 1982). While it is also the case that pupil size is affected by light, the lighting remained constant in our experiment. As Rayner and others have pointed out (Rayner, 1998), using only a single indicator of processing difficulty may result in an oversimplification of the relationship between the indicator and processing difficulty. Hence, both fixation and pupil dilation measures are frequently used as corroborating measures of cognitive workload (Hess, 1965; Just & Carpenter, 1980; Kahneman, 1973). Last, a ‘scanpath’ is the spatial arrangement of a sequence of fixations, or simply the sequence of LookZones that a subject views, as in the present study.

Definition of LookZones In addition to logging the clickstream and Web page data of subjects, the script also constructed ‘LookZones’ around key content regions. The script utilized a feature inherent to the GazeTracker software system that automatically creates LookZones around links and pictures, which the software recognizes within the HTML tags. (For more information on the GazeTracker software system and the eye tracking apparatus itself, see Appendix A.) Thus, the script enabled the creation of distinct LookZone regions around each of the ten displayed results (Figure 1). For the analysis, each of these displayed results on Google—abstracts in rank #1, rank #2, rank #3, to rank #10—is given its own set of LookZones, from which we can then compare eye tracking behaviors across all queries, relative to these zones. LookZones were not visible during the time participants were engaged in the experiment. Figure 1 Open in figure viewer PowerPoint LookZone division on a Google result page

Judged Relevance As stated above, we considered rank, position, and judged relevance in this study. For all queries and results pages that were encountered in the study, we gathered relevance assessments of the abstracts, which allowed us to look at the choices made by subjects as a function of the positions and judged relevance of the page chosen by the subject in case Google’s rank did not reflect what other humans might consider relevant. Five non‐participants were chosen as the judges in the study. For each results page, we randomized the order of the abstracts and asked judges to weakly order the abstracts (ties were allowed) by how promising they looked for leading to relevant. Each of five judges assessed all results pages for two questions, plus 10 results pages from two other questions, for inter‐judge agreement verification. The set of abstracts/pages we asked judges to weakly order were not limited to the (typically 10) hits from the first results page, rather the set included all results encountered by a particular subject for a particular question. The inter‐judge agreement on the abstracts was 82.5%. Furthermore, we also collected relevance judgments for the actual Web pages those abstracts represent. The inter‐judge agreement on the relevance assessment of the pages was 86.4%.