Abstract Cognitive science has long shown interest in expertise, in part because prediction and control of expert development would have immense practical value. Most studies in this area investigate expertise by comparing experts with novices. The reliance on contrastive samples in studies of human expertise only yields deep insight into development where differences are important throughout skill acquisition. This reliance may be pernicious where the predictive importance of variables is not constant across levels of expertise. Before the development of sophisticated machine learning tools for data mining larger samples, and indeed, before such samples were available, it was difficult to test the implicit assumption of static variable importance in expertise development. To investigate if this reliance may have imposed critical restrictions on the understanding of complex skill development, we adopted an alternative method, the online acquisition of telemetry data from a common daily activity for many: video gaming. Using measures of cognitive-motor, attentional, and perceptual processing extracted from game data from 3360 Real-Time Strategy players at 7 different levels of expertise, we identified 12 variables relevant to expertise. We show that the static variable importance assumption is false - the predictive importance of these variables shifted as the levels of expertise increased - and, at least in our dataset, that a contrastive approach would have been misleading. The finding that variable importance is not static across levels of expertise suggests that large, diverse datasets of sustained cognitive-motor performance are crucial for an understanding of expertise in real-world contexts. We also identify plausible cognitive markers of expertise.

Citation: Thompson JJ, Blair MR, Chen L, Henrey AJ (2013) Video Game Telemetry as a Critical Tool in the Study of Complex Skill Learning. PLoS ONE 8(9): e75129. https://doi.org/10.1371/journal.pone.0075129 Editor: Hans P. Op. de Beeck, University of Leuven, Belgium Received: December 7, 2012; Accepted: August 12, 2013; Published: September 18, 2013 Copyright: © 2013 Thompson et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This research was made possible by funding from the Social Sciences and Humanities Research Council of Canada and from Simon Fraser University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Work in expertise and skill learning most often follows one of two paradigms: making precise measurements of performance, but with poorly trained participants doing relatively simple laboratory tasks [1], [2], [3], [4] or studying real-world experts while taking only indirect measures of domain performance [5] from two or three levels of skill [6], [7], [8], [9]. The applicability of these paradigms to understanding the development of expertise rests on the validity of extrapolating from short-term laboratory training or from interpolating from long-term comparisons between experts and novices. These methodologies are thus highly informative where skill development is a smooth transition between expert and novice, but may be problematic if the skill level of the participants changes whether or not a process is important to success. For example, the method is deeply problematic in the comparison of 10 month old infants and 20 year old college students. The two groups could obviously be distinguished by the capacity to pass traditional false belief tasks and by the capacity for algebra, but it does not follow that false belief tests are useful for distinguishing 15 and 20 year olds, or that such tests are even relevant to studying this period of development. Similarly, contrastive methods in the study of expertise are potentially misleading if variable importance changes throughout development. Given that expertise encompasses years of training and significant cognitive motor change, the assumption that variable importance remains static warrants investigation. There is some evidence in the motor learning literature that variable importance can change over small amounts of training (<10 hours) in relatively simple laboratory tasks [4]. Whether changes in variable importance exist on the longer timescale of the development of expertise, especially expertise involving a substantial cognitive component, is unclear. One possible source of evidence could be found in medical expertise, as some authors report that the relationship between expertise and the number of propositions recalled from a medical diagnosis follows an inverted-U shaped function [10] implying that the utility of this predictor varies depending on the levels of expertise being compared. The variable may, for example, be less useful for distinguishing novices and experts than it is for distinguishing intermediates and experts. Until recently, however, there was no straightforward and direct way to test the assumption of static variable importance in a rich, dynamic, realistic context. Here we use the analysis of video game telemetry data from real-time strategy (RTS) games to explore the development of expertise. Expertise in strategy games has long been a subject of interest for researchers [9], [11], [12], [13]. This is not because there is some expectation that expert chess players will be more savvy generals, or that expert tennis players are likely to be better pilots. While the knowledge and skills do not transfer, there is enough consistency in the development of expertise that unified theories have been developed [14]. One would therefore expect that the development of RTS expertise would resemble the development of expertise in these and even less related domains, such as surgery. RTS games, in which players develop game pieces called units with the ultimate goal to destroy their opponent’s headquarters, have three relevant differences from traditional strategy games such as chess. First, the games have an economic component such that players must spend resources to produce military units. Many of a player’s strategic decisions are related to balancing spending on military and economic strength. Second, the game board, called a map, is much larger than what that player can see at any one time. The resulting uncertainty about the game state leads to a variety of information gathering strategies, and requires vigilance and highly developed attentional processes. Third, in RTS games players do not have to wait for their opponent to play their turn. Players that can execute strategic goals more efficiently have an enormous advantage. Consequently, motor skills with a keyboard and mouse are an integral component of the game. Each game produces lots of behavioral data: an average game of chess consists of 40 moves [15] per player, while the average RTS game in our study consists of 1635 moves per player. We bear the burden of arguing that RTS play can be considered an area of expertise in the same sense that chess or Go are areas of expertise. Playing well requires a great deal of strategy and knowledge, and these require a great deal of experience. It satisfies the definition of expertise as being “characteristics, skills and knowledge that separates experts from novices and less experienced people” (p. 3) [16]. Skilled StarCraft players perform consistently better than less skilled ones, as evidenced by the game developer’s need to develop a matchmaking system for fair play. RTS games also meet more commonplace notions of expertise (such as athletic expertise) grounded in professional performance requiring skills and commitment far beyond that of average individuals. StarCraft 2 supports a variety of professional and semi-professional players. Top players can earn 250,000 USD a year [17], motivating full time commitment to the game. Professional’s practice 6–9 hours a day, 6 days a week and often have a decade or more of RTS experience. Tournaments are broadcast live and professional teams are sponsored by major corporations. All of this is evidence that RTS gaming is a domain of expertise. Furthermore, competence in the game necessarily involves fast and meaningful hand movements and intelligent control of the game’s view-screen in order to see and act, so it follows that attention, perception, decision making, and motor control (all of which we will collapse under the term “cognitive-motor abilities”) are important to StarCraft 2 expertise. By studying a domain of expert performance that is entirely computer-based, we are able to obtain accurate measures of performance in its natural environment By using an existing, popular and competitively played video game we are able to obtain much larger diverse samples via online correspondence from participants all over the world. The present study analyzes data from 3,360 StarCraft 2 players across 7 distinct levels of skill called leagues (Bronze, Silver, Gold, Platinum, Diamond, Masters, Professional), making it the largest expertise study ever conducted. Our research goal was to identify potential markers of expertise in RTS games (with special interest in general cognitive markers), and to form a clear picture of the complexity of expertise. When these aspects of RTS games are taken in conjunction with the telemetric collection and analysis of detailed game records, we are left with a project that has the following virtues: A rich, dynamic task environment, Highly motivated participants, Accurate measures of motor performance and attentional allocation, Noninvasive and direct measures of domain performance, Large datasets, Numerous variables, Many levels of expertise. This approach is therefore uniquely situated for exploring expert development.

Materials and Methods Ethics Statement This study was reviewed and approved by the Office of Research Ethics at Simon Fraser University (Study Number: 2011s0302). Participants provided informed consent in an online survey. Data Collection Telemetric data was collected from 3,360 RTS game players from 7 levels of expertise, ranging from novices to full-time professionals. We posted a call for StarCraft 2 players through online gaming communities and social media. From each respondent we gathered a replay file (a recording of all the commands issued in the game), demographic information, and a player identification code that allowed us to verify their level of expertise (as measured by the league in which they compete - online competitive leagues are comprised such that 20% of players are Bronze, 20% Silver, 20% Gold, 20% Platinum, 18% Diamond, and 2% Masters [18]). Replays of professional players’ games were obtained from gaming websites. The primary research question does not depend on any particular variable but on the pattern of importance of all the variables across the levels of expertise. Nevertheless, we selected predictor variables that relate to cognitive-motor abilities. In addition, we selected variables that relate to cognitive load, that is, the amount of mental energy required to perform the task. Unlike laboratory tasks, many of which ask participants to do a single simple task, success at StarCraft2 requires the completion of many separate but interrelated tasks. This can lead to difficulties, as there are serious constraints on attention which limits the ability to perform multiple tasks concurrently [19]. There is much work on skill learning which indicates that after extensive practice, people not only perform tasks more quickly and accurately, but they also to require fewer cognitive resources to perform and become nearly effortless. This is typically called automaticity (e.g., Logan [2], Schneider & Shiffrin [20], Shiffrin & Schneider [21]) in cognitive psychology. Related concepts can be found in other fields. For example, one concern of research on Unmanned Ground Vehicle operation is how the degree to which navigation is autonomously controlled by computer systems affects operator mental workload [22]. The variables chosen for the present study, which reflect the considerations above, fall under the following categories: Perception-Action-Cycle variables. Each variable pertains to a period of time where players are fixating and acting at a particular location. Many of these variables will therefore reflect both attentional processes (because Perception-Action-Cycles have consequences for what players are able to attend to), perceptual processes (because shifts of the screen imply new stimuli), and cognitive-motor speed (in the sense that actions must not only be fast but meaningful and useful). Hotkey usage variables. Players can customize the interface to select and control their units or building more rapidly, thus offloading some aspects of manually clicking on specific units to the game interface. Complex unit production and use variables. Certain units pose dual task challenges and some need to be given explicit direction or targeting instructions. The production and use of these units and abilities is sometimes optional, and so their production and use may reflect a player’s modulation of their own cognitive load. Direct measures of attentional control. StarCraft 2 presents a number of attentional challenges for players. One challenge is that the primary view screen contains detailed and highly salient information that potentially distracts players from the less detailed information of the entire map (this “mini-map” occupies a small portion on the bottom-left of their screen). Mini-map variables reflect player’s performing of actions on the mini-map, and we hypothesized that better players would do a better job of attending to, and using, this map. We also considered how much of the total map was looked at by players, which we thought relevant to the seeking of information about the game state. Actions per minute. This variable is often used as a predictor of expertise in the StarCraft community and is automatically calculated by the game. It is a measure of cognitive motor speed. The Supplementary Materials (Materials S1) contain complete definitions for all variables in the analysis. We extracted a list of all the actions and screen moves from each game replay file. Players move their screen to different locations on the map to perform actions at those locations or to gather information about what is occurring at those locations. These screen moves are very like saccadic eye-movements. To deal with this problem we aggregated screen movements into PoVs using the fixation-IDT algorithm in Salvucci & Goldberg [23], with a dispersion threshold of 6 game coordinates and a duration threshold of 20 Timestamps (about 230 milliseconds). The algorithm aggregates screen movements to provide (a) a pair of Cartesian screen coordinates and (b) a duration for each PoV. Proceeding from earliest screen movements to later ones, the algorithm first collects the smallest set of screen movements such that adding another screen movement would exceed the duration threshold. If the dispersion (defined as [(max(x) - min(x))+(max(y) - min(y))]) of these points exceeds the dispersion threshold, the first screen movement is dropped from the set of screen movements and the process is repeated. If the dispersion of the set does not exceed the dispersion threshold, then a new screen movement is added to the set and the process is repeated until adding a new screen movement produces a window that fails to satisfy the dispersion threshold. The coordinates of screen movements in the set are then averaged into PoV coordinates and the PoV is said to begin at the earliest screen movement in the set and end at the point where the dispersion threshold is violated. The definition of PoVs allowed the analysis of PoVs that contain one or more actions, which we call Perception Action Cycles (PACs). Hotkey selects are not considered an action for calculating any PAC variable as these actions may also be used to produce new PoVs themselves. PACs encompass roughly 87% of the participants’ game time. This finding echoes research using eye-tracking to record gaze while participants do real world tasks [24]. This work found that participants’ PoVs are predominantly part of sequences of object related actions. PACs also make a useful parallel to individual trials within a laboratory experiment in which the participant perceives stimuli and makes a series of responses. For example, Action Latency, the time from the onset of a PoV to the first action, is a close analogue to reaction time in laboratory experiments. In order to ensure comparability between games, we restricted our analysis to rated competitive ladder games between two humans that lasted longer than five minutes, were played at the same game speed, and were played on a StarCraft 2 versions 1.3.6.19269∶291 (also see exclusion criteria). This ensures that each game had essentially the same starting conditions. Two important exceptions are that games are sometimes played on different maps, and that players may occupy different starting positions (although competitive ladder maps are typically symmetrical and are balanced to ensure fair games). Exclusion Criteria and Sample Characteristics Of the 9222 Participants who began the process of filling out the survey, 5917 were dropped from the study for satisfying one of the following exclusion criteria: Participant failed to supply a Battle.net ID for league verification: 4706 Participant failed to submit a valid replay file: 191 Game had a Max Timestamp smaller than 25000 (roughly 5 minutes): 72 Game was played with more or less than two human players: 44 Game was played at a game-speed slower than “faster”: 5 Participant did not have a 1v1 ranking on Battle.net: 141 Game was not a grandmaster game, and was played in leagues that were not played using Blizzard’s “Automatchup” feature: 356 Game had fewer than 100 commands and screen movements overall: 0 Game was not a professional game, and was played on a version of StarCraft 2 other than 1.3.6.19269∶291 Participant submitted a Battle.net ID, but it did not match any player in the game: 76 The game was not a 1 versus 1 game: 0 Belonged to the league Grandmasters: 35 The survey data includes 7 leagues (Bronze, Silver, Gold, Platinum, Diamond, Masters, Grandmasters). The sample of Grandmasters participants was significantly smaller than that of the other leagues. This was not surprising as the Grandmasters league in Starcraft 2 included only the top 200 players in each region – a population smaller in orders of magnitude than that of the other leagues. These 200 players consisted of both top casual players and professional players, which could not be distinguished independently of the variables used in the analysis. Due to the analytic difficulties of this group imposed, we dropped the data from the analysis. Instead, we were able to obtain a larger and more homogenous group, the professionals, from 55 additional publicly available games collected online from professional StarCraft 2 players who competed in the GomTV StarCraft League (GSL) tournament (the most prestigious tournament in competitive StarCraft) in July 2011 or August 2011. The sample size by league was as follows: Bronze : 167. Silver: 347. Gold: 553. Platinum: 811. Diamond: 806. Masters: 621. Professional: 55. Participants reported their countries of origins. According to the survey results, participants came from 77 countries, primarily the United States (1425), Canada (480), Germany (246), and the United Kingdom (187). Participants’ ages ranged from 16–44 (Median = 21; Mean = 21.6; SD = 4.2), which included 3276 males and 29 females. The one-tail 95% trimmed mean of reported hours of Starcraft 2 experience was 545, and the mean of reported StarCraft 1 experience was 4.07 years. Histograms for each variable (by league) are given in Supplementary Figures S1–15. Analysis The primary theoretical question is whether the predictive importance of variables is stable across levels of experience. To answer this, we evaluated variable importance across skill levels by creating a series of statistical classifiers that distinguished players from two different leagues. An important challenge we encountered with this dataset is that the players are grouped into somewhat heterogeneous skill classes. The placement of players into leagues does not perfectly reflect skill, and a high-ranking player within a class might be objectively a better player than a lower-ranked player in the class above. As a consequence, the classes directly beside each other are not separable, and we found that classifiers performed poorly when trying to distinguish between neighboring classes. We had significantly more success when we used the variables to predict class when the distance between classes was at least two. Our method of determining variable importance across skill thus consists of a series of two-league classifiers, each based on classes two leagues apart (e.g. Bronze-Gold). We include a final classifier that emulates the contrastive (novice-expert) approach by comparing only the most extreme skill levels (Bronze and Professional). Although logistic regression is an option for two-class classification, we preferred to use the more flexible conditional inference forest algorithm, which has emerged from work on random forest classifiers [25], [26], [27] (for more information see the work of Carolin Strobl [28]). The main advantage of using conditional inference forests over logistic regression is that we do not need to make unnecessary assumptions about the structure of the relationship between the predictive variables and the response. Furthermore, these classifiers do not exhibit some of the biases present in other random forest techniques [29]. However, random forests in general do not come with significance tests, so we needed to adopt a suitable procedure, which is discussed below. The forests were created using the cforest function in R with ntree = 1000 trees and mtry = 5 variables per split. We assess the randomness in the algorithm by running the forest on samples of size 70% drawn without replacement from the original data twenty five times, as a distribution of importance scores is required by the decision procedure in Linkletter et al. [30]. The Conditional Inference Forest algorithm gives a measure of variable importance called permutation importance index for each variable, but it does not give a p-value for a hypothesis test against zero (see supplementary methods and materials in Materials S1). We follow Linkletter et al. by adding our own random noise variable as a control variable each time we subsample the data, and the 95th percentile of this distribution serves as a critical value for a test against the null [30]. It is important to note that this method does not control for a particular family-wise type 1 error. This is reasonable for screening research such as ours [30], where the goal is to identify variables worthy of further research. Our research sets the stage for further studies that will confirm the importance of variables identified here and probe the relations between them.

Discussion The primary finding is that predictors of expertise change in their importance across skill levels. We also demonstrated that a purely contrastive approach produces a distorted view of changes across expertise. These results make the interpretation of contrastive studies and the generalization from laboratory designs more problematic. They also show that the telemetric collection of data can confer deep benefits to the study of skill development. The results also show that RTS game replays in particular can track abilities of interest to cognitive science. The extreme compression of these cognitive motor measures, the comparative ease of worker production in mid-to higher-skill players, and the increasing importance of using hotkeys are in keeping with the view that automaticity is an important component of expertise development. As some skills are automatized, it frees up cognitive resources for players to devote to learning other skills. Interestingly, this change would also have a profound impact on the learning environment and therefore shape future change. It is important to note, however, that transitions between skill levels may not reflect the process of automaticity alone. Ericsson, for example, argues that conscious control and management of learning are required for individuals to continue to improve particular skills [31]. In our sample, the use of hotkeys are especially pronounced in professional players (see Figure S4,S6), and while this could be because using hotkeys requires substantially more experience than is available to non-professional players, it also may reflect consciously controlled training on behalf of professionals. The present work has several important limitations. First, our measure of skill, though more fine-grained than typical contrastive studies of expertise, is nonetheless ordinal. This hampers our ability to describe development in a continuous fashion. Given that professionals train many hours a day, it would be helpful, for example, to chart the substantial development from Masters to Professional. The present design also fails to capture expertise changes at the individual level. The average Bronze player has 200 hours of experience, but there is no way to know if a particular player, given another 800 hours of practice will end up in Masters league. Another limitation is that we have only a single game from each participant, thus have no good estimate of the variability of individual performance. We also cannot say anything about individual difference in learning trajectories or whether there multiple pathways to expertise. Finally, the present study is observational, and not experimental, and so causal relationships are not identifiable. Future work is needed, for example, to demonstrate that the number of workers created is automatized by showing that in higher leagues it is less prone to disruption by an additional cognitive load. While the above limitations apply to the present study, they are not limitations of the general paradigm of analyzing telemetric data from RTS games. Continuous measures of skill exist. In any competitive games, developers need to match players of similar skill, to ensure the games are fair. This is often a continuous measure called their match-making rating. While these data are not always available to researchers, game developers are, at least in our experience, supportive of research efforts. Perhaps more importantly, the method can be adapted to longitudinal designs. Replay files are compact, meaning that many players have accumulated a record of literally every StarCraft 2 game they have ever played. This allows for the sampling of entire ontogenies of expert development in longitudinal studies of human performance on the microgenetic scale. Scientists can also test specific causal hypotheses in RTS games using existing game-modification tools. With StarCraft2, for example, the company includes tools which allow the modification of almost any aspect of the game. The modified games can be published online for other players to use. A massive sample of participants, randomly assigned to conditions by the modified game, can thus be collected telemetrically. The kind of manipulations used to understand chess expertise, for example, are easily implemented, but with larger and more diverse datasets. This was the dream of “Space Fortress”, a game designed by Mané and Donchin [32] to study cognitive-motor development. They wrote: “The goals were (1) to create a complex task that is representative of real-life tasks, (2) to incorporate dimensions of difficulty that are of interest based on existing research on skill and its acquisition, and (3) to keep the task interesting and challenging for the subjects during extended practice” (p. 17). Space Fortress studies have used up to 40 hours of training [33], and this is far more than most skill learning experiments. While this is admirable, the present method can do better. The least skilled group in our study, the Bronze players, report 200 hours of experience on average. Lewis, Trinh, and Kirsh [34] demonstrated that researchers could analyze telemetric data from video games, like StarCraft, that are already extremely popular. Our study develops this paradigm further and motivates additional research into StarCraft 2, which has millions of players worldwide, and allows for easy telemetric data collection, skill verification, and even experimentation. Of course, the research opportunities extend beyond StarCraft 2, as the features making these virtues possible are becoming more common in video games generally. We have argued that the present paradigm has tremendous advantages on its own, but it can also be used to guide researchers using other methods. For example, if one were interested in studying neural changes involved in multitasking, our data suggest that at least one of the multitasking challenges of StarCraft 2 is overcome in the early leagues (see workers trained per minute, Figure S10). Given the difficulty of acquiring professional players, and the expense of neuroimaging studies, knowing when these skills develop allows researchers to efficiently target specific changes of interest. In this way analysis of telemetric data can provide a kind of map of skill development that can serve as a guide for a variety of research tools and paradigms. In light of the improvement this method provides over the typical contrastive methods, we propose that RTS games can serve cognitive science as a ‘standard task environment’ [35], as drosophila have served biology. As the number of domains of expertise that are predominantly computer mediated increases, so will the relevance of telemetric data to the study of complex learning. As human computer interactions involve more sensors to record human behavior (such as eye-tracking and biometrics) more interesting real-world performance can be recorded and leveraged to make significant advances in our understanding of human cognition and learning.

Acknowledgments The authors would like to thank Jordan Barnes, Caitlyn McColeman, Kim Meier, James MacGregor, Gordon Pang, Jozef Pisko-Dubienski, Alexander Lee, Betty Leung, Scott Harrison, Marcus Watson, and all the members of the Cognitive Science Lab for help on many aspects of the project. We would also like to thank Dario Wünsch, a player on Team Liquid for giving us some insight into the life of a StarCraft professional. We would like to thank Vincent Hoogerheide, Matt Weber, and David Holder for invaluable assistance during data collection. Finally we’d like to thank all the StarCraft players who submitted games to www.skillcraft.ca.

Author Contributions Analyzed the data: JJT MRB AJH LC. Contributed reagents/materials/analysis tools: JJT MRB AJH LC. Wrote the paper: JJT MRB AJH LC.