Abstract Despite dozens of empirical studies and a growing body of meta-analytic work, there is little consensus regarding the efficacy of cognitive training. In this review, we examine why this substantial corpus has failed to answer the often-asked question, “Does cognitive training work?” We first define cognitive training and discuss the general principles underlying training interventions. Next, we review historical interventions and discuss how findings from this early work remain highly relevant for current cognitive-training research. We highlight a variety of issues preventing real progress in understanding the underlying mechanisms of training, including the lack of a coherent theoretical framework to guide training research and methodological issues across studies and meta-analyses. Finally, suggestions for correcting these issues are offered in the hope that we might make greater progress in the next 100 y of cognitive-training research.

The results cited in this book are so extraordinary as to challenge attention both from psychologists and educators, for if it be possible, by devoting ten or fifteen minutes daily to simple exercises, to accomplish the results which are claimed, it would appear to be incumbent upon all teachers to institute such exercises and to regard them as a very essential part of schoolroom training. If, on the other hand, Miss Aiken’s results cannot be duplicated, it is equally important to establish this fact and then, if possible, to find out the cause of the discrepancy.

Whipple, 1910 (1)

In the movie Groundhog Day, Bill Murray’s character, a television meteorologist, is forced to repeat the eponymous holiday over and over again until he learns from his many mistakes and finally has a “perfect” day. Researchers who study cognitive training seem to be having the same experience as Murray’s character: A cursory study of news articles seems to reveal a new “brain-training works” or “brain-training doesn’t work” headline every single week. However, Murray’s character had an advantage that psychologists seem to lack; he was fully aware of his previous experiences and could synthesize all he had learned. Cognitive-training researchers, however, seem to generate an endless cycle of replications and meta-analyses without much progress toward any real consensus—and with little awareness of the long history of cognitive-training practice and research. They are likely unaware that they are living out Santayana’s famous aphorism, “Those who cannot remember the past are condemned to repeat it.”

The goal of this article is to help the reader gain a historical understanding of “mind” or “brain” training in the hopes that the field can begin to learn from the past. We speculate about why cognitive training has been attempted for centuries and why cognitive and educational scientists have spent over 100 y empirically testing its promise. A recent, excellent, review by Simons et al. (2) addresses both the claims made by proponents of brain training and also the methodological strengths and weaknesses of cognitive-training research. That paper provides recommendations for improving the quality of research in this field. We endorse these recommendations. At the same time, we point out that even following stringent methodological protocols may not ultimately lead to an understanding of the potential efficacy of different types of cognitive training for different populations. Thus, we provide a historical perspective on cognitive training research that suggests that asking the question, “Does cognitive training work?”—even with a well-designed study—is not an adequate means of better understanding the underlying mechanisms that may support these interventions.

What Is Cognitive Training? Cognitive training (or “brain training,” or “mind training”) refers to activities designed to make people “smarter” and thus better at reasoning, problem solving, and learning. Many current cognitive-training programs target basic cognitive skills such as attention (the ability to selectively attend to relevant information), working memory (the ability to actively keep in mind task-relevant thoughts), or executive functions (the set of processes involved in controlling and regulating thought and action). The focus on these processes arises from the fact that there are very real limits on their capacity and that individuals differ in terms of these limits (3); that they are necessary for complex, intelligent behavior (4); and that they are highly correlated with individual differences in intelligence, academic achievement, and life outcomes (5, 6). In recent years, many cognitive-training programs have used tasks that were originally created to help us understand these processes (e.g., refs. 7, 8). Given their importance, it is not surprising that researchers have long been interested in their potential for malleability (9). Although questions of how these processes operate are not fully resolved (see, e.g., refs. 10, 11), we know that the prefrontal cortex is the primary brain region associated with them (12, 13). Historically, similar, but noncomputerized, tasks were used in attempts to enhance attention and memory (14). Other interventions embed these basic processes in other activities, such as play (e.g., Tools of the Mind) (15). More complex activities that are thought to transfer to skills like reasoning are sometimes incorporated into cognitive training as well (e.g., problem solving). In addition to activities developed specifically to enhance cognition, sometimes off-the-shelf activities (e.g., video games, board games, dance, music) are used for the purpose of improving reasoning and problem solving (16, 17). We know that practice on any of the activities above leads to improved performance on those same activities, but to what extent do those improvements matter for other, untrained tasks? For some things, we assume that practice does transfer to other situations. A basketball player who lifts weights or practices sprinting does so not to improve those basic skills but to become a better basketball player (these athletic examples are also apt because of the too-frequent assumption that fade-out effects mean that training programs have failed, whereas it may simply be that, as with physical exercise, continued practice is necessary to reap the benefits). The degree that practice-based improvements transfer to other cognitive tasks is, however, a matter of controversy (and has been for some time; see refs. 18, 19). Cognitive skill improvements are relevant for a wide range of populations, from older adults whose cognitive capacities might be in decline to fighter pilots who need to perform at peak capacity. Additionally, children who have attention deficit hyperactivity disorder (ADHD), experiences of early stress due to poverty, nutrition deficits, and so forth all might benefit from cognitive interventions. In fact, cognitive training is potentially relevant for everyone, even those whose abilities are within the normal range. Thus, it is not surprising that so many people are interested in mind training. In fact, public interest in activities designed to improve basic mental skills is at least as old as the Buddhist tradition (20).

Lessons from the Past The mind is very hard to check. And swift it falls on what it wants; The training of the mind is good, A mind so tamed brings happiness. Dhammapada, third century BCE (21) Sources from antiquity, such as this passage from the Dhammapada, suggest that humans have long recognized that being able to attend to the world and inhibit distractions are key in a successful mental life and that these capacities may be improved through practice. Before the industrial revolution, many documented mental training activities in the West, such as the mnemonics of Simonides and Saint Thomas Aquinas, focused on long-term memory (22), although Plato was cognizant of the idea that training in arithmetic could impact general mental quickness (23). By the 1880s in America and Europe, compulsory education was generally accepted as a public good, and the disciplines of psychology and neuroscience were formalized. However, it was among the many spiritualist movements—dubiously grounded in the world of animal magnetism and odd bromides—that mind training would return to the forefront. Pamphlets for programs of this time, such as the Ralston Brain Regime, “designed to develop perfect health in the physical brain, strengthen the mind, and increase the power of thought” may have literally been sold alongside snake oil supplements (24). Another program, Pelmanism, brought mind training to popular consciousness across the United States and Great Britain. It combined self-help invectives with the completion of repeated cognitive tasks; one activity it included was the card game Concentration. Pelmanism was probably the first example of widely available commercial brain training and at its peak counted over 500,000 customers worldwide (25, 26). By World War II, however, the scientific community firmly saw Pelmanism as lumped in with “autosuggestion … unfired food, dietetic and psychological magic” (27) and other offerings in which “prestige and profits are acquired by dubious interventions in the lives of others” (28). Although the “Pelman Institute” was printing its first little gray books in the late 1890s, Catherine Aiken, a Quaker schoolteacher in Stamford, CT, developed a system of attention training for the girls in her school at least a decade earlier. Like many cognitive interventions of today, it required pupils to spend about 15 min a day on short attention and memory activities. In 1894, Charles Dudley Warner described her system in Harper’s (29). In one activity, “a collection of figures was placed upon the reverse side of a revolving blackboard, then quickly turned; the figures were instantly recognized in their order” (i.e., a working memory task). To make the task more difficult, some exercises required students not only to memorize numbers but also to apply various arithmetic operations to them. Her program also included subitizing practice long before Kaufman coined the term; the Harper’s account describes it as follows: “Another exercise which developed quick perception is that of ‘unconscious counting,’ or of immediately recognizing the number of a group of objects without counting them” (29). Following the release of the Harper’s article, Aiken published two books, Methods of Mind-Training: Concentrated Attention and Memory and Exercises in Mind-Training: In Quickness of Perception Concentrated Attention and Memory (30, 31). These books describe her program in detail and provide fascinating accounts of the “action research” Aiken carried out. Aiken was interested in developing more than her students’ attention, however: “This power of concentration has been sought for, not with the idea of making mere memorizers, but in order that they may be able to recall promptly what they have gathered from the great realm of facts and principles, so as to hold it in the mind as a basis of reasoning, and ultimately, for the purpose of possessing well disciplined and self-controlling minds” (30). In other words, Aiken believed that her program led to successful transfer. But even early accounts of Aiken’s work were skeptical of this claim. L. H. Galbreath, of the University of Buffalo, wrote of Aiken’s training program: “However, a great danger for theoretical and practical pedagogy arises out of the assumption of the possibility of training a power to attend to things in general from special formal exercises. Because one acquires special power to attend to things of sight, it does not follow that he can attend with equal skill and efficiency to sensations of sound” (32). Aiken was eager to establish that her work was not associated with “animal magnetism, hypnotism, and other isms” (30). What is truly remarkable about Aiken’s program—and what sets it apart from the “isms” of the time, such as Pelmanism—is the attention it received from psychologists and educational researchers. Aiken’s first book on mind training, for example, includes an encouraging letter from G. Stanley Hall, the first President of the American Psychological Association. In 1907, G. M. Whipple presented research conducted on Aiken’s program at a meeting of American Association for the Advancement of Science. A brief account of this presentation was published in Science the following year (33), and a detailed account of the study 2 y later in The Journal of Educational Psychology (1). To evaluate Aiken’s system under “laboratory conditions,” Whipple, then at Cornell, conducted two experiments. Experiment 1 tested six college students before and after practice on the letter memorization component of Aiken’s exercises for approximately an hour a day. Experiment 2 included three adult participants and a broader range of Aiken’s exercises for 3 h a week for 7 wk. Whipple used a tachistocope to prevent “eye-moving or the roving of attention” rather than Aiken’s revolving chalkboard (1). Whipple found no evidence that training on these exercises led to any general improvements. Instead, he found “a very slight effect” of practice “which is easily explicable in terms of habituation to the experimental conditions and of development of the ‘trick’ of grouping.” W. S. Foster, a student of Whipple’s, supplemented the original study with what he believed was a significant improvement: Each participant was a trained psychologist. Foster arrived at similar results (34): “That training in these experiments has made the observers noticeably better observers or memorizers in general, or given them any habits of observing closely or reporting correctly, or finished any ability to meet better and situations generally met with, neither we nor any of the observers themselves believe. It seems, therefore as if the value of formal training of our kind had been greatly overestimated.” Foster and Whipple both noted that their experiments were imperfect and that the level of participant experience or the duration of practice and age of the participants may have impacted their results; Whipple himself adds a footnote expressing regret at being unable to conduct his experiment with children. However, he also adds, in reference to the issues above, “neither of these objections seem to us of great moment; we feel that our observers had reached their maximal efficiency, and we are unable to believe that children could be brought to exhibit a range of apprehension so markedly superior to that of competent and well-trained university students and instructors” (1). Although Whipple and Foster did not conduct follow-up experiments with children, Karl Dallenbach (a student of Titchener) did. Dallenbach conducted his study within the Ithaca, New York public school district, with 29 students. Student’s trained for 10 min daily for 17 wk, with progressively more difficult material; furthermore, pretests, posttests, and follow-up tests (41 wk after training) were created with untrained material (35). Unlike Whipple and Foster, Dallenbach found that his students did improve, particularly those initially classified as having “poor” performance. These improvements persisted at follow-up. Dallenbach collected not only grades of his students (which rose following the intervention) but also performance on an early Binet Test of Attention. Furthermore, Dallenbach compared the trained students’ performance to a set of students who had not been trained (although this test was only collected at posttest). Dallenbach noted that grades were significantly higher following the intervention and that students who completed the training outperformed those who had not done so on the Binet attention test. Like Whipple and Foster before him, Dallenbach was aware of many of his methodological limitations, but he arrived at very different conclusions: “Our more lengthy experiments with children, however, have not only showed more decided practice-effects, but have also rendered it at least quite possible, if not practically certain, that these practice effects have brought about a permanent modification in the mental traits exercised, and what is more, a modification that certainly seems to have made itself felt in a number of ways outside the special tests we made (as in an improvement in school work and increased efficiency long afterward in supplementary tests of observation and report)” (36). In 1919, Dallenbach repeated his experiment with children with cognitive deficits and arrived at a similar conclusion (37), a somewhat remarkable effort given that even some modern researchers improperly generalize training findings between different study populations. Whipple, Foster, and Dallenbach eventually moved on from this research, but interest in the improvement of basic cognitive skills continued. There are far too many individual studies to list in a single review, but we briefly detail some here. Some of these were used quite widely, such as Feuerstein’s “Instrumental Enrichment” program: Reuven Feuerstein, working from a Piagetian perspective in the 1960s and reflecting on his experiences with young Holocaust survivors, created a variety of facilitator-administered pen-and-paper tasks meant to improve memory and attention in school (38). Others were designed for specific purposes, such as the Space Fortress computer game, funded by the Defense Advanced Research Projects Agency (DARPA) with the ultimate goal of improving prefrontal function in highly cognitively demanding jobs (e.g., fighter pilots) (39). In addition to attempts to improve basic cognitive processes, educators and psychologists tested the potential effectiveness of reasoning training, logic, philosophy, and even Latin language learning to assess whether or not these skills could impact academic achievement and thinking more generally. For example, the “academic games” designed by Layman Allen in the 1960s, including WFF N’ Proof and Equations, were the subject of at least two controlled studies (40, 41). Programs developed or used outside of English-speaking countries were often given less attention. For example, Project Intelligence, a reasoning training program developed for classroom use and offered widely in Venezuela, saw minimal adoption in the United States (42). Also, the children’s concentration program developed by Kossow and Vehreschild in East Germany in the early 1980s generally goes unmentioned, despite promising findings (43, 44). Finally, although many children of the 1980s remember Logo as their introduction to programming, many do not realize that an original motivator behind the program was the development of cognitive skills more generally (45). The studies testing these interventions included transfer tests of various kinds, including IQ tests, academic achievement, and complex reasoning. Many of these studies led to critical discourses that recalled the debates regarding Aiken’s intervention; for example, see Stanley and Schild’s 1971 response to Allen’s work (46) or Shayer and Beasley’s (47) discussion of Feuerstein’s Instrumental Enrichment. What can we conclude from these early studies? First, cognitive training has long been a subject of heated debate in psychology and education. Second, the focus of the research has, from the very start, been on whether it works rather than why it might work, under what conditions, and for whom; this is despite the fact that early researchers noted that these factors mattered. Third, even though researchers noted limitations in their methods (sample size, age, amount of training, experimental design, etc.), they nonetheless felt comfortable drawing sweeping conclusions about the effectiveness of cognitive training in general. When scientists began to focus on specific questions about cognitive training for specific populations (e.g., can Space Fortress help military personnel perform complex tasks), they received less attention, possibly because neither scientists nor the media were inappropriately generalizing the findings. For some readers this section may rightly bring to mind the old debate over “formal discipline”—that is, the question of whether training or experience in one area may transfer to another skill or intelligence more generally (see ref. 48 for a more recent study and ref. 18 for an early historical review). This was often used as justification for teaching Latin or math, even if these subjects did not have obvious utility in everyday life. Whipple and Dallenbach were well aware that their studies were some of the first direct experimental investigations of this theory and that the issue had not been definitively settled by Thorndike and Woodworth in 1901 (19). We note that there is evidence that Whipple remained intensely interested in questions about the efficacy of mind training and the mechanisms of transfer in the years following his initial study. In his preface to C. P. Wang’s 1916 dissertation (49) on visual sense training in children, he wrote that “contributions to the experimental study of the transfer of training (formal discipline) scarcely need either apology or introduction in a period when, despite the considerable amount of investigation, so very much remains undetermined with respect to the amount of such transfer and the mechanism by means of which it takes place.” Perhaps it was still on his mind in 1922, when he completed his educational psychology textbook Problems in Educational Psychology (50). One of the problems in the book reads as follows: “Catherine Aiken describes a series of exercises (columns of figures, groups of dots to be counted, important dates, sets of drawings, etc.) to be placed on a revolving blackboard, which is then whirled about before the pupils in such a way as to expose the material for a few seconds only. These exercises are strongly urged as a means of developing concentrated attention, quick and accurate observation, and of accelerating the whole process of learning. Miss Aiken reports very wonderful results from the use of such exercises for five or ten minutes daily.” He then asks, “Is there psychological warrant for the use of such exercises as a means of developing attention and observation? Would you advocate the introduction of such exercises as a stock feature of school training?” A review of the recent literature illustrates that the questions that Whipple fixated on remain unanswered today.

Lessons from the Present In 2008, Susanne Jaeggi and Martin Buschkuehl published their graduate work in this journal (7); they found that practice on a dual n-back task led to improvements in fluid intelligence. The dual n-back task requires individuals to listen to a stream of letters and judge whether a letter was the same as the one presented n trials previously, while simultaneously viewing a set of boxes on a screen and judging whether the same box “lit up” n trials previously. Fluid intelligence is defined as the ability to solve abstract, novel problems that require little knowledge and was measured before and after training by matrix reasoning tests that require participants to judge which of several options best fits into an array of figures. Tests of fluid intelligence are correlated with working memory and prefrontal function more generally (5) because they require keeping track of and testing numerous rules during the course of problem solving (4). The dramatic improvements detailed by Jaeggi et al. (7) received a considerable amount of attention from the scientific community and the popular press. Additionally, companies offering cognitive-training software often took advantage of their findings for marketing purposes (e.g., Learning RX, Lumos Labs, and CogMed). The media hype around Jaeggi’s paper emphasized its putative novelty; for example, Madrigal (51) wrote in Wired, “Fluid intelligence was previously thought to be genetically hard-wired.” As the historical summary above suggests, such claims were inaccurate. Many studies explicitly tested and found improvements in fluid intelligence, even if they did not necessarily use the term fluid intelligence.* More generally, there has always been a debate regarding the relative importance of nature (i.e., genetics) and nurture (i.e., experiences) in the development of intelligence. One would be hard-pressed to find a scientist who argues that intelligence (including fluid intelligence) is entirely genetically determined and not at all affected by experience. The Jaeggi et al. (7) paper is a bit more nuanced in discussing this issue than the media reports and acknowledges that there is a history of cognitive training research but that successful transfer has been difficult to achieve. The interpretation of the Jaeggi et al. (7) study in terms of a paradigmatic shift within a false dichotomy of fixed versus malleable intelligence, with little attention to historical context, is one reason for the swift critique the study received. There were also numerous concerns regarding the research methods of the study, most notably the lack of an active control group. Furthermore, scientists were concerned that the general public might expend resources on unproven products, possibly to the detriment of other beneficial activities. Adding to the controversy, Redick et al. tried to replicate the Jaeggi findings with a somewhat better controlled trial but found no evidence of gains in fluid intelligence (57). Many additional studies continued to ask the question, “Does cognitive training improve intelligence?” In the next sections, we discuss why we cannot, as yet, answer that question. Indeed, we argue that it—like the question “Does medicine cure disease?”—is inappropriate. That we continue to ask it, over 100 y after the studies of Whipple and Dallenbach, should give researchers reason to pause and take stock. Why do some studies find a positive impact of cognitive training whereas others do not? One reason is that “cognitive training” refers to such a broad range of activities (e.g., commercial programs like Cogmed, laboratory tasks such as the n-back, and off-the-shelf games). It is not possible to draw conclusions regarding cognitive training as a whole with a single empirical study. The extent to which one can reasonably generalize from one intervention to others is not clear, and we are not yet well aware of what intervention characteristics may be important for transfer. Consider, for example, different working memory interventions. In addition to the n-back task, one can train working memory by having individuals remember sequences of items (i.e., span tasks; see ref. 58). Training might be spaced across time or take place within a shorter time frame (59). And studies may involve fixed block of training (say, across one month) or add “booster” sessions later on (60). Cognitive interventions may vary on numerous other dimensions (which processes are practiced, type of instructions, game-like features, amount of training, computerized or not, etc.). Studies also differ in terms of the samples tested. Recall two studies mentioned earlier: Jaeggi et al. (7) used students from the University of Bern, Switzerland, and found successful transfer to fluid intelligence. Redick et al. (57) used students from Michigan State, Georgia Tech, and nonstudents from the Atlanta area and did not find transfer. There are other methodological merits and concerns regarding both studies, but the populations examined in each study are different enough that it is possible that the divergent outcomes could be driven by demographics. For example, two factors that may influence whether one benefits from training are socioeconomic status and motivation (61, 62). The list goes on—personality, age, baseline ability, and many others (63). But many studies do not examine these characteristics, and too few researchers take the step that Dallenbach did early on to replicate his training study with different populations, such as children with cognitive difficulties. Thus, we cannot know the extent to which they influence performance. It is also difficult to judge whether or not interventions are effective above and beyond the influence of various confounding factors. Consider, for example, a study that tests whether improvements on an intelligence test are due to a placebo effect by asking participants about their beliefs. Unfortunately, this too is problematic. Hundreds of participants may be required to adequately test whether or not a construct with a true moderate effect size had an impact on an outcome variable above and beyond a reasonably reliable confound (64). Attempting to statistically control for several factors may require impractically large sample sizes. A related problem is that many studies test participants on a large number of laboratory tasks or surveys but lack the sample size needed to conduct multiple comparison corrections. Our own studies suffer from this concern, as do many others. One concern associated with having a large number of transfer measures or a large sample size has to do with the quality of testing implementation. Outcome measures, when given to participants in rapid succession, shortened for time constraints, and administered over several hours, may be less reliable than ideal. The Redick et al. (57) study may have exactly this problem: Participants performed 17 demanding cognitive tests, several of which were shortened. Although the reliability of their tasks is normally high in standard administration, reliability under these conditions is not clear. In general, one ironic aspect of cognitive training research is that a large sample size is crucial, but studies with large sample size have their own problems. Studies with large sample sizes often have much less control over the training regimens or quality of data collection. The Owen et al. (65) study, which included thousands of participants, is one such example; the administration of tasks is not at all standardized, and training dosage was highly variable. Another concern is presentation of post hoc or selective analyses (66). In a study based in one of our laboratories, for example, we tested children on a battery of tests and compared performance of a group that received a single n-back training with a control group that learned science facts (62). Overall, there was no impact of the cognitive training intervention on our measures of fluid intelligence. However, upon noting vast individual differences in improvement on the n-back task, we tested whether or not children who actually improved in the training also improved on matrix reasoning. We did find improvements for this group. Also, children who viewed the training as “too difficult” did not get better on the training task. We interpreted these findings to mean that some students were easily discouraged and thus did not benefit from the intervention. Our findings could also be explained by assuming that people who can learn well improve from their experiences during training and are also more likely to benefit from taking the same test twice. This is a valid alternative explanation. Although we prefer our own interpretation, we must collect data in a new study in which we explicitly test it to be confident in our conclusions. One final limitation is that there is minimal testing for real-life outcomes (e.g., How much better does a child do in school?). Instead, most outcome measures are laboratory tasks, surveys, or standardized tests. Many studies use performance on matrix reasoning tests as their main outcome measure. Although performance on such tests is correlated with real-life success (e.g., the ability to learn new facts), scoring better on these tests does not mean that one will actually be better in real-world tasks. Some studies do include some real-life outcomes or ecologically valid tasks (e.g., refs. 67,68), but these studies are few. The above list is not exhaustive but is intended to provide the reader with some idea for why most studies are far from conclusive. Although it may be easy to scoff at the tiny samples and limited methods used by Whipple in his century-old cognitive-training experiments, contemporary studies, including our own, often share similar issues. Why would psychologists design studies that are underpowered or that have clear methodological problems? In part, they do so because there is a tradeoff such that avoiding one problem (e.g., sample size) leads to another problem (e.g., poor control over intervention). Researchers include a variety of transfer measures all designed to answer different questions: to see if there is change in fluid intelligence measures, academic achievement measures, or assessments of basic skills that underlie more complex measures. It is practically impossible, because of cost and time constraints, to recruit enough participants to make up for the large number of planned statistical tests. One solution to address issues of small sample size and differences across individual studies is to use quantitative meta-analyses. Unfortunately, the extant meta-analyses arrive at very different conclusions and do little to settle the issue (69⇓⇓⇓⇓–74). Although Au et al., Karr et al., and Karbach and Verhaeghen conclude that training executive functions like working memory may be effective in improving capacities such as fluid intelligence, Melby-Lervag and Hulme suggest that transfer gains are nonsignificant or minor at best. These varied outcomes arise because of key differences in how they were conducted such as the populations included and the type of intervention used. These decisions, along with the choice of statistical procedures, have a substantial impact on the outcomes of meta-analyses. A nice demonstration of this point is a pair of analyses conducted by Van Elk et al. (75) about the effect of religious priming on prosocial behavior. Responding to a meta-analysis that found that religious priming has a positive impact on prosocial behavior in religious participants (76), van Elk conducted two publication bias correction analyses [precision-effect testing–precision-effect estimate with standard error (PET-PEESE) and Bayesian] using the same data as Shariff et al. Although each of these methods is reasonable, they ultimately arrive at different conclusions. Furthermore, to return to the medication analogy, it is impossible to draw conclusions about a broad question (does medication work?) by combining studies of different medications and illnesses in a single analysis. The conundrum is that each individual study differs on so many dimensions that statistically accounting for these differences may be, in essence, the equivalent of reducing the sample size back to the level of individual studies. And as Stegenga (77) writes, “meta-analysis fails to provide objective grounds for intersubjective assessments of hypotheses because numerous decisions must be made when performing a meta-analysis which allow wide latitude for subjective idiosyncrasies to influence its outcome.” Despite the limitations outlined above, there is some agreement that although “far” transfer may not be possible, “near” transfer is easier to achieve (72, 73, 78). In the context of cognitive training, near transfer would mean the improvement of the underlying construct being trained (e.g., working memory) per se. However, when some researchers refer to near transfer, they may also mean “superficial transfer.” In this case, improvement on tasks similar to trained tasks may be due to the acquisition of a superficial strategy. Suppose a working-memory intervention is designed like the popular 1970s game Simon (in which there are four lights arranged in a circle and the task is to remember the order in which they lit up). It is possible that with practice you learn to use the strategy of remembering numerals on a clock (3, 6, 9, 12); you get better at the game and also at remembering numbers, but it does not mean you have a better working memory. Researchers try to avoid superficial transfer by selecting training and transfer tasks that do not readily allow for the use of narrow task-specific strategies and often include multiple training tasks to reduce the likelihood that specific strategies are developed (see ref. 79). But it is impossible to ensure that strategy development is not responsible for transfer. In fact, we know that practice on prefrontal tasks typically involves strategy development (80), and if these strategies do transfer to other contexts, they may have practical value. Furthermore, one recent study found that participants reported acquiring grouping strategies during working-memory training and applied those strategies to near transfer working-memory tests (81). Von Bastian and Oberauer (82) offer a list of strategies that could explain improvements on both near and far transfer tasks and suggest that transfer tasks must be selected with these strategies in mind. If we accept that there is some measure of “real” near transfer from cognitive training and the skills trained serve as rate-limiting factors to more complex cognitive task performance, then it is somewhat puzzling that far transfer is not found. One explanation may be that what we assume to be real near transfer is actually superficial (83). Alternatively, the near transfer is real but not adequate for far transfer without an additional skill. If a person’s working memory improves as a function of training but they still have low vocabulary skills, then performance on a reading comprehension measure may not show improvement. Likewise, getting better at measures of fluid intelligence might require both a better working memory and better reasoning strategies that engage this new ability. Research that explicitly tests such possibilities has yet to be systematically conducted. Additionally, if far transfer but not near transfer is observed, the theoretical underpinnings of far transfer improvements become difficult to discern (76). A “transfer” finding could reflect positive mood, motivation, or a placebo effect. A recent study of placebo effects found that, at least in a single session of “training,” expectations about improvement may be driving gains (84). However, many researchers, even those highly critical of studies finding far transfer, have concluded that near transfer actually can be found reliably and, presumably, that it is not merely superficial (70, 72). Finally, we note that despite the difficulties in establishing the consistent presence of real transfer effects following training interventions and the cognitive mechanisms associated with this transfer, we would be remiss to omit the considerable work that has been conducted into the underlying neural mechanisms of this transfer (see refs. 85⇓–87 for reviews of this work). Although the present piece does not focus on these studies, we note that this research is useful in that it may illuminate the neural correlates of training and, perhaps most importantly, could help provide complementary evidence to establish whether transfer effects are in fact “superficial” or real. We welcome future work incorporating neurophysiological techniques but also emphasize that, ultimately, behavioral outcomes are most important and most relevant for the end users of cognitive training.

Lessons for the Future Most psychologists agree that meaningfully improving fluid intelligence, reasoning, and executive function is incredibly hard. The example of the Abecedarian study (a randomized trial of early childhood education for low-income children) is often mentioned and with good reason—it demonstrates how many hours of intervention may be necessary to have long-lasting effects (88). Psychologists largely split into two camps on this issue. For many, improving cognitive skills through direct intervention is so challenging that it may not be worth the work required. Improving these capacities requires effort that should be better spent elsewhere. For others, this difficulty is more of a challenge than a barrier. We fully admit to belonging in the second camp. If there is any hope for meaningfully improving the capacities that underlie a child’s (or adult’s) ability to learn and think, this research is worth pursuing. There is a vast, unexplored space between lengthy interventions such as the Abecedarian project and brief interventions like n-back training. How efficiently can we improve cognitive abilities? One possible approach is developing interventions that combine exercises that tax prefrontal processes with reasoning instruction and practice. Indeed, some research suggests that playing reasoning games that include prefrontal demands (e.g., off-the-shelf games like Set), or learning and practicing reasoning strategies, may be especially effective (17). There is also evidence that some video games (such as Portal) may improve reasoning and other prefrontal functions (89, 90). Furthermore, there are entire fields of intervention research—such as work with cognitive-behavioral therapy and ADHD, music training to improve auditory cognition, or useful field-of-view training to reduce accidents while driving—that suggest that interventions that share some features with the cognitive training discussed here may be effective in delivering real-world cognitive improvements (91⇓–93). Of course, these lines of work are not excepted from the methodological issues outlined above, and it remains largely unknown whether they ultimately may have a meaningful effect on everyday cognition for the general population. We acknowledge that existing cognitive-training interventions have not yet demonstrated clear real-life impact. The remainder of this piece will focus on why that is—and what must be changed if we wish to have a chance of success. Earlier we detailed some of the issues inherent to using meta-analyses as a means of coming to a consensus regarding the efficacy of cognitive training. For many psychologists, including van Elk et al. (75), there is an obvious solution to the weaknesses of meta-analysis: registered replications. We agree with this sentiment in principle. Furthermore, the movement to preregister studies in general, so that both methodological and analytical decisions are recorded a priori, should help to address some of the issues laid out above. However, the implications of even a well-done, adequately powered, and preregistered study or replication must be drawn with caution. Implicit assumptions about this research—that is, that it has a high internal validity—are absolutely appropriate. The problem is that the media and policymakers may fail to realize that studies with high internal validity may nevertheless have poor external validity. External validity refers to the extent to which an empirical finding can be generalized to other contexts (94). In cognitive training, this means that it may not be possible to generalize the results of one study to different types of interventions, populations, or contexts. This point applies not only to studies with positive outcomes but also those with negative outcomes. As mentioned earlier, one common interpretation of intervention studies is that because their benefits often fade out, they are not worthwhile. However, that an intervention does not have long-term impact does not mean that it is not useful but rather that some sort of continued enrichment may be necessary (95). But again, we return to the question “Does cognitive training work?” We have already discussed the folly of asking such a binary question on such a complicated topic. Alan Newell, in his classic piece that inspired our title (96), points out that psychological science operates on two levels: one in which there are incremental, specific studies that elaborate on a phenomenon (e.g., Does cognitive training work better when practice is spaced or massed?), and one that asks fairly large binary questions (e.g., nature vs. nurture). But what is missed is a more unified approach that allows us to better understand “the behavior of man” (p. 6). Newell offers one highly relevant strategy to achieve this: to center experimental and theoretical work around “a single complex task” and in the service of this develop a coherent theoretical model supported by many smaller studies (e.g., ref. 97). If cognitive-training researchers were to take up his recommendations, they would need to develop computational process models of prefrontal function and intelligent behavior. Hypotheses about how the model improved on the training task and how that would generalize to a transfer task could then be tested via the model. Empirical studies would support model development via microstudies that help generate parameters and also studies that test the model’s predictions. In particular, empirical studies should test possible underlying mechanisms that support transfer. As the Buffalo Springfield lyric goes, “there’s something happening here, but what it is ain’t exactly clear.” In cognitive training, we have identified what appears to be a compelling phenomenon, but without an overarching theoretical framework to guide empirical research, progress in understanding this phenomenon will likely be stalled. As we discussed earlier, there are often practical limits on conducting high-quality studies of cognitive training. At the same time, there are other conflicts of interest and motivational factors that influence which studies are conducted and ultimately published (98). In general, the studies that are most likely to appear in the press or high-impact journals are those that have novel, unexpected, and clearly impactful results. Studies with null effects, or those that replicate and incrementally test the boundary conditions of a finding, are perceived as much less valuable. Additionally, scientists are not immune to the idea of “motivated reasoning.” If they have strong beliefs or motivations inconsistent with the results of the study, they are easily able to find flaws (see the classic study by Lord, Ross, and Lepper) (99). But when they wish to believe a finding, the flaws are less visible. If there were no multimillion-dollar cognitive-training industry, the field would be much less controversial. And yet, this is the world in which we live. So what do we tell parents who want to know whether these programs can help their children? We know that proper nutrition (100), sleep (101), and physical exercise (102) are beneficial for cognitive development, and such factors need to be addressed whether or not children engage in cognitive training. However, it is clear that there is little to be lost, and possibly much to be gained, through engaging in cognitively enriching activities (e.g., cognitive training but also music, dance, meditation, board games, etc.). At the same time, we hope that consumers will be on guard against strong promises offered by purveyors of cognitive-training programs. Consider the continuing allure of “brain-based” marketing techniques. Although recent work has provided experimental evidence for these strategies as tools of persuasion (103, 104), their dangerous efficacy has been clear since the days of Pelmanism over a century ago. When neuroscience is evoked, nonexperts are more likely to believe explanations—even if those explanations are otherwise unsound. If one thing is certain, it is that the public interest in improving cognition will continue for the foreseeable future. But the outcome of any individual study, any individual intervention, and as we have illustrated, any individual meta-analysis cannot be construed as a conclusive answer to the question of how much cognitive function might be improved through intervention. In addition to the theoretical and modeling work discussed above, we note that significant attention should be given to the careful communication of findings. Often the fault on this count lies not in conducting studies that have methodological limitations and potential alternative interpretations, as these studies might guide us toward better future work and a richer understanding of the phenomenon. Rather, the fault lies in interpreting the results of these limited studies as “proof” that cognitive training does or does not work. When studies are published in short-form journals or reported in press releases, claims tend to be exaggerated and the limitations receive short shrift. One study in the British Medical Journal recently analyzed 462 health science press releases and found that 40% of them overstate the implications of the findings (105). Unfortunately, even when “hedging” language is present in scientific articles, most readers gloss over the details and focus on the main claim when reading about scientific studies (106).

Conclusion It may seem as if we have written two papers: one about contemporary issues, and one focused on history. However, we do not believe that these discussions should be separate from each other. Although we cannot address the entirety of the vast historical literature here (107), the work of Aiken, Whipple, and Dallenbach and a cursory review of early training studies reveals that many of the “lessons of the present” were actually raised over 100 y ago (see especially ref. 108). These issues remain despite the actual empirical work conducted, and possibly because of it as well, especially when researchers improperly generalize or ignore previous findings. And they were not solved despite decades of theoretical and methodological discussions that in many respects were not dissimilar from more recent reviews, such as Simons et al. (2). Thus, the issue of whether cognitive training “works” was not settled in 1910, nor 1914, nor in all of the years that followed. As Mead wrote in 1946, “a final question is this: how long will it take for the facts known about transfer to be used, and adjustments to be made accordingly? One hundred years? Or never?” (109). We assert that we still do not have definitive answers to questions regarding training and transfer. Researchers may not have the answers 100 y hence. But if we keep asking, simply, “Does cognitive training work?” rather than investigating the mechanisms of transfer within a coherent theoretical framework, we will never have them at all. How many more studies of this nature must be completed before we start asking the right questions? Toward the end of the film Groundhog Day, Bill Murray’s character, finally approaching something resembling wisdom, declares “When Chekhov saw the long winter, he saw a winter bleak and dark and bereft of hope. Yet we know that winter is just another step in the cycle of life.” For his character, this cycle is a long one, but it is not endless. And it is no mistake that his journey recalls elements of Buddhism. The director, Harold Ramis, reportedly carried his own pocket Buddhist mind-training guide (110) that contained some of the same pointers from the Dhammapada referenced earlier. In Ramis’ pocket-guide version of the Seven Factors of Enlightenment, “investigation and research” are indeed important. But another factor comes first in his list: mindfulness. We hope that, above all, researchers (including us) will be mindful: of the lessons of the present, yes, but also the lessons of the past. The future of the field depends on it.

Footnotes Author contributions: B.K., P.S., and D.E.M. wrote the paper.

Conflict of interest statement: B.K. worked as a game designer for Lumos Labs, the company behind the brain-training website Lumosity.com, before beginning graduate school in 2012.

This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Digital Media and Developing Minds,” held October 14–16, 2015, at the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA. The complete program and video recordings of most presentations are available on the NAS website at www.nasonline.org/Digital_Media_and_Developing_Minds.

This article is a PNAS Direct Submission.

↵*Many early studies did use the term fluid intelligence (or “nonverbal intelligence”) and used matrix reasoning tests or other nonverbal IQ tests as outcome measures (e.g., refs. 52⇓⇓⇓–56).