We examined the efficacy of one of the most commonly used WM training paradigms, namely the n-back task. In contrast to previous meta-analyses in the field, we did not lump untrained versions of the training task together with the other WM transfer measures. Instead, we deemed it as important to keep these two types of transfer measures separate to be able to examine whether n-back training mainly yields task-specific transfer or more general WM improvement. We also took into account certain methodological shortcomings that limit the interpretability of previous meta-analyses on WM training effects. From each original study sample, previous meta-analyses have included only one measure (or an average of several measures) per cognitive domain in the analyses. We employed a multi-level meta-analytical approach to be able to include all measures from the original studies, and thereby obtain a less biased estimate of the training effects. We took full advantage of the repeated measures design in the original studies by accounting for the correlation between pre- and posttest performance and thereby increasing the statistical power of our analyses.

Main analyses

The present meta-analysis included 203 training effects (190 after data screening) from 33 studies. These studies consisted of 41 unique experiments. In total, data were obtained from 2,105 individuals. The results from the main analyses showed a moderate effect of task-specific transfer to untrained n-back tasks and very small transfer effects to other untrained WM measures, cognitive control, and Gf.

The transfer effect to untrained WM tasks as a whole is approximately of the same size as in the n-back training analyses by Melby-Lervåg et al. (2016; when averaged across verbal and nonverbal WM and active and passive control groups) but considerably smaller than in the other meta-analyses not focusing specifically on n-back training (Melby-Lervåg & Hulme, 2013; Weicker et al., 2016; Schwaighofer et al., 2015; Table 1). There are several possible reasons for this discrepancy. First, Melby-Lervåg & Hulme (2013) who reported the highest effects of near transfer, in some cases included the training task in their analyses, resulting in an overestimation of the near transfer effect. Second, except for Melby-Lervåg and Hulme (2013), all previous meta-analyses investigating near transfer of WM training have excluded simple spans from the WM transfer domain. We decided to include simple spans as it has been argued that simple and complex spans (the latter ones being included both in previous meta-analyses and the present one) can in fact be considered as measures of the same cognitive processes (Unsworth & Engle, 2007). Nevertheless, when we ran the same analysis without the simple spans, the effect size remained similar (g = 0.21). Third, Jaeggi et al. (2010) have suggested that the near transfer effects may be smaller following n-back training than training with other WM training paradigms, because the n-back task shows low correlations with other WM tasks. However, a recent study showed that n-back tasks are in fact highly correlated with other WM tasks at a latent level (Schmiedek et al., 2014). Fourth, the differences between the size of the near transfer effect in the previous meta-analyses and the present one may be partly related to differences in the inclusion of studies, and the fact that we included all relevant measures from each original study in our analysis. Finally, in light of the present results, perhaps the most important explanation for higher effect sizes for WM transfer in previous meta-analyses is the fact that those studies have not separated untrained variants of the training task from other WM tasks. This could be even a more acute problem in meta-analyses on studies using several training tasks where it becomes more likely that the transfer tasks include untrained versions of the training tasks. In an attempt to investigate this, we reanalyzed the data from healthy adults included in Melby-Lervåg et al. (2016) by separating between untrained versions of the training task(s) and other untrained WM tasks. We excluded those studies in which the n-back task was the only training task in order to be able to examine if the inclusion of untrained versions of the training task(s) leads to an overestimation of the near transfer effects also for other training paradigms. The pooled effect size for all types of WM tasks was g = 0.29 [0.18, 0.40] p < 0.001. However, when analyzing untrained versions of the training tasks separately from other WM tasks, the results showed that the effect size was significantly stronger, Q M (1) = 8.89, p < 0.01, for the former, g = 0.62 [0.38, 0.86] p < 0.001 (number of effect sizes = 19), than the latter, g = 0.21, [0.08, 0.33] p < 0.01 (number of effect sizes = 70). The findings of this reanalysis are in line with the present meta-analysis and indicate that also for other WM training paradigms, task-specific transfer is important to take into account when investigating transfer effects. These results go against the idea that the transfer effects in the present study are lower because of issues with concurrent validity of the n-back task.

For cognitive control, the transfer effect was approximately of the same size as in the two previous meta-analyses investigating transfer of different kinds of WM training to executive functions and attention (Melby-Lervåg & Hulme, 2013; Weicker et al., 2016). Also for Gf, the present transfer effect size is roughly in line with the three previous meta-analyses investigating transfer to Gf from n-back training (Au et al., 2015; Melby-Lervåg & Hulme, 2016; Melby-Lervåg et al., 2016).

In sum, in the present meta-analysis, the only notable transfer effect is seen to untrained n-back tasks. Despite the fact that the transfer effects to the other domains are also statistically significant, they can be considered very small. This is because an effect size of 0.2 means that only approximately 1% of the variance of the dependent variable (e.g., score on a Gf task) can be explained by which group (training or control group) the participant belongs to. The practical significance of such effects can thus be questioned (see also Melby-Lervåg & Hulme, 2016).

Mechanisms behind transfer

We hypothesized that if n-back training enhances the WM components it consists of, the magnitude of transfer effects would follow the presumed cognitive overlap between the transfer tasks and the training task (Dahlin et al., 2008; Waris et al., 2015). This would result in a gradual decrease in effect sizes, with the strongest transfer effects to untrained n-back tasks, followed by other WM tasks, cognitive control, and Gf. However, in the present study, the only noteworthy transfer effect was seen to untrained n-back tasks, while transfer to other tasks was at similar, very small levels. This pattern of results suggests that the transfer effects of n-back training are mainly caused by acquisition of task-specific aspects such as suitable strategies, rather than better-functioning WM components. This is because task-specific improvement can enhance performance only on tasks with a similar structure where the same strategies can be succesfully employed. An actual improvement in the effectiveness of the underlying WM components such as flexibility of updating and storage capacity, on the other hand, should result in broad transfer effects to different kinds of measures (von Bastian & Oberauer, 2014). A potential caveat is that a similar pattern of results as the one seen here could also be due to an improvement that is limited to the updating component. This is because we did not separate between updating tasks and other WM tasks in the analyses. However, a post hoc analysis of our data showed that the transfer effects to updating tasks other than n-back (g = 0.26) were roughly of the same size as the transfer effects to other WM tasks (g = 0.23), strengthening the conclusion that n-back training mainly improves task-specific aspects and not WM. It may thus be that the very small and similar effects of transfer to other WM tasks, Gf tasks and tasks measuring cognitive control reflect some general effects such as enhanced attention, perceptual speed, or getting used to the computer and performing demanding cognitive tasks.

Moderator analyses

We also investigated whether the choice of control group affected the training outcome. The results showed no differences in improvement between passive and active control groups when comparing them within studies that had employed both types of control groups. Based on these results, we agree with Au et al. (2015) in that there does not seem to be any clear support for the idea that Hawthorne effects affect the results. However, when adding the type of control group as a co-variate to our main analysis, the results revealed a small, significant effect. This could mean either of two things: the within-study comparison was underpowered, failing to find a true difference, or the training groups perform better in studies with passive controls than in studies with active controls. The latter explanation has also been proposed by Au et al. (2015).

Also, the results from the present meta-analysis do not give us reason to claim that publication bias plays a major role in the results. However, the reviewed studies contained a few observations that could be considered outliers. These were removed, because they either over- or underestimated the training effect beyond what was expected from sampling error alone.

Regarding the other moderators in the present meta-analysis, the results showed no difference in transfer effects between young and old participants. On the whole, this is in line with the previous meta-analyses (Melby-Lervåg & Hulme, 2013; Schwaighofer et al., 2015; Melby-Lervåg et al., 2016). The results also showed that training with single or dual n-back tasks was equally effective in producing transfer to the four cognitive domains. This concurs with the Au et al. (2015) meta-analysis showing similar transfer effects to Gf after single and dual n-back training. The number of training hours or sessions did not affect the transfer results either, again in line with most of the previous meta-analyses. However, Weicker et al. (2016) found that the number of training sessions was positively related to the size of the transfer effect to WM. This difference in results may stem from several sources. While the present meta-analysis focused on n-back training in healthy adults, the Weicker et al. (2016) meta-analysis included studies investigating all kinds of WM training paradigms in both healthy and clinical samples of children and adults. It is in principle also possible that higher amounts of n-back training than what is currently used might yield stronger effects. Finally, in line with the Au et al. (2015) meta-analysis, our results indicated that it did not matter if the WM and Gf transfer tasks consisted of verbal or nonverbal material.

Limitations of the present study

The present meta-analysis focused on only one type of WM training, namely n-back training. By restricting the analysis to only one training task type, it was easier to interpret the present pattern of transfer effects. Investigating transfer elicited by several training tasks makes it even more challenging to separate between task-specific effects and increased effectiveness of general WM mechanisms. We were able to conclude that out of the four cognitive domains studied here, n-back training produces substantial transfer only to untrained variants of the training task. Transfer effects to other tasks (whether they measure WM, Gf, or cognitive control), albeit being observable, are small and apparently of little practical significance.

Due to the limited number of available studies, we were not able to investigate transfer to other cognitive domains than the four studied here. For the same reason, we could not perform all the moderator analyses for all four cognitive domains (e.g., material-specific aspects of transfer were investigated only for WM and Gf), and we refrained from investigating interactions between moderators.

Directions for future studies

There is still much controversy regarding the efficacy of WM training, despite the fact that the issue has been investigated in many training studies and a few meta-analyses. We believe that much of the controversy is due to the great variability between training studies regarding for example the choice of training and transfer tasks, control group, and study population. This variability is then reflected in meta-analyses as well, because different researchers will make different choices in categorizing tasks, participant groups, and studies. On the one hand, one could think that variation in for example transfer tasks is important in order to be able to draw conclusions about improvement in general WM mechanisms (for a discussion, see e.g., Shipstead, Redick et al., 2010; Shipstead et al., 2012). This is because cognitive tasks always involve variance that stems from other sources than the ability of interest. Apart from random error (such as variation in alertness or disturbing noises in the testing environment), such variance comes from other abilities engaged in solving the tasks, strategies employed by the participants, or differences related to the type of stimuli involved. Therefore, it has been recommended that training studies should utilize factor analysis to analyze transfer at a latent variable level that should provide more reliable information about the ability of interest compared to task-specific performance (Schmiedek, Lövdén, & Lindenberger, 2010). On the other hand, the present results indicate that task-specific aspects play an important role in the transfer effects. A failure to make a distinction between task-specific and other WM transfer not only inflates effect sizes for near transfer, but may even lead one to a wrong track when searching for theoretical explanations for WM training effects (i.e., assuming that WM training increases the effectiveness of WM in general, rather than considering also alternative hypotheses on strategy-based effects). Thus, if one wants to shed more light on the underlying mechanisms of transfer, more emphasis should be put on task-specific aspects. It would be important to systematically analyze performances in tasks that are closely related to the training task, in order to find out which are the mechanisms that drive the major transfer effects.

As previously mentioned, there is, however, not much research on what different executive tasks actually measure and how reliable they are over time. Low task reliabilities together with small sample sizes (often employed in training studies), result in weak statistical power and consequently lower chances of observing a putative effect of training.

We believe that it would be important for future studies to try to solve issues related to task reliability and validity, pairing of training and transfer tasks, and statistical power, rather than conducting more training studies that carry the current methodological problems. Ultimately, training effects should also be evaluated with measures that are more closely tapping real-life working memory demands. Furthermore, future studies should pave a way to a theory of the processes involved in a training-induced change in WM, a theory that is currently missing (Gibson, Gondoli, Johnson, Steeger, & Morrissey, 2012). Our results highlighting the role of task-specific transfer suggest that a considerable part of the transfer effects is related to self-generated performance strategies that emerge during a repeated practice with a limited set of WM tasks. We concur with what Shipstead, Hicks, and Engle (2012) stated 4 years ago: “Working memory training remains a work in progress.”