Neuropsychological assessment is central to the diagnosis of dementia and identifying individuals who may be in a prodromal phase of AD. People with MCI have a ten-fold higher risk of progressing to dementia every year than people of the same age in the general population. Thus, many people with MCI are actually in a prodromal stage of AD. However, not all patients with MCI progress to dementia and it is thus critical for clinicians to identify tools that can accurately separate those who will remain stable from those who will progress to dementia. Many longitudinal studies have published predictive accuracy values for different cognitive tests. There is a critical need for a systematic analysis of this literature because of the large number of cognitive dimensions that can be measured and because each domain can be assessed with still a larger number of neuropsychological tools.

In this systematic review, we found 28 longitudinal studies that assessed the values of neuropsychological tests to predict progression from MCI to dementia. We selected studies based on strict inclusion and exclusion criteria and hence discarded studies that contained fatal methodological flaws for a meta-analysis of predictive diagnostic test accuracy (for example, those that failed to use clinical criteria to identify progression to dementia or those for which the methodological information provided was insufficient to allow replication). Nevertheless, it was important to assess the general quality level of the remaining studies and whether the methodology could bias the data, as the studies varied somewhat. Based on the quality criteria used here (QUADAS tool and Cochrane guidelines), most studies had a relatively low risk of bias. Furthermore, the studies were relatively homogenous in their methodological approach and relied on well-accepted clinical criteria to identify their patients and outcomes. Most included studies relied on a prospective design and used Petersen’s criteria to select their participants, and all relied on the NINCDS-ARDRA to identify dementia. Yet, 10 of the 28 studies showed a high risk of bias. A few features were found to be problematic. The most frequent problem was related to the selection process, which led to the sample not being representative of the population of interest. This limitation might have a negative impact on the generalizibility of the results from these studies. Another problem was related to a failure to keep predictive tests independent from the gold standard used to identify the outcome (here, progression to dementia). This was found in five studies, and five others failed to report information regarding this criterion. This is an important methodological control because predictive accuracy can be artificially inflated when the predictors are not kept independent from the standard.

We decided to focus on measures of sensitivity and specificity and to report both independently rather than focusing mostly on overall predictive accuracy. Ideally, a test with optimal predictive accuracy should combine excellent sensitivity and specificity. However, the optimal ratio of specificity to sensitivity may also depend on the clinical and research context. For example, clinicians might favor sensitivity over specificity to detect a deadly disease that could be cured if treated. In contrast, specificity might be favored over sensitivity when a disease cannot be treated, when a diagnosis has the potential to result in stigmatization, exclusion or depression, or when treatment has important side effects. In a research context, investigators might favor a different balance between sensitivity and specificity as a function of their research question. Remarkably, our systematic review indicated that many domains and tests show an appropriate balance with very good sensitivity and specificity values.

The studies that we reviewed covered data for a total of 2365 participants who met the criteria for MCI at entry and were followed over an average of 31 months to assess whether they met the criteria for AD type dementia. In total, 916 individuals with MCI were later found to progress to dementia. This represents a progression rate of 38.7%, which is fairly consistent with the literature, considering the 31-month average follow-up (Gauthier et al. 2006). However, the progression rate across individual studies was quite variable, ranging from 6 to 39% per year.

The systematic review examined 61 cognitive tests that evaluated 22 cognitive dimensions. It identified many neuropsychological measures with very good sensitivity for predicting dementia, with some reaching more than 90%. Similarly, many neuropsychological tests revealed excellent specificity values. Five neuropsychological measures had an overall accuracy of greater than 90%. Three were episodic memory tests (Guild paragraph delayed recall, RAVLT delayed recall, and free delayed recall of names from a face-name association) and two measured visual semantics (object function recognition and the VOSP silhouette). One global test measuring different cognitive components (ACE Addenbrooke’s cognitive examination) also yielded excellent overall accuracy. Thus, although the sensitivity to specificity ratio varied for individual tests, many had an appropriate balance between the two and some showed both excellent sensitivity and specificity. This systematic review also examined the predictive value of studies that examined a combination of cognitive measures. The use of a combination of neuropsychological measures is likely to be the best approach to identify future progressors, because the sensitivity to specificity ratio varies for individual domains. Studies that have examined combined markers generally reported very high to excellent predictive accuracy with a good balance between sensitivity and specificity, particularly when they combined memory with executive or language tests.

We performed a meta-analysis of the predictive accuracy for 14 cognitive domain categories that included at least three independent studies. The meta-analysis pooled sensitivity and specificity values to obtain quantitative indicators. The meta-analysis showed that most measures of verbal memory were excellent predictors with very good (≥ 0.7) specificity and sensitivity values. In addition, predictive values from verbal memory tests were barely influenced by the testing conditions. For example, delayed recall did not predict progression better than immediate recall. Similarly, there was no major difference between cued recall and free recall and there was no added benefit from providing orientation at retrieval. This goes against the concept that measures of delayed recall or tests that orient processing at encoding are the best indicators of early AD, because they reflect hippocampal dysfunction (Dubois et al. 2007; Albert et al. 2011). The present data indicate rather that a range of verbal memory tests can be used as appropriate indicators of early AD and that the nature of the task may not profoundly influence predictive accuracy. Contrary to the general finding of high predictive value for verbal memory tasks, two testing conditions were associated with relatively low predictive accuracy: word recognition and word recall with orientation at encoding and cues at retrieval. This is consistent with the notion that AD patients suffer from impaired encoding, because being impaired on tasks that increases encoding is not a good predictor of progression.

Interestingly, some language categories were rather good predictors of future decline, particularly naming and semantic fluency. These tests implicate semantic memory and some form of executive functions, which may explain their ability to predict future decline, as both processes were proposed to be impaired early in MCI (Belleville et al. 2008; Joubert et al. 2008). The predictive accuracy of two language tests (fluency and naming) was modified as a function of the length of follow-up. This indicates that the predictive value of language categories varies with disease stage in the prodromal continuum, contrary to verbal memory, for which predictive accuracy was similar, irrespective of the prodromal stage of the patient. This finding and its implications will be discussed further below.

Several cognitive categories showed better specificity than sensitivity values. This was true for memory tasks that provide support at encoding and retrieval, for example, recognition (0.547 and 0.789 for sensitivity and specificity, respectively) or word-list cued delayed recall with oriented encoding (0.676 and 0.896 for sensitivity and specificity, respectively). The same pattern was found for some non-memory categories as well. For example shifting (sensitivity = 0.541; specificity = 0.679), working memory (sensitivity = 0.599; specificity = 0.667), and semantic knowledge (sensitivity = 0.703; specificity = 0.814) showed higher specificity than sensitivity. Thus, these tests may not be very sensitive to identify future progression, but they might be useful for identifying patients with MCI who will remain stable and thus contribute to reduce the number of false positives.

The meta-regressions examined whether age and length of follow-up determined differences in sensitivity and specificity values. Age had no effect on predictive accuracy. However, conclusions concerning age might be limited by the fact that this aspect was examined across studies and not across individuals. It is therefore possible that the search for an effect was limited by the lack of difference in the average age across studies, because disease onset is likely to be equivalent across different samples. The effect of the length of follow-up is perhaps more informative, as studies have control over this variable, which differed between studies. The predictive accuracy of naming and the semantic fluency category varied as a function of the length of follow-up. Studies with very short follow-ups might increase the likelihood of false negatives, as they do not allow sufficient time for individuals to progress to meet the dementia criteria. It is unlikely that a test would be more sensitive for a shorter than longer follow-up, as symptoms increase with progression. Thus, higher sensitivity for studies with longer follow-ups might reflect such a phenomenon. Semantic fluency followed this pattern. Fluency tasks yielded excellent sensitivity (0.842) for long follow-ups (31 to 36 months), but sensitivity markedly declined (0.540) for short follow-ups (12 to 24 months). Hence, this interaction might reflect the contribution of false negative cases to the data from studies relying on short follow-ups.

Assessing the impact of the length of follow-up is also informative for identifying very early markers. Hence, a reasonably long follow-up can be used as a proxy for how far the patient was from the diagnosis when the test was given. Determining what constitutes a reasonable follow-up is complex, but if approximately 15% of individuals with MCI progress to dementia yearly, and approximately 25% of them remain stable, irrespective of follow-up length, a three-year follow-up would allow the detection of approximately 60% of the MCI progressors, which would increase to 80% for a four-year follow-up. A task found to be predictive, irrespective of the length of follow-up, is a good candidate for an early predictor. A task found to be less predictive at a longer than shorter follow-up may be less well-suited as an early indicator of future dementia and more representative of imminent decline. This pattern was found for the naming category. Although the test showed good specificity at longer follow-ups (0.648 for follow-ups of 31 to 36 months), specificity was markedly increased, and excellent, at shorter follow-ups (0.852 for follow-ups of 12 to 24 months). Sensitivity was unaffected by the length of follow-up, suggesting that naming might fare better as a predictor at shorter rather than longer follow-up and might be better used as a marker of imminent progression rather than an early marker of the disease. This result and its interpretation needs to be confirmed by future studies and meta-analyses, because the effect was not found when examining the alternative distribution and it was based on a relatively small number of studies.

One strength of this meta-analysis was the use of the Bayesian approach, which has several advantages over other frequently used approaches. Simulation studies have shown that the Bayesian method provides better coverage probabilities for global sensitivity and specificity, particularly in the case of sparse data (Paul et al. 2010). One reason for this is that Bayesian inferences generally come with wider credibility intervals (often more realistic) than frequently used alternative methods (Warn et al. 2002). The Bayesian approach yields less biased estimations of variance and correlation parameters (Paul et al. 2010). Other frequently used approaches often experience convergence issues, which are less of an issue with the Bayesian approach (Paul et al. 2010). Finally, the Bayesian approach generally produces an approximate joint posterior distribution of all model parameters. This has the advantage of not only making it possible to test hypotheses, but also to obtain the probability that any given parameter is above or below any given threshold (Rutter and Gatsonis 2001). It also allows the easy computation of point estimates of any functions of Se and Sp, such as predictive values or likelihood ratios, along with their credibility intervals.

This study has limitations. One major limitation is that we focused on individuals meeting the criteria for MCI, which might not represent the earliest stage of AD. Thus, the follow-up periods reported in the included studies are relatively short if one considers that the disease develops over two decades prior to diagnosis. Therefore, it is unclear whether the predictive accuracy identified here is representative of earlier stages and whether it can be extended further back during the prodromal period. It is also possible that a slightly different ensemble of tasks would display different sensitivity and specificity values at an earlier period. Future studies could meta-analyze data from individuals reporting a subjective cognitive decline to assess earlier markers, as these individuals might be in a phase that precedes MCI in the disease continuum (Jessen et al. 2014). Similarly, the outcome of studies interested in pre-dementia diagnosis depends on the type of recruitment at entry and the validity of the classification scheme, for example, how subjective cognitive decline or MCI is diagnosed. Advances in the field and refinement of diagnostic criteria will certainly increase the ability to identify early markers. Another limitation is the large variability in the tasks that were tested across studies. As a result, we focused on cognitive domain categories rather than individual tasks for the meta-analyses. Although this quantitative meta-analysis provides information as to the domains that should be measured for early prediction, it does not identify specific tests. However, the systematic review included in this study identifies the predictive accuracy for a range of neuropsychological tests that map the cognitive domain categories identified in the meta-analysis. Our finding that the use of different testing conditions for memory tasks does not substantially modify predictive accuracy lends support to our approach, as it indicates that sensitivity and specificity do not vary much within broad cognitive domains. As already mentioned, the statistical analyses were limited by the small number of included studies. This does not diminish the reliability of the results, but the high between-study variability was reflected by the large CrI widths. Some statistical assumptions, for example that of normality, could not be verified due to the small number of studies. Also, the reported CrIs are not applicable to a hypothetical future study. For example, using different cut-offs would change the sensitivity and specificity values. We experienced instances of convergence failures. The Bayesian approach that we used is known to be less prone to convergence issues than other frequently used methods (Paul et al. 2010). Yet, when there are relatively few studies with sparse data, convergence can be particularly difficult to achieve because it may lead to very wide estimations of the corresponding credibility intervals. This is even more challenging for meta-regression models as they contain additional parameters. We suspect that the convergence failures were due to small amounts of data compared to the number of model parameters. Finally, sensitivity and specificity values were slightly modified in four cases after excluding studies with a high risk of bias (reduced sensitivity for three, increased specificity for one). In these cases, correct inferences for sensitivities and specificities probably fell somewhere between the original results (left-hand side of Table 5) and those excluding high-risk studies (right-hand side of Table 5).

In conclusion, the results from this meta-analysis are encouraging for those interested in the early identification of AD. They show that neuropsychological assessment, which is affordable and widely accessible, can strongly contribute to predicting dementia while individuals are still in the MCI phase. The meta-analysis revealed very good to excellent predictive accuracy for many cognitive domains, particularly those concerned with verbal memory and semantic processing. Based on the meta-analyzed data, performance on cognitive tests can predict whether MCI patients will progress to dementia at least 3 years prior to the time at which the diagnosis is made and should contribute highly to the development of early indices of AD.