Conclusion The NIHTB-CB is a reliable and valid test battery for children and young adults with ID with a mental age of ≈5 years and above. Adaptations for very low-functioning or younger children with ID are needed for some subtests to expand the developmental range of the battery. Studies examining sensitivity to developmental and treatment changes are now warranted.

Results Above a mental age of 5.0 years, all tests had excellent feasibility. More varied feasibility across tests was seen between mental ages of 3 and 4 years. Reliability and convergent validity ranged from moderate to strong. Each test and the Crystallized and Fluid Composite scores correlated moderately to strongly with IQ, and the Crystallized Composite had modest correlations with adaptive behavior. The NIHTB-CB showed known-groups validity by detecting expected executive function deficits in FXS and a receptive language deficit in DS.

Methods We assessed feasibility, test-retest reliability, and convergent validity of the NIHTB-CB (measuring executive function, processing speed, memory, and language) by assessing 242 individuals with fragile X syndrome (FXS), Down syndrome (DS), and other ID, ages 6 through 25 years, with retesting completed after 1 month. To facilitate accessibility and measurement accuracy, we developed accommodations and standard assessment guidelines, documented in an e-manual. Finally, we assessed the sensitivity of the battery to expected syndrome-specific cognitive phenotypes.

Approximately 2.0% of the global population has an intellectual disability (ID), an understudied condition with lifelong effects on academic, vocational, and personal functioning.1 As the etiologies and mechanisms underlying specific forms of ID are discovered, targeted treatments and human clinical trials soon follow in the translational process,2,3 raising the potential for medical remediation of disability. The preclinical development of promising targeted treatments for ID-associated disorders has not been followed by successes in human trials. Unfortunately, testing accessibility issues, pervasive floor effects, and lack of consensus on acceptable cognitive endpoints have been obstacles in the field. Although other barriers such as limitations of animal models are involved, it is apparent that the IDs continue to lag behind other neurologic or psychiatric conditions on scalable, psychometrically supported, and broadly accepted endpoints.4,5

The NIH Toolbox Cognitive Battery6,–,8 (NIHTB-CB), an iPad-based battery of brief memory, executive function (EF), processing speed, and language tests, was developed within the NIH Blueprint for Neuroscience Research. The NIHTB-CB has the potential to provide a highly standardized, objective, and scalable tool for use across laboratories and clinical trial sites. As an extension of our pilot work,9 the present study reflects progress made over a 4-year period to empirically validate and refine the NIHTB-CB for ID (aim 1), including standardized administration guidelines required for this challenging population, with the goal of supporting its use as a set of outcome measures for clinical trials and other clinical research. The second aim was to measure the sensitivity of the battery to detect known cognitive phenotypes in 2 ID-associated syndromes that are a focus of translational research: Down syndrome5 (DS) and fragile X syndrome10 (FXS). On the basis of prior research, we hypothesized EF deficits in FXS and DS and episodic memory deficits in DS (relative to a heterogeneous other-ID group), a relative language strength in FXS, and no group differences on visual processing speed.

Our prior work on IQ measurement demonstrated the utility of deviation-based scoring to deal with problematic floor effects in ID. 24 We used this method in the current study in place of NIHTB-CB standard scores to circumvent imposed floored scores (e.g., the current NIHTB-CB winsorizes age-corrected standard scores at 54). We created z scores by transforming participant raw scores on each NIHTB-CB test using normative means and SDs for their chronologic age band. The z scores were used to create deviation-based composites following the previously defined criteria. 25 For a Crystallized Composite, 1 of 2 valid scores on PVT and OR was required; for a Fluid Composite, 4 of 5 valid scores on Flanker, DCCS, LS, PC, and PSM were required. The Crystallized and Fluid z scores were used to create a Cognitive Function Composite, used in ecologic validity analyses. Deviation scores were also created for FSIQ on the SB-5 24 and were used in FSIQ analyses. Group comparisons to examine known-groups validity were assessed with a 2-way mixed-model analysis of variance on NIHTB-CB z scores. Significant results were followed up with the Tukey honest significant difference tests to examine group differences.

SAS version 9.4 (SAS Institute Inc, Cary, NC) and R version 3.6.0 (R Foundation for Statistical Computing, Vienna, Austria) were used for analyses. Visit 1 data were used for feasibility and validity analyses. For validity and reliability, NIHTB-CB raw scores were used: computed score on DCCS, Flanker, and PC; theta score on PSM, PVT, and OR; raw score on LS and Flanker DEXT; and percent correct on DCCS DEXT. Because DEXT scores are currently on a different scale than standard Flanker and DCCS scores, DEXT results are presented separately. For test-retest reliability, single-score intraclass correlations (ICCs) were used. The Cohen d was used to evaluate potential practice effects, with paired-sample t tests to measure the significance of change. Convergent, discriminant, and ecologic validities were measured with Pearson correlations.

Before conducting analyses, we visually inspected all data and bivariate correlations for normality and the presence of outliers. All NIHTB-CB tests were examined for floor or ceiling issues. Only 2 tests had such issues. At visit 1, 21 individuals received a score at the floor on LS Age 3–6, 33 received a score at the ceiling on PSM Age 3–4, and 4 received a score at the floor on PSM Age 5–6. After a thorough review of administration details and reliability and validity analyses, LS floored scores were kept in the analyses. Because PSM has multiple age versions and these participants likely should have received a harder or easier version, PSM scores at floor or ceiling were excluded.

After every administration of the NIHTB-CB, the Administration Form was used to record whether each test was considered valid for the participant. All analyses used only valid scores. The most common reasons for invalid scores were an invalid response pattern, refusal, and excessive prompting (3.0%, 0.78%, and 0.72% of all scores, respectively).

To increase feasibility and to improve reliability and validity in ID, we developed a manual of standardized procedures regarding the test environment and NIHTB-CB administration: the “NIH Toolbox Cognitive Battery Supplemental Administrator's Manual for Intellectual and Developmental Disabilities” (e-Manual; hereafter Supplemental Manual, links.lww.com/WNL/B58 ). 22 Strategies to proactively improve feasibility and to reduce participant stress included using a visual schedule before the visit, a caregiver questionnaire on behaviors and potential reinforcement rewards, and a visual token board of the NIHTB-CB for the participant to check off during testing. Best practices in administering standardized assessments with appropriate accommodations for ID were used. 23 Test-specific guidelines are available in the manual to aid future users of the NIHTB-CB in standard administration and feasibility specifically for the ID population. In addition, the Supplemental Manual includes the Administration Form that we developed to document test environment, behavioral responses, and validity of tests for each participant.

Six tests were preselected as convergent validity measures for the NIHTB-CB. The NEPSY Inhibition subtest 16 (NEPSY-In), iPad version, was used as the convergent measure for DCCS. The NEPSY-In measures cognitive flexibility and inhibitory control. From piloting the NEPSY-In, we found that participants could rarely do the most difficult level (Switching). We thus administered only the Naming and Inhibition portions and created a prorated score indicating the number of correct items per minute. For Flanker, we used the Conners Kiddie Continuous Performance Test 2nd Edition 17 , administered on a computer with a spacebar as the response button. The hit reaction time SD was used as the convergent validity variable. For LS convergent validity, the SB-5 verbal working memory raw score was used. PC validity was measured with the Wechsler Preschool and Primary Scale of Intelligence, 4th Edition 18 Bug Search, from the number of correct items per minute. The Leiter International Performance Scale, 3rd Edition 19 Forward Memory (FM) subtest assesses sequential memory span. The raw score was the convergent variable for PSM. The Peabody Picture Vocabulary Test, 4th edition 20 (PPVT-4) measures receptive vocabulary. The raw score from the PPVT-4 iPad version was used for PVT. For OR, the Woodcock Johnson 4th Edition 21 Letter-Word Identification was used, which measures letter recognition and single word reading. Discriminant measures for NIHTB-CB tests were selected out of these measures by choosing a feasible measure of a different construct than the NIHTB-CB test.

In addition, PSM has multiple forms available for each test version. Pilot reliability results suggested nonequivalence of forms. To assess PSM reliability, Form A was used at visit 1 and for half of participants at visit 2 (PSM A-A); the other half received Form B at visit 2 (PSM A-B). For LS, pilot studies demonstrated that additional teaching items improved feasibility. For the current study, PowerPoint slides of these teaching items were used before test items on LS Age 3–6. The NIHTB-CB developers then released the LS Age 3–6 Experimental version with these extended instructions during the study, which was subsequently used.

The NIHTB-CB is a computerized assessment validated in ages 3 to 85 years in the general population. The battery includes 7 tests: Dimensional Change Card Sort (DCCS), Flanker Inhibitory Control and Attention (Flanker), List Sorting Working Memory (LS), Pattern Comparison Processing Speed (PC), Picture Sequence Memory (PSM), Picture Vocabulary (PVT), and Oral Reading Recognition (OR). 9 , 15 The NIHTB-CB provided experimental Developmental Extension (DEXT) versions of DCCS and Flanker designed to be more accessible to lower-functioning or very young participants. Because tests have multiple age versions, the participant's mental age derived from the SB-5 was used to select test versions, allowing for a starting point of reasonable difficulty, thereby reducing frustration and improving compliance. The DEXT versions were used for participants in the 3- to 7-year mental age range. For PVT and OR, there is 1 computerized adaptive testing (CAT) version, and the start point is typically based on age (children) or education (adults). Instead, we used the education override feature, entering the grade equivalent of the mental age as the start point.

The NIHTB-CB and all convergent validity measures (see below) were completed at visit 1 across 2 days. After completion of the SB-5, the order of remaining assessments was randomized with the exception of the NIHTB-CB, which was the first assessment of day 2. The order of the NIHTB-CB tests was randomized for each participant. At visit 2, the NIHTB-CB was readministered with the same test order within participants.

In all, 288 participants consented to the study. After completion of the SB-5, 45 participants were ineligible: 16 with an FSIQ >80 and 29 with a mental age <3.0 years. One participant discontinued due to behavioral noncompliance. Across sites, 242 participants completed initial neuropsychological testing, with 228 completing retesting of the NIHTB-CB ≈1 month later to examine test reliability. This retest duration was selected to evaluate reliability within a typical time interval used in clinical trials. Participants included 91 with DS, 75 with FXS, and 76 with OID. A subset of 21 participants with OID had a diagnosed ID-associated syndrome; the represented syndromes were 16p11.2 deletion (1), 22q11.2 deletion (1), Bannayan Riley Ruvalcaba (1), cri-du-chat (1), fetal alcohol (5), Floating-Harbor (1), Kleefstra (1), mitochondrial disease (1), mosaic trisomy 8 (1), neurofibromatosis type 1 (2), Phelan-McDermid (1), Potocki-Lupski (4), and Williams (1).

Because of the study aim to measure the sensitivity of the battery to syndrome-specific cognitive phenotypes, 3 groups were recruited. Two ID-associated syndromes were chosen: DS (affecting ≈1 in 700) 11 and FXS (affecting ≈1 in 7,000 males and 1 in 11,000 females). 12 Individuals with ID of other or unknown cause (OID) were also recruited to evaluate the NIHTB-CB within a more heterogeneous group and to serve as comparison to FXS and DS. Eligible participants met the following criteria: chronologic age of 6 through 25 years; full-scale IQ (FSIQ) <80 on the Stanford-Binet, 5th edition (SB-5), 13 mental age of at least 3.0 years on the SB-5 (in concordance with the lowest chronologic age limit of the NIHTB-CB), adaptive behavior deficits as measured by the Vineland Adaptive Behavior Scales, 3rd edition Comprehensive Interview 14 (VABS-3), speech of at least short phrases, English as first language, and stable medication and intervention regimen for 6 weeks before enrollment. Exclusion criteria were uncorrected vision impairment, uncontrolled seizures, motor impairment affecting touchscreen use, and a history of head trauma, brain infection, or stroke.

There was a main effect of group on NIHTB-CB z scores (F 2,238 = 4.90, p = 0.008), as well as a main effect of test on z scores (F 6,1,067 = 17.26, p < 0.001). These were qualified by a significant group × test interaction (F 12,1,067 = 3.68, p < 0.001) and a significant IQ × test interaction (F 6,1,067 = 6.72, p < 0.001). Follow up Tukey tests showed that FXS performed worse on DCCS than OID [t(238) = 4.02, p < 0.001, d = 0.52] and worse than DS [t(238) = −3.04, p = 0.007, d = 0.39]. On Flanker, FXS performed worse than OID [t(238) = 3.45, p = 0.002, d = 0.45] and worse than DS [t(238) = −2.85, p = 0.01, d = 0.37]. These 2 EF test results supported the hypothesized EF impairment in FXS, although the DS impairment relative to OID was not supported. On PVT, FXS performed better than DS, fitting with expected language strength in FXS [t(238) = −2.77, p = 0.02, d = 0.36]. On PC, DS showed a poorer performance than OID, which approached significance [t(238) = 2.24, p = 0.07, d = 0.29]. There were no significant group differences on LS, PSM, or OR.

The z scores on each test (representing group performance relative to the general population average performance) are shown, derived from the mixed-model analysis of variance, adjusted for IQ. A z score of 0 (horizontal line at top) represents the average performance in the general population normative sample. The z scores <0 represent the number of SDs below the general population average for the chronologic age band. Error bars represent 95% confidence intervals. *Comparison between fragile X syndrome (FXS) and intellectual disability (ID) of other or unknown cause. †Comparison between FXS and Down syndrome. *p < 0.05; **p < 0.01; ***p < 0.001; †p < 0.05; ††p < 0.01; †††p < 0.001. DCCS = Dimensional Change Card Sort; LS = List Sorting Working Memory; OR = Oral Reading and Recognition; PC = Pattern Comparison Processing Speed; PSM = Picture Sequence Memory; PVT = Picture Vocabulary.

To examine the specificity of the NIHTB-CB to detect syndrome-specific performance, a 2-way mixed-model analysis of variance was conducted on NIHTB-CB test z scores with group as a between-participants factor and NIHTB-CB test as a within-participants factor; we also examined their interaction ( figure 2 ). IQ was included as a repeated-measures varying covariate, and an IQ-by-test interaction term was included to allow the effect of IQ to vary by test. Because groups differed on FSIQ, covarying IQ aimed to clarify whether group results reflect phenotype-specific impairments or if they simply reflect globally poorer performance due to overall level of cognitive functioning. To avoid overcontrolling for the domain of interest (because NIHTB-CB domains overlap with components of IQ), verbal IQ was used as the covariate for the fluid NIHTB-CB test outcomes (DCCS, Flanker, LS, PC, and PSM), and nonverbal IQ was used as the covariate for crystallized NIHTB-CB tests (PVT and OR). To obtain effect sizes for pairwise comparisons, the Cohen d was calculated from the estimated marginal means from the model to account for the effects of IQ.

Table 5 provides the ecologic validity of NIHTB-CB tests and composites. The composites each had moderate to strong correlations with FSIQ ( figure 1B ), as did all NIHTB-CB test scores other than Flanker DEXT. VABS-3 Adaptive Behavior Composite had small but significant correlations with several tests (Flanker, PC, PSM, PVT, and OR) and with the Crystallized and Cognitive Function composites, with better performance associated with higher levels of adaptive behavior.

In DCCS, for a subgroup of participants below a raw score of 1.88, there appeared to be no association with the NEPSY-In; in this subgroup, DCCS score was not significantly correlated with NEPSY-In score (r = 0.29, p = 0.16). This DCCS score represents participants who did not pass the introductory switching portion of the test. When this subgroup was removed, validity improved (r = 0.57, p < 0.001).

Test-retest reliability was assessed after ≈1 month (mean = 31.7 days, SD = 6.3 days) ( table 3 ). ICCs on each test were moderate to strong, with the exception of DCCS DEXT and PSM A-A, which were in the high 0.40s. All composites had strong reliability. Three tests had small but significant visit 1 to visit 2 effect sizes, reflecting a modest increase in performance: Flanker, PC, and the PSM A-B group. However, the PSM A-B increase likely reflects nonequivalent forms rather than a practice effect because the PSM A-A group had no practice effect. The Fluid and Cognitive Function composites also had small but significant increases, suggestive of small practice effects. Within groups, reliability coefficients were mostly moderate to strong; some DEXT and PSM reliabilities in small sample sizes were not significant: Flanker DEXT in OID (ICC = 0.46, p = 0.14) and both PSM groups in DS (A-A: ICC = 0.24, p = 0.15; A-B: ICC = 0.33, p = 0.05). In FXS, DCCS reliability (ICC = 0.41) was notably lower than in the total sample (ICC = 0.71), but FXS Flanker reliability (ICC = 0.84) was stronger than in the total sample (ICC = 0.74).

Feasibility data are provided in table 2 as the percentage of participants with valid scores on each test. Feasibility overall was similar to the normative 3- to 15-year-old sample, 26 with DCCS, PC, and LS having slightly lower feasibility and Flanker, PSM, PVT, and OR having similar or higher feasibility rates than the normative sample. Even down to a mental age of 3 years, PSM, PVT, and OR feasibility was very good. The feasibility of the remaining tests improved particularly at 5 years. All tests were feasible for nearly every participant with a mental age of ≥6 years.

Table 1 presents descriptive statistics by diagnostic group and overall. Groups did not differ significantly by chronologic age (F 2,238 = 0.91, p = 0.41) or by VABS-3 Adaptive Behavior Composite (F 2,227 = 1.50, p = 0.23). However, FSIQ differed significantly by group (F 2,238 = 31.6, p < 0.001), with FSIQ higher in OID than in both DS [t(238) = 7.57, p < 0.001] and FXS [t(238) = 6.05, p < 0.001]. FSIQ did not significantly differ between DS and FXS [t(238) = 1.23, p = 0.22]. Similarly, mental age was significantly different by group (F 2,238 = 25.3, p < 0.001), with mental age higher in OID than in both DS [t(238) = 6.76, p < 0.001] and FXS [t(238) = 5.40, p < 0.001]. Mental age did not differ significantly between DS and FXS [t(238) = 1.10, p = 0.27]. Figure 1A shows the distribution of the NIHTB-CB Cognitive Function Composite age-adjusted standard scores of the sample without the current imposed floor, illustrating the variability of the sample below this floor and the benefit of deviation-based composites (used in all analyses).

Discussion

This study provides the first comprehensive examination of the psychometric properties and feasibility of the NIHTB-CB for individuals with ID, with an initial focus on 2 of the most common genetic causes with robust translational research programs: FXS and DS. Overall, the NIHTB-CB has demonstrated strong potential for use as an objective, standardized outcome measure that can be confidently used in ID trials with participants with a mental age of 5 years or higher. Results of the study demonstrate very strong psychometrics for the Crystallized reasoning tests (PVT and OR) and good to excellent performance of Fluid reasoning tests, with more variation across FXS, DS, and OID groups for some measures (e.g., strong reliability for Flanker in FXS compared to DS, and vice versa for DCCS). Indeed, the Fluid Composite appeared to have more consistently strong reliability across conditions and a solid convergent association with FSIQ. Thus, it may be a good candidate outcome measure for studies seeking to examine broad nonverbal cognitive changes for individuals with a mental age of ≥5 years. Below a mental age of 5 years, feasibility was more variable across tests, indicating the need for further adaptations, scoring algorithms on developmental extensions, or new tests targeting these lower-functioning individuals.

The Supplemental Manual was compiled after hundreds of administrations of the NIHTB-CB to individuals with ID, and the study results on feasibility, reliability, and validity support its use. We encourage researchers and examiners planning to use the NIHTB-CB to follow these guidelines for the ID population.

Group comparisons demonstrated that the NIHTB-CB is sensitive to substantial EF deficits among individuals with ID in that all 3 groups performed relatively poorly on Flanker and DCCS. In particular, participants with FXS showed weakness in inhibitory control and attention and cognitive flexibility, in excess of their general cognitive level and compared to controls with other forms of ID. This aligns with previous research showing that boys with FXS are impaired in inhibitory control, set shifting, and planning relative to mental age–matched controls27 and that boys with FXS have impairments in inhibition and attention relative to mental age–matched controls and relative to children with DS.28 The hypothesized EF weakness in DS compared to OID (to a lesser extent than FXS) was not found; however, there have been some mixed results on EF in children with DS compared to children with other IDs.29,30 The low-scoring subgroup on DCCS are those who failed introduction to the switching portion of the test; for participants who cannot perform switching at the earliest level, the test may be less sensitive to variation in EF. Removing this subgroup from analyses strengthened the convergent validity correlation. This suggests that these individuals' level or lack of cognitive flexibility may not be captured by the introductory switching portion of DCCS. An extension of DCCS that does capture this construct in very young or low-functioning persons would have much value. The lower reliability of DCCS in the FXS group may also reflect this limitation in that participants with FXS overall scored very low on this test. Because anxiety and hyperactivity are common in FXS, it is possible that individual state interacts with the task complexity and cognitive flexibility to result in more variable performance over time in this group. When participants needed Flanker DEXT or DCCS DEXT, they were almost always able to perform the tests; however, a few issues remain to be worked out regarding these experimental versions, notably the ability to interpret these scores relative to the standard Flanker and DCCS scores. This study highlighted other concerns with the DEXT measures (e.g., difficulty may be ordered incorrectly, or portions are overly burdensome). Our results provide clear evidence that DEXT levels are necessary (and extremely feasible), especially in those with FXS and DS, but that further refinements and modifications are necessary.

To improve feasibility of LS, extra instructions and practice items were developed and used. The LS Age 3–6 experimental version with these additions is now available. Feasibility did improve after our initial pilot studies; however, the test remains challenging for this population, and some participants pass practice but get no test sequences completely correct. The limited feasibility and floored or low variability in raw scores suggest that a lower range of LS is necessary. A potential approach may be to give some credit on early sequences that are partially recalled or items recalled out of sequence because the construct of working memory builds on more basic short-term memory (e.g., SB-5 Working Memory index and Wechsler Working Memory indices).13,31

Both language tests, PVT and OR, had excellent performance in our samples of individuals with ID. Both demonstrated strong reliability and clear domain specificity with a much higher convergent than discriminant correlation. The NIHTB-CB was sensitive to expected language characteristics in that DS was impaired relative to FXS on PVT.32 These tests have an advantage over other language measures such as the PPVT-4 in that PVT and OR are brief (≈3 minutes each) but accurate, owing to CAT; the ability to obtain results with a brief assessment is especially important in individuals with frequent behavior or attention issues such as in those with ID.

Episodic memory is relevant to clinical trials, particularly for DS, in which memory impairments are well documented,33 and FXS, in which memory of sequential information is especially impaired.34 It is important to emphasize that despite these known weaknesses, PSM was among the highest of the test scores in each group (figure 2), suggesting that individuals may have compensatory strategies for performing well, perhaps such as use of the contextual information in the stories. Reliability was moderate and lower than that of the normative 3- 15-year-old study, although improved from our pilot study. The significant effect size in the PSM A-B group suggests nonequivalence of Forms A and B. Notably, in DS, ICCs were small and nonsignificant, suggesting that in its present form, PSM appears contraindicated as a separate outcome measure for this population.

The high rate of ceiling scores on PSM 3–4 suggests that mental age may not be a good indicator of start point on PSM in ID, at least at this mental age level. It is also possible that this age version is too easy in comparison to PSM scores on older versions; perhaps a person's score on 1 PSM version is not equivalent to that person’s score on another. Development of a CAT PSM test with the full range of version difficulties would likely simplify the testing process and yield more comparable and reliable scores.

PC showed adequate feasibility overall and good feasibility in FXS and OID, with sufficient feasibility at a mental age of ≥4 years. While the ICC was excellent, there was a small but significant practice effect, although smaller than that found in the normative sample.35 The most common reason for lack of feasibility on PC was an invalid alternating response pattern, especially common in DS. The task has an inherent challenge of understanding “same” and “different” while mapping this choice onto smiley face and frown face options. For participants without invalid response patterns, PC performs well in ID. On the basis of the feasibility challenges, we developed a new processing speed task, Speeded Matching, now available as an experimental version in the app. The task is to select the animal face among 3 foils that matches a target image. Future psychometric studies will provide more information about its performance.

The Fluid, Crystallized, and Cognitive Function composites demonstrated reliability results similar to those of the normative age 3 to 6 sample, with small practice effects in Fluid and Cognitive Function composites (although smaller than in the normative sample). Each composite was well correlated with FSIQ. Although not all participants were able to receive a composite (due to missing valid scores on some tests), when complete, the composites appear to perform well. The deviation method used to create these composites has a clear advantage over the current age-corrected standard scores, on which more than half of our sample obtained scores at the lower limit, currently set at 54. We are conducting analyses to identify the best option for composite scores below this floor (deviation approach vs extension of existing age-adjusted standard scores). These findings provide further evidence that test developers should consider and address the range and sensitivity of tests and scores for individuals with moderate to severe ID.24,36

This study has some important limitations that warrant consideration. Construct validity challenges are inherent with the ID population in that fully adequate convergent validity measures are not always available or they present their own feasibility and psychometric limitations. The lack of clear discriminant validity in most measures is another challenge. While discriminant correlations are generally desired to be markedly lower than the convergent correlations, in early development, domains of cognition (especially EFs) are thought to be unidimensional, with increasing differentiation of constructs occurring through early adulthood.37 In ID, there is likely even less differentiation between domains than in typically developing children. Our discriminant validity results are similar to normative 3- to 6-year-old results.38 Therefore, the Fluid Composite may be a good outcome measure choice on the basis of its psychometric performance and some limitations of subtest construct differentiation in this population. The study was also limited by sample size in evaluations of group-specific results. Future work with larger samples should provide more clarity about reliability and validity within individual ID subgroups.

The NIHTB-CB is a promising outcome measure for ID clinical trials and for many types of nonintervention observational studies. Although not originally intended for clinical use or for special education purposes, ongoing and future research may be done to explore such applications23 such as in the school psychology setting for accurate and feasible assessment of students with IDs. In addition, the Supplemental Manual developed from this study provides key guidance to examiners and researchers working with ID populations; the procedures on administration, test environment, fidelity, and scoring not only improve examiner familiarity and comfort but, more important, extend accessibility of the NIHTB-CB to more cognitively impaired or behaviorally challenged individuals.

Besides evaluating the NIHTB-CB as an appropriate assessment for ID in general, the present results demonstrate the sensitivity of the battery to known syndrome-specific cognitive phenotypes, such as the impairment of EFs in FXS relative to other ID and to DS. A critical remaining question is the degree to which the battery is sensitive to change, especially to effects of intervention. As an initial test of sensitivity to change, we are currently collecting longitudinal data from study participants to explore natural developmental changes within each NIHTB-CB test and the composites relative to measures already established as change sensitive such as the SB-5 and Vineland. The present results warrant the next step of evaluating the NIHTB-CB in ID for individuals down to a mental age of 5 years to demonstrate the treatment-specific sensitivity of the battery and to determine the degree to which measure gains reflect functional improvements in daily life. Below a mental age of 5 years, the NIHTB-CB performs more variably, and adaptations to the lower test ranges or scoring adaptations are needed for some measurement domains. Studies of the performance of the battery in older adults with ID are needed, especially focusing on those experiencing cognitive decline or dementia. Overall, the present validation results represent an important step toward providing an objective, scalable, and standardized method for successfully measuring cognition and tracking cognitive changes in ID.