Abstract Social scientists often seek to demonstrate that a construct has incremental validity over and above other related constructs. However, these claims are typically supported by measurement-level models that fail to consider the effects of measurement (un)reliability. We use intuitive examples, Monte Carlo simulations, and a novel analytical framework to demonstrate that common strategies for establishing incremental construct validity using multiple regression analysis exhibit extremely high Type I error rates under parameter regimes common in many psychological domains. Counterintuitively, we find that error rates are highest—in some cases approaching 100%—when sample sizes are large and reliability is moderate. Our findings suggest that a potentially large proportion of incremental validity claims made in the literature are spurious. We present a web application (http://jakewestfall.org/ivy/) that readers can use to explore the statistical properties of these and other incremental validity arguments. We conclude by reviewing SEM-based statistical approaches that appropriately control the Type I error rate when attempting to establish incremental validity.

Citation: Westfall J, Yarkoni T (2016) Statistically Controlling for Confounding Constructs Is Harder than You Think. PLoS ONE 11(3): e0152719. https://doi.org/10.1371/journal.pone.0152719 Editor: Ulrich S. Tran, University of Vienna, School of Psychology, AUSTRIA Received: January 18, 2016; Accepted: March 17, 2016; Published: March 31, 2016 Copyright: © 2016 Westfall, Yarkoni. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Reanalyzed data are from the Eugene-Springfield Community Sample, the maintainer of which (Lewis Goldberg) may be contacted at lewg@ori.org. Funding: The authors received no specific funding for this work. Competing interests: The authors have declared that no competing interests exist.

Introduction A common goal of statistical analysis in the social sciences is to draw inferences about the relative contributions of different variables to some outcome variable. When regressing academic performance, political affiliation, or vocabulary growth on other variables, researchers often wish to determine which variables matter to the prediction and which do not—typically by considering whether each variable’s contribution remains statistically significant after statistically controlling for other predictors. When a predictor variable in a multiple regression has a coefficient that differs significantly from zero, researchers typically conclude that the variable makes a “unique” contribution to the outcome. And because measured variables are typically viewed as proxies for latent constructs of substantive interest—for example, two cognitive ability measures might be taken to index spatial versus verbal ability—it is natural to generalize the operational conclusion to the latent variable level; that is, to conclude that the latent construct measured by a given predictor variable itself has incremental validity in predicting the outcome, over and above other latent constructs that were examined [1,2]. Incremental validity claims pervade the social and biomedical sciences. In some fields, these claims are often explicit. To take the present authors’ own field of psychology as an example, a Google Scholar search for the terms “incremental validity” AND psychology returns (in January 2016) over 18,000 hits—nearly 500 of which contained the phrase “incremental validity” in the title alone. More commonly, however, incremental validity claims are implicit—as when researchers claim that they have statistically “controlled” or “adjusted” for putative confounds—a practice that is exceedingly common in fields ranging from epidemiology to econometrics to behavioral neuroscience (a Google Scholar search for “after controlling for” and “after adjusting for” produces over 300,000 hits in each case). The sheer ubiquity of such appeals might well give one the impression that such claims are unobjectionable, and if anything, represent a foundational tool for drawing meaningful scientific inferences. Unfortunately, incremental validity claims can be deeply problematic. As we demonstrate below, even small amounts of error in measured predictor variables can result in extremely poorly calibrated Type 1 error probabilities. This basic problem has been discussed in a number of literatures—most extensively, in epidemiology and biostatistics, where concerns about incremental validity claims are often discussed under the heading of residual confounding [3–5], but also in fields ranging from psychology to education to econometrics [6–11]. The common thread is that measurement unreliability and model misspecification will often have a deleterious and large effect on parameter estimates (and associated error rates) when covariates are entered into regression-based model. Consequently, under realistic assumptions, it can be shown that a large proportion of incremental validity claims in many disciplines are likely to be false. In this paper, we develop and apply a general statistical and conceptual framework for understanding and evaluating claims about incremental validity. We begin by providing an intuitive statement of the problem using simple examples and simulated data. We discuss the most common forms of incremental validity argument and identify the unstated assumptions they rest on. Next, we introduce a formal statistical framework for analytically determining the expected Type I error rate of incremental validity claims as a function of key parameters like sample size, effect size, and reliability. We demonstrate that the likelihood of spurious inference is surprisingly high under real-world conditions, and often varies in counterintuitive ways across the parameter space. For example, we show that, because measurement error interacts in an insidious way with sample size, the probability of incorrectly rejecting the null and concluding that a particular construct contributes incrementally to an outcome quickly approaches 100% as the size of a study grows. In the latter part of the paper, we consider potential solutions to the problems we have identified. We focus attention on structural equation modeling (SEM) methods that can maintain appropriate Type I error rates provided certain assumptions are met—or, alternatively, that can be used to identify the boundary conditions under which an observed association can be said to hold. We also provide a novel perspective on power analysis that takes the measurement unreliability of covariates into account, providing more realistic—and surprisingly large—estimates of the sample sizes typically required to support incremental validity claims. Taken as a whole, our work provides a formal framework for understanding the effects of multiple predictors on significance testing in the presence of unreliability, and offers practical guidelines for dealing with a very common, but largely unappreciated, problem.

An Intuitive Statement of the Problem Incremental validity claims come in a number of different forms. The most basic and common of these is what might be called the argument for predictive utility. Stated abstractly, it says: “If measurements of construct X correlate significantly with outcome Y even when controlling for existing measure(s) Z, then X is a useful predictor of Y, over and above Z.” As noted above, examples of this argument abound throughout the social and biomedical sciences. For example, epidemiologists have concluded that eating processed meat significantly increases colorectal cancer risk, on the basis of prospective studies that consistently find a positive association between the two variables when controlling for a host of confounding variables [12,13]. Organizational psychologists advocate the use of measures such as Emotional Intelligence on the grounds that they incrementally predict job performance when controlling for standard personality and cognitive ability measures [14,15]. Political scientists frequently seek to quantify the incremental contributions of specific demographic variables to voting preferences (e.g., are higher-income individuals more likely to vote Republican in US elections after controlling for differences in education level, race, state, etc.; [16–18]). And cognitive neuroscientists' arguments for the utility of brain-based predictive models are often predicated on those models’ putative ability to predict real-world outcomes (e.g., product purchases or smoking cessation) above and beyond relevant self-report variables [19,20]. In all of these cases—and thousands of others—the claims in question may seem unobjectionable at face value. After all, in any given analysis, there is a simple fact of the matter as to whether or not the unique contribution of one or more variables in a regression is statistically significant when controlling for other variables; what room is there for inferential error? Trouble arises, however, when researchers behave as if statistical conclusions obtained at the level of observed measures can be automatically generalized to the level of latent constructs [9,21]—a near-ubiquitous move, given that most scientists are not interested in prediction purely for prediction’s sake, and typically choose their measures precisely so as to stand in for latent constructs of interest. That is, researchers typically do not care to show that, say, school vouchers are associated with improved academic performance after controlling for a specific survey item asking about respondents’ income bracket; rather, the goal is to show that the vouchers may improve performance after accounting for the general construct of income (or, more generally, socioeconomic status). To see the problem intuitively, consider a slight alteration of a familiar example from many introductory data analysis courses. Suppose we are given city statistics covering a four-month summer period, and observe that swimming pool deaths tend to increase on days when more ice cream is sold. As astute analysts, we immediately identify average daily temperature as a confound: on hotter days, people are more likely to both buy ice cream and visit swimming pools. Using multiple regression, we can statistically control for this confound, thereby eliminating the direct relationship between ice cream sales and swimming pool deaths. Now consider the following twist. Rather than directly observing recorded daily temperatures, suppose we obtain self-reported Likert ratings of subjectively perceived heat levels. A simulated batch of 120 such observations is illustrated in Fig 1, with the reliability of the subjective heat ratings set to 0.40—a fairly typical level of reliability for a single item in psychology. (A conventionally acceptable level of reliability for sum-scores derived from a measurement scale in psychology is around 0.8. If such a scale consists of six parallel items, which would be a fairly typical number of items in many contexts, then by the Spearman-Brown formula, the reliability of each individual item would be around 0.4.) Fig 2 illustrates what happens when the error-laden subjective heat ratings are used in place of the more precisely recorded daily temperatures. The simple relationship between ice cream sales and swimming pool deaths (Fig 2A) is positive and substantial, r(118) = .49, p < .001. When controlling for the subjective heat ratings (Fig 2B), the partial correlation between ice cream sales and swimming pool deaths is smaller, but remains positive and statistically significant, r(118) = .33, p < .001. Is the conclusion warranted that ice cream sales are a useful predictor of swimming pool deaths, over and above daily temperature? Obviously not. The problem is that subjective heat ratings are a noisy proxy for physical temperature, so controlling for the former does not equate observations on the latter. If we explicitly control for recorded daily temperatures (Fig 2C), the spurious relationship is eliminated, as we would intuitively expect, r(118) = -.02, p = .81. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Plot of subjective heat ratings on a 7-point Likert scale against the “true” underlying daily temperatures. https://doi.org/10.1371/journal.pone.0152719.g001 PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Illustration of residual confounding. (A) Simple relationship between daily swimming pool deaths and number of ice cream cones sold. (B) Relationship between daily swimming pool deaths and number of ice cream cones sold after controlling for subjective heat Likert ratings. (C) Relationship between daily swimming pool deaths and number of ice cream cones sold after controlling for recorded daily temperatures. https://doi.org/10.1371/journal.pone.0152719.g002 The foregoing example is based on a single batch of simulated data. If we repeat the simulation 10,000 times with the same parameter values, we find a spurious partial correlation between ice cream sales and swimming pool deaths when controlling for subjective heat ratings 92% of the time. While the variables in the above example were deliberately chosen so that the absurdity of the hypothetical relationship is clear, the parameter values upon which it is based, and the structure of the statistical argument itself, are representative of many common research situations. Table 1 presents expected Type I error rates for a number of other parameter regimes common to different scientific disciplines—ranging from small-sample lab-based experiments involving large effects (e.g., n = 30, r = 0.6) to population-level models involving tens of thousands of individuals and putatively small associations (n = 30,000, r = 0.2). In each case, we quantify the probability of (incorrectly) rejecting the null—that is, of concluding that a construct of interest makes a statistically significant contribution after controlling for a putative confound, when in fact the confound fully accounts for the relationship at the latent-variable level. For simplicity, we assume that measurement reliability is identical for both predictors. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Type 1 error rates for a few parameter combinations. N = sample size; ES (r) = correlations of predictor with covariate and covariate with outcome; reliability = reliability of predictor and covariate. These error rates are determined using the methods described in the next section, and described in more detail in the S1 Appendix. https://doi.org/10.1371/journal.pone.0152719.t001 While the Type I error rate varies considerably depending on sample size, effect size, and reliability, it is apparent from Table 1 that it is very often much larger than the nominal value of 5%. We submit that if there is a high probability of rejecting the null hypothesis in such situations even when it is actually true, then rejecting the null hypothesis cannot be considered convincing empirical evidence that a construct has incremental predictive utility. To be confident that an incremental validity argument is sound, one would need to either ensure perfect measurement reliability, or formally account for the potential effects of unreliability in one’s model. The former is a daunting—and usually impossible—proposition. The latter is quite feasible, but, as we discuss in a later section, cannot be accomplished with standard multiple regression.

A General Statistical Framework for Assessing Incremental Validity Having provided illustrative examples to prime readers’ intuitions, we now undertake a more comprehensive evaluation of the Type 1 error rates associated with incremental validity arguments. We first lay out the statistical models involved and define the relevant null hypotheses. We then quantify how the Type 1 error rates for tests of incremental validity claims vary across a broad range of parameter values. We have also written an interactive web application (“Ivy,” accessible online at http://jakewestfall.org/ivy/) that readers can use to explore the statistical properties of these and other incremental validity arguments for themselves. In S1 Appendix, we give the analytical derivations underlying these results, in which we determine the probabilities of rejecting different combinations of regression coefficients as a function of the simple or partial correlations among the outcome and the latent predictors, the reliabilities, and the sample size. Consider a regression of an outcome Y on two true scores T j , with e T a random disturbance term (subscripts indexing people are omitted for simplicity). But rather than observing the latent predictors T j directly, we instead observe two imperfectly measured indicators X j = b j T j + e j , so that the regression we actually observe is From these regressions we define the following parameters: ρ 1 : The simple correlation between Y and T 1 . ρ 1.2 : The partial correlation between Y and T 1 , controlling for T 2 . ρ 2 : The simple correlation between Y and T 2 . ρ 2.1 : The partial correlation between Y and T 2 , controlling for T 1 . δ: The simple correlation between T 1 and T 2 . α 1 : The reliability of X 1 (var(b 1 T 1 )/var(X 1 )). α 2 : The reliability of X 2 (var(b 2 T 2 )/var(X 2 )). (Note that in the special case where X 1 and X 2 measure the same true score—that is, T 1 = T 2 = T—then δ = 1 and ρ 1 = ρ 2 = ρ.) The core incremental validity argument—i.e., the “argument for predictive utility”—claims that T 1 is a useful predictor of Y even after controlling for T 2 . The corresponding null hypothesis for this argument, stated in terms of the statistical parameters just defined, is that ρ 1.2 = 0. We reject this null hypothesis if we observe a significant partial correlation between Y and the measured variable X 1 , controlling for X 2 . Type 1 error rates for this argument are illustrated in Fig 3. The first thing to note is that if the control variable X 2 is free of measurement error, the Type 1 error rate is, as expected, 5%. Although not illustrated in the plot, the error rate is also 5% if either ρ 2 = 0 or δ = 0, either of which imply that the indirect effect of T 1 on Y via T 2 is 0, in which case there is no confounding influence to control for. However, if X 2 is contaminated with any amount of measurement error, and there is any indirect effect of T 1 on Y via T 2 , then the Type 1 error rate exceeds 5%. The extent by which the error rate exceeds 5% depends on three factors: the size of the indirect effect (ρ 2 δ), the sample size (n), and the reliability of X 2 (α 2 ). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. Contour plots of Type 1 error probabilities for the argument for predictive utility. The null hypothesis is that T 1 has no partial relationship with Y after controlling for T 2 (i.e., ρ 1.2 = 0). The size of the true indirect effect of T 1 on Y via T 2 varies from small (panel A) to medium (panel B) to large (panel C). https://doi.org/10.1371/journal.pone.0152719.g003 The influence of the indirect effect size is straightforward: All else equal, as the indirect effect increases, the Type 1 error rate increases. When the indirect effect is small (both ρ 2 and δ are modest; Fig 3A), the error rate is only slightly inflated. When the indirect effect is large (ρ 2 and δ are large; Fig 3C), the error rate is very high for most representative values of n and α 2 . The intuitive explanation for this is that measurement unreliability makes it easier for the regression model to confuse the direct and indirect paths (i.e., to apportion variance in the outcome incorrectly between the various predictors). The larger the influence of the confounding covariate, the more variance can be misattributed to the predictor of interest, leading to an increase in Type I error. The relationship between n and Type 1 error may be less obvious: all else equal, as sample size increases, error rates also increase. It is worth reflecting on this result, because it contravenes the received wisdom that larger samples mitigate most common statistical problems (e.g., as n grows, power to reject the null increases, parameter estimates become more precise, etc.). Indeed, we find that for studies involving thousands of participants and non-negligible indirect effects, rejection of the null hypothesis is a near certainty even when the null is in fact true (cf. Table 1). On reflection, the reason for this behavior becomes clear: as samples grow, power to detect any reliable association between the predictors and the outcome necessarily increases. This remains true even when measurement unreliability causes the model to confuse a common effect of two or more predictors with a unique effect of one predictor—as n grows, the model more confidently concludes that there is a reliable association between the predictor of interest and the outcome. Finally, the effect of reliability on error rates is even less intuitive: there is a non-monotonic relationship, such that Type 1 error approaches 5% when reliability nears 0 or 1, but is highest when reliability is moderate. The error rate typically peaks when reliability is between 0.3 and 0.7, which is likely representative of many commonly used measures in the social sciences, particularly those that consist of a single item. However, even at a conventionally acceptable reliability of 0.7 or 0.8, the error rate can still be extremely high if the sample size and/or indirect effect are large. The non-monotonic effect of reliability has a compound explanation that becomes clear when one considers each extreme separately. When reliability is very low, the observed associations between all variables must be very small (i.e., power is very low), so the null cannot be rejected simply because it becomes almost impossible to detect any effect. Conversely, when reliability is very high, the model is able to avoid misattributing the effect of the covariate to the predictor of interest. In the middle, however, there exists a territory where effects are large enough to afford detection, but reliability is too low to prevent misattribution, leading to particularly high Type 1 error rates.

Statistical Power of Incremental Validity Arguments Using SEM The reanalyses presented above make it clear that, when arguing for incremental validity, the reliability of the predictors matters. Seemingly strong evidence for incremental validity based on a multiple regression model that ignores measurement error can easily disintegrate when one uses more appropriate, SEM-based methods that account for measurement error in the predictors. This observation naturally raises an important question: what kind of statistical power do incremental validity arguments have when they are based (correctly) on SEM rather than multiple regression? Naively, one might suppose that an SEM analysis should be only modestly more conservative than its multiple regression equivalent. However, multiple regression and SEM respond very differently to the presence of measurement error in a covariate. As illustrated in Fig 11, adding an increasingly unreliable covariate to an SEM model causes the standard error of the parameter estimate for the predictor of interest to grow larger and larger. The intuition for this behavior is that the model must adjust the parameter estimate to account for the overlap with the covariate, but as the covariate’s unreliability increases, it becomes increasingly unclear exactly how much of an adjustment is required. This increasing uncertainty is reflected in the increasing standard error. By contrast, multiple regression will typically show the opposite trend: the more unreliable the covariate, the more the multiple regression actually capitalizes on this unreliability by conflating the direct and indirect effects of the predictor of interest, leading to biased, inconsistent parameter estimates and inflated test statistics [6]. The net effect is that, as the reliability of a covariate falls, it typically becomes easier to reject the null with multiple regression (resulting, as we have already seen, in very high false positive rates when the null is true), but harder to reject the null with SEM. The latter is the correct behavior, as it reflects our expectation that introducing additional measurement error into a set of regression equations should increase the uncertainty in parameter estimates and correspondingly attenuate the test statistics for hypothesis tests on those parameters. The upshot is that shifting from multiple regression to SEM should increase the sample size required to support incremental validity claims. The key question is by how much. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 11. Incremental validity in multiple regression vs. SEM. The SEM results are from a simulation using 300,000 iterations. The multiple regression results are computed analytically. The SEM line in the left panel is a smoothed curve derived from fitting a generalized additive model with a binomial response to the simulation results tracking whether the null hypothesis was rejected. In the right panel, the SEM line and shaded region are based on first applying rolling medians of width 101 to the simulated regression coefficients and standard errors (to reduce the distorting influence of extreme outlying parameter estimates occurring particularly at low reliability values), and then fitting a generalized additive model to these rolling medians. SEM = Structural Equation Model. https://doi.org/10.1371/journal.pone.0152719.g011 To find out, we conducted a simulation. We generated random data according to a structural equation model identical in structure to the model shown in Fig 9. The reliability of the focal predictor of interest was always kept at 1.0, while the assumed reliability of the covariate was set to either perfect (α = 1; equivalent to multiple regression), high (α = .8; a typical reliability for an aggregate of multiple items), or low (α = .4; a typical reliability for a single item). (Adding measurement error to the focal predictor would, of course, simply diminish the statistical power even further and lead the required sample sizes to be even larger.) We assumed a relatively large indirect effect of the focal predictor via the covariate, with δ = ρ 2.1 = .7. We reasoned that, in the real word, the situations where it occurs to the researcher that it might be important to control for a particular covariate are precisely those in which the covariate has a large indirect effect, so that large indirect effects are probably common in much actual research. We varied the size of the partial correlation between the focal predictor and the outcome between ρ 1.2 = 0 (to verify that the SEM can keep the Type 1 error rate at approximately the nominal alpha level of 5%) to ρ 1.2 = .3, in increments of 0.1. For each parameter combination we ran the simulation 30,000 times, each time drawing sample sizes from a distribution uniform on the log scale from n = 50 to n = 5000. The results of the simulation are shown in Fig 12. Panel A, in which the covariate is perfectly reliable, shows a relatively happy situation: with effect sizes of ρ 1.2 = .3, .2, or .1, achieving 80% power requires sample sizes of n ≈ 80, 200, or 800, respectively. These are equivalent to the power results for multiple regression, and we suspect that most researchers’ intuitions about statistical power are calibrated to a situation similar to this one. We also see that the Type 1 error rate is maintained at 5%. In panel B, where we now introduce just a relatively small amount of measurement error to the covariate, we can see that the required sample sizes increase substantially: with effect sizes of ρ 1.2 = .3, .2, or .1, achieving 80% power now requires sample sizes of n ≈ 180, 400, or 1600, respectively. The Type 1 error rates are still maintained at 5%. Finally, in panel C, where the covariate is measured with a substantial amount of error—as is likely typical with single indicator covariates widely used across many fields—the required sample sizes are now very large. With effect sizes of ρ 1.2 = .3 or .2, achieving 80% power requires sample sizes of n ≈ 1200 or 2300, respectively. The required sample size for ρ 1.2 = .1 is too large to even show within the plot margins, but it seems to be well into the tens of thousands. We also see that, in the low reliability case, there is some slight elevation of the Type 1 error rate, although it does not appear to go much beyond 10%. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 12. Power to detect incremental validity using SEM. The lines in each panel are smoothed curves derived from fitting generalized additive models with a binomial response to the simulation results. SEM = Structural Equation Model. https://doi.org/10.1371/journal.pone.0152719.g012 For comparison, we also conducted a simulation involving a small indirect effect size of δ = ρ 2.1 = .3. In this simulation, the required sample sizes to achieve 80% power with direct effect sizes of ρ 1.2 = .3, .2, or .1 were about n ≈ 80, 200, or 800, respectively, for both perfect reliability and high reliability. For low reliability, the required sample sizes were about n ≈ 110, 300, or 1000, respectively. We note that such estimates are probably much too optimistic for most real-world situations, as it is rare for a single predictor to exert nearly all of its influence on the outcome via the direct path, and independently of other possible covariates.

Discussion To most social scientists, observed variables are essentially just stand-ins for theoretical constructs of interest. The former are only useful to the extent that they accurately measure the latter. Accordingly, it may seem natural to assume that any statistical inferences one can draw at the observed variable level automatically generalize to the latent construct level as well. The present results demonstrate that, for a very common class of incremental validity arguments, such a strategy runs a high risk of failure. The scope of the problem is considerable: literally hundreds of thousands of studies spanning numerous fields of science have historically relied on measurement-level incremental validity arguments to support strong conclusions about the relationships between theoretical constructs. The present findings inform and contribute to this literature—and to the general practice of “controlling for” potential confounds using multiple regression—in a number of ways. First, we show that the traditional approach of using multiple regression to support incremental validity claims is associated with extremely high false positive rates under realistic parameter regimes. Researchers relying on such arguments will thus often conclude that one construct contributes incrementally to an outcome, or that two constructs are theoretically distinct, even when no such conclusion is warranted. Of course, this general problem is not novel, and has been discussed in a number of literatures [8–10]—most extensively, under the heading of “residual confounding” in epidemiology [3,5]. However, previous treatments have typically focused on circumscribed aspects of the problem or applications to specific domains. Here we have introduced a general formal framework that can be easily used to derive expected false positive rates for any combination of reliabilities and effect sizes, expressed in terms of either simple or partial correlations. As a complement, we also provide a web application that enables researchers to obtain these quantities using a simple point-and-click interface (http://jakewestfall.org/ivy/). Application of our framework to a wide range of realistic scenarios demonstrates that key parameters interact with one another in complex, and sometimes counterintuitive, ways. For example, we find that false positive rates typically increase with sample size, and typically peak when reliability is moderate rather than when it is very low or very high. In general, we find that the probability of spurious inference approaches 100% much more quickly than one might imagine, and under realistic parameter regimes will typically be several times the nominal rate of 5%. Second, we demonstrate that the problem has a principled solution: inferences about the validity of latent constructs should be supported by latent-variable statistical approaches that can explicitly model measurement unreliability. Researchers in a position to measure constructs using multiple indicators can rely on well-established structural equation modeling techniques to support construct-level inferences; however, we also show how even when only a single indicator is available, researchers can use an SEM approach to estimate what level of reliability must be assumed in order to support the validity of one’s inferences ([50] p. 168), [51], ([52] p. 276)—providing important insights into the plausibility and/or boundary conditions of posited relationships. A major strength of the latter approach is that it can be readily applied to existing datasets, thus enabling researchers to re-evaluate previous incremental validity claims with measurement unreliability taken into account. Lastly, we address an important question that, to our knowledge, has not been previously investigated in the literature: what kind of sample sizes are required to achieve adequate statistical power to detect incremental contributions at the latent variable level? While the answer will necessarily vary across contexts, we show that, under realistic conditions likely to apply fairly widely, statistical power to establish incremental validity at the construct level is often shockingly low. In particular, when the unique contribution of the construct of interest is relatively small, a study can easily require tens of thousands of participants to establish that construct’s incremental validity. Even when the effect is moderate to large, achieving adequate power in the presence of moderately unreliable covariates will often require hundreds of participants. Moreover, our analyses focused only on the case where a single covariate is included in the model. The inclusion of additional imperfectly measured covariates—as is common in real-world analyses—will generally make detection of incremental validity even more difficult.

Conclusion Taken as a whole, our results demonstrate that drawing construct-level inferences about incremental validity is considerably more difficult than most researchers recognize. We do not think it is alarmist to suggest that many, and perhaps most, incremental validity claims put forward in the social sciences to date have not been adequately supported by empirical evidence, and run a high risk of spuriousness. By this we do not mean to suggest that such claims are wrong, but simply that the modal analytical strategy of controlling for one or more covariates in a multiple regression cannot provide adequate evidence for a construct-level incremental validity claim under realistic conditions where variables are measured unreliably. Our hope is that greater appreciation of the inferential dangers of confusing measures with constructs [21] will lead researchers to adopt statistical approaches like SEM that provide appropriately calibrated evidence for incremental validity claims.

Supporting Information S1 Appendix. Derivation of statistical properties of incremental validity. We derive the probabilities of rejecting different combinations of regression coefficients as a function of (1) the simple or partial correlations among the outcome and the latent predictors, (2) the reliabilities, and (3) the https://doi.org/10.1371/journal.pone.0152719.s001 (DOCX)

Author Contributions Conceived and designed the experiments: JW TY. Performed the experiments: JW TY. Analyzed the data: JW TY. Contributed reagents/materials/analysis tools: JW TY. Wrote the paper: JW TY.