Two Warnings

Drum begins by admitting that there is a Black-White IQ gap and nobody thinks otherwise. This is the case. As long as it has been recorded, there has been a roughly one Cohen’s d Black-White IQ gap. The following graph from this preprint makes this clear:

Fig 1

Drum then mentioned that the gap is not the result of test bias or construction and that the gap really does exist. This is an extremely important point which invalidates much of the rest of his arguments, unbeknownst to him. In order to even make this statement, we have to show that the construct the tests assess — which cannot be directly observed — is measurement invariant, meaning that the test assesses the same construct in both groups and that it does so equally well. This condition was stated in 1989 by Mellenbergh as that in which

Eq 1

Where Y and η are observed and factor scores and s is the selection variable for group membership. In other words, given some level of a factor (or latent ability) score, η the level of the observed score, Y, is not affected by one’s group, s: an IQ of 70 is an IQ of 70 in both Blacks and Whites and so on. Another way to state this is in terms of a regression equation whereby

Eq 2

Note that the ε for the initial and group-specific equations is not set to be equivalent because a score may be unbiased even with residual differences. In the case where the residuals are equivalent, the error in measuring the factor scores or latent variables in each group is equivalent and a formulation of measurement invariance called strict factorial invariance is achieved. The theoretical implications of strict factorial invariance are substantial and will be touched on below. For now, note that Eq 2 implies that unbiased scores are a regression of observed on factor scores in which slopes (regression weights or factor loadings) and intercepts (levels) are equal in both groups. If the slopes of the regression of observed on latent scores are unequal, then we encounter nonuniform bias whereby different levels of the observed scores equate to different levels and perhaps types of the latent abilities by group and thus the IQ scores cannot be compared and the meaning of the tests cannot be assumed to be the same by group. Intercept bias is a form of uniform bias where the level alone cannot be interpreted the same way by group, so a score of, say, 70 in one group may be equal to 70 in the latent ability whereas in another it may be equivalent to 80. It is possible to simultaneously have both forms of bias, but, as Arthur Jensen noted in his 1980 Bias in Mental Testing, it is almost never the case that natives of the same country regardless of race have noninvariant (biased) test parameters. The difference in the means is just that — a difference in the means. Visually oriented individuals can play around with this shiny applet in order to better understand test bias.

A deeper theoretical understanding of test bias than this is necessary to fully comprehend why exactly this concept disqualifies so much of what Drum believes. As implied above, the factor model is a linear regression model relating observed scores on a given set of items or subscales to a smaller number of factors or latent variables, which are theoretical constructs constituted by the shared variance of a set of test items. If we have i = 1, …, I observed scores, Y, measuring l = 1, …, L factors and we suppose further that the total sample consists of j = 1, …, J subjects belonging to one of s = 1, …, S groups with I = 10, L = 2, J = 500, and S = 2, we have a test with 10 items/subscales, measuring two factors (latent variables), in a sample of 500 test takers split into two groups. The within-group model for person j on item i is

Eq 3

The observed score y is a sum of the regression intercept, v, the scores on the factors, η, multiplied with the corresponding slopes (again, factor loadings), λ, and a residual, ε. The intercepts and factor loadings in this model are the same for everyone but are able to differ across items, which is why the intercept and factor loading have no subscript j, instead having the item subscript i. The factor score (henceforth latent score) η is subject-specific and has the subscript j as such, with the subscript l indicating which of the L factors being referred to. A latent score (for, say, mathematics or visuospatial ability) does not vary across items. The residual includes error specific to items and individuals and thus has the subscripts ij. The multiple group model is fitted to the means and covariances of the observed items instead of the raw scores and as such to obtain the mean of some item, Yᵢ, within some group, scores are averaged across the group’s subjects. With the residual mean assumed to be zero, the mean of item Yᵢ is

Eq 4

with ξ being the expected mean value. It is more common to denote observed mean item scores μ and factor means x. Hence,

Eq 5

And for I items/subscales we have I equations

Eq 6

described in matrix notation as

Eq 7

where s indicates the means of groups s, s = 1, …, S and μ and v have dimensions of 1 × I, Λ is I × L, and α is 1 × L. Assuming residuals which are uncorrelated with the latent abilities and which are themselves not intercorrelated in addition to intercepts which are constant and have zero covariance with factors and residuals, the covariances of the observed scores, Y, equal the sum of the covariances pre- and post-multiplied with the corresponding factor loadings (slopes) and the residual score variances. Using factor notation for a covariance matrix of the items, Σ, a covariance matrix of latent ability scores Ψ, and residual variances Θ, the equation for variances and covariances in an arbitrary group s is

Eq 8

The equations for the means and covariances are both derived from Eq 3 and hence Eq 7 described the means of the item scores in terms of the latent variable means whereas Eq 8 describes the covariances of the items in terms of the covariances of the latent abilities. The factor loadings denoted by Λ are identical in the models for means (Eq 7) and covariances (Eq 8). A multiple group model comprises both equations, modeling the regression intercepts, factor loadings, means, covariances, and residual variances in order to impose a specific structure on the means of the observed scores and their covariances which can be restricted to take the same value in each group (to be invariant) or to take a specific value. It’s through comparing the fits of models with greater or fewer restrictions that we can assess whether these restrictions are tenable and the models fit comparably, meaning that the mean differences we observe are due to differences in the same constructs. When such a model, fitted to the means and covariances of several groups, is constructed, the means (between-group differences) and covariances (within-group differences) can, clearly, be compared in a single analysis.

Measurement invariance (the state in which items measure the same things in both groups equally well, i.e., unbiasedness) is a statement that for some level of a given ability the probability of answering a given question right is not dependent on one’s group membership. There are a few steps, each nested below the preceding step and all clearly implied above, which are used to assess whether this condition is met. The first step is configural invariance whereby a model in which the same number of indicator variables and constrained and estimated parameters fits equally well in both groups; the coefficients may differ in this model, but the price of this is that this level does not let us interpret group mean differences. The next step is weak or metric invariance whereby the factor loadings are constrained to equality in the different groups. This step only allows us to state that the slopes of the observed on the latent scores are the same by group and thus the sampled populations attribute the same meanings to the latent constructs being studied. Additionally, at this stage we may compare the latent variances and covariances, but still, not to the means. The next step is called strong or scalar invariance and involves the addition of a constraint on the intercepts whilst allowing the means to differ by group, allowing us to finally make comparisons of the group means in addition to the variances and covariances. The next level is known as strict factorial invariance (SFI) and adds that, for a given indicator, the error term is constrained to be equal across groups. This level is not necessary to make group comparisons, but if it holds, it follows that the scale for the construct being measured is equally reliable in both groups, that there’s also measurement invariance with respect to unmeasured variables, the residual variances don’t mask differences in residual means, and, most importantly, that the sources of between-group variation in the constructs being measured are a subset of the within-group sources of variation. (An additional level rarely assessed, called homogeneity of latent variances, constrains the variances of the latent variables to be equal in both groups, assessing whether they used the same ranges of the latent constructs in question; this level is necessary to obviate any issues with the predictive bias of the constructs by group generated by specificity and predictive value being higher or lower due to group-specific variance limitations.) The measurement invariant model (MI with SFI) is represented as

Eq 9

Eq 10

Eq 11

In contrast to equations 7 and 8, equations 9–11 only allowed the factor means and covariances to differ, without any group subscript. The lack of an effect of group is MI. As mentioned above, with MI, the factors affecting the tests are the same in both groups and the background factors can be interpreted in the same way. As such, we can interpret the effects of variables such as socioeconomic status, education, and discrimination the same way in both groups when MI is said to hold. Additionally, things like stereotype threat or unequal Flynn effects, race-related anxiety and nervousness from having an other-race invigilator, racial discrimination, or anything else for a test will generate measurement non-invariance. When MI holds, there is no bias. Things like unequal access to knowledge are an invariant related to the level of ability, not group membership, so if a test score depends on knowing some factoid like vocabulary words or mathematical equations, this difference between groups is not due to unequal access to the words or equations as a result of being in a group, but due to those groups having different levels of ability which predispose different levels of learning; altering this level of knowledge would produce non-invariance in the scores or cause gains related to different latent abilities than in the initial arrangement. Another way to state this is that the groups compared cannot be thought of as two different, identical seeds raised in pots of differing quality when SFI holds (bookish readers will recognize this as a reference to the thought experiment known as Lewontin’s Seeds or X-Factors). Lubke et al. write:

Suppose observed mean differences between groups are due to entirely different factors than those that account for the individual differences within a group. The notion of ‘‘different factors’’ as opposed to ‘‘same factors’’ implies that the relation of observed variables and underlying factors is different in the model for the means as compared with the model for the covariances, that is, the pattern of factor loadings is different for the two parts of the model. If the loadings were the same, the factors would have the same interpretation. In terms of the multigroup model, different loadings imply that the matrix Λ in [Eq 10] differs from the matrix Λ in [Eq 11] (or [Eqs 7 and 8]). However, this is not the case in the MI model. Mean differences are modeled with the same loadings as the covariances. Hence, this model is inconsistent with a situation in which between-group differences are due to entirely different factors than within-group differences. In practice, the MI model would not be expected to fit because the observed mean differences cannot be reproduced by the product of α and the matrix of loadings, which are used to model the observed covariances. Consider a variation of the widely cited thought experiment provided by Lewontin (1974), in which between-group differences are in fact due to entirely different factors than individual differences within a group. The experiment is set up as follows. Seeds that vary with respect to the genetic make-up responsible for plant growth are randomly divided into two parts. Hence, there are no mean differences with respect to the genetic quality between the two parts, but there are individual differences within each part. One part is then sown in soil of high quality, whereas the other seeds are grown under poor conditions. Differences in growth are measured with variables such as height, weight, etc. Differences between groups in these variables are due to soil quality, while within-group differences are due to differences in genes. If an MI model were fitted to data from such an experiment, it would be very likely rejected for the following reason. Consider between-group differences first. The outcome variables (e.g., height and weight of the plants, etc.) are related in a specific way to the soil quality, which causes the mean differences between the two parts. Say that soil quality is especially important for the height of the plant. In the model, this would correspond to a high factor loading. Now consider the within-group differences. The relation of the same outcome variables to an underlying genetic factor are very likely to be different. For instance, the genetic variation within each of the two parts may be especially pronounced with respect to weight-related genes, causing weight to be the observed variable that is most strongly related to the underlying factor. The point is that a soil quality factor would have different factor loadings than a genetic factor, which means that [Eqs 10 and 11] cannot hold simultaneously. The MI model would be rejected. In the second scenario, the within-factors are a subset of the between-factors. For instance, a verbal test is taken in two groups from neighborhoods that differ with respect to SES. Suppose further that the observed mean differences are partially due to differences in SES. Within groups, SES does not play a role since each of the groups is homogeneous with respect to SES. Hence, in the model for the covariances, we have only a single factor, which is interpreted in terms of verbal ability. To explain the between-group differences, we would need two factors, verbal ability and SES. This is inconsistent with the MI model because, again, in that model the matrix of factor loadings has to be the same for the mean and the covariance model. This excludes a situation in which loadings are zero in the covariance model and nonzero in the mean model. As a last example, consider the opposite case where the between-factors are a subset of the within-factors. For instance, an IQ test measuring three factors is administered in two groups and the groups differ only with respect to two of the factors. As mentioned above, this case is consistent with the MI model. The covariances within each group result in a three-factor model. As a consequence of fitting a three-factor model, the vector with factor means, α in [Eq 10], contains three elements. However, only two of the element corresponding to the factors with mean group differences are nonzero. The remaining element is zero. In practice, the hypothesis that an element of α is zero can be investigated by inspecting the associated standard error or by a likelihood ratio test In summary, the MI model is a suitable tool to investigate whether within- and between-group differences are due to the same factors. The model is likely to be rejected if the two types of differences are due to entirely different factors or if there are additional factors affecting between-group differences. Testing the hypothesis that only some of the within factors explain all between differences is straightforward. Tenability of the MI model provides evidence that measurement bias is absent and that, consequently, within- and between-group differences are due to factors with the same conceptual interpretation.

Just as in 1980, practically all modern assessments of MI find invariance with respect to race/ethnicity for natives within the same country including, notably, in two recent studies which assessment invariance with respect to latent variances (here and here). With this background out of the way, readers can better understand some of Drum’s errors. The following is a point-by-point response to Drum’s reasons to believe the Black-White IQ gap is environmental.