As a newcomer to cross-cultural research a few years ago, I soon became aware of the term “measurement invariance,” which typically is given as a necessary condition for using a psychological measurement instrument, such as a personality inventory, in more than one cultural context[1]. At one of the first talks where I presented some then-new data gathered in 20 different countries, using a new instrument developed in our lab (the Riverside Situational Q-sort) a member of the audience asked “what did you do to assess measurement invariance.” I had no real answer, and my questioner shook his head sadly.

Which, I started to realize, is kind of the generic response when these issues come up. If a researcher gathers data in multiple cultures and doesn’t assess measurement invariance, then the researcher earns scorn – from certain kinds of critics – for ignoring the issue. If the researcher does do the conventional kinds of analyses recommended to assess measurement invariance, the results are often discouraging. The RMSEA’s are out of whack, Delta CFI’s are bigger than .01, and oh my goodness, the item intercepts are not even close to equivalent so scalar invariance is a total joke, not to mention the forlorn hope of attaining “strict” invariance (which sounds harsh, because it is) . A member of a symposium I recently attended exclaimed, “If you can show me some real data where strict measurement invariance was achieved across cultures, I shall buy you a beer!” He had no takers. The following message is approaching the status of conventional wisdom: the lack of equivalence in the properties of psychological measures across cultures means that they cannot be used for cross-cultural comparison and attempts to do so are not just psychometrically ignorant, they are fatally flawed.

As I have become a bit more experienced, however, I have begun to develop some misgivings about this conclusion, and the whole business of “measurement invariance,” which I put in scare quotes because I suspect there is less there, than meets the eye. Below, I shall refer to it simply as MI.

The assessment of MI uses complex methods that (it appears) few researchers really understand, and often makes dichotomous evaluative decisions based on seemingly arbitrary benchmarks. I’m one of those researchers who doesn’t really fully understand conventional MI analyses, and I’ll go out on the limb to confess this only because I strongly suspect I’m not the only one. Indeed, I’m starting to recover from the imposter syndrome I used to suffer around people nodding sagely as they talk about factorial invariance, RMSEA, delta CFI, and equivalence of intercepts. And I have flashbacks of the whole debate about arbitrariness of the .05 p-level for evaluating research results, when I hear experts propound a .01 benchmark for maximum permissible delta CFI. Where, exactly, did that come from? Does anybody really know? The only answer concerning the origin of this or other benchmarks is that some authoritative figure (or an institution such as the Educational Testing Service) published an article recommending it. But the basis of the recommendation generally remains obscure, and the (I suspect few) researchers who actually go and read said authoritative article will not necessarily be enlightened. But they will obey.

I suggest that for most and perhaps nearly all empirical cross-cultural researchers (by which I mean the ones who actually gather data), the whole process is a black box where they dump their data in one side (such as an R program), and wait for the output on the other side, with fingers crossed. And then, almost always, they get bad news. Discussions of MI often have a prohibitionist tone. It is my impression that a “failure” (that’s the actual word used most often) to achieve MI by conventional criteria is not typically treated as a scientific finding of interest in its own right. Rather, it is more often given as a reason, even a “violation” (another often-used word) that implies one should not take the cross-cultural data seriously, or sometimes, even look at them. I actually recently saw a paper in which because MI was not achieved, the authors primly stated that, therefore, they did not examine their data any further. No wonder they didn’t dare, in the face of a recently published warning that “widespread hidden invalidity in the measures we use… pose[s] a threat to many research findings.”

Such a prohibitionist tone goes too far. First, the amount of non-invariance required to actually throw substantive results into question is far from clear and, as noted above, often is evaluated on the basis of mysterious and seemingly arbitrary benchmarks. Second, the implications of a “failure” of MI depends on the kind of MI one decides to insist on. Do you want to interpret correlations among measures within countries? Then configural MI is enough, and it’s indeed often found (e.g., Aluja et al., 2019). Do you want to interpret mean differences between countries? Well, then maybe (but not necessarily, see below) you do need “strict” or scalar invariance, which is a very high bar seldom attained.

The more balanced treatments of failures of scalar invariance may say something like “all questionnaires showed some noninvariance across countries, indicating that caution needs to be exercised when investigating and interpreting mean differences.” I appreciate the careful, moderate tone of this quote but still: what is this finely worded advice supposed to mean? That otherwise you can throw caution to the winds? I don’t think so. Assuming it means anything at all (and I’m not sure it does), I think it means if you don’t have strict MI, your mean differences don’t mean anything. So, you are prohibited to look at them – an attitude that strikes me as, how shall I put this, anti-scientific (see point 5, below). The repeated disappointments in MI appear at odds with conclusions emerging elsewhere in cross-cultural psychology. An emerging theme, and a real surprise I think, is that cross-cultural differences in psychological attributes and processes are turning out to be smaller than was expected when this field of research really got going, a couple of decades ago. The touted fundamental differences between East and West were not only almost absurdly simplistic (Asia’s a big and diverse place, as is Europe), but also turned out to be smaller and less profound than initially assumed. China contains lots of individualists, and Europe and North America have a fair number of collectivists, and while there still might be overall differences, the distributions overlap considerably. Our own international project, with data from 64 countries, is finding that two measures of happiness, one developed in the US and the other, purported to be profoundly different, developed in Japan, yield correlates and other results that are much more similar than different around the world (and two countries in which the two measures behave especially similarly are, wait for it, the US and Japan). Our two studies of situational experience around the world, one with 20 countries and one with 64 countries, both found that individual experiences within countries were more similar to each other than experiences compared across countries, but the difference was surprisingly small and indeed, just barely reached statistical significance even with N’s in the thousands. More generally, the distinguished and pioneering cross-cultural researcher Juri Allik (2005) has written about how personality variation across countries is (unexpectedly) small compared with variation within countries (see also: Hanel, Maio & Manstead, 2018). In the face of all this, how long can we maintain a conventional wisdom that cultural variation in the basic properties of well-established measurement instruments is typically large, consequential, and maybe even fatal?

Consider, again, the nature of cross-cultural vs. within-culture variation. One example: Perhaps, indeed, the items on the BFI-2 extraversion scale have a different meaning for someone living in Japan than they do for me. But might they not also, to some degree, have a different meaning for my next-door neighbor than they do for me? And can we assume that the former difference in meaning is really all that different, from the latter difference? Juri Allik’s conclusions give reason to doubt. Measurement instruments surely have at least somewhat different properties and implications for different individuals. But I don’t see a strong reason to presuppose that these properties and implications necessarily vary to any importantly consequential degree according to whether the individuals in question reside in the same or different countries. Maybe, sometimes, they do. But the burden of proof seems misplaced, given what we are learning about cultural variation elsewhere. Conventional assessments of MI are completely internal to the measurement instruments. That is, they assess internal validity as opposed to external validity. They focus on the structure of the latent factors of the instruments, and the degree to which this structure is maintained across contexts, and – even more stringently – the intercepts of the items on latent traits or factors.[2] This is all well and good, I suppose, but internal validity is not the same as external validity and the former is actually not even always necessary for the latter. The classic examples in personality psychology are the MMPI and the CPI (California Psychological Inventory), the scales of which have well-established validity-in-use for predicting important outcomes, but which “fail” many conventional psychometric tests of internal reliability and factorial homogeneity.

A useful future direction, I propose, would be to move away from the almost exclusive focus on the internal properties of our measurement instruments in favor of increased emphasis on external validity. This can and should be done at both the cultural and individual level. At the cultural level, research could assess average levels of measurements with other country-level variables (e.g., Mõttus, Allik & Realo, 2010). For example, are country-level average levels of happiness associated in sensible ways with other variables measured at the country level, such as economic and demographic indicators, or other cultural attributes such as, for example, religiosity? But lest we fall prey to the ecological fallacy, this kind of research must be complemented by investigations at the individual level, assessing to what degree and when the measure’s correlations with other psychological variables are maintained across cultural contexts. For example, does a measure of happiness correlate with other indicators of well-being within some, many, or all countries? This, to me, would be more persuasive evidence for the cross-cultural validity of a measure than even the finest demonstration of configural MI.

Even better – and even more difficult – a measure used in more than one country could be compared in its associations with actual behavior, something that despite psychology’s self-definition as the “science of behavior” continues to get less attention than it should (Baumeister et al., 2007). One example is the use of “anchoring vignettes,” in which respondents in different cultures report how they would respond to various situations, and then these responses are compared with their personality scores (Mõttus, 2012). Another example is a study that assessed differences in sociability between Mexicans and Americans using naturalistic audio recordings as well as self-reports (Ramírez-Esparza et al., 2009)[3]. Research like this may lead to an eventual gold standard for cross-cultural psychology, in which behavioral data, and not just self-reports, are gathered. To do this will be difficult and expensive. But we must, sooner or later. The data are the data. This is my most important point. Researchers who go to the considerable trouble of gathering data in more than one country should not be discouraged from doing so, should not be prohibited from analyzing their data in any way they find informative, and certainly should not be disadvantaged compared to researchers who avoid cross-cultural complications by gathering data only at their home campus. Of course, interpretations should be appropriately cautious, but this warning is a truism that applies to all research of any kind. You never really know for sure what the scores on your measures mean; all you can do is try to triangulate them with other data and interpret the patterns – and even the mean differences – that emerge the best you can. This is a worthy endeavor and indeed, the essence of scientific activity. The issue of “measurement invariance” should not be allowed to inhibit it.

References

Allik, J. (2005). Personality dimensions across cultures. Journal of Personality Disorders, 19, 212-232.

Aluja, A., et al. (2019). Multicultural validation of the Zuckerman-Kuhlman-Aluja Personality Questionnaire Shortened Form (ZKA-PQ/SF) across 18 countries. Assessment, doi: 10.1177/1073191119831770. [Epub ahead of print]

Baumeister, R.F., Vohs, K.D., & Funder, D.C. (2007). Psychology as the science of self-reports and finger movements. Whatever happened to actual behavior? Perspectives on Psychological Science, 2, 396-403.

Gardiner, G., Sauerberger, K., Members of the International Situations Project, & Funder, D. (2019). Towards meaningful comparisons of personality in large-scale cross-cultural studies. In A. Realo (Ed.), In praise of an inquisitive mind: A Festschrift in honor of Jüri Allik on the occasional of his 70th birthday (pp. 123-139). Tartu: University of Tartu Press.

Hanel, P.H.P., Maio, G.R., & Manstead, A.S.R. (2018). A new way to look at the data: Similarities between groups of people are large and important. Journal of Personality and Social Psychology, 116, 541-562.

Mõttus et al. (2012). Comparability of self-reported conscientiousness across 21 countries. European Journal of Personality, 26, 303-317.

Mõttus, R., Allik, J., & Realo, A. (2010). An attempt to validate national mean scores of Conscientiousness: No necessarily paradoxical findings. Journal of Research in Personality, 44, 630-640.

Plieninger, H. (2017). Mountain or molehill? A simulation study on the impact of response styles. Educational and Psychological Measurement, 77, 32-53.

Ramírez-Esparza, N., Mehl, M.R., Álvarez-Bermúdez, J., & Pennebaker, J.W. (2009). Are Mexicans more or less sociable than Americans? Insights from a naturalistic observation study. Journal of Research in Personality, 43, 1-7.

Acknowledgment

I thank several friends and colleagues for their advice, some of which I took, and some of which I ignored. For their protection I shall maintain their anonymity unless they want to go public via a comment here, on Twitter, or elsewhere.

Footnotes

[1] Other contexts for assessing measurement invariance concern possible changes in the meaning of a measurement instrument across time or for participants of different ages. I am not talking about those applications here.

[2] A new (and simpler) method of assessing the similarity in meaning of measurement instruments across cultures, developed in our lab, also is based entirely on analyses internal to the instrument itself (Gardiner et al., 2019).

[3] I have now run out of examples.