Can personality traits be measured and interpreted reliably across the world? While the use of Big Five personality measures is increasingly common across social sciences, their validity outside of western, educated, industrialized, rich, and democratic (WEIRD) populations is unclear. Adopting a comprehensive psychometric approach to analyze 29 face-to-face surveys from 94,751 respondents in 23 low- and middle-income countries, we show that commonly used personality questions generally fail to measure the intended personality traits and show low validity. These findings contrast with the much higher validity of these measures attained in internet surveys of 198,356 self-selected respondents from the same countries. We discuss how systematic response patterns, enumerator interactions, and low education levels can collectively distort personality measures when assessed in large-scale surveys. Our results highlight the risk of misinterpreting Big Five survey data and provide a warning against naïve interpretations of personality traits without evidence of their validity.

The top panel presents the relationship between income and the different PTs. We estimated the coefficients and their 95th confidence intervals running the following regression separately for each country: y i = α 0 + β 0 co g i + ∑ PT = 1 PT = 5 β PT P T i + ϵ i , where y i is the income of person i (transformed into the rank of income and scaled from 0 to 100). For cog i , the full literacy test was used when available, and the partial literacy test was used otherwise (see the Supplementary Materials). The last line pools all observations from all countries, controlling for the best measure of cognitive ability available in each country (country-fixed effects are not needed since relative rankings were calculated by country). All regressors are standardized, hence coefficients can be interpreted as the effect of 1 SD in the regressor on the percentile of the rank of income. The bottom panel is based on similar regressions, but replacing the five PTs by one index averaging the five values (for each observation). It only includes the nine countries with full literacy test included in the STEP database. The last line pools all observations from the nine countries. The coefficient of cognitive ability (literacy) is significantly higher than the coefficient of the Big Five index (at the 90% level) in four out of nine countries and in the pooled regression.

As motivation for the analysis, Fig. 1 shows that the relationship between the Big Five PT measures and income in a large database covering 14 low- and middle-income countries appears much less stable than that found in the U.S. literature (see Supplementary Materials for regression specifications and variable definitions). Most notably, Conscientiousness (often emphasized in the U.S. literature as being predictive of earnings) is not a significant predictor of income in 10 of the 14 countries analyzed (with the point estimate being negative in three), while Extraversion and Agreeableness also show some negative coefficients. Emotional Stability and, to a lesser extent, Openness show a more consistent positive relationship. The bottom panel of Fig. 1 compares an index averaging the five PTs to cognitive ability for the nine countries for which a good measure of cognition was available. Cognitive ability shows a stronger and more significant relationship with income than the Big Five index in almost every country. Moreover, when pooling the nine countries together, the size of the coefficient of the Big Five index is less than half of that of cognitive ability. Taken at face value, these results would suggest that personality matters much less than cognition and potentially lead to puzzling conclusions such as the observation that Conscientiousness (being achievement oriented and organized) does not correlate with earnings in many low- and middle-income countries. In psychometrics, predictive validity assesses the extent to which the measure is a good predictor of outcomes with which it can be expected to be related. It is used as one of the criteria for assessing whether a measure captures what it intends to capture. Hence, the fact that the Big Five measures are less predictive of income than cognition and that the traits that appear to best predict income are quite different from what has been found in other contexts can raise doubts about the measures themselves. Of course, it could also mean that PTs actually matter differently for economic success in low- and middle-income countries, where most individuals face many more external constraints than do WEIRD populations. However, these conclusions would be premature if, as we show in this paper, the Big Five questions collected in large-scale surveys may not only be noisy but also measure latent traits other than the PTs they are intended to measure.

This paper compiles and analyzes Big Five data collected through a very diverse set of surveys from low- and middle-income countries in different parts of the world and covering all types of education levels, including large nationally representative surveys and surveys collected on targeted samples for impact evaluation purposes. It shows that commonly used personality questions fail to measure the intended PTs in these settings and do not pass standard validity tests. The analysis of the psychometric properties of Big Five measures raises serious concerns about their validity and hence about their use and subsequent interpretation in many studies in low- and middle-income countries when collected through surveys. In contrast, data collected through the internet from the same set of countries reflect the Big Five factor structure, suggesting that low validity is not primarily driven by cultural or contextual differences. We explore the possible reasons for the measurement challenges and derive recommendations for the use of personality questions collected through surveys among non-WEIRD populations.

This rapidly growing literature builds on the wide support for the universality of the Big Five across cultures established in the psychology literature ( 21 – 25 ). The motivation for the Big Five taxonomy originates in research observing that the same five factors broadly emerged each time a factor analysis was conducted to classify a set of questions describing personality ( 4 ), hence referred to as the five-factor model (FFM). However, these studies have mostly focused on highly educated populations (often college students) in high-income countries, often referred to in the literature as WEIRD (western, educated, industrialized, rich, and democratic) ( 26 ). Some notable exceptions are ( 27 , 28 ), who do not find robust support for the FFM using data from orally administered surveys on rural populations with low levels of education in Bolivia, Colombia, and Kenya. Related, Ludeke and Larson ( 29 ) flag concerns with the use of the BFI-10 ( 30 ), a short 10-item Big Five instrument used in the World Values Survey, showing low correlations between items meant to measure the same PT. While they interpret this as possible evidence against universality across cultures, Gosling et al. ( 31 ) explain that interitem correlation per se is not necessarily a good indicator of validity for short scales. More generally, Church ( 32 ) reviews evidence on cultural differences in personality and concludes that the FFM of personality continues to find cross-cultural support but may be difficult to replicate in less educated populations. Soto et al. ( 33 ) show that, between ages 10 and 18, there is a strong relationship between cognitive ability and between-domain differentiation of Big Five questions, while internal consistency of Big Five scales also increased with age. This suggests that variations in cognitive ability may explain variations in differentiation of the Big Five structure.

Increasingly, the Big Five PTs are measured in developing country settings. A growing literature in economics [reviewed in ( 7 )] analyzes not only whether different PTs can predict selection ( 8 , 9 ) or performance ( 10 ) in public sector jobs but also whether the effectiveness of interventions aimed at improving public service delivery (for instance, through increased payments or monitoring) depends on the PTs of the targeted personnel ( 11 ) and whether these interventions can change the quality of people being recruited into the public sector ( 12 ). Measurement of PTs are also used for predicting economic performance ( 13 , 14 ), for screening purposes in the job market ( 15 ), for credit eligibility ( 16 , 17 ), or for analyzing treatment heterogeneity of nudge interventions ( 18 ). Last, PTs are sometimes considered as outcomes that can be affected by skill-enhancing or behavioral interventions ( 19 , 20 ). In a few cases ( 12 , 15 ), studies use versions of Big Five measures that have been specifically validated for the countries studied.

While there is a large body of evidence on the importance of cognitive ability for predicting social and economic success, personality traits (PTs) are often emphasized to be equally important for many aspects of life ( 1 , 2 ). The most influential taxonomy of PTs is the Big Five personality inventory ( 3 , 4 ). Ample empirical evidence from the United States and other high-income countries shows that the Big Five PTs correlate with earnings, employment, and other labor market outcomes. Recent reviews conclude that Conscientiousness and Emotional Stability in particular are strong predictors of job performance and wages ( 5 , 6 ). PTs are found to be particularly important for people with lower levels of job complexity or education level, whereas cognitive ability is more important at higher levels of job complexity ( 1 ). One could hence hypothesize that PTs could be even more important in low- and middle-income countries, where large shares of the population participate in lower-complexity jobs.

RESULTS

Data We use four types of databases, all of which include self-reported measures of the Big Five drawn from the BFI 44-item scale (21, 34, 35). First, the World Bank’s Skills towards Employment and Productivity (STEP) database contains data on the same subset of 15 Big Five items, which we will refer to as the STEP items. They were collected in large representative samples in 12 countries, and in two additional countries, a larger set of 35 BFI items were collected (for a total of 20,584 individuals; see table S1 for descriptive statistics and Supplementary Materials section 1.1 for the list of STEP items). The STEP data were collected through face-to-face surveys, often undertaken in local languages (36) and mostly representative of urban populations with diverse levels of education. A critical advantage of these data is that the same data collection instrument and standardized methods were applied across different countries from Africa, Asia, Latin America, Europe, and Central Asia. The data also contain measures of functional literacy, incorporated in the STEP surveys to provide a proxy for the respondents’ cognitive ability, and are therefore referred to as such in this paper. See Supplementary Materials for an explanation on the choice of literacy as a proxy for a broader set of foundational cognitive skills. The second database consists of 15 other datasets, from 12 developing countries, including a total of 54,167 households, which all contain a (partial or complete) version of the BFI. These datasets, described in table S2, were collected for varied purposes and by different institutions and researchers in local languages. They were either graciously shared by the researchers who collected them or are in the public domain (and some were collected by authors of this article). These include datasets used in top-level publications (9, 12, 20, 37–43). The database contains both face-to-face and self-administered surveys and covers different populations: farmers, entrepreneurs, civil servants, and adolescents. Most surveys are representative of specific subpopulations targeted for particular interventions and randomized controlled trials. While the number of items varies across datasets, we use the same 15 items as in the STEP data for most of the analysis for comparability. These datasets are identified with a code because the objective is not to provide diagnostics for individual studies but rather to show the generalizability of the findings. The third database contains data obtained from volunteers who visited a noncommercial website (www.outofservice.com), provided sociodemographic data, and filled the 44 BFI items. This database has been widely used in the psychology literature (44, 45). The website provides respondents immediate scoring and feedback regarding their personality, which is the main driver of the sample recruitment (46), resulting in a population of young and highly-educated respondents. To facilitate comparisons, we keep only the data from the 198,356 individuals who live in any of the 14 countries included in the STEP database and restrict most of the analysis to the same 15 items (see table S3 for descriptive statistics). The BFI could be completed in English, Spanish, German, or Dutch so not necessarily in the local languages. A fourth database used, for comparison and reference, contains 44 BFI items self-administrated to a community sample in the United States, containing 642 adults. It was chosen because of its high degree of validity, shown by Soto and John (35), and includes a wide age range (18 to 85 years of age) and is balanced by gender (58% women). The factor structure in this data is similar to the one found in many other datasets on WEIRD populations in the United States and has been used in influential methodological work (47), making it an appropriate benchmark. Overall, we combine 31 datasets from 24 countries amalgamating data on about 300,000 individuals. We corrected all data for acquiescence response style, i.e., the tendency of an individual to consistently agree (yea saying) or disagree (nay saying). Not correcting the acquiescence before factor analysis often results in the emergence of a factor representing the response pattern (48). See Materials and Methods for an explanation of the correction. Table S4 shows that, without correction, the internal validity of the survey data is even lower. Most notably, before the correction, we find 17 cases (in 10 different countries) in which an item has a negative correlation with other items intended to measure the same PT construct (table S5A). This number goes down to only one case after correcting for acquiescence bias (table S5B).

The absence of the Big Five factor structure in survey data We start by analyzing the extent to which the FFM can be found in the data. Separately for each dataset, we examine the factorial structure. Following the common practice in the literature (25, 49), we use principal components analysis (PCA) with Procrustes rotation to align the factor loadings with those in the U.S. data (described in Materials and Methods). The matrix of factor loadings, which provides the estimation of how much each item contributes to each component, indicates the extent to which the factor structure of the data matches the expected one. Table 1 provides a visual representation of the (lack of) congruence. We assigned each item to the PT for which it has the highest factor loading and colored in red bold font every cell where an item is associated to a factor that is different than the one it is intended to measure. When the data behave according to the FFM, we expect items within the same PT to correlate among themselves more than with items from a different PT so that they would be pulled together into the same factor, as can be seen for the U.S. data in Table 1. By contrast, of the 23 survey datasets, only two have all items sorted according to the FFM. For the other countries, of the 15 items, between one and nine items have their highest factor loadings on PTs other than those they intend to measure. The number of wrongly assigned items is high for Conscientiousness, Openness, and Agreeableness, whereas in most countries, Emotional Stability more clearly distinguishes itself from other PTs. Table 1 Congruence, factor structures obtained by PCA, and comparison with theoretical scale. Congruence coefficient (first column) is a proxy for the similarity of the factor structure, obtained from the correlation between factor loadings between two samples (in this case, the sample of the corresponding line is compared to factor loadings of the U.S. data). A detailed description of the calculation of the congruence coefficient is provided in Materials and Methods. In the rest of the table, each column represents one item (the same across datasets), sorted by PTs, and an “R” in its name means that it is a reverse-coded item (see Supplementary Materials section 1.1 for the phrasing of each item). The number that appears in each cell indicates for which factor the item has the highest factor loading in the PCA (with Procrustes rotation on U.S. data). Cell entries are in red bold font when the factor with highest loading differs from the one that the item aims to measure. For other surveys, only identifiers are provided to preserve anonymity. All data were corrected for acquiescence bias before the analysis. View this table: The congruence coefficient in the first column of Table 1 provides a quantitative indicator of the observed mismatch, as it can be interpreted as an index of similarity between two-factor structures. It is the correlation between the factor loadings in a given dataset compared to the factor loadings of the United States (see Materials and Methods for computational details). It is clear from the data that higher congruence coefficients are associated with more differentiation of the FFM. The two datasets with the lowest congruence coefficient (0.59 and 0.60) are the ones with the most items wrongly sorted (seven and nine items respectively), while the two datasets with the highest congruence coefficients (0.84 and 0.92) are the only ones for which all items were sorted according to the FFM. The average congruence coefficient across all survey data is 0.73. Although any threshold is somewhat arbitrary, Lorenzo-Seva and ten Berge (50) argue that a congruence coefficient in the range of 0.85 to 0.94 corresponds to a fair similarity. Across all databases used in this paper, we find that a proper differentiation of the FFM appears to emerge with congruence coefficients around 0.85. For comparison, the top panel of Table 1 shows that the variables in the U.S. data perfectly sort into the FFM, and in the bottom panel of Table 1, we present the same analysis with the internet dataset, restricted to the same countries and 15 items as STEPs. In stark contrast with the survey data, the average congruence coefficient is 0.9, and in 10 out of the 14 countries, the PCA matched all the 15 items with the intended PT. The internet data are from the same countries as the STEP database, hence the results suggest that cultural differences are unlikely to be the main driver of the low validity found in survey data. In many cases, the language of administration differed between the survey and the internet; however, differences are about as stark when comparing survey and internet data from countries in which the languages used in the survey are one of the four languages in which the internet questionnaire was available (e.g., Colombia or Kenya).

Other reliability and validity statistics To more comprehensively document the lack of reliability and validity of the Big Five measures in survey data, Table 2 shows the within correlation (average correlation between items that are within the same PT), the between correlation (average correlation between items of different PTs), Cronbach’s alpha, and the congruence coefficients. These indicators were first calculated by dataset and then aggregated by database (table S6 shows statistics by dataset, and table S7 splits up the results by PT). Table 2 Psychometric indicators by database. Datasets, psychometric measures, and different datasets and subsamples are described in detail in the main text and Supplementary Materials. All calculations were done after correcting for acquiescence bias. Within correlation, between correlation, Cronbach’s alpha, and congruence coefficient are first calculated by dataset before calculating a nonweighted average across all datasets. See table S6 for calculations for each dataset separately. In the case of within correlation, Cronbach’s alpha, and congruence coefficient, for each dataset, we first calculate it by PT and then average it across PT (before averaging across datasets). View this table: The top panel of Table 2 shows within and between correlations for all available datasets but restricts the analysis to the 15 items available in the STEP. Strictly positive correlations between items indicate that they are capturing something in common rather than just noise. The expected factor structure stands out when there is sufficient difference between the within correlation and the between correlations (i.e., the three items belonging to a same PT should have a higher correlation among them than with the other 12 items). We find that the survey data have an average within correlation of 0.22 and a between correlation of 0.09. This compares to a within and between correlation of 0.45 and 0.10, respectively, in the United States and 0.32 and 0.09 for the internet data. The fact that the difference between the between and within correlation is greater with the internet data is consistent with its more discernible factor structure. This can further be inferred from table S8, which shows the item-by-item correlation coefficients for the survey database (combining STEP and other surveys) versus the correlation coefficients of the internet data and the United States. For internet data and the United States, correlations between items meant to capture the same PT are consistently much higher than correlations with any other items (with the only exception of the first Openness item). By contrast, for the survey database, there are several items that show higher correlations with some items meant to measure other PTs than with items meant to measure the same PT (including two items of Conscientiousness). For the internet and U.S. data, the 10 highest correlations are all within correlations. But for the survey data, despite averaging over a large number of datasets, of the 10 highest correlations, 4 are between correlations and 6 are within. The fact that many questions correlate more with items intended to measure a different PT than with items intended to measure the same PT makes it arguably hard to interpret items as capturing the intended PT. Table 2 also shows Cronbach’s alpha, one of the most widely used measures of internal consistency of a test. For a given number of items, it increases when the correlation between items of the same PT increases. Hence, it is higher when the noise of each item is low and when they measure the same underlying factor. A minimum threshold of 0.7 is often applied, but as explained by Gosling et al. (31), a short measure should not aim at maximizing Cronbach’s alpha, because each set of items of a PT needs to capture the breadth of the concept. Still, the fact that the survey data obtain substantially lower Cronbach’s alpha than the internet data or the U.S. data when comparing the same items raises further concerns about the internal validity of the measure in the large survey databases. While all PTs measured in surveys show relatively low values, results indicate that internal consistency in the survey data is the lowest for Agreeableness and somewhat better for Emotional Stability. This core set of results highlights that many Big Five survey data collected in low- and middle-income countries do not follow the FFM. Conceptually, this implies that some items correlate less with items that aim at measuring the same PT than with items that belong to different PTs. The evidence that led psychometricians to conclude that the Big Five are universal tends to hold for the internet data, but it is far from apparent in the survey data from developing countries. Moreover, this appears not to be driven by differences in average age of the respondents or sample size in the internet data, as limiting the internet sample to ages and sample sizes similar to those of the survey data yields broadly similar results as for the full sample of internet data (bottom panel of Table 2).

What can explain the lack of clear Big Five factor structure in survey data? We investigate possible explanations for the low validity of the PT measures, including the number of items, the cognitive ability and education levels of the respondents, the administration method, and systematic response patterns. One might be concerned that it is difficult to recover the factor structure with only 15 items or that the 15 items selected for STEP were poorly chosen. Note, however, that validity with the same 15 items is much higher in the internet data. Moreover, the congruence coefficient, if anything tends to decrease with the number of items because the overfitting that occurs when the number of components is high with respect to the number of items, is reduced. In addition, the between and within correlation are on average not affected by the number of items. Second, Fig. 2 shows that, even if the Cronbach’s alpha is increasing in the number of items, following the Spearman-Brown prophecy formula, for any number of items, the survey data perform substantially worse than the internet or U.S. data. Last, the middle panel of Table 2 presents statistics using the full (44) set of items available in other surveys and still shows overall low validity, substantially lower than the levels observed for the internet data. Fig. 2 Cronbach’s alpha as a function of the number of items by type of data. The estimates for the survey data are based on surveys with at least six items per PT, while the estimates for the internet data are based on data from the 14 STEP countries, using all 44 items of the BFI. The U.S. data also use all items of the BFI. For each dataset and for each number of items n, we calculate Cronbach’s alpha for every possible combination of n items before averaging it across all combinations and then averaging it across datasets (by type of data collection). The educational level of the respondents is another potential driver of the differences in the reliability and internal validity of survey data compared to datasets used by psychologists in the United States or the internet data. On average, only 25% of respondents of the STEP surveys have college education, compared to 81% in the internet data. The bottom panel of Table 2 therefore presents a set of psychometric indicators when restricting the STEP data to respondents who have had some college education. This increases comparability between the STEP and internet data and brings the sample of STEP respondents closer to the convenience samples of university students often used in psychometric studies. Unexpectedly, we find no improvement in any of the indicators, suggesting that the level of education of respondents may not be a primary driver of the low validity. Similar results are obtained when restricting the STEP samples to individuals with white-collar jobs. More generally, cognitive ability could play a role since it may affect people’s understanding of the arguably abstract Big Five questions. We analyze the role of cognitive ability with the STEP database, using as a proxy for cognition the measure of functional literacy that is comparable between individuals and countries of the STEP surveys. Figure 3 presents the relationship between psychometric indicators and the cognitive ability of the respondents. In this figure, the unit of observation is the region, corresponding to the largest geographical division within each country as indicated in STEP, resulting in between 2 and 15 regions per country. Each figure depicts one of the indicators, separately calculated for each of the region, and their correlation with the regional-level average cognitive ability of the respondents. The analysis is limited to the nine countries for which good cognitive measures are available. Of course, since regions with low-average cognitive ability are likely to differ along many other dimensions, these correlations may not have any causal interpretation. However, the congruence coefficient and the Cronbach’s alpha have a clear positive and significant relationship with the measure of cognitive ability. The variation is substantial with the congruence coefficient varying from about 0.5 in the lower end to about 0.7 in the upper end of the cognitive scores. Hence, survey data for regions with lower-average cognitive ability show factor structures that are less consistent with the FFM and less internally valid. Yet, this is not the entire story because even the regions with the highest average cognitive ability remain below acceptable psychometric standards. Moreover, once we account for average differences between countries by including country-fixed effects in the estimations, these relationships are no longer significant, making it unclear whether they capture differences in cognition or rather other cross-country contextual differences that could affect responses in face-to-face surveys. Fig. 3 Relationship between psychometric indicators and cognitive ability. In each figure, the level of observation is the largest possible geographical division in the country (regions, provinces, or district). We apply a weight that is the inverse of the number of geographical divisions to give the same weight to each country. The calculations of congruence coefficient, Cronbach’s alpha, absolute value of acquiescence bias, and enumerator bias are described in Materials and Methods. Enumerator bias measures the share of the variation in responses (by PT) that can be explained by systematic biases due to which enumerator administrated each survey. Cognitive ability is measured by the full literacy test, also described in the Supplementary Materials. The nine countries in the regression are the nine countries with full literacy test included in the STEP surveys. Low validity of the PT measures could also be related to systematic response biases and answering patterns that are potentially more prevalent in survey data. Social desirability bias, for instance, could help explain why Conscientiousness and Agreeableness are the most problematic PTs and why Conscientiousness has little predictive power in the survey data. The databases do not allow us to quantify social desirability bias, but we consider two other indicators of undesirable response patterns that can contribute to blurring the measures of the PTs: (i) the absolute value of the acquiescence that indicates a respondent’s tendency to agree with two contradictory statements and ii) the share of the variance in responses that can be explained by enumerator-fixed effects (i.e., dummy variables capturing which enumerator administrated the face-to-face survey), which we refer to as potential enumerator bias. The bottom left panel of Fig. 3 shows a small negative relationship between acquiescence bias and cognition. Using regions as the level of observation, the relationship is not significant, but it is steeper and significant when using individual data rather than regional averages (P < 0.001). This result highlights that, within a country, respondents with lower cognitive skills are more likely to agree with statements that are mutually inconsistent. It is in line with Soto et al. (33) who found acquiescence bias to reduce between ages 10 and 18, when cognitive ability increases. Overall, Fig. 3 suggests that, even if low levels of respondents’ cognitive ability alone cannot explain the low validity in survey questions, they may accentuate response biases and contribute to making the PTs difficult to identify.