The psychology of human males and females is marked by a complex pattern of similarities and differences in cognition, motivation, and behavior. Describing and quantifying sex differences is a crucial task of psychological science, and a source of lively controversy among researchers [1]–[12]. In this paper, we will consider the nature and magnitude of sex differences in personality. It is difficult to overstate the theoretical and practical importance of sex differences in personality; finding large overall differences would tell us that the sexes differ broadly in their emotional and behavioral patterns, rather than just in a few (and comparatively narrow) motivational domains such as aggression and sexuality.

While the gender similarities hypothesis does not make specific predictions about personality, sex differences in personality were found to be “small” in Hyde's meta-analytic review. Specifically, Hyde found consistently “large” (d between .66 and .99) or “very large” (d≥1.00) sex differences in only some motor behaviors and some aspects of sexuality; “moderate” differences (d between .35 and .65) in aggression; and “small” differences (d between .11 and .35), or even differences close to zero (d≤.10) in the other domains she considered. Cohen's d is a standardized difference, obtained by dividing the difference between group means by the pooled within-group standard deviation. Assuming normality, a standardized difference d≤.35 implies that the male and female distributions overlap by at least 75% of their joint area. Even if conventional criteria for labeling effect sizes as “small”, “medium” and “large” have many limitations and should be used with great caution [2] , [13] , [14] , this amount of overlap does indicate that the statistical distributions of males and females are not strongly differentiated. In fact, the nonoverlapping portion of the joint distribution becomes larger than the overlapping portion only when d>.85. For comparison, the criterion used by Hyde to identify “very large” sex differences (d≥1.00) corresponds to an overlap of 45% or less between the male and female distributions.

The idea that the sexes are quite similar in personality – as well as most other psychological attributes – has been expressed most forcefully in Hyde's “gender similarities hypothesis” [9] . The gender similarities hypothesis holds that “males and females are similar on most, but not all, psychological variables. That is, men and women, as well as boys and girls, are more alike than they are different.” Hyde's paper has been remarkably influential; between 2005 and 2010, it has accumulated 247 citations in the Web of Knowledge database and 498 citations in Google Scholar (retrieved May 19 th , 2011).

In addition to their direct influences on mating processes, personality traits correlate with many other sexually selected behaviors, such as status-seeking and risk-taking (see e.g., [20] , [34] , [35] ). Thus, in an evolutionary perspective, personality traits are definitely not neutral with respect to sexual selection. Instead, there are grounds to expect robust and wide-ranging sex differences in this area, resulting in strongly sexually differentiated patterns of emotion, thought, and behavior – as if there were “two human natures”, as effectively put by Davies and Shackelford [15] .

Most personality traits have substantial effects on mating- and parenting-related behaviors such as sexual promiscuity, relationship stability, and divorce. Promiscuity and the desire for multiple sexual partners are predicted by extraversion, openness to experience, neuroticism (especially in women), positive schizotypy, and the “dark triad” traits (i.e., narcissism, psychopathy, and Machiavellianism). Negative predictors of promiscuity and short-term mating include agreeableness, conscientiousness, honesty-humility in the HEXACO model, and autistic-like traits [20] – [31] . Relationship instability is associated with extraversion, low agreeableness, and low conscientiousness [26] , [29] – [31] . Finally, neuroticism, low conscientiousness, and (to a smaller extent) low agreeableness all contribute to increase the likelihood of divorce [32] , [33] .

At the other end of the theoretical spectrum, evolutionary psychologists have emphasized how divergent selection pressures on males and females are expected to produce consistent – and often substantial – psychological differences between the sexes [1] , [7] , [15] , [16] . By the logic of sexual selection theory and parental investment theory [17] , [18] , large sex differences are most likely to be found in traits and behaviors that ultimately relate to mating and parenting. More generally, sex differences are expected in those domains in which males and females have consistently faced different adaptive problems. For example, typical effect sizes in research on mate preferences range from d = .80 to 1.50, a finding consistent with this expectation [1] , [19] . In contrast, similarities between males and females can be expected when the sexes have been subjected to similar selection pressures. Thus, the evolutionary approach to sex differences is consistent with a weak version of the gender similarities hypothesis [19] , although the latter is stated so vaguely that it is extremely difficult to test empirically (e.g., how many “psychological variables” should be considered, and what is the appropriate index of similarity?).

The task of quantifying sex differences in personality faces a number of important methodological challenges. Indeed, all the studies performed so far suffer, to various degrees, from limitations that ultimately lead to systematic underestimation of effect sizes. In this paper, we discuss those challenges and present a set of guidelines for the accurate measurement of sex differences. We then apply our guidelines to the analysis of a large, representative US sample to obtain a “gold standard” estimate of global sex differences in personality, which turns out to be extremely large by any reasonable criterion.

Given the contrast between the predictions derived from evolutionary theory and those based on the gender similarities hypothesis, there is a pressing need for accurate empirical estimates of sex differences in personality. Of course, the existence of large sex differences would not, by itself, constitute proof that sexual selection had a direct role in shaping human personality. For example, Eagly and Wood [4] advanced an alternative theory in which selection is assumed to be responsible for physical, but not psychological differences between the sexes – the latter resulting from sex role socialization. Nevertheless, the accurate quantification of between-sex differences represents a necessary initial step toward an informed theoretical debate [36] , and may eventually help researchers discriminate between alternative models of biological and cultural evolution [2] .

Methodological Challenges and Guidelines

In this section we review three key methodological challenges in the quantification of sex differences in personality. In doing so, we review the main empirical studies in this area and present the relevant effect sizes. However, the reader should keep in mind that our aim is to illustrate key methodological issues, not to present a comprehensive review of empirical research. In presenting univariate effect sizes (d), a positive sign indicates a male advantage, and a negative sign a female advantage.

Broad Versus Narrow Personality Traits. Personality traits can be organized in a hierarchical structure, from the broad and inclusive (e.g., extraversion) to the narrow and specific (e.g., gregariousness or excitement seeking). Researchers often focus on the Big Five, i.e., the broad “domains” of the five-factor model of personality (FFM [37]), which is at the same hierarchical level as the six factors of the HEXACO model [38], the five factors of the “alternative FFM” by Zuckerman and colleagues [39], and others. Up in the hierarchy, correlations between broad traits give rise to two “metatraits”, often labeled stability and plasticity [40], [41]. It may be even possible to identify a single, general factor of personality (the GFP or “Big One”) at the top of the hierarchy [42]–[45]. Right below the level of the Big Five, about 10–20 narrower traits can be identified; the ten “aspects” described by DeYoung and colleagues [46] and the fifteen primary factors in Cattell's 16PF [47], [48] fall in this category. At the lowest level are dozens of specific personality “facets”; questionnaires based on the FFM typically identify 30 to 45 such facets [46]. Choosing the proper level of description is a crucial challenge in the study of sex differences. Specifically, differences that are apparent (and possibly substantial) at a given descriptive level may become muted, or even disappear, when traits are aggregated into broader constructs at a higher hierarchical level. The effect is especially dramatic when sex differences in two narrow traits go in the opposite direction, canceling out one another at the level of a broader trait. For example, FFM extraversion has loadings on two narrower dimensions, warmth/affiliation (consistently higher in females) and dominance/venturesomeness (consistently higher in males). These two effects of opposite sign result in a small overall sex difference in extraversion, with females typically scoring (slightly) higher than males [49]–[54]. A similar pattern of crossover sex differences has been found in openness to experience, with males scoring higher on the “ideas” dimension and females on the “aesthetics” dimension of this trait [49], [54], [55]. Sex differences in Conscientiousness are also confined to just some of its components [50], [52]. Taken together, these findings make it apparent that measuring personality at the level of the Big Five hides some important differences between the sexes. Thus, in order to get the most accurate picture of sex differences, researchers need to measure personality with a higher resolution than that afforded by the Big Five (or other traits at the same hierarchical level). A corollary is that, when investigating sex and personality as predictors of a given outcome (such as health, self-esteem, and so forth), cleaner and more meaningful results are likely to obtain if personality is measured at the level that yields the most clearly sex-differentiated profiles. As traits become narrower, however, it also becomes more difficult to measure them with sufficient reliability, and the signal-to-noise ratio decreases. We provisionally suggest that the best compromise may be reached by describing personality with about 10–20 traits, i.e., at the hierarchical level immediately below that of the Big Five.

Observed Scores Versus Latent Variables. Theories of personality conceptualize traits as unobserved latent variables; as such, they are only imperfectly represented by scores on personality inventories. Typically, observed scores are contaminated by substantial amounts of specific variance and measurement error, leading to attenuated estimates of sex differences when such scores are used to compute effect sizes. Unfortunately, most studies of sex differences in personality have relied on observed scores, thus making the implicit (and incorrect) assumption that observed scores are equivalent to latent variables. The contrast between latent, error-free effect sizes and observed-score effect sizes is strikingly illustrated by a recent study by Booth and Irwing [54]. On the 15 primary factors of the 16PF, observed-score effect sizes ranged from d = −1.34 to +.32, with an average absolute effect size = .26 (within the bounds of “small” effects according to Hyde [9], and corresponding to a male-female overlap of 81%). When group differences on latent variables were estimated by multi-group covariance and mean structure analysis (MG-CMSA), effect sizes ranged from d = −2.29 to +.54, with an average absolute effect size = .44 (a “moderate” effect by Hyde's criteria, corresponding to an overlap of 70%). Estimating group differences on latent variables is clearly preferable to relying on observed scores, but this methodology depends on the assumption of measurement invariance, i.e., the assumption that the construct being measured is actually the same in both groups [56]–[58]. Booth and Irwing [54] found that between-sex invariance was violated for the five global scales of the 16PF (analogous to the Big Five), but satisfied for the 15 primary factors of personality. There is evidence that the same may apply to FFM inventories [59]. Measurement invariance is thus another reason to measure sex differences at the level of narrow traits, instead of focusing on broad traits like the Big Five.

Univariate Versus Multivariate Effect Sizes. Since personality is a multidimensional construct, the question of how to quantify the overall magnitude of sex differences in personality is far from trivial. A common way of dealing with multiple effect sizes is to simply average them. For Big Five traits, the average absolute effect size across studies is = .16 to .19, corresponding to an overlap of about 87% between the male and female distributions [11], [16]. When narrower traits are measured, average effect sizes increase somewhat. For example, Costa and colleagues [49] analyzed sex differences in FFM facets; their average effect sizes were = .24 (US adults) and = .19 (adults from other countries). As reported above, Booth and Irwing [54] found = .26 for observed scores on the 15 primary factors of the 16PF. Finally, the average effect size in Weisberg and colleagues [53] was = .21 for the Big Five and = .26 for the ten FFM aspects (uncorrected raw scores). The problem with this approach is that it fails to provide an accurate estimate of overall sex differences; in fact, average effect sizes grossly underestimate the true extent to which the sexes differ. When two groups differ on more than one variable, many comparatively small differences may add up to a large overall effect; in addition, the pattern of correlations between variables can substantially affect the end result. As a simple illustrative example, consider two fictional towns, Lowtown and Hightown. The distance between the two towns can be measured on three (orthogonal) dimensions: longitude, latitude, and altitude. Hightown is 3,000 feet higher than Lowtown, and they are located 3 miles apart in the north-south direction and 3 miles apart in the east-west direction. What is the overall distance between Hightown and Lowtown? The average of the three measures is 2.2 miles, but it is easy to see that this is the wrong answer. The actual distance is the Euclidean distance, i.e., 4.3 miles – almost twice the “average” value. The same reasoning applies to between-group differences in multidimensional constructs such as personality. When groups differ along many variables at once, the overall between-group difference is not accurately represented by the average of univariate effect sizes; in order to properly aggregate differences across variables while keeping correlation patterns into account, it is necessary to compute a multivariate effect size. The Mahalanobis distance D is the natural metric for such comparisons. Mahalanobis' D is the multivariate generalization of Cohen's d, and has the same substantive meaning. Specifically, D represents the standardized difference between two groups along the discriminant axis; for example, D = 1.00 means that the two group centroids are one standard deviation apart on the discriminant axis. A crucial (and convenient) property of D is that it can be translated to an overlap coefficient in exactly the same way as d: for example, two multivariate normal distributions overlap by 50% when D = .85, just as two univariate normal distributions overlap by 50% when d = .85 [60], [61]. The only difference between d and D is that the latter is an unsigned quantity. The formula for D is (1)where d is the vector of univariate standardized differences (Cohen's d) and S is the correlation matrix. Confidence intervals on D can be computed analytically [61], [62] or bootstrapped. For more information about D and its applications in sex differences research, see [2], [63], [64]. Multivariate effect sizes can make a big difference in the study of sex differences. Del Giudice [2] reanalyzed a dataset collected by Noftle and Shaver [65] in an undergraduate sample. On the Big Five, univariate effect sizes (corrected for unreliability) ranged from d = −.57 to +.11, and the average absolute effect size was a “small” = .30, corresponding to a 79% overlap between the male and female distributions. However, the multivariate effect size was D = .98, a “large” effect corresponding to a multivariate overlap of 45%. In a similar fashion, univariate effect sizes in the study by Weisberg and colleagues [53] ranged from d = −.49 to +.07, and the average absolute effect size on the ten FFM aspects was = .29 (corrected for unreliability). Computing a multivariate effect size with the same scores, however, gives D = .94, corresponding to an overlap of 47%. The importance of using multivariate effect sizes is further increased by the fact that personality traits interact with each other to determine behavior [66]; for example, high extraversion can have very different consequences when coupled with high versus low agreeableness. For this reason, global, “configural” sex differences (quantified by multivariate effect sizes such as D) may be especially relevant in determining both the social perception and the social behavior of the two sexes.