Every so often, I read an article quoting some claim arising from research on sex differences. Typically, scientists have found some sex difference that they have found to be statistically significant, and the difference is reported with much fanfare and various claims about what this difference means. Unfortunately, a statistically significant difference is not necessarily a useful difference in practice, making many of the ways people interpret the original narrow scientific claim simply wrong.

A naïve perspective on a statistical difference often goes like this: “I heard that girls aren’t as good at spacial rotation as boys are, so I guess that explains why I get lost so easily.” Or, for other differences, someone might say, “Apparently, they’ve found that gay men’s index fingers are longer than straight men’s, so I measured mine and my partner’s. I passed the test, but my partner failed it—I guess he is a bit straight acting.” Or, “Researchers have found that transsexual brains are different from cissexual brains. If only they could have tested me when I was a child, I could have diagnosed back then and been saved a lot of pain.”

The problem here is a fundamental misunderstanding of what statistics tell us. Statistics tell us about properties of samples taken from populations. They don’t necessarily tell us about individuals. To understand why, we’ll look at three sex differences, height, finger length, and brains. Tasty, tasty brains.

Measuring Up the Sexes

Let’s start with height, an “obvious” sex difference. In most human populations, men are noticeably taller than women. For our discussion, we’ll use demographic data from the USA. Men have a median height of 5′ 8.5″ (174 cm), whereas women have a median height of 5′ 3.5″ (162 cm). Thus, we have a tangible sex difference of five inches between the sexes. (I used the median as my “average” here, rather than the mean—see Note 1 at the bottom.)

So, we have a statistic, and it meshes well with our daily observations of life, but what do these numbers actually tell us? Can we use height as a predictor for sex, and, if so, just how good a predictor is it?

To try to answer that question, let’s arbitrarily pick someone from the US population who is 5′ 6.5″ (169 cm) tall, making them 3 inches taller than the median for women, and two inches shorter than the median for men. Naïvely, you might expect that this individual is more likely to be a man than a woman; after all, the person’s height is closer to the male median than the female one. But you’d be wrong. In fact, statistically, it is slightly more likely that the person we have picked is a woman.

It isn’t enough to have a sense of the height of the average man or the average woman. We also need to know something about the distribution of heights in the population. If we measured one million Americans, chosen at random, we would expect to get statistics that look like the ones below:

Height (cm) Women Men P(Woman) P(Man) 75–80 146 100.00% 80–85 907 66 93.22% 6.78% 85–90 1,911 2,115 47.47% 52.53% 90–95 3,915 4,708 45.40% 54.60% 95–100 4,805 4,727 50.41% 49.59% 100–105 5,050 4,015 55.71% 44.29% 105–110 5,303 3,941 57.37% 42.63% 110–115 4,949 5,462 47.54% 52.46% 115–120 4,461 6,651 40.15% 59.85% 120–125 5,347 7,371 42.04% 57.96% 125–130 5,728 6,404 47.21% 52.79% 130–135 6,199 5,475 53.10% 46.90% 135–140 5,265 5,056 51.01% 48.99% 140–145 6,461 7,699 45.63% 54.37% 145–150 20,368 6,669 75.33% 24.67% 150–155 55,291 7,245 88.41% 11.59% 155–160 99,716 10,198 90.72% 9.28% 160–165 122,115 24,369 83.36% 16.64% 165–170 100,480 57,832 63.47% 36.53% 170–175 41,443 94,289 30.53% 69.47% 175–180 9,660 102,024 8.65% 91.35% 180–185 1,521 74,355 2.00% 98.00% 185–190 54 33,282 0.16% 99.84% 190–195 12,338 100.00% 195–200 2,212 100.00% 200–205 402 100.00% Total 511,095 488,905 51.11% 48.89%

One of the interesting points to note from this data is that our sample has more women than men, because there are more women than men in the US population (men are more likely to die). But it is also the case that men occupy a greater range of heights, so they are necessarily spread more thinly.

If we plotted these counts, they would look like the following:

The graph shows two distributions, one for men and one for women (each roughly following the classic bell-curve shape of a statistical normal distribution). At a little before 5′ 7″ (169.5 cm), they cross—anyone taller than that appears to be statistically more likely to be a man, and anyone shorter is more likely to be a woman. It might seem that we could use this value as a threshold between “female” heights and “male” heights. But there are a lot of people “on the wrong side” of the cut off point. More than 1 in 9 women (11.6%) are taller than 169.5 cm (the shaded pink section of the graph), putting them on the side we might have described as “more likely to be male”, but there are even more men who have “girly” heights: almost exactly one third of men (33.4%) are shorter than 169.5 cm.

So, imagining that there is a height threshold that we could use to reliably partition women from men is false. And our daily experience backs that up. There is a lot of overlap between the range of heights for men and the range of heights for women. Statistically, it may be the case that someone who is 5′ 6″ (167 cm) is more likely to be a woman, but 1 in 3 people of that height are men, which means as a test to sort men and women, a height threshold would be wrong quite often. Of course, in places where the distributions have less overlap, things are more clear cut; for example, only 1 in 10 of people 5′ 3″ (159.5 cm) tall are men, and only 1 in 10 people 5′ 9.5″ (176.5 cm) are women. And there are very few women taller than 6′ 1″ (185 cm).

It’s tempting to imagine we might be able to salvage the simple threshold idea by saying that people taller than 176.5 cm are usually male and a those shorter than 159.5 cm are usually female, and abandoning everyone the middle (more than half of our sample!) as living in a gray area. But even that doesn’t work. Once we get to people shorter than 4′ 9″ (145 cm), there is no reliable gender difference. There are 60,447 women less than 145 cm tall, and 63,690 men, making it pretty much a wash. But because there are more men than women, there are proportionately more particularly short men (13% of all men) than short women (11.8% of all women).

So even though height is a sex difference that is fairly visible in the world, and the difference between the height of men and women is statistically significant, it’s difficult to put this information to good use to make predictions about individuals. If all we know is that men have heights centered around 5′ 8.5″ and women have heights centered 5′ 3.5″, that’s not enough information to make any reasonable predictions at all about individuals, and certainly not enough to use someone’s height to predict whether they are male or female. Even with a better sense of how height is distributed, we can only predict gender with at least 90% accuracy for 34.7% of people, mostly people who are really tall (but also people between 5′ 1″ and 5′ 3″). In other words, even though knowing the statistical distributions for the heights of men and women can tell us something some of the time, for most people, knowing their height is useless for making reliable predictions about their sex.

Furthermore, we have not even begun to consider other factors that influence height, such as race and ethnicity. For example, an average woman from Norway is 5′ 6.5″, but an average man from rural India is only 5′ 3.5″. So, any generalizations we make about height are “all other things being equal”, but in real life, those other factors are not being held constant.

And that was height—what about other sex differences? After all, you’re hardly likely to be considered to have made a major research discovery if you announce that on average men are taller than women. Typically, research on sex differences focuses on more subtle, less obvious differences. These differences are good for headlines, but at least some of the time, the differences that are uncovered, while “statistically significant”, are even less practically significant than height when applied to individuals.

Pull My Finger

With height under our belt, let’s move on to looking at a more subtle sex difference, finger length, specifically the 2D:4D ratio. Here’s what Wikipedia says about it (at the time of writing):

The digit ratio is the ratio of the lengths of different digits or fingers typically measured from the bottom crease where the finger joins the hand to the tip of the finger. It has been suggested by some scientists that the ratio of two digits in particular, the 2nd (index finger) and 4th (ring finger), is affected by exposure to androgens e.g. testosterone while in the uterus and that this 2D:4D ratio can be considered a crude measure for prenatal androgen exposure, with lower 2D:4D ratios pointing to higher androgen exposure. The 2D:4D ratio is calculated by measuring the index finger of the right hand, then the ring finger, and dividing the former by the latter. A longer ring finger will result in a ratio of less than 1, a longer index finger will result in a ratio higher than 1. The 2D:4D digit ratio is sexually dimorphic: in males, the second digit tends to be shorter than the fourth, and in females the second tends to be the same size or slightly longer than the fourth. A number of studies have shown a correlation between the 2D:4D digit ratio and various physical and behavioral traits.

It seems to say, then, that we can set a threshold of one for the ratio, with men on one side and women on the other. If you have a ratio of less than one (longer ring finger), you have “boy fingers”, and if you have a ratio of greater than one (longer index finger), you have “girl fingers”. It’s a simple rule that’s easy to remember. But we ought to be suspicious. We had a threshold for height for (169.5 cm), but a full third of men fell into the “girly height” category. And we haven’t been told anything about the average ratios for men and women, or their distribution. At the time of writing, no such information is on Wikipedia about that.

So, let’s dive into a paper on the topic. For convenience I’m going to pick just one paper that has a moderately good sample size, The Visible Hand: Finger Ratio (2D:4D) and Competitive Behavior , by Matthew Pearson and Burkhard C. Schipper. Here’s their data (see Note 2):

Race Sex Count Average Std Dev Min Max White Male 35 0.960 0.026 0.899 1.022 Asian Male 47 0.944 0.026 0.882 1.000 Hispanic Male 10 0.954 0.025 0.913 1.002 Black Male 2 0.951 0.015 0.941 0.962 Others Male 6 0.973 0.025 0.938 0.998 All Male 100 0.952 0.0272 0.882 1.033 White Female 20 0.959 0.030 0.898 0.999 Asian Female 65 0.963 0.026 0.912 1.033 Hispanic Female 5 0.948 0.043 0.898 0.996 Black Female 1 0.917 0.917 0.917 Others Female 7 0.978 0.034 0.942 1.033 All Female 98 0.962 0.0293 0.898 1.033

The first intriguing detail is that no group has an average greater than one. Women, in aggregate, do not match our intuition of a “girly“ finger length ratio at all, averaging 0.962. Also, while the paper does claim to have found a statistically significant difference between men and women in general, for white women, they failed to find any significant difference at all, which is just as well, because their white women had more mannish hands than their white male counterparts—oops!

The authors of the paper don’t give us the distribution for their data, but they do give the standard deviation, and so it is reasonable to assume that we can approximate it with a normal (bell-curve) distribution. The graph below shows what the two distributions look like:

As you can see, there is a lot of overlap. If we used finger length as a sex test, it would be right only 56.7% of the time. It’s only better than 75% accurate for people with finger length ratios of 1.02 and above, and only 1.6% of the population are fall into that category. It’s only 90% accurate for people with a finger ratio of 1.06, which is a tiny 0.026% of the population.

Also, while the test can accurately identify a very small number of women, it can never accurately identify men. It is at its most clear-cut at a ratio of 0.89; 3 out 5 people with that ratio (60%) are men.

So while the researchers for this paper did find an actual “statistically significant” difference between the finger ratios of men and women, in practice the difference is not one we can usefully apply to individuals.

Other researchers have examined finger-length ratios of smaller groups, including gays and lesbians and transsexuals. They, too, have found “statistically significant” differences, but we have no reason to expect that they will be any more useful, especially as the sample sizes for these groups are smaller and the differences observed more subtle, as we will see in our final sex difference.

Brains, Bring Me Brains

Brains are a favorite choice of sex-differences researchers, so let’s pick one random study of brains, namely Male-to-Female Transsexuals Have Female Neuron Numbers in a Limbic Nucleus , by Kruijver, et al. Here’s a summary of their results (see Note 3):

Subjects Mean

BSTc Std Dev Cissexual Gay Men 9 34.6 10.20 Cissexual Straight Men 9 32.9 9.00 Transsexual Women 6 19.6 8.08 Cissexual Women 10 19.2 7.91

A naïve view of this data is that transsexual women’s brains look a lot like cissexual women’s brains, and unlike the brains of cissexual men. We might also naïvely suppose from these results that we could have a “brain femininity” test and use it to detect transsexual women (provided that we found killing them and dissecting their brains to make the measurement to be a good trade-off!).

You might be concerned that someone is making a generalization about the world’s hundreds of thousands of transsexuals (we can estimate at least 350,000 in the USA and Europe alone) from comparing six transsexual brains, but let’s forget the issue of tiny sample sizes for now. Instead, we’ll assume that the average and standard deviation values are accurate, and come from data with a normal distribution.

If we examine the distributions, their probabilities look like the following:

Here, we can see that despite the mean BSTc count of the women’s brains being almost half that of men, there is still considerable overlap. If we just compare cissexual men with cissexual women, and say that BSTc counts above 26.5 × 103 are male and those below female, we find that that 23.9% of men have “female” brains, and 17.8% of women have “male” brains.

But the picture changes a lot when you look at the distributions in a context where everyone comes from the same population. As before, we’ll use an imagined sample of one million people from the general population of the USA. In that sample, we’d expect to see about 510,000 cissexual women, 444,000 straight cissexual men, 45,000 cissexual gay men, and a mere 500 transsexual women (plus 500 transsexual men). When we scale the curves proportionately, the bell curve for gay men (about 1 in 10 of the population) drops a good deal, but the curve for transsexual women (about 1 in 1000) flatlines. (I didn’t color the x-axis red; that’s the line for transsexual women.)

Now we can see why a test for transsexualism based on measuring BSTc is not possible (other than that annoying brain-dissection requirement). If we have someone who was born physiologically male who comes to us, survives a brain examination, and appears to have a very “girly” BSTc count in the 14 × 103 range, there will be about 2175 straight cissexual men with a similar value, against only 19 transsexual women. In other words, if you used BSTc measurments to test for transsexualism, you’d only be right on someone with a brain this “girly” a mere 0.76% of the time. Even worse, fewer than 25% of transsexual women’s brains would score this (or more) “girly”—those with less girly brains are even tougher to correctly identify with our putative transsexuality test.

So, even if these researchers have found a statistically significant difference in the brains of transsexual women from their tiny sample of six women, in practice, it is of little use to individual transsexuals.

What Would a Useful Sex Difference Look Like?

If you’re hoping for a sex difference that might be helpful for some kind of test applied to individuals, let me give you some rules of thumb. For gay people (or any group that is about 1 in 10 of the population), you want a difference where there isn’t too much overlap between the two distributions. Since the larger the standard deviation, the greater the overlap, we can set a rule for the maximum standard deviation that will avoid exessive false positives. Find the distance between the average values for men and women (or whatever groups we’re distinguishing between), and divide it by 4. The standard deviation should be no larger than the result. For example, if men average 25 and women average 33, the groups are 8 apart and the maximum workable standard deviation is 2. If it is worse than that, you’ll have excessive false positives (more than about 25%) from opposite-sexed heterosexuals in the tail of their distribution.

For a sex-difference based test for transsexualism (or any group that is about 1 in 1000 of the population), the sex difference needs to have an even smaller standard deviation. To get the rough value you need, divide the distance between the two averages by 7 instead of 4. But the truth is, that would give you two distributions that barely touch at all. Very, very few sex differences are going to be that clear cut. For that reason, you can probably count on there never being a useful and reliable test for transsexualism based on sex differences.

A Test That Works

Despite all that I have said, there is a pretty accurate test for transsexualism, based on sex differences. Simply tell your would-be transsexual about this sex difference, and watch them. If you see them frantically measuring their fingers, or wishing they could scan their brains, they’re probably transsexual. It’s probably not that accurate, but it’s better than we’d do from actually scanning their brains or measuring their fingers.

Notes

I used the median for the height data because the average and standard deviations are skewed by the long tail at the left of the distribution. The tail is presumably caused by the various disorders that can cause stunted growth. In their original paper, Pearson & Schipper seem to come up with a different total, and, more importantly, their standard deviation lacks precision. By using some tricks with the data they do give, I calculated a standard deviation with three digits of precision rather than two. The paper presents SEM values (standard error of the mean), rather than standard deviation, but we can convert between the two using the formula stddev = SEM × sqrt(sampleSize).