If, like me, you still find yourself shaking your fist at the abysmal numbers of women speakers at your average STEM conference, and you enjoy a bit of geeking out over math, then today’s post is going to make your toes curl in delight.

A few months back, my mathematician friend Greg Martin at the University of British Columbia invited my feedback on a paper he was writing on how to increase the number of women speaking at math conferences. One bit in particular jumped out at me:

He used statistical probability to disprove the notion that underrepresentation of women on any given speaker’s list “just happens.”

(Picture me bouncing in my seat with glee. Take that, sexist STEM-ers!)

If you’ve ever followed a debate about why an event has so few women speakers, you’re likely familiar with the argument that gender was not a factor (AKA “we chose the best speakers, regardless of gender”), and that speakers were chosen in an unbiased fashion, on merit alone. Well, if I understand the math correctly, the odds of that assertion being true are next to nothing.

It delights me to no end that Greg has found a way to use the master’s tools to dismantle the master’s house. So naturally, I asked if he’d be willing to share a little more about how he arrived at his calculations.

What follows is, to be sure, a fairly technical read—but it’s an accessible and engaging one, too. I hope you’ll follow along, even if math isn’t your most cherished subject, and share this with your favourite stats nerds.

Over to Greg:

In a recent article, I made the following statement regarding the genders of plenary speakers at the International Congress of Mathematicians:

The appropriate null hypothesis is “the ICM speakers were selected independently of gender from among the pool of people who have received PhDs in mathematics in the last 25 years”. Under our conservative 24% assumption from above, the observation of nineteen male plenary speakers and one female plenary speaker rejects ( ) this null hypothesis. Indeed, it is 18 times as likely that we would have seen an “overrepresentation” of female plenary speakers (five or more, since ) by chance than to have seen at most one.

Clearly the gist of this statement is that having one female speaker out of twenty is really unlikely to be the result of chance or bad luck. But how were those exact numbers, 0.031 and 18, determined? How can we calculate analogous numbers in similar situations? It turns out not to be that hard, once we know the formula to use; the purpose of this post is to supply that formula and give some examples of how to use it.

Let’s start by examining an idealized situation divorced from social issues. Imagine that we have a giant bag full of marbles; the marbles come in two colours, orange and green (we like the orange ones better), which are well mixed together. We are going to take 50 marbles out of the bag, one by one, and see how many orange marbles we end up with.

Of course, without knowing whether orange marbles are common, rare, or somewhere in between, we have no idea how many orange marbles to expect! Let’s say that we know that 40% of the marbles are orange and 60% of them are green. On average, we’d expect to get orange marbles in our selection of 50 marbles; but of course, we might get lucky and get more than 20 orange marbles, or we might get unlucky and get fewer than 20 marbles. How likely is it to end up with, say, only 13 orange marbles?

As it happens, the probability of ending up with exactly 13 orange marbles, if we draw 50 marbles from a bag containing 40% orange marbles and 60% green marbles, is1

Here, the symbol is a “binomial coefficient”2 (sometimes written as , and pronounced “50 choose 13”).

While it’s not impossible to calculate this on our own, we might as well use WolframAlpha to help out: if we type in

Binomial[50,13] * (0.4)^13 * (1-0.4)^(50-13) ,

we receive the answer , telling us that the chance of getting exactly 13 orange marbles is about 1.47%. We can also go to Stat Trek’s online binomial calculator and enter 0.4, 50, and 13 in the first three fields; we see the answer 0.0147378… appear in the fourth field.

The general version of the above situation is: we have independent opportunities for success or failure (in the example above, was 50, and “success” meant drawing an orange marble while “failure” meant drawing a green marble). In each opportunity, the probability of success is some number (above, ). If we are interested in a certain number, , of successes (above, was 13), then the probability of succeeding exactly out of times is given by the formula3

Now, suppose we suspected that some funny business was going on. For example, maybe our housemate loves orange marbles, and we think that she snuck around one night and pulled a bunch of orange marbles out of the bag before we started taking our 50 marbles. (I guess the bag is so huge that emptying it out and counting all the marbles it holds is out of the question.) She denies it, however, saying “sometimes you get few orange marbles by random chance”. What should you believe?

It seems unlikely to get only 13 orange marbles out of 50 (if there really are still 40% orange marbles in the bag). On the other hand, any specific number of orange marbles is pretty unlikely. Getting exactly 20 orange marbles is the most likely outcome, as we remarked above, but even that has less than an 11.5% chance of happening4. So instead of asking how likely it is to get exactly 13 orange marbles, standard procedure is to ask how likely it is to get at most 13 orange marbles (in other words, how likely it is to get a result this extreme or even more extreme).

There’s no secret here: to get the probability of obtaining at most 13 orange marbles out of 50 (assuming that the bag really does contain 40% orange marbles), we just add up the probability of obtaining 0 orange marbles, 1 orange marble, 2 orange marbles, and so on up to 13 orange marbles:5

Fortunately WolframAlpha can do this for us, if we enter6

the sum of Binomial[50,j] * (0.4)^j * (1-0.4)^(50-j) from 0 to 13 .

The Stat Trek binomial calculator also calculates this for us (the answer appears in the third field from the bottom). Either way, we obtain 0.0279883…; so the chances of getting at most 13 orange marbles is less than 2.8%.

So… what should we believe about our suspicious housemate? Well, there are no guarantees in any of this: while 2.8% is a pretty small probability of getting at most 13 orange marbles out of 50 if everything is on the up and up, it can still happen—in fact, it happens about 1 time in 36. But that low a probability seems unlikely. If this were a statistics paper, our “null hypothesis” would be that nothing funny had happened to the bag of marbles, and we would probably reject that null hypothesis since the standard threshold is 5% (the ubiquitous ” level” of statistical tests). If the only other reasonable hypothesis, in your mind, was that your housemate was stealing orange marbles during the night, then perhaps you should believe that.

More often, though, we get to make similar observations many different times. (I’m not sure what’s going on in this world we’re inventing, but maybe we get a well-proportioned marble delivery every week to top up the bag of marbles, and we draw a 50-marble allotment every week…?) If we consistently see extreme results like this, then we become convinced that something fishy is happening. So each individual calculation becomes a piece of data that we can collect to see if there is a larger, systemic pattern.

From a sufficiently abstract perspective, examining the gender of speakers at an STEM conference is the same as looking at how many marbles of each color we get from the bag. The collection of all people who might have been invited to this conference is our bag of marbles; female speakers are our orange marbles (our “successes”) and male speakers are our green marbles (our “failures”—only in the context of examining appropriate representation of women in STEM, that is!). In a world where gender was unrelated to being invited to STEM conferences, we would expect the proportion of female speakers to be (more or less, since things are always a little bit random) the same as the proportion of female practitioners in the field of the conference.

Of course, knowing or estimating that latter proportion can be difficult. In the quote that started this post, I used the estimate 24% (which I called a conservative estimate) for the proportion of women in research mathematics; my reasoning was based on data of PhD graduates in the US over the past 25 years, where women earned at least 24% of the mathematics PhDs in all 25 of those years (and sometimes up to 34%).7

The particular conference I was discussing had 20 plenary speakers, only one of whom was female. The above formulas tell us that the probability of having at most one female speaker by chance is

,

or a tiny bit over 3%. On the other hand, if gender bias were not present in the academic system, then we would expect 20-speaker conferences to have, on average, female speakers. The probability of having women “overrepresented”—that is, of there being at least 5 female speakers—is

or over 54%. Since , it is indeed almost 18 times more likely (under our assumptions) to have an “overrepresentation” of female speakers than to have at most one.

Although it doesn’t show all the numbers corresponding to its calculations, the Conference Diversity Distribution Calculator gives a nice visual representation of how many female speakers we should expect at conferences in a bias-free world.

Footnotes

Coders and other detail-oriented folks might raise the following nitpick: if the first marble we draw is orange, say, then the percentage of orange marbles remaining in the bag is then actually slightly less than 40%! While certainly true, the slight change is negligible if our huge bag contains a very large number of marbles. It is common practice to ignore this detail when “selecting without replacement”, as a statistician would say, from a huge number of possibilities. For number geeks and other interested rockstars: in general, the binomial coefficient is a shorthand for a quotient of factorials: Here the factorial is just the product of all the numbers up to , that is, See http://en.wikipedia.org/wiki/Combination for more information. See http://www.mathsisfun.com/data/binomial-distribution.html for a from-scratch explanation of why this is the correct formula, and http://www.statisticshowto.com/binomial-distribution-formula for a reference to a more concise version. As we can see by entering Binomial[50,20] * (0.4)^20 * (1-0.4)^(50-20) into WolframAlpha, or the numbers 0.4, 50, and 20 into the Stat Trek binomial calculator. In general, the probability of getting at most successes out of tries, if each success occurs with probability , is



which we can write using summation notation (or Sigma notation) as follows: The probability of getting at least successes out of tries equals Or, for lovers of syntax, Sum[ Binomial[50,j] * (0.4)^j * (1-0.4)^(50-j), {j,0,13} ] . As a side note: among all critical responses to the posting of that article, I found that this one detail—the 24% figure—generated the most criticism, much of it (in my opinion) demonstrably misguided. Some people complained that the “correct” figure to use should be much lower, perhaps around 10-12%, because that is the proportion of women among tenured faculty at top-tier mathematics departments. There are at least two flaws in this argument. First, it assumes that being hired and granted tenure at top-tier institutions is done equitably for women and men, but there is ample evidence of systemic gender bias in hiring, one that is more pronounced at top-tier institutions; in other words, it relies upon the assumption that men at top-tier institutions are generally stronger than women at next-tier institutions, which a dubious assumption. Second, my article explicitly discusses all of the obstacles for women in STEM, even after they earn their PhD—implicit bias in evaluations, double standards for behaviour and self-promotion, the impostor phenomenon, and all the usual factors known to contribute to the leaky pipeline—and hence comparing the proportion of female PhD-earners to the proportion of female conference speakers is reasonable. In this context, claiming that the lower proportion of women in top-tier tenured positions should be the measuring stick for representation at conferences is, quite simply, using a symptom of the very problem we are discussing to justify perpetuating the problem.

Update, 21 October 2015: I interviewed Greg for Quartz, and you can read that interview here (or here, in The Atlantic).

Photo credit: PearlsofJannah on Flickr.