by Yohan J. John

Probability theory is a relative newcomer in the history of ideas. It was only in the 19th century, two centuries after Isaac Newton ushered in the scientific revolution, that thinkers began to systematize the laws of chance. In just a few generations, the language of probability has seeped into popular discourse — a feat that older branches of mathematics, such as calculus, have not quite managed. We encounter numbers that express probabilities all the time. Here are a few examples:

“With a 6-sided die, the probability of rolling a 5 is 1 in 6, or around 16.7%. “

“The chance of rain tomorrow is 60%.”

“The chance of Bernie Sanders winning the 2016 US Presidential Election is 15%.”

What exactly do these numbers — 16.7%, 60%, 15% — mean? Does the fact that we use the words 'chance' or 'probability' for all three suggest that dice, rainfall, and elections have something in common? And how can we assess the accuracy and usefulness of such numbers?

Mathematicians, scientists and philosophers agree on the basic rules governing probability, but there is still no consensus on what probability is. As it turns out, there are several different interpretations of probability, each rooted in a different way of looking at the world.

Before we get to the interpretations of probability, let us review the basic mathematical rules that any interpretation must conform to. Let's imagine we have a set of possible events: S = {A, B, C,…Z}. The set S is called a possibility space or a reference class. The probability of event 'A' is symbolized by P(A), and its value must be between 0 and 1. A value of 0 means the event cannot occur, and a value of 1 means the event definitely occurs. For the set of possible events, P(S) = 1. This means that some event from the set S will occur, and nothing outside the set can occur. In other words, the probability that something will occur must be 100%. Finally, for any two events 'A' and 'B' that cannot happen simultaneously, the probability that either one or the other will occur is just the sum of the two individual probabilities, so P(A or B) = P(A) + P(B).

The Stanford Encyclopedia of Philosophy lists 6 major interpretations of probability, but for our purposes it makes sense to look at the three that are most common and easily understood. These are

Classical probability — a way to quantify “balanced ignorance”

Frequentist probability — a way to quantify observations of a random process

Subjective probability — a way to quantify subjective degree of belief

Classical Probability

The earliest conception of probability is the classical interpretation, which is based on an argument called the principle of insufficient reason or the principle of indifference: it states that if we have no reason to distinguish among a group of mutually exclusive events, then each of them should be assigned equal probability. Let's say we have an unpredictable process, like the rolling of a die. We can list six events or outcomes that appear symmetrical: rolling any of its six faces. Since we have no reason to doubt that these outcomes are symmetrical, we assign equal probability to each. So the probabilities of rolling 1, 2, 3, 4, or 6 are all equal to 1/6, or roughly 16.7%. In the simpler case of a coin flip, there are two symmetrical outcomes, heads and tails, so each is assigned 50% probability.

Classical probability can seem quite straightforward, especially for simple objects like dice and coins. But this simplicity can often obcure an important aspect of probability: the process of delineating the boundaries of the possibility space. In the case of coin-flipping, several outcomes that are definitely possible are left out of consideration. The coin might land on its edge. It might slip through a crack in the floor and get lost. An anti-gambling campaigner might seize the coin in mid-air, ending the game. Even if we only count the instances when the coin lands heads up or tails up, we can read the outcome of the coin toss in more elaborate ways . We could, for instance, ask about the angle made by the coin with respect to some fixed line, such as the direction of north. Our outcome set would consist of possibilities like “Heads oriented at 42 degrees”, “Heads oriented at 45 degrees”, “Tails oriented at 76 degrees”, and so on.

These examples may seem silly, given the way we typically employ coin-flipping in everyday life. But this is precisely the point. Probability is highly dependent on what we choose to count and what we choose to ignore in a given phenomenon. This was vividly revealed by a popular probability problem that was doing the rounds a few years ago. This is the problem: “I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?” A natural reaction is to ask, “What does Tuesday have to do with anything?” You might think (as I initially did) that Tuesday is a red herring, and that the answer must be 50% — just the probability of having a boy, since we typically assume each birth in a family is independent of the others, with even probabilities of having a girl and having a boy. But probability experts generally agree that Tuesday gives us a clue about what to count.

The best way to solve the problem is to lay out all the possibilities, and then count the ones that fit. There are 4 possible ways of having 2 children: {boy-boy, boy-girl, girl-boy and girl-girl}. Since there are 7 days in a week, there are 49 (7×7) possible pairs of birth days for the 2 children. We can represent the 4×49 possibilities in a set of tables (shown on the right).

Now all we have to do is count the boxes that line up with the terms of the problem. The red squares are outside the possibility space, since they are not consistent with our initial information: that one of the two children is a boy born on Tuesday. The 14 yellow squares are all in the possibility space, but are not two-boy possibilities. Only the 13 green squares are two-boy possibilities. So the answer to the question is 13/(13+14) or 13/27, roughly 48.1%. The strong dependence of the answer on subtleties inherent in the question must be particularly disconcerting for those who think of probability as something that exists 'out there' in the world, just as mass, length or temperature does.

Frequentist probability

The classical interpretation of probability proved to be quite effective in simple situations such as rolling dice or flipping coins, where symmetrical outcomes are easy to identify. But what does one do when the outcomes of some process don't seem to possess any natural symmetry? For experimental scientists, the answer is obvious: perform experiments! From the frequentist perspective, probability only has meaning in the context of a repeatable random process. The frequentist probability of a particular outcome is its long-term relative frequency. As with classical probability, we define a possibility space — the set of outcomes of the random process. But instead of simply assigning equal probabilities to each outcome, we run the process over and over again, counting the number of times each outcome is observed. If we run the process N times, where N is a sufficiently large number of trials, and we observe that outcome 'A' occurs M times, then the probability of A is M/N. If we flip a fair coin 1000 times, we expect that it will land heads up around 500 times. So the probability of heads is 500/1000, which is 0.5 or 50%. It is easy to apply this procedure to any process that has a well-defined set of outcomes, however asymmetrical those outcomes appear.

But the practical applicability of frequentist probability masks a few conceptual problems. When we define probability as the long-term relative frequency, what exactly do we mean by long-term? How long is long enough? How many trials is a sufficiently large number? Consider flipping coins. If we flip a fair coin just once, we only get one outcome. Let's say the coin lands heads. We might then assign a probability of 1 to heads and zero to tails. But this flies in the face of everything we know about coin flipping. Is 100 flips enough? We might get 60 heads, even if the coin is fair. Should we then say that the probability of heads is 60%? What is the 'true frequency' of heads?

The rule of thumb when working in a frequentist mode is that the more trials we perform, the closer we approach the 'true frequency'. In the limit of infinite trials, we converge exactly upon the 'true frequency'. But in the real world we cannot do anything infinitely many times, and so we are left having to pick an appropriate number of trials based on a variety of practical considerations. Real-world estimates of relative frequency also open us up to an interesting problem. If I obtain 60 heads from 100 coin tosses, should I accept that the coin is fair? At what point do I begin to suspect that the coin is unfair? Dealing with issues of this sort lead to the emergence of frequentist hypothesis testing, and to the concept of statistical significance. Despite its periodic bouts of notoriety, significance testing remains a key tool of experimental science.

Subjective probability

Subjectivism is based on the idea that people's beliefs are not black-or-white positions — there are always shades of gray, known as 'degrees of belief' or 'credences'. Subjectivists hold that probability is the appropriate way to quantify degrees of belief. So if you are completely sure about something, you believe in it with 100% probability. If you are less sure, you might instead believe in it with only 60% probability. Subjectivism has had limited success as a theory of how human beliefs actually work, but it has become a very popular normative theory for how human beliefs should work. A set of rational beliefs, according to this framework, must be associated with degrees of belief that obey the laws of probability.

Many statements about probability fit into both frequentist and subjective interpretations. The statement “The chance of rain tomorrow is 60%” can mean that the subjective belief of the meteorologist is 60%. But it can also mean that the weather conditions have been divided into a possibility space, so that the meteorologist just counts the number of days with similar conditions, and the number of days with similar conditions that preceded a day of rain, and estimated a relative frequency. A frequentist 'translation' of our weather statement might be something like: “the frequency with which a rainy day follows a day with weather conditions like today's is 60%”. Frequentist techniques can be used to determine subjective beliefs: relative frequencies can become degrees of belief.

When people predict a unique future event — like an election result — using probabilities, it is much harder to find a clear frequentist connection. A particular day may be a unique stretch of time that never repeats itself, but it is not hard to slot it into a reference class of similar days for purposes of weather forecasting. Categorizing elections is somewhat more tricky. Consider the statement “The chance of Bernie Sanders winning the 2016 US Presidential Election is 15%”. The number 15% does not mean that Bernie Sanders (or someone very much like Bernie Sanders) stood repeatedly in US presidential elections and won 15% of them. An election among particular candidates happens only once, and any analysis of previous elections is going to be very hard to translate into frequentist probabilities. When we encounter estimates of the probability of a unique future event, what we are really looking at is the subjective confidence of a media outlet, an expert, or perhaps even a bookmaker.

Bookmakers play a surprisingly important role in the theory of subjective probability. One of the most popular arguments for making your degrees of belief obey the laws of probability defines rationality in terms of betting. The Dutch Book Argument is based on a mathematical finding: if your beliefs do not obey the axioms of probability, then a Dutch book can be created against you — a set of bets on your beliefs in which you are guaranteed to lose money. Conversely, if your beliefs do obey the laws of probability, then no Dutch book exists for you. Beliefs that obey the probability laws are called coherent; the hallmark of rationality according to subjectivists is a coherent set of beliefs. Since degrees of belief can pertain to non-repeatable events like particular elections, we cannot check to see if our beliefs are correct using a frequentist experimental procedure.

Strangely, the accuracy of your beliefs has no bearing on whether they are deemed rational from a subjectivist point of view. You can believe in any sort of nonsense, as long as you assign your beliefs numbers that obey the laws of probability. Subjectivism typically makes contact with the real world through Bayesian inference, a mathematical procedure for updating your beliefs when new evidence arises. Bayesian methods are named after Bayes' theorem, a mathematical law that has been described as being “to the theory of probability what Pythagoras's theorem is to geometry”.

Bayes' celebrated theorem

Bayes theorem has to do with conditional probability, which is the probability of one event or outcome, given that another has already occurred. If we are interested in the conditional probability of event A, and we know that even B has occurred, we write the conditional probability as P(A|B) (the vertical bar “|” is normally read as “given”, so P(A|B) is “the probability of A given B”). Bayes theorem gives us a formula to calculate this conditional probability from other probabilities:

P(A|B) = P(B|A)P(A)/P(B)

Let's imagine we have a new medical test for a rare disease. Let's say the test is highly accurate, producing 99% positive results for people who have the disease, and 98% negative results for people who do not have the disease. Further, lets imagine that only 0.5% of the population has the disease. If someone tests positive for the disease, what is the probability that she has the disease? In other words what is the probability of the disease, given that the person tested positive? The answer is calculated using Bayes' theorem as follows:

P(disease|positive) = P(positive|disease) x P(disease) /P(positive)

The probability of a positive test given that the person has the disease is 99%. The probability of the disease is 0.5%. The probability of a positive test requires adding up the accurate positive tests (99% of 0.5% of the population) and the false positives (2% of 99.5% of the population). Plugging these numbers in, we get

P(disease|positive) = 0.99 x 0.005 /( 0.99 x 0.005 + 0.02 x 0.995) = 19.9 % (approximately)

So even though the test is quite accurate, only around 20% of people who test positive will actually have the disease. This is because the baseline prevalence of the disease is only 0.5%, and the false positives (2% of the remaining 99.5% of the population) outweigh the true positives.

In this example, Bayes' theorem is perfectly consistent with frequentist relative frequencies. We might say that the result of the medical test gives us a 20% degree of belief that the person has the disease, but we can also make a pure frequency-based inference: 20% of the people who test positive will actually have the disease.

Bayes' theorem moves from being a useful mathematical tool to the foundation of an entire system of epistemology — Bayesianism — when we plug in numbers that have not been derived from relative frequencies. In order to understand how the same mathematical formula takes on new meaning in the Bayesian subjective probability framework, we have to change the names of the terms. P(A) is now called the prior of A, which means the initial degree of belief in A. P(A|B) is the posterior: the degree of belief in A given that B is true. The term P(B|A)/P(B) represents the support that B provides for A. Even if one has not actually performed any experiments, one can still have priors for A and B, derived from other sorts of knowledge, or even from personal whim.

According to Bayesian thinking, probabilities are assigned to statements or hypotheses, rather than to proportions of outcomes or events. After all, one does not typically believe in events per se, only in assertions about events. This represents a fundamental conceptual gulf between subjective and frequentist interpretations of probability. For a frequentist, assigning numbers to hypotheses is meaningless. A statement can either be correct or incorrect. What can a frequentist do with an assertion like “Hypothesis H is 75% true”? Hypothesis H is not a repeatable random process, so we cannot perform an experiment in which 75% of trials result in H being true.

Frequentism and Bayesianism seem to represent very different attitudes towards the world. Frequentists are most interested in testable, accurate predictions of the cumulative results of random experiments. Bayesians are most interested in constructing coherent belief systems that are internally consistent, and which reflect the existence of shades of belief. These two attitudes towards probability have given rise to competing statistical methods for experimental science. Frequentist approaches have long been standard, but in recent years Bayesian methods have become increasingly popular. Delving into the frequentist-versus-Bayesian debate might take us too far into the murkily technical zone where scientific methodology and philosophy intersect, so instead let us revisit an intriguing issue that is common to most interpretations of probability.

The reference class problem

Probability theory creates a way to talk about possibilities — whether they are events or hypotheses — using the tools of mathematics. But what is the relevant space of possibilities? As John Venn, one of the pioneers of probability theory, stated in 1876, “every single thing or event has an indefinite number of properties or attributes observable in it, and might therefore be considered as belonging to an indefinite number of different classes of things”. The problem of deciding which class is relevant to a given question of probability is called the reference class problem. [1]

We saw a version of this issue when we looked at the 'boy born on Tuesday' problem. Intuitively, the probability of a boy being born doesn't depend on the day of the week, and therefore many people expect the answer to be 50%. But the correct answer — or at the very least, the one that experts seem to agree on — is 48.1%, and to arrive at this answer we had to take into account the days of the week. So the reference class implied by the question is not the class of families with two boys, but the subtly different class of families in which one of the two boys was born on a Tuesday. The question implies a particular reference class. In other words, solving the problem requires figuring out what the problem-setter expects us to count.

The reference class problem shows up in classical probability when we try to list symmetrical outcomes that will be assigned equal probabilities. In a coin toss, we normally pick heads and tails as the two outcomes. But as we saw, we could have chosen to include the angle of each side with respect to a line. The classical reference class problem is particularly interesting when we deal with a continuous possibility space. Consider a problem involving squares. Given the (infinite) set of squares whose side length can take any real numbers between 0 and 1, what is the probability of picking a square that has side length 0.5 or less. The range 0 to 0.5 and 0.5 to 1 seem symmetrical, and so we might pick a classical probability of 50%. But we could restate the question as follows. What is the probability of picking a square whose area is between 0 and 0.25? Note that the very same squares that fit the first problem fit the second one (since the area of a square of side length 0.5 is 0.5×0.5 = 0.25). And yet when we divide up the continuum of possible areas into symmetrical ranges, we might pick 0 to 0.25, 0.25 to 0.5, 0.5 to 0.75, and 0.75 to 1. We would now have to assign a classical probability of 1 in 4, or 25%. Clearly the probability we assign depends on how we select the squares, and this in turn is implicit in the framing of the question.

The reference class problem is also encountered when estimating frequentist probabilities. Let's say I want to know the probability that Radha Singh, an Indian-American woman, will get skin cancer. The frequency estimate requires counting the number of people who are 'like' Radha along some dimesion or set of dimensions, and the smaller number of these people who also have skin cancer, and dividing the latter by the former. Which classes should we use? Should we compare Radha to other women in the US, to Indian-Americans specifically, to all South Asians, or to members of her particular ethnic subgroup? And should factors like age, income, education and diet be included? The largest class — say, the world's female population — would result in the a number that is close to the 'true frequency', but is this estimate useful? The smallest classes — say, Indian-Americans of her ethnic subgroup who are also matched for age, income and diet — would presumably result in inaccurate estimates of the frequency based on very small numbers. Whichever class we end up choosing, we can do our best to make an accurate estimate. But which of the potentially infinite relative frequencies is the true probability of Radha getting skin cancer?

Choosing a reference class can have consequences in the medical world beyond estimations of cancer risk. A growing body of research suggests that people with red hair are more sensitive to pain, and require on average 20% higher doses of anesthesia than the general population. Prior to this research being conducted, the fact of being a redhead simply did not count as a factor when deciding on an appropriate anesthesia dose, because a person's reference class was just the a broad average of humanity. We might say that P(pain|redhead) — the probability of pain at the population average dose, given that the patient is a redhead — was a probability that for many years no one bothered to estimate. The reference class problem appears to crop up when dealing with the results of randomized control trials of drugs and other treatments. Imagine a doctor who prescribes a drug based on the fact that it was shown to be 95% effective in male patients above the age of 55 with hypertension. The drug proves to be ineffective, however. The doctor's patient is a male above the age of 55, but is also of African origin. If the original study that generated the 95% estimate was not explicit about the ethnicities of the participants, it may not be clear if the patient actually fits the reference class on which the drug was actually tested. Researchers are now realizing that ethnicity is an important factor for medical testing. In the absence of accurate causal scientific theories, very little can be said about other factors that have simply not yet been incorporated into clinical tests. [2]

Subjective probability interpretations have a way to escape the reference class problem, but it comes at quite a cost. Subjectivists only care that their degrees of belief obey the laws of probability — the rationality of their belief system requires internal coherence but not accuracy. So radical subjectivists can escape the reference class problem by having very little to say about how their beliefs relate to the real world. Less extreme subjectivists might use classical or frequentist procedures to estimate their Bayesian prior degrees of belief. In that case the subjectivists inherit the same sort of reference class problems that emerge for the other interpretations. Alternatively, they might resort to the testimonies of experts in order to determine their prior beliefs. They might, for instance, defer to a group of physicists on all matters to do with physics. But this runs into a version of the reference class problem too: if there is ever a conflict between multiple experts — a very common situation in physics and everywhere else — then the subjectivists must come up with some procedure for choosing among experts, or combining their beliefs into one number.

The answer is 42… but what was the question?

We live in the age of Big Data. Much of this data is analyzed from a probabilistic perspective. Companies want to be able to predict consumer behavior. Governments want to know how to weigh various pieces of information, in order to decide on policy and allocate resources. Health care organizations want to know the probabilities of diseases and infirmities so they can choose treatments that are cost-effective at the population level. Reliable frequentist probabilities are perhaps the most useful for such purposes, given that they are directly comparable with outcomes in the world. Subjective degrees of belief become useful when frequencies can't be directly assessed; they can encapsulate the weight of evidence and expert opinion for some course of action. As long as the nature of the reference class or the set of prior beliefs is clear to the person or organization asking the question, the numbers churned out by the mathematical machinery don't lie.

But how clear are we on reference classes and prior beliefs? In my experience even many scientists are only dimly aware of the differences between the various interpretations of probability, and the problems that go along with each. Perhaps this just means that the nature of probability is a mere academic quibble — something that practically-minded people can defer to theoreticians. After all, most concepts seem to evaporate upon close examination by philosophers. Nevertheless, I find that engaging with these foundational questions can help to understand the numbers that play an increasingly important role in the decisions that govern our lives. If a doctor tells me that I have a certain chance of getting a deadly disease, I'd like to be able to think about what this means. If a climate scientist tells us the chances for several possible outcomes of global warming, a sufficient number of informed citizens needs to be able to critically examine the numbers. If a Wall Street analyst tells us that a particular financial innovation reduced the probability of risk, we need to be able to ask where the numbers came from, and what we are supposed to do with them. None of this means we should adopt a cynical “Lies, damned lies, and statistics” attitude. Probabilities often contain useful information, as long as we understand how they have been estimated.

Perhaps more importantly, looking under the hood of the engine of probability reveals to us that there is always a role for human choice, and therefore bias. Probability is rooted in our ability to count, and for this reason it can seem like an objective concept. But the universe does not provide us with natural reference classes or prior degrees of belief — we have to actively choose them as individuals and as societies. The reference class problem and the problem of choosing prior beliefs, rather than being in need of technical resolution, are useful reminders of the potentially powerful consequences of deciding what counts and what doesn't count. Fundamentally, probabilities depend on the questions we ask of nature, and we should therefore try to uncover, whenever possible, the assumptions and biases latent in the act of questioning.

______

Notes & References

[1] My treatment of the reference class problem is based on Alan Hájek's arguments in the paper The Reference Class Problem is Your Problem Too [pdf]. The square example and the cancer example are based on examples from this paper.

[2] The hypertension example is based on one used in a paper by Connor Cummings: The Reference Class Problem vis-à-vis Evidence-Based Medicine.

The diagram tabulating the 'boy born on Tuesday' possibilities is from Decision Science News.