An Intuitive Explanation of Eliezer Yudkowsky’s Intuitive Explanation of Bayes’ Theorem

Richard Feynman once said that if nuclear war caused the human race to lose all its knowledge and start over from scratch, but he could somehow pass on to them just one piece of information, he would tell them this:

All things are made of atoms – little particles that move around in perpetual motion, attracting each other when they are a little distance apart, but repelling upon being squeezed into one another.

For Feynman, this was the single most helpful and important pieces of information we could pass on to a future human race that had lost all other knowledge.

It’s an excellent choice, especially since it entails reductionism.

After giving it some thought, I think maybe the second piece of information I would pass to a new society is Bayes’ Theorem.

Seeing the world through the lens of Bayes’ Theorem is like seeing The Matrix. Nothing is the same after you have seen Bayes.

But I’d rather not just give the equation and then explain its parts, because if you don’t understand the logic behind the equation, it’s hard to know how to apply it correctly. The goal of the tutorial below is not to teach you how to guess the teacher’s password and give the right responses on an exam. No, the goal of the tutorial below is to give you a true understanding of Bayes’ Theorem so that can apply it correctly in the complexities of real life that exist beyond the exam sheet. By the end of this tutorial you will not just be able to recite Bayes’ Theorem; you will feel it in your bones.

The most popular online tutorial on Bayes’ Theorem, Eliezer Yudkowsky’s “An Intuitive Explanation of Bayes’ Theorem,” opens like this:

Your friends and colleagues are talking about something called “Bayes’ Theorem” or “Bayes’ Rule”, or something called Bayesian reasoning. They sound really enthusiastic about it, too, so you google and find a webpage about Bayes’ Theorem and… It’s this equation. That’s all. Just one equation. The page you found gives a definition of it, but it doesn’t say what it is, or why it’s useful, or why your friends would be interested in it. It looks like this random statistics thing. So you came here. Maybe you don’t understand what the equation says. Maybe you understand it in theory, but every time you try to apply it in practice you get mixed up trying to remember the difference between p(a|x) and p(x|a), and whetherp(a)*p(x|a) belongs in the numerator or the denominator. Maybe you see the theorem, and you understand the theorem, and you can use the theorem, but you can’t understand why your friends and/or research colleagues seem to think it’s the secret of the universe. Maybe your friends are all wearing Bayes’ Theorem T-shirts, and you’re feeling left out. Maybe you’re a girl looking for a boyfriend, but the boy you’re interested in refuses to date anyone who “isn’t Bayesian”. What matters is that Bayes is cool, and if you don’t know Bayes, you aren’t cool. Why does a mathematical concept generate this strange enthusiasm in its students? What is the so-called Bayesian Revolution now sweeping through the sciences, which claims to subsume even the experimental method itself as a special case? What is the secret that the adherents of Bayes know? What is the light that they have seen? Soon you will know. Soon you will be one of us.

Eliezer’s explanation of this hugely important law of probability is probably the best one on the internet, but I fear it may still be too fast-moving for those who haven’t needed to do even algebra since high school. Eliezer calls it “excruciatingly gentle,” but he must be measuring “gentle” on a scale for people who were reading Feynman at age 9 and doing calculus at age 13 like him.

So, I decided to write an even gentler introduction to Bayes’ Theorem. One that is gentle for normal people.

There are times when Yudkowsky introduces new terms without defining or explaining them (“mean revised probability,” for example). Other times, he leaves you with a difficult problem without the resources you need to solve it (for example, the problem stated right before the phrase “mean revised probability”). That is where, I suspect, many non-mathematicians just give up and don’t come back. If you gave up on Yudkowsky’s introduction to Bayes’ Theorem, I hope you’ll try mine, below. It’s much gentler.

Because this article is gentler than Yudkowsky’s, it’s also longer. So I advise you tackle just one section per day. The table of contents to all sections is below.

This introduction also replaces Yudkowsky’s interactive elements with lots of pictures so that you can read it on a mobile device like the Kindle. Here: download the PDF (updated 01/04/2011).

I hope you find it useful!

Table of contents:

You probably already use Bayesian reasoning without knowing it.

Consider an example I adapted from Neil Manson:

You’re a soldier in combat, crouching in a trench. You know for sure there is just one enemy soldier left on the battlefield, about 400 yards away. You also know that if the remaining enemy is a regular army troop, there’s only a small chance he could hit you with one shot from that distance. But if the remaining enemy is a sniper, then there’s a very good chance he can hit you with one shot from that distance. But snipers are rare, so it’s probably just a regular army troop. You peek your head out of the trench, trying to get a better look. Bam! A bullet glance off your helmet and you duck down again. Okay, you think. I know snipers are rare, but that guy just hit me with a bullet from 400 yards away. I suppose it might still be a regular army troop, but there’s a seriously good chance it’s a sniper, since he hit me from that far away. After a few minutes, you dare to take another look, and peek your head out of the trench again. Bam! Another bullet glances off your helmet! You duck down again. Oh shit, you think. It’s definitely a sniper. No matter how rare snipers are, there’s no way that guy just hit me twice in a row from that distance if he’s a regular army troop. He’s gotta be a sniper. I’d better call for support.

If that’s roughly how you’d reason in a situation like that, then congratulations! You already think like a Bayesian, at least some of the time.

But of course it will be helpful to be more precise than this, and it will be helpful to know when and how our reasoning departs from correct Bayesian reasoning. In fact, in the scientific study of human reasoning biases, a bias is defined in terms of a systematic departure from ideal Bayesian reasoning.

So, let us be Accidental Bayesians no more. Let’s learn to be consistent Bayesians.

We begin with a story problem, like the ones you had in high school. But I promise you, learning Bayes’ Theorem will be far more useful than almost anything you learned in high school.

Here’s the problem:

Only 1% of women at age forty who participate in a routine mammography test have breast cancer. 80% of women who have breast cancer will get positive mammographies, but 9.6% of women of women who don’t have breast cancer will also get positive mammographies. A woman of this age had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

If you’re struggling to figure out the answer, you’ll be relieved to know that only 15% of doctors give the correct answer.

And no, I didn’t make up that number. See: Casscells et. al. (1978), Eddy (1982), Gigerenzer & Hoffrage (1995).

But grab yourself a calculator and see if you can get the right answer. It’s simple math, but it’s tricky.

Okay. You at least gave it a try, right? You won’t learn Bayes’ Theorem just by reading. You can only learn Bayes Theorem by doing. So you really should try all the exercises.

Go ahead. Give it a try.

I’ll still be here when you’re done.

What answer did you get? Most doctors estimate between 70% to 80%, but that’s wildly incorrect.

Let’s try an easier version of the same problem. With this one, nearly half the doctors get the right answer.

100 out of 10,000 women at age forty who participate in routine screening have breast cancer. 80 of every 100 women with breast cancer will get a positive mammography. 950 out of 9,900 women without breast cancer will also get a positive mammography. If 10,000 women in this age group undergo a routine screening, about what fraction of these women with positive mammographies will actually have breast cancer?

Give it a try. What’s the answer?

The answer is 7.8%. Just 7.8% of the women with positive mammographies will have breast cancer!

Here, follow the logic…

Always begin by figuring out what you want to know. In this case, we want to know what fraction (or percentage) of the women with positive mammographies actually have breast cancer.

First, let’s figure out how many women have positive mammographies. That’s the denominator of our fraction.

The story above says that 950 of the 9,900 that do not have breast cancer will have a positive mammography. So that’s 950 women with a positive test result right there.

The story also says that 80 out of the 100 women who do have breast cancer will get a positive test result. So that’s another 80 women, and 950 + 80 = 1,030 women with a positive test result.

Good. We’ve got half our fraction. Now, how do we find the numerator? How many of those 1,030 women with a positive test result actually have breast cancer?

Well, the story says that 80 of the 100 women with breast cancer will get a positive test result, so 80 is our numerator.

The fraction of women with positive test results who actually have breast cancer is 80/1,030, which is a probability of .078, which is 7.8%.

So if one of these 40-year-old women got a positive mammography, and the doctor knew the above statistics, then the doctor should tell the woman she has only a 7.8% chance of having breast cancer, even though she had a positive mammography. That’s much less stressful for the woman than if the doctor had told her she had a 70%-80% chance of having breast cancer like most doctors apparently would!

Already, we can see that careful reasoning of this sort has real-world consequences. This is not “just math.”

Why do even doctors get that kind of problem wrong, so often? The math isn’t hard, so what goes wrong?

The most common mistake is to focus only on the women with breast cancer who get positive results, while ignoring the other important information, such as the original fraction of women who have breast cancer and the fraction of women without breast cancer who get false positives.

But you always need all three pieces of information to get the right answer.

To get a feel for why you always need all three pieces of information, imagine an alternate universe in which only one woman out of a million has breast cancer. And let’s say the mammography test detected breast cancer in 8 out of 10 cases, while giving a false positive only 10% of the time.

Now, I think you can see that in this universe, the initial probability that a woman has breast cancer is so incredibly low that even if a woman gets a positive test result, it’s still almost certainly true that she does not have breast cancer.

Why? Because there are a lot more women getting false positives (10% of 99.9999% of women) than there are getting true positive (80% of 0.0001% of women). So if a woman gets a positive result, it’s almost certainly a false positive, not a true positive.

An extreme example like this illustrates that the new data you get from the mammography test does not replace the data you had at the outset about how improbable it was that the woman had breast cancer. Instead, imagine that you start with the original probability that a woman has breast cancer, and then getting the new evidence of the mammography test moves the probability one direction or the other from that starting point, depending on whether the test is positive or negative. In this way, the mammography test slides the probability of the woman having breast cancer in the direction of the test result.

To illustrate this, consider the original problem again. In that story, 1% of 40-year-old women (100 out of 10,000) have breast cancer. 80% of women with cancer (80 out of 100) test positive, and 9.6% of women without cancer (950 out of 9,900) also test positive. When we did the math, we found that a positive test result slides a woman’s chances of having breast cancer from 1% upward to 7.6%.

You can’t replace the original probability with new information. You can only update it with new information, by sliding in one direction or another from the original probability. The original probability still matters, a fact which is obvious when the original probability is really extreme – for example in a universe where only one in every million women has breast cancer.

Remember that we always need all three pieces of information. We need to know the original fraction of women with breast cancer, the fraction of women with breast cancer who get positive test results, and the fraction of women without breast cancer who get positive test results.

To see why that last piece of information matters – the fraction of women without breast cancer who get false positives – consider a new test: mammography*. A mammography* has the same rate of false negatives as before: 20%. But it also has an alarmingly high rate of false positives: 80%!

Here’s the story problem:

1% of women have breast cancer. 80% of women with breast cancer will get a positive test result. 80% of women without breast cancer will also get a positive test result. A woman had a positive mammography*. What is the probability that she actually has breast cancer?

Go ahead; calculate the answer.

Got your answer?

Okay, let’s start by calculating what percentage of women will get a positive test result. 80% of the 1% of women with breast cancer will get a positive result, so that’s 0.8% of women right there. Also, 80% of the 99% of women without breast cancer will get a positive result, so that’s another 79.2%. And since 0.8% + 79.2% = 80%, that means 80% of women will get a positive test result.

Even though only 1% of women actually have breast cancer!

So already you can tell that third piece of information can make a huge difference.

But let’s finish the calculation. What fraction of women with positive mammography* results actually have cancer?

First, how many women will get a positive test result? That’s our denominator.

Well, there are two groups of women who will get a positive mammography* result: those with a positive result who do have cancer (0.8%), and those with a positive result who don’t have cancer (79.2%). Add those together, and our denominator is 80%.

Time to figure out our numerator. Out of those 80% of women who will get a positive result, how many actually have cancer? We already know the answer, because we already know what percentage of women will test positive and have breast cancer: 0.8%. So the fraction of women with positive mammography* results who actually have cancer is 0.8%/80%, which is 1%.

The woman started out with a 1% chance of having breast cancer, and after the test she still has a 1% chance of having breast cancer.

How did that happen? Didn’t the test tell us anything, either way?

Nope.

Why didn’t it tell us anything? Remember, the mammography* test had such a high rate of false positives that a woman was quite literally just as likely to get a positive result if she didn’t have breast cancer as if she did have breast cancer!

If she did have breast cancer, she had an 80% chance of testing positive. And if she didn’t have breast cancer, she also had an 80% chance of testing positive.

And that’s why the test didn’t tell us anything. So we updated her chances of having breast cancer by 0%. She was just as likely to get the same test result either way, so the test didn’t do anything to tell us which possibility was correct.

In such a case, the mammography* test is completely uncorrelated with incidences of breast cancer, because it gives the same results either way. In fact in this case, there’s no reason to call one result “positive” and another result “negative,” since neither result tells you to slide your probability in either direction.

Which means you might just as well have flipped a coin as your “test” for breast cancer. Flipping a coin would have been equally uncorrelated with incidences of breast cancer. If the woman has breast cancer, there’s a 50% chance the coin will turn up heads. If the woman doesn’t have breast cancer, there is also a 50% chance the coin will turn up heads.

Or, you could just as well have used a test that always gave the same result. Let’s say your test was adding-two-plus-two. If a woman had breast cancer, the result of the adding-two-plus-two test would have been 4. And if a woman hadn’t had breast cancer, the result of the adding-two-plus-two test would have been 4.

All these tests are equally worthless, because these tests give the same result the same percentage of the time whether or not the woman has breast cancer. In order for a test to give us information we can use to update the probability that the woman has breast cancer, the test has to be correlated with breast cancer in some way. The test has to be more likely to give some particular result when a woman does have breast cancer than when she doesn’t.

That’s what it means for something to be a “test” for breast cancer.

But remember: probability is in the mind, not in reality. Even a useful mammography test does not actually change whether or not the woman has cancer. She either has cancer or she doesn’t. Reality is not uncertain about whether or not the woman has cancer. We are uncertain about whether or not she has cancer. It is our information, our judgment, that is uncertain, not reality itself.

The original proportion of women with breast cancer is known as the prior probability. This is the probability that a woman has breast cancer prior to some new evidence we receive.

What about the proportion of women with breast cancer who get a positive test result, and the proportion of women without breast cancer who get a positive test result? These were the two conditions of our story, so these probabilities are called the two conditional probabilities.

Collectively, the prior probability and the conditional probabilities are known as our priors. They are the bits of information we know prior to calculating the result, which is called the revised probability or posterior probability.

What we showed above is that if the two conditional probabilities are the same – if a positive test is 80% likely if the woman has breast cancer, and a positive test result is 80% likely if the woman doesn’t have breast cancer – then the posterior probability equals the prior probability.

Where do we get our priors from? How do we know what the prior probability is, and what the conditional probabilities are?

Well, those are tested against reality like anything else. For example, if you think 100 out of 10,000 women have breast cancer, but the actual number is 500 out of 10,000, then one of your priors is wrong, and you need to do some more research.

There are also a few easy symbols you should know, because it’s common to work out these kinds of story problems with the help of some symbols.

To illustrate, Eliezer tells a story of some plastic eggs:

Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl?

Before you go about solving this problem, let’s introduce the symbols I was talking about. To say “the probability that a certain egg contains a pearl is equal to .4,” we write:

p(pearl) = .4

Now, more notation:

p(blue|pearl) = .3

What is that straight line between blue and pearl? It stands for “given that.” And here, the word blue stands for “is blue.” So we can read the above statement as: “The probability that a certain egg is blue, given that it contains a pearl, is .3.”

One more symbol is the tilde: ~

It means “not,” as in:

p(blue|~pearl) = .1

This reads: “The probability that a certain egg is blue, given that it does not contain a pearl, is .1.”

Now we are ready to express our three pieces of information from the story above, but in symbolic form:

p(pearl) = .4 p(blue|pearl) = .3 p(blue|~pearl) = .1

And of course what we’re looking for is:

p(pearl|blue) = ?

You should be able to read those four statements aloud:

The probability that a certain egg contains a pearl is .4. The probability that a certain egg is blue, given that it contains a pearl, is .3. The probability that a certain egg is blue, given that it does not contain a pearl, is .1. The probability that a certain egg contains a pearl, given that it is blue, is… what?

That’s our problem. Stop and try to solve it without peeking below.

What’s the solution? We’re looking for the probability that an egg contains a pearl, given that it is blue. (This is like trying to figure out the probability that a woman has breast cancer given that she had a positive test result.)

40% of the eggs contain pearls, and 30% of those are blue, so 12% of the eggs altogether are blue and contain pearls.

60% of the eggs contain no pearls, and 10% of those are blue, so 6% of the eggs altogether are blue and contain no pearls.

12% + 6% = 18%, so a total of 18% of the eggs are blue.

We already know that 12% of the eggs are blue and contain pearls, so the chance that a blue egg contains a pearl is 12/18 or about 67%.

One famous case of a failure to apply Bayes’ Theorem involves a British woman, Sally Clark. After two of her children died of sudden infant death syndrome (SIDS), she was arrested and charged with murdering her children. Pediatrician Roy Meadow testified that the chances that both children died of SIDS was 1 in 73 million. He got this number by squaring the odds of one child dying of SIDS in similar circumstances (1 in 8500).

Because of this testimony, Sally Clark was convicted. The Royal Statistics Society issued a public statement decrying this “misuse of statistics in court,” but Sally’s first appeal was rejected. She was released after nearly 4 years in a woman’s prison where everyone else thought she had murdered her own children. She never recovered from her experience, developed an alcohol dependency, and died of alcohol poisoning in 2007.

The statistical error made by Roy Meadow was, among other things, to fail to consider the prior probability that Sally Clark had murdered her children. While two sudden infant deaths may be rare, a mother murdering her two children is even more rare.

It can help to visualize what’s going on here. On Yudkowsky’s page for Bayes’ Theorem, there is an interactive tool that lets you adjust each of the three values independently and see what the result is. But it only works if you have Java, and not on Mac (not on mine, anyway) or on mobile devices like the Kindle, so I’m going to use images here. (Screenshots of Yudkowsky’s interactive tool, in fact.)

First, the original problem:

Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl?

The bar at the top, divided between pearl and empty, shows the prior probability that an egg contains a pearl. The probability is 40%, so the division between the two is just left of center. (Center would be 50%.)

The first conditional probability is p(blue|pearl) or “the probability that an egg is blue given that it contains a pearl.” The size of the right-facing arrow reflects the size of this probability.

The second conditional probability is “the probability that an egg is blue given that it does not contain a pearl.” The bottom row shows the probabilities that an egg either does or doesn’t contain a pearl, given that it is blue.

One thing that might be confusing right away is that the bars at the top and bottom of this drawing are not measuring the same collection of eggs, even though they look the same size in the drawing. The top bar is measuring all eggs, both blue and red. But the bottom bar is measuring only blue eggs. But don’t let this confuse you; drawing it this way allows us to most clearly illustrate the effects of each of our three priors: the prior probability, the first conditional probability, and the second conditional probability. At both the top and bottom of the drawing we are looking at the chances an egg has a pearl. It’s just that in the middle we “ran a test” and discovered that the egg we grabbed from the bin was blue, so that eliminated all the red eggs from the situation.

The slant of the line in the middle represents how we should update our probability that an egg contains a pearl after our first test (whether the egg is blue or not). At first, all we know is that if we grab an egg from the bin, it has a 40% chance of containing a pearl. But now let’s say we grab an egg from the bin and see that it is blue. Because we know that the chances it would be blue if it had a pearl were higher than the chances it would be blue if it didn’t have a pearl, we therefore know the egg now has a higher probability of containing a pearl than it did before. So we slide our probability that the egg contains a pearl in the upward direction. That’s why the line in the middle of the drawing is slanted to the right: we shifted up. Knowing how much to shift up our probability requires doing the math, of course.

Now, let’s look at the effect on the posterior probability if the prior probability is different. What if only 10% of all the eggs contained pearls? Now our drawing would look like this:

Our conditional probabilities didn’t change, so the relative slant of the line in the middle didn’t change. That is, the degree by which we have to update our probability that an egg contains a pearl after we discover it is blue – that degree of updating required did not change. However, the prior probability is now much lower, which means the posterior probability is correspondingly lower.

Remember what happened in the above story about women and breast cancer when our story said that only one woman in a million had breast cancer? If we showed that in a diagram like the one above, the slanted line would be slammed up against the left edge of the drawing, and the posterior probability that a woman had breast cancer would be extremely small no matter what the result of the test, given almost any set of conditional probabilities. (You’d have to have a really slanted line of conditional probabilities to update very far from the far left edge of the diagram.)

And what happens if we keep the conditional probabilities locked in place, but jump the prior probability up to 80%?

Now of course, the prior probability is much greater, so the posterior probability is also much greater.

And again, the degree of updating we need to do remains the same, so the line is still slanted mildly to the right. But notice, the exact amount of updating is not the same. The line is not quite as slanted as when the prior probability was 40%, or even as slanted as when the prior probability was 10%. Why is that?

That’s because the amount by which we need to update our probability after discovering the egg is blue depends not just on difference between the two conditional probabilities (in this case, 30% and 10%), but also on the prior probability. That effect just drops out of the math. And if you think about it, it makes sense. What if the prior probability that an egg contained a pearl was 99.999%, and the conditional probabilities remained the same? If it were the case that we updated by the same amount as before, then the probability that an egg was blue and contained a pearl would be greater than 100%! The slanted line would go off the right edge of the drawing!

If that happened, it would mean you were doing the math wrong. As it happens, pushing the prior probability to 99% just makes the amount of updating we need to do very small in absolute terms, because we still need to adjust upward, but the probability that the egg contains a pearl can’t get much higher than it already is:

Now, what if we return the prior probability to its original value of 40%, but change the first conditional probability?

Now the first conditional probability is exerting much more force on the posterior probability, and slanting our line more heavily to the right. That means we have to make a heavier update to our probability that the egg contains a pearl after discovering it is blue.

Why is that? Well, the first conditional probability is p(blue|pearl) or “the probability that an egg is blue, given that it contains a pearl.” What happens if that probability is far larger than the other conditional probability, “the probability that an egg is blue, given that it doesn’t contain a pearl”? If that happens, then there are going to be a lot more eggs that are blue and contain a pearl than eggs that are blue and empty. So once you have discovered the egg you picked from the bin is blue, you know there is now a much better chance than before that the egg you have in your hand contains a pearl… because there are more pearl-containing blue eggs than there are empty blue eggs.

So again, it’s the difference between the two conditional probabilities that determines by how much we need to update our probability as a result of our test. If the difference between the two conditional probabilities is small, we don’t need to update from the prior probability very much. If the difference between the two conditional probabilities is large, we need to update quite a bit from the prior probability.

Just to emphasize that it’s the difference between the two conditional probabilities that determines the degree by which we must update our probability, and not their absolute values, let’s look at what happens if both conditional probabilities are very high, but not very different:

Now the story is that if an egg contains a pearl, it is definitely blue. There are no red eggs with pearls. However, there is also a high degree of “false positives” in our “test,” because it’s also very common for an egg without a pearl to be blue. Because an egg is very likely to be blue whether it contains a pearl or not, the fact that an egg is blue doesn’t tell you much about whether or not it contains a pearl, and so finding that the egg is blue doesn’t allow us to update our probability very much. It’s the difference between the two conditional probabilities that tells us by how much we can update our prior probability.

Now what if the second conditional probability is larger than the first?

Now the second conditional probability is larger than the first one, so the line is slanted left, and we have to update our probability in the opposite direction. Again, this makes sense. Now the story is that “the probability that an egg is blue given that it doesn’t contain a pearl” is larger than “the probability that an egg is blue given that it does contain a pearl,” which means that in this story there are more eggs that are blue and empty than there are eggs that are blue and contain a pearl. So if we’ve grabbed a blue egg, it’s more likely to be empty than contain a pearl than was the case when we didn’t know it was blue (the prior probability). So we have to update our probability that the egg contains a pearl in the downward direction this time, no matter what the prior probability is.

And what if our two conditional probabilities are the same?

If our two conditional probabilities are the same, then they exert the same amount of force on our required update, which means we don’t update at all. If an egg is just as likely to be blue given that it contains a pearl as it is likely to be blue given that it doesn’t contain a pearl, then there are just as many eggs that are blue and contain a pearl as there are eggs that are blue and empty, and so the fact that the egg we picked is blue doesn’t give us any new information at all about whether or not it contains a pearl. Thus, we’re stuck with no new information, and we can’t update from the prior probability.

This is the case no matter what the conditional probabilities are, as long as they are the same:

The usual mistake in thinking about these kinds of problems is to simply ignore the prior probability and focus on the two conditional probabilities. But now you can see why all three of these pieces of information are required for calculating the posterior probability correctly.

Yudkowsky explains:

Studies of clinical reasoning show that most doctors carry out the mental operation of replacing the original 1% probability with the 80% probability that a woman with cancer would get a positive mammography. Similarly, on the pearl-egg problem, most respondents unfamiliar with Bayesian reasoning would probably respond that the probability a blue egg contains a pearl is 30%, or perhaps 20% (the 30% chance of a true positive minus the 10% chance of a false positive). Even if this mental operation seems like a good idea at the time, it makes no sense in terms of the question asked. It’s like the experiment in which you ask a second-grader: “If eighteen people get on a bus, and then seven more people get on the bus, how old is the bus driver?” Many second-graders will respond: “Twenty-five.” They understand when they’re being prompted to carry out a particular mental procedure, but they haven’t quite connected the procedure to reality. Similarly, to find the probability that a woman with a positive mammography has breast cancer, it makes no sense whatsoever to replace the original probability that the woman has cancer with the probability that a woman with breast cancer gets a positive mammography. Neither can you subtract the probability of a false positive from the probability of the true positive. These operations are as wildly irrelevant as adding the number of people on the bus to find the age of the bus driver.

A man who knows he has a genetic predisposition to alcoholism can respond to this knowledge by avoiding alcohol more purposefully than he might otherwise. Likewise, if we can understand why our brains handle probabilities poorly, we may be able to plan ahead and counteract the reasoning mistakes our brains lean toward.

So why do human brains, even the brains of trained doctors, usually get these kinds of problems wrong?

Luckily, recent studies have shed some light on the problem.

It turns out that we get the problems right more or less often depending on how the problem is phrased.

The kind of problem we get wrong the most often is phrased in terms of percentages or probabilities: “1% of women…” and so on.

But we do somewhat better when the problem is phrased in terms of frequencies: “1 out of every 100 women have breast cancer…” and “80 out of every 100 women with breast cancer get a positive test result…” Apparently, this phrasing helps us to visualize a single woman in an empty space made to hold 100 women, or 80 women nearly filling a space made to hold 100 women.

We do best of all when the problem is phrased in terms of absolute numbers, which are called natural frequencies: “400 out of 1000 eggs contain pearls…” and “50 out of 400 pearl-containing eggs are blue…” This is closest of all to actually doing the experiment yourself and experiencing how often you pull a blue egg from the bin, and experiencing how often blue eggs contain pearls.

Yudkowsky remarks:

It may seem like presenting the problem in this way is “cheating”, and indeed if it were a story problem in a math book, it probably would be cheating. However, if you’re talking about real doctors, you want to cheat; you want the doctors to draw the right conclusions as easily as possible. The obvious next move would be to present all medical statistics in terms of natural frequencies. Unfortunately, while natural frequencies are a step in the right direction, it probably won’t be enough. When problems are presented in natural frequences, the proportion of people using Bayesian reasoning rises to around half. A big improvement, but not big enough when you’re talking about real doctors and real patients.

A visualization of the eggs and pearls problem in terms of natural frequencies might look like this:

Because we’re looking at absolute numbers now instead of percentages, the bar at the top is bigger than the bar at the bottom, because the collection of all eggs is larger than the collection of just blue eggs.

The top bar looks the same as before, and the middle line has the same kind of slant, but now the bottom bar is much smaller because we’re only looking at blue eggs in the bottom bar. (The bottom bar is centered, as you can see.)

In this kind of visualization, we don’t see how much we’re updating our probability so much in the slant of the line, but rather in the difference in proportions between the top bar and the bottom bar. In the above example, you can see just by looking that the pearl condition takes up an even larger proportion of the bottom bar than it does in the top bar, which means we are updating our probability upward.

What does this kind of visualization look like if the conditional probabilities are the same?

In this case, we see that the proportions in the bottom and top bars are the same, so we aren’t updating, even though the line is slanted in this kind of visualization.

But the natural frequencies visualization shows something the probabilities visualization does not. The natural frequencies visualization shows that when we decrease two proportions by the same factor, the resulting proportions are the same. Discovering that the egg was blue decreased the number of pearl-carrying eggs we might be looking at, but it decreased the number of empty eggs we might be looking at by the same factor, and that’s why the probability that the egg we grabbed contains a pearl remains the same as before we discovered it was blue.

Now, let’s look at a natural frequencies visualization for the original problem about breast cancer. 1% of women have breast cancer, 80% of those women test positive on a mammography, and 9.6% of women without breast cancer also receive positive mammographies.

You can hardly see the condition on the left at all because the prior probability of breast cancer is so small: only 1%. And even though the mammography is fairly accurate (only a 20% rate of false negatives and a 9.6% rate of false positives), we still don’t have much reason to think the woman has breast cancer after a positive test, because the prior probability is so low. Even after adjusting upward because of the positive test result, the posterior probability that she has breast cancer is still only 7.6%.

Still, how does the test give us useful information? As the above visualization shows, the test eliminates more of the women without breast cancer than with breast cancer. The proportion of the top bar that represents women with breast cancer is small, but the test passes most of this on to the bottom bar, our posterior probability. In contrast, most of the section of the top bar representing women without breast cancer was not passed to the bottom bar. It’s this difference between conditional probabilities that gives us some information with which to update our prior probability to our posterior probability. The evidence of the positive mammography test slides the prior probability of 1% to the posterior probability of 7.8%.

Next, Yudkowsky asks us to imagine a new kind of breast cancer test:

Suppose there’s yet another variant of the mammography test, mammography@, which behaves as follows. 1% of women in a certain demographic have breast cancer. Like ordinary mammography, mammography@ returns positive 9.6% of the time for women without breast cancer. However, mammography@ returns positive 0% of the time (say, once in a billion) for women with breast cancer.

Here is the graph:

Okay, this one is easy. If a woman gets a positive result on this test, what do you tell her?

If a woman gets a positive result on the mammography@ test, you tell her: “Congratulations! You definitely don’t have breast cancer.”

Mammography@ isn’t a cancer test; it’s a health test! As the visualization shows, very few women get a positive result from a mammography@ test, but no women with breast cancer get a positive result. So if a woman gets a positive mammography@ result, she definitely doesn’t have breast cancer!

What this shows is that what makes a normal mammography test a positive test for breast cancer (not for health) is not that somebody named the mammography test “positive,” but that the test has a certain kind of probability relation to breast cancer. Normal mammography is a “positive” test for breast cancer because a “positive” result of the test increases the chances that the tested woman has breast cancer. But in the case of mammography@, a “positive” result actually decreases the chances she has breast cancer. So mammography@ is not a positive test for breast cancer, but a positive test for the condition of not-having-breast-cancer.

Yudkowsky concludes:

You could call the same result “positive” or “negative” or “blue” or “red” or “James Rutherford”, or give it no name at all, and the test result would still slide the probability in exactly the same way. To minimize confusion, a test result which slides the probability of breast cancer upward should be called “positive”. A test result which slides the probability of breast cancer downward should be called “negative”. If the test result is statistically unrelated to the presence or absence of breast cancer – if the two conditional probabilities are equal – then we shouldn’t call the procedure a “cancer test”! The meaning of the test is determined by the two conditional probabilities; any names attached to the results are simply convenient labels.

Now, note that the mammography@ is rarely useful. Most of the time, it gives a negative result, which gives very weak evidence that doesn’t allow us to slide our probability (that the woman has cancer) very far away from the prior probability. Only on rare occasions (a positive result) does it give us strong evidence. But when it does give us strong evidence, it is very strong evidence, for it allows us to conclude with certainty that the tested woman does not have breast cancer.

Let’s return to our original mammography story:

Only 1% of women at age forty who participate in a routine mammography test have breast cancer. 80% of women who have breast cancer will get positive mammographies, but 9.6% of women of women who don’t have breast cancer will also get positive mammographies. A woman of this age had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

Let’s look at all the different quantities involved (taken from Yudkowsky’s page):

p(cancer): 0.01 Group 1: 100 women with breast cancer p(~cancer): 0.99 Group 2: 9900 women without breast cancer p(positive|cancer): 80.0% 80% of women with breast cancer have positive mammographies p(~positive|cancer): 20.0% 20% of women with breast cancer have negative mammographies p(positive|~cancer): 9.6% 9.6% of women without breast cancer have positive mammographies p(~positive|~cancer): 90.4% 90.4% of women without breast cancer have negative mammographies p(cancer&positive): 0.008 Group A: 80 women with breast cancer and positive mammographies p(cancer&~positive): 0.002 Group B: 20 women with breast cancer and negative mammographies p(~cancer&positive): 0.095 Group C: 950 women without breast cancer and positive mammographies p(~cancer&~positive): 0.895 Group D: 8950 women without breast cancer and negative mammographies p(positive): 0.103 1030 women with positive results p(~positive): 0.897 8970 women with negative results p(cancer|positive): 7.80% Chance you have breast cancer if mammography is positive: 7.8% p(~cancer|positive): 92.20% Chance you are healthy if mammography is positive: 92.2% p(cancer|~positive): 0.22% Chance you have breast cancer if mammography is negative: 0.22% p(~cancer|~positive): 99.78% Chance you are healthy if mammography is negative: 99.78%

As you might imagine, it can be easy to mix up one of these quantities with another. p(cancer&positive) is the exact same thing as p(positive&cancer), but p(cancer|positive) is definitely not the same thing or the same value as p(positive|cancer). The probability that a woman has cancer given a positive test result is not the same thing as the probability that a woman will get a positive test result given that she has cancer. If you confuse those two, you’ll get the wrong answer! And of course p(cancer&positive) is entirely different from p(cancer|positive). The probability that a woman has cancer and will get a positive test result is not at all the same thing as the probability that a woman has cancer given that gets a positive test result.

Later, I’ll present Bayes’ Theorem, and if you stick to the formula, you won’t mix these up. But it helps to know what they all mean, and how they relate to each other.

To see how they relate to each other, consider the “degrees of freedom” between them. What the heck are “degrees of freedom”? Wikipedia sayeth unto you:

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

What does that mean? Let’s look at an example. There is only one degree of freedom between p(cancer) and p(~cancer) because if you know one of them, you know the other. Once you know one of the numbers, there is only one value the other can take; it has one degree of freedom. If p(cancer) is 90%, then p(~cancer) is 10%. If p(~cancer) is 45%, then p(cancer) is 55%. There is “nowhere else to go,” because

p(cancer) + p(~cancer) = 100%

And of course, the same goes for positive and negative tests:

p(positive) + p(~positive) = 100%

Another pair of values that has only one degree of freedom between them is p(positive|cancer) and p(~positive|cancer). Given that a woman has cancer (which is true of both those values), then she will either test positive or she will not test positive. There’s no third option, assuming (as our story does) that she is tested. So once you know one of these values, you know the other. If p(positive|cancer) = 20%, then p(~positive|cancer) must be 80%. Why? Because:

p(positive|cancer) + p(~positive|cancer) = 100%

Remember, it helps to always read these statements aloud. The above statement reads: “The probability that a woman tests positive given that she has cancer, plus the probability that she tests negative given that she has cancer, equals 100%.

And of course the same is true when looking at cancer or not cancer given a positive test result:

p(cancer|positive) + p(~cancer|positive) = 100%

That reads: “The probability that a woman has cancer given that she tests positive, plus the probability that she doesn’t have cancer given that she tests positive, equals 100%.” If you say it out loud, the truth of the equation becomes obvious.

And likewise:

p(positive|cancer) + p(~positive|cancer) = 100% p(cancer|~positive) + p(~cancer|~positive) = 100%

However, consider the relation between p(positive|cancer) and p(positive|~cancer). It could be the case that, as in the original story, p(positive|cancer) was equal to 80%, while p(positive|~cancer) was equal to 9.6%. In other words, it could be that the probability that a woman would test positive given that she has cancer is 80%, while the probability that a woman would test positive given that she does not have cancer is 9.6%. These two values are independent of each other. It could just as well be that the chance of false negative is not 20% but instead 2%, while at the same time the chance of a false positive is still 9.6%. So these two values, p(positive|cancer) and p(positive|~cancer) are said to have two degrees of freedom. Both numbers could be different, independently.

Let’s take a triplet of values and consider the degrees of freedom between them. Our three values are p(positive&cancer), p(positive|cancer), and p(cancer). How many degrees of freedom are there between them? Since there are three values, there could be as many as three degrees of freedom between them. But let’s check.

In this case, we can calculate one of the other values by looking at the other two. In particular:

p(positive&cancer) = p(positive|cancer) × p(cancer)

Why must this be so? If we know the probability that a woman has cancer, and we know the proportion of those women with cancer that will test positive, then that tell us the probability that a woman has cancer and tests positive. We multiply the probability that a woman has cancer by the probability that a woman tests positive given that she has cancer, and that is the probability that a woman has cancer and tests positive.

Because we can use two of the values to calculate the third, there are only two degrees of freedom between these three values: p(positive&cancer), p(positive|cancer), and p(cancer).

The same is true for this arrangement:

p(~positive&cancer) = p(~positive|cancer) × p(cancer)

If we know the probability that a woman has cancer, and we know the probability of those women with cancer that will not test positive, then that just is the probability that a woman has cancer and does not test positive.

Let’s consider another triplet of values: p(positive), p(positive&cancer), and p(positive&~cancer). How many degrees of freedom are there between them? It should be rather obvious that:

p(positive&cancer) + p(positive&~cancer) = p(positive)

Every woman who tests positive either has cancer and tests positive or doesn’t have cancer and tests positive. Those two possibilities add up to account for 100% of the women who test positive. So you can use them to calculate the total percentage of women who test positive for breast cancer. So there are only two degrees of freedom between p(positive), p(positive&cancer), and p(positive&~cancer).

Now, consider this set of four values: p(positive&cancer), p(positive&~cancer), p(~positive&cancer), and p(~positive&~cancer). At first glance, it might look like there are only two degrees of freedom here, because you can calculate all the other values by knowing only two of them: p(positive) and p(cancer). For example: p(positive&~cancer) = p(positive) × p(~cancer).

But this is actually wrong! Notice that the above equation is only true if p(positive) and p(~cancer) are statistically independent. The above equation is only true if the probability of a woman having cancer has no bearing on her chances of testing positive for cancer. But that’s not the case! According to our story, she is more likely to test positive if she does have breast cancer than if she doesn’t have breast cancer.

But a simpler way of seeing why this is wrong may be to notice that these four values correspond to four groups of different women, and of course there could be different numbers of women in each group. We could have 500 women in the has cancer and tests positive group (group A, let’s call it), 150 women in the has cancer and tests negative group (group B), 50 women in the has no cancer and tests positive group (group C), and 900 women in the has no cancer and tests negative group (group D). And each of these values could be different, independent of all the others.

So now you’re thinking this set of four values has four degrees of freedom, and you’d be right if it wasn’t for the fact that all four add up to 100% of the women. That is, all four of these probabilities must add up to 100%. For example, in the above paragraph I put 500 women in group A, 150 women in group B, 50 women in group C, and 900 women in group D. That makes for a total of 1,600 women. Thus, the probability that a woman belongs to group A is 31.25%. That is, p(A) = 31.25%. Moving on, p(B) = 9.375%, p(C) = 3.125%, and p(D) = 56.25%. And, not surprisingly, those four percentages add up to 100%.

Because of this, we can use three of the values to calculate the fourth, because we know the total of all of them is going to add up to 100%, and this costs us one degree of freedom:

p(positive&cancer) + p(positive&~cancer) + p(~positive&cancer) + p(~positive&~cancer) = 100%

In fact, once you have all four group (A = cancer and tests positive, B = cancer and tests negative, C = no cancer and tests positive, and D = no cancer and tests negative), you can easily use them to calculate all the other values. For example:

p(cancer|positive) = A / (A + C)

The probability that a woman has cancer given a positive test result is, of course, the probability that she has cancer and tests positive (group A), divided by the probability that she tests positive (A + C):

p(positive) = A + C

And, the probability that a woman has cancer given that she tests negative is the probability that she has cancer and tests negative (group B), divided by the probability that she tests negative (B + D):

p(cancer|~positive) = B / (B + D) p(~positive) = B + D

Finally, the probability that a woman has cancer is equal to the probability that she has cancer and tests positive (group A) plus the probability that she has cancer and tests negative (group B):

p(cancer) = A + B

Likewise, the probability that she doesn’t have cancer is equal to the probability that she doesn’t have cancer and tests positive (group C) plus the probability that she doesn’t have cancer and tests negative (group D):

p(~cancer) = C + D

If we translate the letters into probability symbols, we’ve just explained the following equations:

p(cancer|positive) = p(cancer&positive) / [p(cancer&positive + p(~cancer&positive)] p(positive) = p(cancer&positive) + p(~cancer&positive) p(cancer|~positive) = p(cancer&~positive) / [p(cancer&~positive) + p(~cancer&~positive)] p(~positive) = p(cancer&~positive) + p(~cancer&~positive) p(cancer) = p(cancer&positive) + p(cancer&~positive) p(~cancer) = p(~cancer&positive) + p(~cancer&~positive)

And since we can calculate all the values we want if we have A, B, C, and D, and since A, B, C, and D have three degrees of freedom, it follows that all 16 values in the problem (see the colored table above) have three degrees of problem.

But that should not surprise you, since you already knew that you could solve these types of problems with just three pieces of information: the prior probability and the two conditional probabilities.

Now that you understand the relations between the 16 different quantities in this kind of problem, let’s try another kind of story problem offered by Yudkowsky:

Suppose you have a large barrel containing a number of plastic eggs. Some eggs contain pearls, the rest contain nothing. Some eggs are painted blue, the rest are painted red. Suppose that 40% of the eggs are painted blue, 5/13 of the eggs containing pearls are painted blue, and 20% of the eggs are both empty and painted red. What is the probability that an egg painted blue contains a pearl?

Try to solve it using the relations we discovered above.

What pieces of information do we have? We know that 40% of the eggs are painted blue:

p(blue) = 40%

We also know that 5/13 of the eggs containing pearls are blue:

p(blue|pearl) = 5/13

And we also know that 20% of the eggs are both empty and red:

p(~blue&~pearl) = 20%

The piece of information we want to solve for is:

p(pearl|blue) = ?

Okay, how do we get that posterior probability? Since you’re new to Bayes’ Theorem, you’re probably not sure what the fastest way to the answer is, so let’s just start filling in the most obvious values we can, among those 16 values of the problem:

p(blue) = 40% …given in the story

p(~blue) =

p(pearl) =

p(~pearl) =

p(pearl&blue) =

p(pearl&~blue) =

p(~pearl&blue) =

p(~pearl&~blue) = 20% …given in the story

p(blue|pearl) = 5/13 …given in the story

p(~blue|pearl) =

p(blue|~pearl) =

p(~blue|~pearl) =

p(pearl|blue) = ???

p(~pearl|blue) =

p(pearl|~blue) =

p(~pearl|~blue) =

How do we fill in more of the values? Well, let’s check all the relations between the values we have discovered, and see what we can solve for. Here are the rules we discovered when discussing the breast cancer story problem:

p(cancer) + p(~cancer) = 100%

p(positive) + p(~positive) = 100%

p(positive|cancer) + p(~positive|cancer) = 100%

p(cancer|positive) + p(~cancer|positive) = 100%

p(positive|cancer) + p(~positive|cancer) = 100%

p(cancer|~positive) + p(~cancer|~positive) = 100%

p(positive&cancer) = p(positive|cancer) × p(cancer)

p(~positive&cancer) = p(~positive|cancer) × p(cancer)

p(positive&cancer) + p(positive&~cancer) = p(positive)

p(positive&cancer) + p(positive&~cancer) + p(~positive&cancer) + p(~positive&~cancer) = 100%

p(cancer|positive) = p(cancer&positive) / [p(cancer&positive + p(~cancer&positive)]

p(positive) = p(cancer&positive) + p(~cancer&positive)

p(cancer|~positive) = p(cancer&~positive) / [p(cancer&~positive) + p(~cancer&~positive)]

p(~positive) = p(cancer&~positive) + p(~cancer&~positive)

p(cancer) = p(cancer&positive) + p(cancer&~positive)

p(~cancer) = p(~cancer&positive) + p(~cancer&~positive)

But now, let’s translate those rules into talking about blue and pearl, by replacing every occurrence of cancer (what we were trying to detect) with pearl (what we’re now trying to detect), and by replacing every occurrence of positive (our previous test) with blue (our current test):

p(pearl) + p(~pearl) = 100%

p(blue) + p(~blue) = 100%

p(blue|pearl) + p(~blue|pearl) = 100%

p(pearl|blue) + p(~pearl|blue) = 100%

p(blue|pearl) + p(~blue|pearl) = 100%

p(pearl|~blue) + p(~pearl|~blue) = 100%

p(blue&pearl) = p(blue|pearl) × p(pearl)

p(~blue&pearl) = p(~blue|pearl) × p(pearl)

p(blue&pearl) + p(blue&~pearl) = p(blue)

p(blue&pearl) + p(blue&~pearl) + p(~blue&pearl) + p(~blue&~pearl) = 100%

p(pearl|blue) = p(pearl&blue) / [p(pearl&blue + p(~pearl&blue)]

p(blue) = p(pearl&blue) + p(~pearl&blue)

p(pearl|~blue) = p(pearl&~blue) / [p(pearl&~blue) + p(~pearl&~blue)]

p(~blue) = p(pearl&~blue) + p(~pearl&~blue)

p(pearl) = p(pearl&blue) + p(pearl&~blue)

p(~pearl) = p(~pearl&blue) + p(~pearl&~blue)

Okay, so which rules can we use with the quantities we know already?

Well, the most obvious ones we can use right away are:

p(blue) + p(~blue) = 100%

and

p(blue|pearl) + p(~blue|pearl) = 100%

That gives us 60% for p(~blue) and 8/13 for p(~blue|pearl).

Here’s another. Remember that:

p(~blue) = p(pearl&~blue) + p(~pearl&~blue)

Well, we know what two of those values are, so:

60% = 20% + p(pearl&~blue)

Which means:

p(pearl&~blue) = 60% – 20% = 40%

So we can add that to our table, too. Now our table looks like this:

p(blue) = 40%

p(~blue) = 60% …because p(blue) + p(~blue) = 100%

p(pearl) =

p(~pearl) =

p(pearl&blue) =

p(pearl&~blue) = 40% …because p(~blue) = p(pearl&~blue) + p(~pearl&~blue)

p(~pearl&blue) =

p(~pearl&~blue) = 20%

p(blue|pearl) = 5/13

p(~blue|pearl) = 8/13 …because p(blue|pearl) + p(~blue|pearl) = 100%

p(blue|~pearl) =

p(~blue|~pearl) =

p(pearl|blue) = ???

p(~pearl|blue) =

p(pearl|~blue) =

p(~pearl|~blue) =

Now, what else can we do? Go through the value relations listed above and see if you can find one that you now have enough data to solve for.

Here’s one:

p(pearl|~blue) = p(pearl&~blue) / [p(pearl&~blue) + p(~pearl&~blue)]

Filling that in with the values we already have, we get:

p(pearl|~blue) = 40% / (40% + 20%)

So:

p(pearl|~blue) = 2/3

Which means we can solve for p(~pearl|~blue) also, because:

p(pearl|~blue) + p(~pearl|~blue) = 100%

Therefore:

p(~pearl|~blue) = 1/3

And now our table of values looks like this:

p(blue) = 40%

p(~blue) = 60%

p(pearl) =

p(~pearl) =

p(pearl&blue) =

p(pearl&~blue) = 40%

p(~pearl&blue) =

p(~pearl&~blue) = 20%

p(blue|pearl) = 5/13

p(~blue|pearl) = 8/13

p(blue|~pearl) =

p(~blue|~pearl) =

p(pearl|blue) = ???

p(~pearl|blue) =

p(pearl|~blue) = 2/3 …because p(pearl|~blue) = p(pearl&~blue) / [p(pearl&~blue) + p(~pearl&~blue)]

p(~pearl|~blue) = 1/3 …because p(pearl|~blue) + p(~pearl|~blue) = 100%

We are making progress! And here’s another equation we can now solve for:

p(~blue&pearl) = p(~blue|pearl) × p(pearl)

So:

40% = (8/13) × p(pearl)

And therefore:

p(pearl) = 40% / (8/13) = (2/5) / (8/13) = 13/20 = 65%

Now, that we know p(pearl), we can also solve this one from our list of known equations:

p(blue&pearl) = p(blue|pearl) × p(pearl)

So:

p(blue&pearl) = (5/13) × (13/20) = 1/4 = 25%

Updating out table of values, we now have:

p(blue) = 40%

p(~blue) = 60%

p(pearl) = 65% …because p(~blue&pearl) = p(~blue|pearl) × p(pearl)

p(~pearl) =

p(pearl&blue) = 25% …because p(blue&pearl) = p(blue|pearl) × p(pearl)

p(pearl&~blue) = 40%

p(~pearl&blue) =

p(~pearl&~blue) = 20%

p(blue|pearl) = 5/13

p(~blue|pearl) = 8/13

p(blue|~pearl) =

p(~blue|~pearl) =

p(pearl|blue) = ???

p(~pearl|blue) =

p(pearl|~blue) = 2/3

p(~pearl|~blue) = 1/3

The last bit is easy, because:

p(blue) = p(pearl&blue) + p(~pearl&blue)

Which gives us:

40% = 25% + p(~pearl&blue)

And therefore:

p(~pearl&blue) = 40% – 25% = 15%

So now we can finally solve for p(pearl|blue), because:

p(pearl|blue) = p(pearl&blue) / [p(pearl&blue + p(~pearl&blue)]

Which gives us:

p(pearl|blue) = 25% / (25% + 15%)

And therefore:

p(pearl|blue) = 25% / 40%

Which results in:

p(pearl|blue) = 62.5%

In fact, we didn’t need to calculate all the values we did. But it was good practice. :)

But now, let’s check our calculations. Do they make sense? Here’s the original story problem:

Suppose you have a large barrel containing a number of plastic eggs. Some eggs contain pearls, the rest contain nothing. Some eggs are painted blue, the rest are painted red. Suppose that 40% of the eggs are painted blue, 5/13 of the eggs containing pearls are painted blue, and 20% of the eggs are both empty and painted red. What is the probability that an egg painted blue contains a pearl?

Remember we found that p(pearl) is 13/20. That’s our prior probability: 65%. There’s a 65% chance that an egg has a pearl, even before we run the test of seeing what color it is.

What are our conditional probabilities? One of them was given to us in the story. The probability we would see a blue egg given that it contained a pearl, p(blue|pearl), is 5/13 – which doesn’t reduce nicely to a decimal. The other conditional probability, the probability that we would see a red egg if it contained a pearl, is p(~blue|pearl) = 8/13.

So if a certain egg has a pearl, it’s just barely more likely to be red than to be blue. So when we discover that the egg we picked from the bin is blue, that makes it slightly less likely to contain a pearl than was the case before we knew its color. So after running the test and discovering the egg we picked is blue, the probability that our egg contains a pearl slides down just a little.

And hey! That’s just what we see. Our prior probability that our chosen egg contains a pearl was 65%, and according to our calculations the posterior probability that our chosen egg contains an egg is slightly lower at 62.5%.

So yes, our math seems to fit what we would expect given what we’ve learned about how these types of situations work.

Having worked a few of these problems now, you might have noticed that strong but rare evidence pushing in one direction must be balanced by weak but common evidence pushing in the opposite direction. This is because, using the breast cancer story problem as an example:

p(cancer) = p(cancer|positive) × p(positive) + p(cancer|~positive) × p(~positive)

This reads: “The probability that a woman has cancer is equal to [the probability that she has cancer given that she tests positive times the probability that she tests positive] plus [the probability that a woman has cancer given that she tests negative times the probability that she tests negative].”

Thus, if there is rare but strong evidence from one of the conditional probabilities, this must be balanced by common but weak evidence from the other conditional probability, because the two of them must add up to, in this example, p(cancer). Yudkowsky calls this principle the Conservation of Probability.

Now, one more term: likelihood ratio. The likelihood ratio has to do with the likelihood of getting a true positive vs. the likelihood of getting a false positive. Specifically, the likelihood ratio is the probability that a test gives a true positive, divided by the probability that a test gives a false positive. However, the likelihood ratio doesn’t tell us much about what we should do if we get a negative result.

For example, p(pearl|blue) is independent of p(pearl|~blue). Even if we know the likelihood ratio, and therefore know what to do if we get a positive result on our test, that doesn’t tell us what to do with a negative result from our test.

To illustrate this further, consider the following problem, again taken from Eliezer:

Suppose that there are two barrels, each containing a number of plastic eggs. In both barrels, 40% of the eggs contain pearls and the rest contain nothing. In both barrels, some eggs are painted blue and the rest are painted red. In the first barrel, 30% of the eggs with pearls are painted blue, and 10% of the empty eggs are painted blue. In the second barrel, 90% of the eggs with pearls are painted blue, and 30% of the empty eggs are painted blue. [Assuming you like pearls, would you rather] have a blue egg from the first or second barrel? Would you rather have a red egg from the first or second barrel?

This time, we need to calculate the probability that a blue egg from the 1st barrel contains a pearl, and compare it to the probability that a blue egg from the 2nd barrel contains a pearl. For the second question, we calculate the probability that a red egg from the 1st barrel contains a pearl, and compare it to the probability that a red egg from the 2nd barrel contains a pearl.

In both barrels, 40% of the eggs contain pearls. So our prior probability, p(pearl), is 40% for either barrel.

And if you’ve made it this far instead of skipping ahead, it might be intuitively obvious to you that we don’t care whether a blue egg comes from the first or second barrel, because:

In the first barrel, p(blue|pearl) / p(blue|~pearl) = 30/10 In the second barrel, p(blue|pearl) / p(blue|~pearl) = 90/30

…which is the same ratio. They both equal exactly three. And since the prior probability – specifically, p(pearl) – is the same for both barrels, and the ratio between the conditional probabilities is the same for both barrel, thus p(pearl|blue) is going to be the same for both barrels. So would we rather have a blue egg from the first or second barrel? We don’t care: p(pearl|blue) is the same in either case.

But what about a red egg? Would we rather have a red egg from the first or second barrel? That is: for which barrel is p(pearl|~blue) higher?

In the first barrel, 70% of the eggs with pearls are painted red, and 90% of the empty eggs are painted red. But in the second barrel, 10% of the eggs with pearls are painted red, while 70% of the empty eggs are painted red. Here, the ratio between the conditional probabilities is different for the first barrel than for the second barrel. Specifically:

In the first barrel, p(~blue|pearl) / p(~blue|~pearl) = 70/90 In the second barrel, p(~blue|pearl) / p(~blue|~pearl) = 10/70

Since the ratio of the conditional probabilities for barrel #1 is different than the ratio of the conditional probabilities for barrel #2, we can tell that we are going to prefer to get a red from one barrel over the other. And without doing the math, we can tell p(pearl|~blue) is going to be higher for barrel #1, so we’d rather get a red egg from barrel #1 than from barrel #2. The ratio of p(~blue|pearl) to p(~blue|~pearl) is higher for barrel #1 than for barrel #2.

Again, you must be reading these out loud: “The ratio of the probability of drawing a red egg given that it contains a pearl to the probability of drawing a red egg given that it is empty is higher for barrel #1 than for barrel #2, so we’d rather get a red egg from barrel #1 than from barrel #2 (assuming we want a pearl).”

This problem illustrates the fact that p(pearl|blue) and p(pearl|~blue) have two degrees of freedom, even when p(pearl) is fixed. For either barrel above, p(pearl) was the same, but that did not mean that the ratio of p(pearl|blue) to p(pearl|~blue) was the same for either barrel, because p(blue) was different between the two barrels. As Yudkowsky puts it:

In the second barrel, the proportion of blue eggs containing pearls is the same as in the first barrel, but a much larger fraction of eggs are painted blue! This alters the set of red eggs in such a way that the proportions [between the conditional probabilities] do change.

Back to the breast cancer test:

The likelihood ratio of a medical test – the number of true positives divided by the number of false positives – tells us everything there is to know about the meaning of a positive result. But it doesn’t tell us the meaning of a negative result, and it doesn’t tell us how often the test is useful. For example, a mammography with a hit rate of 80% for patients with breast cancer and a false positive rate of 9.6% for healthy patients has the same likelihood ratio as a test with an 8% hit rate and a false positive rate of 0.96%. Although these two tests have the same likelihood ratio, the first test is more useful in every way – it detects disease more often, and a negative result is stronger evidence of health. The likelihood ratio for a positive result summarizes the differential pressure of the two conditional probabilities for a positive result, and thus summarizes how much a positive result will slide the prior probability… Of course the likelihood ratio can’t tell the whole story; the likelihood ratio and the prior probability together are only two numbers, while the problem has three degrees of freedom.

The late great Bayesian master E.T. Jaynes once suggested that evidence should be measured in decibels.

Why decibels?

Decibels measure exponential differences in sound energy, just like the Richter scale measures exponential differences in the seismic energy released by earthquakes. On the Richter scale, a magnitude 7 earthquake doesn’t release merely a bit more energy than a magnitude 6 earthquake, but ten times more energy. And a magnitude 8 earthquake releases 100 times more energy than a magnitude 6 earthquake.

Likewise, if total silence is 0 decibels, then a whisper is about 20 decibels and a normal conversation is about 60 decibels. The normal conversation releases not three times as much energy as the whisper, but rather 10,000 times more energy, because it releases 40 decibels more energy.

To get the decibels of a sound:

decibels = 10 × log 10 (intensity)

Allow me a brief aside to make sure we all remember how logarithms work.

I’m sure we all remember how exponents work. If you have a base of 5 and its exponent is 3, that looks like this: 53. And that’s just a quick way of saying 5 × 5 × 5. This operation of taking a “base” and “raising it to the power of” an exponent is called exponentation. Well, taking the logarithm of a number is basically the inverse of exponentation. 53 asks “What is 5 to the 3rd power?” whereas log 5 25 asks “5 to the what power equals 25?” Since 5 to the 2nd power equals 25, log 5 25 = 2. If you “evaluate” the expression log 5 25 (“logarithm, base 5, of 25″), your answer is the power to which you’d have to raise the base (5, in this case) in order to get the number of which you’re taking the logarithm.

Whenever I see the phrase log x n, I always read “x to the what power equals n?” So log 4 64 read “4 to the what power equals 64? The answer, obviously, is that 4 to the 3rd power equals 64.

If this isn’t clear, watch the short video here.

So anyway:

decibels = 10 × log 10 (intensity)

…just reads “The decibel measure of a sound is equal to 10 times the log, base 10, of the intensity.”

Understanding logarithms will help us get the feel for what it means to think of evidence in terms of decibels (in terms of exponents).

Back to the medical story. Suppose we start with a 1% prior probability that a woman has breast cancer. Then we administer three different tests for breast cancer, and each test has a different likelihood ratio. The likelihood ratios of the tests are 25:3, 18:1, and 7:2.

If we were to take Jaynes’ advice literally and measure our prior probability in decibels, we’d get:

10 × log 10 (1/99) = -20 decibels of evidence that a woman has breast cancer

I don’t really care if you can calculate that. I always use a calculator for that kind of thing. What I’m hoping is that you can get the feel of working with “decibels” of evidence.

Now, let’s say we administer the first test, the one with a likelihood ratio of 25/3, and the woman tests positive. This gives us 9 positive decibels of evidence that she has breast cancer, because:

10 × log 10 (25/3) = +9 decibels of evidence that a woman has breast cancer

Next we administer the second test, and she tests positive again!

10 × log 10 (18/1) = +13 decibels of evidence that a woman has breast cancer

She also tests positive on the third test:

10 × log 10 (7/2) = +5 decibels of evidence that a woman has breast cancer

The poor woman started out with a very low probability of having breast cancer, but now she has tested positive on three pretty effective tests in a row. Things are not looking good! She started out with -20 decibels of evidence that she had breast cancer, but the three tests added 27 decibels of evidence (9+13+5) in favor of her having breast cancer, so we now have +7 decibels of evidence that she has breast cancer. On a linear scale, +7 bits of evidence might be small, but on an exponential scale, +7 decibels of evidence means that there is now an 83% chance that she has cancer!

Notice that +7 decibels of evidence is not as large as -20 decibels of evidence is small. The original -20 decibels of evidence meant it was 99% likely she did not have breast cancer, but +7 decibels of evidence means it is 83% likely she has breast cancer. Of course, +20 decibels of evidence would mean it was 99% likely she had breast cancer.

Now that you understand the exponential power that evidence has in probabilistic reasoning, try to estimate the answer to this problem - which I paraphrased from Yudkowsky – without writing out all the math:

In front of you is a bookbag containing 1,000 poker chips. I started out with two such bookbags, one containing 700 red chips and 300 blue chips, the other containing 300 red chips and 700 blue chips. I flipped a fair coin to determine which bookbag to show you, so your prior probability that the bookbag in front of you is the mostly-red bookbag is 50%. Now, you close your eyes and reach your hand into the bag and take out a chip at random. You look at it’s color, write it down, and then put it back into the bag and mix the chips up with your hand. You do this 12 times, and out of those 12 “samples” you get 8 red chips and 4 blue chips. What is the probability that this is the mostly-red bag?

Stop here and think about the problem in your head and make a rough guess at the answer.

According to a study by Ward Edwards and Lawrence Phillips, most people faced with this problem give an answer between 70% and 80%. Was your estimate higher than that? If so, congratulations! The correct answer is about 97%.

Without doing the math, here’s how your intuition might have arrived at roughly the right answer. As stated in the problem, the likelihood ratio for the test result of drawing a red chip is 7/3, while the test result for drawing a blue chip is 3/7. Thus, a positive result for either test has the same degree of force in pushing our final probability in one direction or the other, but of course drawing a red chip pushes p(mostly-red bag) in the opposite direction as does drawing a blue chip.

If you draw one red chip, put it back, and then draw a blue chip, these two pieces of evidence have cancelled each other out – but only because the likelihood ratio for either test is exactly opposite. If you draw a red chip and then a blue chip and that’s all the evidence you have, then your probability that the bag in front of you is the mostly-red bag is back to 50%, right where it started.

You drew 12 chips, and got four more red chips than blue chips. That is several “decibels” of evidence in favor of the bag being the mostly-red bag, which is quite a lot of evidence. When you get rid of the red and blue chips that “cancel” each other, every single red chip you have left over pushes your probability that the bag is the mostly-red bag with the strength of a likelihood ratio of 7/3! So even without doing the math you know the final probability that the bag is the mostly-red bag is going to be pretty darned high.

If the likelihood ratio of your positive test is 7/3 and you have four more positive tests than negative ones, it turns out that you can calculate your odds like so:

74:34 = 2401:81

Which is about 30:1, near 97%.

Okay. I think you’re starting to get a feel for how these types of probabilities work. Now, let’s work through one last problem:

You are a mechanic for gizmos. When a gizmo stops working, it is due to a blocked hose 30% of the time. If a gizmo’s hose is blocked, there is a 45% probability that prodding the gizmo will produce sparks. If a gizmo’s hose is unblocked, there is only a 5% chance that prodding the gizmo will produce sparks. A customer brings you a malfunctioning gizmo. You prod the gizmo and find that it produces sparks. What is the probability that a spark-producing gizmo has a blocked hose?

So we want to solve for p(blocked|sparks), and we already know:

p(blocked) = 30% p(~blocked) = 70% p(sparks|blocked) = 45% p(sparks|~blocked) = 5%

Remember that:

p(sparks|blocked) × p(blocked) = p(sparks&blocked)

So:

p(sparks&blocked) = 45% × 30% = 13.5%

Also remember that:

p(sparks|~blocked) × p(~blocked) = p(sparks&~blocked)

So:

p(sparks&~blocked) = 5% × 70% = 3.5%

Finally, remember that:

p(blocked|sparks) = p(sparks&blocked) / [p(sparks&blocked + p(sparks&~blocked)]

And therefore:

p(blocked|sparks) = 13.5% / (13.5% + 3.5%)

And the answer is:

p(blocked|sparks) = 79.4%

Now, if we put the arithmetic you did for this problem into one equation, here’s what you just did:

The general form of this is:

That is Bayes’ Theorem.

And because:

p(E) = p(E|H ) × p(H) + p(E|~H) × p(~H)

We can reduce Bayes’ Theorem to the following:

That formulation is simpler, so you’ll see it more often, though it doesn’t give as clear a picture of what Bayes’ Theorem does as the earlier formulation. But they’re both correct.

Given some hypothesis H that we want to investigate, and an observation E that is evidence about H, Bayes’ Theorem tell us how we should update our probability that H is true, given evidence E.

In the medical example, H is “this woman has breast cancer” and E is a positive mammography test result. Bayes’ Theorem tell us what our posterior probability of the woman having breast cancer (H) is, given a positive mammography test (E).

Yudkowsky concludes:

By this point, Bayes’ Theorem may seem blatantly obvious or even tautological, rather than exciting and new. If so, this introduction has entirely succeeded in its purpose.

So there you have it. Now you understand the famous theorem from Reverend Thomas Bayes. He is proud of you.

So why should you care? Why does Bayes’ Theorem matter?

Yudkowsky gives the example of someone who thinks mankind will avoid nuclear war for at least another 100 years. When asked why, he said, “All of the players involved in decisions regarding nuclear war are not interested right now.” But why extend that for 100 years? “Because I’m an optimist,” was the reply.

What is it that makes this kind of thinking irrational? What is it about saying “Because I’m an optimist” that gives us no confidence that the claim is correct? (Maybe the claim is true, but we wouldn’t believe so merely because someone says he’s an optimist.)

Yudkowsky explains:

Other intuitive arguments include the idea that “Whether or not you happen to be an optimist has nothing to do with whether [nuclear] warfare wipes out the human species”, or “Pure hope is not evidence about nuclear war because it is not an observation about nuclear war.” There is also a mathematical reply that is precise, exact, and contains all the intuitions as special cases. This mathematical reply is known as Bayes’ Theorem.

For example, the reply “Whether or not you happen to be an optimist has nothing to do with whether nuclear warfare wipes out the human species” can be translated into:

p(you’re an optimist | mankind will avoid nuclear war for another century) = p(you’re an optimist | mankind will not avoid nuclear war for another century)

Yudkowsky continues (he uses ‘A’ for the hypothesis and ‘X’ for the evidence, instead of H and E as I did above):

Since the two probabilities for p(X|A) and p(X|~A) are equal, Bayes’ Theorem says that p(A|X) = p(A); as we have earlier seen, when the two conditional probabilities are equal, the revised probability equals the prior probability. If X and A are unconnected – statistically independent – then finding that X is true cannot be evidence that A is true; observing X does not update our probability for A; saying “X” is not an argument for A.

In this case, the evidence X (that you’re an optimist) does not make us update our probability for A (that nuclear warfare will wipe out the human species within 100 years).

But suppose the optimist says: “Ah, but since I’m an optimist, I’ll have renewed hope for tomorrow, I’ll work a little harder at my dead-end job, I’ll pump up the global economy a little, and eventually, through the trickle-down effect, I’ll send a few dollars into the pocket of the researcher who will ultimately find a way to stop nuclear warfare – so you see, the two events are related after all, and I can use one as valid evidence about the other.”

Not so fast:

In one sense, this is correct - any correlation, no matter how weak, is fair prey for Bayes’ Theorem; but Bayes’ Theorem distinguishes between weak and strong evidence. That is, Bayes’ Theorem not only tells us what is and isn’t evidence, it also describes the strength of evidence. Bayes’ Theorem not only tells us when to revise our probabilities, but how much to revise our probabilities. A correlation between hope and biological warfare may exist, but it’s a lot weaker than the speaker wants it to be; he is revising his probabilities much too far.

Statistical models are judged against Bayesian method because, well, Bayesian statistics is as good as it gets: “the Bayesian method defines the maximum amount of mileage you can get out of a given piece of evidence, in the same way that thermodynamics defines the maximum amount of work you can get out of a temperature differential.” You’ll also hear cognitive scientists judging decision-making subjects against Bayesian reasoners, such that cognitive biases are defined in terms of departures from ideal Bayesian reasoning.

Yudkowsky concludes:

The Bayesian revolution in the sciences is fueled, not only by more and more cognitive scientists suddenly noticing that mental phenomena have Bayesian structure in them; not only by scientists in every field learning to judge their statistical methods by comparison with the Bayesian method; but also by the idea that science itself is a special case of Bayes’ Theorem; experimental evidence is Bayesian evidence. The Bayesian revolutionaries hold that when you perform an experiment and get evidence that “confirms” or “disconfirms” your theory, this confirmation and disconfirmation is governed by the Bayesian rules. For example, you have to take into account, not only whether your theory predicts the phenomenon, but whether other possible explanations also predict the phenomenon. Previously, the most popular philosophy of science was probably Karl Popper’s falsificationism - this is the old philosophy that the Bayesian revolution is currently dethroning. Karl Popper’s idea that theories can be definitely falsified, but never definitely confirmed, is yet another special case of the Bayesian rules; if p(X|A) ~ 1 - if the theory makes a definite prediction – then observing ~X very strongly falsifies A. On the other hand, if p(X|A) ~ 1, and we observe X, this doesn’t definitely confirm the theory; there might be some other condition B such that p(X|B) ~ 1, in which case observing X doesn’t favor A over B. For observing X to definitely confirm A, we would have to know, not that p(X|A) ~ 1, but that p(X|~A) ~ 0, which is something that we can’t know because we can’t range over all possible alternative explanations. For example, when Einstein’s theory of General Relativity toppled Newton’s incredibly well-confirmed theory of gravity, it turned out that all of Newton’s predictions were just a special case of Einstein’s predictions. You can even formalize Popper’s philosophy mathematically. The likelihood ratio for X, p(X|A)/p(X|~A), determines how much observing X slides the probability for A; the likelihood ratio is what says how strong X is as evidence. Well, in your theory A, you can predict X with probability 1, if you like; but you can’t control the denominator of the likelihood ratio, p(X|~A) - there will always be some alternative theories that also predict X, and while we go with the simplest theory that fits the current evidence, you may someday encounter some evidence that an alternative theory predicts but your theory does not. That’s the hidden gotcha that toppled Newton’s theory of gravity. So there’s a limit on how much mileage you can get from successful predictions; there’s a limit on how high the likelihood ratio goes for confirmatory evidence. On the other hand, if you encounter some piece of evidence Y that is definitely not predicted by your theory, this is enormously strong evidence against your theory. If p(Y|A) is infinitesimal, then the likelihood ratio will also be infinitesimal. For example, if p(Y|A) is 0.0001%, and p(Y|~A) is 1%, then the likelihood ratio p(Y|A)/p(Y|~A) will be 1:10000. -40 decibels of evidence! Or flipping the likelihood ratio, if p(Y|A) is very small, then p(Y|~A)/p(Y|A) will be very large, meaning that observing Y greatly favors ~A over A. Falsification is much stronger than confirmation. This is a consequence of the earlier point that very strong evidence is not the product of a very high probability that A leads to X, but the product of a very low probability that not-A could have led to X. This is the precise Bayesian rule that underlies the heuristic value of Popper’s falsificationism. Similarly, Popper’s dictum that an idea must be falsifiable can be interpreted as a manifestation of the Bayesian conservation-of-probability rule; if a result X is positive evidence for the theory, then the result ~X would have disconfirmed the theory to some extent. If you try to interpret both X and ~X as “confirming” the theory, the Bayesian rules say this is impossible! To increase the probability of a theory you must expose it to tests that can potentially decrease its probability; this is not just a rule for detecting would-be cheaters in the social process of science, but a consequence of Bayesian probability theory. On the other hand, Popper’s idea that there is only falsification and no such thing as confirmation turns out to be incorrect. Bayes’ Theorem shows that falsification is very strongevidence compared to confirmation, but falsification is still probabilistic in nature; it is not governed by fundamentally different rules from confirmation, as Popper argued. So we find that many phenomena in the cognitive sciences, plus the statistical methods used by scientists, plus the scientific method itself, are all turning out to be special cases of Bayes’ Theorem. Hence the Bayesian revolution.

Welcome to the Bayesian Conspiracy.

Bonus links:

Bonus problem:

(courtesy of Beelzebub)