I have created a test that can tell you, with 99% accuracy, whether or not you have celiac disease. Yup, that dreaded disease that makes gluten your number one enemy? This test will tell you, with 99% accuracy, whether you have it or not. It gets better. This test will cost you $0.

And better still. It's online. Yup, no blood, no tissue, no DNA, no biological samples of any kind are necessary.

In fact, to get your result, you just need to scroll down. Are you sitting down? Ok, go ahead and scroll.

Processing ...

Processing ...

Still processing ...

Processing ...

Be patient, you didn't think this would happen immediately, did you?

Processing ...

Yes, still processing. This is sophisticated technology!

Getting there ...

Just checking in one more time. I'm serious when I say this test is 99% accurate. It's not perfect, but no tests are, right? OK keep scrolling.

Figure 1: Test Result. (Unnecessary disclaimer: This is not a real medical test. It's a teaching tool.)

I told you it was 99% accurate. Yes, that image always reads "No," but it turns out that of the 322 million people living in the U.S., about 0.5%-1% have celiac disease (which is distinct from gluten sensitivity, BTW). So by guessing "No," I'm right 99% of the time. Venture capitalists can reach me at perry@methodsman.com. Get in on the ground floor of this highly accurate test system.

Yes, nitpickers, the population of people who use this site are certainly systematically different than the general population of the U.S. But this is supposed to be educational.

This little experiment in terrible testing was brought to you by my immense frustration at the way the word "accuracy" is thrown around in the science press. Here are a few of my favorite examples:

Figure 2: The irony of using "accuracy" in your title.

See, accuracy has a very precise biostatistical definition. It is defined as the proportion of correct test results over all test results. That's it. My "test" above guesses "no" all the time. It is correct, therefore, 99 out of 100 times giving it 99% accuracy.

Accuracy, then, is inextricably tied to the prevalence of the disease. The rarer a disease is, the easier it is to develop a highly "accurate" test.

And of course, a completely useless test, like the one above, can be highly "accurate" in that sense, making the use of accuracy to compare two tests also useless.

In fact, a better test might be less accurate than a worse test. Let's imagine that I have a test that returns a "No" 90% of the time if you are celiac-disease-free, and returns a "yes" 90% of the time if you have celiac disease. Such a test would be correct, by definition, 90% of the time, and thus have an accuracy of 90%. That's quite a bit worse than my test above.

Nevertheless, this test would actually be more useful. While it's intuitive that this is true (as my "no" picture above can't possibly have clinical utility) understanding exactly why this is true requires a bit of statistical savvy.

One problem with accuracy is that it assigns one number to describe the properties of a test. But a test isn't quite that simple. A test has performance characteristics when performed in people with the disease of interest and when performed in people without the disease of interest. To extricate these, we need to make what is called a contingency table or, simply a "2 x 2" table. You've seen these before. I put the disease condition on top, and my test result on the side and assign everyone a box. Here's what the 2x2 table looks like for the "Always No" test above:

Figure 3: Assuming a population of 1 million people, this is how my test performs. Accuracy=99%!

I've highlighted in green the two boxes that matter when it comes to accuracy. Those are the boxes where my test got it right. I do really well among those without celiac disease. My performance in those with celiac disease is not so hot.

Now what about our merely 90% accurate test? The contingency table looks like this:

Figure 4: An actual, useful test. Accuracy=90%.

Now the real test doesn't perform as well among those without celiac disease – it will lead to a lot of false-positives. But it's light-years ahead among those with celiac disease.

Now, we can cut the above data in lots of ways. We can look at the sensitivity of the test, which tells us the chance that the test will be positive given you have celiac disease. The methods man test has a sensitivity of 0, while the actual test has a sensitivity of 90%. One point for the actual test.

We can look at specificity, which tells us how likely the test is to be negative if you don't have celiac. Methods man test: 100%, real test 90%. One point for the fake test.

The nice thing about sensitivity and specificity is that they are measures that are independent of disease prevalence since they are calculated separately in groups with and without the disease. But, as the above example illustrates, either of these numbers alone is not enough to prove that a test is good. Showing that your new, expensive test is more sensitive than an older test or more specific than an older test is good marketing, but doesn't tell the whole story. How can we combine sensitivity and specificity meaningfully? Should we go for the test with the best, what, sum of sensitivity and specificity? (For you stats nerds out there, the average of sensitivity and specificity is called "balanced accuracy" and is actually a pretty good metric).

The real answer is "it depends on what you are using the test for." Screening tests should have high sensitivity. We are OK with false positives – we just want to make sure we catch all those true positives. Confirmatory tests should have high specificity – we want to make really sure you have what we think you have.

We want to put sensitivity and specificity together. The easiest way to do this is to imagine you have two people, one with celiac disease, and one without:

Figure 5: Two patients, both alike in dignity.

We count how often each test gets it right. Then we repeat the experiment, again and again. Always with two patients, always where one has the disease and one doesn't. How often is the Methods Man test correct? 50% of the time. It's a coin flip. How often is the actual test correct? 90% of the time. By doing this experiment, we've weighted the results for the prevalence of the disease, removing the bias that "accuracy" introduces.

As you can see, the statistic we got ranges from 50% or 0.5 (a worthless test) to 100% or 1 (a perfect test). This corresponds to something called the "AUC", or Area Under the receiver operator characteristic Curve. It is a statistical test that can be used to directly compare two different diagnostic tests without regard to disease prevalence. It can also be used to evaluate tests that give a continuous metric as an outcome (rather than the binary metrics above) but we'll avoid that can of worms for the time being.

In short, what we are interested when we hear about a fancy new diagnostic test is not the accuracy. Really, we want to know how good it is at discriminating between those who have the disease and those who don't. The Methods Man test is useless because it is completely nondiscriminatory (like me!). Its AUC is 0.5. The actual test is a much better discriminator.

Look, I get that "accuracy" seems like a good word to use for lay audiences. In fact, sometimes when the term "accuracy" is used, the science writer is actually referring to the AUC. But we need a shorthand way to tell the difference. I favor the term "predictive power," where we define predictive power as (AUC-0.5)*2, mostly because it sounds awesome. But it also gives you a way to compare how much better the test is than a coin flip. Compare these two headlines:

"Methods Man test identifies celiac disease with 99% accuracy."

Versus:

"Methods man test identifies celiac disease with 0% predictive power."

In closing, I'd like to apologize to all of my shareholders.