Josh Rushton writes:

I’ve been following your blog for a while and checked in today to see if there was a thread on last week’s big-splash Stanford antibody study (the one with the shocking headline that they got 50 positive results in a “random” sample of 3330 antibody tests, suggesting that nearly 2% of the population has been infected “under the radar”). I didn’t see anything, so I thought I’d ask if you’d consider opening a discussion. This paper is certainly relevant to the MrP thread on politicization of the covid response, in that the paper risks injecting misinformation into an already-broken policy discussion. But I think it would be better to use it as a case study on poor statistics and questionable study design. I don’t mean to sound harsh, but if scientists are afraid to “police” ourselves, I don’t know how we can ask the public to trust us. Simply put, I see two potentially fatal flaws with the study (full disclosure: I [Rushton] haven’t read the entire paper — a thousand apologies if I’m jumping the gun — but it’s hard to imagine these getting explained away in the fine print): The authors’ confidence intervals cannot possibly be accounting for false positives correctly (I think they use the term “specificity” to mean “low rate of false-positives). I say this because the test validation included a total of 30+371 pre-covid blood tests, and only 399 of them came back negative. I know that low-incidence binomial CIs can be tricky, and I don’t know the standard practice these days, but the exact binomial 95% CI for the false-positive rate is (0.0006, 0.0179); this is pretty consistent to the authors’ specificity CI (98.3%, 99.9%). For rates near the high end of this CI, you’d get 50 or more false positives in 3330 tests with about 90% probability. Hard to sort through this with strict frequentist logic (obviously a Bayesian could make short work of it), but the common-sense take-away is clear: It’s perfectly plausible (in the 95% CI sense) that the shocking prevalence rates published in the study are mostly, or even entirely, due to false positives. So the fact that their prevalence CIs don’t go anywhere near zero simply can’t be right.

Recruitment was done via facebook ads with basic demographic targeting. Since we’re looking for a feature that affects something like 2% of the population (or much, much less), we really have to worry about self selection. They may have discussed this in the portions of the paper I didn’t read, but I can’t imagine how researchers would defeat the desire to get a test if you had reason to believe that you, or someone near you, had the virus (and wouldn’t some people hide those reasons to avoid being disqualified from getting the test?)…

Pretty harsh words—but this is just some guy sending me an email. I’ll have to read the paper and judge for myself, which I did with an open mind. (Let me assure you that I did not title this post until after writing most of it.)

It’s been a busy month for Stanford on the blog. First there were these pre-debunked forecasts we heard from a couple of assholes from the Hoover Institution, then some grad students set us this pretty sane literature review, and now this!

Reading through the preprint

Anyway, after receiving the above email, I clicked though and read the preprint, “COVID-19 Antibody Seroprevalence in Santa Clara County, California,” by Eran Bendavid et al., which reports:

On 4/3-4/4, 2020, we tested county residents for antibodies to SARS-CoV-2 using a lateral flow immunoassay. Participants were recruited using Facebook ads targeting a representative sample of the county by demographic and geographic characteristics. We report the prevalence of antibodies to SARS- CoV-2 in a sample of 3,330 people, adjusting for zip code, sex, and race/ethnicity. . . . The unadjusted prevalence of antibodies to SARS-CoV-2 in Santa Clara County was 1.5% . . . and the population-weighted prevalence was 2.8%.

That’s positive test results. Then you have to adjust for testing errors:

Under the three scenarios for test performance characteristics, the population prevalence of COVID-19 in Santa Clara ranged from 2.5% to 4.2%. [I’ve rounded all numbers to a single decimal place for my own sanity. — AG]

To discuss this paper, I’ll work backward, starting from the conclusion and going through the methods and assumptions.

Let’s take their final estimate, 2.5% to 4.2%, and call it 3%. Is a 3% rate of coronavirus antibodies in Santa Clara county a high or a low number? And does this represent good news or bad news?

First off, 3% does not sound implausible. If they said 30%, I’d be skeptical, given how everyone’s been hiding out for awhile, but 3%, sure, maybe so. Bendavid et al. argue that if the number is 3%, that’s good news, because Santa Clara county has 2 million people and only an estimated 100 deaths . . . 0.03*(2 million)/100 = 600, so that implies that 1/600 of exposed people there died. So that’s good news, relatively speaking: we’d still like to avoid 300 million Americans getting the virus and 500,000 dying, but that’s still better than the doomsday scenario.

It’s hard to wrap my head around these numbers because, on one hand, a 1/600 death rate sounds pretty low; on the other, 500,000 deaths is a lot. I guess 500,000 is too high because nobody’s saying that everyone will get exposed.

The study was reported in the news as that the county “Santa Clara county has had 50 to 85 times more cases than we knew about, Stanford estimates.” It does seem plausible that lots more people have been exposed than have been tested for the disease, as so few tests are being done.

At the time of this writing, NYC has about 9000 recorded coronavirus deaths. Multiply by 600 and you get 5.4 million. OK, I don’t think 5.4 million New Yorkers have been exposed to coronavirus. New York only has 8.4 million people total! I don’t think I know anyone who’s had coronavirus. Sure, you can have it and not have any symptoms—but if it’s as contagious as all that, then if I had it, I guess all my family would get it too, and then I’d guess that somebody would show some symptoms.

That’s fine—for reasons we’ve been discussing for awhile—actually, it was just a month and a half ago—it doesn’t make sense to talk about a single “case fatality rate,” as it depends on age and all sorts of other things. The point is that there’ve gotta be lots of coronavirus cases that have not been recorded, given that we have nothing close to universal or random-sample testing. But the 1/600 number doesn’t seem quite right either.

Figuring out where the estimates came from

OK, now let’s see where the Stanford estimate came from. They did a survey and found 1.5% positive tests (that’s 50 out of 3330 in the sample). Then they did three statistical adjustments:

1. They poststratified on zip code, sex, and ethnicity to get an estimate of 2.8%. Poststratification is a standard statistical technique, but some important practical issues arise regarding what to adjust for.

2. They adjusted for test inaccuracy. This is a well-known probability problem—with a rare disease and an imperfect test, you can easily end up with most of your positive test results being false positives. The error rates of the test is the key input to this calculation.

3. They got uncertainty intervals based on the sampling in the data. That’s the simplest part of the analysis, and I won’t talk much about it here. It does come up, though, in the implicit decision of the paper to focus on point estimates rather than uncertainty ranges. To the extent that the point estimates are implausible (e.g., my doubts about the 1/600 ratio above), that could point toward a Bayesian analysis that would account for inferential uncertainty. But I’m guessing that the uncertainty due to sampling variation is minor compared to uncertainty arising from the error rate of the test.

I’ll discuss each of these steps in turn, but I also want to mention three other issues:

4. Selection bias. As Rushton wrote, it could be that people who’d had coronavirus symptoms were more likely to avail themselves of a free test.

5. Auxiliary information. In any such study, you’d want to record respondents’ ages and symptoms. And, indeed, these were asked about in the survey. However, these were not used in the analysis and played no role in the conclusion. In particular, one might want to use responses about symptoms to assess possible selection bias.

6. Data availability. The data for this study do not seem to be available. That’s too bad. I can’t see that there’d be a confidentiality issue: just knowing someone’s age, sex, ethnicity, and coronavirus symptoms should not be enough to allow someone to be identified, right? I guess that including zip code could be enough for some categories, maybe? But if that were the only issue, they could just pool some of the less populated zip codes. I’m guessing that the reason they didn’t release the data is simple bureaucracy: it’s easier to get a study approved if you promise you won’t release the data than if you say you will. Backasswards, that is, but that’s the world that academic researchers have to deal with, and my guess is that the turf-protectors in the IRB industry aren’t gonna letting go of this one without a fight. Too bad, though: without the data and the code, we just have to guess at what was done. And we can’t do any of the natural alternative analyses.

Assessing the statistical analysis

Now let’s go through each step.

1. Poststratification.

There are 2 sexes and it seems that the researchers used 4 ethnicity categories. I’m not sure how they adjusted for zip code. From their map, it seems that there are about 60 zip codes in the county, so there’s no way they simply poststratified on all of them. They say, “we re-weighted our sample by zip code, sex, and race/ethnicity,” but “re-weighed . . . by zip code” doesn’t really say exactly what they did. Just to be clear here, I’m not suggesting malfeasance here; it’s just the usual story that it can be hard for people to describe their calculations in words. Even formulas are not so helpful because they can lack key details.

I’m concerned about the poststratification for three reasons. First, they didn’t poststratify on age, and the age distribution is way off! Only 5% of their sample is 65 and over, as compared to 13% of the population of Santa Clara county. Second, I don’t know what to think about the zip code adjustment, since I don’t know what was actually done there. This is probably not the biggest deal, but given that they bothered to adjust at all, I’m concerned. Third, I really don’t know what they did, because they say they weighted to adjust for zip code, sex, and ethnicity in the general population—-but in Table 1 they give their adjusted proportions for sex and ethnicity and they don’t match the general population! They’re close, but not exact. Again, I’d say this is no big deal, but I hate not knowing what was actually done.

And why did they not adjust for age? They write, “We chose these three adjustors because they contributed to the largest imbalance in our sample, and because including additional adjustors would result in small-N bins.” They should’ve called up a survey statistician and asked for help on this one: it’s standard problem. You can do MRP—that’s what I’d do!—but even some simple raking would be fine here, I think.

There aren’t a lot of survey statisticians out there, but there are some. They could’ve called me up and asked for advice, or they could’ve stayed on campus and asked Doug Rivers or Jon Krosnick—they’re both experts on sampling and survey adjustments. I guess it’s hard to find experts on short notice. Doug and Jon don’t have M.D.’s and they’re not economists or law professors, so I guess they don’t count as experts by the usual measures.

2. Test inaccuracy.

This is the big one. If X% of the population have the antibodies and the test has an error rate that’s not a lot lower than X%, you’re in big trouble. This doesn’t mean you shouldn’t do testing, but it does mean you need to interpret the results carefully. Bendavid et al. estimate that the sensitivity of the test is somewhere between 84% and 97% and that the specificity is somewhere between 90% and 100%. I can never remember which is sensitivity and which is specificity, so I looked it up on wikipedia: “Sensitivity . . . measures the proportion of actual positives that are correctly identified as such . . . Specificity . . . measures the proportion of actual negatives that are correctly identified as such.” OK, here are concern is actual negatives who are misclassified, so what’s relevant is the specificity. That’s the number between 90% and 100%.

If the specificity is 90%, we’re sunk. With a 90% specificity, you’d expect to see 333 positive tests out of 3330, even if nobody had the antibodies at all. Indeed, they only saw 50 positives, that is, 1.5%, so we can be pretty sure that the specificity is at least 98.5%. If the specificity were 98.5%, the observed data would be consistent with zero, which is one of Rushton’s points above. On the other hand, if the specificity were 100%, then we could take the result at face value.

So how do they get their estimates? Again, the key number here is the specificity. Here’s exactly what they say regarding specificity:

A sample of 30 pre-COVID samples from hip surgery patients were also tested, and all 30 were negative. . . . The manufacturer’s test characteristics relied on . . . pre-COVID sera for negative gold standard . . . Among 371 pre-COVID samples, 369 were negative.

This gives two estimates of specificity: 30/30 = 100% and 369/371 = 99.46%. Or you can combine them together to get 399/401 = 99.50%. If you really trust these numbers, you’re cool: with y=399 and n=401, we can do the standard Agresti-Coull 95% interval based on y+2 and n+4, which comes to [98.0%, 100%]. If you go to the lower bound of that interval, you start to get in trouble: remember that if the specificity is less than 98.5%, you’ll expect to see more than 1.5% positive tests in the data no matter what!

3. Uncertainty intervals. So what’s going on here? If the specificity data in the paper are consistent with all the tests being false positives—not that we believe all the tests are false positives, but this suggests we can’t then estimate the true positive rate with any precision—then how do they get a confidence nonzero estimate of the true positive rate in the population?

It seems that two things are going on. First, they’re focusing on the point estimates of specificity. Their headline is the range from 2.5% to 4.2%, which come from their point estimates of specificity of 100% (from their 30/30 data) and 99.5% (from the manufacturer’s 369/371). So the range they give is not a confidence interval; it’s two point estimates from different subsets of their testing data. Second, I think they’re doing something wrong, or more than one thing wrong, with their uncertainty estimates, which are “2.5% (95CI 1.8-3.2%)” and “4.2% (2.6-5.7%)” (again, I’ve rounded to one decimal place for clarity). The problem is that we’ve already seen that a 95% interval for the specificity will go below 98.5%, which implies that the 95% interval for the true positive rate should include zero.

Why does their interval not include zero, then? I can’t be sure, but one possibility is that they did the sensitivity-specificity corrections on the poststratified estimate. But, if so, I don’t think that’s right. 50 positive tests is 50 positive tests, and if the specificity is really 98.5%, you could get that with no true cases. Also, I’m baffled because I think the 2.5% is coming from that 30/30=100% specificity estimate, but in that case you’d need a really wide confidence interval, which would again go way below 98.5% so that the confidence interval for the true positive rate would include zero.

Again, the real point here is not whether zero is or “should be” in the 95% interval, but rather that, once the specificity can get in the neighborhood of 98.5% or lower, you can’t use this crude approach to estimate the prevalence; all you can do is bound it from above, which completely destroys the “50-85-fold more than the number of confirmed cases” claim.

They do talk about this a bit: “if new estimates indicate test specificity to be less than 97.9%, our SARS-CoV-2 prevalence estimate would change from 2.8% to less than 1%, and the lower uncertainty bound of our estimate would include zero. On the other hand, lower sensitivity, which has been raised as a concern with point-of-care test kits, would imply that the population prevalence would be even higher.” But I think this misses the point. First, if the specificity were less than 97.9%, you’d expect more than 70 positive cases out of 3330 tests. But they only saw 50 positives, so I don’t think that 1% rate makes sense. Second, the bit about the sensitivity is a red herring here. The uncertainty here is pretty much entirely driven by the uncertainty in the specificity.

This is all pretty much what Rushton said in one paragraph of his email. I just did what was, in retrospect, overkill here because I wanted to understand what the authors were doing.

4. Selection bias. In their article, Bendavid et al. address the possibility: “Other biases, such as bias favoring individuals in good health capable of attending our testing sites, or bias favoring those with prior COVID-like illnesses seeking antibody confirmation are also possible.” That makes sense. Bias could go in either direction. I don’t have a good sense of this, and I think it’s fine to report the results of a self-selected population, as long as (a) you make clear the sampling procedure, and (b) you do your best to adjust.

Regarding (b), I wonder if they could’ve done more. In addition to my concerns expressed above regarding insufficient poststratification (in turn driven by their apparent lack of consultation with a statistics expert), I also wonder if they could’ve done something with the data they collected on “underlying co-morbidities, and prior clinical symptoms.” I don’t see these data anywhere in the report, which is too bad. They could’ve said what percentage of the people in their survey reported any coronavirus-related symptoms.

5. Auxiliary information. and 6. Data availability. As noted above, it seems that the researchers collected some information that could have helped us understand their results, but these data are unavailable to us.

Jeez—I just spent 3 hours writing this post. I don’t think it wasn’t worth the time. I could’ve just shared Rushton’s email with all of you—that would’ve just taken 5 minutes!

Summary

I think the authors of the above-linked paper owe us all an apology. We wasted time and effort discussing this paper whose main selling point was some numbers that were essentially the product of a statistical error.

I’m serious about the apology. Everyone makes mistakes. I don’t think they authors need to apologize just because they screwed up. I think they need to apologize because these were avoidable screw-ups. They’re the kind of screw-ups that happen if you want to leap out with an exciting finding and you don’t look too carefully at what you might have done wrong.

Look. A couple weeks ago I was involved in a survey regarding coronavirus symptoms and some other things. We took the data and ran some regressions and got some cool results. We were excited. That’s fine. But we didn’t then write up a damn preprint and set the publicity machine into action. We noticed a bunch of weird things with our data, lots of cases were excluded for one reason or another, then we realized there were some issues of imbalance so we couldn’t really trust the regression as is, at the very least we’d want to do some matching first . . . I don’t actually know what’s happening with that project right now. Fine. We better clean up the data if we want to say anything useful. Or we could release the raw data, whatever. The point is, if you’re gonna go to all this trouble collecting your data, be a bit more careful in the analysis! Careful not just in the details but in the process: get some outsiders involved who can have a fresh perspective and aren’t invested in the success of your project.

Also, remember that reputational inference goes both ways. The authors of this article put in a lot of work because they are concerned about public health and want to contribute to useful decision making. The study got attention and credibility in part because of the reputation of Stanford. Fair enough: Stanford’s a great institution. Amazing things are done at Stanford. But Stanford has also paid a small price for publicizing this work, because people will remember that “the Stanford study” was hyped but it had issues. So there is a cost here. The next study out of Stanford will have a little less of that credibility bank to borrow from. If I were a Stanford professor, I’d be kind of annoyed. So I think the authors of the study owe an apology not just to us, but to Stanford. Not to single out Stanford, though. There’s also Cornell, which is known as that place with the ESP professor and that goofy soup-bowl guy who faked his data. And I teach at Columbia; our most famous professor is . . . Dr. Oz.

It’s all about the blood

I’m not saying that the claims in the above-linked paper are wrong. Maybe the test they are using really does have a 100% specificity rate and maybe the prevalence in Santa Clara county really was 4.2%. It’s possible. The problem with the paper is that (a) it doesn’t make this reasoning clear, and (b) their uncertainty statements are not consistent with the information they themselves present.

Let me put it another way. The fact that the authors keep saying that “50-85-fold” thing suggest to me that they sincerely believe that the specificity of their test is between 99.5% and 100%. They’re clinicians and medical testing experts; I’m not. Fine. But then they should make that assumption crystal clear. In the abstract of their paper. Something like this:

We believe that the specificity of the test used in this study is between 99.5% and 100%. Under this assumption, we conclude that the population prevalence in Santa Clara county was between 1.8% and 5.7% . . .

This specificity thing is your key assumption, so place it front and center. Own your modeling decisions.

P.S. Again, I know nothing about blood testing. Perhaps we could convene an expert panel including George Schultz, Henry Kissinger, and David Boies to adjudicate the evidence on this one?

P.P.S. The authors provide some details on their methods here. Here’s what’s up:

– For the poststratification, it turns out they do adjust for every zip code. I’m surprised, as I’d think that could give them some noisy weights, but, given our other concerns with this study, I guess noisy weights are the least of our worries. Also, they don’t quite weight by sex x ethnicity x zip; they actually weight by the two-way margins, sex x zip and ethnicity x zip. Again, not the world’s biggest deal. They should’ve adjusted for age, too, though, as that’s a freebie.

– They have a formula to account for uncertainty in the estimated specificity. But something seems to have gone wrong, as discussed in the above post. It’s hard to know exactly what went wrong since we don’t have the data and code. For example, I don’t know what they are using for var(q).

P.P.P.S. Let me again emphasize that “not statistically significant” is not the same thing as “no effect.” What I’m saying in the above post is that the information in the above-linked article does not provide strong evidence that the rate of people in Santa Clara county exposed by that date was as high as claimed. Indeed, the data as reported are consistent with the null hypothesis of no exposure, and also with alternative hypotheses such as exposure rates of 0.1% or 0.5% or whatever. But we know the null hypothesis isn’t true—people in that county have been infected! The data as reported are also consistent with infection rates of 2% or 4%. Indeed, as I wrote above, 3% seems like a plausible number. As I wrote above, “I’m not saying that the claims in the above-linked paper are wrong,” and I’m certainly not saying we should take our skepticism in their specific claims and use that as evidence in favor of a null hypothesis. I think we just need to accept some uncertainty here. The Bendavid et al. study is problematic if it is taken as strong evidence for those particular estimates, but it’s valuable if it’s considered as one piece of information that’s part of a big picture that remains uncertain. When I wrote that the authors of the article owe us all an apology, I didn’t mean they owed us an apology for doing the study, I meant they owed us an apology for avoidable errors in the statistical analysis that led to overconfident claims. But, again, let’s not make the opposite mistake of using uncertainty as a way to affirm a null hypothesis.

P.P.P.P.S. I’m still concerned about the zip code weighting. Their formula has N^S_zsr in the denominator: that’s the number of people in the sample in each category of zip code x sex x race. But there are enough zip codes in the county that I’m concerned that weighting in this way will be very noisy. This is a particular concern here because even the unweighted estimate of 1.5% is so noisy that, given the data available, it could be explained simply by false positives. Again, this does not make the substantive claims in the paper false (or true), it’s just one more reason these estimates are too noisy to do more than give us an upper bound on the infection rate, unless you want to make additional assumptions. You could say that the analysis as performed in the paper does make additional assumptions, it just does so implicitly via forking paths.

P.P.P.P.P.S. A new version of the article has been released; see discussion here and here.

P.P.P.P.P.P.S. See here for our analysis of the data published in the revised report. Our conclusion: