by Carl V Phillips

I am skipping the Introduction section here. Which is to say, I assume that the reader is at least somewhat familiar with the the overwhelming evidence that people who smoke are much less likely to have bad COVID-19 outcomes. It turns out that this phenomenon and the (often misguided) chatter around provides a great case study for some general science lessons. Here are a few of those:

1. Small nonsystematic collections of observations < systematic observations / experiments < large, somewhat systematic, reasonably comprehensive collections of observations.

To unpack that, anyone who follows pop discussions of science has learned that systematic focused studies, of whatever sort, offer better information than happenstance data collection. That is basically true when they are on the same scale. So if we have a case series of hospitalized COVID-19 patients, than happens to have smoking data, it is not as good as a systematic study that focused on smoking status and COVID outcomes. This is true for various reasons — e.g., the smoking data might not have been accurate because it was not anyone’s real focus. The comparison for those studies in this case are population averages (i.e., the smoking prevalence among the patients is observed to be lower than for average people in that country) which do not offer a perfect comparison. A study that focused on smoking status, and tried hard to collect that information correctly and figure out what baseline to compare it to is a lot more reliable.

But in this case, any limitations of the individual studies is made up for on sheer volume. We have a zillion of the former imperfect-but-easy comparisons, and a decent handful of the latter. And almost all of them support the claim that smoking is protective. It really is approximately a zillion. See this thread by @phil_w888 on Twitter, in which he has collected, at the time of this writing, 762 published reports that inform the question of whether smokers have lower rates of COVID-19. Almost all the results point in the same direction; the rare exceptions are what we would expect from normal study error.

This is absolutely overwhelming evidence. Something would have to be systematically misleading about hundreds of different observations, that use different methods in different populations, and have many variations in exposure and outcome measurement. Not to say that is impossible. If we had 762 studies comparing height to breast cancer risk, they might all show that being taller is strongly protective (because men). But it is hard to imagine such an failure to recognize an important variable. No one has proposed a plausible explanation other than causation.

Anyone who says, “we need to do study X to see if this is really true” is just chasing grant money. Anyone who says “this new study shows it really true” is apparently not familiar with the hundreds of previous reports. The statement “this new result should convince the deniers” is wrong; it might be what gets someone’s attention for the first time, but that is different. Anyone familiar with the data who did not believe this was real clear back in April, let alone now, is unlikely to be swayed by any evidence.

The evidence is reasonably systematic, by which I mean that it was not apparently cherrypicked in any way to try to “show” something that is not true. It is not limited to specific and possibly odd populations. It is consistent across methodologies, giving us confidence that it is not an artifact of study design.

If you are seeing a parallel to the question “does vaping cause people to quit smoking”, you are spot-on. We do not know that to be true because of some contrived ultra-systematic little study. All of those are trumped by the much broader knowledge.

2. “Meta-analysis” is usually junk science, and this is a great example of that.

All those “meta-analyses” of the smoking-COVID results you see floating around are complete junk. By junk science I do not just generically mean “bad science” but rather “a methodology that even if done as well and honestly as possible produces meaningless results”. As I have previously explained at length (e.g., here), the meta-analysis method of averaging together a bunch of study results (which is not the only form of meta-analysis but is what “meta-analysis” always means when used by people who are not expert in methodology) is only valid if you can legitimately imagine that all the studies are really slices of a single large study (rows of data) that were separated into different datasets for some reason. If that were the case, it would make sense to put them back together.

This is sometimes(!) kinda(!) the case for clinical treatment experiments where the treatments are mostly(!) the same, the outcome measures are reasonably(!) consistent, and people are mostly(!) functioning just as biological machines and thus are fairly interchangeable. Notice all the emphatics in that sentence, though. Even for this best-case scenario, there are departures from the implicit “the data was separated at birth” assumption. As soon as we depart from that simple case, the averaging becomes absurd.

Consider an example most of you are familiar with, clinical intervention trials where people who smoke are persuaded to try to switch to vaping. This is a collection of very different interventions (though they can all sloppily described as “give vapes to smokers”) in different populations — different at both the macro level (the larger population in space and time that the study is drawing from) and the micro level (exactly which unusual group from among that population volunteered to be part of the study). Even the relatively simple outcome measures vary across datasets. So averaging the results together is utter nonsense. What is it the average of? The answer is not even something vague and barely-meaningful like, “the average result when you try the many different possible methods across many different peoples and times” because it is not even that. It is the average of results from the particular collection of methods and people that people happened to write down, which is unlikely to represent the full range of options. Moreover — if you want to put the icing on the cake of this absurdity — the average is weighted by however big the particular individual study happened to be. So if there was one study of Estonians who were given a cigalike in 2017, but it was only 20 subjects, then its result will barely affect the average, but if they had happened to enroll 1000 subjects in the exactly same study with the exact same result, it would have a large effect on the average. Just think about that. Anyone who thinks all this is legitimate scientific methodology, well, I have some bad news for you about astrology also.

The data on smoking and COVID makes the smoking cessation trials look like medical treatment trials, in terms of heterogeneity and representativeness. As noted in the previous point, one of the strengths of the data supporting the smoking-COVID conclusion is that it is extremely heterogeneous. Even the collections that are the “same” are very different. For example, even if you limit your collection to case series of COVID patients from Chinese hospitals, it is still a collection of whatever people happened to be in these particular hospitals on particular days (and they differ at the macro and micro levels even though they are more similar than “all the people in the world”), with very different outcome endpoints (each study was about some particular outcome of interest to one group of researchers), different exposure measures, and different data collection methods. It is absurd to average these together. Needless to say averaging these and whatever data happens to appear in French and American national statistics, and some case studies in Germany, and so on, is several steps more absurd.

Moreover, even if the methodology did not deliver an absurd weighted average of whatever happens to have been reported, why would you want to know the average? Imagine that we had data for every COVID hospitalization for each country in the world, rather than just a random nonsystematic subset of those, and could average them all together. Why would we want to? If the apparent protective effect were greater in China than in France (this seems to be the case), that is more useful information than the average of them. The same is true if the association varies by outcome measures or whatever. Taking the average discards useful information and produces nothing of value.

3. “There is a plausible mechanism” rhetoric is approximately worthless.

Being able to come up with some plausible story for why an association in the data represents causation is not an impressive feat. It can pretty much always be done. This is particularly true when the story is about biochemistry, where human knowledge is still so primitive and where few people have any real intuition for it (unlike with behavioral stories). It is informative when someone proposes the story and then specifically tests it (i.e., “under this story, we would expect to see X and not expect to see Y, and so we looked to see if X and Y….”). But an ad hoc story to explain an existing observation is entirely different.

If data came out that wearing cloth masks does a better job of reducing SARS-CoV-2 spread than paper masks, it would be accompanied by a collection of just-so stories about what mechanism is causing it. If the data said paper worked better than cloth, it too would be accompanied by mechanistic stories. Sitting there by themselves, either set of stories would be plausible and indeed compelling because it was the only thing being presented. People would be saying, “yeah, because of [mechanistic story], we would expect this difference”. Whichever way the difference went, the commentary would be all about why this should be expected to be the case.

I suppose that if no one can come up with any story for how a particular association is causal, that would be a strike against the claim of causation. Though sometimes even when the first reaction is “nah, no way this is real”, it turns out that it is and no one understood the story, so this is far from definitive. Perhaps if data showed, say, that smoking weed on Tuesdays is strongly associated with diminished productivity, but smoking on Wednesdays is not, the right assessment would be “we cannot figure out any story that would make this a real causal difference, so we conclude it is a meaningless artifact of our data.” But I’ll bet that half of you are already coming up with stories for why that contrast might really be causal, so you just proved my point.

So when you hear a story about what is causing a particular observed pattern in the data, keep in mind that it is always easy to make up such a story. Of course, if someone performs a proper focused test to see whether that story, rather than some alternative, that is good science. That is the essence of science. But don’t expect to see it in public health research, where they will just make up a story and declare it to be true without ever testing it.

3a. There is little reason to believe that the protective factor is nicotine.

This is the specific implication of the previous point. You may have seen the claims that nicotine seems to be what makes smoking protective because “…blah blah…ACE enzymes…blah blah…lots of other words that you think must be true because they sound all sciencey”. While these stories are plausible, the existence of the stories is not informative. Again, plausible stories are always possible.

Smoking is a complex exposure that has a lot of effects on people’s biology and behavior. SARS-CoV-2 transmission is complicated and we do not fully understand it, and COVID-19 severity is even more complicated and we barely understand it at all. There are countless possible causal pathways there, only some of which are about the nicotine. Just as people with little knowledge of tobacco use think of nicotine as being the harmful aspect of smoking, people who are immersed in vaping and NRT politics tend to think of nicotine as the beneficial aspect of smoking. They have an anti-scientific prejudice (like most people do about most things, but it is less forgivable in this context) and so make up a story to make the data fit their assumptions.

The scientific approach would be to withhold judgment until we have some data that resolves the question. The easiest and most obvious observation would be whether exclusive vapers or long-term NRT users have the same protective association that smokers do. Unfortunately, we will not stumble into that observation like we did with smoking because there are fewer of them and data collection about vaping/NRT status is of even lower quality than for smoking, and often not collected at all. Thus there needs to be a focused systematic study to answer this, and it does not seem to be happening.

It is worth adding that trials in which smokers are given nicotine patches when hospitalized for COVID-19 (which are being done, because there is always money for treatment research) are not helpful in answering this question. The biggest variable in that mix is whether or not patients are forced into the stress of nicotine withdrawal. (It would certainly be very useful to know if this is killing people, but it does not address the question of why smoking is protective.) Indeed, even giving non-smoking patients nicotine patches would not address the question very effectively because (a) effects of an ongoing consumption choice do not necessarily start the first week of consumption and (b) the protective effect of smoking occurs before someone becomes a patient, so any effect at this stage might be an entirely different phenomenon.

3b. Where along a causal pathway the protection occurs is also unclear

This is another specific point relating to story-telling. Almost all of the available comparisons are based on clinically-significant (usually hospitalized) COVID cases. The deficit of smokers in that group could mean that smoking prevents colonization with SARS-CoV-2, or that it prevents the colonization from causing into COVID-19 at all, or that it prevents cases of the disease from getting severe. The effect is frequently described in terms of preventing colonization, which might be true (and would be bad news for smokers, given that the protection is presumably not perfect, so smokers remain susceptible to eventual infection), but we do not know. The evidence cited in favor of that — comparisons of smoking rates in all test-diagnosed cases (not just hospitalized cases) show a deficit of smokers — does not really show it. If smoking only prevented significant disease after infection, we would still see this because (in most populations, so far) a large portion of tests are among people with disease symptoms.

It turns out epidemiology requires a bit of thinking about what the data means. Go figure.

4. Correlation is not causation, but it is the best possible evidence of causation that exists.

Causation can never be observed. Never. Never ever ever. All we can do is infer it from observed associations. If you have those, whatever form they take, you can infer causation. Or perhaps not — because perhaps there is a good affirmative reason to doubt that the association is causal. If so, it is useful to focus on testing that rather than either doing more of the same or, worse, saying “there is a reason for doubt and therefore I am just going to doubt”. This does not include running any old clinical trial whose only “virtue” (*cough*) is being a clinical trial. There are no simple recipes for inferring causation. Anyone who suggests otherwise is too simple to be doing science.