[Epistemic status: Somewhat confident in the medical analysis, a little out of my depth discussing the statistics]

For years, we’ve been warning patients that their sleeping pills could kill them. How? In every way possible. People taking sleeping pills not only have higher all-cause mortality. They have higher mortality from every individual cause studied. Death from cancer? Higher. Death from heart disease? Higher. Death from lung disease? Higher. Death from car accidents? Higher. Death from suicide? Higher. Nobody’s ever proven that sleeping pill users are more likely to get hit by meteors, but nobody’s ever proven that they aren’t.

In case this isn’t scary enough, it only takes a few sleeping pills before your risk of death starts shooting up. Even if you take sleeping pills only a few nights per year, your chance of dying double or triple.

When these studies first came out, doctors were understandably skeptical. First, it seems suspicious that so few sleeping pills could have such a profound effect. Second, why would sleeping pills raise your risk of everything at once? Lung disease? Well, okay, sleeping pills can cause respiratory depression. Suicide? Well, okay, overdosing on sleeping pills is a popular suicide method. Car accidents? Well, sleeping pills can keep you groggy in the morning, and maybe you don’t drive very well on your way to work. But cancer? Nobody has a good theory for this. Heart disease? Seems kind of weird. Also, there are lots of different kinds of sleeping pills with different biological mechanisms; why should they all cause these effects?

The natural explanation was that the studies were confounded. People who have lots of problems in their lives are more stressed. Stress makes it harder to sleep at night. People who can’t sleep at night get sleeping pills. Therefore, sleeping pill users have more problems, for every kind of problem you can think of. When problems get bad enough, they kill you. This is why sleeping pill users are more likely to die of everything.

This is a reasonable and reassuring explanation. But people tried to do studies to test it, and the studies kept finding that sleeping pills increased mortality even when adjusted for confounders. Let’s look at a few of the big ones:

Kripke et al 2012 followed 10,529 patients and 23,676 controls for an average of 2.5 years. They used a sophisticated de-confounding method which “controlled for risk factors and [used] up to 116 strata, which exactly matched cases and controls by 12 classes of comorbidity”. Sleeping pill users still had 3-5x the risk of death, regardless of which of various diverse sedatives they took. Even users in their lowest-exposure category, fewer than 18 pills per year, had 3.6x the mortality rate. Cancer rate in particular increased by 1.35x.

Kao et al 2012 followed 14,950 patients and 60,000+ matched controls for three years. They tried to match cases and controls by age, sex, and eight common medical and psychiatric comorbidities. They still found that Ambien approximately doubled rates of oral, kidney, esophageal, breast, lung, liver, and bladder cancer, and slightly increased rates of various other types types of cancer as well.

Welch et al 2017 took 34,727 patients on sleeping pillsand 69,418 controls and followed them for eight years. They controlled for sex, age, sleep disorders, anxiety disorders, other psychiatric disorders, a measure of general medical morbidity, smoking, alcohol use, medical clinic (as a proxy for socioeconomic status), and prescriptions for other drugs. They also excluded all deaths in the first year of their study to avoid patients who were prescribed sleeping pills for some kind of time-sensitive crisis – and check the paper for descriptions of some more complicated techniques they used for this. But even with all of these measures in place to prevent confounding, they still found that the patients on sedatives had three times the death rate.

This became one of the rare topics to make it out of the medical journals and into popular consciousness. Time Magazine: Sleeping Pills Linked With Early Death. AARP: Rest Uneasy: Sleeping Pills Linked To Early Death, Cancer. The Guardian: Sleeping Pills Increase Risk Of Death, Study Suggests. Most doctors I know are aware of these results, and have at least considered changing their sedative prescribing habits. I’ve gone back and forth: such high risks are inherently hard-to-believe, but the studies sure do seem pretty good.

This is the context you need to understand Patorno et al 2017: Benzodiazepines And Risk Of All Cause Mortality In Adults: Cohort Study.

P&a focus on benzodiazepines, a class of sedatives commonly used as sleeping pills, and one of the types of drugs analyzed in the studies above. They do the same kind of analysis as the other studies, using a New Jersey Medicare database to follow 4,182,305 benzodiazepine users and 35,626,849 non-users for nine years. But unlike the other studies, they find minimal to zero difference in mortality risk between users and non-users. Why the difference?

Daniel Kripke, one of the main proponents of the sleeping-pills-are-dangerous hypothesis, thinks it’s because of the switch from looking at all sleeping pills to looking at benzodiazepines in particular. In a review article, he writes:

[Patorno et al] was not included [in this review] because it was not focused on hypnotics, specifically excluded nonbenzodiazepine “Z” drugs such as zolpidem, and failed to compare drug use of cases and controls during follow-ups.

I’m not sure this matters that much. Most of the studies of sleeping pills, including Kripke’s own study, including benzodiazepines, specifically analyzed them as a separate subgroup, and found they greatly increased mortality risk. For example, Kripke 2012 finds that the benzodiazepine sleeping pill temazepam increased death hazard ratio by 3.7x, the same as Ambien and everything else. If Patorno’s study is right, Kripke’s study is wrong about benzodiazepines and so (one assumes) probably wrong in the same way about Ambien and everything else. I understand why Kripke might not want to include it in a systematic review with stringent inclusion criteria, but we still have to take it seriously.

He’s also concerned about the use of an intention-to-treat design. This is where your experimental group is “anyone who was prescribed medication to begin with” and your control group is “anyone who was not prescribed medication to begin with”. If people switch, they stay in the same group – for example, someone taking medication stops taking it, they’re still in the “taking medication” group. This is the gold standard for medical research because having people switch groups midstream can introduce extra biases. But if people in the “taking medication” group end up taking no more medication than people in the “not taking medication” group, obviously it’s impossible for your study to get a positive finding. So although P&a were justified in using an intention-to-treat design, Kripke is also justified in worrying that it might get the wrong result.

But the authors respond by giving a list of theoretical reasons why they were right to use intention-to-treat, and (more relevantly) repeating their analysis doing the statistics the other way and showing it doesn’t change the results (see page 10 here). Also, they point out that some of the studies that did show the large increases in mortality also used intention-to-treat, so this can’t explain the differences between their studies and previous ones. Overall I find their responses to Dr. Kripke’s concerns convincing. Also, my priors on a few sleeping pills per year tripling your risk of everything is so low that I’m biased towards believing P&a.

So why did they get such different results from so many earlier studies? In their response to Kripke, they offer a clear answer:

They adjusted for three hundred confounders.

This is a totally unreasonable number of confounders to adjust for. I’ve never seen any other study do anything even close. Most other papers in this area have adjusted for ten or twenty confounders. Kripke’s study adjusted for age, sex, ethnicity, marital status, BMI, alcohol use, smoking, and twelve diseases. Adjusting for nineteen things is impressive. It’s the sort of thing you do when you really want to cover your bases. Adjusting for 300 different confounders is totally above and beyond what anyone would normally consider.

Reading between the lines, one of the P&a co-authors was Robert Glynn, a Harvard professor of statistics who helped develop an algorithm that automatically identifies massive numbers of confounders to form a “propensity score”, then adjusts for it. The P&a study was one of the first applications of the algorithm on a controversial medical question. It looks like this study was partly intended to test it out. And it got the opposite result from almost every past study in this field.

I don’t know enough to judge the statistics involved. I can imagine ways in which trying to adjust out so many things might cause some form of overfitting, though I have no evidence this is actually a concern. And I don’t want to throw out decades of studies linking sleeping pills and mortality just because one contrary study comes along with a fancy new statistical gadget.

But I think it’s important to notice: if they’re right, everyone else is wrong. If you’re using a study design that controls for things, you’re operating on an assumption that you have a pretty good idea what things are important to control for, and that if you control for the ten or twenty most important ones you can think of then that’s enough. If P&a are right (and again, I don’t want to immediately jump to that conclusion, but it seems plausible) then this assumption is wrong. At least it’s wrong in the domain of benzodiazepine prescription and mortality. Who knows how many other domains it might be wrong in? Everyone who tries to “control for confounders” who isn’t using something at least as good as P&a’s algorithm isn’t up to the task they’ve set themselves, and we should doubt their results (also, measurement issues!)

This reminds me of how a lot of the mysteries that troubled geneticists in samples of 1,000 or 5,000 people suddenly disappeared once they got samples of 100,000 or 500,000 people. Or how a lot of seasonal affective disorder patients who don’t respond to light boxes will anecdotally respond to gigantic really really unreasonably bright light boxes. Or of lots of things, really.