An artist’s impression of a coronavirus. Photo: TheDigitalArtist/pixabay.

Quinine is a naturally occurring extract of the bark of the cinchona tree, isolated in 1820. It has been widely used in the treatment of malaria. The loosely related compounds chloroquine and hydroxychloroquine are widely used synthetic alternatives. All three are also used to treat other conditions such as lupus and rheumatoid arthritis.

Chloroquine and hydroxychloroquine have also been reported to have antiviral effects, and were recently used to treat COVID-19 patients in China, South Korea, Italy and France. However, a recent preprint paper from a French group led by Didier Raoult, an infectious diseases specialist in Marseille, claimed exceptional efficacy of a combination of hydroxychloroquine and azithromycin – a common antibiotic also known to have antiviral effects; it kicked up a global storm, including a premature endorsement by US President Trump that was promptly contradicted by Anthony Fauci, a member of the White House Coronavirus Task Force.

The story behind this premature hype is rather strange and involves billionaire entrepreneur Elon Musk, program hosts at Fox News and various rather dubious characters.

To understand the controversy, we should know what the preprint paper does and does not claim.

In particular, the experiment described in the paper is not a randomised controlled trial, considered to be the gold-standard when validating the efficacy and safety of drugs. It was conducted with a small number of patients. And it is not appropriate for the president of a powerful country to recommend the drug, and never appropriate for patients to self-prescribe such drugs.

However, is it appropriate for a trained and experienced physician to prescribe it, knowing the literature and its limitations, and also drawing on her own experience?

Let’s explore the answer here from a statistical point of view.

The problem with supplying a drug to a patient, recording an improvement and then attributing the improvement to the drug is that you don’t know what might have happened if you hadn’t given the drug. Say you believe vitamin C cures the common cold. You have 20 patients with common cold and you give them a thrice-daily dose of vitamin C, and all of them are cured in a week. Can you claim vitamin C cured them? No, because they may have become better on their own anyway. To protect against this possibility, you need a control group that is not given vitamin C, and compare the two groups at the end of the trial.

A randomised controlled trial (RCT) does exactly this. From a group of patients with similar symptoms, you randomly select about half for the experimental group, and to whom you administer the drug or treatment. The rest, who form the control group, receive a placebo; all members of this group still believe they are being treated. At the end of the trial, you compare the responses of the two groups. Only if the experimental group shows significantly better improvement over the control group do you consider the drug or treatment to be effective.

It is rare that every patient in one group is cured while every patient in the other is not. You have to deal with variations of numbers. For example, say the groups consisted of three people each; two out of the three in the experimental group were cured, and only one in the control group was. Is that significant? Intuitively, one would say ‘no’. What if all three in the experimental group were cured but none in the control group? The numbers are still too small for you to be confident about the results. If three tossed coins land heads and another three land tails – it could still happen by chance.

Perhaps the most commonly used parameter to measure significance is called the p-value, based on the work of statistician Ronald Fisher, who also developed the essential ideas of randomisation that underlie the modern RCT. Fisher defined the p-value as follows: Your hypothesis is that “the drug works”. Suppose the contrary, called the null hypothesis, that “the drug does not work” and whether the patient is cured or not is unrelated to the drug. The p-value is the probability of seeing data under the null that is the same or more suggestive than what you saw.

That is, if the null hypothesis – the hypothesis that there is nothing special going on – were true, how probable is it that you would see the data that you observe (or more extreme data)? That is the p-value. If it is high, it means your data is likely to occur by chance. If it is low, it means your data is unlikely to occur by chance, and that something of interest is going on. But it says nothing directly about the hypothesis that you are interested in. This is sometimes misunderstood even by experts. The p-value says you wouldn’t be seeing this by chance but it could mean something else else is going on.

For example, consider the example of 2/3 in the experiment being cured and 1/3 in the control being cured: p here is the probability of seeing exactly these results if the experiment was designed to prove that “the drug doesn’t work” as well.

So if we take the probability of a patient being cured in the absence of an effective drug to be 0.5, the p-value works out to be 0.25: a one-in-four chance, hardly terribly unlikely. Fisher declared somewhat arbitrarily that a p-value of less than 0.05 – a one-in-20 chance – would count as significant. This has become the widely accepted value in the medical literature.

As it happens, the p-value has been widely criticised since, especially more recently. The principal issue is that a small value of p only tells us about the probability of the null hypothesis, which we don’t care about; it tells us nothing about the hypothesis we are interested in. The p-value also invokes hypothetical data more extreme than observed data, which also we don’t care about. As Fisher’s contemporary Harold Jeffreys wrote in his 1939 textbook, Theory of Probability: under the p-value criterion, “an hypothesis [the null] that may be true is rejected because it has failed to predict observable results that have not occurred [the more extreme results]. This seems a remarkable procedure.”

Nevertheless, RCTs and p-values have become established as benchmarks in the medical literature – and they sort of work because researchers already believe it’s plausible that the drug is effective.

The p-value rift between Jeffreys and Fisher was one aspect of a larger divide that continues among statisticians from Fisher’s time to this day: how do you calculate a probability?

Fisher’s frequentist school believes that the probability of an event can only be meaningfully calculated when one has access to many different independent trials (like multiple coin tosses instead of just one). Then the probability of an outcome for a large number of trials becomes the observed frequency of the outcome: if 60 out of 100 patients are cured, the probability of the drug working becomes 0.6.

The other and older school of Jeffreys, first formalised in the early 1800s and now called the Bayesian school, essentially defines the probability of an event by the degree of your belief in the event. The frequentists derided this idea as subjective – but it is not except for one aspect: the Bayesians use the concept of a prior probability, which is defined by what you believe about a hypothesis before you have seen any data at all. In many cases, this does not matter: your prior can be complete ignorance, for example. But when you do have a prior – and even if it was obtained through gut feeling – the Bayesians believe it should be factored into the calculation of your results.

For example, a test for a disease detects the disease successfully in 99% of cases that do have the disease, but has a 2% false positive rate: i.e., in 2% of cases where the patient does not have the disease, it wrongly says the patient does. Now, you run the test on a patient and it comes back positive. What is the probability that the patient has the disease?

Neither of the numbers above gives you the correct answer because the answer also depends on the prior probability that the patient has the disease. Say the disease occurs in 1 out of 10,000 people in the population. That could be your prior. Or if you know of risk factors specific to this patient, you could include them in your prior. So given your prior and the test data, Bayes’s theorem gives you a posterior probability for the disease. In this case, if your prior is 1 in 10,000, the posterior is a mere 0.005, or 1 in 200. But if you perform a second independent test with similar sensitivity and error rate, the previous posterior becomes your new prior, and your new posterior is 0.2, or 1 in 5. A third independent test pushes it up to 0.92, i.e. you are now 92% confident that the patient has the disease (assuming the three tests are really independent.)

A 2018 paper authored by Maurizio Pandolfi and Guilia Carreras, researchers working in Sweden and Italy, respectively argued that the RCT doesn’t work very well when you have a very low prior belief in the efficacy of the medicine because, as noted above, the p-value tells you nothing about your hypothesis itself: it only rejects a fictitious null hypothesis. But the prior probability of your hypothesis may be so low that even the posterior probability is more unlikely than that of the null hypothesis. In fields like clinical psychology, the awareness of this ‘p-value fallacy’ is now more widespread, and at least one scientific journal has banned its use.

The preprint paper describing an experiment by Raoult and his colleagues, about the efficacy of a combination of hydroxychloroquine and azithromycin, does not have a randomised control group. They did have a control group but it wasn’t randomly selected from the full group. Therefore, in principle, we can’t say whether the drug works or not. Their sample size was also small: 24 people received hydroxychloroquine (HCQ) only, six received hydroxychloroquine and azithromycin, and 16 received no treatment. Additionally, Raoult et al don’t actually report clinical outcomes like patient deaths or recoveries, only the viral load (the amount of viral particles present in the bloodstream).

However, they do report that every patient in the experimental group experienced “significant” decrease in viral load, and that the load dropped to zero for every patient who received both drugs, as well as for large numbers of patients receiving only HCQ. What does this mean?

We must interpret these results with great caution, but not by rejecting them entirely. Fauci, who is also the director of the US National Institute of Allergy and Infectious Diseases, and other medical experts are completely correct to repudiate Trump’s endorsement of HCQ + azithromycin: the paper by Raoult’s group is no kind of proof. Self-medication can be very dangerous: overdosing on these drugs can cause deafness, seizures, retinopathy or even kill. And this Trump-fuelled craze is precipitating a shortage of these drugs among those who legitimately, and desperately, need them.

The Raoult group paper does not prove anything. However, it is very suggestive.

A clinician who is considering prescribing this combination to a sick patient has to weigh many competing factors:

This is not an RCT, only a non-randomised trial, and with relatively small numbers

But for those small numbers, the results are striking

HCQ and azithromycin both have documented antiviral properties

Both are widely-used drugs and their side-effects are well-known

Both are contra-indicated in some patients

The French paper is from a respected group, though with a reputation of being iconoclastic

A previous paper from China with a larger group (100 patients) contains no data

There is no other approved treatment. Other drugs are being trialled, like Remdesivir, an anti-Ebola drug with a mixed record, Ritonavir/Lopinavir, an anti-HIV combination drug

The reality is physicians are not computer programs that take a list of symptoms as input and print out a prescription as output. They use their judgement and experience every day – what is commonly known as intuition. They should continue to use all available data and their own experience when prescribing they think is best for their patient. If experienced physicians didn’t try off-label uses of drugs every now and then, no drug would ever become repurposed.

Some comments on Twitter suggesting that until HCQ is validated by an RCT, it is no better than gaumutra. This is sheer hyperbole. And this is also why prior knowledge is important. We know a priori that these drugs have antiviral properties. We know their side effects. We know they have been used for multiple conditions for decades. We know the credentials of the doctors reporting the results of these incomplete, small-sized, non-RCT trials. Practising physicians have intuition built out of successful and unsuccessful interventions they have performed over the years.

We also know that there is no known or plausible mechanism for gaumutra, homeopathy or any alternative treatments being advanced on the social media.

All of this prior knowledge is important when making a decision in a given situation.

Researchers are already starting large clinical trials in Europe and elsewhere to test the efficacy of these and other drugs. While they wait, doctors should not fall prey to absurd hype from Trump, Musk or Fox News. Let us be clear: under no circumstances should patients self-prescribe, whether it is HCQ or antibiotics or any future drug that is approved for this disease.

Instead, let’s remember the sword cuts both ways: physicians should also not dismiss it as ‘no better than gaumutra‘ until validated. They should trust their instincts.

Rahul Siddharthan is at the Institute of Mathematical Sciences, Chennai. The views expressed here are personal.