Last post, I said we have a safety problem in medical AI. I even suggested that it is bad enough that it could lead to a tragedy.

It may not seem that way. Numerous papers are published every week, showing deep learning systems achieving impressive results on medical tasks. Products are being approved by the FDA, and companies are starting to sell them to healthcare providers.

I’m going to argue we are doing three specific things wrong:

we assume good experimental performance equals good clinical performance we assume good overall performance equals good subtask performance we are not very careful with our study designs

Today I will explain the first of these problems, and give the evidence that supports it. Since this topic is so important, I’m even including references!

Over the next few posts, I will cover the other issues, and describe the solutions we have.

As this is gonna be a long one, there will be a TL:DR at the end.

Standard disclaimer: these posts are aimed at a broad audience including layfolk, machine learning experts, doctors and others. Experts will likely feel that my treatment of their discipline is fairly superficial, but will hopefully find a some interesting new ideas outside of their domains. That said, if there are any errors please let me know so I can make corrections.

I don’t care about your model

A little bit of housekeeping to start with. We are talking about assessing the safety and efficacy of medical AI, for systems that are being considered for real-world clinical implementation. These are not prototypes, or proof-of-concept models, or research projects. These are systems we want to apply to patients.

This means we are only interested in the test set results. This is a computer science free zone. We don’t care about how the model was built, we don’t care about the design decisions you made, and we honestly don’t even care much about how it was trained.

The only architecture diagram that matters in clinical testing for AI

When testing an intervention to change standard medical practice, we can treat the change itself as a black box. It makes no difference if you are testing AI, or task substitution (for example, a nurse performing a task traditionally done by a doctor), or a new medication. We don’t consider how the system works, or what the nurse was thinking about, or the mechanism of the drug. We just look at the results of the testing. Nothing else matters.

Well, that isn’t quite true, but only because we are never certain that our results are reliable. Knowing the AI system design is sensible can reassure us that the results could be valid (this is a principle of “science-based medicine”, which is itself an extension of “evidence-based medicine”).

To give some examples of how science-based medicine might work in medical AI:

If you are working with medical images using a 3-layer MLP, your results are nonsense.

If you are doing unsupervised learning for a clinical task, your results are nonsense.

If you are using a deep network variant created prior to 2014, your results are nonsense. It doesn’t matter how good they look, they are almost certainly spurious*.

Obviously, I am slightly exaggerating for effect here, but this isn’t far off the current state of science in medical AI. Except in very strange circumstances, it is only if you are using a high performance model, trained on a decently sized dataset, that your results might not be nonsense. If so, we can move on to what might be wrong with your testing 🙂

Performance is not outcomes

“From our experience, most healthcare organizations do not evaluate algorithms in the context of their intended use,” Kakarmath said. “The technical performance of an algorithm for a given task is far from being the only metric that determines its potential impact.”

This quote from healthcareitnews frames today’s topic nicely. So far, no-one has ever shown that patients are better off when we use an AI system. That seems like the most important thing we need to know about these models, right?

If we were doing drug development, we have only done the equivalent of simulation studies or animal models.

Why is this a problem?

Because performance is not outcomes.

This should be the mantra of anyone who is building medical AI systems.

I’ll introduce a bit of terminology here.

Performance testing is what we have seen in research papers and regulatory approvals so far. We take a set of patients (a cohort), define a performance measure we will judge our model on (a metric), and identify what “good” performance will be (usually a comparison against current practice). We then analyse the results with some sort of statistical test to estimate how reliable they are.

This is like doing an experiment in a laboratory, a drug trial in a petri dish, which is why it is often also called laboratory testing (despite the severe lack of laboratories in radiology research). The point is that in this type of experiment, we control for all factors other than the AI model.

Clinical testing has the goal of not controlling the experiments. Unlike in performance testing, we want to see how the system operates in the context of real healthcare. We want to see that good performance actually leads to better clinical outcomes.

Clinical outcomes are what happens in practice. The two types of outcomes we care about are patient outcomes, like the rates of death and disability for patients who have a specific condition, and healthcare system outcomes, such as the amount of money spent per patient.

So the key components of clinical testing are:

real clinical environments

real patients

real outcomes that really matter

At a glance it wouldn’t be unreasonable to assume that high performance should result in good outcomes. If we look at recent papers, we often see experiments that directly compare the performance of AI systems to those of doctors, with favourable results.

Examples of humans (dots) vs AI (lines) comparisons on ROC curves, from prominent recent papers.

Surely this is apples for apples? If a test shows an AI can do the task as well as a doctor, then they can be swapped just like identical cogs, right?

Of course not, because performance is not outcomes.

Why doctors hate CAD

The experience we have had in computer aided diagnosis (CAD) over the last few decades is instructive. If you are working in medical AI and you are not aware of the failure of CAD, stay a while and listen**.

CAD is the term we used for 90s AI as it applied to screening mammography. Mammography being x-rays of breasts, performed to look for breast cancer. The methods used were mostly expert systems using handcrafted rules and support vector machines with hand-crafted features (SIFT/HOG etc).

I assume everyone knows that this breed of AI didn’t work very well for any perceptual tasks?

Well, radiology didn’t get the memo. Instead of leaving this technology to researchers and enthusiasts, the US government*** decided to pay radiologists $8 more to report a screening mammogram if they used^ CAD. Unsurprisingly, by 2010 it was estimated that 74% of mammograms in the US were read by CAD [1]. This decision has cost billions over the last two decades.

The most valuable thing to come out of this lamentable decision is that we now actually have direct evidence of whether the performance testing that CAD was justified by was good enough. Since we are currently approving AI systems today based around the same sort of experiments, you can see why it might be important to know if it works.

Spoiler: it doesn’t, because performance is not outcomes.

The early experiments were promising. The first performance study (I think) of CAD that directly compared humans with and without the support of the CAD system (these are usually called a “reader study”) was undertaken in 1990 [2]. This showed a greater AUC for the combo of humans and CAD.

Many more studies followed, with similar performance results. The first FDA approval of mammography CAD was in 1998, and Medicare in the USA started to reimburse use of CAD in 2001.

Almost immediately, doctors started getting uneasy. In practice, CAD systems would highlight a lot of false positives – areas on the study for the radiologist to review that did not end up being important. It was also variable; if you ran the same study through a CAD system twice, you could get quite different results. To the radiologists, it certainly didn’t appear that these systems were very good, and using them could be frustrating.

Example of a CAD interface, with a highlighted area of concern.

Frustrating was expected though. These systems were supposed to add a bit of a burden (a slight increase in interpretation time), but allow us to pick up more cancer. Unfortunately, the evidence trickling in seemed to suggest that patients weren’t doing any better. Many groups started putting these systems to the test, and several massive clinical trials came out in the 2000s. They all found the same thing.

CAD didn’t work. At best.

Even reading the literature, it can be hard to appreciate this. There are numerous studies which say the opposite, that CAD helps radiologists pick up more cancer with minimal costs, but they all had one thing in common.

They were all controlled experiments^^. They involved radiologists reading a set of images with and without CAD, and they show that in combination, more cancer is detected. These studies range from small (tens of patients) to large (thousands of cases), but they never looked at patient outcomes in clinical practice.

Several large scale clinical trials have now been completed. In 2007, Fenton et al. [3] showed that in a cohort of 222,000 women undergoing 430,000 mammograms, across four years and three states, implementing CAD was associated with a reduction in specificity from 90.2% to 87.2%. The rate of biopsy increased by 19.7%, but the change in the cancer detection rate (from 4.15 per 1000 to 4.20 per thousand) was not significant.

So, 20% more biopsies, no more cancer.

In 2015, an even larger study by Lehman et al. [4] looked at 630,000 mammograms from 320,000 women across a 6 year period. They found that sensitivity, specificity, and cancer detection rates were not any different between radiologists that used CAD, and those that didn’t. They also found that for the radiologists who had practiced both with and without CAD during the study period, their sensitivity dropped from 89.6% to 83.3%.

Not better, maybe worse.

Similar results have been shown in the other trials on the topic eg Grommet et al, Gur et al., etc. A systematic review in 2008 (prior to Lehman) showed that CAD did not change detection rates, but increased recall rates. It also showed that double reading increased detection rates and decreased recall rates, but more on that later.

So, we have a bunch of laboratory studies, even at large scale, that show improved performance, and a bunch of huge clinical trials that say “nuh-uh, it definitely isn’t better, and most likely is worse”. What is going on?

Check your bias

People are weird. It turns out that if you run an experiment with doctors being asked to review cases with CAD, they get more vigilant. If you give them CAD and make them use it clinically, they get less vigilant than if you never gave it to them in the first place.

There are a range of things going on here, but the most important is probably the laboratory effect. As several studies have shown [5, 6], when people are doing laboratory studies (i.e., controlled experiments) they behave differently than when they are treating real patients. The latter study concluded:

“Retrospective laboratory experiments may not represent either expected performance levels or interreader variability during clinical interpretations of the same set of mammograms”

which really says it all.

An important question to ask, since it gets to the root of how we might want to test medical interventions like AI, is why? Why would laboratory testing fail?

As I said, people are weird. Not weird as in “do strange things”, but weird as in “can be consistently expected to do things that are unintuitive at first glance”. Welcome to the study of human cognitive biases.

Quick bias check: how likely do you think it would be for these two people to revolutionise cognitive psychology?

Human decisions are prone to influence by external forces. In cognitive science generally there has been an enormous amount of work on the question of why humans make decisions. For the time poor, I’d recommend this TED talk from Dan Ariely, and if you want a deeper introduction, Think101x from Unviersity of Queensland (a free MOOC I have mentioned before).

From Dan Ariely in the talk above, talking about the external factors that influence us:

“We wake up in the morning and we feel we are making decisions … but what (this evidence) shows is that these decisions are not residing within us.”

This effect is well described in medicine too. Dan Ariely has a medical example in the talk, but in healthcare IT in particular there is a wealth of literature on the topic. From Enrico Coiera (a leader in this field):

Biases such as the anchoring, adjustment and representativeness heuristics, and information presentation order effects all can lead to decisions that do not reflect the available evidence.

You can see how these effects might all come into play in laboratory experiments. Anchoring (and the adjustment heuristic) is when your decision is biased by an initial piece of information like a prompt, e.g. “determine if these cases contain malignant lesions or not.” Instead of treating the case like you would in clinic, the presence of the word “malignant” might make you more vigilant.

I won’t go through all of the possible ways these biases could alter human performance during tests, but I will note one bias in particular because it is specifically relevant to medical AI. Automation bias or automation-induced complacency has been described as:

“the tendency to use automated cues as a heuristic replacement for vigilant information seeking and processing” [7]

or, in other words, our propensity to over-rely on the cues from computers, and under-value other evidence we may have. This effect has been implicated in several recent deaths in partially self-driving cars – it has been shown that even trained safety drivers are unable to remain vigilant in autonomous cars that work most of the time.

Automation bias can reduce vigilance, because we inherently trust computers^^^

This effect has also been directly cited as a possible reason for the failure of mammography CAD. One particularly interesting study showed that using CAD resulted in worse sensitivity (less cancers picked up) when the CAD feedback contained more inaccuracies [8] (pdf link). On the surface this didn’t make a lot of sense, since CAD was never meant to be used to exclude cases; it was approved to highlight additional areas of concern, and the radiologists were supposed to use their own judgement for the remainder of the image. Instead, we find that radiologists are reassured by a lack of highlighted regions (or by dismissing incorrectly highlighted regions) and become less vigilant.

I’ve heard many supporters of CAD claim that the reason for the negative results in clinical studies is that “people just aren’t using the CAD as it was intended,” which is both accurate and absurdly naive as far as defenses go. Yes, radiologists become less vigilant when they use CAD. It is not surprising, and it is not unexpected. It is inevitable and unavoidable, simply the cost that comes with working alongside humans.

If you want to read any more about automation bias and the effects it can have in medical IT, David Lyell has done some really nice studies on the topic.

All robots, all the time

You may ask, what about full automation? When an AI system doesn’t just influence human decisions, but provides the answer autonomously, then don’t these problems just vanish? There are no hidden factors involved, no messy humans, and no way to trip up the system.

Of course, the answer is no. Because performance is not outcomes.

No decision in medicine occurs in isolation from people. A radiology report doesn’t make patients better. The report is delivered to a clinician, who interprets it through their own understanding of the patient, and through their own biases. Do surgeons act differently when they receive a report from an AI? No idea. Do internists alter their treatment plans when an AI presents information in a specific order, in a specific way? Never been tested.

We have no reason to expect that any medical AI system will be unaffected by these problems. An enormous weight of evidence shows that complex human systems will always act differently than we would expect from controlled experiments. Performance studies will never truly show us how a system will operate in clinical practice, and all of our experience suggests that the reality is usually worse than our experiments suggest.

So, if we approve and implement AI that has only been tested in modestly sized performance studies, what could go wrong?

What’s the harm?

In mammography, CAD has cost the United States hundreds of million per year [4] without any appreciable benefit. The harm may not only be measured in dollars though, because it is possible that CAD use has prevented the wider dissemination of double reading in mammography, a practice which has been shown to improve patient outcomes.

Double reading is when two radiologists independently read a mammogram, and some consensus mechanism is used if their reports disagree with each other. In many other countries (including my own home, Australia), double reading is widespread. CAD has been seen by many as a cost-efficient way to avoid double-reading.

The evidence [9] shows that double reading costs a bit more (€8,912 per cancer detected vs €8,287 with single reading), but detects about 10% more cancer. This is generally considered a good trade-off, especially once you consider the increased costs of delayed treatment if you miss those cancers.

Across the US, if we pretend that the money spent on CAD had instead been used for double reading, we can estimate the effect. Double reading, according to the above study, finds an additional cancer for every €16,600 (in 2010 Euros). Allowing for differences in exchange rate, currency value, and so on over time, let’s just round that up to $20,000 USD.

If the average cost of CAD per year is $400 million, then double reading could have detected an extra 20,000 cancers per year in the US! Obviously this is not a formal economic analysis, and although I am being conservative in my estimates, the figures are rubbery. But even 10,000 more cancers detected per year would be a huge deal. Even 5000 per year would be incredible and tragic.

Courting tragedy

If it wasn’t for the slow, statistical nature of the problems with CAD, occurring over decades and measured in calculated lives rather than visible people, we might call breast CAD a medical tragedy.

What happens when we apply AI to urgent and critical care? If our models appear to perform well, but underperform in actual clinical practice, it is not hard to imagine a local, clustered tragedy of dozens or hundreds of deaths.

What would happen to medical AI as a sector if something like this occurred because we had not been as diligent as we knew we should be? When all the evidence already exists that our current approach is inadequate? Last post I included a quote from Samuel Massengill about the role of his company in the 1937 elixir sulfanilamide tragedy, where he said no-one “could have foreseen the unlooked-for results.” As I said then, unlooked-for is not the same as unforseeable.

We know that laboratory testing is not good enough. We already have extensive evidence of increased costs and likely patient harm, caused by the very same testing we are still using today to assess and approve medical AI systems.

We have to do better.

TL:DR Medical AI today is assessed with performance testing; controlled laboratory experiments that do not reflect real-world safety.

Performance is not outcomes! Good performance in laboratory experiments rarely translates into better clinical outcomes for patients, or even better financial outcomes for healthcare systems.

Good performance in laboratory experiments rarely translates into better clinical outcomes for patients, or even better financial outcomes for healthcare systems. Humans are probably to blame. We act differently in experiments than we do in practice, because our brains treat these situations differently.

Even fully autonomous systems interact with humans, and are not protected from these problems.

We know all of this because of one of the most expensive, unintentional experiments ever undertaken. At a cost of hundreds of millions of dollars per year, the US government paid people to use previous-generation AI in radiology. It failed, and possibly resulted in thousands of missed cancer diagnoses compared to best practice, because we had assumed that laboratory testing was enough.

References

[1] Rao VM, Levin DC, Parker L, Cavanaugh B, Frangos AJ, Sunshine JH. How widely is computer-aided detection used in screening and diagnostic mammography? Journal of the American College of Radiology. 2010;7(10):802-5.

[2] Chan HP, CHARLES E, METZ P, LAM KL, WU Y, MACMAHON H. Improvement in radiologists’ detection of clustered microcalcifications on mammograms. Arbor. 1990 Oct 1;1001:48109-0326.

[3] Fenton JJ, Taplin SH, Carney PA, Abraham L, Sickles EA, D’Orsi C, Berns EA, Cutter G, Hendrick RE, Barlow WE, Elmore JG. Influence of computer-aided detection on performance of screening mammography. New England Journal of Medicine. 2007 Apr 5;356(14):1399-409.

[4] Lehman CD, Wellman RD, Buist DS, Kerlikowske K, Tosteson AN, Miglioretti DL. Diagnostic accuracy of digital screening mammography with and without computer-aided detection. JAMA internal medicine. 2015 Nov 1;175(11):1828-37.

[5] Rutter CM, Taplin S. Assessing mammographers’ accuracy: a comparison of clinical and test performance. Journal of clinical epidemiology. 2000 May 1;53(5):443-50.

[6] Gur D, Bandos AI, Cohen CS, Hakim CM, Hardesty LA, Ganott MA, Perrin RL, Poller WR, Shah R, Sumkin JH, Wallace LP. The “laboratory” effect: comparing radiologists’ performance and variability during prospective clinical and laboratory mammography interpretations. Radiology. 2008 Oct;249(1):47-53.

[7] Mosier KL, Skitka LJ. 10 Human Decision Makers and Automated Decision Aids: Made for Each Other?. Automation and human performance: Theory and applications. 2018 Jan 29:120.

[8] Alberdi E, Povyakalo A, Strigini L, Ayton P. Effects of incorrect computer-aided detection (CAD) output on human decision-making in mammography. Academic radiology. 2004 Aug 1;11(8):909-18.

[9] Posso M, Carles M, Rué M, Puig T, Bonfill X. Cost-effectiveness of double reading versus single reading of mammograms in a breast cancer screening programme. PloS one. 2016 Jul 26;11(7):e0159806.

* there is nothing wrong with spurious results. We should expect to find spurious things all the time, that is the basis of stats. The important thing is to clearly recognise the limitations of your methods, and explain them clearly in your papers.

** maybe “Tala Moana, warrior”, is more appropriate if you are a millennial and don’t own a mobile phone.

*** most other governments did not do this, and unsurprisingly in other countries almost no-one uses CAD. Something something lobbying something something regulatory capture.

^ “used” in this context was pretty rubbery. There were rules about how the systems should be applied, but in practice most people did not follow them. Anecdotally, many people simply ignored the system and pocketed the extra cash.

^^ presumably there are some exceptions to this, but all the large unmatched trials I’ve read are remarkably consistent.

^^^ obviously I could have included the tragic gif from the Uber self-driving car accident here. I will instead just assume everyone has seen it, and note that the importance of reduced vigilance in life-or-death situations cannot be overstated.