Re-branding AI for Health-care

Over-hyping Machine Learning hinders adaptation, being clear about its limitations can pave the way to it.

DeepMind, Google’s Artificial Intelligence subsidiary has recently made a huge step in cancer screening with its algorithm outperforming human readers. The research has received a lot of criticism since. Despite its shortcomings, the technology does have potential and it is a huge milestone for AI in medicine. Unlike most AI companies in Healthcare, DeepMind made sure to be as transparent as possible, when presenting their findings. They built trust and paved the way for adaptation by making both the limitations and the potential of their technology clear, and showed promising metrics for possible human-machine cooperation. Not only are they pioneers in AI technology, but also in its presentation. There is a lot we can learn from DeepMind, even if their solution is unable to “automate” cancer screening.

After the first AI winter, AI research in the late 60s was re-branded in the form of Expert Systems: software that would mimic the decision making process of a domain expert. Just put all the necessary information in and the computer will output the answer without ever getting tired! Hopes were high, and doctors were told that they are about to be replaced with more efficient machine intelligence soon. All these unfulfilled promises have quickly led to a second AI winter instead. Now the hype’s back! Countless articles have been published on AI detecting cancer more accurately than doctors, finding relevant papers on rare conditions more efficiently than researchers, and establishing better diagnosis through Big Data than physicians. So why aren’t these solutions available yet? Do they even exist outside the simulated “Petri dishes”? Why does adaptation take so long?

Retrofuture of medicine by Matt Novak

The anthropomorphisation of AI, all the unfulfilled promises and the constant fear-mongering in the media caused people to expect the rise of infallible robot doctors, instead of thinking of these models as diagnostic devices that rely on patient data. This false perception made it a lot harder for healthcare professionals to adopt current Machine Learning models — even though actual “AI” solutions have already been an essential part of medicine for several decades. The only way AI can see broad adaptation in medicine is if the industry would aim to create tools for physicians to aid their work instead of trying to replace them (a goal that DeepMind follows, but not all AI companies aim for).

In this article, I will list several factors (including marketing puffery, white lies and over-hyped promises) that are counterproductive and hinder adaptation, and show alternative ways to present research results in a way that is beneficial to both patients and doctors. These examples aim to help the reader to be able to read behind the lines when tabloid articles on AI’s current accomplishments are presented to them.

Hype #1: AI will replace ___ jobs!

Software engineers create functions to avoid repetition and add layers of abstraction to make complex problems more manageable. When large amounts of data become available in any field, it is only natural to look more deeply into the data, and see if it is possible to automate some human labour (including the labour of these engineers) or have a better understanding of the underlying system. No programmer would find it offensive if a software library or API would “took some part of their jobs”. These solutions allow them to do more complex tasks by building upon them (with some additional computational overhead). Very few programmers would feel attacked if a solution came from someone with different qualifications. Learning to code is different from studying medicine: being self-taught in the former is common and common skills matter more than a degree. There are no real authorities or approving boards. Even ordinary people can issue pull requests to huge repositories that continuously evolve. Languages and workarounds learned by respected developers can become obsolete in years. Progress happens when repetitive tasks are automated, and new layers of abstractions are added. In the field of computer science, creating tools to solve problems with the help of multiple people with different sets of skills, sometimes even without formal education, is quite common. Now imagine being examined by a doctor who has never studied medicine, but is smart enough to figure out how to do surgery on a cat alone by reading related Wikipedia articles. It is conceited to think, that automating any problem is the same as having the strength and will to make life and death decisions. “Replacement” also implies that all doctors do is just giving out diagnoses. Exaggerations like telling physicians with decades of prior education that some junior programmer hacked a Convolutional Network during a weekend which would render them obsolete, will understandably only make them more sceptic and hinder adaptation. Sometimes Machine Learning models used without domain knowledge come up with results that the experts of that domain have already discovered long ago. Presenting these findings usually backfires, and gets scrutinized by experts of the field.

On the other hand, if software engineers don’t create the proper tools for physicians, they will just rely on different solutions at hand, which is far worse. Doctors sometimes share sensitive information via Social Media or messaging apps to get help from other professionals without being fully aware of how it might be possible for this information to be traced back. Using popular software for spreadsheets, appointments and medical records can cause data breaches to occur more often, therefore reliable software solutions are necessary for healthcare. These applications will partly replace paper-based administration, and log data, which should be analyzed and understood — by engineers and statisticians with domain knowledge.

A revolutionary Health Industry concept from 1982

Hype #2: AI has outcompeted humans in ___!

When Deep Blue defeated Garry Kasparov in 1997 in chess, the headlines wrote about how a single, superior AI beat the best human player. Although it was one supercomputer running the chess algorithm (with multiple systems working in parallel), it is misleading to say that it was a 1 on 1 match. More than 50 years of AI research from the likes of Alan Turing, Claude Shannon and John von Neumann had preceded its creation, including years of effort from IBM’s best engineers. Deep Blue was the product of decades of a collaboration of the sharpest minds in the world. Their joint effort combined with exponential computational growth had beat Kasparov in chess. That’s nothing to be ashamed of.

As impressive Deep Blue is, Kasparov could still have beaten it in a game of tic-tac-toe. All current AI solutions are narrow, meaning they can excel in a single area, but be useless in others. Machine Learning models have no concept of the world outside their niche domain, they can only work with the features presented to them during training and can only output a finite set of answers based on the problem. This means that unlike Deep Blue, Kasparov could learn chess by playing not just on the computer but also on wooden, plastic, glass boards, etc. He could even play if the chess pieces were a different shape, read books about chess, discuss chess strategies, and learn to read his opponents’ body language. He could also use his strategic knowledge in decision making, business, or when playing games other than chess. He would most likely be pretty successful in games he has never played before. Current AI solutions, however, are unable to transfer their knowledge to entirely different areas, and usually have to be trained from zero for each problem.

The narrow AI approach faces additional problems in healthcare:

Training models require a large number of (properly labelled) data. Even if this data is available, the classes are usually highly imbalanced (fewer people have a certain disease than those who don’t). It is harder for a model to generalize from classes with fewer examples, although there are many solutions for this. This makes rare diseases an even bigger problem, as there aren’t many examples of them.

Data needs to be as clean as possible. An AI that learns from patient data will give false predictions if the patient’s history has missing or false information (which is usually the case, for administrative reasons). If certain illnesses were misdiagnosed in their history, the model will come to wrong conclusions. When new information becomes available, the data has to reflect these changes. If new kinds of illnesses are found or new kinds of symptoms arise, the models have to be (at least partly) retrained.

The model can not predict illnesses that are outside of its domain. For instance, if the model was trained to detect melanoma, using images of patients’ skins, it will handle all kinds of images as if they were skins and will try to diagnose them. Also, if multiple skin problems are present on the image, the model would only look for melanoma and nothing else. If the model was trained with examples where each image has only one type of disease, it may only be able to diagnose only a single disease at the time, even if multiple are present.

The limitations of narrow AI make its diagnosis unreliable when the patient suffers from multiple, rare diseases at the same time or when the administrated patient information does not reflect reality. Expert Systems in the past also suffered from these problems, as they had to be frequently updated to represent new knowledge and they gave strange answers when never before seen inputs were presented to them.

Note that brilliant minds at DeepMind solved the first 2 out of these 3 problems, by also training models on UK data and evaluating it on images from the US, which means that their system is robust enough to generalize between scans done in different hospitals (although the models benefit if they are fine-tuned with local data). They also suggest that “future research should assess the performance of the AI system across a variety of manufacturers in a more systematic way”.

Hype #3: AI has predicted ___ with 99% accuracy!

Computer vision and Machine Learning research have come a long way. But reporting a high accuracy score alone for any model is misleading. Selecting the right metrics to evaluate Machine Learning algorithms is not self-evident. Both Machine Learning and statistical approaches can give you attractive, but misleading results if applied incorrectly, and their effectiveness depends heavily on the quality of data available. If you have a sample of 100 patients, where only 1 of them has a certain disease, a model that always outputs “healthy” will automatically get a 99% accuracy. The cost of misclassifying a patient also varies: wouldn’t you rather have 99% accuracy where that 1 ill patient is diagnosed correctly, but a healthy one out of 99 people is accidentally classified as ill and sent to get further examination instead? Again, the latter would mean that everyone would get proper treatment in the end, even though both the former and the latter would yield the exact same accuracy. Does detecting the same illness at different stages gets more or less difficult? If so, do experts and machines make the same amount of errors in each stage? Is it even helpful to diagnose an illness in every stage? Do doctors make more false positives and send patients for further examinations only because they want to make 100% sure that they do not accidentally have a disease?

It is far more useful to report confusion matrices, to show where the model was wrong. Errors made by Machine Learning models and the errors made by humans are usually very different, as we make mistakes for different reasons. So direct comparison of success rates is also misleading. Most models are also sensitive to small changes of scale, rotation and position in an input image. Having high accuracy on holdout sets (the data the model’s accuracy was tested on) does not mean the AI would yield the same accuracy in real life when human staff could make errors while inputting the new images (like accidentally rotating them or bending the scans, etc.).

Teledoctor by Hugo Gernsback

It is surprisingly easy to do bad science with Machine Learning — especially if the research environment was originally designed for hypothesis testing on small samples! Clinical research done on 10–30 people is acceptable, but Machine Learning models ran on only a small number of training examples should not be trusted. Most Machine Learning models are complex enough to completely memorize (overfit) the quirks in the data, instead of learning to generalize from it. It is the role of the Data Scientists to design models that can grasp the underlying information, without making the models too vague (underfit) in their predictions. Besides, the more features (input variables) a model has, the easier it is for the model to overfit, yielding high training accuracy but little to no actual use in real life when encountering new data. Therefore, publishing accuracy scores alone is misleading and should be questioned every time a tabloid article tries to convince the reader how “AI has predicted a disease with X% accuracy”!

Despite all of this, it is still possible, that a Machine Learning model could give better diagnosis than humans. However, due to the different nature of errors, a human physician’s additional, second opinion could still improve a machine’s final verdict. When designing such systems, it is crucial that physicians do not get over-reliant on them, and know when to override the predictions of a model. Garry Kasparov, after his defeat to Deep Blue, came up with the notion of “centaur chess” where humans and chess programs can work together in a team, relying on the strength of both. DeepMind states that there are “potentially complementary roles for the AI system and human readers in reaching accurate conclusions” due to different edge cases. They also provide data on the possible effectiveness of human-machine cooperation, with the AI being a second reader. DeepMind has also published a detailed explanation of their evaluation and metrics (like AUC-ROC) and made the model compete against 6 US-board-certified radiologists! The results are still easy to misinterpret though: the AI seems to be better than doctors at not assigning patients biopsy when they don’t need one. It is the duty of a physician to request further examinations even when there is only a small chance for cancer to be present, so they would obviously score lower.

Hype #4: AI can’t explain it’s decisions on ___!

People tend to be more forgiving towards humans than AI advisors. We expect machines to never make errors — especially when they are marketed as superhuman level intelligence. Some say that building trust is essential for adaptation. This is needless anthropomorphisation that treats AI as a human-like expert that people can go to for advice. If Machine Learning was in fact treated like another human-made tool, except for cognitive tasks, the notion of trust wouldn’t make much sense. Do you trust an X-ray? Would trusting an X-ray more make you go to more appointments? Even if X-rays sometimes fail to show certain parts in the detail needed, and physicians sometimes fail to read the images correctly, would that cause you not to ever have an X-ray due to lack of trust? Again, these screening tools are presented as medical devices and not robotic experts.

DeepMind has combined three different state of the art Neural Networks, “each operating on a different level of analysis (individual lesions, individual breasts and the full case)“ and their mean predicted probabilities of risk is used to establish the diagnosis. However, what these models see, which patterns or areas make them come to certain conclusions and how their decisions differ from each other was not discussed in their paper. In my opinion, this is the biggest shortcoming of their otherwise groundbreaking paper.

Most Machine Learning models use randomization during training, for example, Neural Networks initiate with random weights. It is also common to randomly take subsamples or randomly split the dataset for training and test sets in different ways. This means that each time the very same model is trained, it can be slightly different (although this can be often avoided by using random seeds that force the same series of random numbers to be generated each time). The goal of the Data Scientist is in fact to create models that can generalize well even for slightly different data. Machine Learning is often criticized for this inherited randomness (the same Neural Network can perform differently with a different set of initial weights), but this is also a problem with linear models and statistics: a few missing or additional data points result in slightly different p-values, while outliers could skew regression coefficients.

A linear relationship between 2 or 3 variables is easy to interpret, but this approach has its limits. The more complex the relationship, the harder it is to explain the inner workings of a model. Most Machine Learning models can handle a large number of input variables, and model complex non-linear relationships between them (like how multiple pixels in an image have to be arranged in relationship with each other if certain objects are present in that image). However, these models are seldom interpretable; in fact, if a Machine Learning model is easy to interpret, chances are it’s useless. This does not necessarily mean that the model’s decision-making process can not be explained. It is usually possible to understand which input variables were more important for the decision-making process or which areas of an image have caused the model to generate a certain output. Even if healthcare professionals would avoid these models for their lack of transparency, they can still be used in a hybrid way, by running hypothesis testing on their predictions. If an AI would design a drug, even if it’s not possible to clearly explain why it came up with the result, its results can still be verified via the scientific method. Most of Data Mining is correlational, but causations can be uncovered via additional hypothesis testing. There are many black-box systems in biology that we don’t understand, yet we still develop drugs by measuring if they tend to have the desired effects on these systems.

It is crucial that Data Scientists come up with Machine Learning models in healthcare that do not disadvantage anyone! There are several tabloid news articles that call “AI racist”, as they misunderstand the technology and its limits. There are targeted governmental surveillance, and misuse of these systems by companies, as every technology can be used for either good or bad. But labelling an entire field of software engineering, a branch of mathematics and 70+ years of research, under the umbrella term of AI inherently evil only reveals the confusion of the journalist about what Machine Learning is. Anthropomorphizing computer code will not improve on these issues. Machine Learning models are trained on data usually generated by users. The algorithm learns from the data presented to it, so if users generate biased data, it will learn relationships from that. Searching for “hand” images will mostly return white hands, as images of white hands are more widely used by designers. The search engine only learned to blindly adapt to the needs of users searching for “hands” creating an unhealthy feedback loop where even more white hands are shown, causing even more designers to use them. Facial recognition is less effective on people of colour. Most facial recognition models work by finding sudden changes in contrast (like areas between the forehead and eyebrows, nose and the area under the nostrils on which the nose casts a shadow, the lips, etc.) and then compares them to figure out if the relative positions of these changes could be caused by indeed looking at a face. The lower the camera quality is, the less likely these models will work as intended, but the results are also worse when a face is not well lit or is less pale.

There are cases where AI (for instance credit systems) learn bias, despite not having any information on gender, race or sexual orientation. The more disadvantages a human being has, the worse their chances in life will be. Both advantages and disadvantages are cumulative, as all social systems generate inequality. Having AI systems output unfavourable answers based on inputs unrelated to gender, race or sexual orientation only proves that the link between socioeconomic status and cumulative disadvantages indeed exists. Calling AI racist for pointing out some of these relationships is as outrageous as calling the scientific method or statistics racist. Machine Learning can model nonlinear relationships between multiple variables. It is clear that disadvantages not only come from gender, race or sexual orientation but also age, health, place of residence, personality, intelligence, beauty, size of family, etc. What we should all strive for is to lessen suffering and allow people to rise above the unfavourable status life has dealt them. AI has managed to raise the standards of living for humankind and lifted millions out of poverty, so it’s been pretty good at doing that so far.

Change the way we look at AI

Data Scientist and software engineers should be more humble when developing solutions for medicine. If you’ve ever pushed code to GitHub, has it occurred to you that your solution may be used as a building block for a critical system in the future (maybe some other library would rely on it, which would then be included in a critical system)? Even if your code worked reliably most of the time, would you take responsibility if its failure would cause someone’s death? Would you want your Machine Learning model with 99% accuracy to replace doctors, knowing that you might harm 1 in every 100 people? Your loss function now translates to losses in human lives. Thinking of healthcare as just another optimization process is not the way to go.

First robots to try to replace doctors, without much success.

This does not mean that engineers shouldn’t have an important role in medicine — their role will only increase with computer-aided drug discoveries, administration and robotic medical devices. Neural Networks have a huge role in discovering our brain. Formerly, social scientists were responsible for doing large scale quantitative analysis and running census, as the only way of estimating certain aspects of the population. They were concerned with research ethics, and what questions to ask in forms — and respondents were concerned about replying in ways that would show them in a better light. Engineers are more practical, as they are trained to solve problems and to optimize, favouring one goal over another. A news feed with articles arranged by importance and how well known their authors are, is alien to a social scientist’s mindset, as it distracts the public sphere and generates echo chambers. And also a lot of likes as it turns out. By being able to run multiple experiments on world-scale, Google and Facebook already know more about human psychology than most (social)psychologists are willing to admit. Or any experiment ran on n = 20 university students participating for credit was able to describe it. The more personal health data becomes available, further paradigm shifts are inevitable. DeepMind has demonstrated that we are indeed at the dawn of such a paradigm shift.

When doctors run tests, their devices might be somewhat off, and the lab results could be somewhat misleading for several reasons (like not being able to take enough blood to measure certain properties). Similarly, these models should be thought of as tools that give results based on the patient’s data, instead of their bodily functions. Even if lab results are sometimes incorrect, they can still help physicians to make better decisions. This sounds far less exciting than having an “omnipotent AI doctor with X% accuracy”, but dismisses most concerns about the utility of artificial intelligence in health care. The area needs to be strictly regulated, but decision-makers should not expect the technology to be unerring. In fact, different kinds of “AIs” have long been part of decision making for medical screening, via signal processing: whenever a physician looks at an ECG or MRI image, they look at the de-noised, augmented versions of the device’s imprecisely measured signals — data that was altered by software to help the establishment of a diagnosis. GPS car navigation (pathfinding algorithms that sometimes give wrong directions based on traffic data) is already helping paramedics find shorter routes. None of these technologies are perfect and neither of them have replaced doctors — yet they are essential to save lives.