Zachary Chase Lipton is an assistant professor at Carnegie Mellon University with appointments in Business, Machine Learning, and Public Policy. His critical writing, both in formal academic papers and as the founding editor of the blog Approximately Correct, have helped to shape the conversation about how AI is used in the wild and how the discourse surrounding AI has diverged from technical realities. He is also the lead author of Deep Learning: The Straight Dope, a popular open-source interactive book teaching deep learning through Jupyter notebooks. In industry, he has been involved in growing the Amazon AI team at AWS since 2017, while he also enjoyed stints at Microsoft Research during his PhD studies.

Our discussion with Zachary produced some valuable insight into the far-reaching effects of distribution shift, while also touching on topics as diverse as ML in healthcare and data efficiency. If you want to hear more from experts across the fields of AI and machine learning, take a look at the rest of our interview series here.

Lionbridge AI: How did you become interested in machine learning?

Zachary: Honestly, I got lucky. Since before undergrad and until the age of 26, more of my time was spent playing music than on any scholarly pursuit. The big decision wasn’t really to pursue machine learning but to ditch New York and pursue a PhD in ‘something’. Research is one of those rare paths in life where you get rewarded for being critical, for identifying problems that other people have overlooked, and where you have broad license to step across disciplinary boundaries and become an authority. In 2012, I spent two weeks visiting a friend pursuing a PhD at UC Santa Cruz (in music, actually) and after 2 weeks debating with a bunch of PhD students about… everything, I decided that was the life for me.

The next step was to decide ‘what to study’. While I had no proper background in computer science I had taken two classes in undergrad and self-taught a few pathetic web development skills. It was enough exposure to know that computational thinking appealed to me. Together with the other topics that I enjoyed in undergrad (probability and statistics), it didn’t take much reflection to decide that machine learning was the right discipline. I applied broadly to PhD programs, got lucky that a few professors remembered who I was, and UCSD took a chance on me.

I should probably point out that somewhere along the way I built a website for Columbia Biophysics Professor Julio Fernandez. He invited me to hang out at lab meetings and reading groups, would meet up with me to play chess and drink coffee, encourage me to go to grad school, and when it came time to apply he was the first to write me a strong recommendation letter. It’s hard to overstate the benefit of a strong supporter early in your career.

L: Which areas is your research particularly focused on?

Z: Most of my work is influenced by my deep interest in several critical questions: when ML should be used, which problems aren’t addressed by current methods, and which roadblocks thwart its use on problems with real consequences. From these, I’ve worked on projects that span core machine-learning methods and their social impact, with concentrations on deep learning, robustness under distribution shift, data scarcity, and sequential decision-making. This work has diverse application areas, including medical diagnosis, natural language processing, computer vision, and business/finance.

Of course, the distinction between methodology and application isn’t clear cut and many of these fields overlap. As a result, it’s probably easiest to summarize my primary interests as follows:

machine learning for healthcare

algorithms that are invariant to various kinds of distribution shift

interactive learning (with humans in the loop)

the social impacts of machine learning, an area now called FAT (for Fairness, Accountability, and Transparency)

L: In your talk at EmTech, you described distribution shift as a ‘fundamental limitation’ on the progress of machine learning. Could you explain what distribution shift is and why it has such an impact?

Z: Most machine learning works in the following way: given gobs of data (input, output pairs) drawn from some distribution, we aim to produce a model that, given new inputs drawn from the same distribution, can accurately guess the corresponding (unseen) output. Depending on the task, we might be predicting the category that an image belongs to or predicting a diagnosis given a stream of sensor readings. The problem lies in the “distribution” part. We estimate our model based on the statistics of the historical data we calculated.

Unfortunately, there’s nothing to stop our model from identifying patterns that, while predictive, owe to coincidence rather than truly recognizing the concept we care about.

Imagine, for example, a system that identifies that footwear is indicative of a person’s ability to repay a loan. That might be true, but not because there’s a deep connection between sneakers and responsibility with credit. All it takes is a slight change in people’s sartorial choices to break such a model.

L: In your own work in healthcare, has distribution shift had an effect and how have you safeguarded against it?

Z: Most ML for healthcare is still in the exploratory phase and typically the first step is to see what we can predict in the first place. However, deploying these models successfully in the wild will entirely depend on some degree of robustness under minor shocks to the data distribution. Imagine trusting your life to a pathologist that might get thrown off if the microscope lenses are different from those used during training, or if the brand of petri dish changes.

L: What general steps can developers take to ensure that their models can withstand distribution shifts?

Z: Unfortunately this is a big open research problem (actually, many research problems). Unlike normal supervised learning, it won’t succumb so easily to brute force solutions. To start with, the general problem of learning under distribution shift is known to be impossible. So the real question is in each case, which restrictive assumptions could conceivably make the problem possible to tackle? Often these assumptions correspond to assumed invariances, defined either in features space (say to rotations), or in terms of the data generating process. These in turn correspond to knowledge about causal mechanisms.

L: As a data provider, it sounds to us like highly efficient training data will be central to overcoming this problem. Do you have any thoughts about this and have you seen any meaningful progress based on new approaches to data?

Z: Arguably all of machine learning is about data efficiency. After all, if we had an unlimited amount of data and unlimited computation, then K-nearest neighbors would be the perfect algorithm. One big empirical success in deep learning has been the use of pre-trained networks to transfer knowledge across tasks. The big insight here is that for certain domains, like text and image classification, the features extracted for one task are similar to those necessary for another, related task. Just how far we can push this in practice might depend on some notion of how similar new tasks are those we’ve encountered in the past. Unfortunately, we don’t yet have good ways of quantifying such similarities.

L: Finally, do you have any other advice for anyone out there building a machine learning algorithm?

Z: There are a few things I’d think through. First, what’s your goal: to do interesting research or to prototype a product? The difference of perspective between doing science and R&D is significant and will color the work that you do. Second, I’d ask whether or not the real problem you’re trying to solve is a prediction problem. If not, think through the potential dangers of applying the wrong tool for the job. The next, really practical thing I’d always recommend is to try the simplest, easiest baselines first. Everywhere I go, both in industry and academia, people rush, skipping the step of establishing solid baselines and justifying each additional modeling technique.