Use Active Learning to Get the Most Help from Humans

The core job of most machine learning systems is to generalize from sample data created by humans. The learning process starts with humans creating a bunch of labeled data: images annotated with the objects they depict, pictures of faces with the names of the people, speech recordings with an accurate transcript, etc. Then comes training. A machine learning algorithm processes all that human-labeled data. At the end of training the learning algorithm produces a classifier, essentially a small standalone program that can provide the right answer for new input that was not part of the human-labeled training data. That classifier is what you then deploy into the world to guess your users’ age, or recognize their friends’ faces, or or transcribe their speech when they talk to their phone.

The scarce resource in this equation is the human labor needed to label the training data in the first place.

Many impressive Deep Learning results come from domains where enormous amounts of labeled data is available because it was shared by a social network’s billion users or crawled from across the web. However, unless you’re Facebook or Google, you’ll likely find labeled data relevant to your problem somewhat more scarce, especially if you’re working in some new vertical that has its own jargon or behavior or data sources. Hence you’ll need to get your labels from your users. This entails building some kind of interface that shows them examples of the texts or images or other inputs you want to be able to classify and gets them to submit the correct labels.

But, again, human labor — particularly when it’s coming from your users — is a scarce resource. So, you’ll want to only ask your users to label the data that will improve your system’s results the most. Active Learning is the name for the field of machine learning that studies exactly this problem: how to find the samples for which a human label would help the system improve the most. Researchers have found a number of algorithmic approaches to this problem. These include techniques for finding the sample about which the system has the greatest uncertainty, detecting samples for which a label would cause the greatest change to the system’s results, selecting samples for which the system expects that its predictions would have the highest error, and others. Burr Settles’ excellent survey of Active Learning provides a great introduction to the field.

As a concrete example of these ideas, here’s a video demonstrating a hand gesture recognition system I built that uses Active Learning principles to request labels from the user when it sees a gesture for which it cannot make a clear prediction (details about this work here):

Don’t Treat the User as an “Oracle”

Active Learning researchers have shown success in producing higher accuracy classifiers with fewer labeled samples. Active Learning is a great way to pull the most learning out of the the labeling work you get your users to do.

However, from an interaction design perspective, Active Learning has a major downside: it puts the learning system in charge of the interaction rather than the human user. Active Learning researchers refer to the human who labels the samples they select as an “oracle”. Well, Interactive Machine Learning researchers have shown that humans don’t like being treated as an oracle.

Humans don’t like being told what to do by a robot. They enjoy interactions much more, and are willing to spend more time training the robot if they are in charge of the interaction.

In a 2010 paper, Designing Interactions for Robot Active Learners, Cakmak et al studied user perceptions of passive and active approaches to teaching a robot to recognize shapes. One option put the robot in charge. It would use Active Learning to determine the shape it wanted labeled next. Then it would point at the shape and the user would provide the answer. The other option put the users in charge, letting them select which examples to show the robot.

When the robot was in charge of the interaction, selecting which sample it wanted labeled in the Active Learning style, users found the robot’s stream of questions “imbalanced and annoying”. Users also reported a worse understanding of the state of the robot’s learning making them worse teachers.

In a software context, Guillory and Blimes found similar feelings while attempting to apply active learning to Netflix’s movie rating interface.

Choose Algorithms for Their Ability to Explain Classification Results

Imagine you have a persistent health problem that you need diagnosed. You have the choice of two AI systems you can use. System A has a 90% accuracy rate, the best available. It takes in your medical history, all your scans and other data and gives back a diagnosis. You can’t ask it any questions or find out how it arrived at that diagnosis. You just get back the latin name for your condition and a wikipedia link. System B has an 85% accuracy rate, substantially less than System A. System B takes all your medical data and also comes back with a diagnosis. But unlike System A it also tells you how it arrived at that diagnosis. Your blood pressure is past a certain threshold, you’re above a certain age, you have three of five factors from your family history, etc.

Which of these two systems would you choose?

There’s a cliche from marketing that half of the advertising budget is wasted but no one knows which half. Machine learning researchers have a related cliche: it’s easy to create a system that can be right 80% of the time, the hard part of figuring out which 80% is right. Users trust learning systems more when they can understand how they arrive at their decisions. And they are better able to correct and improve these systems when they can see the internals of their operation.

So, if we want to build systems that users trust and that we can rapidly improve, we should select algorithms not just for how often they produce the right answer, but for what hooks they provide for explaining their inner workings.

Some machine learning algorithms provide more of these types of affordances than others. For example, the neural networks currently pushing the state-of-the-art in accuracy on so many problems provide particularly few hooks for such explanations. They are basically big black boxes that spit out an answer (though some researchers are working on this problem). On the other hand, Random Decision Forests provide incredibly rich affordances for explaining classifications and building interactive controls of learning systems. You can figure out which variables were most important, the system’s confidence about each prediction, the proximity between any two samples, etc.

You wouldn’t select a database or web server or javascript framework simply because of its performance benchmarks. You’d look at the API and see how much it supported the interface you want to provide your users. Similarly, as designers of machine learning systems we should expect to have the ability to access the internal state of our classifiers in order to build richer, more interactive interfaces for our users.

Beyond our own design work on these systems, we want to empower our users themselves to improve and control the results they receive. Todd Kulesza, at Microsoft Research, has done extensive work on exactly this problem which he calls Explanatory Debugging. Kulesza’s work produces machine learning systems that explain their classification results. These explanations themselves then act as an interface through which users can provide feedback to improve and, importantly, personalize the results. His paper on Why-Oriented End-User Debugging of Naive Bayes Text Classification provides a powerful and concrete example of the idea.

Empowering Users to Create Their Own Classifiers

In conventional machine learning practice, engineers build classifiers, designers integrate them into interfaces, and then users interact with their results. The problem with this pattern is that it divorces the practice of machine learning from knowledge about the problem domain and the ability to evaluate the system’s results. Machine learning engineers or data scientists may understand the available algorithms and the statistical tests used to evaluate their results, but they don’t truly understand the input data and they can’t see problems in the results that would be obvious to their users.

At best this pattern results in an extremely slow iteration cycle. Machine learning engineers return to their users with each iteration of the system, slowly learning about the domain and making incremental improvements. In practice, this cumbersome cycle means that machine learning systems ship with problems that are obvious to end users or are simply too expensive to build for many real problems.

To escape this pattern we have to put the power to create classifiers directly in the hands of users. Now, no user wants to “create a classifier”. So, in order to give them this power we need to design interfaces that let them label samples, select features, and do all the other actions involved in a way that fits with their existing mental models and workflows.

When we figure out how to do this the results can be extremely powerful.

One of the most impressive experiments I’ve seen in Interactive Machine Learning is Saleema Amershi’s work on Facebook group invites, ReGroup: Interactive Machine Learning for On-Demand Group Creation in Social Networks.

The current Facebook event invite experience goes like this: you create a new event and go to invite friends. Facebook presents you with an alphabetical list of all of your hundreds of friends with a checkbox by each one. You look at this list in despair and then click the box to “invite all”. And hundreds of your friends get invites to events they’ll never be able to attend in a city where they don’t live.

The ReGroup system Amershi and her team put together improves on this dramatically. It starts you with the same list of names with checkboxes. But then when you check a name it treats that check as a positively labeled sample. And it treats any names you skipped as negatively labeled samples. It uses this data to train a classifier, treating profile data and social connections as the features. It computes a likelihood for each of your friends that you’ll check the box next to them and sorts the most likely ones to the top. The features that determine event revelance are relatively strong and simple — where people live, what social connections you have in common, how long ago you friended them etc. — the classifier’s results rapidly become useful.

This work is an incredibly elegant match between existing user interaction patterns and what’s needed to train a classifier.

Another great example is CueFlik, a project by Fogarty et al that improves web-based image search by letting users create rules that automatically group photos by their visual qualities. For example (as shown above), a user might search for “stereo” and then select just the “product photos” (those on a clean white background). CueFlick takes these examples and learns a classifier that can distinguish product photos from natural photos that users can later choose to apply to other searches beyond the initial search for “stereo”, for example to “cars” or “phones”.