In the following sections, we discuss a few kinds of machine learning problems in greater detail. We begin with a list of objectives, i.e., a list of things that we would like machine learning to do. Note that the objectives are complemented with a set of techniques of how to accomplish them, including types of data, models, training techniques, etc. The list below is just a sampling of the problems ML can tackle to motivate the reader and provide us with some common language for when we talk about more problems throughout the book.

1.3.1. Supervised learning¶

Supervised learning addresses the task of predicting targets given inputs. The targets, which we often call labels, are generally denoted by y. The input data, also called the features or covariates, are typically denoted \(\mathbf{x}\). Each (input, target) pair is called an example or instance. Sometimes, when the context is clear, we may use the term examples, to refer to a collection of inputs, even when the corresponding targets are unknown. We denote any particular instance with a subscript, typically \(i\), for instance (\(\mathbf{x}_i, y_i\)). A dataset is a collection of \(n\) instances \(\{\mathbf{x}_i, y_i\}_{i=1}^n\). Our goal is to produce a model \(f_\theta\) that maps any input \(\mathbf{x}_i\) to a prediction \(f_{\theta}(\mathbf{x}_i)\).

To ground this description in a concrete example, if we were working in healthcare, then we might want to predict whether or not a patient would have a heart attack. This observation, heart attack or no heart attack, would be our label \(y\). The input data \(\mathbf{x}\) might be vital signs such as heart rate, diastolic and systolic blood pressure, etc.

The supervision comes into play because for choosing the parameters \(\theta\), we (the supervisors) provide the model with a dataset consisting of labeled examples (\(\mathbf{x}_i, y_i\)), where each example \(\mathbf{x}_i\) is matched with the correct label.

In probabilistic terms, we typically are interested in estimating the conditional probability \(P(y|x)\). While it is just one among several paradigms within machine learning, supervised learning accounts for the majority of successful applications of machine learning in industry. Partly, that is because many important tasks can be described crisply as estimating the probability of something unknown given a particular set of available data:

Predict cancer vs not cancer, given a CT image.

Predict the correct translation in French, given a sentence in English.

Predict the price of a stock next month based on this month’s financial reporting data.

Even with the simple description “predict targets from inputs” supervised learning can take a great many forms and require a great many modeling decisions, depending on (among other considerations) the type, size, and the number of inputs and outputs. For example, we use different models to process sequences (like strings of text or time series data) and for processing fixed-length vector representations. We will visit many of these problems in depth throughout the first 9 parts of this book.

Informally, the learning process looks something like this: Grab a big collection of examples for which the covariates are known and select from them a random subset, acquiring the ground truth labels for each. Sometimes these labels might be available data that has already been collected (e.g., did a patient die within the following year?) and other times we might need to employ human annotators to label the data, (e.g., assigning images to categories).

Together, these inputs and corresponding labels comprise the training set. We feed the training dataset into a supervised learning algorithm, a function that takes as input a dataset and outputs another function, the learned model. Finally, we can feed previously unseen inputs to the learned model, using its outputs as predictions of the corresponding label. The full process is drawn in Fig. 1.3.1.

1.3.1.1. Regression¶ Perhaps the simplest supervised learning task to wrap your head around is regression. Consider, for example, a set of data harvested from a database of home sales. We might construct a table, where each row corresponds to a different house, and each column corresponds to some relevant attribute, such as the square footage of a house, the number of bedrooms, the number of bathrooms, and the number of minutes (walking) to the center of town. In this dataset, each example would be a specific house, and the corresponding feature vector would be one row in the table. If you live in New York or San Francisco, and you are not the CEO of Amazon, Google, Microsoft, or Facebook, the (sq. footage, no. of bedrooms, no. of bathrooms, walking distance) feature vector for your home might look something like: \([100, 0, .5, 60]\). However, if you live in Pittsburgh, it might look more like \([3000, 4, 3, 10]\). Feature vectors like this are essential for most classic machine learning algorithms. We will continue to denote the feature vector corresponding to any example \(i\) as \(\mathbf{x}_i\) and we can compactly refer to the full table containing all of the feature vectors as \(X\). What makes a problem a regression is actually the outputs. Say that you are in the market for a new home. You might want to estimate the fair market value of a house, given some features like these. The target value, the price of sale, is a real number. If you remember the formal definition of the reals you might be scratching your head now. Homes probably never sell for fractions of a cent, let alone prices expressed as irrational numbers. In cases like this, when the target is actually discrete, but where the rounding takes place on a sufficiently fine scale, we will abuse language just a bit and continue to describe our outputs and targets as real-valued numbers. We denote any individual target \(y_i\) (corresponding to example \(\mathbf{x}_i\)) and the set of all targets \(\mathbf{y}\) (corresponding to all examples \(X\)). When our targets take on arbitrary values in some range, we call this a regression problem. Our goal is to produce a model whose predictions closely approximate the actual target values. We denote the predicted target for any instance \(\hat{y}_i\). Do not worry if the notation is bogging you down. We will unpack it more thoroughly in the subsequent chapters. Lots of practical problems are well-described regression problems. Predicting the rating that a user will assign to a movie can be thought of as a regression problem and if you designed a great algorithm to accomplish this feat in 2009, you might have won the 1-million-dollar Netflix prize. Predicting the length of stay for patients in the hospital is also a regression problem. A good rule of thumb is that any How much? or How many? problem should suggest regression. “How many hours will this surgery take?”: regression

“How many dogs are in this photo?”: regression. However, if you can easily pose your problem as “Is this a _ ?”, then it is likely, classification, a different kind of supervised problem that we will cover next. Even if you have never worked with machine learning before, you have probably worked through a regression problem informally. Imagine, for example, that you had your drains repaired and that your contractor spent \(x_1=3\) hours removing gunk from your sewage pipes. Then he sent you a bill of \(y_1 = \$350\). Now imagine that your friend hired the same contractor for \(x_2 = 2\) hours and that he received a bill of \(y_2 = \$250\). If someone then asked you how much to expect on their upcoming gunk-removal invoice you might make some reasonable assumptions, such as more hours worked costs more dollars. You might also assume that there is some base charge and that the contractor then charges per hour. If these assumptions held true, then given these two data examples, you could already identify the contractor’s pricing structure: $100 per hour plus $50 to show up at your house. If you followed that much then you already understand the high-level idea behind linear regression (and you just implicitly designed a linear model with a bias term). In this case, we could produce the parameters that exactly matched the contractor’s prices. Sometimes that is not possible, e.g., if some of the variance owes to some factors besides your two features. In these cases, we will try to learn models that minimize the distance between our predictions and the observed values. In most of our chapters, we will focus on one of two very common losses, the L1 loss where (1.3.1) ¶ \[l(y, y') = \sum_i |y_i-y_i'|\] and the least mean squares loss, or \(L_2\) loss where (1.3.2) ¶ \[l(y, y') = \sum_i (y_i - y_i')^2.\] As we will see later, the \(L_2\) loss corresponds to the assumption that our data was corrupted by Gaussian noise, whereas the \(L_1\) loss corresponds to an assumption of noise from a Laplace distribution.

1.3.1.2. Classification¶ While regression models are great for addressing how many? questions, lots of problems do not bend comfortably to this template. For example, a bank wants to add check scanning to its mobile app. This would involve the customer snapping a photo of a check with their smart phone’s camera and the machine learning model would need to be able to automatically understand text seen in the image. It would also need to understand hand-written text to be even more robust. This kind of system is referred to as optical character recognition (OCR), and the kind of problem it addresses is called classification. It is treated with a different set of algorithms than those used for regression (although many techniques will carry over). In classification, we want our model to look at a feature vector, e.g., the pixel values in an image, and then predict which category (formally called classes), among some (discrete) set of options, an example belongs. For hand-written digits, we might have 10 classes, corresponding to the digits 0 through 9. The simplest form of classification is when there are only two classes, a problem which we call binary classification. For example, our dataset \(X\) could consist of images of animals and our labels \(Y\) might be the classes \(\mathrm{\{cat, dog\}}\). While in regression, we sought a regressor to output a real value \(\hat{y}\), in classification, we seek a classifier, whose output \(\hat{y}\) is the predicted class assignment. For reasons that we will get into as the book gets more technical, it can be hard to optimize a model that can only output a hard categorical assignment, e.g., either cat or dog. In these cases, it is usually much easier to instead express our model in the language of probabilities. Given an example \(x\), our model assigns a probability \(\hat{y}_k\) to each label \(k\). Because these are probabilities, they need to be positive numbers and add up to \(1\) and thus we only need \(K-1\) numbers to assign probabilities of \(K\) categories. This is easy to see for binary classification. If there is a \(0.6\) (\(60\%\)) probability that an unfair coin comes up heads, then there is a \(0.4\) (\(40\%\)) probability that it comes up tails. Returning to our animal classification example, a classifier might see an image and output the probability that the image is a cat \(P(y=\text{cat} \mid x) = 0.9\). We can interpret this number by saying that the classifier is \(90\%\) sure that the image depicts a cat. The magnitude of the probability for the predicted class conveys one notion of uncertainty. It is not the only notion of uncertainty and we will discuss others in more advanced chapters. When we have more than two possible classes, we call the problem multiclass classification. Common examples include hand-written character recognition [0, 1, 2, 3 ... 9, a, b, c, ...] . While we attacked regression problems by trying to minimize the \(L_1\) or \(L_2\) loss functions, the common loss function for classification problems is called cross-entropy. Note that the most likely class is not necessarily the one that you are going to use for your decision. Assume that you find this beautiful mushroom in your backyard as shown in Fig. 1.3.2. Now, assume that you built a classifier and trained it to predict if a mushroom is poisonous based on a photograph. Say our poison-detection classifier outputs \(P(y=\mathrm{death cap}|\mathrm{image}) = 0.2\). In other words, the classifier is \(80\%\) sure that our mushroom is not a death cap. Still, you would have to be a fool to eat it. That is because the certain benefit of a delicious dinner is not worth a \(20\%\) risk of dying from it. In other words, the effect of the uncertain risk outweighs the benefit by far. We can look at this more formally. Basically, we need to compute the expected risk that we incur, i.e., we need to multiply the probability of the outcome with the benefit (or harm) associated with it: (1.3.3) ¶ \[L(\mathrm{action}| x) = E_{y \sim p(y| x)}[\mathrm{loss}(\mathrm{action},y)].\] Hence, the loss \(L\) incurred by eating the mushroom is \(L(a=\mathrm{eat}| x) = 0.2 * \infty + 0.8 * 0 = \infty\), whereas the cost of discarding it is \(L(a=\mathrm{discard}| x) = 0.2 * 0 + 0.8 * 1 = 0.8\). Our caution was justified: as any mycologist would tell us, the above mushroom actually is a death cap. Classification can get much more complicated than just binary, multiclass, or even multi-label classification. For instance, there are some variants of classification for addressing hierarchies. Hierarchies assume that there exist some relationships among the many classes. So not all errors are equal—if we must err, we would prefer to misclassify to a related class rather than to a distant class. Usually, this is referred to as hierarchical classification. One early example is due to Linnaeus, who organized the animals in a hierarchy. In the case of animal classification, it might not be so bad to mistake a poodle for a schnauzer, but our model would pay a huge penalty if it confused a poodle for a dinosaur. Which hierarchy is relevant might depend on how you plan to use the model. For example, rattle snakes and garter snakes might be close on the phylogenetic tree, but mistaking a rattler for a garter could be deadly.

1.3.1.3. Tagging¶ Some classification problems do not fit neatly into the binary or multiclass classification setups. For example, we could train a normal binary classifier to distinguish cats from dogs. Given the current state of computer vision, we can do this easily, with off-the-shelf tools. Nonetheless, no matter how accurate our model gets, we might find ourselves in trouble when the classifier encounters an image of the Town Musicians of Bremen. As you can see, there is a cat in the picture, and a rooster, a dog, a donkey, and a bird, with some trees in the background. Depending on what we want to do with our model ultimately, treating this as a binary classification problem might not make a lot of sense. Instead, we might want to give the model the option of saying the image depicts a cat and a dog and a donkey and a rooster and a bird. The problem of learning to predict classes that are not mutually exclusive is called multi-label classification. Auto-tagging problems are typically best described as multi-label classification problems. Think of the tags people might apply to posts on a tech blog, e.g., “machine learning”, “technology”, “gadgets”, “programming languages”, “linux”, “cloud computing”, “AWS”. A typical article might have 5-10 tags applied because these concepts are correlated. Posts about “cloud computing” are likely to mention “AWS” and posts about “machine learning” could also deal with “programming languages”. We also have to deal with this kind of problem when dealing with the biomedical literature, where correctly tagging articles is important because it allows researchers to do exhaustive reviews of the literature. At the National Library of Medicine, a number of professional annotators go over each article that gets indexed in PubMed to associate it with the relevant terms from MeSH, a collection of roughly 28k tags. This is a time-consuming process and the annotators typically have a one year lag between archiving and tagging. Machine learning can be used here to provide provisional tags until each article can have a proper manual review. Indeed, for several years, the BioASQ organization has hosted a competition to do precisely this.

1.3.1.4. Search and ranking¶ Sometimes we do not just want to assign each example to a bucket or to a real value. In the field of information retrieval, we want to impose a ranking on a set of items. Take web search for example, the goal is less to determine whether a particular page is relevant for a query, but rather, which one of the plethora of search results is most relevant for a particular user. We really care about the ordering of the relevant search results and our learning algorithm needs to produce ordered subsets of elements from a larger set. In other words, if we are asked to produce the first 5 letters from the alphabet, there is a difference between returning A B C D E and C A B E D . Even if the result set is the same, the ordering within the set matters. One possible solution to this problem is to first assign to every element in the set a corresponding relevance score and then to retrieve the top-rated elements. PageRank, the original secret sauce behind the Google search engine was an early example of such a scoring system but it was peculiar in that it did not depend on the actual query. Here they relied on a simple relevance filter to identify the set of relevant items and then on PageRank to order those results that contained the query term. Nowadays, search engines use machine learning and behavioral models to obtain query-dependent relevance scores. There are entire academic conferences devoted to this subject.

1.3.1.5. Recommender systems¶ Recommender systems are another problem setting that is related to search and ranking. The problems are similar insofar as the goal is to display a set of relevant items to the user. The main difference is the emphasis on personalization to specific users in the context of recommender systems. For instance, for movie recommendations, the results page for a SciFi fan and the results page for a connoisseur of Peter Sellers comedies might differ significantly. Similar problems pop up in other recommendation settings, e.g., for retail products, music, or news recommendation. In some cases, customers provide explicit feedback communicating how much they liked a particular product (e.g., the product ratings and reviews on Amazon, IMDB, GoodReads, etc.). In some other cases, they provide implicit feedback, e.g., by skipping titles on a playlist, which might indicate dissatisfaction but might just indicate that the song was inappropriate in context. In the simplest formulations, these systems are trained to estimate some score \(y_{ij}\), such as an estimated rating or the probability of purchase, given a user \(u_i\) and product \(p_j\). Given such a model, then for any given user, we could retrieve the set of objects with the largest scores \(y_{ij}\), which could then be recommended to the customer. Production systems are considerably more advanced and take detailed user activity and item characteristics into account when computing such scores. Fig. 1.3.4 is an example of deep learning books recommended by Amazon based on personalization algorithms tuned to capture the author’s preferences. Despite their tremendous economic value, recommendation systems naively built on top of predictive models suffer some serious conceptual flaws. To start, we only observe censored feedback. Users preferentially rate movies that they feel strongly about: you might notice that items receive many 5 and 1 star ratings but that there are conspicuously few 3-star ratings. Moreover, current purchase habits are often a result of the recommendation algorithm currently in place, but learning algorithms do not always take this detail into account. Thus it is possible for feedback loops to form where a recommender system preferentially pushes an item that is then taken to be better (due to greater purchases) and in turn is recommended even more frequently. Many of these problems about how to deal with censoring, incentives, and feedback loops, are important open research questions.