In this post we will look into the basics of building ML models with Scikit-Learn. Scikit-Learn is the most widely used Python library for ML, especially outside of deep learning (where there are several contenders and I recommend using Keras, which is a package that provides a simple API on top of several underlying contenders like TensorFlow and PyTorch).

We'll proceed in this fashion:

give a brief overview of key terminology and the ML workflow

illustrate the typical use of SciKit-Learn API through some simple examples

discuss various metrics that can be used to evaluate ML models

dive deeper with some more complex examples

look at the various ways we can validate and improve our models

discuss the topic of feature engineering - ML models are good examples of "garbage in, garbage out", so cleaning our data and getting the right features is important

finally, summarize some of the main model techniques and their pros and cons

We are going to focus on practical ML, and will not be going much into the internal details of the various algorithms. We're also going to avoid the topics of Deep Learning and neural nets: these are generating a lot of hype at present but in general involve learning thousands of parameters requiring massive amounts of data for training (and a lot of time). The techniques we are going to look at can be used for much smaller problems and are generally very fast to train. It's worth having them in your toolkit and progressing to deep learning later.

Getting good data and the right features is generally more important than the type of model you choose to use. The performance of the different types of ML model techniques is often similar. It's commonly said in ML circles that 80% of time is spent on getting data cleaned up and doing feature engineering and only 20% is spent on modeling. For example, until around 2014 when deep learning started taking off, most ML models at Google were trained using the same system (Sibyl), a form of boosted logistic regression that could be run on large data sets with map-reduce. Deep learning has an advantage that it can learn features that would otherwise need to be engineered by hand, and can thus automate some of the manual work typically done by an ML engineer, but in order to do that, as mentioned above, it generally needs a massive dataset and often days or weeks of compute time.

ML Terminology¶

Machine Learning is the process of building models from data, either to gain insight into the data or to make predictions for new data (generalization). There are two main categories:

supervised learning, in which the data (training data) is labeled with an known outcome (this is the supervision part), and the aim is to predict new outcomes from new data; if the outcome is a category this is classification while if the outcome is a continuous quantity this is regression

unsupervised learning, in which the data is analyzed for underlying patterns to gain insight; common examples are clustering (finding similar cases), outlier detection (finding unusual cases), and dimensionality reduction (reducing the number of variables needed to represent the data, essentially a form of lossy compression)

The data used in ML is typically tabular. The columns are called features while the rows are instances or observations. For supervised learning, we call the output the target or label. We'll often refer to the vector if features as X and the output (label) as y, and say that we're trying to find a function that approximates f(X)=y; this function is our model and is characterized by some model parameters. We usually choose the type or class of model, and then use ML techniques to learn the model parameters that minimize the error (the difference between the predicted and actual output). More generally we can think of this as an optimization problem, where we are trying to learn the parameters that minimize a loss function; that loss function is typically going to be some cumulative function of the errors; a common loss function is RMSE (root mean square of errors).

While we learn the model parameters, there are some other parameters we need to specify as inputs to the model too; these are called hyperparameters. For example, if we learn a decision tree model, the parameters might be the features being tested at each branch and the values they are being tested against, while the hyperparameters would include the depth that we want to limit the tree to. For a polynomial regression model, the parameters would be the coefficients of the polynomial, while the hyperparameters could include the degree of polynomial we want to learn.

ML Workflow¶

A typical machine learning workflow begins with:

getting the data

exploring and cleaning up the data (handling missing values, normalizing values, removing bad data and outliers, encoding data in appropriate representations, and more)

possibly creating some additional synthetic values (e.g. we could use linear regression to fit a higher-order polynomial by creating synthetic values that are existing values raised to the power 2, 3, etc; other common examples might be computing and including aggregate values like means and standard deviations of some of the original values)

picking a type of model to use, and configuring the initial hyperparameters

After which we could do multiple iterations of:

training the model

evaluating the results

adjusting the model type or the hyperparameters if the results are not yet satisfactory

And this second phase could necessitate a return to the first phase, to get more data or create more synthetic features.

The hyperparameter tuning could be done manually or may itself be automated by doing a state-space search through the possible values.

After some number of iterations, if we have a model that improves upon existing capabilities, we would want to deploy it. For the model to be useful, new data that we apply the model to should have similar properties to the training data (i.e. we rely on our test set being representative of the new data). If this is not true (e.g. because the future is just different and unpredictable, or perhaps we have overfit the training data), the model likely will not generalize well. If it is true, then we may be done, at least for some time, but in many cases future data will eventually diverge from the past and the model will degrade. We thus need to get more recent data, retrain, retune, and redeploy. The cadence of these refreshes will depend on the particular problem we are addressing. For example, if we build a model to recognize species of flowers we may need to retrain very infrequently or possibly never, as the primary reason to retrain is that we discover species that the model has not accounted for. On the other hand, if we build a model to detect fraudulent behavior we are pitting the model against adversaries that are adaptive and intelligent and we may need constant refreshes.