ML For Everyone — Part 2

A showroom of Supervised ML algorithms

Welcome back to my introductory ML series! Make sure you’ve gone through the first part of this series before continuing. In this article, you are going to be able to get up to speed with a lot of ML terminology and some very cool and popular algorithms! It may seem a little dry, but I promise I will only go into enough detail so that you have a little bit of theory in mind before you jump in to training your own models from the next article. It’s imperative that you know a little bit about all of these so that you can choose the appropriate model to use on your problem — that is going to save you a LOT of time!

Photo Credits: Pexels

As I mentioned in the last article, there are three major types of ML: supervised, unsupervised and reinforcement. A lot of problems you will encounter in an ML setting are probably going to be supervised, and sometimes even reinforcement learning problems can be formulated into supervised learning problems so you might be able to use models described below. Unsupervised learning is usually used as a helper to supervised learning algorithms because it alone does not provide as useful results. Thus, while I will go into a little more detail on the other two types of ML, this article is going to be a showroom of supervised models.

To be fair, each model I cover below deserves its own separate article to know all the details in depth. However, my intention is to give you just enough so that you can use them quickly without getting bogged down by the math and the details. However, if you are interested after this article and have the time, I strongly recommend you to check out Andrew Ng’s Machine Learning course because he explains all the involved concepts with amazing clarity. It is by far the most thorough, well-explained and interesting MOOC I have ever taken. However, to be able to truly understand and appreciate everything that’s going on there, you might need a little bit of knowledge about matrices and vectors, basic differentiation, as well as some programming knowledge.

Regression vs Classification

If you recall, supervised learning involves data points and labels, i.e. inputs and answers, that are fed to train the model. The model learns a representation of what it is trying to predict by fitting the training data. Once the model has been fit well, it can be used to predict the answers for different future inputs.

This can take two forms, however. Consider an ML problem where we need to predict the height of an individual from the weight, age and gender. The answers will be a continuous spectrum representing possible heights, and this may be infinite. Therefore, this is known as a regression problem. On the other hand, consider an ML problem where we want to determine whether a given image is of a dog or a cat or neither. Here, we have only three possible answers and so the problem is a classification problem. More generally, when we want to classify inputs into one or more categories, we call it classification, and when we want to predict non-categorical outputs we call it regression.

The idea is that regression and classification are, in essence, interchangeable. For example, we can convert the animal classification problem mentioned above into a regression problem by changing the answer space to the probabilities of the input image being a dog, a cat or neither. Then we could set a threshold to produce binary outputs, i.e. if the probability of being a dog is more than 50% then it’s a dog otherwise it’s not. Likewise, we can break the answer space into many small categories to simulate regression. So, any model you see below can be modified (and popular modifications already exist) so that they can be used for either regression and classification.

Some Terminology

What we have been calling “inputs” and “answers” are more formally known as features and labels/class. Features can be single values, or they can be multi-dimensional. For example, a 28x28 pixel image has 784 pixel values so we can say that there re 784 features. A dataset contains a number of feature-label pairs that are provided as inputs during the training phase of the model. In this phase, we use algorithms to learn patterns in the input data.

However, how do we know if the model has learned good patterns that will help it do well on new data? To do so, we need to test the model, but we cannot do that on the same data as the training phase (we want to avoid memorization). So, before we begin training, we split the original dataset into training data and testing data. Usually we use two-thirds of the data for training and one-third for testing, but the ratios may vary based on the amount of data you actually have (e.g. if dataset is large, you can decrease testing ratio).

While training or testing, we may sometimes refer to the loss/error of the model.For regression, we usually use the mean-squared-error which is simply the mean squared distance between the predicted value and the actual value. For example, if we have two data points (2, 3) and (4, 5) and we predict outputs for the 2 and 4 to be 3 and 6 respectively, our mean square error is (3–3)²+(6–5)² =1. The error for classification is simply the number of misclassified data points. There are sophisticated measures to do this, known as loss functions, but for now you can use the same as what we use for regression. Generally, our aim during training is to optimize/minimize the error on training dataset without overfitting, so that the error is low on new data as well (which we can verify by testing).

In many models, we also have hyper-parameters. No need to be scared! These parameters are just free parameters in the model that we can change and tune manually to whatever works best for our particular problem. This will become a whole lot clearer when we talk about our first model below! In practice, if we are tuning the hyper-parameters pretty often to select the model that works best on the test dataset, then we are inadvertently fitting our model to the test dataset as well. To avoid this, we usually have a cross-validation set which we use for tuning our hyper-parameters and only use the test dataset to actually test the model. In such a case we would have a 60–20–20 split usually, but you can ignore this for now!

Nearest Neighbors

Photo Credits: WikiMedia Commons

In nearest neighbors, we memorize all the training data in a nice index so that when given a new data point, we can find the points in training data that resemble it best, and then we can just average their labels to form our prediction. This is a very intuitive and simple model, and yet can be incredibly powerful. If you are a software developer, try developing this on your own without any frameworks or libraries, and think of what data structures you could use to do this most efficiently!

One hyper-parameter in this model is the number of neighbors we consider while making predictions. If we set the number to be too large, then there will be high uncertainty in our output because we are averaging over points that are farther away, whereas if we set it too low then we have high variability because we just immediately choose the output of the nearest few neighbors. The best number depends on your problem, the number of features, the size of the training dataset and more! You need some trial and error, and some reasoning, to see what works best for you here!

Learn more about this algorithm here.

Decision Trees and Variations

Photo Credits: WikiPedia

You might need some tree terminology here. As the name suggests, in this model we construct a tree which we can navigate to decide the label of inputs. At each node in this tree, we ask a question like “Is X > 0.5?” and depending on the answer we go right or left until we reach a leaf node, where we would have enough information to decide the label of the input.

These are extremely powerful models primarily because of explainability. In ML, we want to be able to explain why the model makes certain decisions so we can reason with it better, diagnose problems, and make sure it is sensible. After all, we don’t want our models picking up patterns that do not exist. Decision Trees can be visualized easily as well, so you can see the skeleton of your model.

How we train them is for every data-point in our training dataset, we generate a list of all the possible questions we can ask before selecting the question that maximizes our information gain. Intuitively, we choose a partition that maximizes the discriminatory information we obtain to help us distinguish elements while asking the least questions.

Josh from Google Developers Channel has a great explanation on this topic as he takes you through writing your Decision Tree Classifier from scratch here.

The power of Decision Trees can be compounded by using them in an ensemble. If you took a statistics course, you might know how averaging multiple independent random variables has a lower variance. Ensemble techniques have hundreds of decision trees that they then combine for better predictions with reduced over-fitting. What is really happening under the hood is slightly more complex and you can learn more about them here and here, but all you really need to know is you can have really powerful models by combining results from an “army of idiots”. There are two popular ensemble techniques, random forests and gradient boosting, and they are an amazing starting point model to try on for any machine learning problem you take up. They are usually fast to train, have fewer hyperparameters than neural networks, and can be performant on even less training data. In the next article, I will share how you can train your own.

Linear and Logistic Regression

Photo Credits: WikiMedia mmons

You might know linear regression from high-school math. It is really just the best-fit line you used to draw by hand or using your graphic calculator, except it is a more general technique which works with higher-dimensional data. In general, we have a model y=Wx + b, with W being a n by m matrix where n is the dimension of the output and m is the number of features.

If that’s too much math, let’s use an example of housing prices. Say we have three features including number of bedrooms, square footage and neighborhood crime rate. Also, let’s say we want to output two values: the rent we can get by renting the house and the price we can get if we sell the house. Then, x will be our vector with the three features, and y will be a two-dimensional vector containing the two outputs. We want to learn the values of the weight matrix W such that performing that matrix multiplication and adding the bias term b gives outputs y that best fit our training data.

So how do we learn W and b? The usual algorithm is that we first randomly initialize W and b. Then, make predictions for the records in our training dataset. Obviously, these predictions will most likely be highly inaccurate because our parameters are random. However, now we can calculate the loss of our model using a loss function like mean-squared-error (mentioned above). There are many other loss functions for different contexts, e.g. categorical-cross-entropy is most used for classification problems. Then, we use calculus to calculate the derivative of the loss with respect to the weights using the chain rule, to figure out how changing the weights affects the loss. Then, we simply update the parameters by moving a small step in the direction that reduces the loss. The size of this “small step” is known as the learning rate which is a hyperparameter. Too big and we would keep overshooting the optimal point, and if it is too small, we would take a long time to converge to the optimal point. This process of updating weights by calculating gradients and moving in the direction of the steepest descent (direction that minimizes the loss the most) is known as gradient descent.

Photo Credits: WikiMedia Commons

Gradient Descent is at the heart of optimizing various models, including neural networks. If your goals are academic, you should definitely study it in more detail, including some of the more advanced optimizations on the usual algorithm like RMSProp Optimizer and Adam Optimizer. Even if your goals are more engineering oriented, at some point you will want to understand the optimization process well so that you can diagnose the problems that your model faces.

You also need to know that despite the name, linear regression can be used for non-linear regression as well! If you manually add polynomial or exponential features to the feature space, you can map lines that are non-linear with respect to the original parameters. For example, you can add a feature x² to your feature space and then fit a linear model of the form y=Wx²+b, and with the same process as before, you get yourself a quadratic regressor!

Photo Credits: WikiMedia Commons

Logistic regression is a variant of this which is better suited for classification problems. In essence, our model is now y=sigmoid(Wx+b) where the Sigmoid function coverts any input into an value between 0 and 1. This way we can represent “dog” and “not dog”, and “cat” and “not cat” and get categorical results instead of more continuous results. We need to adapt our calculation of derivative to also factor in this additional function, but everything else stays the same!

Neural Networks

Neural networks are a natural extension of linear/logistic regression. There are two main differences. Firstly, instead of the Sigmoid function, we can now choose between a variety of activation functions like Linear, ReLU, Sigmoid, Tanh, Leaky ReLU, Softmax and more (yes, a hyper-parameter!). Secondly, and more importantly, we now have multiple regression-like layers stacked up on top of each other. So for example, we can have y=A (act(Cx + d))+ b where act is the activation function.

Photo Credits: WikiMedia Commons

To break down some terminology, we have multiple layers in the network, where each layer has neurons. We have an input layer where data is fed in (the x vector), and an output layer where the computer predictions come out (the y vector). All layers in the middle are known as hidden layers. In fully-connected networks, each neuron is connected to every other neuron in the previous layer as well as the next layer through edges. Each edge has a weight representing the strength of the connection. The activation of a neuron is its activation function applied to the weighted product of all its inputs. This is better explained visually.

Phew! That’s a lot of jargon. But really, it’s nothing different from what we have already seen. Logistic regression is a neural network as well, just with no hidden layers. When we do logistic regression, and then logistic regression on its output, we have a two layer network! Of course, we would need to propagate the gradients to both layers so that we can update the weights. This is known as back-propagation which is just chain rule applied over and over again for every weight in the network. In a typical neural network, you can have thousands of neurons in each layer, and multiple such layers. It’s easy to see how propagating all these gradients backwards can be a computationally intensive process — that’s why training deep neural networks is super-slow! It is true that the parallelization offered by recent boom of GPUs has helped, but it is not uncommon to have models train for weeks or even months in industry.

Neural Networks are very good function predictors. They have so many parameters that they can detect patterns very accurately. However, the large number of parameters also makes them prone to over-fitting, so if you want to use them successfully, you need to start worrying about more aspects. While they have been used in very sophisticated ways, the fundamental concept is really simple, and the back-propagation may seem like work, but usually the libraries we use for building them take care of most of it! Also, note that while it’s traditionally a supervised learning algorithm, neural networks are used for unsupervised learning and reinforcement learning as well.

Support Vector Machines

Last model for the day! SVMs try to build a plane that best segregates points between two classes. This is known as a decision boundary or decision frontier where each side of the decision boundary represents membership to a class. It works by finding support vectors which are extreme points in each class i.e. points that lie close to points of the other class. It then fits a boundary that pushes against such points because those are the only important points, and it assumes that other points can be ignored.

Photo Credits: WikiMedia Commons

SVMs are very powerful as well, especially when used with kernels. Kernels are used to transform inputs in a way that they have a linear decision boundary. When there isn’t a linear decision boundary in the original feature space, these transformations, known as kernel method, can find good non-linear frontiers with respect to the original features. Learn more about SVMs here, and I recommend you watch Andrew Ng’s lecture on SVMs in the aforementioned Coursera MOOC to learn more about the math involved.

Important Advice

Recently, I have been networking with many industry professionals, startup founders, recruiters, businessmen, professors and graduate students, and they all have one common thing to say. For 90% of the problems you want to solve, a simple model is a lot more efficient and suitable than a complex model. This means that if your interests are in engineering or business value, you shouldn’t need to have to use deep neural networks even though they’re all the hype, unless you want to do some sophisticated image or text classification or your problem is extremely complex. Whenever you have a machine learning problem, you are always recommended to start with a simple model and see how it performs.

“Traditional AI” models like decision trees and logistic regression can perform just as well as (or almost as well) DNNs on problems in many industries while being significantly easier and cheaper to train. Consider this example from the health industry, or consider this startup in the finance industry. This is an opportunity for you to build something awesome without getting bogged down by complications of deep learning frameworks or more complex models. There is an untapped potential in AI despite all the buzz, and you don’t need to be a math wiz to be a part of the movement.

Also, one of the more important challenges when it comes to applied ML is data engineering and preparation. Often, data sources are not as rich or clean as your Kaggle competitions may be, and when you want to apply ML to the real world, you will need to face this challenge. The point is, if your goals are academic, feel free to spend that extra month finessing your Tensorflow skills. However, for engineering, the returns of investing time in the practical aspects like rapid cycles of building and testing models and learning data preparation are much higher.

Summary

Ah, thank you for sticking around! This has been a long article, but pat yourself on the back because you already know a lot about some of the most popular supervised learning algorithms to use ML in practice! There was a lot of content and theory here, but with this in place, you will be able to actually build models in minutes (not exaggerating) once I introduce you to the tools in the next article!