The recent boom in machine learning has seen a tremendous rise in the popularity and popularization of complex machine learning methods. Techniques previously relegated to the annals of academic discourse have become common topics of discussion in some tech circles. Terms such as convolutional neural networks, variational autoencoders, and even Deep Q-learning have now become important to many people who previously could hardly be troubled to care about the latest fads in machine learning.

Despite the sudden prominence of advanced techniques in machine learning, some aspects of the field have not actually changed quite as much as the news would seem to suggest. While machine vision and some related fields have been transformed overnight, many of the tried and true methods in the background have remained the same. Here I would like to discuss and explain the king of workhorse machine learning methods: the logistic regression.

One important thing to note about the logistic regression is that it is called by many different names. Just a few: Single layer sigmoidal perceptron, log-linear model, generalized linear model with a log-odds link function, the logit, and the logistic regression. Many fields have discovered or repurposed to model for their own ends and particular use cases. The logic of the model remains essentially the same across all disciplines, being renamed only to help the people using it better understand the intuition behind why the model works.

The logistic regression is completely ubiquitous. In fact, it is virtually impossible that some aspect of your life has not been predicted by a logit already. The effects of its influence are felt in staid fields such as insurance, as well as the hippest models coming off the presses in deep learning. A logit is the go to simple model that takes a long list of inputs, variables, or quantified information and learns to produce a simple number between 0 and 1, also known as a probability.

To start understanding the logit, the first step is to understand a linear regression. Think about a simple situation where you are trying to understand a correlation between two variables. Consider shoe size (which I will write as S) and height (which I will write as H). A regression tackles this problem by assigning numbers that represent the significance of an input, then multiplying them by that input. So if for example we are trying to express shoe size in terms of height:

\(\beta H = S\)

This is basically like one of those old quizilla polls from the early 2000s. β is a number that represents how sharp the connection between H and S is. Where one of those old polls would have a key telling you how many points to award yourself for any given answer on the quiz, β tells you what to expect in terms of shoe size for any given height.

Machine learning and statistics aren't just about setting up equations. After all, machine learning is called machine learning. The real question once you have this equation is, "how big (or small) is β?" The short answer is: wiggle β around a lot, and see what gets you as close as possible to the relationship between H and S that you see in the real world. It might never fit exactly, but you can measure the distance between the value you predict for shoe size (equal to β H) and the actual value you observe for all different people. After all, though height is a good predictor of shoe size, it isn't 100% accurate.

Exactly how this is done encompasses a very large part of the field of machine learning and statistics. I'm going to skip exactly how this is done. Suffice to say that improvements in this field are much of what drive progress in machine learning, particularly as the model (and problem you are trying to solve) grows more complex.

This is all well and good for things you can count or measure. But how do you solve this problem when the thing you are interested in is binary, either-or, or otherwise a choice between two options? This is where the humble logit comes in. The solution is a simple one: Squash the result.

Now, instead of predicting shoe size, let's take H and use it to predict another quantity: gender or G. Our new equation is:

\(\beta H = G\)

Enter the logistic function. I can write it out here for you:

\(\frac{1}{1 + e^{-\beta H}}\)

More importantly, when you plot it, this is what you see:

This is going to squash any number into the range (0,1). We can rewrite our equation with Σ standing in for the logistic or, ahem, sigma function:

\(\sigma(\beta H)\)

So, when we are asking if someone has a gender of man or woman, we assign women a score of 1 and men a score of 0. We proceed to wiggle \(\beta\) as we did before, trying to get the resulting prediction to match the data we observe as well as possible.

We're almost done. Note that in practice, we add some other number, let's call it \(\beta_0\), to the score we are calculating. We can call our earlier \(\beta\) by the new moniker \(\beta_1\):

\(\sigma(\beta_1 H + \beta_0)\)

Depending on what exactly we are trying to do, it can be possible to skip this step. Don't worry too much about it right now. We're done, basically.

This method is a logit. And this method is everywhere.

It's practically like one of those "one weird trick" ads we are so "fond" of. One weird trick to predicting practically everything. Except it works. You can get decent results on everything from predicting census-style statistics (such as gender) to classifying text. Even though when you classify text there are thousands of different inputs (variables standing in for each kind of word), it still does remarkably well.

Many state-of-the-art machine learning techniques do a variety of fancy things to their input, teasing out connections and correlations, before finally passing some preprocessed data to a logit for the last step. Practically any neural network that does classification (in other words, deciding what 'bin' your input belongs in) essentially has a logit on top of it to produce the final result.

So that's it, the mighty (yet humble) logistic regression. Logits are everywhere, not just in our lives, but under the hoods of most machine learning systems. Proof that the most simple ideas can be the most effective and pervasive.