Logistic regression is one of those topics I struggled with to no end. Why the logarithm? Why the natural logarithm? Why the natural logarithm of the odds?

Refresher: Linear regression is the idea that I can take a set of features and find a weighted sum of those features which produces a good estimate of the numeric "target" I'm trying to predict.

$$f(\vec{x}) = w_0 + w_1x_1 + w_2x_2 + \ldots$$

So Logistic Regression...

Logistic regression is used when you want to model the probability that an instance of your data set belongs to a given class. Let's say you want to predict the probability that a company will still be in business at the end of the year; we'll call that $p_+(\vec{x})$, where $\vec{x}$ is the feature vector describing a company in your dataset. A ($+$) here indicates "still in business" and a ($-$) will indicate "out of business".

The Goal

What's our ideal model here? We want our predicted probability to be 1 for all $\vec{x_+}$(companies still in business) in our data set and 0 for all $\vec{x_-}$ (companies out of business). It sounds like the target of our model is a numeric value between 0 and 1; since we already know how to use linear regression to model numeric targets, let's do that.

Houston, we have a problem

But wait, how does linear regression work with a target between 0 and 1? Lines by definition extend between $(-\infty, \infty)$, which means some values produced by your linear model would be totally outside the range of possible values for your target!

So we need a target that spans between $(-\infty, \infty)$ that can be mapped directly back to what we really want: the probability of class membership. Transforming the probability into the odds of class membership gets us part of the way there, since odds can vary between $[0, \infty)$. Odds are defined as $\frac{p_+(\vec{x})}{p_-(\vec{x})}$, which is the same as $\frac{p_+(\vec{x})}{1 - p_+(\vec{x})}$.

To get us the rest of the way, it turns out that the natural logarithm function ($ln$) handily transforms our $[0, \infty)$ into $(-\infty, \infty)$. We use the natural log because "coefficients on the natural-log scale are directly interpretable as approximate proportional differences", which removes the intrinsic assymetry of odds and makes the calculus easier. Applying it makes the new target of our regression

$$ ln\bigg(\frac{p_+(\vec{x})}{1 - p_+(\vec{x})}\bigg) $$

Fitting the Model

Now that we know what we're after (modeling the "log-odds" of class membership using linear regression), how do we go about fitting a solution? Here's the model:

$$ln\bigg(\frac{p_+(\vec{x})}{1 - p_+(\vec{x})}\bigg) = f(\vec{x}) = w_0 + w_1x_1 + w_2x_2 + \ldots $$

Let's assure ourselves that we can still actually get the probability back out. Solving for $p_+(\vec{x})$,

$$ p_+(\vec{x}) = \frac{1}{1 + e^{-f(\vec{x})}} $$

Great! Now we can try different sets of weights and see how they perform. For any set of weights $\vec{w}$ that produce probability estimates $p_+(\vec{w})$, we can score the the model's ($\vec{w}$) performance on each instance $\vec{x}$ of our data set:

$$ score(\vec{x},\vec{w}) = \left\{ \begin{array}{ll} p_+(\vec{x}) & \mbox{if } \vec{x} \mbox{ is actually a }+ \\ p_-(\vec{x}) & \mbox{if } \vec{x} \mbox{ is actually a} - \end{array} \right. $$

If we sum these scores over the data for each model, we get a measure of that model's average performance. The set of weights which gives the highest probabilities to positive instances and the lowest probabilities to negative instances will be the $\vec{w}$ with the highest sum of its scores. This is known as the "maximum likelihood model".

Where do I go from here?