Model fitting and optimization

In the last section, we derived the logistic regression model and the underlying logic for how we can correlate observations with binary events. In this section, we’ll take a look at the math needed to actually fit the model.

Before we get into specifics of logistic regression, I want to introduce a core concept in machine learning. Machine learning models are, in my opinion, mathematical structures that we use to think about how input and output data are related to each other; however the real power of these structures is that we don’t predefine everything about them a priori, we leave some internal parameters that we can tune like knobs on an analog radio to solve the problem at hand. However, this then raises the question of what should be the end result of this model building and knob tuning? In my understanding, that question has a clear answer: prediction. Our goal in any machine learning problem is to select or develop a model that allows us to “learn” a set of rules based on past observations for predicting future outcomes.

With that underlying objective in mind, let’s think about a practical way in which we could actually go about learning those rules or model parameters to make the best possible predictions. Well, how do we humans actually go about learning something? For me, the algorithm usually looks something like this:

Alright, I’ll stop being cheeky now and talk about some specifics for logistic regression, but in all seriousness, what I outlined above is the basic idea behind most (supervised) machine learning algorithms.

As you can see, in line 2 of the “Learn Something” algorithm, the first thing we have to do is take a stab at what the answer should be. In the case of logistic regression, we already have that from the first part of the blog post. What we need to define now, is how we are 1) going to measure error and 2) propose adjustments to the model to reduce error the next time around. The way we typically do this is set up a function that we want to optimize. In machine learning literature, this is often referred to as a loss function or a cost function. Ideally, this function is convex (or concave) which means it’s guaranteed to have a minimum (or maximum) value — this structure is particularly useful as it allows us to use the derivative of that function to find the maximum or the minimum. Recall that in calculus, the minimum of a convex function for some parameter can be found by setting the partial derivative of the function with respect to the parameter in question to zero and then solving for the parameter in question. So, what function should we use to measure our performance. A good place to start is the likelihood function:

Basically what we want to do here is maximize this function over a set of training samples (pairs of (x,y) from 1 to N), which means that we are adjusting the values of theta that would increase the overall likelihood measurement given our data. However, the first thing we need to do is write the right hand side of the equation in terms of X and theta; luckily we have that result from earlier so let’s just plug that in.

Alright, this is looking better but the product still makes it a difficult problem to compute and solve. However, we can simplify this by taking the logarithm of both sides, turning the product into a sum.

And for convenience, let’s turn this into a minimization problem by negating each side of the equation:

Great! Now let’s do some algebra and compute the derivative with respect to theta for this bad boy. Just kidding, I think I’ve tortured you enough today (and I don’t want to type all this out in latex :P) so I’ll provide the equation for the derivative:

One thing to note, which I haven’t mentioned yet is that I’m using matrix notation for X and theta, so to be clearer, if you had j features (columns) in X, we would represent theta as a vector of j+1 numbers (we add one as the intercept or “bias” term). Therefore the above equation for each of our j+1 theta parameters would look like this:

So all we have to do now is set this equal to zero and solve for theta, right? Unfortunately it’s not that simple. Given the nonlinearity present in the equation, we cannot analytically solve for theta. But we can use the properties of the derivative to “nudge” the values of theta in the right direction over many iterations. The formal name for this process is called gradient descent (as in we are descending down the gradient towards the optimum). The gradient descent algorithm in this case would look like this:

In the above pseudocode, alpha refers to the “learning rate”, or how large we adjust theta on each step. It can be tricky in practice to set and there are many schemes that use adaptive methods to keep the adjustment of theta big enough that we will decrease the cost but not so big that we may overshoot the global optimum. Overall the algorithm is pretty simple, and draws a striking resemblance to the silly “Learn Something” algorithm, right? In the next section, let’s see how this is implemented in Python.