(Note: This is a post attempting to explain the intuition behind Logistic Regression to readers NOT well acquainted with statistics. Therefore, you may not find any rigorous mathematical work in here.)

Logistic Regression is a type of classification algorithm involving a linear discriminant. What do I mean by that?

1. Unlike actual regression, logistic regression does not try to predict the value of a numeric variable given a set of inputs. Instead, the output is a probability that the given input point belongs to a certain class. For simplicity, lets assume that we have only two classes(for multiclass problems, you can look at Multinomial Logistic Regression), and the probability in question is -> the probability that a certain data point belongs to the ‘ ‘ class. Ofcourse, . Thus, the output of Logistic Regression always lies in [0, 1].

2. The central premise of Logistic Regression is the assumption that your input space can be separated into two nice ‘regions’, one for each class, by a linear(read: straight) boundary. So what does a ‘linear’ boundary mean? For two dimensions, its a straight line- no curving. For three dimensions, its a plane. And so on. This boundary will ofcourse be decided by your input data and the learning algorithm. But for this to make sense, it is clear that the data points MUST be separable into the two aforementioned regions by a linear boundary. If your data points do satisfy this constraint, they are said to be linear-separable. Look at the image below.

(source)

This dividing plane is called a linear discriminant, because 1. its linear in terms of its function, and 2. it helps the model ‘discriminate’ between points belonging to different classes.

(Note: If your points aren’t linearly separable in the original concept space, you could consider converting the feature vectors into a higher dimensional space by adding dimensions of interaction terms, higher degree terms, etc. Such usage of a linear algorithm in a higher dimensional space gives you some benefits of non-linear function learning, since the boundary would be non-linear if plotted back in the original input space.)

==========X===========

But how does Logistic Regression use this linear boundary to quantify the probability of a data point belonging to a certain class?

First, lets try to understand the geometric implication of this ‘division’ of the input space into two distinct regions. Assuming two input variables for simplicity(unlike the 3-dimensional figure shown above)- and , the function corresponding to the boundary will be something like

.

(It is crucial to note that and are BOTH input variables, and the output variable isn’t a part of the conceptual space- unlike a technique like linear regression.)

Consider a point . Plugging the values of and into the boundary function, we will get its output . Now depending on the location of , there are three possibilities to consider-

I. lies in the region defined by points of the class. As a result, will be positive, lying somewhere in (0, ). Mathematically, the higher the magnitude of this value, the greater is the distance between the point and the boundary. Intuitively speaking, the greater is the probability that belongs to the class. Therefore, will lie in (0.5, 1].

2. lies in the region defined by points of the class. Now, will be negative, lying in (- , 0). But like in the positive case, higher the absolute value of the function output, greater the probability that belongs to the class. will now lie in [0, 0.5).

3. lies ON the linear boundary. In this case, . This means that the model cannot really say whether belongs to the or class. As a result, will be exactly 0.5.

Great. So now we have a function that outputs a value in (- , ) given an input data point. But how do we map this to the probability , that goes from [0, 1]? The answer, is in the odds function.

Let denote the probability of an event occurring. In that case, the odds ratio ( ) is defined by , which is essentially the ratio of the probability of the event happening, vs. it not happening. It is clear that probability and odds convey the exact same information. But as $P(X)$ goes from 0 to 1, goes from 0 to .

However, we are still not quite there yet, since our boundary function gives a value from – to . So what we do, is take the logarithm of , called the log-odds function. Mathematically, as goes from 0 to , goes from – to !

So we finally have a way to interpret the result of plugging in the attributes of an input into the boundary function. The boundary function actually defines the log-odds of the class, in our model. So essentially, inour two-dimensional example, given a point , this is what Logistic regression would do-

Step 1. Compute the boundary function(alternatively, the log-odds function) value, . Lets call this value for short.

Step 2. Compute the Odds Ratio, by doing . (Since is the logarithm of ).

Step 3. Knowing , it would compute using the simple mathematical relation

.

There you go! In fact, once you know from step 1, you can combine steps 2 and 3 to give you

The RHS of the above equation is called the logistic function. Hence the name given to this model of learning :-).

==========X===========

We have now understood the intuition behind Logistic Regression, but the question remains- How does it learn the boundary function ? The mathematical working behind this is beyond the scope of this post, but heres a rough idea:

Consider a function , where is a data point in the training dataset. can be defined in simple terms as:

If is a part of the class, (Here, is the output given by your Logistic Regression model). If is a part of the class, .

Intuitively, quantifies the probability that a training point was classified correctly by your model. Therefore, if you average over your entire training data, you would get the likelihood that a random data point would be classified correctly by your system, irrespective of the class it belongs to. Simplifying things a little, it is this ‘average’ that a Logistic Regression learner tries to maximize. The method adopted for the same is called maximum likelihood estimation (for obvious reasons). Unless you are a mathematician, you can do without learning how the optimization happens, as long as you have a good idea of what is being optimized – mostly because most statistics or ML libraries have inbuilt methods to get it done.

==========X===========

Thats all for now! And like all my blog posts, I hope this one helps some guy trying to Google up and learn some stuff on his own, understand the misunderstood technique of Logistic Regression. Cheers!