We can also say that the target variable is. Based on the number of categories, Logistic regression can be classified as:

This machine learning tutorial discusses the basics of Logistic Regression and its implementation in Python. Logistic regression is basically a supervised classification algorithm. In a classification problem, the target variable(or output), y, can take only discrete values for a given set of features(or inputs), X.

binomial: target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc. multinomial: target variable can have 3 or more possible types which are not ordered(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs “disease C”. ordinal: it deals with target variables with ordered categories. For example, a test score can be categorized as: “very poor”, “poor”, “good”, “very good”. Here, each category can be given a score like 0, 1, 2, 3.

First of all, we explore the simplest form of Logistic Regression, i.e Binomial Logistic Regression.





Binomial Logistic Regression





Consider an example dataset which maps the number of hours of study with the result of an exam. The result can take only two values, namely passed(1) or failed(0):

HOURS(X) 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00 5.50 PASS(Y) 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1





So, we have





i.e. y is a categorical target variable which can take only two possible type:“0” or “1”.

In order to generalize our model, we assume that:

The dataset has ‘p’ feature variables and ‘n’ observations.

The feature matrix is represented as:



Here, denotes the values of feature for observation.

Here, we are keeping the convention of letting = 1. (Keep reading, you will understand the logic in a few moments).

denotes the values of feature for observation. Here, we are keeping the convention of letting = 1. (Keep reading, you will understand the logic in a few moments). The observation, , can be represented as:



observation, , can be represented as: represents the predicted response for observation, i.e. . The formula we use for calculating is called a hypothesis.



If you have gone though Linear Regression, you should recall that in Linear Regression, the hypothesis we used for prediction was:

are the regression coefficients. where,are the regression coefficients.

be: Let the regression coefficient matrix/vector,be:







Then, in a more compact form,





= 1 is pretty clear now.

We needed to do a matrix product, but there was no

actual multiplied to in original hypothesis formula. So, we defined = 1. The reason for taking= 1 is pretty clear now.We needed to do a matrix product, but there was noactualmultiplied toin original hypothesis formula. So, we defined= 1.







to take values larger that 1 or smaller than 0. Now, if we try to apply Linear Regression on the above problem, we are likely to get continuous values using the hypothesis we discussed above. Also, it does not make sense forto take values larger that 1 or smaller than 0.



So, some modifications are made to the hypothesis for classification:

So, some modifications are made to the hypothesis for classification:





is called the logistic function or the sigmoid function.

Here is a plot showing g(z): where,is called theor theHere is a plot showing g(z):

We can infer from the above graph that:

g(z) tends towards 1 as

g(z) tends towards 0 as

g(z) is always bounded between 0 and 1

observation as:

So, now, we can define conditional probabilities for 2 labels(0 and 1) forobservation as:



We can write it more compactly as:

likelihood of parameters as:

Now, we define another term,as:





Likelihood is nothing but the probability of data(training examples), given a model and specific parameter values(here, ). It measures the support provided by the data for each possible value of the . We obtain it by multiplying all for given .





log-likelihood:

And for easier calculations, we take

cost function for logistic regression is proportional to the inverse of the likelihood of parameters. Hence, we can obtain an expression for cost function, J using log-likelihood equation as:

Thefor logistic regression is proportional to the inverse of the likelihood of parameters. Hence, we can obtain an expression for cost function, J using log-likelihood equation as:

so that cost function is minimized !! and our aim is to estimateso that cost function is minimized !!





Using Gradient descent algorithm





w.r.t each to derive the stochastic gradient descent rule(we present only the final derived value here):

Firstly, we take partial derivatives ofw.r.t eachto derive the stochastic gradient descent rule(we present only the final derived value here):

is the vector representing the observation values for the feature.

Now, in order to get min ,

Here, y and h(x) represents the response vector and predicted response vector(respectively). Also,is the vector representing the observation values forthe feature.Now, in order to get min

is called the learning rate and needs to be set explicitly?

Let us see the python implementation of the above technique on a sample dataset (download it from whereis called theand needs to be set explicitly?Let us see the python implementation of the above technique on a sample dataset (download it from here ):

2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00 5.50