This post will be about helping you understand the basics of the main machine learning task : classification. More specifically, we are going to scratch the surface of linear classification and quickly address non linear classification. I assume some basic knowledge in probability and regression.

Linear Classification

Let’s start by simple examples depicted in the introduction to statistical learning by G.James, D.Witten, T.Hastie, R.Tibshirani (which I highly recommend along with the element of statistical learning by the same authors). Given the income and the balance of the credit card of bank users, the classification aims to divide the input space into two regions : default or not default. One user will be predicted into one of the two categories.

The yellow points correspond to users who defaulted at a given month (great balance means great debt to the bank) and the blue points to the ones who did not.

So, looking at the Figure, your hope is to have a model that draw a line that would separate default guys from non default ones in such a way that you could decide if you should grant a loan to new users given their balance and income. Of course, this is a very basic example but the idea generalizes to multi dimensions in a natural way. Indeed, imagine that instead of having balance and income as deciding factors (features), you have social status, age, gender etc…

No matter what features, you have to build an algorithm, that given an input of a bunch of data, can differentiate among classes.

Classification is in fact just a special case of regression because the labels are restricted to a discrete set : $\mathcal{Y}=\{0,\dots,K\}$. However, regression techniques are not adapted to a classification task because they don’t have the same objective function. Regression aims to minimize the mean squared error, whereas classification aims to minimize the proportion of misclassified observations.

This figure illustrates one of the problems when solving a classification task with regression techniques. Here, only the balance feature was kept. The labels were assigned 0 for “not default” class and 1 for “default” class. Using linear regression and given a training data, a prediction function $P(x)$ was produced (the blue line in this example). $P(x)$ takes the balance as input and thus predict the user in one of the two categories based on this arbitrary decision rule :

$$ \delta(x)=\textbf{1}\{P(x) \geq 0.5\} $$

It is clear that the predictions are highly influenced by the number of observations in each class. In the figure, you can see that the relative high number of non default people (high density of yellow points in the $y=0$ axis) “shifted” $P(x)$ (blue line) toward the $y=0$ axis and thus “ruined” the predictions based on our decision rule.

Moreover, you can also notice that if you add just a few default people that have very high balance, this will make $P(x)$ “shift” toward the $y=1$ axis (because it tries to minimize the mean squared error). Therefore, regression is unstable.

Moreover, the function $P(x)$ will predict values outside the desired range $[0,1]$. A probability should always be in that range to be interpreted. That is why, regression techniques can’t be used to solve a classification task.

For all those reasons, we definitely need more adapted techniques.

Discriminative methods

Discriminative methods are a class of machine learning methods that directly model the quantity of interest : $p(y\mid\textbf{x})$. Therefore, leading to the prediction that maximizes this probability. The most popular discriminative method is logistic regression.

Logistic Regression

In the following, we deal with 2-class classification $\mathcal{Y}=\{0,1\}$. One can generalizes logistic regression to multiple classes See [K. Murphy, Machine learning : a probabilistic perspective]

In linear regression, the responses were normally distributed : $$ p(y\mid \textbf{x},\textbf{w}) = N(y \mid\textbf{w}^\intercal \textbf{x},\sigma^2) $$

Here, since the $y$ variable takes only 2 possible values (0 or 1) it would be more appropriate to use Bernoulli distribution : $$ p(y\mid \mathbf{x},\mathbf{w}) = Bern(y \mid p(\mathbf{x})) $$ By definition of the expectation of the bernoulli distribution, we get : $$ p(\textbf{x})= \mathop{\mathbb{E}}(y\mid \textbf{x})=p(y=1 \mid \textbf{x}) $$ Our aim is thus to model this probability. In fact, the logistic regression method comes from modeling this quantity with a logistic function $\sigma(x)=\frac{e^x}{1+e^x}$.

The intuition behind the choice of this function is that it maps the linear combination of the input $\textbf{w}^\intercal \textbf{x}$ into a [0,1] interval in such a way that we can interpret it as a probability. So, very high $\textbf{w}^\intercal \textbf{x}$ values will indicate a high probability of belonging to the class which has been labeled by 1. The modeling of the probability give us $$ p(y\mid \textbf{x},\textbf{w}) = Bern(y \mid \sigma(\textbf{w}^\intercal \textbf{x})) $$ The final decision rule is : $$ \hat{y}(\textbf{x})= 1 \Leftrightarrow p(y=1 \mid \textbf{x}) > 0.5 \Leftrightarrow \hat{\textbf{w}}^\intercal \textbf{x} > 0 $$

$\hat{\textbf{w}}^\intercal \textbf{x} > 0$ is exactly the line that would have been drawn to separate the two classes default or not default in the first Figure ($\hat{\textbf{w}}$ is learned by maximizing the likelihood).

Thus, given the assumption of the conditional independence of observations $(y_n,\mathbf{x}_n)_{n=1}^N$ of our training set, we get that the joint distribution is :

$$ p(\textbf{y}\mid\textbf{x},\textbf{w})=\prod_{n=1}^{N} p(y_n \mid \textbf{x}_n ) $$ which yields to the likelihood : $$ L(\textbf{w})=\prod_{n : y_n=1}^{N} p(y_n=1 \mid \textbf{x}_n) \prod_{n : y_n=0}^{N} p(y_n=0 \mid \textbf{x}_n) $$

From this likelihood, we can easily derive the cost function that we are going to minimize :

$$ \mathcal{L}(\textbf{w})= -\log(L(\textbf{w}))=\sum_{n=1}^N \log(1+\exp(\textbf{w}^\intercal \textbf{x})) - y_n\textbf{w}^\intercal \textbf{x} $$

You could notice $\mathcal{L}(\textbf{w})$ is a convex function (suffice to show that the second derivative is non negative for the non linear term)

Since the cost function is convex, gradient descent is a good algorithm to solve the optimization problem using the following update :

$$ \textbf{w}^{(t+1)}=\textbf{w}^{(t)}-\gamma^{(t)}

abla\mathcal{L}(\textbf{w}^{(t)}) $$

In this equation, $\gamma^{(t)}$ is the learning rate and is a hyper-parameter to optimize. A big learning rate could be the cause of divergence

Gradient descent is also very sensitive to the scaling of the features, one might want to standardize (mean 0 and variance 1) them when using gradient descent.

Regularized Logistic Regression

Regularization is very useful to reduce the model complexity since it will introduce a penalty on the parameters in the minimization problem. It is a very useful technique to reduce overfitting.

The loss of the regularized logistic regression is :

$$ \mathcal{L}_{reg}(\textbf{w}) = -\sum_{n=1}^N \log(p(y_n\mid \textbf{w}^{T}x_n)) + \lambda \left\lVert\textbf{w}\right\rVert^2 $$ For the norm of $\textbf{w}$, the L1,2-norm are the ones that are often used. L1 norm will tend to shrink the parameters to zero. Its advantage is that it can yield to sparse models. Moreover, L1 regularization can also be seen as a model selection technique.

This Figure depicts an example of logistic regression on height/weight data with labels {Males in red, Females in blue}. We can see that the logistic regression is performing quite good for this data.

Logistic regression is one of the most popular discriminative method, however one can find others that perform much better in a machine learning contest like tree based methods, neural networks or even Support Vector Machine.

Why L1 Regularization shrinks the parameters to zero

You can view this as a minimization under constraint. Imagine you are minimizing the cost function with the constraint that $\textbf{w}=(w_1,w_2)$ (in this example we only have two parameters) is in the Rhombus (for L1) or in the circle (for L2). From the Figure one can directly notice that L2 norm puts the weight $w_1$ to very small value while L1 norm puts $w_1=0$. That is why, L1 norm is very useful technique to reduce the complexity and to sparsify your model.

One other class of methods are the generative ones, they are based upon Bayes rule.

Generative methods

We saw that discriminative methods models directly the quantity of interest : $p(y\mid\mathbf{x})$ that would discriminate between classes. Generative methods, on the other hand, models $p(\textbf{x}\mid y)$ and turns it down with Bayes rule.

Generative methods tend to solve a higher problem since they model the distribution of the observed variables. That’s why they can be used to generate data.

One natural question that arise is that if there exist an optimal classifier ?

The answer is yes conditionally on the fact that the distribution of the generating model $p(\textbf{x},y)$ was known

http://www-math.mit.edu/~rigollet/courses/Notes/L2.pdf. This classifier will then optimize the following : $\hat{y}(\textbf{x})=argmax_{y \in \mathcal{Y}}p(y\mid \textbf{x})$.

Thus, generative classifiers will try to model the joint distribution so that it can approximate this optimal classifier. Naive bayes is one of them.

Naive Bayes

Naive Bayes classifier will model the joint distribution and turn it down to be able to discriminate among classes with the following Bayes formula : $$ p(y=c \mid \textbf{x}) = \frac{p(\textbf{x}\mid c)p(y=c)}{\sum_k p(\textbf{x}\mid y=k)p(y=k)} $$

Actually, Naive Bayes Classifier makes the naive assumption of conditional independence between features. With this assumption, one would easily model $p(\textbf{x} \mid y=c)$ by making some assumptions on the conditional marginal distribution. The assumptions on the conditional densities depend on the features. For example, if we have real-valued features, you could assume normality : $$ p(\textbf{x} \mid y=c) = \prod_{i=1}^{p} \mathcal{N}(x_i \mid \mu_{ic},\sigma^2_{ic}) $$

One could also estimate $p(y=c)$ by the proportion of observations in class $\textit{c}$.

Obviously, we wouldn’t expect the features to be independent (for example in the height/weight example). But, even with this naive assumption, Naive Bayes performs pretty good sometimes.

You can also find other generative methods like Linear Discriminant Analysis or Quadratic Discriminant Analysis.

Linear Discriminant Analysis

Linear Discriminant Analysis or LDA is a generative model that assumes that $X=(X_1,…,X_p)$ are normally distributed. In fact, each class density is modeled as a multivariate normal $$ p(\textbf{x} \mid y=k)=\frac{1}{(2\pi)^{p/2} \mid \Sigma_k \mid} e^{-\frac{1}{2}(\textbf{x}-\mu_k) ^\intercal \Sigma_k^{-1} (\textbf{x}-\mu_k)} $$ With $\Sigma_k=\Sigma$ for LDA.

To compare two classes say $\textbf{l}$ and $\textbf{k}$, one could compute their log-ratio. We then use $\log(\frac{p(y=k\mid \textbf{x})}{p(y=l \mid \textbf{x})})$ to compare them.

Using Bayes rule, we get the $\log(\frac{p(y=k\mid \textbf{x})}{p(y=l \mid \textbf{x})})$ is equal to :

$$ \log(\frac{p(y=k)}{p(y=l)}) -\frac{1}{2}(\mu_k+\mu_l)^\intercal \Sigma^{-1}(\mu_k - \mu_l) + \textbf{x}^\intercal \Sigma^{-1}(\mu_k - \mu_l) $$

From this, one could easily derive the decision boundaries between $\textbf{k}$ and $\textbf{l}$ $$ \{\textbf{x} : \delta_k(\textbf{x}) = \delta_c(\textbf{x})\} $$ with equating to zero :

$$ \delta_k(\textbf{x})=\textbf{x}^\intercal \Sigma^{-1}\mu_k -\frac{1}{2}\mu_k ^\intercal \Sigma^{-1} \mu_k + \log(p(y=k)) $$ Which is linear in $\textbf{x}$.

The parameters which are the mean and the covariance matrix are estimated using maximum likelihood.

Quadratic Discriminant Analysis

The only difference between LDA and QDA resides in the fact that QDA doesn’t make the assumption of a constant covariance matrix. In fact the decision boundaries becomes : $$ \delta_k(\textbf{x})=-\frac{1}{2}\log(\left|\Sigma_k\right|) - \frac{1}{2}(\textbf{x}-\mu_k^\intercal)\Sigma_k^{-1}(\textbf{x}-\mu_k) + \log(p(y=k)) $$ Which is quadratic in $\textbf{x}$.

QDA is more flexible than LDA since its decision boundaries are quadratic in $\textbf{x}$ and allow us to have to degree of freedom when fitting a model. However, for linearly separable data, QDA might overfit the data. On top of that, QDA has way more parameters to estimate. Indeed, LDA has $(K-1)(p+1)$ parameters to estimate, with $K$ being the number of classes. While QDA has $(K-1)(p(p+3)/2 + 1)$ parameters to estimate. We thus notice that if the number of features $p$ increases then QDA becomes quickly computationally expensive.

Discriminative vs Generative

V. Vapnik “One should solve the [classification] problem directly and never solve a more general problem as an intermediate step”}

As we said earlier, Generative methods tend to solve a more complex problem which consist in estimation the joint distribution.

This Figure is an artificial example from Bishops machine learning book that shows us that the joint distribution could contain a lot of structure that is not needed to discriminate between classes. Its very hard to say that one method is better than the other since it depends on the problem. However, in practice it seems that in average discriminative methods performs better but converge more slowly.

In [https://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf], they compare Naive Bayes and logistic regression on known datasets (breast cancer data,…). Their findings is depicted below.

From this Figure, you notice that Naive Bayes seems to converge faster than logistic regression. That is why it is often to use Naive Bayes classifier as a baseline approach.

In fact, they showed that Discriminative learning tends to have lower asymptotic error. They also showed that generative classifiers could reach their highest asymptotic error faster.

Non Linear Classification

In this post, we discussed linear classification, however in real world applications, data is almost all of the time not linearly separable. As an example, you can see in the figure below that the data is clearly not linearly separable and cannot be separated by a hyperplane (a line here).

However if you add as a feature $(x-x_1)^2$+ $(y-y_1)^2$ you end up with this feature space :

And thus the data becomes linearly separable and can be separated by a hyperplane. Therefore, in real world examples adding those non linear features is key to perform well in a machine learning contest. However, finding the good non linear transformation is very hard and never as simple as the example we saw. That is why, deep learning is so popular because it builds those non linear features automatically across non linear activation functions applied at each layer. And, even though deep learning performs very good, nothing can beat expert hand engineered features.

Some Nice Papers