Logistic Regression¶

Logistic regression is the canoncial example of a "discriminative" classifier (i.e. one that learns the mapping $f:X \rightarrow Y$ directly from the signal as opposed from learning a model of how the data was generated). Here, $Y$ is categorical and $X$ may be either continuous or categorical. A good short introduction (which also touches on generalized linear models) is [10].

Logistic regression assumes that $f$ is a sigmoid function of an inner product with the features, eg.

$$ P(Y=1 | X; \theta) = \sigma(\theta ^T X) = \frac{1}{1+exp(-\theta ^T X)} $$

Training a logistic regression classifier amounts to learning the parameter vector, $\theta$. The most simple approach is via maximum liklihood estimation, i.e.:

\begin{eqnarray} \theta^{MLE} &=& argmax_\theta \; \mathcal{L}(\theta)\\ &=& argmax_\theta \; P(Y | X; \theta)\\ \end{eqnarray}

Now, in the special case where $Y$ can be only $0$ or $1$, the conditional density can be concisely written as

$$ P(Y | X; \theta) = \sigma(\theta ^T X)^Y (1-\sigma(\theta ^T X))^{1-Y} $$

Assume the $M$ training examples were generated independently, we can thus write the likelihood as:

\begin{eqnarray} \mathcal{L}(\theta) = \prod_i^M P(Y_i | X_i; \theta) \end{eqnarray}

It's easier to work with the log of the product, this leads to a convex unconstrained optimization:

$$ argmax_\theta = \sum_{i=1}^M \log(P(Y_i | X_i; \theta)) $$

Remark: Optimization problems tend to be stated as mins, so it's common to see this version to:

$$ argmin_\theta = \sum_{i=1}^M -\log(P(Y_i | X_i; \theta)) $$

The standard way solve this maximization is via stochastic gradient ascent. Usual gradient ascent involve an update rule over all of $\theta$, eg. $\theta^{next} = \theta + \alpha

abla \mathcal{L}(\theta)$. If we have lots of training data computing the full gradient is slow. Stochastic gradient ascent means that we just compute the gradient in one (randomly-selected) direction before updating. The algorithm looks like this instead:

$$ \begin{align} 1. & \mbox{Select a random ordering of the coordinates} \\ 2. & \mbox{while not converged:}\\ & \quad \mbox{for }i \in \{1, \ldots, M\}: \\ & \qquad \theta^{next} = \theta + \alpha

abla_i \mathcal{L}(\theta) \end{align} $$

Unlike naive Bayes, logistic regression does not assume conditional independece (or any relation) between the features. As a result, when several extraneous features are provided, the MLE optimization can be solved by optimizing the irrelevant features (ie overfitting). The standard approach to reduce overfitting is to replace the MLE estimate with a MAP estime where a Laplacian prior, $P(\theta) = (\beta/2)^N exp(-\beta/|\theta|_1)$, has been assumed (the parameter $\beta$ is user-specified) [5]. The resulting minimization for such a MAP estimate is:

$$ argmin_\theta = \beta |\theta|_1 + \sum_{i=1}^M -\log(P(Y_i | X_i; \theta) $$

Note the $L_1$ regularization. It favors a sparser $\theta$ (i.e. a simpler model (which is less prone to overfitting)) at the expense of a more difficult optimization problem.

Liblinear is usually available [1]

No user-defined parameters to experiment with unless you regularize (which you probably do), $\beta$ is intuitive

Fast to train

Fast to apply

You probably won't get fired for suggesting it

No assumptions about P(X|Y) during the learning stage (true of any discriminative method)

Logistic Regression is available in Spark [6]

Mark Tygert has done some writing about it [7]

Robust to outliers when compared against LDA (LDA assumes normal distributions in the density of the training data)

Often less accurate than the newer methods

Interpreting $\theta$ isn't straightforward

$L_1$ optimization is not easy

Bizarre use of the word "regression" until you learn about generalized linear models

Parametric

Discriminative

Somehow equlivalent to neural networks with a single hidden layer

Somehow equlivalent to Gaussian naive Bayes

[1] http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf

[2] https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

[3] https://cs.stanford.edu/people/ang//papers/nips01-discriminativegenerative.pdf

[4] http://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-discriminative-algorithm

[5] https://cs.stanford.edu/people/ang//papers/aaai06-efficientL1logisticregression.pdf

[6] https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html

[7] http://www.cims.nyu.edu/~tygert/lr.pdf

[8] http://math.arizona.edu/~hzhang/math574m/2014Lect6_LDAlog.pdf

[9] http://webdocs.cs.ualberta.ca/~greiner/C-466/SLIDES/3b-Regression.pdf

[10] http://cs229.stanford.edu/notes/cs229-notes1.pdf