You are probably using machine learning a number of times in a day without even noticing. For instance, whenever you check your mail box, a spam filter saves you from going through spam messages. And that is because it has learned how to distinguish between spam and non-spam mails. That is Machine Learning (ML). To explicitly define the term, machine learning is the science of getting a machine trained, without being explicitly programmed, from past experiences that come through the data collected.

There are two types of ML techniques:

Supervised Machine Learning: It is the most common type of ML in which the system is trained on a set of predefined training data which then reaches an accurate result when given a new set of data.

Unsupervised Learning: In this type of learning, the system is given a bunch of data, and it must find patterns and relationships within. For instance, identifying a group of close friends on Facebook

We will begin by talking about an example of supervised linear regression problem and then understand how to apply a gradient descent algorithm on it.

Supervised Learning

Suppose we have the following dataset which gives the prices of houses in the city of Bengaluru(India).

Living area(in square feet)(x) Price(in USD)(y) 820 30105 1050 58448 1550 85911 1200 87967 1600 73722 1117 54630 550 42441 1162 79596

How can we predict prices of the houses in Bengaluru as a function of the size of their living areas, given this data?

Let us establish notation for future use. We will denote the input variables, also called features(living area in the example) by \(x^{(i)}\) and the output or the target variable (price) by \(y^{(i)}\). A pair \(\left(x^{\left(i\right)},y^{\left(i\right)}\right)\) is called a training example. We will be using \(m\) such training examples. The list of \(m\) training examples is called a training set. When we’re trying to predict a target variable which is continuous, we say that the learning problem is a regression problem like our housing example.

To define the supervised learning problem more formally, given a training set, the aim is to learn a function \(h\) so that \(h(x)\) is a predictor for the corresponding value of \(y\). This function \(h\) is called a hypothesis.

Next, we need to decide while designing a learning algorithm is the representation if the hypothesis function \(h\) as a function of \(x\).

Let us initially assume that the hypothesis function looks like this:

\(h_{\theta}(x)=\theta_{0}+\theta_{1}x\)

Here, \(\theta_{0}\mbox{ and }\theta_{1}\) are called parameters.

How do we go about choosing these two parameter values, \(\theta_0 \mbox{ and } \theta_1\) so that it best fits our training set?

We will get different hypothesis functions with different choices of parameters.

In linear regression, we have a training set and we want to come up with values for the parameters so that the straight line we get out of \(h(x)\) somehow fits the data well.

Let’s try to choose values for the parameters so that given the \(x\) in the training set, we make reasonable predictions for the \(y\) values. Formally, we want to solve a minimization problem, that is, we want to minimize the difference between \(h_{\theta}(x) \mbox{ and } y\). To achieve that, we solve the following equation:

\(\displaystyle \min_{\theta_0,\theta_1}\frac{1}{2m}\sum_{i=1}^{m} \left(h_{\theta}(x^{(i)})-y^{(i)}\right)^2 \)

Here, \(m\) is the number of training examples. To make the math a little bit easier, we put a factor of \(\displaystyle\frac{1}{2m}\), and it gives us the same value of the process.

By convention, we define a cost function:

\(\displaystyle J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m} \left(h_{\theta}(x^{(i)})-y^{(i)}\right)^2 \)

This cost function is also called the squared error function.

The expression \( \min\limits_{\theta_0,\theta_1}\) means that we want to find the values of \(\theta_{0}\mbox{ and }\theta_{1}\) so that the cost function \( J(\theta_0,\theta_1) \) is minimized.

To understand cost function more intuitively, let us consider a simple hypothesis function with a single parameter.

Let the training set be \(S=\{(1,1),(2,2),(3,3)\}\).

Say, \(\theta_{0}=0\), then our hypothesis function is \(h_{\theta}(x)=\theta_1x\) and cost function becomes

\(\displaystyle J(\theta_1)= \frac{1}{6}\sum_{i=1}^3\left(\theta_1x^{(i)}-y^{(i)}\right)^2 \).

This is a plot showing how \(J(\theta_1)\) varies with \(\theta_1\).

This shows that the cost function for \(S\) is minimum at \(\theta_1=1\) and that is what we expect as it will give \(y=x\), which perfectly matches our training set.

Now, if we assume that \(h_{\theta}(x)=\theta_0+\theta_1x\) and plot \(J(\theta_0,\theta_1)\) versus \(\theta_0\mbox{ and }\theta_1\), we get the following contour plot:

Here x-axis represents \(\theta_1\), y-axis represents \(\theta_0\) and z-axis represents \(J(\theta_0,\theta_1)\). So, the point where the height of the plot is least represents the point where the cost function is minimized. In this case, we expect it to be at \(y=0, x=1\) and that is what the contour plot shows.

Our objective is to minimize the cost function.

Gradient Descent

Gradient descent is an algorithm that is used to minimize a function. Gradient descent is used not only in linear regression; it is a more general algorithm.

We will now learn how gradient descent algorithm is used to minimize some arbitrary function f and, later on, we will apply it to a cost function to determine its minimum.

We will start off by some initial guesses for the values of \(\theta_0 \mbox{ and }\theta_1\) and then keep on changing the values according to the formula:

\(\displaystyle \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}f(\theta_0,\theta_1)\mbox{ for j=0,1}\)

Here, \(\alpha\) is called the learning rate, and it determines how big a step needs to be taken when updating the parameters. The learning rate is always a positive number.

We want to simultaneously update \(\theta_0 \mbox{ and }\theta_1\), that is, calculate the right-hand-side of the above equation for both \( j=0 \mbox{ as well as }j=1\) and then update the values of the parameters to the newly calculated ones. This process is repeated till convergence is achieved.

Let us understand the role of learning rate. If \(\alpha\) is too small, then we will end up taking tiny baby steps, which means a lot of steps before we get anywhere near the global minimum. Now, if \(\alpha\) is too large, then there is a possibility that we miss the minimum entirely. It may fail to converge or it can even diverge.

How will one really know that he has already reached the minimum? It turns out that at the local minimum, derivative will be equal to zero because the slope of the tangent line at this point will be equal to zero. So, if the parameters are already at a local minimum then one step with gradient descent does absolutely nothing and that is what we are looking for.

Also, gradient descent converges to the local minimum even when learning rate \(\alpha\) is fixed. Because as the minimum is approached, derivative gets closer and closer to zero. Therefore, after one step of descent, the new derivative is smaller than the previous one. For another step in gradient descent, one will take a somewhat smaller step from the previous. Now, the derivative term is even smaller and so the magnitude of the update to \(\theta_1\) is even smaller and as gradient descent runs, one will automatically end up taking smaller and smaller steps, so there is no need to decrease \(\alpha\) every time.

To summarize our gradient descent algorithm:

\(\displaystyle\mbox{ repeat until convergence }\left\{\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}f(\theta_0,\theta_1)\mbox{ for j=0,1} \right\}\)

Gradient descent for univariate linear regression

For the linear regression model that we have discussed so far, we have the following hypothesis and cost function:

\(\displaystyle h_{\theta}(x)=\theta_0+\theta_1x \mbox{ and } J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m} \left(h_{\theta}(x^{(i)})-y^{(i)}\right)^2 \)

Now, we are going to apply the gradient descent algorithm to minimize the cost function for the linear regression model. So we have:

\(\displaystyle\theta_j:=\theta_j-\alpha \frac{\partial}{\partial\theta_j}\left(\frac{1}{2m}\sum_{i=1}^{m} \left(\theta_0+\theta_1x^{(i)}-y^{(i)}\right)^2 \right)\)

The most important term in the equation above is the derivative term, so let’s calculate it for both \(j=0\) and \(j=1\).

\(\displaystyle \frac{\partial}{\partial\theta_0}\left(\frac{1}{2m}\sum_{i=1}^{m} \left(\theta_0+\theta_1x^{(i)}-y^{(i)}\right)^2 \right) = \frac{1}{m}\sum_{i=1}^{m} \left(\theta_0+\theta_1x^{(i)}-y^{(i)}\right) = \frac{1}{m}\sum_{i=1}^{m} \left(h_{\theta}(x^{(i)})-y^{(i)}\right)\)

\(\displaystyle \frac{\partial}{\partial\theta_1}\left(\frac{1}{2m}\sum_{i=1}^{m} \left(\theta_0+\theta_1x^{(i)}-y^{(i)}\right)^2 \right) = \frac{1}{m}\sum_{i=1}^{m} \left(\theta_0+\theta_1x^{(i)}-y^{(i)}\right).x^{(i)} = \frac{1}{m}\sum_{i=1}^{m} \left(h_{\theta}(x^{(i)})-y^{(i)}\right).x^{(i)} \)

Now, we will look at gradient descent in action. We know that as we go toward the center of the contour plot, we get those values of \(\theta_0,\theta_1\) which yield the minimum value of the cost function. Let’s now approach the center of the contour plot and see how the hypothesis function fits the training set.

Every data point on the contour plot corresponds to \((\theta_1,\theta_0)\), and we have plotted the hypothesis function corresponding to every point. The one that is closest to the training data set is the center of the contour plot.

Gradient Descent with Python

>>> import numpy as np >>> import matplotlib.pyplot as plt >>> x = np.random.uniform(-4,4,500) >>> y = x + np.random.standard_normal(500)+2.5 >>> plt.plot(x, y, 'o') >>> plt.show() >>> def cost(X, Y, theta): >>> J=np.dot((np.dot(X,theta) - Y).T,(np.dot(X,theta) - Y))/(2*len(Y)) >>> return J >>> alpha = 0.1 # Specify the learning rate >>> theta = np.array([[0,0]]).T # Initial values of theta >>> X = np.c_[np.ones(500),x] >>> Y = np.c_[y] >>> X_1=np.c_[x].T >>> num_iters = 1000 >>> cost_history=[] >>> theta_history=[] >>> for i in range(num_iters): >>> a=np.sum(theta[0]- alpha * (1/len(Y)) * np.sum((np.dot(X,theta)- Y))) >>> b=np.sum(theta[1] - alpha * (1/len(Y)) * np.sum(np.dot(X_1,(np.dot(X,theta)-Y)))) >>> theta= np.array([[a],[b]]) >>> cost_history.append(cost(X,Y,theta)) >>> theta_history.append(theta) >>> if i in (1,3,7,10,14,num_iters): >>> plt.plot(x, a+x*b) >>> plt.suptitle('Linear regression by gradient descent') >>> plt.xlabel('x') >>> plt.ylabel('y') >>> plt.show() >>> elif i in range(20,num_iters,10): >>> plt.plot(x, a+x*b) >>> print(theta) [[ 2.51030574] [ 0.95243058]]

Following is the plot that is displayed when we execute the code above in Python:



Gradient Descent with R

# Generate random data > x <- runif(500, -4, 4) > y <- x + rnorm(500) + 2.5 # Define the squared error cost function > cost <- function(X, y, theta) { > sum( (X %*% theta - y)^2 ) / (2*length(y)) > } > alpha <- 0.1 # Specify the learning rate > num_iters <- 1000 # Specify the number of iterations > cost_history <- rep(0,num_iters) # will be used to store the value of cost function after # every iteration > theta_history <- list(num_iters) # will be used to store the value of theta after every # iteration > theta <- c(0,0) # Initial values of theta > X <- cbind(1,x) # Add a column vector with all values to be 1 to x so that hypothesis # function has an intercept > for (i in 1:num_iters) { > theta[1] <- theta[1] - alpha * (1/length(y)) * sum(((X%*%theta)- y)) > theta[2] <- theta[2] - alpha * (1/length(y)) * sum(((X%*%theta)- y)*X[,2]) > cost_history[i] <- cost(X, y, theta) > theta_history[[i]] <- theta > } > print(theta) [1] 2.470261 1.020998 # Plots the training dataset > plot(x,y, col=rgb(0.2,0.4,0.6,0.4), main='Linear regression by gradient descent') # Plots various lines during the course of convergence > for (i in c(1,3,6,10,14,seq(20,num_iters,by=10))) { > abline(coef=theta_history[[i]], col=rgb(0.8,0,0,0.3)) > } > abline(coef=theta, col='blue') # Plots a straight line with intercept as theta[1] and slope # as theta[2]

Following is the plot that is displayed when we execute the code above in R:

Linear regression via gradient descent is conceptually a simple algorithm. Although, for advanced learning algorithms, the basic concepts remain same but the linear model is replaced by a much more complex model and, correspondingly, a much more complex cost function.