The principle of boosting

A gradient boosting algorithm is a special case of boosting, but how does boosting itself actually work?

The basic idea is similar to that of bagging. Rather than using a single model, we use multiple models that we then aggregate to achieve a single result. In building models, boosting works sequentially. It begins by building a first model, which is evaluated. From this measurement, each individual model will be weighted according to the performance of its prediction.

The objective is to give greater weight to the individuals for whom the value was badly predicted for the construction of the end model. Correcting the weights as you go makes it easier to predict difficult values.

Gradient Boosting

This algorithm uses the gradient of the loss function to calculate the weights of individuals during the construction of each new model. It looks a bit like gradient descent for neural networks.

Gradient boosting generally uses classification and regression trees, and we can customize the algorithm using different parameters and functions.

The algorithm is inspired by the gradient descent algorithm. We consider a real function f(x) and we calculate the gradient to construct a sequence:

The sequence (xᵢ) converges to the minimum of the function f. We apply this to an error function from a regression problem.

Most often, we apply this method to a function F, which depends on a parameter θ :

It’s the following sequence that converges towards the minimum of the function f, so that the function f(θ, x) approximates the points (Xᵢ, yᵢ), at best.

But we could solve this problem in a space of functions and not a space of parameters:

The gradient is easy to calculate since it doesn’t depend on G. We could therefore construct the regression function G as an additive sequence of functions (Fk):

And we could construct the function (Fk) as a solution to a regression problem defined by the pairs (Xᵢ, zᵢ), with:

That how gradient boosting is defined mathematically. For more details, you can look at Krishna Kumar Mahto excellent article, where he explains the mathematics behind gradient boosting.