Today I would like to share my experience with open source machine learning library, based on gradient boosting on decision trees, developed by Russian search engine company — Yandex.

Github profile according to the 12th of February

Library is released under Apache license and offered as a free service.

‘Cat’, by the way, is a shortening of ‘category’, Yandex is enjoying the play on words.

You might be familiar with gradient boosting libraries, such as XGBoost, H2O or LightGBM, but in this tutorial I’m going to give quick overview of the basis of gradient boosting and then gradually move to more core complex things.

Decision tree introduction

Before talking about gradient boosting I will start with decision trees. A tree as a data structure has many analogies in real life. It is used in many areas, as it is a good representation of a decision process. Tree consists of the root node, decision node and terminal node (nodes, that are not going to be splitted further). Trees are typically drawn upside down , in the sense that the leaves are at the bottom of the tree. Decision trees can be applied to both regression and classification problems.

A simple decision tree used in scoring classification problem

In classification problem as a criterion to make a binary split we use different metrics — the most popular one are Gini index and Cross-entropy. Gini index is a measure of total variance across K classes. In regression problem we use variance or mean deviation from median

The functional whose value is maximized for finding the optimal partition at a given vertex

Growing a tree involves deciding on which features to choose and what conditions to use for splitting, along with knowing when to stop. Decision tree tend to be very complex and overfitted — which means, the error of training set will be low, but high on the validation set. A smaller tree with fewer splits might lead to lower variance and better interpretation at the cost of a little bias.

Decision trees shows good result in non-linear dependencies. In the example above we can see that the dividing surface of each class is piecewise constant, and each side of the surface parallel to the coordinate axis, since each condition compares the value of one sign to the threshold.

We can avoid overfitting in two ways: to add a stopping criteria, or to use tree pruning. Stopping criteria helps to decide, whether we need to continue dividing the tree or we can stop and turn this vertex into a leaf. For example, we can set number of objects in each node. If m > n then continue divide the tree, else stop. n == 1 — the worst case.

Or we can fix the tree height.

Another approach is tree pruning — we construct an overfitted tree, and then delete leaves based on selected criteria. Pruning can start at either root or the leaves.Remove branches from a “fully grown” tree — get a sequence of progressively pruned trees. On cross-validation, we compare overfitted tree with the split and without split. If the result is better without this node, we exclude it. There are many different techniques for tree pruning that are used to optimise the performance — for example reduced error pruning and cost complexity pruning, where a learning parameter (alpha) is used to weigh whether nodes can be removed based on the size of the sub-tree.

The decision trees suffer from high variance. This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.

Ensembles

However, researchers found out that the combination of different decision trees may show much better results. Ensemble — it it when we have N base algorithm and the result on the final algorithm would be a function of the results from the base algorithm. We combine a series of k learned models with the aim of creating an improved model.

There are various ensemble techniques, such as boosting — weighted vote with a collection of classifiers, bagging — averaging the prediction over a collection of classifiers and stacking— combining a set of heterogeneous classifiers.

For building a tree ensemble we need to train over algorithms on different samples. But we can’t train them on one single set. We need to use randomization here to train classificators on different dataset. For example we can use bootstrap.

The expectation of an error is a sum of variance, bias and noise. The ensemble consist of the trees with low bias and high variance. The main objective of gradient boosting algorithm is to keep constructing trees with low bias.

For example, we need to approximate the green function on the right picture based on 10 dots with a noise. On the left graph we show the polinims trained on different samples. The averaged polynomial is shown on the right picture with the red line.

We can see that the red graph is almost the same as the green one, while algorithms separately is significantly different from the green function. The following family of algorithms has low bias, but high variance.

Decision trees are characterised by low bias but high variance even with a small changes in training sample. The ensemble variance is the variance of one base algorithm divided by number of algorithm + correlation between base algorithm.

Random forest algorithm

To reduce the impact of correlation between base algorithm we can use bagging algorithm and random subspace method. One of the most notable example of this approach is random forest classifier. This is algorithm based on method of random subspaces and bagging and use CART decision trees as a base algorithm.

Method of random subspaces can help to reduce the correlation between trees and avoid overfitting. Let’s have a closer look: Suppose we have a dataset with D features, L objects and N base trees.

Each of the base algorithm is fitted on sample from bootstrap.

We choose d features from D randomly and construct tree until stopping criterion (which I mentioned earlier). Usually we build the overfitted trees with low bias.

Number of features d for regression problem is D/3, and for classification sqrt(D).

It should be emphasized that a random subset of size d is selected again each

This is the time to split another vertex. This is the main difference of such

approach from the method of random subspaces, where a random subset of features was chosen once before building the base algorithm.

After all we apply bagging and average the result of each of the base algorithm.

Random forest has various advantages, such as insensitivity to outliers, works well for large feature space, hard to overfit by adding more trees.

However there is one drawback, storing ,models requires O(NK) memory storage, where K — number of trees. Which is quite a lot.

Boosting

Boosting is a weighted ensemble method. Each of the base algorithm is added sequentially, one by one. A series of N classifiers iteratively learned. Weights are updated to allow subsequent classifiers to “pay more attention” on training tuples that were misclassified by previous classifier. Weighted vote doesn’t influence the complexity of the algorithm, but smoothes the answers of the basic algorithms.

How does boosting compare with bagging? Boosting focuses on misclassified tuples, it risk overfitting the resulting composite model to such data.

• Greedy algorithm for construction linear models

• Each of the following algorithms is constructed to correct the errors of existed ensemble.

Different approximations used for threshold loss function

As an example take the standard regression task with MSE as a loss function

Usually as a base algorithm we take the most simple one, for instance we can take a short decision tree

The second algorithm must be fitted in a way to minimize the error of the composition b1(x) and b2(x)

And as for bN(x)

Gradient boosting

Gradient boosting is known to be one of the leading ensemble algorithm.

Gradient boosting algorithm uses gradient descent methond to optimize the loss function. It is iterative algorithm and the steps are following:

Initialise the first simple algorithm b0 On each iteration we make a shift vector s = (s1,..sl). si — the values of the algorithm bN(xi) = si on a training sample

3. Then the algorithm would be

4. Finally we add the algorithm bN to the ensemble

There are several gradient boosting libraries available: XGBoost, H20, LightGBM. The main difference between them in the tree structure, feature engineering and working with sparse data

Catboost

Catboost can be used for solving problems, such as regression, classification, multi-class classification and ranking. Modes differ by the objective function, that we are trying to minimize during gradient descend. Moreover, Catboost have pre-build metrics to measure the accuracy of the model.

On official catboost website you can find the comparison of Catboost (method) with major benchmarks

Figures in this table represent Logloss values (lower is better) for Classification mode.

Percentage is metric difference measured against tuned CatBoost results.

Catboost advantage

Catboost introduces the followign algorithmic advances:

An innovative algorithm for processing categorical features. No need to preprocess features on your own — it’s performed out of the box. For data with categorical features the accuracy would be better compare to other algorithm. The implementation of ordered boosting, a permutation-driven alternative to the classic bosting algorithm. On small datasets, the GB is quickly overfitted. In Catboost there is a special modification for such cases. That is, on those datasets where other algorithms had a problem with overfitted you won’t observe the same problem on Catboost Fast and easy to-use GPU-training. You can simply install it via pip-install Other useful features: missing value support, great visualization

Categorical features

Categorical feature is one with a discrete set of values called categories that are not comparable to each other.

The main advantage of catboost is a smart preprocessing of categorical data. You don’t have to preprocess data on your own. Some of the most popular practices to encode categorical data are:

One-hot encoding Label encoding Hashing encoding Target encoding and etc..

One-hot encoding is a popular approach for the categorical features with small number of distinct features. Catboost use one_hot_max_size for all features with number of different values less than or equal to the given parameter value.

In the case features with high cardinality(like, e.g., “user ID” feature), such technique leads to infeasibly large number of new features.

Another popular method is to group categories by target statistics (TS) that estimate expected target value in each category.

The problem of such greedy approach is target leakage: the new feature is computed using target of the previous one. This leads to a conditional shift — the distribution differes for training and test examples.

The common methods for solving this problem are holdout TS and leave-one out TS. But still they doesn’t prevent model from target leakage.

CatBoost uses a more effective strategy. It relies on the ordering principle and called Target-Based with prior (TBS). It is inspired by online learning algorithms which get training examples sequentially in time. The values of TS for each example rely only on the observed history. To adapt this idea to standard offline setting, we introduce an artificial “time”, i.e., a random permutation σ of the training examples.

In Catboost, the data is randomly shuffled and mean is calculated for every object only on its historical data. Data can be reshuffled multiple times.

Another important detail of CatBoost is using combinations of categorical features as additional categorical features which capture high-order dependencies like joint informa- tion of user ID and ad topic in the task of ad click prediction. The number of possible combinations grows exponentially with the number of categorical features in the dataset, and it is infeasible to process all of them. CatBoost constructs combinations in a greedy way. Namely, for each split of a tree, CatBoost combines (concatenates) all categorical features (and their combinations) already used for previous splits in the current tree with all categorical features in the dataset. Combinations are converted to TS on the fly.

Fighting gradient biases

CatBoost implements an algorithm that allows to fight usual gradient boosting biases. The existed implementations face the statistical issue, prediction shift. The distribution F(x_k) | x_k for a training example shifts from the distribution of F(x) | x for a test example x. This problem is similar to the one that occurs in preprocessing of categorical variables, that was described above.

The Catboost team derived ordered boosting, a modification of standard gradient boosting algorithm, that avoid target leakage. CatBoost has two boosting modes, Ordered and Plain. The latter mode is the standard GBDT algorithm with inbuilt ordered TS.

You can find a detailed description of the algorithm in the paper Fighting biases with dynamic boosting

CatBoost uses oblivious decision trees, where the same splitting criterion is used across an entire level of the tree. Such trees are balanced, less prone to overfitting, and allow speeding up prediction significantly at testing time.

Here is the implementation of oblivious tree evaluation in Catboost:

int index = 0;

for (int depth = 0; depth < tree.ysize(); ++depth) {

index |= binFeatures[tree[depth]] << depth;

}

result += Model.LeafValues[treeId][resultId][index];

As you can see, there are no “if” operators in this code. You don’t need branches to evaluate an oblivious decision tree.

An oblivious decision tree can be described as a list of conditions, one condition per layer. With oblivious trees you need just to evaluate all tree’s conditions, compose binary vector from them, convert this binary vector to the number and access leafs array by the index equals to this number.

For example in LightGBM (XgBoost has similar impelementation)

std::vector<int> left_child_;

std::vector<int> right_child_; inline int NumericalDecision(double fval, int node) const {

...

if (GetDecisionType(decision_type_[node], kDefaultLeftMask)) {

return left_child_[node];

} else {

return right_child_[node];

}

...

}



inline int Tree::GetLeaf(const double* feature_values) const {

...

while (node >= 0) {

node = NumericalDecision(feature_values[split_feature_[node]], node);

}

...

}

In the Ordered boosting mode, during the learning process, we maintain the supporting models Mr,j , where Mr,j(i) is the current prediction for the i-th example based on the first j examples in the permutation σr. At each iteration t of the algorithm, we sample a random permutation σr from{σ1, . . . , σs} and construct a tree Tt on the basis of it. First, for categorical features, all TS are computed according to this permutation. Second, the permutation affects the tree learning procedure.

Based on Mr,j(i), the corresponding gradients are computed. While constructing a tree, we approximate the gradient G in terms of the cosine similarity where for each example i, we take the gradient based on the previous examples is σs. When the tree structure Tt (i.e., the sequence of splitting attributes) is built, we use it to boost all the models Mr′,j

The detailed information you can find in the original paper or in the NIPS’18 slides

GPU training

CatBoost can be efficiently trained on several GPUs in one machine.

Experimental result for different hardware

CatBoost achieves good scalability. On 16 GPUs with InfiniBand, CatBoost runs approximately 3.75 faster than on 4 GPUs. The scalability should be even better with larger datasets. If there is enough data, we can train models on slow 1GbE networks, as two machines with two cards per machine are not significantly slower than 4 GPUs on one PCIe root complex. You can read more in this NVIDIA article

Among described advantages also need to mention the following one:

Overfitting detector. Usually in gradient boosting we adjust learning rate to the stable accuracy. But the smaller learning rate — more iterations needed. Missing variables. Just left NaN In Catboost you are able to write your own loss function Feature importance CatBoost provides tools for the Python package that allow plotting charts with different training statistics. This information can be accessed both during and after the training procedure

Monitor training in iPython Notebook using our visualization tool CatBoost Viewer.

Catboost model can be integrated into Tensorflow. For example, it is a common case for combining Catboost and Tensorflow together. Neural network can be used for feature extraction for gradient boosting.

Also, now Catboost model can be used in the production with the help of CoreML.

Examples

I created an example of applying Catboost for solving regression problem. I used data from Allstate Claims Severity as a basement.

Feel free to use my colab in your further research!

Also you can find a plenty of other examples in official Catboost’s github

Contribution

In case if you want to make CatBoost better:

Check out help wanted issues to see what can be improved, or open an issue if you want something.

Add your stories and experience to Awesome CatBoost.

To contribute to CatBoost you need to first read CLA text and add to your pull request, that you agree to the terms of the CLA. More information can be found in CONTRIBUTING.md Instructions for contributors can be found here.

Follow on twitter or wechat (zkid18) to stay updated with my posts.