Decision forest

This page contains a brief description of the RDF classification and regression algorithm. Prior to reading this page, it is necessary that you look through the paper on the general principles of data analysis methods. It contains important information which, to avoid duplication (as it is of great significance for each algorithm in this section), is removed to a separate page.

RDF (Random Decision Forest) Algorithm

The RDF algorithm is a modification of the original Random Forest algorithm designed by Leo Breiman and Adele Cutler. Two ideas are in combination with each other in this algorithm: these are the use of a decision tree committee getting the result by voting, and the idea of training process randomization. The algorithm is briefly described below: /text:>

Let us assume that the training set has a size N, and the number of independent variables is equal to M. Let us add three input parameters: a ratio r (0 ≤ r ≤ 1), a number of attributes m ≤ M, and a number of trees NTrees ≥ 1. Using primary training set, let us generate a random sample sized as r�N (without repetitions). Let the training set elements that failed to get into the sample be used later on for estimating the generalization error. Based on the generated sample, let us grow a decision tree. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set. The tree is fully grown and is never pruned. The procedure is reiterated NTrees times. The trees grown unite to form a committee deciding by voting.

Algorithm Discussion

Advantages of the algorithm:

High training speed

Noniterative training: the algorithm is completed at a fixed number of operations

Scalability (capacity for processing large data volumes)

High quality of the models derived (comparable with neural networks and neural network ensembles)

Small quantity of parameters to be adjusted

Internal estimate of the model's generalization error

At the same time, the algorithm's disadvantages may also be noted:

The built model takes a large memory capacity. If a committee is set up of K trees based on a training set of the dimension N , then memory requirements will amount to O(K·N) . For example, with K=100 and N=1000 , the model built by the ALGLIB will have a size about one megabyte.

trees based on a training set of the dimension , then memory requirements will amount to . For example, with and , the model built by the ALGLIB will have a size about one megabyte. The trained model works somewhat slower than other algorithms (if 100 trees are incorporated into the model, then we should go over all of them, to get a result).

The algorithm is prone to overfitting, especially when used on a noisy task. This problem can partly be overcome by adjusting the parameter r (see below). A similar, only more apparent, problem is encountered in the original algorithm "Random Forest" (See Machine Learning Benchmarks and Random Forest Regression). It should be noted that its authors failed to notice this disadvantage, believing that the algorithm had no tendency to overfit, and some practitioners and theoreticians in machine learning share that wrong belief.

(see below). A similar, only more apparent, problem is encountered in the original algorithm "Random Forest" (See Machine Learning Benchmarks and Random Forest Regression). It should be noted that its authors failed to notice this disadvantage, believing that the algorithm had no tendency to overfit, and some practitioners and theoreticians in machine learning share that wrong belief. Similar to decision trees, the algorithm has absolutely no capacity for extrapolation.

Working with decision forests

Operations on the decision forest are performed in the following order:

Selection of a tuning parameters (see below). Forest construction Using the forest built (data processing, serialization, etc).

Tuning RDF Algorithm

The algorithm contains three parameters that need to be adjusted: the ratio r - size of the part of the training set which will be used for the construction of individual trees; the number of trees NTrees; as well as the number NFeatures indicating the number of variables used to grow individual trees. These parameters are more fully detailed below.

The ratio r, in the range of 0 to 1, impacts the algorithm's tolerance to the noise in the training set. The weak point of the original Breiman algorithm is its inability to offer full compensation for the individual trees' tendency to overfit.

Note #1

Nonregularized decision tree can accurately remember the training set. The procedure of sampling N records with repetitions used in the original algorithm results in approximately 0.632 elements of a complete training set getting into the sample (which is equal to the value r=0.632). Thus, for any element of the training set there are roughly 63% of individual trees which remember it and will gain the majority in the course of voting. If there is noise in the training set, the Breiman algorithm will repeat all errors in the training set with a high degree of accuracy, having failed to find regular structure bihend the noise.

There are different solutions to this problem: either individual trees can be regularized (as it is suggested in the paper referred to above), or noise can be leveled down, by putting variety of individual trees under control through the ratio r (as implemented in the ALGLIB). The recommended values of r range from 0.66 (a low noise level) up to 0.05 (a very high noise level). Selection is made on the basis of the relationship between the error in the training set and the generalization error (calculated by means of a test set or by out-of-bag estimation): if the ratio is much smaller than one, the value r shall be decreased, and the model should be reconstructed.

The other algorithm's parameters are much easier to set. The number of trees NTrees is recommended to be made at a level of 50 to 100, and the number NFeatures is chosen automatically and is somewhere about the half of the total number of variables.

Training Set Format

The training set format is described in the paper that is recommended at the top of the page. That paper also deals with such problems as missing values and nominal variable encoding. It should be noted that the dataset format depends on which problem - regression or classification - the network solves.

Nominal Variable Encoding

When a RDF algorithm is used, it is strongly recommended that one should read 'Nominal Variable Encoding' section and accurately comply with the encoding recommended. The algorithm is optimized just for this data representation scheme, although it can be used with any encoding (but less efficiently)

This article is licensed for personal use only.