skip to content
Ben Lau statistics . machine learning . programming . optimization . research

Random Forests

4 min read Updated:

Definition

It uses decision trees as building blocks, each of which constructed very deep and hence with high variance, it then averages them to reduce variance. It also utilizes the bootstrapping and randomization to decorrelate the trees. It involves three key steps, which are data sampling, parallel training, and aggregation. Its key benefit is to reduce variance, and improve accuracy due to ensemble learning. ref

Parameters to tune

Analytics Vidhya | Tuning random forecast model:

  • max_features: number of features in each tree
  • n_estimators: number of trues, due to the mechanism of bagging, it should not be set to too low, and it has no harm to set it high apart from computational cost
  • min_samples_leaf: number of samples in leaf node to prevent splitting, prevent overfitting, and prevent underfitting while working with imbalanced data, but normally imbalanced data should be handled by gradient boosting due to the mechanism of boosting

Randomization

In order to benefit from averaging, the individual trees should not be too correlated. This is achieved by injecting some randomness into the tree-growing process, which could be achieved in two ways.

  1. Bootstrap: each tree is grown to a bootstrap resampled training data set.
  2. Split-variable randomization: each time a split is to be performed, the search for the split variable is limited to a random subset of m of the p variables. Typically, m is set to p\sqrt{p} or p/3p/3. When m=p, it is called bagging, a.k.a. Boostrap Aggregating. With m=1, the split variable is completely random, so all variables get a chance. This will decorrelate the trees the most, but can create bias, somewhat similar to that in ridge regression.

Algorithm

  1. Define mm and BB
  2. For b=1,2,...,Bb=1,2,...,B:
    1. Draw a bootstrap sample of size n from the training data n times.
    2. Grow a regression tree r^b(x)\hat{r}_b(x) using the bootstrapped data, sampling m of the p features at random prior to making each split.
    3. Save the tree and bootstrap sampling frequencies.
  3. Compute the random-forest fit at any prediction point x0x_0 as the average r^rf(x0)=1Bb=1Br^b(x0)\hat{r}_{rf}(x_0)= \frac{1}{B} \sum^B_{b=1} \hat{r}_b(x_0)
  4. Compute the OOBi_i error (Out-of-Bag error) for each response observation yiy_i in the training data by using the fit r^rf(i)(xi)\hat{r}_{rf}^{(-i)}(x_i), where the superscript (i)(-i) indicates that the iith observation is excluded from the bootstrap sample used to fit the bbth tree. The overall OOB error is the average of these OOBi_i. Since it only excludes one observation in each tree, it delivers cross-validated error estimates at virtually no extra cost.

Note: If the response type is quantitative, we would fit a regression tree. If it is binary or multilevel qualitative, we would fit a classification tree.

Another interpretation of the algorithm

It has been described as adaptive nearest-neighbor estimators, adaptive in that they select predictors. A k-nearest-neighbor estimate finds the k training observations closest in feature space to the target point x0x_0, and averages their responses. Each tree in the random forest drills down by recursive partitioning to pure terminal nodes, often consisting of a single observation. Hence, when evaluating the prediction from each tree, r^b(x0)=yl\hat{r}_b(x_0) = y_l for some ll, and for many of the trees this could be the same ll. From the whole collection of B trees, the number of distinct lls can be fairly small. Since the partitioning that reaches the terminal nodes involves only a subset of the predictors, the neighborhoods so defined are adaptive.

Variance Estimation

We can use similar idea of algorithm’s step 4 to estimate the variance of a random-forest prediction, using the jackknife variance estimator.

Interpretation

We can use variable-importance plots to assess the relative importance of variables that are included in the ensemble.

Technical details on averaging trees

  • A maximal-depth tree splits every node until it is pure, meaning all the responses are the same. For very large nn this might be unrasonable; in practice, one can put a lower bound on the minimum count in a terminal node.
  • In the case of a classification tree, there are at least two strategies at the averaging stage: using a plurality vote on classifications made by each tree or averaging the class probabilities.

Readings

References

  • Efron, B., & Hastie, T. (2021). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.