skip to content
Ben Lau statistics . machine learning . programming . optimization . research

Decision Trees

8 min read Updated:

Definition

It uses a simple but intuitively appealing technique to form a regression surface: recursive partitioning, which is used to fit a piecewise constant surface r^(x)\hat{r}(x) over the domain X\mathcal{X}. It is totally nonparametric (Efron and Hastie, 2021).

Use cases

It is easy to interpret, but on the other hand, also easy to overinterpret, with a reputation for being unstable in practice, i.e. bad for explanation. Discontinuous regression surfaces disqualify them for estimation. So their principal use in what follows will be as key parts of prediction algorithms.

It is used as a foundation to form random forest and boosting (Ensembles of trees)

  • Random forest: Grow many deep regression trees to randomized versions of the training data, and average them. The randomization could be bootstrap sampling and/or subsampling of the observations or variables.
    • The basic mechanism is variance reduction by averaging. Each deep tree has a high variance.
  • Boosting: Repeatedly grow shallow trees to the residuals, and hence build up an additive model consisting of a sum of trees.
    • The basic mechanism is bias reduction. It might also include some variance reduction in different methods.
  • Explanatory analysis: feature importances could be easily extracted from the tree structure.

Note: Boosting and Random Forests provide a terrific benchmark for how well a traditional parametrized model is performing: if they does much better, then the model is not doing well, with an indicator that probably some important interactions are missing.

Common method to split tree

Suppose at step kk of the algorithm, groupk_k of NkN_k cases remains to be split, those cases having mean mkm_k and sum of squares sk2s^2_k. Dividing groupk_k into groupk,left_{k, left} and groupk,right_{k, right} preoduces means mk,leftm_{k,left} and mk,rightm_{k,right} and corresponding sums of squares sk,left2s^2_{k,left} and sk,right2s^2_{k,right}. The algorithm proceeds by choosing the splitting variable XkX_k and the threshold tkt_k to minimize the sum of squares of the two groups. In other words, it splits groupk_k into two groups that are as different from each other as possible.

The most important techniques to know

How deep to grow the tree (when to stop splitting)

Uses cross-validation estimates of prediction error to determine when to stop splitting. If groupk_k is not to be further divided, it becomes terminal node RkR_k, with prediction value y^Rk=mk\hat{y}_{R_k} = m_k.

How much to prune it back

Good and bad properties

Good properties:

  • Trees automatically select variables; only variables used in defining splits are in the model.
  • Tree-growing algorithms scale well to large n; growing a tree is a divide-and-conquer operation.
  • It handle mixed features (quantitative/qualitative) seamlessly, and can handle missing data.
  • Small trees are easy to interpret.

Bad properties:

  • Large trees are hard to interpret.
  • Generally don’t have good predictive performance.

It is the best tool for tabular data, even better than deep learning

It performs poorly on extrapolation

It is not good at extrapolation, i.e. predicting outside the range of the training data. It is because the tree-based models are piecewise constant approximation, and they can’t predict outside the range of the training data. However, it is often possible to reformulate the problem to avoid extrapolation through feature engineering, e.g. including lagged features, differencing, and using multi-output targets.

When you have a large sample size, you might not need so many assumptions. For example, you can estimate the distribution of Y|X=2 without using any assumption of the classical regression model. You can do this by finding the subset of the data set having X=2, and then by constructing the histogram of the Y variable using only the data in that subset. If there are hundreds of such observations where X=2, enough to obtain a reasonable estimate of p(yX=2)p(y|X=2) by using the histogram. (Westfall and Arias, Understanding Regression Analysis, 2020, p.471)

Tree regression is a technique that uses the subsetting approach to estimate the conditional distributions. When there are no repeats of Y values for given X values, the tree regression algorithm divides the X data into intervals, where there are sufficient data values within the interval ranges to estimate the distributions. (Westfall and Arias, Understanding Regression Analysis, 2020, p.473)

The conditional distribution p(yX=x)p(y|X=x) for every single x in a common range is estimated to be exactly the same distribution. With tree regression, the estimated conditional distributions of Y are identical for all X values within a given subset range. The limitation that regression trees assume identical distributions within X ranges is balanced by the “common sense” appeal of estimating distributions by using actual data values, rather than by imposing some special form (normal, lognormal, T, Poisson, etc.). (Westfall and Arias, Understanding Regression Analysis, 2020, p.474)

When there is not enough data in the subset to obtain an adequate estimate of the distribution, the tree algorithm will perform poorly because the estimated distribution by subsetting method follows the actual data, rather than making any assumptions. So, the tree algorithm is not good at extrapolation.

It inherently includes interaction terms

It is believed that tree-based models consider variables sequentially, which makes them handy for considering interactions without specifying them. Interactions that are useful for prediction will be easily picked up with a large enough forest, so there’s no real need to include an explicit interaction term. example However, if you believe that the interaction is important, you could manually create the interaction term because an engineered feature will make it easier for the model to discover relationships in the data. Apart from general believe, there is a article stating that interactions are masked by marignal effects and interactions cannot be differentiated form marginal effects, such that variable importance measures are unable to detect them as interactions. paper XGBoost tutorials discussion on feature engineering in tree-model

It is robust to outliers and feature scales

Decision trees divide the feature space into regions, and it divide the features one by one into left and right. Therefore it only cares the ordering of the values, so it will not be affected by the scale of the features, and it is also robust to outliers.

However, it is not the case for the target variable. So all the feature engineering techniques that usually required for the target variable are still needed.

How to get predictions with uncertainty

confidence intervals vs prediction intervals

Confidence intervals

Prediction intervals

  • Fit two more models using Quantile regressor forest with quantile loss and alpha 0.025 and 0.975, and use the predictions as the lower and upper bounds of the confidence interval.

How to handle missing values

Tree-based models vs neural networks

Random forest are good when the features are already somewhat informative, while neural nets are good when you have to learn the features while training as well

Interpretations

  • dtreeviz: A python library for decision tree visualization and model interpretation.
  • Permutation importance

Feature importances in decision trees are tricky

Standard feature importances simply tell you which features were more useful when building the model. They are not to be interpreted as a direct dependence between predictor and target.

  1. They are completely useless if the model has bad predictive power.
  2. They are strongly influenced by correlated features.
  3. They are biased towards numerical and high cardinality features.

However, permutation importances are computed on validation stage, and therefore solve first overfitting issue. Moreover, as they are computed on a metric of your choice, they are easier to interpret and can in some sense be seen as a “strength coefficient”, since they answer the question: “How much does the performance of my model degrade if I shuffle this predictor?”. ref

Nevertheless, features that are deemed of low importance for a bad model could be very important for a good model. it is always important to evaluate the predictive power of a model prior to interpreting feature importances. Permutation importance does not reflect to the intrinsic predictive value of a feature by itself but how important this feature is for a particular model. ref

The second issue still exists in a way that when two features are correlated and one of the features is permuted, the model still has access to the latter through its correlated feature. This results in a lower reported importance value for both features, though they might actually be important. For example, if two features both strongly related to the target, they will always end up with a feature importance score of about 0.5 each, whereas one would expect that both should score something close to one. One way to handle the issue is to cluster features that are correlated and only keep one feature from each cluster, or simply remove upper triangle of correlation matrix based on a threshold, or use SelectKBest with f_regression from sklearn. ref clustering notes

References

  • Efron, B., & Hastie, T. (2021). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.
  • Westfall, Peter H., and Andrea L. Arias. Understanding Regression Analysis: A Conditional Distribution Approach. Boca Raton: CRC Press, 2020.