UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An empirical study of practical, theoretical and online variants of random forests Matheson, David


Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with different goals in mind: improving predictive accuracy, extending the range of application to online and structure domains, and introducing simplifications for theoretical amenability. While there are many subtle differences among the variants, the core difference is the method of selecting candidate split points. In our work, we examine eight different strategies for selecting candidate split points and study their effect on predictive accuracy, individual strength, diversity, computation time and model complexity. We also examine the effect of different parameter settings and several other design choices including bagging, subsampling data points at each node, taking linear combinations of features, splitting data points into structure and estimation streams and using a fixed frontier for online variants. Our empirical study finds several trends, some of which are in contrast to commonly held beliefs, that have value to practitioners and theoreticians. For variants used by practitioners the most important discoveries include: bagging almost never improves predictive accuracy, selecting candidate split points at all midpoints can achieve lower error than selecting them uniformly at random, and subsampling data points at each node decreases training time without affecting predictive accuracy. We also show that the gap between variants with proofs of consistency and those used in practice can be accounted for by the requirement to split data points into structure and estimation streams. Our work with online forests demonstrates the potential improvement that is possible by selecting candidate split points at data points, constraining memory with a fixed frontier and training with multiple passes through the data.

Item Citations and Data


Attribution-NoDerivs 2.5 Canada