An empirical study of practical, theoretical and online variants of random forests

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

An empirical study of practical, theoretical and online variants of random forests Matheson, David

Abstract

Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with different goals in mind: improving predictive accuracy, extending the range of application to online and structure domains, and introducing simplifications for theoretical amenability. While there are many subtle differences among the variants, the core difference is the method of selecting candidate split points. In our work, we examine eight different strategies for selecting candidate split points and study their effect on predictive accuracy, individual strength, diversity, computation time and model complexity. We also examine the effect of different parameter settings and several other design choices including bagging, subsampling data points at each node, taking linear combinations of features, splitting data points into structure and estimation streams and using a fixed frontier for online variants. Our empirical study finds several trends, some of which are in contrast to commonly held beliefs, that have value to practitioners and theoreticians. For variants used by practitioners the most important discoveries include: bagging almost never improves predictive accuracy, selecting candidate split points at all midpoints can achieve lower error than selecting them uniformly at random, and subsampling data points at each node decreases training time without affecting predictive accuracy. We also show that the gap between variants with proofs of consistency and those used in practice can be accounted for by the requirement to split data points into structure and estimation streams. Our work with online forests demonstrates the potential improvement that is possible by selecting candidate split points at data points, constraining memory with a fixed frontier and training with multiple passes through the data.

Item Metadata

Title	An empirical study of practical, theoretical and online variants of random forests
Creator	Matheson, David
Publisher	University of British Columbia
Date Issued	2014
Description	Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with different goals in mind: improving predictive accuracy, extending the range of application to online and structure domains, and introducing simplifications for theoretical amenability. While there are many subtle differences among the variants, the core difference is the method of selecting candidate split points. In our work, we examine eight different strategies for selecting candidate split points and study their effect on predictive accuracy, individual strength, diversity, computation time and model complexity. We also examine the effect of different parameter settings and several other design choices including bagging, subsampling data points at each node, taking linear combinations of features, splitting data points into structure and estimation streams and using a fixed frontier for online variants. Our empirical study finds several trends, some of which are in contrast to commonly held beliefs, that have value to practitioners and theoreticians. For variants used by practitioners the most important discoveries include: bagging almost never improves predictive accuracy, selecting candidate split points at all midpoints can achieve lower error than selecting them uniformly at random, and subsampling data points at each node decreases training time without affecting predictive accuracy. We also show that the gap between variants with proofs of consistency and those used in practice can be accounted for by the requirement to split data points into structure and estimation streams. Our work with online forests demonstrates the potential improvement that is possible by selecting candidate split points at data points, constraining memory with a fixed frontier and training with multiple passes through the data.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2014-04-25
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NoDerivs 2.5 Canada
DOI	10.14288/1.0167214
URI	http://hdl.handle.net/2429/46586
Degree	Master of Science - MSc
Program	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2014-09
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nd/2.5/ca/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

An empirical study of practical, theoretical and online variants of random forests Matheson, David

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights