An Empirical Study of Practical,Theoretical and Online Variants ofRandom ForestsbyDavid MathesonBSCS, The University Of British Columbia, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2014c© David Matheson 2014AbstractRandom forests are ensembles of randomized decision trees where diversityis created by injecting randomness into the fitting of each tree. The combi-nation of their accuracy and their simplicity has resulted in their adoptionin many applications. Different variants have been developed with differentgoals in mind: improving predictive accuracy, extending the range of ap-plication to online and structure domains, and introducing simplificationsfor theoretical amenability. While there are many subtle differences amongthe variants, the core difference is the method of selecting candidate splitpoints. In our work, we examine eight different strategies for selecting can-didate split points and study their effect on predictive accuracy, individualstrength, diversity, computation time and model complexity. We also ex-amine the effect of different parameter settings and several other designchoices including bagging, subsampling data points at each node, takinglinear combinations of features, splitting data points into structure and esti-mation streams and using a fixed frontier for online variants. Our empiricalstudy finds several trends, some of which are in contrast to commonly heldbeliefs, that have value to practitioners and theoreticians. For variants usedby practitioners the most important discoveries include: bagging almostnever improves predictive accuracy, selecting candidate split points at allmidpoints can achieve lower error than selecting them uniformly at random,and subsampling data points at each node decreases training time withoutaffecting predictive accuracy. We also show that the gap between variantswith proofs of consistency and those used in practice can be accounted for bythe requirement to split data points into structure and estimation streams.Our work with online forests demonstrates the potential improvement thatis possible by selecting candidate split points at data points, constrainingmemory with a fixed frontier and training with multiple passes through thedata.iiPrefaceAll work in this thesis was conducted in UBC’s Laboratory for Computa-tional Intelligence under the supervision of Professor Nando de Freitas.Chapter 3 is based on the work of Denil et al. [21] and Chapter 4 is basedon the work of Denil et al. [20]. I am an author on both papers and parts ofthe text have been used with permission from the remaining authors: MishaDenil and Nando de Freitas.For both papers I was responsible for the empirical results which in-cluded implementing and running all experiments. I was also responsiblefor writing the experiment sections and generating all plots and figures.The algorithms and experiments were conceived by myself, Misha Denil andNando de Freitas and the remaining text and theoretical results is the workof Misha Denil.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xSymbols and Notation . . . . . . . . . . . . . . . . . . . . . . . . . xviiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . xx1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Online methods . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Contributions of this thesis . . . . . . . . . . . . . . . . . . . 41.4 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . 62 Random forest framework . . . . . . . . . . . . . . . . . . . . 72.1 Walking a decision tree . . . . . . . . . . . . . . . . . . . . . 72.2 Leaf predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . 92.3 Fitting a randomized tree . . . . . . . . . . . . . . . . . . . . 92.3.1 Impurity measures . . . . . . . . . . . . . . . . . . . . 102.3.2 Split criteria . . . . . . . . . . . . . . . . . . . . . . . 152.4 Candidate split point Strategies . . . . . . . . . . . . . . . . 162.4.1 Global uniform . . . . . . . . . . . . . . . . . . . . . . 162.4.2 Node uniform . . . . . . . . . . . . . . . . . . . . . . 172.4.3 All data points . . . . . . . . . . . . . . . . . . . . . . 172.4.4 All gaps midpoint . . . . . . . . . . . . . . . . . . . . 18iv2.4.5 All gaps uniform . . . . . . . . . . . . . . . . . . . . . 192.4.6 Efficient computation of candidate split points . . . 192.5 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6 Subsample at nodes . . . . . . . . . . . . . . . . . . . . . . . 212.7 Computation and memory complexity . . . . . . . . . . . . . 212.7.1 Computation complexity of testing . . . . . . . . . . 212.7.2 Computational complexity of training . . . . . . . . . 212.7.3 Memory requirements . . . . . . . . . . . . . . . . . . 232.7.4 Computational and memory requirements in practice 232.8 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.8.1 Sparse random projections . . . . . . . . . . . . . . . 242.8.2 Class difference sparse projections . . . . . . . . . . . 242.8.3 Kinect features . . . . . . . . . . . . . . . . . . . . . . 252.9 Accuracy diversity trade-off . . . . . . . . . . . . . . . . . . . 272.9.1 Ambiguity decomposition for regression . . . . . . . . 272.9.2 Breiman’s bound for classification . . . . . . . . . . . 283 Consistent random forests . . . . . . . . . . . . . . . . . . . . 303.1 Biau 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Biau 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3 Denil 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Online random forests . . . . . . . . . . . . . . . . . . . . . . . 384.1 Sufficient statistics and split points . . . . . . . . . . . . . . 394.2 Online bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3 Memory management . . . . . . . . . . . . . . . . . . . . . . 414.4 Saffari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 Denil 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.1 Datasets and procedure . . . . . . . . . . . . . . . . . . . . . 505.1.1 Kinect Dataset . . . . . . . . . . . . . . . . . . . . . . 515.1.2 Comparing variants . . . . . . . . . . . . . . . . . . . 545.2 Variants used in practice . . . . . . . . . . . . . . . . . . . . 555.2.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.2 Selecting number of candidate features . . . . . . . . 585.2.3 Selecting minimum child size . . . . . . . . . . . . . . 625.2.4 All gaps midpoint versus node uniform candidate splitpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . 62v5.2.5 Subsample data points at node . . . . . . . . . . . . . 695.2.6 Other split point strategies . . . . . . . . . . . . . . . 715.2.7 Linear combination features . . . . . . . . . . . . . . 725.3 Variants with provable consistency . . . . . . . . . . . . . . . 755.3.1 Comparison of Biau2008, Biau2012 and Denil2014 . . 765.3.2 Denil2014 design choices . . . . . . . . . . . . . . . . 765.4 Online variants . . . . . . . . . . . . . . . . . . . . . . . . . . 805.4.1 Memory management: Fixed depth versus fixed fron-tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.4.2 Split points: at data points versus global uniform . . 825.4.3 Online bagging . . . . . . . . . . . . . . . . . . . . . . 855.4.4 Split criteria: Mns +Mim versus α(d) + β(d) +Mim . 855.4.5 Structure and estimation streams . . . . . . . . . . . 875.4.6 Multiple passes through the data . . . . . . . . . . . 876 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93AppendicesA Appendix of experiment figures . . . . . . . . . . . . . . . . . 98A.1 All gaps midpoint with and without bagging . . . . . . . . . 98A.2 Number of candidate features for different forest sizes . . . . 110A.3 Minimum child size for different forest sizes . . . . . . . . . . 118A.4 Error, individual strength and diversity for all gaps midpointsversus node uniform . . . . . . . . . . . . . . . . . . . . . . . 126A.5 Computation and model size for all gaps midpoints versusnode uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . 150A.6 Subsample data points at node . . . . . . . . . . . . . . . . . 174A.7 Sort and walk split point variants . . . . . . . . . . . . . . . 190A.8 Uniform split point variants . . . . . . . . . . . . . . . . . . 198A.9 Random and class difference projections . . . . . . . . . . . 206A.10 Error, individual strength and diversity for Biau2008, Biau2012and Denil2014 . . . . . . . . . . . . . . . . . . . . . . . . . . 215A.11 Computation and model size for Biau2008, Biau2012 and De-nil2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228A.12 Sampling number of candidate features from a Poisson distri-bution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241viA.13 Stream splitting . . . . . . . . . . . . . . . . . . . . . . . . . 249A.14 Online: fixed frontier versus fixed depth . . . . . . . . . . . . 256A.15 Online: global uniform candidate split points versus at datapoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262A.16 Online: negative effect of online bagging . . . . . . . . . . . 268A.17 Online: Mns+Mim versus α(d)+β(d)+Mim . . . . . . . . . 274A.18 Online: negative effect of structure and estimation streamson Denil2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . 280A.19 Online: reduction in error from multiple passes . . . . . . . . 287viiList of Tables5.1 Overview of classification datasets with a train and test split 525.2 Overview of classification datasets which were were evaluatedwith k-folds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 Overview of regression datasets with a train and test split . . 525.4 Overview of regression datasets which were were evalutedwith k-folds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.5 How often all gaps midpoint with bagging is better, tied orworse than all gaps midpoint without bagging. The threenumbers in each cell are the number of times the row beat/tied/lostto the column. . . . . . . . . . . . . . . . . . . . . . . . . . . 565.6 How often all gaps midpoint is better, tied or worse thannode uniform with KS = 1 with respect to test error. Thethree numbers in each cell are the number of times the rowbeat/tied/lost to the column. . . . . . . . . . . . . . . . . . 665.7 How often all gaps midpoint is better, tied or worse thannode uniform with KS = 1 with respect to training time.The three numbers in each cell are the number of times therow beat/tied/lost to the column. . . . . . . . . . . . . . . . 665.8 Comparison of subsampling 10, 100, 1000,∞ data points ateach node with respect to test error. On most datasets sub-sampling 100 data points achieves similar results to using alldata points. The three numbers in each cell are the numberof times the row beat/tied/lost to the column. . . . . . . . . 695.9 How Breiman (all gaps midpoint), Denil2014, Biau2012 andBiau2008 compare to each other with respect to test error.Denil2014 is second best after Breiman. The three numbersin each cell are the number of times the row beat/tied/lostto the column. . . . . . . . . . . . . . . . . . . . . . . . . . . 76viii5.10 How often selecting candidate split points at the first KS datapoints is better, tied or worse than sampling candidate splitpoints with global uniform for online random forests. Therest of the design choices are fixed to a fixed frontier, onlinebagging and Mns +Mim split criteria. The three numbers ineach cell are the number of times the row beat/tied/lost tothe column. . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.11 A comparison of online forests with and without online bag-ging. The rest of the design choices are a fixed frontier withthe first KS data points as candidate split points. The threenumbers in each cell are the number of times the row beat/tied/lostto the column. . . . . . . . . . . . . . . . . . . . . . . . . . . 855.12 A comparison of Mns+Mim split criteria versus α(d)+β(d)+Mim split criteria across datasets. α(d)+β(d)+Mim split cri-teria is equivalent or better than Mns+Mim. The three num-bers in each cell are the number of times the row beat/tied/lostto the column. . . . . . . . . . . . . . . . . . . . . . . . . . . 87ixList of Figures1.1 Predictions for a classification forest with three trees wherethe forest prediction is the average of each tree’s estimate ofthe posterior probability. The class with the highest averagedprobability estimate is then chosen which in this example isred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1 Left: X traverses to the right if h(X,φj) > τj is true andto the left if it is false. The feature extractor parameter φjand threshold τj are specific to node j and are learnt whenfitting the tree. Right: The path of X down a decision treeis highlighted in orange. In this example the index of the leafthat X reaches is 15 so l(X) = 15. . . . . . . . . . . . . . . . 82.2 Predictions for a regression forest with three trees where theforest prediction is the average of each tree. . . . . . . . . . . 92.3 Growing the first four nodes of a classification tree for the Irisdataset with KF = 2 and KS = 3. Each row visualizes threeaspects of the node being split. On the left is the currenttree structure where the node being split is highlighted inorange, nodes waiting to be split are light grey and finalizednodes are dark grey. In the middle are the data points thatarrive at the node being split. On the right are the datapoints projected onto the two candidate features that wererandomly sampled along with the three candidate split pointsrepresented by arrows. If there is a valid split, the best splitpoint is highlighted orange and two children are added to theset of nodes waiting to split. . . . . . . . . . . . . . . . . . . 112.4 Relative impurities of several histograms and scatter plots.The impurity is highest on the left and decreases while movingright. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13x2.5 Three candidate split points sampled using the global uni-form strategy for one candidate feature. The dark points aredata points within the node being split and the light greypoints are data points outside the node. The upward facingarrows are sampled candidate split points, the blue squarebrackets indicate the minimum and maximum of the rangeand the highlighted orange region indicates where candidatesplit points can be sampled. . . . . . . . . . . . . . . . . . . 172.6 Example of three candidate split points chosen with the nodeuniform strategy for one candidate feature. The dark pointsare data points within the node and the light grey pointsare data points outside the node. The blue square bracketsindicate the minimum and maximum of the range and theorange regions indicate where candidate split points can besampled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 Example of all data points being candidate split points fora candidate feature. The dark points are data points withinthe node and the light grey data points are outside the node.With this deterministic strategy, there are large regions of thefeature range that can never contain a split point. . . . . . . 182.8 Example of all gaps midpoint for one candidate feature. Thedark points are data points within the node and the light greydata points are outside the node. With this deterministicstrategy, there are large regions of the feature range that cannever contain a split point. . . . . . . . . . . . . . . . . . . . 182.9 Example of candidate split points generated by the all gapsuniform for one candidate feature. The dark points are datapoints within the node and the light grey data points areoutside the node. While this figure makes it appear as ifthere is a gap at each data point, this was just added to showthe regions between data points where each candidate splitpoint is sampled. . . . . . . . . . . . . . . . . . . . . . . . . 192.10 Example of efficiently computing the left and right sufficientstatistics of one split point by updating an existing split pointwith the sufficient statistics of the data points between thetwo split points. . . . . . . . . . . . . . . . . . . . . . . . . . 202.11 A balanced tree where each node is half the size of its parent.Notice that the sum of the size of all nodes at any level is lessthan or equal to N . . . . . . . . . . . . . . . . . . . . . . . . 222.12 Class difference projection for data points x′ and x′′. . . . . . 25xi2.13 Depth image and corresponding body part labels for the chal-lenging computer vision task of predicting the body part la-bels of each pixel in a depth image. . . . . . . . . . . . . . . . 262.14 An example of the kinect feature φ specified by two offsets uand v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1 Example of Biau2008 candidate split point for a candidatefeature. A gap is selected uniformly and a candidate splitis sampled uniformly within the gap. The dark points aredata points within the node and the light grey data pointsare outside the node. The blue region is the gap where thisparticular split point was sampled and the orange regions areother gaps that could have been sampled. . . . . . . . . . . . 313.2 Examples of Biau2012 candidate split points as the midpointof the current range for one candidate feature. When a can-didate feature is selected as the best split, the range is halvedand future splits are based on the left or right region. Thisis depicted with three examples, each of which is the samefeature being split. The dark points are data points withinthe node and the light grey data points are outside the node. 323.3 Example of Denil2014 candidate split points for a candidatefeature where Knr = 3. The three downward pointing arrowsindicate the selection of data points used to determine therange. All gaps midpoint within the range are selected. Thedark points are data points within the node and the light greydata points are outside the node. . . . . . . . . . . . . . . . 344.1 An example of selecting three candidate split points, KS = 3,with global uniform for a single candidate feature and updat-ing the sufficient statistics as seven data points arrive. . . . . 404.2 An example of candidate split points being selected at datapoints as they arrive for a single candidate feature. Datapoints that arrive before a split point is created are not in-cluded in the split point sufficient statistics. . . . . . . . . . 424.3 An example with structure and estimation streams with can-didate split points being selected at structure points for a sin-gle candidate feature. Each data point is assigned a stream,S or E, which determines the set of sufficient statistics thatare updated. Only structure points, S, can create a new can-didate split point. . . . . . . . . . . . . . . . . . . . . . . . . . 47xii5.1 Left: Depth image with a candidate feature specified by theoffsets u and v. Center: Body part labels. Right: Lefthand joint predictions (green) made by the appropriate classpixels (blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Top: The red and blue variant are equal at the minimum be-cause the standard deviation overlaps the mean at each vari-ants’ minimum. Bottom: The red variant is better than theblue variant because the standard deviation does not overlapthe mean at each variants’ minimum. . . . . . . . . . . . . . 555.3 Comparing all gaps midpoint with bagging versus all gapsmidpoint without bagging for the vowel dataset. The plotis test error with respect to number of candidate features(KF ). Notice that for most settings of KF , the performanceof bagging is better. . . . . . . . . . . . . . . . . . . . . . . . 575.4 Comparing all gaps midpoint with bagging versus all gapsmidpoint without bagging for the letter dataset. The plot istest error with respect to number of candidate features (KF ).This plot is indicative of a typical classification dataset whereall gaps midpoint without bagging is better than all gapsmidpoint with bagging for most settings of KF . . . . . . . . . 575.5 Comparing all gaps midpoint with bagging versus all gapsmidpoint without bagging for the diabetes dataset. Theplot is test error with respect to number of candidate features(KF ). With bagging the test error remains relatively flat asthe number of candidate features increases which is a commonpattern for regression datasets. . . . . . . . . . . . . . . . . . 585.6 A classification and regression example for balancing indi-vidual error with diversity to achieve the lowest test error.Top: Test error, Breiman strength, Breiman correlation forthe satimage dataset. As the number of candidate features(KF ) increases the strength and correlation of each tree alsoincreases. Bottom: Test MSE, individual MSE and am-biguity MSE for the housing dataset. As the number ofcandidate features (KF ) increases the individual MSE andambiguity MSE both decrease. . . . . . . . . . . . . . . . . . 59xiii5.7 Test MSE, individual MSE and ambiguity MSE for the abalonedataset. Unlike the typical scenario, the ambiguity MSE er-ror actually increases with respect to number of candidatefeatures for KF < 3 and the individual MSE error decreaseswith respect to number of candidate features for KF > 3.This indicates a more complex relationship between numberof candidate features and the generalization error for somedatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.8 The test error for the letter dataset for different forest sizes(T ). Notice that the best number of features to sample ateach node decreases as the forest size increases up until 25trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.9 Top: The test error for the wine dataset with respect to min-imum child size (Mcs) for different size forests. As the forestsize increases, the slope of the line increases until a minimumchild size of one achieves the lowest error. Bottom: Thetest error for the diabetes dataset with respect to minimumchild size (Mcs) for different size forests. As the forest sizeincreases, the slope of the line approaches zero showing thatthe sensitivity of the test error with respect to the minimumchild size decreases as the number of trees increase. . . . . . . 635.10 All gaps midpoint versus node uniform with 1, 5, 25 and 100candidate split points for the letter dataset. As the numberof split points increase, node uniform converges to all gapsmidpoint. For this dataset the extra diversity created fromnode uniform with KS = 1 out ways the loss of individualstrength and results in a lower overall test error. Top: Testerror with respect to KF . Middle: Individual strength withrespect to KF . Bottom: Breiman correlation with respectto KF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.11 All gaps midpoint versus node uniform with 1, 5, 25 and100 candidate split points for the satimage dataset. As thenumber of split points increase, node uniform converges to allgaps midpoint. For this dataset the diversity created fromnode uniform with KS = 1 does not out way the loss of in-dividual strength and the overall test error is higher for nodeuniform with KS = 1. Top: Test error with respect to KF .Middle: Individual strength with respect to KF . Bottom:Breiman correlation with respect to KF . . . . . . . . . . . . . 65xiv5.12 All gaps midpoint versus node uniform with 1, 5, 25 and100 candidate split points for the satimage dataset. Nodeuniform with KS = 1 produces the largest model and all gapsmidpoint produces the smallest. The test times are correlatedwith model size so forests trained with node uniform havehigher test times. For this dataset all gaps midpoint has thelowest training time. Top: Model size with respect to KF .Middle: Test time with respect to KF . Bottom: Train timewith respect to KF . . . . . . . . . . . . . . . . . . . . . . . . 675.13 All gaps midpoint versus node uniform with 1, 5, 25 and 100candidate split points for the usps dataset. Node uniformwith KS = 1 produces the largest model and all gaps mid-point produces the smallest. The test times are correlatedwith model size so forests trained with node uniform havehigher test times. For this dataset all node uniform withKS = 1 has the lowest training time. Top: Model size withrespect to KF . Middle: Test time with respect to KF . Bot-tom: Train time with respect to KF . . . . . . . . . . . . . . . 685.14 Subsampling 10, 100 and 1000 data points with all gaps mid-point candidate split points for the mnist dataset. WithKss = 100 the individual strength does drop; however, thecorrelation also drops which results in an overall test errorthat is comparable to using all the data points. Top: Testerror with respect to KF . Middle: Individual strength withrespect to KF . Bottom: Breiman correlation with respectto KF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.15 Top: The test error with respect to number of candidate fea-tures to sample at each node (KF ) for the letter dataset fornode uniform and global uniform candidate split points. For100 split points (KS = 100) global uniform does not effectperformance; however, for 1 split point (KS = 1) global uni-form appears to be shifted right so that for the same numberof candidate features it performs equal or worse. . . . . . . . 715.16 Test error with respect to KF for random projections, classdifference projections and axis aligned features for the uspsdataset. Class difference projections do better than randomprojections and at KF = 100 node uniform with class dif-ference projections begins to out perform axis aligned nodeuniform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73xv5.17 Top: The strength and correlation for usps dataset withrespect to KF for random projections, class difference projec-tions and axis aligned features. Class difference projectionshave much higher individual strength but also have highercorrelation while random projections have lower strength andlower correlation. This trend holds for all datasets and thebest generalization error depends on the interaction betweenthe two. Top: Breiman strength with respect to KF Bot-tom: Breiman correlation with respect to KF . . . . . . . . . 745.18 Test error with respect to KF for random projections, classdifference projections and axis aligned features for the pendig-its dataset. For this dataset, both class difference projectionsand random projections outperform standard axis aligned fea-tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.19 Comparison of Biau2008, Biau2012, Denil2014 and Brieman(all gaps midpoint) for satimage dataset. This dataset fol-lows the typical ordering. Top: Test error with respect toKF . Middle: Individual strength with respect to KF . Bot-tom: Breiman correlation with respect to KF . . . . . . . . . 775.20 Comparison of Biau2008, Biau2012, Denil2014 and Breiman(all gaps midpoint) for gisette dataset. For this datasetBiau2012 outperforms Denil2014. Top: Test error with re-spect to KF . Middle: Individual strength with respect toKF . Bottom: Breiman correlation with respect to KF . . . . 785.21 All gaps midpoint compared with all gaps midpoint wherethe number of candidate features is sampled from a Poissondistribution for the letter dataset. Sampling the number ofcandidate features from a Poisson smooths test error relativeto KF near the ends of the range but near the minimum theyare equivalent. . . . . . . . . . . . . . . . . . . . . . . . . . . 795.22 Comparison of Denil2014 with structure and estimation streamsper tree, Denil2014 with one stream and standard all gapsmidpoint for the usps dataset. All gaps midpoint and De-nil2014 with one stream are equivalent. . . . . . . . . . . . . 805.23 Comparison between fixed depth versus fixed frontier versusoffline all gaps midpoint for pendigits. Managing memorywith a fixed frontier achieves significantly lower test error. . . 82xvi5.24 Candidate split points at the first KS data points versusglobal uniform for the letter dataset. The other design choicesare constrained to a fixed frontier, online bagging and Mns +Mim split criteria. Offline all gaps midpoint is included on theplot as a baseline. In this example using the first KS datapoints as candidate split points outperforms global uniform.This represents the typical scenario. . . . . . . . . . . . . . . 835.25 Candidate split points at the first KS data points versusglobal uniform for the satimage dataset. The other designchoices are constrained to a fixed frontier with online baggingand offline all gaps midpoint is included on the plot as a base-line. In this example global uniform outperforms using thefirst KS data points. . . . . . . . . . . . . . . . . . . . . . . . 845.26 Error relative to number of candidate features for online with(red) and without (blue) bagging on the gisette dataset. Of-fline all gaps midpoint (green) is also included as a base-line. This demonstrates that online bagging hurts perfor-mance which holds across most datasets. . . . . . . . . . . . 845.27 Comparison between Mns + Mim split criteria (red), α(d) +β(d)+Mim split criteria (blue) andMcs split criteria for offline(green) for the usps dataset. . . . . . . . . . . . . . . . . . . 865.28 Comparison between Denil2013 with and without splittingeach data point into structure and estimation streams alongwith the offline all gaps midpoint as a baseline for the pendig-its dataset. Denil2013 without stream splitting is signifi-cantly closer to offline all gaps midpoint. In this exampleDenil2013 has a fixed frontier, uses the first KS data pointsas candidate split points and samples the number of candidatefeatures from a Poisson at each node. . . . . . . . . . . . . . . 885.29 Different number of passes through the data (p) for a fixedfrontier, KS data points as split points and Poisson number ofcandidate features along with the offline all gaps midpoint asa baseline for the pendigits dataset. Online with 10 passesdoes as well as offline all gaps midpoint. . . . . . . . . . . . . 88xviiSymbols and NotationA subset of data points at a node being splitA′ subset of data points for the left path of a candidate splitA′′ subset of data points for the right path of a candidate splitD the dataset of all X,Y data pointsh(X,φ) feature value for data point X and feature φIφ,τ impurity value of feature φ and split point τj the index of a node in a treeKdr growth rate of the number of data points required to split withrespect to depth for Denil2013KF number of candidate features to try at each nodeKnr number of data points to sample to create the valid range forDenil2014KP size of subspace for linear projections of featuresKS number of candidate splits to try for each candidate featureKss number of data points to subsample when computing the bestsplit for each nodelt(x) leaf reached by x for tree tLj(D) subset of data points in D which reach leaf jMaf maximum number of active leafs in the frontier of an onlinerandom forestMcs minimum size of both child nodes for a split to be validMd maximum depth a node can be for a split to be validMim minimum gain in impurity for a split to be validxviiiML maximum number of leafs to create for Biau2008 and Biau2012Mns minimum size of a node for a split to be validN number of data points in the data setτ split point used to determine if the test of a node is true or false(also referred to as the threshold)φ feature parametrizationY¯ tj prediction of leaf j for tree txixAcknowledgmentsI would like to express my utmost gratitude to many people, without whomthis thesis would not have been possible. First and foremost, my supervisorNando de Freitas for his invaluable guidance and support throughout mygraduate studies. I am in awe of his unbelievable passion for machine learn-ing research and I am grateful for the freedom he gave me to explore my ownideas and find my own way. Much of this work was done in collaborationwith Misha Denil and I was privileged to have worked so closely with sucha brilliant person. I am particularly thankful for his patience while explain-ing the theoretical foundations required for this research. I’d also like tothank Ziyu Wang, Matt Hoffman, Bobak Shahriari, Mareija Eskelinen andMasrour Zoghi for their help and ideas. Finally, I would like to thank mywife for her continuous support, my father for inspiring me to pursue thesciences at a young age and the rest of my family who were so critical inhelping me become the person I am today.xxChapter 1IntroductionRandom forests are ensembles of decision trees commonly used for classi-fication and regression. Unlike in boosting [41] where the base models aretrained and combined using a sophisticated weighting scheme, each tree of arandom forest is trained in isolation and the predictions are typically com-bined by averaging or max vote. In order to increase diversity among theindependently trained base learners, randomness is injected into the con-struction of each tree. Increasing the randomness will cause the strengthof each tree to decrease while the diversity among trees will increase. Theaccuracy of the entire forest is a complex function of the data and the typesand amount of randomness injected into the construction of each tree.This thesis will focus on random forests for supervised learning. In thesupervised learning scenario, the dataset is typically pairs of X, Y whereX ∈ RDX is a vector. The dimension of vector X, DX , is often referredto as the number of features or attributes. More formally the dataset isrepresented as D = {(Xi, Yi)}Ni=1 where N is the size of the dataset. Duringtraining, which is also known as fitting, a model, in our case a random forest,is fit to the dataset D. The trained model can then make predictions for Ygiven a new X. If Y is a discrete category, Y ∈ {1, 2, ..C}, then the problemis defined as classification and error is typically measured as 0-1 loss. IfY ∈ RDY then the problem is defined as regression and error is typicallymeasured as mean squared error.Despite there being numerous variants of random forests the core decisiontree is the same across all variants. Each decision tree is a hierarchy ofbinary tests where a data point X is recursively evaluated on each test untilreaching a leaf. The test in each node is typically axis-aligned; that is to sayit evaluates whether the jth feature of X is greater than some threshold τ . Ifthe test is true, the right child is taken otherwise the left child is taken. Eachleaf contains a model that predicts Y . During training, the best feature jand threshold τ are chosen from a set of candidates with a purity function.It is common to call the threshold τ a split point during training because itsplits the data points into left and right subsets.The most common choice of predictor in each leaf is a class histogram for1Figure 1.1: Predictions for a classification forest with three trees wherethe forest prediction is the average of each tree’s estimate of the posteriorprobability. The class with the highest averaged probability estimate is thenchosen which in this example is red.classification and the average response for regression. Figure 1.1 visualizes aclassification forest averaging predictions for a single data point. Criminisiet al. [18] explore the use of several different leaf predictors for regression andother tasks, but these generalizations are beyond the scope of this thesis.1.1 Prior workRandom forests [10] were originally conceived as a method of combiningseveral CART [12] style decision trees using bagging [8]. Their early de-velopment was influenced by the random subspace method of Ho [29], theapproach of random split selection from Dietterich [22] and the work ofAmit and Geman [2] on feature selection. Several of the core ideas usedin random forests were also present in the early work of Kwokt and Carter[32] on ensembles of decision trees. The motivation for using ensemblesof randomized decision trees is their inherent instability and low computa-tional requirements. While randomness is injected into Breiman’s originalalgorithm by bagging and subsampling candidate features at each node, thebest candidate split point is chosen by optimizing over all possible candidatesplit points.Extremely randomized trees of Geurts et al. [27] extend Brieman’s ran-dom forests [10] by injecting extra randomness into the tree construction bysampling candidate split points uniformly. This extra randomness combinedwith not using bootstrap samples for each tree, was shown to improve theaccuracy of the forest and is used by many practitioners.Criminisi et al. [18] generalized random forests through a frameworkthat supports density estimation and semi-supervised learning in additionto classification and regression. This framework abstracts away the details2of selecting candidate split points and assumes some mechanism for sam-pling entire decisions (i.e. both the candidate feature and the candidatesplit point). The authors argue that without bagging and sampling can-didate features and split points uniformly their model obtains max-marginproperties. By leaving out other candidate split point selection strategiesthey choose to focus on more advanced leaf predictors, such as linear modelsfor regression, using random forests for density estimation, manifold learningand semi-supervised learning as well as introducing more complex featureshapes. This paper is a great introduction to random forests and is rec-ommended to any reader new to the subject. To make this thesis easier todigest we have adopted much of their notation.Rodrguezm et al. [39] extended axis aligned decision trees to linear pro-jections of features where the weights are found by applying PCA to a sub-space of features at each node being split. This was shown to improvepredictive accuracy on datasets with features that are highly correlated butrequires significantly more computation for training. More recently Menzeet al. [34] explored learning optimal projections by using linear discrimina-tive models to learn the feature weights for a candidate split. Their modelachieves lower generalization error when features are highly correlated butdoes worse when this is not the case.Empirically focused papers describe other elaborate extensions to the ba-sic random forest framework, adding domain specific refinements which pushthe state of the art in performance [35, 42, 43, 47, 48]. Random forests havebeen applied to great effect in a wide variety of fields including computervision, medical imaging, medicine, chemistry, natural language processing,ecology, fault diagnosis and fraud detection [15–17, 19, 37, 43–45].In contrast, theoretical papers focus on simplifications of the standardframework where analysis is more tractable. The analysis is typically toprove consistency of the model. Notable contributions in this direction arethe recent papers of Biau et al. [4], Biau [3], Denil et al. [20] and Denil et al.[21].1.2 Online methodsIn addition to offline methods, several researchers have focused on build-ing online versions of random forests. Online models are attractive becausethey do not require that the entire training set be accessible at once. Thesemodels are appropriate for streaming settings where training data is gen-erated over time and should be incorporated into the model as quickly as3possible. Several variants of online decision tree models are present in theMOA system of Bifet et al. [6].The primary difficulty with building online decision trees is their recur-sive structure. Data encountered once a split has been made cannot beused to correct earlier decisions. A notable approach to this problem is theHoeffding tree [23] algorithm, which works by maintaining several candi-date splits in each leaf. The quality of each split is estimated online as dataarrive in the leaf, but since the entire training set is not available these qual-ity measures are only estimates. The Hoeffding bound is employed in eachleaf to control the amount of data which must be collected to ensure thatthe split chosen on the basis of these estimates is the true best split withhigh probability. Domingos and Hulten [23] prove that under reasonableassumptions the online Hoeffding tree converges to the offline tree with highprobability. The Hoeffding tree algorithm is implemented in the system ofBifet et al. [6].Alternative methods for controlling tree growth in an online setting havealso been explored. Saffari et al. [40] use the online bagging technique ofOza and Russel [36] and control leaf splitting using two parameters in theironline random forest. One parameter specifies the minimum number ofdata points which must be seen in a leaf before it can be split, and anotherspecifies a minimum quality threshold that the best split in a leaf mustreach. This is similar in flavor to the technique used by Hoeffding trees, buttrades theoretical guarantees for more interpretable parameters.One active avenue of research in online random forests involves trackingnon-stationary distributions, also known as concept drift. Many of the onlinetechniques incorporate features designed for this problem [1, 5, 7, 24, 40].However, tracking of non-stationarity is beyond the scope of this thesis.1.3 Contributions of this thesisIn this thesis we present a generalized framework which describes the algo-rithms of Breiman [10], Geurts et al. [27], Biau et al. [4], Biau [3], Saffariet al. [40], Denil et al. [20] and Denil et al. [21]. Two of the variants are com-monly used in practice, four are proven to be consistent and two are online.Every variant selects candidate split points using a different strategy. Inthis thesis we present and empirically study the effect of the different splitpoint strategies on generalization error, individual error, diversity, compu-tation cost and model complexity. We also investigate other modificationsto understand what combination of design choices is optimal. For practical4variants this includes bagging, subsampling data points at the node, ran-dom linear combinations of features and sampling linear combinations offeatures by projecting along the line between two data points of differentclasses. For theoretical variants this includes sampling the number of a can-didate features from a Poisson distribution and splitting data points intostructure and estimation streams. For online variants we examine applyinga fixed frontier versus a fixed tree depth, online bagging and the splittingconditions of Saffari et al. [40] and Denil et al. [20].We discover two trends that are counter to the common beliefs regardingrandom forests. First, bagging rarely helps Breiman’s algorithm [10] andmuch of the reported improvement from Geurts et al. [27] can be accountedto not using bagging. Second, increasing the number of candidate splitpoints for Geurts et al. [27] only improves accuracy if Breiman’s algorithmalready achieves higher accuracy. In this situation, the extra split pointsnever beat Breiman’s algorithm and they significantly increase the trainingtime. When Geurts et al. [27] does do better than Breiman’s algorithm,using one candidate split point always achieves the highest accuracy whichconveniently requires the least amount of training time. For the variantswith proofs of consistency, we show that splitting data points into structureand estimation streams accounts for most of the gap between variants used inpractice and those studied by theoreticians. For online variants we show thatonline bagging hurts performance and sampling split points at datapoints,instead of uniformly, actually improves accuracy on many datasets.While both this thesis and Criminisi et al. [18] present a frameworkof random forests, the emphasis is quite different. This thesis focuses onsupervised learning variants with different strategies for selecting candidatesplit points while Criminisi et al. [18] focuses more on unsupervised learningand does not discuss different candidate split point strategies. The followingare the key contributions of this thesis:1. a framework of practical, theoretical and online random forest variants,2. an exhaustive empirical study of variants, including ten different splitpoint selection strategies, across different parameter settings,3. the first empirical study of variants with provable consistency,4. a variant with provable consistency which is closer to Breiman in termsof the algorithm itself and its generalization error,5. a strategy to reduce computational cost without decreasing accuracyby subsampling data points at each node,56. a novel approach for managing the memory for an online random forestusing a fixed frontier, and7. a feature shape that uses the projection between two data points ofdifferent classes.1.4 Structure of this thesisThe remainder of this thesis is structured as follows:• In Chapter 2 we introduce a random forest framework and present thework of Breiman [10] and Geurts et al. [27] in terms of the framework.We present subsampling data points at each node, suggest candidatelinear projections from data points of different classes and formalizethe trade-off between diversity and individual strength.• In Chapter 3 we present three variants that are provably consistent interms of the framework from Chapter 2. Two of which are the work ofBiau [3, 4] while the third is an algorithm that we developed [21] whichis closer to Breiman in terms of the algorithm itself and its predictiveaccuracy. Along with the other alterations, we explain splitting datapoints into structure and estimation streams and why this is requiredby the theory.• In Chapter 4 we extend the framework from Chapter 2 to support on-line random forests and we present our work [20] for managing memorywith a fixed frontier. We present the candidate split point selectionstrategies and splitting conditions for Saffari et al. [40] and Denil et al.[20].• In Chapter 5 we report test error, individual strength, diversity, com-putation time and model complexity for all variants and parame-ter configurations across 14 classification datasets and 11 regressiondatasets. This chapter is broken down into experiments for Chapter 2,Chapter 3 and Chapter 4.• In Chapter 6 we conclude by suggesting to practitioners which variantsand parameter settings should be evaluated as well as presenting chal-lenges to theoreticians for completely closing the gap between variantsused in practice and variants with provable consistency.6Chapter 2Random forest frameworkIn this chapter we present a random forest framework for classification andregression. We begin with a brief review of making predictions with clas-sification and regression trees. We then present the process for fitting arandomized tree to data. We adopt a very similar notation to Criminisiet al. [18] but we delve into more detail regarding the selection of candidatesplit points. We present five strategies for selecting candidate split pointsand examine the advantage of subsampling data at each node. We then ex-amine the computational and memory requirements and introduce metricsfor measuring diversity to help understand the strength and diversity tradeoff. This includes Krogh and Vedelsby [31] ambiguity decomposition for re-gression and Breiman’s original work [10] which bounds the generalizationerror of a classification forest in terms of the strength and correlation of eachtree in the ensemble.2.1 Walking a decision treeA decision tree is a hierarchy of binary tests which determine the path ofa data point X to a leaf. We define l(x) as the function mapping x to theindex of the leaf that x lies within. The test at each node is of the formh(x, φ) > τ where φ is the parameterization of the feature that is extractedfrom x and τ is a threshold. Figure 2.1 illustrates the test at a single nodeand visualizes the path of a data point X from the root to the correspondingleaf. In this example the index of the leaf is 15 so l(X) = 15.In the common case of axis-aligned splits, φ is the index of the featuredimension. Represented formally, h(x, φ) = piφx where piφ is is a projectiononto the φ dimension. For now we only consider axis aligned splits but wewill explore more complex feature shapes in Section 2.8.2.2 Leaf predictorsEvery leaf has a model for predicting Y .7Figure 2.1: Left: X traverses to the right if h(X,φj) > τj is true and to theleft if it is false. The feature extractor parameter φj and threshold τj arespecific to node j and are learnt when fitting the tree. Right: The path ofX down a decision tree is highlighted in orange. In this example the indexof the leaf that X reaches is 15 so l(X) = 15.2.2.1 RegressionFor regression E(Y |X) is typically estimated with a constant prediction Y¯jfor leaf j which is the average of Y for all training data points, D′, that liein leaf j. We define the function Lj(D′) as the subset of data points D′ thatreach leaf j.Lj(D) = {(Xi, Yi) | Xi ∈ D ∧ l(Xi) = j}Y¯j =1|Lj(D′)|∑Yi∈Lj(D′)YiFor regression, the predictions of each tree in the forest are combined withsimple averaging.f¯(x) =1TT∑t=1Y¯ tl(t,x)In the context of a forest we define Y¯ tj as the prediction of leaf j and treet and l(t, x) as the map of data point x and tree index t to the leaf index.Figure 2.2 visualizes combining the predictions of a regression forest for asingle data point. In this example the forests prediction is 8.2.8Figure 2.2: Predictions for a regression forest with three trees where theforest prediction is the average of each tree.2.2.2 ClassificationFor classification the posterior probability, P (y = c|x), is estimated for eachleaf with a normalized histogram. Y¯j,c is an estimate of the probability ofclass c for leaf j.Y¯j,c =1|Lj(D′)|∑Yi∈Lj(D′)I {Yi = c}The posterior probability of the forest is then combined with simple aver-aging and the class with the highest probability is selected. Following thesame convention as the previous section, the tree is indexed with t.f¯(x) = arg maxc∈C1TT∑t=1Y¯ tl(t,x),cFigure 1.1 visualizes combining the predictions of a classification forest fordata point X. In this example the most likely class is red.Another approach, called max-vote, only stores the most likely class foreach leaf. Each tree in the forest then votes for one class and the class withthe most votes in chosen. Max vote is less stable but requires each leaf toonly store a class index instead of a full histogram.2.3 Fitting a randomized treeEach tree is fit to the data in isolation, where the fitting process can besummarized with two steps. First, the data set D′ used to train the treein generated and a corresponding root node is created. Second, nodes arerecursively split until there are no nodes left to split.The node splitting procedure selects candidate features and for eachcandidate feature it selects candidate split points. The impurity gain of9each candidate split point for each candidate feature is measured and thepair with the highest impurity gain that also meets the splitting criteriais selected as the best split. If there is a valid best split, the two newchildren are then added to the frontier of nodes to split. Figure 2.3 visualizesthis recursive process where each row captures a node being split, the datapoints to arrive at that node, the candidate features and split points that aresampled and the best feature and split point pair that is selected. Impuritygain is calculated from an impurity metric which typically measures thereduction in uncertainty resulting from a node being split into two children.The five design decisions when training a randomized tree are the fol-lowing:1. how to select data points used to train a tree,2. how to select candidate features at each node,3. how to select candidate split points for each candidate feature,4. the impurity measure to determine the best feature and split pointpair, and5. the criteria for determining if a split is valid.Randomness can be injected when selecting data points, candidate fea-tures and candidate split points. In Section 2.5 we will explore other strate-gies for selecting data points and in Section 2.4 we will cover five strategiesfor selecting candidate split points. For now, assume each tree is trained onthe entire dataset and split points are sampled uniformly within the rangeof each feature.Candidate axis-aligned features are typically selected by sampling KFcomponent indices without replacement. KF is a parameter of the algorithmand in general a smaller KF results in weaker but more diverse trees and alarger KF results in stronger but less diverse trees.The process of growing a randomized tree by recursively selecting thebest candidate feature φ′ and candidate split point τ ′ is visualized in Fig-ure 2.3 and outlined in more detail in Algorithm 1 and Algorithm 2.2.3.1 Impurity measuresThe best candidate feature φ′ and candidate split point τ ′ is chosen by max-imizing the impurity gain, I(A,A′φ,τ , A′′φ,τ ) with respect to some impurity10 Figure 2.3: Growing the first four nodes of a classification tree for the Irisdataset with KF = 2 and KS = 3. Each row visualizes three aspects of thenode being split. On the left is the current tree structure where the nodebeing split is highlighted in orange, nodes waiting to be split are light greyand finalized nodes are dark grey. In the middle are the data points thatarrive at the node being split. On the right are the data points projectedonto the two candidate features that were randomly sampled along with thethree candidate split points represented by arrows. If there is a valid split,the best split point is highlighted orange and two children are added to theset of nodes waiting to split.11Algorithm 1: BestRandomizedSplitInput: The set of data points A which are within the node being splitOutput: A tuple (v, φ, τ) where v is whether the split is valid, φ isthe feature parameter and τ is the split point1 S ← {}/* Generate candidate splits and their impurity values */2 features←SelectCandidateFeatures(A)3 foreach φ ∈ features do4 splitpoints←SelectCandidateSplitpoints(A, φ)5 foreach τ ∈ splitpoints do6 (A′φ,τ , A′′φ,τ )← split A with h(Xi, φ) > τ such that Xi ∈ A7 Iφ,τ ← I(A,A′φ,τ , A′′φ,τ )8 S ← S ∪ {(φ, τ, Iφ,τ , A′φ,τ , A′′φ,τ )}/* Select the best valid split */9 S′ ← filter S with ValidSplit(Iφ,τ , A′φ,τ , A′′φ,τ )10 if S′ is not empty then11 v =”Valid”12 (φ′, τ ′)← Select φ and τ where Iφ,τ is max13 else14 v =”Invalid”15 (φ′, τ ′)← (0, 0)16 return (v, φ′, τ ′)measure H(Z) such that A is split into A′φ,τ and A′′φ,τ by φ′ and τ ′. Wherethe impurity gain is defined asI(A,A′φ,τ , A′′φ,τ ) = H(A)−|A′φ,τ ||A| H(A′φ,τ )−|A′′φ,τ ||A| H(A′′φ,τ )The impurity measure H(Z) captures the similarity of all data points inZ. For classification, if all the data points in Z are the same class then H(Z)should return 0, while if the number of data points per class is equal thenH(Z) should return a large impurity. For regression, the impurity capturesthe variance of Y for all data points in Z. Figure 2.4 provides some intuitionfor relative impurities. The histograms and scatter plots on the left havethe highest impurity values and the ones on the right have the lowest.12Algorithm 2: GrowDecisionTreeInput: A data set DOutput: A randomized decision tree fit to D1 D′ ← sample tree dataset from D2 frontier ← {0}3 nodes test← empty map from node index to test parameters4 nodes children← empty map from node index to child indices5 nodes predictor ← empty map from node index to leaf predictor6 while frontier is not empty do7 j ←pop(frontier)8 Aj ← Lj(D′) /* all data points within node j */9 (v, φj , τj)←BestRandomizedSplit(Aj)10 if v =”Valid” then11 (j′, j′′)←create two child nodes for j12 frontier ← frontier ∪ {j′, j′′}13 nodes test[j]← (φj , τj)14 nodes children[j]← (j′, j′′)15 nodes predictor[j]←FitPredictorModel(Aj)16 return Tree(nodes test, nodes children, nodes predictor)Figure 2.4: Relative impurities of several histograms and scatter plots. Theimpurity is highest on the left and decreases while moving right.ClassificationTwo common impurity measures for classification are the Gini and Shannonentropy. Gini is used in CART while Shannon entropy was adopted for ID3,13C4.5 and C5.0 decision trees. Both Gini, HG(Z), and Shannon entropy,HG(Z), use an estimate of the probability of each class.p(c) =1|Z|∑Yi∈ZI {Yi = c}HG(Z) = 1−∑c∈Cp(c)2HE(Z) =∑c∈C−p(c) log p(c)Gini tends to isolate the largest class from all other classes while entropytends to find groups of classes that add up to 50% of the data. As a result,Shannon entropy tends to produce more balanced trees. Shannon entropyis used in most applications and has been shown to differ with Gini for only2% of class distributions [38]; therefore, we have decided to adopt it for therest of this thesis. The only downside of using Shannon entropy is it requiresthe evaluation of log p(c) which can be computationally expensive.RegressionFor regression, we use the sum of the variance of Y for the impurity measure.This is equivalent to the mean squared error when using a constant predictor.Y¯Z =1|Z|∑Yi∈ZYiHV(Z) =1|Z|∑Yi∈Z(Yi − Y¯Z)2Another impurity measure suggested by Criminisi et al. [18] is differentialentropy which for the 1-D case is proportional to the sum of the logs of thesquared errors. We did not explore this further because of the computationalrequirements for higher dimension multivariate regression.14Sufficient statisticsWhile the pseudo code in Algorithm 1 splits the set of data points A intoA′φ,τ and A′′φ,τ and then measures I(A,A′φ,τ , A′′φ,τ ), in practice we onlyneed to store the sufficient statistics of A, A′φ,τ and A′′φ,τ required to com-pute I(A,A′φ,τ , A′′φ,τ ).For classification the sufficient statistics required by HE(Z) are the classcounts and for regression the sufficient statistics required by HV(Z) are∑Yi∈ZYi,∑Yi∈ZY 2i and |Z|. The variance estimate is simply the standardformula and in the case of multivariate regression, there is an additional sumover the number of dimensions of Y .HV(Z) =1|Z|∑Yi∈ZY 2i − (1|Z|∑Yi∈ZYi)2By using the sufficient statistics outlined above, it is possible to evaluatea new split of A by updating an existing split A′φ,τ and A′′φ,τ by removing apoint from A′φ,τ and adding it to A′′φ,τ . This allows for efficient computationof different split points which we cover in more detail in Section 2.4.6.2.3.2 Split criteriaThe best split, φ′ and τ ′, is chosen by selecting the split with the highestimpurity gain, I(A,A′φ,τ , A′′φ,τ ) where all split criteria are met. In thissection we introduce four split criteria.Minimum child sizeThe minimum child size is a constraint on the size of each candidate child ina node being split. If either candidate child has less than Mcs data points,the split is not valid. For formally |A′φ,τ | > Mcs and |A′′φ,τ | ≥ Mcs mustboth be true. A larger Mcs reduces the size of each tree while increasingthe accuracy of the estimate of each leaf’s predictor. Therefore a large Mcsproduces smaller trees that will under fit the data while a large Mcs produceslarger trees that will overfit the data. The overfitting of each tree can also beaddressed by increasing the number of trees in the forest. See Section 5.2.3for a detailed empirical study of the number of trees and minimum childsize.15Minimum node sizeThe minimum node size ensures |A| ≥ Mns where Mns is the minimumnumber of data points in a node being split. This criteria is very similar tominimum child size but it can be evaluated before measuring any candidatesplits because it is independent of φ and τ . However it does not make anyguarantees about the balance of a split and is usually used in conjunctionwith minimum child size as an early check where Mns = 2Mcs.Minimum impurity gainThe minimum impurity gain ensures I ′φ,τ ≥Mim whereMim is the minimumimpurity of split φ, τ . This stops trees from splitting leaves which are closeto pure.Maximum depthThe maximum depth ensures that the length of the path from the root ofa tree to any leaf never exceeds Md. This is another way of controlling thecomplexity of a tree but it offers no guarantees on number of data pointsused in estimating the leaf predictors. Because of this, we favour minimumchild size over maximum depth.2.4 Candidate split point StrategiesThus far we have assumed candidate split points have been sampled uni-formly from the range of each feature. We begin this section by illustratingthis strategy in more detail and then examine four other strategies for gener-ating candidate split points. As discussed in Section 2.3 the candidate splitpoints are generated for each feature. Fewer candidate split points resultsin more randomness which creates weaker trees that are more diverse.2.4.1 Global uniformThe global uniform candidate split point strategy samples split points uni-formly from the range of a candidate feature, where the range is specifiedby the minimum and maximum feature value for all the data points in thedataset. The parameter KS specifies the number of candidate split points tosample. Figure 2.5 illustrates an example with three candidate split points(KS = 3). The black circles represent data points in the node being split16and the light grey circles represent data points outside the node. The ar-rows are the sampled split points and highlighted orange regions visualizewhere potential split points could have been sampled. One advantage of thisstrategy is the minimum and maximum value for each feature only needs tobe computed once for the entire forest. The disadvantage is that some splitpoints lie outside the range of feature values for the data points at the node.Figure 2.5: Three candidate split points sampled using the global uniformstrategy for one candidate feature. The dark points are data points withinthe node being split and the light grey points are data points outside thenode. The upward facing arrows are sampled candidate split points, theblue square brackets indicate the minimum and maximum of the range andthe highlighted orange region indicates where candidate split points can besampled.2.4.2 Node uniformInstead of sampling across the entire range of a feature, candidate splitpoints are now sampled between the minimum and maximum value of acandidate feature for data points within the node. We label this candidatesplit point strategy as node uniform. This strategy is illustrated in Figure 2.6and it also has parameter KS for specifying the number of candidate splitpoints to sample. This strategy is employed by Geurts et al. [27] as the coredifferentiation of their algorithm from Breiman [9].2.4.3 All data pointsUsing all data points as split point candidates is illustrated in Figure 2.7.This strategy clearly can not find a max margin split but we’ve includedit to see how much it hurts generalization error. For online variants, usinga subset of data points is appealing so it is important to understand howmuch potential accuracy is being given up.17Figure 2.6: Example of three candidate split points chosen with the nodeuniform strategy for one candidate feature. The dark points are data pointswithin the node and the light grey points are data points outside the node.The blue square brackets indicate the minimum and maximum of the rangeand the orange regions indicate where candidate split points can be sampled.Figure 2.7: Example of all data points being candidate split points for acandidate feature. The dark points are data points within the node and thelight grey data points are outside the node. With this deterministic strategy,there are large regions of the feature range that can never contain a splitpoint.2.4.4 All gaps midpointThe original random forest algorithm presented by Breiman [9] followedthe candidate split point strategy from CART. This strategy proposes themidpoint of all gaps between adjacent data points as illustrated in Figure 2.8.While this produces a max margin split, the range of possible split points isdiscontinuous. For the rest of this thesis we refer to this strategy as all gapsmidpoint.Figure 2.8: Example of all gaps midpoint for one candidate feature. Thedark points are data points within the node and the light grey data points areoutside the node. With this deterministic strategy, there are large regionsof the feature range that can never contain a split point.182.4.5 All gaps uniformThe final strategy, illustrated in Figure 2.9 and named all gaps uniform,samples a candidate split point uniformly for all gaps created by adjacentdata points. This combines the split point strategy of all gaps midpoint withnode uniform as it ensures the range of possible split points is continuouswhile creating a split point for every gap. This allows for a max margin splitpoint while still injecting extra randomness.Figure 2.9: Example of candidate split points generated by the all gapsuniform for one candidate feature. The dark points are data points withinthe node and the light grey data points are outside the node. While thisfigure makes it appear as if there is a gap at each data point, this was justadded to show the regions between data points where each candidate splitpoint is sampled.2.4.6 Efficient computation of candidate split pointsThere are two approaches for computing the impurity of every split point.The first is to maintain the sufficient statistics of A′ and A′′ for each splitpoint. This requires storing sufficient statistics for all split points which isfeasible when there are few candidate split points and has the advantage ofbeing able to process the data points in any order.The second strategy, which we term sort and walk, is to sort the datapoints for each candidate feature and then walk along the split points inorder while maintaining the sufficient statistics of A′ and A′′. Each splitpoint can be computed from the last split point by removing the data pointsbetween the old split point and the new split point from A′′ and adding themto A′. Figure 2.10 illustrates the process for a classification problem. Thisapproach can be applied to all candidate split point selection strategies butis required for all data points, all gaps midpoint and all gaps uniform.19Figure 2.10: Example of efficiently computing the left and right sufficientstatistics of one split point by updating an existing split point with thesufficient statistics of the data points between the two split points.2.5 BaggingAnother common method for introducing randomness is to build each treeusing a bootstrapped or subsampled data set. The process of creating an en-semble of equally weighted learners using a bootstrapped data set is knownas bagging which is short for bootstrap aggregating. Breiman’s original ran-dom forest algorithm [9] was influenced by bagging and used a bootstrappeddata set, while more recent random forest algorithms do not [18, 27].Given a data set D of size N , a bootstrapped data set D′ is generatedby sampling the original data set N times with replacement. This newbootstrapped data set D′ has multiple copies of some data points and ismissing others. In expectation, the bootstrapped data set D′ will be missingapproximately 0.3 of the data points.In addition to injecting additional randomness, one advantage of baggingis it provides a mechanism to estimate the generalization error of an ensem-ble. For every data point there will be a subset of trees that were not trainedon that point. Therefore, it is possible to evaluate the error of a forest bypredicting Y with the trees that were not trained with the correspondingX. This is called the out of bag error and it tends to be pessimistic becausethe size of the out of bag forest is smaller than the actual forest.202.6 Subsample at nodesFor most data sets, subsampling the data at each tree increases the test error.The smaller trees can not capture the complexity of the entire data set andthe leaf predictors are estimated with fewer data points. The motivation forsubsampling is to reduce computation time and the amount of data that isrequired to be in memory at one time.To allow trees to be grown to the same size with the same number ofdata points in each leaf, we propose subsampling Kss data points at eachnode. Close to the root of the tree, this injects additional randomness andreduces computation. Near the leaves the number of data points is less thanKss so split selection and predictor estimation is unaffected.2.7 Computation and memory complexity2.7.1 Computation complexity of testingThe cost for predicting Y for a point X is linear in the number of trees andthe average depth of each tree. Assuming the trees are reasonably balanced,the average depth is a constant of logN where N is the number of datapoints.2.7.2 Computational complexity of trainingFixed split pointsThe computation cost of splitting a node when storing the sufficient statisticsfor each candidate split is proportional to n ·KF ·KS ·dim(Y ) where n is thenumber of data points at the node, KF is the number of candidate features,KS is number of candidate split points and dim(Y ) is the dimension of Y forregression and the number of classes for classification. The naive approachfor bounding the cost per tree would be to take a worst case of N nodes andn = N . This results in a per tree training time that is O(N2).However, this is overly pessimistic because the number of data pointsin all nodes for each layer of the tree must be less than or equal to N . Tosimplify we’ll assume a well balanced tree such as illustrated in Figure 2.11.At any level of the tree the cost is bounded by N ·KF ·KS ·dim(Y ) so if depthis proportional to logN , the total cost for fitting the forest is proportionalto T ·N · logN ·KF ·KS · dim(Y ) where T is the size of the forest.21Figure 2.11: A balanced tree where each node is half the size of its parent.Notice that the sum of the size of all nodes at any level is less than or equalto N .Sort and walkFor a single node, the computation cost of computing impurity by sortingand walking the data points is no longer dependent on the number of splitpoints KS . Instead n · log n ·KF · dim(Y ) has an extra log n term.For each layer in the tree, let ni be the size of the i node in the layer.For a layer with L nodes,∑Li=0 ni ≤ N and the cost of splitting all nodes inthe layer isL∑i=0(ni · log ni) ·KF · dim(Y ) ≤ N · logN ·KF · dim(Y )As a result, the cost of training the forest is bounded by T ·N · log2N ·KF ·dim(Y ). If logN is less than KS then it is more efficient to sort and walkthe data points to compute the impurity of each split point. Using a randomforest variant with a fixed number of split points, such as node uniform, iscomputationally more efficient than the sort and walk variants, such as allgaps midpoint, when the number of data points is sufficiently large so that22logN > KS . With large number of candidate split points, global uniformand node uniform can also sort and walk the sampled candidate split pointsbut we did not implement this for our experiments because increasing thenumber of candidate split points did not decrease generalization error. SeeSection 5.2.4 for the empirical study.Subsampling at each nodeBy subsampling the data at each node it is possible to further reduce thecost of training to linear in N . The cost for a fixed number of split pointsbecomes bounded by T ·N ·Kss ·KF ·KS · dim(Y ) and the cost to sort andwalk becomes bounded by T ·N ·Kss · logKss ·KF · dim(Y ).2.7.3 Memory requirementsThe memory required for a tree is proportional to N ·dim(Y ) because therecan be up to N nodes and each leaf predictor estimate needs dim(Y ) pa-rameters. Depending on the data as well as the setting of Mcs, Mns, Mimand Md it is possible to replace N with a constant of choice. However, thiscomes at the cost of limiting the complexity of the model which could resultin underfitting the data.During training the memory required is dependent on the amount ofparallelism. If the impurity gain of each split point, feature and node iscomputed sequentially then N memory is required to store the extractedfeature, KF ·KS memory is required to store the impurity gains and dim(Y )memory is required to store the sufficient statistics of the impurity gainbeing computed.If the impurity gain of all candidate features and candidate split points isdone in parallel then N ·KF ·KS ·dim(Y ) is required. When the impurity gainis computed using a sort and walk variant, only the best split point of eachfeature is stored. Therefore the memory requirements become proportionalto N ·KF · dim(Y ).Additional parallelism can be introduced by training each tree in paralleland on separate machines. It would be possible to further distribute thecomputation by splitting every node in the frontier on separate machines.2.7.4 Computational and memory requirements in practiceSo far we have looked at worst case scenarios where KF does not effect thenumber of nodes. However in practice, increasing KF often results in bettersplits being found higher up the tree. This results in smaller trees which23reduces the computation for training the model. These conflicting forcescan result in the computation cost increasing at less than a linear rate whenincreasing KF .2.8 FeaturesIn this section we expand the class of features used in our experiments. Someof the literature [18] makes a distinction between the extraction of featuresand a function of existing features. The function of existing features istypically linear or conic and is called the feature shape. Instead, we treatfeature shapes as new features which are functions of existing features. Inour notation, all candidate features and shapes have the form h(x, φ) whereφ is the parameter of the feature. To quickly review, axis-aligned splits havethe formh(x, φ) = piφ · xwhere piφ is a projection on the φ coordinate.2.8.1 Sparse random projectionsBreiman introduced random projections as a technique for injecting addi-tional randomness when training on low dimension data sets [10]. Criminisiet al. [18] showed that random projections reduce the artifacts from axisaligned splits.Sparse random projections have the benefit of reducing the memory re-quired for φ from the dimension of the feature space to a constant. We defineKP as the size of the subspace. Weights are sampled uniformly between −1and 1 for KP components, resulting in a sparse linear projection φ that hasKP non-zero components.h(x, φ) = φ · xIn order for the random projection to be influenced by all features equally,the original feature space must be normalized. This new requirement is notnecessary for axis-aligned random forests.2.8.2 Class difference sparse projectionsFor classification, the motivation of using sparse random projections is tofind hyperplanes that separate the different classes. However in high dimen-sional spaces, random projections are less likely to find such hyperplanes.24Rotation forests [39] find good linear combinations of features by ap-plying PCA to random subspaces of features at each node. This techniqueis computationally expensive to compute but does not use the class labels.The oblique random forests of Menze et al. [34] perform LDA on the datawithin a node to determine the best projection which separates the data.This significantly increases the computation of fitting a forest and while itimproves the predictive accuracy for data sets where the features of X arehighly correlated, it also decreases predictive accuracy when the features ofX are uncorrelated.We now present a computationally efficient process for generating ran-domized candidate projections that are likely to separate some of the datapoints. Two data points, x′ and x′′ are sampled uniformly with replace-ment from all data points at a node such that their class labels are different.The candidate projection is simply φ = x′ − x′′. To make the candidateprojection sparse, a subset of components are sampled uniformly withoutreplacement. By sampling two points with different classes the resultingprojection is guaranteed to produce a hyperplane that separates these twopoints. One candidate projection is illustrated in Figure 2.12.Figure 2.12: Class difference projection for data points x′ and x′′.2.8.3 Kinect featuresRather than extract all the features for every data point, φ can parameterizea raw feature extractor where x becomes the index of a data point. Shottonet al. [43] demonstrate this in their work on predicting human body partslabels for pixels in a depth image. Figure 2.13 visualizes a depth image andthe corresponding ground truth body part labels. Each of the twenty labels25is represented by a different color. After predicting the body part class ofeach pixel, a post process such as mean shift is applied to find the center ofeach joint.Figure 2.13: Depth image and corresponding body part labels for the chal-lenging computer vision task of predicting the body part labels of each pixelin a depth image.In this application, the pixel index x is the position of a pixel in an imageand φ = (u, v) is two 2-D offsets to neighboring pixels. The feature h(x, φ)is the difference of the depth between the two pixels specified by the offsetsu and v. Figure 2.14 visualizes one set of u and v offsets for a single pixelx. The offsets are scaled as a function of the depth of pixel x to producedepth invariant features. As a subject gets closer the camera the depth ofx decreases and the offsets specified by u and v increase to compensate forthe larger subject. The constant cf specifies the rate at which the offsetsare scaled and is dependent on the field of view of the camera.h(x, φ) = d(x+ φ{u}cf · d(x))− d(x+ φ{v}cf · d(x))26uvFigure 2.14: An example of the kinect feature φ specified by two offsets uand v.2.9 Accuracy diversity trade-offAs an ensemble model, random forests achieve the lowest error when balanc-ing individual tree accuracy with diversity across the forest. In this sectionwe survey the literature on understanding this tradeoff for regression andclassification.2.9.1 Ambiguity decomposition for regressionKrogh and Vedelsby [31] introduced the ambiguity decomposition for re-gression. They proved that the mean squared error of a weighted ensemble,where the weights sum to one, can be decomposed into the average squarederror of each base model and the average squared difference between eachbase model and the ensemble. We present the decomposition for the specialcase where all the weights are 1T and T is the number of base models in theensemble.(f¯(X)− Y )2 = 1TT∑t=0(ft(X)− Y )2 −1TT∑t=0(ft(X)− f¯(X))2ft(X) is the prediction of a base model t and f¯(X) is the prediction ofthe ensemble. This decomposition captures the trade-off between individualaccuracy and diversity. If each base classifier is perfect, both terms are zero.As individual accuracy goes down, the first term goes up but the secondterm may go up if the errors being made by each base model are different.As the second term increases, the overall error of the ensemble goes down.27A proof of this decomposition as well as an excellent discussion can be foundin Brown et al. [14].2.9.2 Breiman’s bound for classificationFor classification there is no equivalent to the ambiguity decomposition.The closest approximation is Breiman’s bound on the generalization errorin terms of tree strength and raw margin correlation [10]. This bound re-quires the variance between base classifiers to be the result of independentrandomness during the fitting of each base classifier and the predictions tobe combined with max vote. The randomness is captured by the randomvariable Θ and f¯(X,Θ) is the forest prediction for data point X and ran-domness Θ. Breiman defines the margin of the ensemble, mr(X,Y ), as thedifference in probability of the true class from the most likely class that isnot the true class as predicted by the ensemble.mr(X,Y ) = PΘ(f¯(X,Θ) = Y )−maxj 6=YPΘ(f¯(X,Θ) = j)The strength s is defined as the margin with an expectation over the data.s = EX,Ymr(X,Y )Since the trees are combined with max vote,mr(X,Y ) = PΘ(f¯(X,Θ) = Y )−maxj 6=YPΘ(f¯(X,Θ) = j)= EΘ{I{f¯(X,Θ) = Y}− I{f¯(X,Θ) = arg maxj 6=Yf¯(X,Θ)}}then the raw margin, rmg(Θ, X, Y ), can be defined as mr(X,Y ) withouttaking the expectation over Θ.rmg(Θ, X, Y ) = I{f¯(X,Θ) = Y}− I{f¯(X,Θ) = arg maxj 6=Yf¯(X,Θ)}The mean correlation of the raw margin is defined asρ¯ =varX,Y (mr(X,Y ))(EΘsdX,Y (rmg(Θ, X, Y ))2where sdX,Y (rmg(Θ, X, Y )) is the standard deviation of rmg(Θ, X, Y ) overthe data while keeping Θ fixed and varX,Y (mg(X,Y )) is the variance of28mg(X,Y ) over data. The expectation EΘsdX,Y (rmg(Θ, X, Y ))2 can be es-timated as an average of sdX,Y (rmg(Θ, X, Y ))2 over all the trees. Noticethat ρ¯ goes up when the margin of the ensemble varies for different datapoints but decreases the more times each tree votes for the second best classas voted by the ensemble. At its essence, ρ¯ is lowest when there is largeamount of disagreement between trees and the amount of disagreement issimilar for all data points in the data set. In this context it makes sense forthe generalization error, PE∗, to be linearly bounded by ρ¯.PE∗ ≤ ρ¯(1− s2)s2To estimate sˆ let f¯(x, y) be the ensemble’s estimate of the probability ofclass y for data point x.sˆ =1NN∑i=1(f¯(Xi, Yi)−maxj 6=Yi f¯(Xi, j))To estimate ˆ¯ρ let ft(x) be the class that tree t would vote for, let pˆ′t bean estimate of the probability that tree t voted for the true label Y and letpˆ′′t be an estimate of the probability that tree t voted for the second bestlabel as voted by the ensemble.pˆ′t =1NN∑i=1I {ft(Xi) = Yi}pˆ′′t =1NN∑i=1I{ft(Xi) = arg maxj 6=Yif¯(X, j)}ˆ¯ρ =1N∑Ni=1(f¯(Xi, Yi)−maxj 6=Yi f¯(Xi, j))2 − sˆ2( 1T∑Ti=1√pˆ′t + pˆ′′t − (pˆ′t − pˆ′′t )2)229Chapter 3Consistent random forestsIn spite of the extensive use of random forests in practice, the mathematicalforces underlying their success are not well understood. The early theo-retical work of Breiman [11] for example, is essentially based on intuitionand mathematical heuristics, and was not formalized rigorously until quiterecently [3].There are two main properties of theoretical interest associated withrandom forests. The first is consistency of estimators produced by the algo-rithm which asks (roughly) if we can guarantee convergence to an optimalestimator as the data set grows infinitely large. Beyond consistency we arealso interested in rates of convergence. In this chapter we focus on consis-tency, which, surprisingly, has not yet been established even for Breiman’soriginal algorithm.Theoretical papers typically focus on stylized versions of the algorithmsused in practice. An extreme example of this is the work of Genuer [25,26], which studies a model of random forests in one dimension with com-pletely random splitting. In exchange for simplification researchers acquiretractability, and the tact assumption is that theorems proved for simpli-fied models provide insight into the properties of their more sophisticatedcounterparts, even if the formal connections have not been established.An important milestone in the development of the theory of randomforests is the work of Biau et al. [4], which proves the consistency of sev-eral randomized ensemble classifiers. Two models studied in Biau et al. [4]are direct simplifications of the algorithm from Breiman [10], and two aresimple randomized neighborhood averaging rules, which can be viewed assimplifications of random forests from the perspective of Lin and Jeon [33].We refer to the scale in invariant version as Biau2008 and present it in moredetail in Section 3.1.More recently Biau [3] has analyzed a variant of random forests originallyintroduced in Breiman [11] which is quite similar to the original algorithm.The main differences between the model in Biau [3] and that of Breiman [10]are in how candidate split points are selected and that the former requires asecond independent data set to fit the leaf predictors. The specifics of this30algorithm, which we refer to as Biau2012, are covered in Section 3.2 .The latest work of Denil et al. [21] extends the work of Biau [3] to selectcandidate split points using a strategy that is very similar to Breiman [10]as well as removing the requirement of a second data set for fitting the leafpredictors. Instead, the data set is split in two for each tree, where half isused for constructing the structure of the tree while the other half is usedfor estimating the leaf predictors. In Section 3.3 we specify this algorithmwhich we refer to as Denil2014.For the remainder of this chapter we present Biau2008, Biau2012 andDenil2014 in terms of the framework presented in Chapter 2 and then pro-vide a brief overview of the requirements for consistency and a high levelunderstanding of the theory.3.1 Biau 2008For each node being split, Biau2008 selects a single candidate dimensionand a single split point. The split point is sampled by selecting a randomgap between adjacent data points and then sampling uniformly within thegap. Figure 3.1 illustrates this uniform gaps uniform split point selectionstrategy employed by Biau2008. In Section 5.3 the plots with number ofcandidate features (KF ), always use KF = 1 for Biau2008 regardless of thesetting of KF as it is required for consistency.Figure 3.1: Example of Biau2008 candidate split point for a candidate fea-ture. A gap is selected uniformly and a candidate split is sampled uniformlywithin the gap. The dark points are data points within the node and thelight grey data points are outside the node. The blue region is the gap wherethis particular split point was sampled and the orange regions are other gapsthat could have been sampled.Rather than split leafs in depth first or breadth first order, Biau2008selects a random leaf at each iteration of growing the tree. The candidatefeature and split point are sampled for the random leaf and the leaf is split aslong as there are at least two data points at the leaf. This process continuesuntil there are no leaves with more than one data point or until their are31Figure 3.2: Examples of Biau2012 candidate split points as the midpoint ofthe current range for one candidate feature. When a candidate feature isselected as the best split, the range is halved and future splits are based onthe left or right region. This is depicted with three examples, each of whichis the same feature being split. The dark points are data points within thenode and the light grey data points are outside the node.more than ML leaves. ML functions in a very similar manner to Mns asthey both control the size of each tree. By setting ML = NMns , the upperbound on the number of nodes in the tree is the same. In practice, ratherthan keeping track of whether any leaves have more than one data point, arandom leaf is sampled and checked R times. If no leaf could be split afterR tries, the process assumes that no leaves can be split and stops. For allexperiments we set R = 100.3.2 Biau 2012The trees in Biau2012 assume the data is supported on [0, 1]DX , so the datamust first be scaled to lie in this range. Trees are grown by expanding leafsin breadth first order until ML leaves have been created.For each node being split KF candidate features are sampled and a singlecandidate split point is chosen at the midpoint of each feature. Figure 3.2visualizes the candidate split point of the same feature at different levels ofthe tree. This mechanism of selecting the feature mid point as the candidatesplit point requires each node to maintain the range of each feature. In thescenario where all candidate feature do not split the data (i.e. all the datapoints lie on one side of the split) then a candidate feature is chosen at32random. Therefore, one of the new leafs will not have any data points andwill predict zero.The last requirement for Biau2012 is that one data set is used to deter-mine the best candidate feature and split point, which in turn determines thestructure of the tree, while another data set is used to fit the leaf predictors.The pseudo code from Section 2.3 only needs to be modified slightly to sup-port two data sets: structure points and estimation points. See Algorithm 3for the details where DS is the set of structure points and DE is the set ofestimation points. In practice the two data sets are created by randomlyassigning each point in D to either DS or DE with equal probability.Algorithm 3: Biau2012GrowDecisionTreeInput: Data sets DS and DEOutput: A Biau2012 randomized decision tree fit to DS and DE1 frontier ← {0}2 nodes test← empty map from node index to test parameters3 nodes children← empty map from node index to children indices4 nodes predictor ← empty map from node index to predictor5 while frontier is not empty do6 j ←pop(frontier)7 ASj ← Lj(DS) /* all structure points within node j */8 (v, φj , τj)←BestRandomizedSplit(ASj )9 if v =”Valid” then10 (j′, j′′)←create two child nodes for j11 frontier ← frontier ∪ {j′, j′′}12 nodes test[j]← (φj , τj)13 nodes children[j]← (j′, j′′)14 AEj ← Lj(DE) /* all estimation data points within nodej */15 nodes predictor[j]←FitPredictorModel(AEj )16 return Tree(nodes test, nodes children, nodes predictor)3.3 Denil 2014Denil2014 extends Biau12 to support the selection of structure and estima-tion points at each tree rather than for the entire forest.Structure points are allowed to influence the shape of the tree. They33are used to determine split dimensions and split points in each internalnode of the tree. However, structure points are not permitted to effect thepredictions made in the tree leafs.Estimation points play the dual role. These points are used to fit theestimators in each leaf of the tree, but have no effect on the shape of thetree partition.Each data point is assigned to the estimation or structure stream, i ∈{S,E} with equal probability. We defined as LSj (D) as all structure pointsin leaf j and LEj (D) as all estimation points in leaf j. The pseudocode isnow modified to Algorithm 4.The number of candidate features for each node is sampled from a Pois-son distribution where KF is the expected number of candidate features.This ensures every feature is split infinity often. The candidate split pointsare all gaps midpoint within a valid range. This is the same as Breiman’salgorithm except the valid range does not contain all data points. Instead,the valid range is computed by sampling Knr data points in the node. Thisprocess is illustrated in Figure 3.3.Figure 3.3: Example of Denil2014 candidate split points for a candidatefeature where Knr = 3. The three downward pointing arrows indicate theselection of data points used to determine the range. All gaps midpointwithin the range are selected. The dark points are data points within thenode and the light grey data points are outside the node.3.4 TheoryIn this section we formally define consistency for a regression forest, intro-duce the fundamental definitions and provide insight into the requirementsof consistency for Biau2008, Biau2012 and Denil2014. Although we focuson regression forests the structure is very similar for classification forests.We denote a tree partition created by our algorithm trained on dataDn = {(Xi, Yi)}ni=1 as fn. As n varies we obtain a sequence of base models34Algorithm 4: Denil2014GrowDecisionTreeInput: A data set DOutput: A randomized decision tree fit to D1 frontier ← {0}2 nodes test← empty map from node index to test parameters3 nodes children← empty map from node index to children indices4 nodes predictor ← empty map from node index to predictor5 while frontier is not empty do6 j ←pop(frontier)7 ASj ← LSj (D) /* all structure points within node j */8 (v, φj , τj)←BestRandomizedSplit(ASj )9 if v =”Valid” then10 (j′, j′′)←create two child nodes for j11 frontier ← frontier ∪ {j′, j′′}12 nodes test[j]← (φj , τj)13 nodes children[j]← (j′, j′′)14 AEj ← LEj (D) /* all estimation points within node j */15 nodes predictor[j]←FitPredictorModel(AEj )16 return Tree(nodes test, nodes children, nodes predictor)and we are interested in showing that the sequence {fn} is consistent asn→∞. More precisely,Definition 1. A sequence of estimators {fn} is consistent for a certaindistribution on (X,Y ) if the value of the risk functionalR(fn) = EX,Z,Dn[|fn(X,Z,Dn)− f(X)|2]converges to 0 as n → ∞, where f(x) = E [Y |X = x] is the (unknown)regression function.All three theoretical variants take advantage of their structure as anempirical averaging estimator.Definition 2. A (randomized) empirical averaging estimator is an estimatorthat averages a fixed number of (possibly dependent) base estimators, i.e.f (M)n (x, Z(M),Dn) =1MM∑j=1fn(x, Zj ,Dn)35where Z(M) = (Z1, . . . , ZM ) is composed of M (possibly dependent) realiza-tions of Z.The first step of each proof is to show that the consistency of the randomforest is implied by the consistency of the trees it is composed of. This resultis shown by Denil et al. [21] for regression, Biau et al. [4] for binary classifiersand Denil et al. [20] for multi-class classifiers.Proposition 3. Suppose {fn} is a sequence of consistent estimators. Then{f (M)n }, the sequence of empirical averaging estimators obtained by averagingM copies of {fn} with different randomizing variables is also consistent.Proposition 3 allows each proof to focus on the consistency of each ofthe trees in the forest. The task of proving the tree estimators are consis-tent is greatly simplified if we condition on the partition of the data intostructure and estimation points. Conditioned on the partition, the structureof the tree becomes independent of the estimators in the leafs. The follow-ing proposition shows that, under certain conditions, proving consistencyconditioned on the partitioning variables is sufficient.Proposition 4. Suppose {fn} is a sequence of estimators which are con-ditionally consistent for some distribution on (X,Y ) based on the value ofsome auxiliary variable I. That is,limn→∞EX,Z,Dn[|fn(X,Z, I,Dn)− f(x)|2 | I]= 0for all I ∈ I and that ν is a distribution on I. Moreover, suppose f(x) isbounded. If these conditions hold and if ν(I) = 1 and each fn is boundedwith probability 1, then {fn} is unconditionally consistent, i.e. R(fn)→ 0.Let kn be the minimum node size required to split. Then the maintheorems in [3, 4, 20, 21] are very similar to Theorem 5.Theorem 5. Suppose that X is supported on RD and has a density which isbounded from above and below. Moreover, suppose that f(x) is bounded andthat E[Y 2]<∞. Then the random regression forest algorithm is consistentprovided that kn →∞ and kn/n→ 0 as n→∞.To ensure kn → ∞ and kn/n → 0 as n → ∞ the size of the diagonal ofevery node must go to zero. By splitting at the range midpoint, Biau2012is able to achieve this. For Denil2014, this requires every dimension beingsplit infinity often. By sampling the number of candidate dimensions from36a Poisson distribution, there is always some probability of selecting one can-didate dimension which is then guaranteed to split. The splitting mechanicof Denil2014 also ensures that the trees are sufficiently balanced which is anaddition requirement of the diagonal going to zero.Proposition 4 is the constraint that lead Biau2012 to require two datasets and Denil2014 to require structure and estimation streams. Biau2008did not need these restrictions because the labels, Y , are not used to con-struct the tree and X is not used when fitting the predictors.37Chapter 4Online random forestsIn this chapter we present online variants of random forests. Online modelsfit a model by processing a single data point (Xt, Yt) at a time. They areattractive because they do not require the entire training set be accessibleat once and are appropriate for streaming settings where training data isgenerated over time and should be incorporated into the model as quicklyas possible. Several variants of online decision tree models are present inthe Massive Online Analysis (MOA) framework of Bifet et al. [6].The primary difficulty with building online decision trees is their recur-sive structure. Data encountered once a split has been made cannot beused to correct earlier decisions. A notable approach to this problem is theHoeffding tree [23] algorithm, which works by maintaining several candi-date splits in each leaf. The quality of each split is estimated online as dataarrive in the leaf, but since the entire training set is not available these qual-ity measures are only estimates. The Hoeffding bound is employed in eachleaf to control the amount of data which must be collected to ensure thatthe split chosen on the basis of these estimates is the true best split withhigh probability. Domingos and Hulten [23] prove that under reasonableassumptions the online Hoeffding tree converges to the offline tree with highprobability. The Hoeffding tree algorithm is implemented in the system ofBifet et al. [6].Alternative methods for controlling tree growth in an online setting havealso been explored. Saffari et al. [40] control leaf splitting using two param-eters in their online random forest. One parameter specifies the minimumnumber of data points which must be seen in a leaf before it can be split,and another specifies a minimum quality threshold that the best split in aleaf must reach. This is similar in flavor to the technique used by Hoeffdingtrees, but trades theoretical guarantees for more interpretable parameters.Denil et al. [20] grows the minimum number of data points as a function ofdepth. As the tree gets deeper the number data points required for splittingincreases. This requirement, along with several others, allows for Denil et al.[20] to have a proof of consistency.One active avenue of research in online random forests involves tracking38non-stationary distributions, also known as concept drift. Many of the onlinetechniques incorporate features designed for this problem [1, 5, 7, 24, 40].However, tracking of non-stationarity is beyond the scope of this thesis.The remainder of this Chapter presents how to efficiently grow an onlineforest and defines two variants: Saffari [40] and Denil2013 [20].4.1 Sufficient statistics and split pointsThe typical approach to building trees online, which is employed in Domin-gos and Hulten [23], Saffari et al. [40] and Denil et al. [20], is to maintaina frontier of candidate splits in each leaf of the tree. Every leaf maintainssufficient statistics for each candidate split along with sufficient statistics forthe predictor. These sufficient statistics are updated every time a new datapoint (Xt, Yt) arrives at a leaf. When the splitting criteria is met for thesplit with the best impurity gain, two child nodes are created with predictorsufficient statistics initialized from the best split’s sufficient statistics. Un-like offline trees, only the split with the best impurity gain should be chosensince waiting for more data will always make the best split valid. Pickinga split which is currently valid, but not the best, results in lower predictiveaccuracy.Since the model can only store sufficient statistics, it is not possible toselect candidate split points by sort and walk strategies. Instead candidatesplit points must be selected before collecting sufficient statistics. One optionis to sample candidate split points when a node is created and the other is toadd new split points as data points arrives at a leaf. By sampling candidatesplit points when a leaf is created, all data points that arrive will be used toupdate candidate split sufficient statistics. By creating split points as dataarrives, all data points that arrived before the creation of the new split pointwill not be included in the sufficient statistics. Therefore more data pointsmust arrive before splitting which slows the growth of the tree however thesplit points reflect the density of the data. We assume candidate featuresare always sampled when a new node is created.The simplest strategy for selecting split points is to use global uniform.This is adopted by Saffari et al. [40] and was covered in detail in Section 2.4.1.In the online setting, this requires knowing the feature ranges apriori ormaintaining an estimate of the minimum and maximum of each feature dur-ing the fitting process. Figure 4.1 visualizes three candidate global uniformsplit points, along with seven data points that arrive and update the splitpoint estimates.39Figure 4.1: An example of selecting three candidate split points, KS = 3,with global uniform for a single candidate feature and updating the sufficientstatistics as seven data points arrive.The second strategy is to use the firstKS data points as split points. Thishas a similar flavor to selecting all data point as described in Section 2.4.3but rather than use all data points as split points, new split points arecreated as data points arrive. For each candidate feature, a new split pointis added if there is no split point with the same value and if there are lessthan KS split points already being monitored. Candidate split sufficientstatistics are collected for each data point for each existing candidate split.40The entire process is outlined in Figure 4.2.4.2 Online baggingSaffari et al. [40] use the online bagging technique of Oza and Russel [36]. Foreach data point each tree samples a weight from a Poisson, w ∼ Poisson(λ),where λ is typically one. Data points with weight equal to zero are skippedby the tree while data points with a weight greater than one are processedmultiple times.4.3 Memory managementThe difficulty of growing trees online is that the trees must be grown breadthfirst, and maintaining the frontier of potential splits is very memory intensivewhen the trees are large. Maintaining the frontier requires KF ·KS ·dim(Y )statistics in each leaf, where KF is the number of candidate features, KSis the number of candidate split points per candidate feature and dim(Y )is the dimension of the sufficient statistics used for computing the impuritygain. For classification dim(Y ) is equal to the number of classes and forregression it is twice the dimension of Y . These statistics can be quite largeand for deep trees the memory cost may become prohibitive.Offline forests do not suffer from this problem, because they are able togrow the trees depth first. Since they do not need to accumulate statistics formore than one leaf at a time, the cost of computing even several megabytesof statistics per split is negligible. Although the size of the trees still growsexponentially with depth, this memory cost is dwarfed by the savings fromnot needing to store split statistics for all the leafs.In practice the memory problem is resolved either by growing small trees,as in Saffari et al. [40], or by bounding the number of nodes in the frontierof the tree, as in Domingos and Hulten [23]. Other models of streamingrandom forests, such as those discussed in Abdulsalam [1], build trees insequence instead of in parallel, which reduces the total memory usage.We now present a bounded frontier that adopts the technique of Domin-gos and Hulten [23] to control the policy for adding and removing leafs fromthe frontier. In each tree we partition the leafs into two sets: we have aset of active leafs, for which we collect split statistics as described in earliersections, and a set of inactive leafs for which we store only two numbers.We call the set of active leafs the frontier of the tree, and describe a policyfor controlling how inactive leafs are added to the frontier.41Figure 4.2: An example of candidate split points being selected at datapoints as they arrive for a single candidate feature. Data points that arrivebefore a split point is created are not included in the split point sufficientstatistics.42In each inactive leaf j we store the following two quantities for classifi-cation• pˆ(j) which is an estimate of p(j) = P (l(X) = j), and• eˆ(j) which is an estimate of e(j) = P (f(X) 6= Y | l(X) = j).Both of these are estimated based on the estimation points which arrivein j during its lifetime. From these two numbers we form the statisticsˆ(j) = pˆ(j)eˆ(j) (with corresponding true value s(j) = p(j)e(j)) which isan upper bound on the improvement in error rate that can be obtained bysplitting j. For regression, eˆ(j) is an estimate of e(j) = E[(Y − f(X))2]which is the expected MSE of points which reach leaf j.Membership in the frontier is controlled by sˆ(j). When a leaf is split itrelinquishes its place in the frontier and the inactive leaf with the largestvalue of sˆ(j) is chosen to take its place. The newly created leafs from thesplit are initially inactive and must compete with the other inactive leafsfor entry into the frontier. Let Maf be the maximum number of leafs inthe active frontier at any one time. The pseudo code for constructing anonline randomized tree with a fixed frontier is outlined in Algorithm 5 andAlgorithm 6.4.4 SaffariThe design choices of the Saffari et al. [40] variant include controlling mem-ory with max depth, online bagging, sampling KS candidate split pointswith global uniform, and splitting a leaf when a minimum impurity of Mimand minimum node size of Mns is reached. In Section 5.4 we evaluate theeffect of the following modifications to Saffari:1. replacing max depth with a fixed frontier to control memory,2. removing online bagging, and3. sampling KS candidate split points at data points instead of usingglobal uniform.4.5 Denil 2013Denil et al. [20] combines the requirements for consistency from Chapter 3with a fixed frontier online forest to create an online forest with provable43Algorithm 5: OnlineBestRandomizedSplitInput: X,Y : data point, Φ: features , T :map from features to splitpoints, S: map from features, split points and left/right tosufficient statsOutput: A tuple (v, φ, τ, S) where v is whether the split is valid, φ isthe feature parameter, τ is the split point, Sφ,τ are thesufficient stats/* Generate splits and their purity values */1 foreach φ ∈ Φ do2 T[φ]← UpdateCandidateSplitpoints(splitpoints,T[phi])3 foreach τ ∈ T[φ] do4 s← h(X,φ) > τ5 S[φ][τ ][s]← UpdateSufficientStats(S[φ][τ ][s], X, Y)6 Iφ,τ ← I(S[φ][τ ])7 (I ′φ,τ , φ′, τ ′, S′φ,τ )← Select Iφ,τ , φ, τ and Sφ,τ such that Iφ,τ is max8 if IsValid((I ′φ,τ , φ′, τ ′, S′) then9 v =”Valid”10 else11 v =”Invalid”12 return (v, φ′, τ ′, S′φ,τ )44Algorithm 6: OnlineGrowDecisionTreeInput: A stream of data points DOutput: A randomized decision tree fit1 inactive frontier ← {0}2 nodes test← empty map from node index to test parameters3 nodes children← empty map from node index to child indices4 nodes predictor ← empty map from node index to predictor5 cf ← empty map from node index to candidate features6 cs← empty map from node index to a map of candidate features tocandidate split points7 css← empty map from node index to sufficient statistics of splits/* Continuously process data points X,Y from stream */8 while X,Y ← D do9 w ←DatapointWeight()10 while w > 0 do11 w ← w − 1/* Get index of the leaf that X is within */12 j ← l(X)13 nodes predictor[j]←UpdatePredictorModel(X,Y )14 (v, φj , τj , Sj)←BestRandomizedSplit(X,Y, cf [j], cs[j], css[j])15 if v =”Valid” then16 (j′, j′′)←create two child nodes for j17 nodes predictor[j′]←FitPredictorModel(Sj [L])nodes predictor[j′′]←FitPredictorModel(Sj [R])18 inactive frontier ← inactive frontier ∪ {j′, j′′}19 nodes test[j]← (φj , τj)20 nodes children[j]← (j′, j′′)21 delete cf [j]22 delete cs[j]23 delete css[j]24 while size of active frontier is less than Maf do/* Select the leaf which is most likely toimprove the tree by being split */25 j ←pop(inactive frontier)26 cf [j]← SelectCandidateFeatures()27 cs[j]← empty split points28 css[j]← zeroed sufficient stats29 return Tree(nodes test, nodes children, nodes predictor)45consistency. Unlike Domingos and Hulten [23], who use this fixed frontieronly as a heuristic for managing memory use, Denil et al. [20] incorporatethe memory management directly into their analysis.Similar to Section 3.3, the proof of Denil et al. [20] requires data pointsto be split into structure and estimation streams. Only structure points canbe used to select candidate split points and determine the best candidatesplit. Estimation points are used to update the current leaf predictor esti-mates and collect sufficient statistics to initialize the leaf predictor of newchildren. Figure 4.3 visualizes data points being assigned to structure andestimation streams, creating new candidate split points at structure pointsand gathering structure and estimation sufficient statistics.In addition to requiring two streams, Denil2013 requires the number ofcandidate features to be sampled from a Poisson distribution with mean KFand for the number of data points required to split a leaf to grow with depth.A leaf is split when the best impurity is greater than Mim and there are atleast α(d) estimation points in the best candidate split. A leaf is also splitwhen there are more than β(d) estimation points regardless of the impurity.By growing the number of estimation points required to split a leaf as afunction of depth, the number of data points in each leaf goes to infinityas N → ∞. By forcing a split after β(d) estimation points, each leaf isguaranteed to eventually split. In order to ensure consistency, it is necessaryfor α(d)→∞ monotonically in d and that d/α(d)→ 0. β(d) ≥ α(d) is alsorequired. We define α(d) = Mcs · (Kdr)d and β(d) = 4 · α(d) where Mcs isthe minimum child size and Kdr is the growth rate with respect to depth.46Figure 4.3: An example with structure and estimation streams with can-didate split points being selected at structure points for a single candidatefeature. Each data point is assigned a stream, S or E, which determines theset of sufficient statistics that are updated. Only structure points, S, cancreate a new candidate split point.47Chapter 5ExperimentsIn this chapter we present the results of our in-depth empirical study ofthe random forest variants presented in the previous three chapters. Thegoal is to learn the properties of different variants and parameter settingsto help practitioners select the optimal configuration for their requirementsand for theoreticians to understand which simplifications required by thetheory result in the largest gap from the variants used in practice. Whileonline variants can not achieve the same predictive accuracy as the offlinevariants, the gap can be reduced by selecting the right variant and choosingthe correct parameters.We begin by outlining the procedure used in each experiment and themetrics that were gathered. We then introduce the 23 datasets which wereused in this study and define the requirements for a variant to be better,tied or worse than another variant.Our first experiments examine the effect of different parameter settingson the predictive accuracy of Breiman’s original random forest algorithm.We examine test error with respect to number of candidate features withand without bagging and for different size forests. We show that on almostall datasets bagging does not improve predictive accuracy for the optimalnumber of candidate features (KF ). This contradicts the commonly heldbelief that bagging improves Breiman’s random forest.While it is well known that increasing the number of trees in a forestdecreases the generalization error, we also show that the optimal numberof candidate features (KF ) and minimum child size (Mcs) changes as thenumber of trees in the forest increase. However the optimal parameter set-ting stabilizes after about 25 trees which is critical when applying Bayesianoptimization [13] with the goal of finding the best parameter setting withthe least computation. This was observed in Wang et al. [46] when usingBayesian optimization to find the optimal hyperparameters of a randomforest for predicting the body part label of each pixel in a depth image.We then show the trade off between reducing overfitting with a higherminimum child size (Mcs) versus increasing the number of trees in the forest.By increasing the number of trees the optimal Mcs goes to 1 for classification48and while it does not always become 1 for regression, the generalization errorbecomes much less sensitive to the Mcs.We then expand the study to include Geurts et al. [27] node uniformcandidate split point strategy. We show decreasing the number of splitpoints decreases strength and increases diversity. We also show that byincreasing the number of split points, node uniform candidate split pointsconverge on the same test error as all gaps midpoint for any KF . For alldatasets, all gaps midpoint has higher strength but lower diversity whencompared to node uniform with one split point. Depending on whetherthe individual strength or diversity term dominates the overall accuracy,one end of the spectrum always has the lowest error. Therefore either allgaps midpoint (Breiman) or node uniform with one split point (extremelyrandomized trees) is the best configuration.Another approach for increasing diversity and decreasing training timethat we discussed in Section 2.6 is to subsample data points at nodes higherup in the tree. We empirically confirm that sub-sampling data points at eachnode does not effect generalization error for the all gaps midpoint split pointstrategy. Like the number of trees, the number of data points to sample ateach node provides practitioners with a parameter to trade off training timewith predictive accuracy.We then revisit the two new sort and walk split point selection strategiesand the one new uniform split point strategy proposed in Section 2.4. Webegin by examining all gaps uniform and all data points. The hypothesisis selecting candidate split points at all data points would perform worsebecause it would not result in a max margin separation and all gaps uni-form would inject extra randomness that is similar to extremely randomizedtrees. Surprisingly, neither split point selection strategy has much effect onpredictive accuracy. This gives hope to the online variants that use datapoints as split point candidates.In comparison to node uniform, the global uniform split point strategyhas increased test error for the same KF and requires more candidate fea-tures to be sampled at each node to achieve the same error. While there is adecrease in performance on most datasets, it shows that using global uniformwill not significantly effect the performance for online random forests.For all experiments up until this point we have studied decision trees withaxis aligned splits. We then evaluate features which are linear combinationsof other features. This includes the effect of the subspace dimension (Kss)and the number of candidate projections sampled at each node (KF ).We then move on to variants with provable consistency. We comparethe predictive accuracy, strength and diversity of Biau2008, Biau2012 and49Denil2014. We show that Denil2014 achieves the best accuracy and most ofthe gap between Denil2014 and Breiman is due to splitting data points intostructure and estimation streams.For the online variants, we begin by evaluating using a fixed frontierversus a maximum depth to manage memory. The fixed frontier results inbetter accuracy as it is able to grow larger trees. We then investigate thetrade-off between selecting split points at data points versus selecting themwith global uniform. We then move on to examine the Saffari’s split criteriaof impurity gain and constant number of data points (Mns + Mim) versusDenil2013’s split criteria of impurity gain and a number of data points thatgrows with the depth (α(d) +β(d) +Mim). We show that the growth of thenumber of data points required to split as a function of depth does not effectaccuracy. Whereas online bagging hurts performance and multiple passesthrough the data increases performance.At the end of this chapter practitioners should have a clear understandingon how to train a random forest to achieve the lowest generalization errorwith the least amount of computation and theoreticians should understandwhich modifications that are required to prove consistency have the largesteffect on empirical results for real world datasets.5.1 Datasets and procedureFor our empirical study we trained a random forest for each variant andparameter setting for each dataset. All parameter settings were defined asa simple grid across the combined space. For each configuration we mea-sured five metrics from the trained forest. The most important being testerror; which is 0-1 loss for classification and MSE for regression. Individualstrength and correlation are also measured. For classification the strengthand correlation is measured as Breiman’s strength and correlation as definedin Section 2.9.2. It is important to note that higher individual strengthmeans lower individual test error and higher correlation means lower diver-sity. Therefore, the generalization error goes down as individual strengthgoes up or as the correlation goes down. For regression, we measure individ-ual MSE and ambiguity MSE as defined in Section 2.9.1. Individual MSEis the average MSE of each base model and the ambiguity MSE is the MSEbetween each base model and the ensemble. The generalization error ofthe ensemble decreases as individual MSE decreases or the ambiguity MSEincreases.The other three metrics that are measured for each variant and param-50eter configuration is training time, test time and the number of nodes ineach model. Generally training time is the largest computation cost but forsome real world applications, where billions of predictions are being madeper hour, the test time can be equally as important to the practitioner. Thenumber of nodes in the model represents the complexity of the model andis proportional to the memory requirements.For each dataset, variant and parameter setting we independently traina forest five times and report the mean and standard deviation for eachmetric. While five runs is quite small, we decided to trade off certainty inour estimates for being able to run across more variants, datasets and a largergrid of parameter values. For datasets without a training set and test setsplit, we used k-folds. Table 5.1 outlines the properties of the classificationdatasets which have train and test sets while Table 5.2 summarizes theproperties of the classification datasets which are evaluated with k-folds.For datasets with a training and test split, the train column is the numberof data points in the training set and the test column is the number of datapoints in the test set. For datasets without a training and test split, thesize column is the total number of data points and folds is the number ofk-folds. The features column is the number of features in X and classesis the number of classes. The constant predictor error column is the testerror that would result from predicting the most common class for all datapoints. Table 5.3 outlines the regression datasets with train and test splitsand Table 5.4 contains regression datasets which are evaluated using k-folds. The dimension of Y is the dimensions of the response variable andthe constant predictor MSE is the MSE error if the average Y across theentire dataset was predicted for every point.All but one of the datasets are from the UCI or libsvm repository andhave real value features. For all sections up until Section 5.2.7, axis alignedsplits are used for these datasets. The datasets where chosen for their di-verse domains and because of their availability to other researchers. Due tocomputational limitations we were not able to run all variants and param-eter settings on all datasets. For each subsequent section, we outline whichdatasets variants were run on.5.1.1 Kinect DatasetThe additional dataset is for the challenging computer vision problem ofpredicting the body part label for each pixel in a depth image along withoffsets to each joint. Instead of using axis-aligned features, the depth differ-ence features presented in Section 2.8.3 are extracted for candidate offsets.51Data set Train Test Features Classes Constant Predictor Errorvowel 528 462 10 11 90.91dna 2000 1186 180 3 49.16satimage 4435 2000 36 6 76.95gisette 6000 1000 5000 2 50.00usps 7291 2007 256 10 82.11pendigits 7494 3498 16 10 89.62letter 15000 5000 16 26 96.30news20 15935 3993 62060 20 94.99cifar10 50000 10000 3072 10 90.00mnist 60000 10000 778 10 88.65sensit-vehicle 78823 19705 100 3 50.22kinect 768524 48797 2949081600 20 94.88Table 5.1: Overview of classification datasets with a train and test splitData set Size Folds Features Classes Constant Predictor Errorwebspam 350000 5 254 2 39.37Table 5.2: Overview of classification datasets which were were evaluatedwith k-foldsData set Train Test Features Dim of Y Constant Predictor MSEe2006 16087 3308 150358 1 3.827039e-01kinect-head 40000 2500 2949081600 3 1.863756e+01ypm 463715 51630 90 1 1.471805e+02Table 5.3: Overview of regression datasets with a train and test splitTypically the first step in a joint location pipeline is to predict the bodypart label of each pixel in the depth image and the second step is to use thelabeled pixels to predict the joint locations [43]. Further refinements to thisprocedure can predict both the pixel label and joint locations simultaneouslyusing a Hough forest as in Girshick et al. [28]; however these refinements arewell beyond the scope of this thesis.Since our primary goal is to evaluate classification and regression modelsrather than to build an end product, we learn each independently. For52Data set Size Folds Features Dim of Y Constant Predictor MSEbodyfat 252 5 14 1 3.607580e-04diabetes 442 5 10 1 5.929887e+03housing 506 5 13 1 8.441954e+01abalone 4177 5 8 1 1.039254e+01wine 6497 5 11 1 7.624701e-01cadata 20640 5 8 1 1.331540e+10ct-slice 53500 5 384 1 4.993774e+02Table 5.4: Overview of regression datasets which were were evaluted withk-foldsclassification we learn for each pixel a model from the depth image to thebody part label. For regression we use depth images and ground truth bodypart labels to learn a regression model of 3-D offsets from each pixel to anassociated joint.For each joint, we train a regression forest on the pixels of body partsassociated with that joint. For instance, the offsets for the left hand jointwould be predicted by the pixels labeled as left hand, left lower arm and leftupper arm. Typically these errors would be post-processed with mean shiftto find a more accurate final prediction for the joint location. Instead, wedirectly report the regression error over pixels to avoid confounding factorsuvFigure 5.1: Left: Depth image with a candidate feature specified by the off-sets u and v. Center: Body part labels. Right: Left hand joint predictions(green) made by the appropriate class pixels (blue).53in the comparison between the forest models.To build our data set, we sample random poses from the CMU mocapdataset1 and render a pair of 320x240 resolution depth and body part imagesalong with the positions of each joint in the skeleton. The 19 body partsand one background class are represented by 20 unique color identifiers inthe body part image.For this study we generate 2000 poses for training and 500 poses fortesting. To create the training set, we sample 20 pixels without replacementfrom each body part class in each pose. For the regression problem ofpredicting the the offsets of pixels to neighbouring joints, we sample 40000pixels without replacement from body part pixels associated with the joint.Figure 5.1 visualizes the raw depth image, ground truth body part labelsand the votes for the left hand made by all pixels associated with the lefthand joint.The classification dataset listed as kinect in Table 5.1 reports the er-ror for predicting the body part labels and the regression dataset listed askinect head in Table 5.3 reports the error for predicting the offset to thehead joint from all body parts near the head. The head is just an exampleand all other joints are also measured. The listed feature space dimension of2949081600 is computed by counting all possible pairs of pixels in a 320x240resolution depth image.5.1.2 Comparing variantsWhen comparing variants we adopt a simple definition of whether they areequivalent. We define two variants as equivalent if the standard deviation atthe best parameter setting for either variant overlaps the mean of the othervariant at its optimal setting. If the standard deviation does not overlap,we declare the better variant the winner. The best parameter setting andthe winning variant are at the minimum for metrics where smaller is betterand at the maximum for metrics where larger is better.Figure 5.2 provides two examples for a metric where smaller is better. Inthe top example the standard deviation of the red line overlaps the standarddeviation of the blue line at the minimum. As a result, they are tied. In thebottom example the red and blue variants do not overlap at the minimum sothe red is variant is better. While more sophisticated statistical tests couldbe used to determine if one variant is significantly better than another, wechoose this simple criteria as we believe it is more intuitive.1Data obtained from mocap.cs.cmu.edu54Figure 5.2: Top: The red and blue variant are equal at the minimum be-cause the standard deviation overlaps the mean at each variants’ minimum.Bottom: The red variant is better than the blue variant because the stan-dard deviation does not overlap the mean at each variants’ minimum.For each variant in a plot, we optimize the free parameters not includedin the plot in order to achieve the lowest test error. In the legend of eachplot, all the fixed parameters settings are included along with the candidatesplit point strategy and any other design choices.5.2 Variants used in practiceWe now examine the design choices of random forest variants used in prac-tice. For all experiments we search over a grid of values for two parameters:the number of candidate features (KF ) and the minimum child size (Mcs).In every plot, the optimal configuration is chosen by minimizing the testerror. The optimal setting of KF , Mcs and any other parameters is includedin the legend of each plot.555.2.1 BaggingBagging is part of Breiman’s original random forest algorithm and is de-scribed in Section 2.5. By sampling data points with replacement each treeis trained on a random subset of the data. This should increase the variancebetween trees and could potentially decrease the overall generalization er-ror. However, the work of Geurts et al. [27] showed that with node uniformcandidate split points, the forest achieves higher accuracy without bagging.We now test all gaps midpoint with and without bagging. This is equivalentto Breiman’s original algorithm with and without bagging.For the datasets that we studied bagging never significantly improvedthe test error when selecting candidate splits at all gaps midpoint. For 15of the 23 datasets bagging actually hurt performance. See Table 5.5 for abreakdown of the number of datasets where all gaps midpoint with baggingbeat all gaps midpoint without bagging. The datasets that were includedin this evaluation were vowel, dna, satimage, gisette, usps, pendigits, letter,news20, cifar10, mnist, sensit-vehicle, webspam, kinect, bodyfat, diabetes,housing, abalone, wine, cadata, e2006, kinect-head, ypm and ct-slice.With bagging Without baggingWith bagging - 0/8/15Without bagging 15/8/0 -Table 5.5: How often all gaps midpoint with bagging is better, tied or worsethan all gaps midpoint without bagging. The three numbers in each cell arethe number of times the row beat/tied/lost to the column.Vowel is the only dataset where all gaps midpoint with bagging is betterthan all gaps midpoint without bagging, however based on the definitionin Section 5.1.2 it is classified as a tie. Figure 5.3 is a plot of test errorfor different number of candidate features (KF ). While the variant withoutbagging does overlap the variant with bagging at the minimum, the variantwith bagging has a lower test error over most of the range of KF . Sincevowel is the smallest dataset, with only 528 training data points, we suspectthat bagging has the most benefit on small datasets with outliers.Figure 5.4 is indicative of a typical test error plot for a classificationproblem using all gaps midpoint candidate split points with and withoutbagging. In this scenario all gaps midpoint without bagging does better formost parameter settings. For regression datasets, all gaps midpoint withbagging has the effect of maintaining a relatively constant test error while561 2 3 4 5 6 7 8 9Number of candidate features (KF )455055606570TesterrorAll gaps midpoint, Mcs= 3, T= 100All gaps midpoint + Bagging, Mcs= 2, T= 100Figure 5.3: Comparing all gaps midpoint with bagging versus all gaps mid-point without bagging for the vowel dataset. The plot is test error withrespect to number of candidate features (KF ). Notice that for most settingsof KF , the performance of bagging is better.2 4 6 8 10 12Number of candidate features (KF )45678TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1, T= 100Figure 5.4: Comparing all gaps midpoint with bagging versus all gaps mid-point without bagging for the letter dataset. The plot is test error withrespect to number of candidate features (KF ). This plot is indicative ofa typical classification dataset where all gaps midpoint without bagging isbetter than all gaps midpoint with bagging for most settings of KF .57the number of candidate features increase. However, all gaps midpoint with-out bagging achieves equivalent test error for a lower number of candidatefeatures as illustrated in Figure 5.5. See Section A.1 for plots comparing allgaps midpoint with bagging and without bagging across all datasets.1 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorAll gaps midpoint, Mcs= 7, T= 100All gaps midpoint + Bagging, Mcs= 8, T= 100Figure 5.5: Comparing all gaps midpoint with bagging versus all gaps mid-point without bagging for the diabetes dataset. The plot is test error withrespect to number of candidate features (KF ). With bagging the test errorremains relatively flat as the number of candidate features increases whichis a common pattern for regression datasets.Since bagging does not significantly improve generalization error for anydataset and actually hurts generalization error on many datasets, we ignorebagging for the rest of this thesis. It is quite surprising that bagging doesnot help with more datasets as it is part of Breiman’s original algorithm andis enabled by default on most implementations of random forests.5.2.2 Selecting number of candidate featuresVarying the number of candidate features (KF ) is the primary mechanism forinjecting randomness to increase diversity and decrease generalization error.The overly simplified explanation is that sampling fewer candidate featuresresults in weaker but more diverse trees while sampling more candidate fea-tures results in stronger but less diverse trees. Somewhere in between eitherextreme is a sweet spot where the overall generalization error is lowest. Fig-ure 5.6 plots test error, Breiman strength and Breiman correlation for the582 4 6 8 10 12 14 16 188.28.48.68.89.09.29.4TesterrorTest errorBreiman strengthBreiman correlation0.7100.7150.7200.7250.730Breimanstrength0.340.360.380.400.420.440.46Breimancorrelation2 4 6 8 10 12810121416TesterrorTest errorIndividual MSEAmbiguity MSE1520253035IndividualMSE5101520AmbiguityMSEFigure 5.6: A classification and regression example for balancing individ-ual error with diversity to achieve the lowest test error. Top: Test error,Breiman strength, Breiman correlation for the satimage dataset. As thenumber of candidate features (KF ) increases the strength and correlationof each tree also increases. Bottom: Test MSE, individual MSE and ambi-guity MSE for the housing dataset. As the number of candidate features(KF ) increases the individual MSE and ambiguity MSE both decrease.59satimage dataset and test MSE, individual MSE and ambiguity MSE forthe housing dataset. For the satimage dataset, the strength and correlationboth increase as the number of candidate features increase. For the housingdataset, the individual MSE and ambiguity MSE both decrease as the num-ber of candidate features increase. Recall from Section 2.9.1 that the testMSE is just the ambiguity MSE subtracted from the individual MSE. Whilethe metrics of individual error and diversity are inverted for classificationand regression, they are telling the same story.1 2 3 4 5 6 74.64.85.05.25.4TesterrorTest errorIndividual MSEAmbiguity MSE5.905.956.006.056.10IndividualMSE0.91.01.11.21.31.41.51.6AmbiguityMSEFigure 5.7: Test MSE, individual MSE and ambiguity MSE for the abalonedataset. Unlike the typical scenario, the ambiguity MSE error actually in-creases with respect to number of candidate features for KF < 3 and theindividual MSE error decreases with respect to number of candidate featuresfor KF > 3. This indicates a more complex relationship between number ofcandidate features and the generalization error for some datasets.While this simplified story holds true for some datasets such as satimageand housing there exist some datasets where increasing the number of can-didate features actually increases the individual error of each tree after somecritical point. In addition the diversity can also decrease for lower numberof candidate features. With fewer candidate features, the split criteria isless likely to be met and the resulting trees are shallower and less diverse.Figure 5.7 shows that for the abalone dataset, the individual MSE increasesfor higher number of candidate features and the ambiguity MSE decreases60for fewer number of candidate features. This demonstrates a more complexrelationship between number of candidate features to sample at each nodeand the generalization error.2 4 6 8 10 12Number of candidate features (KF )510152025TesterrorAll gaps midpoint, Mcs= 1, T= 1All gaps midpoint, Mcs= 2, T= 5All gaps midpoint, Mcs= 1, T= 25All gaps midpoint, Mcs= 1, T= 100All gaps midpoint, Mcs= 1, T= 500Figure 5.8: The test error for the letter dataset for different forest sizes (T ).Notice that the best number of features to sample at each node decreasesas the forest size increases up until 25 trees.Using grid search or Bayesian optimization for finding the best numberof candidate features to sample at each node is computationally expensive.Ideally, we would like to search for the best parameter setting with a smallerforest size and then train a larger forest from the optimum parameter settingfound with the smaller forest. Unfortunately, the reduction in test error dueto diversity will require a minimum forest size to find the optimal parametersetting. While dataset dependent, we found that a forest with 25 trees findsclose to the same optimal number of candidate features to sample at eachnode in comparison to a larger forest with 500 trees. Figure 5.8 plots the testerror for different forest sizes for the letter dataset. The optimum shifts tothe left as the forest size increases up until a size of 25 trees. While the testerror decreases after 25 trees, KF does not continue to shift significantly.For plots of the rest of the datasets, refer to Section A.2.It is also worth noting that the decrease in generalization error decreasesexponentially as the forest size increases while the training time and testtime increase linearly. In other words, the vast majority of the decrease inthe test error is achieve by increasing the number of trees from 1 to 25. Theimprovement from going from 25 to 500 is much smaller.615.2.3 Selecting minimum child sizeCriminisi et al. [18] suggests two ways to reduce overfitting. The first is toincrease the number of trees in the forest. The second is to increase theminimum child size (Mcs) required to split a node. Increasing the minimumchild size will result in shallower trees with more data points in each leaf.Since both of these strategies reduce overfitting we now look at the effectof the size of a forest on the optimal minimum child size. Recall that thedefault minimum child size is one for classification and five for regression.The wine plot in Figure 5.9 shows that as the number of trees in the forestincreases, the optimal minimum child size decreases until reaching one.For regression datasets, such as the diabetes plot in Figure 5.9, evenwhen a higher minimum child size achieves better generalization error thesensitivity of the test error to the minimum child size decreases as the num-ber of trees in the forest increases. Therefore, rather than using grid searchor Bayesian optimization to select the optimal minimum child size, our ex-periments show that for the same training time a lower generalization errorwill be achieved by selecting a smaller minimum child size and growing alarger forest.5.2.4 All gaps midpoint versus node uniform candidatesplit pointsSo far we have focused on selecting candidate split points with all gaps mid-point. We now compare the all gaps midpoint candidate split point strategywith the node uniform candidate split point strategy. Other than bagging,this is the difference between Breiman’s original random forest algorithmand Geut’s extremely randomized trees. For node uniform candidate splitpoints, additional randomness can be injected by sampling a smaller numberof candidate split points as we described in Section 2.4.2. The most randomstrategy, which creates the most diversity across the forest, is sampling asingle candidate split point for each candidate feature (KS = 1). However,this is also creates weaker individual trees. As demonstrated in Figure 5.10,increasing the number of split points for node uniform results in the testerror, individual accuracy and correlation converging on the same results asall gaps midpoint.For node uniform, it is reasonable to expect that there would be an opti-mal number of split points that balances individual strength with diversity.However, for the datasets we evaluated the best generalization error is eitherwith all gaps midpoint or node uniform with KS = 1. The satimage example621 2 3 4 5 6 7 8Minimum child size (Mcs)0.40.50.60.70.8TesterrorAll gaps midpoint, KF= 5, T= 1All gaps midpoint, KF= 4, T= 5All gaps midpoint, KF= 2, T= 25All gaps midpoint, KF= 3, T= 100All gaps midpoint, KF= 2, T= 5001 2 3 4 5 6 7 8Minimum child size (Mcs)30004000500060007000TesterrorAll gaps midpoint, KF= 7, T= 1All gaps midpoint, KF= 4, T= 5All gaps midpoint, KF= 3, T= 25All gaps midpoint, KF= 3, T= 100All gaps midpoint, KF= 3, T= 500Figure 5.9: Top: The test error for the wine dataset with respect to mini-mum child size (Mcs) for different size forests. As the forest size increases,the slope of the line increases until a minimum child size of one achievesthe lowest error. Bottom: The test error for the diabetes dataset withrespect to minimum child size (Mcs) for different size forests. As the forestsize increases, the slope of the line approaches zero showing that the sensi-tivity of the test error with respect to the minimum child size decreases asthe number of trees increase.632 4 6 8 10 12Number of candidate features (KF )3456789TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )0.650.700.750.80BreimanstrengthAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )0.150.200.250.300.350.40BreimancorrelationAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure 5.10: All gaps midpoint versus node uniform with 1, 5, 25 and 100candidate split points for the letter dataset. As the number of split pointsincrease, node uniform converges to all gaps midpoint. For this dataset theextra diversity created from node uniform with KS = 1 out ways the lossof individual strength and results in a lower overall test error. Top: Testerror with respect to KF . Middle: Individual strength with respect to KF .Bottom: Breiman correlation with respect to KF .642 4 6 8 10 12 14 16 18Number of candidate features (KF )8.28.48.68.89.09.29.4TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12 14 16 18Number of candidate features (KF )0.660.670.680.690.700.710.72BreimanstrengthAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12 14 16 18Number of candidate features (KF )0.300.320.340.360.380.40BreimancorrelationAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure 5.11: All gaps midpoint versus node uniform with 1, 5, 25 and 100candidate split points for the satimage dataset. As the number of splitpoints increase, node uniform converges to all gaps midpoint. For thisdataset the diversity created from node uniform with KS = 1 does notout way the loss of individual strength and the overall test error is higherfor node uniform with KS = 1. Top: Test error with respect to KF . Mid-dle: Individual strength with respect to KF . Bottom: Breiman correlationwith respect to KF .65in Figure 5.11 demonstrates an example where all gaps midpoint producesa lower test error in comparison to node uniform with KS = 1. Notice thatthe order of strength and correlation is still the same as Figure 5.10. Theremainder of the plots for all of the datasets can be seen in Section A.4.Another distinction between all gaps midpoint and node uniform, is nodeuniform with a small number of split points produces larger models. Becausethe pairs of candidate features and split points are more random, morenodes are required to reach pure leafs. Recall from Section 2.7 that thecomputation time for training a node uniform decision tree is linear in thenumber of split points. Therefore, as the number of split points surpasses thelog(N) term associated with sorting the data points for all gaps midpoint,the training time of node uniform strategy becomes prohibitively expensive.Since the test time is proportional to the model size it is higher for nodeuniform with less split points. Figure 5.12 plots the model size, test timeand training time for the satimage dataset and Figure 5.13 plots the samefor the usps dataset. For the usps dataset node uniform with KS = 1 hasa lower training time in comparison to all gaps midpoint where as all gapsmidpoint has a lower training time for the satimage dataset.All gaps midpoint Node uniform with KS = 1All gaps midpoint - 8/9/6Node uniform with KS = 1 6/9/8 -Table 5.6: How often all gaps midpoint is better, tied or worse than nodeuniform with KS = 1 with respect to test error. The three numbers in eachcell are the number of times the row beat/tied/lost to the column.All gaps midpoint Node uniform with KS = 1All gaps midpoint - 7/5/11Node uniform with KS = 1 11/5/7 -Table 5.7: How often all gaps midpoint is better, tied or worse than nodeuniform with KS = 1 with respect to training time. The three numbers ineach cell are the number of times the row beat/tied/lost to the column.Since it appears that either all gaps midpoint or node uniform withKS = 1 always achieve the best generalization error it is not necessary toevaluate node uniform with more than one split point. Only evaluatingall gaps midpoint and node uniform with KS = 1 is also advantageous to662 4 6 8 10 12 14 16 18Number of candidate features (KF )1000150020002500NumberofNodesAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12 14 16 18Number of candidate features (KF )−1.0−0.50.00.51.01.5TesttimeAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12 14 16 18Number of candidate features (KF )−600−400−2000200400600TraintimeAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure 5.12: All gaps midpoint versus node uniform with 1, 5, 25 and 100candidate split points for the satimage dataset. Node uniform with KS = 1produces the largest model and all gaps midpoint produces the smallest.The test times are correlated with model size so forests trained with nodeuniform have higher test times. For this dataset all gaps midpoint has thelowest training time. Top: Model size with respect to KF . Middle: Testtime with respect to KF . Bottom: Train time with respect to KF .6710 20 30 40Number of candidate features (KF )100015002000250030003500NumberofNodesAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 50010 20 30 40Number of candidate features (KF )0.00.20.40.60.81.01.21.4TesttimeAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 50010 20 30 40Number of candidate features (KF )−2000−1000010002000TraintimeAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure 5.13: All gaps midpoint versus node uniform with 1, 5, 25 and 100candidate split points for the usps dataset. Node uniform with KS = 1produces the largest model and all gaps midpoint produces the smallest.The test times are correlated with model size so forests trained with nodeuniform have higher test times. For this dataset all node uniform withKS = 1 has the lowest training time. Top: Model size with respect to KF .Middle: Test time with respect to KF . Bottom: Train time with respectto KF .68practitioners as node uniform with KS = 1 is faster to train than nodeuniform with KS > 1. We now compare the number of times all gapsmidpoint beats, ties and loses to node uniform with KS = 1 in terms oftest error and training time for datasets vowel, dna, satimage, gisette, usps,pendigits, letter, news20, cifar10, mnist, sensit-vehicle, webspam, kinect,bodyfat, diabetes, housing, abalone, wine, cadata, e2006, kinect-head, ypmand ct-slice. See Table 5.6 for the test error comparison and Table 5.6for training time comparison. The general trend was node uniform withKS = 1 required less training time for larger datasets and more trainingtime for smaller datasets. See Section A.5 for the plots for all datasets.5.2.5 Subsample data points at nodeAnother approach for increasing diversity and decreasing training time thatwe discussed in Section 2.6 is to subsample data points at each node. Inthis section we empirically show that subsampling data points at each nodeincreases diversity, decreases individual strength and maintains similar testerror as long as the number of data points to sample is moderately large. Ateach node, we subsample Kss data points without replacement. If there arefewer than Kss data points at a node, all data points are used. Figure 5.14demonstrates that for the mnist dataset, sampling 100 data points per nodedoes not have a negative effect on test error; however, sampling only 10 datapoints does have a negative effect on accuracy.Kss = 10 Kss = 100 Kss = 1000 Kss =∞Kss = 10 - 0/1/13 1/0/13 1/0/13Kss = 100 13/1/0 - 0/11/3 1/11/2Kss = 1000 13/1/0 3/11/0 - 0/14/0Kss =∞ 13/0/1 2/11/1 0/14/0 -Table 5.8: Comparison of subsampling 10, 100, 1000,∞ data points at eachnode with respect to test error. On most datasets subsampling 100 datapoints achieves similar results to using all data points. The three numbersin each cell are the number of times the row beat/tied/lost to the column.In Table 5.8 we compare the number of times subsampling, with differentconstants, beats, ties or looses to other subsampling strategies for the vowel,dna, satimage, gisette, usps, pendigits, letter, mnist, sensit-vehicle, bodyfat,diabetes, housing, abalone and wine datasets. The number of data pointsto sample at each node provides practitioners with a parameter to trade off6950 100 150 200Number of candidate features (KF )3.03.54.04.55.05.5TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 10050 100 150 200Number of candidate features (KF )0.40.50.60.70.8BreimanstrengthAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 10050 100 150 200Number of candidate features (KF )0.050.100.150.200.25BreimancorrelationAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 100Figure 5.14: Subsampling 10, 100 and 1000 data points with all gaps mid-point candidate split points for the mnist dataset. With Kss = 100 theindividual strength does drop; however, the correlation also drops which re-sults in an overall test error that is comparable to using all the data points.Top: Test error with respect to KF . Middle: Individual strength withrespect to KF . Bottom: Breiman correlation with respect to KF .70training time with predictive accuracy. The most encouraging result is thateven with a reasonably small number of samples, Kss = 100, subsamplingachieves equivalent results to using all data points for 11 of the 14 datasets.5.2.6 Other split point strategiesIn Section 2.4 we presented five candidate split point selection strategies.We now compare them on the vowel, dna, satimage, gisette, usps, pendig-its, letter, mnist, sensit-vehicle, bodyfat, diabetes, housing, abalone andwine datasets. For the the sort and walk strategies which included all datapoints, all gaps midpoint and all gaps uniform, we expected all gaps uni-form to achieve higher diversity without much loss of individual strengthin comparison to all gaps midpoint. Surprisingly, using all gaps uniforminstead of all gaps midpoint did not significantly effect accuracy. Neitherdid selecting candidate split points at all data points which is promising foronline variants that sample data points as candidate split points. Refer toSection A.7 for all plots comparing candidate split point strategies: all datapoints, all gaps midpoint and all gaps uniform.2 4 6 8 10 12Number of candidate features (KF )4681012141618TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 1, KS= 100.0, T= 100Figure 5.15: Top: The test error with respect to number of candidatefeatures to sample at each node (KF ) for the letter dataset for node uniformand global uniform candidate split points. For 100 split points (KS = 100)global uniform does not effect performance; however, for 1 split point (KS =1) global uniform appears to be shifted right so that for the same numberof candidate features it performs equal or worse.71The remaining two candidate split point strategies presented in Sec-tion 2.4 are node uniform and global uniform. Our experiments show thatwith a large number of split points the two strategies are equivalent. Withonly one split point (KS = 1) the error with respect to number of candidatefeatures (KF ) for global uniform candidate split points appears to be thesame as local uniform but shifted to the right. In other words, the sameerror can be achieved by sampling more candidate features (KF ). See Fig-ure 5.15 for an example with the letter dataset and refer to Section A.8for all plots comparing candidate split point strategies: node uniform andglobal uniform. The fact that global uniform can achieve similar resultsto node uniform is positive for online variants that use global uniform tosample candidate split points.5.2.7 Linear combination featuresIn this section we evaluate features which are linear combination of other fea-tures. These include sparse random projections described in Section 2.8.1and class difference projections described in Section 2.8.2. For these ex-periments we evaluated all gaps midpoint, still without bagging, and nodeuniform with one split point (KS = 1) with sparse random projections andclass difference projections. We evaluate three different number of candi-date features, KF = {5, 25, 100}, as well as three different subspace sizes.The subspace sizes we evaluate are 0.2, 0.5 and 1.0 of the total number ofdimensions in the feature space. If there are more than 100 dimensions,we restrict the subspace size to KP = {20, 50, 100}. The datasets that weevaluated in this study were vowel, dna, satimage, gisette, usps, pendigits,letter, mnist, sensit-vehicle, bodyfat, diabetes, housing, abalone and wine.There are several trends that are true for both types of projections. Webegin with the observation that increasing the number of candidate featuresto try at each node always improves performance. It appears that the extrarandomness generated by the linear projections ensures that increasing thenumber of candidate features does not result in trees which are too corre-lated. However, for the same number of candidate features or less, regularaxis aligned splits out perform either random or class difference projections.For the same computational cost, axis aligned splits can perform better. Ifcomputational efficiency is not a requirement, then with enough candidatefeatures random projections and class difference projections can out per-form regular axis aligned splits on some datasets. Figure 5.16 illustratesthis for the usps dataset. The only exception to class difference projectionsoutperforming random projections was the letter dataset. Whether class7220 40 60 80 100Number of candidate features (KF )5.56.06.57.0TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 50Node uniform + class diff projections, KP= 50Figure 5.16: Test error with respect to KF for random projections, classdifference projections and axis aligned features for the usps dataset. Classdifference projections do better than random projections and at KF = 100node uniform with class difference projections begins to out perform axisaligned node uniform.difference projections or random projections have lower generalization erroris dictated by the strength and correlation of each tree. Across all datasetsclass difference projections have much higher individual strength but alsohave higher correlation while random projections have lower strength andlower correlation. Figure 5.17 illustrates this trend for the usps dataset.Another common trend was increasing the subspace size tended to hurtperformance. With some exceptions, such as pendigits performing betterwith KP = 8 rather than KP = 3, the smallest tested subspace tendedto perform the best. On most datasets, increasing the size of the sub-space decreases the individual strength of each tree. With the exception ofthe pendigits dataset shown in Figure 5.18, class difference projections andrandom projections do not significantly improve generalization error. Thissupports practitioners decision to use axis aligned splits. It also providessome insight into why it so hard to improve the random forest algorithm.Alterations that tend to increase individual strength also tend to increasecorrelation resulting in the same overall generalization error.7320 40 60 80 100Number of candidate features (KF )0.640.660.680.700.720.740.760.78BreimanstrengthAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 50Node uniform + class diff projections, KP= 5020 40 60 80 100Number of candidate features (KF )0.250.300.350.40BreimancorrelationAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 50Node uniform + class diff projections, KP= 50Figure 5.17: Top: The strength and correlation for usps dataset with re-spect to KF for random projections, class difference projections and axisaligned features. Class difference projections have much higher individualstrength but also have higher correlation while random projections havelower strength and lower correlation. This trend holds for all datasets andthe best generalization error depends on the interaction between the two.Top: Breiman strength with respect to KF Bottom: Breiman correlationwith respect to KF .7420 40 60 80 100Number of candidate features (KF )23456TesterrorAll gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 8All gaps midpoint + class diff projections, KP= 8Node uniform + class diff projections, KP= 8Figure 5.18: Test error with respect to KF for random projections, classdifference projections and axis aligned features for the pendigits dataset.For this dataset, both class difference projections and random projectionsoutperform standard axis aligned features.5.3 Variants with provable consistencyIn this section we begin by comparing the variants with provable consistencythat were introduced in Chapter 3 in terms of test error, individual strengthand diversity. In all aspects Denil2014 is closest to standard offline variantof all gaps midpoint (Breiman). We then break down two design choices ofDenil2014. The first is sampling the number of candidate features at eachnode from a Poisson distribution. The second is splitting data points intostructure and estimation streams. We find that most of the performancegap between Denil2014 and Breiman is attributed to the stream splittingmechanic required by the theory.Relative to the different candidate split point strategies of the previ-ous chapter, the variants being compared in this section have additional orslightly different parameters. Denil2014 has a parameter, Knr, to controlthe number of data points that are sampled to determine the bounds tosearch for the best split point. For all experiments we let Knr = 1000 as wefound that the generalization error is not sensitive to Knr.Biau2008 and Biau2012 do not have minimum child size as a split cri-teria. Instead they have a parameter to control the total number of leafs(ML). This plays the same role as minimum child size as it controls the75size of the tree. For these experiments we set the total number of leafsto ML =(N1 ,N2 ,N3 ,N4 ,N5 ,N6 ,N7 ,N8)to match the minimum child size of1, 2, 3, 4, 5, 6, 7, 8 that is evaluated for the different split point strategies ofthe previous section as well as Denil2014.Because Biau2012 requires all data to be in a fixed size hypercube, wecenter and scale all features to be in the range [−1.0, 1.0] for Biau2012.5.3.1 Comparison of Biau2008, Biau2012 and Denil2014For this comparison of Biau2008, Biau2012, Denil2014 and Breiman (allgaps midpoint) we evaluate the vowel, dna, satimage, gisette, usps, pendig-its, letter, bodyfat, diabetes, housing, abalone and wine datasets. On mostdatasets the ordering from best to worst is Breiman, Denil2014, Biau2008and Biau2012 as depicted in Figure 5.19. Biau2008 creates very diversebase models because each split is selected randomly. Biau2012 normally hasthe lowest individual strength because of always splitting at the range mid-point; however, in some cases Biau2012 does better than Denil2014 whichis depicted in Figure 5.20.Table 5.9 summarizes how often each variant wins, ties or loses to theother variants. It is reasonably clear that Denil2014 is second best afterBreiman (all gaps midpoint) as it has the most ties with Breiman and lossesto Biau2012 and Biau2008 once each.Breiman Denil2014 Biau2012 Biau2008Breiman - 8/4/0 11/1/0 9/3/0Denil2014 0/4/8 - 8/3/1 7/4/1Biau2012 0/1/11 1/3/8 - 3/2/7Biau2008 0/3/9 1/4/7 7/2/3 -Table 5.9: How Breiman (all gaps midpoint), Denil2014, Biau2012 andBiau2008 compare to each other with respect to test error. Denil2014 issecond best after Breiman. The three numbers in each cell are the numberof times the row beat/tied/lost to the column.5.3.2 Denil2014 design choicesOne of the requirements for proving consistency for Denil2014 is to samplethe number of candidate features at each node from a Poisson distribution.In order not to conflate different factors we now compare standard all gaps762 4 6 8 10 12 14 16 18Number of candidate features (KF )101520253035TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12 14 16 18Number of candidate features (KF )0.20.30.40.50.60.7BreimanstrengthBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12 14 16 18Number of candidate features (KF )0.200.250.300.350.40BreimancorrelationBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure 5.19: Comparison of Biau2008, Biau2012, Denil2014 and Brieman(all gaps midpoint) for satimage dataset. This dataset follows the typicalordering. Top: Test error with respect to KF . Middle: Individual strengthwith respect to KF . Bottom: Breiman correlation with respect to KF .7750 100 150 200Number of candidate features (KF )468101214TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 3, T= 10050 100 150 200Number of candidate features (KF )−0.20.00.20.40.60.8BreimanstrengthBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 3, T= 10050 100 150 200Number of candidate features (KF )−0.2−0.10.00.10.20.3BreimancorrelationBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 3, T= 100Figure 5.20: Comparison of Biau2008, Biau2012, Denil2014 and Breiman(all gaps midpoint) for gisette dataset. For this dataset Biau2012 outper-forms Denil2014. Top: Test error with respect to KF . Middle: Individualstrength with respect to KF . Bottom: Breiman correlation with respectto KF .782 4 6 8 10 12Number of candidate features (KF )45678TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure 5.21: All gaps midpoint compared with all gaps midpoint where thenumber of candidate features is sampled from a Poisson distribution for theletter dataset. Sampling the number of candidate features from a Poissonsmooths test error relative to KF near the ends of the range but near theminimum they are equivalent.midpoint with all gaps midpoint where the number of candidate features issampled from a Poisson distribution. Figure 5.21 demonstrates that theyare equivalent for the letter dataset and the same holds for the vowel, dna,satimage, gisette, usps, pendigits, letter, bodyfat, diabetes, housing, abaloneand wine datasets.The most important requirement of Denil2014 is splitting data pointsinto structure and estimation streams. A data point is either used to con-struct the tree or compute the estimates of the leaf predictors. Denil2014extends Biau2012 to support stream splitting per tree versus per forest. Un-fortunately our experiments across the vowel, dna, satimage, gisette, usps,pendigits, letter, bodyfat, diabetes, housing, abalone and wine datasets showthat splitting data points per tree versus for the entire forest, does not signif-icantly effect performance on most datasets. However, changing Denil2014to use a single stream, where all data points are used for structure andestimation, results in the test error being equivalent to standard all gapsmidpoint for all datasets that we tested. Figure 5.22 shows this for the uspsdataset.To conclude, the design choice that is resulting in the largest gap betweenDenil2014 and the best practical variant, all gaps midpoint, is splitting the7910 20 30 40Number of candidate features (KF )67891011TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure 5.22: Comparison of Denil2014 with structure and estimation streamsper tree, Denil2014 with one stream and standard all gaps midpoint forthe usps dataset. All gaps midpoint and Denil2014 with one stream areequivalent.stream into structure and estimation points. Therefore in order to achievesimilar performance, theoreticians need to find a way to relax the two streamrequirement.5.4 Online variantsIn this section we empirically evaluate five design choices for online randomforest variants:1. Memory management: fixed depth versus fixed frontier,2. Candidate split points: at data points versus global uniform,3. Bagging: whether to use Poisson online bagging, and4. Split criteria: Mns +Mim versus α(d) + β(d) +Mim, and5. Stream splitting: whether to split data points into structure and esti-mation streams.The algorithm presented in Saffari [40] uses a fixed depth, samples can-didate split points with global uniform, uses online bagging, uses Mns+Mim80split criteria and does not split data points into structure and estimationstreams. The algorithm presented in Denil2013 [20] uses a fixed frontier,samples split points at the first KS data points, uses α(d) + β(d) + Mimsplit criteria and splits data points into structure and estimation streams.Denil2013 also samples the number of a candidate features, KF , from aPoisson distribution just like Denil2014. In the online setting, sampling thenumber of candidate features from a Poisson has no effect on generalizationerror and we refer the reader to the offline study in Section 5.3.2. Someof the Denil2013 design decisions are primarily made to improve empiricalperformance while others are to allow existing proof techniques to be usedto prove consistency.The methodology adopted over the next five sections is to begin withSaffari and compare each design choice against the competing Denil2013choice. The configuration that achieves the lowest test error across datasetsis then compared in subsequent sections. By the end we show that the onlyalteration required by the theory that consistently hurts generalization erroris splitting data points into structure and estimation streams. The rest ofthe design choices outlined in Denil et al. [20] either improve or do not affectgeneralization error.We conclude our study of online variants by examining the effect of doingmultiple passes through the dataset as well as providing recommendations topractitioners. All experiments in this section were evaluated on the vowel,dna, satimage, gisette, usps, pendigits, letter, bodyfat, diabetes, housing,abalone and wine datasets.5.4.1 Memory management: Fixed depth versus fixedfrontierIn Section 4.3 we study our most significant contribution to online randomforests which is to constrain the memory requirements of an online forest bycollecting split statistics for a subset of leafs. We now present the perfor-mance improvement by replacing Saffari’s memory management techniqueof using a fixed depth with using a fixed frontier. All other design choicesremain fixed and optimal parameters are chosen using cross validation. InFigure 5.23 we compare a fixed depth of 8 with a fixed frontier of 10000.With a max depth of 8 there are at most 27 leafs per tree that are collectingstatistics and with 100 trees this results in maximum of 12800 leafs collect-ing split statistics. Where as with the fixed frontier there is a max of 10000leafs collecting split statistics at any one time and there is no limit on thedepth of the trees. Depending on the size of the trees that are grown, using812 4 6 8 10 12Number of candidate features (KF )468101214TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 5.0, T= 100, Md= 8.0, Mim= 0.01, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.01, p= 1All gaps midpoint, Mcs= 2, T= 100Figure 5.23: Comparison between fixed depth versus fixed frontier versusoffline all gaps midpoint for pendigits. Managing memory with a fixedfrontier achieves significantly lower test error.a fixed frontier either ties using a fixed depth or does better. The larger thedataset, the better fixed frontier does. See Section A.14 for the plots for ofthe vowel, dna, satimage, gisette, usps, pendigits, letter, bodyfat, diabetes,housing, abalone and wine datasets.5.4.2 Split points: at data points versus global uniformIn Section 2.4 we evaluated different split point strategies for offline randomforests. We had shown that selecting all data points as candidate split pointsachieves equivalent test error to sampling all gaps midpoint. Recall thatsampling candidate split points with global uniform does hurt performancefor low number of candidate features; however, for large number of candidatefeatures the results are much closer to node uniform.For global uniform candidate split points we assume that each feature’srange is known appropri. In practice it would be straightforward to maintainan estimate of each feature’s range as new data points arrive. Denil2013[20] selects candidate split points for each feature by using the first KS datapoints to arrive at the node. We now evaluate the two candidate split pointstrategies while using a fixed frontier, online bagging and Mns + Mim splitcriteria. All other parameters are optimized with cross validation.Table 5.10 summarizes how often the two variants beat, tie and lose to822 4 6 8 10 12Number of candidate features (KF )510152025TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.1, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure 5.24: Candidate split points at the first KS data points versus globaluniform for the letter dataset. The other design choices are constrained toa fixed frontier, online bagging and Mns+Mim split criteria. Offline all gapsmidpoint is included on the plot as a baseline. In this example using thefirst KS data points as candidate split points outperforms global uniform.This represents the typical scenario.each other. Using the the first KS data points as candidate split points doesequivalent or better than global uniform candidate split points across alldatasets with the exception of satimage which is presented in Figure 5.25.At data points Global uniformAt data points - 6/5/1Global uniform 1/5/6 -Table 5.10: How often selecting candidate split points at the first KS datapoints is better, tied or worse than sampling candidate split points withglobal uniform for online random forests. The rest of the design choices arefixed to a fixed frontier, online bagging and Mns + Mim split criteria. Thethree numbers in each cell are the number of times the row beat/tied/lostto the column.832 4 6 8 10 12 14 16 18Number of candidate features (KF )20406080100TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 1000, Mim= 0.1, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 1000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure 5.25: Candidate split points at the first KS data points versus globaluniform for the satimage dataset. The other design choices are constrainedto a fixed frontier with online bagging and offline all gaps midpoint is in-cluded on the plot as a baseline. In this example global uniform outperformsusing the first KS data points.50 100 150 200Number of candidate features (KF )46810TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 10000, Mim= 0.01, p= 1At datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.1All gaps midpoint, Mcs= 3, T= 100Figure 5.26: Error relative to number of candidate features for online with(red) and without (blue) bagging on the gisette dataset. Offline all gapsmidpoint (green) is also included as a baseline. This demonstrates thatonline bagging hurts performance which holds across most datasets.845.4.3 Online baggingWe now compare online random forests with and without the Poisson onlinebagging that we introduced in Section 4.2. For each data point a weight, w,is sampled from a Poisson distribution for each tree. The tree is updatedwith the data point w times. This injects extra randomness and convergesto the same model as offline bagging. Just like offline bagging, each treein the model uses less data points because for some data points w is zero.Table 5.11 shows that for the twelve datasets that we evaluated, onlinebagging hurts performance for five datasets and never improves performance.Figure 5.26 is a typical example but the plots for all the datasets can befound in Section A.16.With online bagging Without online baggingWith online bagging - 0/7/5Without online bagging 5/7/0 -Table 5.11: A comparison of online forests with and without online bagging.The rest of the design choices are a fixed frontier with the first KS datapoints as candidate split points. The three numbers in each cell are thenumber of times the row beat/tied/lost to the column.5.4.4 Split criteria: Mns +Mim versus α(d) + β(d) +MimTo evaluate the split criteria strategies, we continue from the last sectionwith a fixed frontier and selecting the first KS data points as candidate splitpoints. We also select the number of candidate features to sample at eachnode from a Poisson distribution as this is shown to have little effect onprediction accuracy and is required by the theory. With all of these designchoices fixed, we compare a variant with Mns+Mim split criteria to a variantwith α(d) + β(d) +Mim split criteria.The split criteria Mns + Mim has two parameters: the number of datapoints required to split (Mns) and a minimum impurity (Mim). Once bothof these conditions are meet, the node will split. For these experiments weevaluated Mns = {5, 25, 100} and Mim = {0.0, 0.01, 0.1} for classification.For regression we let Mim be 0, 0.01, 0.1 times the constant errors presentedin Section 5.1.The split criteria α(d) + β(d) +Mim requires the number of data pointsto split a node to grow with the depth of the tree. While it was devel-oped to prove consistency for Denil2013, it does not negatively impact8510 20 30 40Number of candidate features (KF )67891011TesterrorAt datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.1, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure 5.27: Comparison between Mns + Mim split criteria (red), α(d) +β(d) +Mim split criteria (blue) and Mcs split criteria for offline (green) forthe usps dataset.the generalization error. Figure 5.27 shows a typical example where theα(d) + β(d) + Mim split criteria required by the theory actually out per-forms the simpler Mns + Mim split criteria. Recall from Section 4.5 theα(d) + β(d) +Mim split criteria consists of two conditions. Either the num-ber of data points of each candidate child in a candidate split must surpassα(d) and the minimum impurity of the best split must be at least Mim orthe number of data points to reach a node must surpass β(d). For these ex-periments we let α(d) = Mcs ·(Kdr)d and β(d) = 4·α(d) where d is the depthof the node. We evaluate Kdr = {1.001, 1.01, 1.1} and Mcs = {2, 12, 50} asthis is comparable to Mns = {5, 25, 100} for the Mns + Mim split criteria.These choices meet the requirements for proving consistency.Table 5.12 shows that α(d)+β(d)+Mim split criteria beats Mns+Mim forone of the twelve datasets and ties on the remaining eleven datasets. Whileit isn’t clear why α(d)+β(d)+Mim performs better, one possible explanationis that splitting with fewer data points higher up the tree allows for moresplits while increasing the number of data points lower down the tree resultsin better estimates of the leaf predictors. In other words, by increasing thenumber of data points required to split by the depth of a node, the size ofthe final tree and accuracy of the leaf estimators is balanced.We now examine the effect of the parameter settings Mcs and Kdr ontest error. In the cases where α(d) + β(d) +Mim split criteria outperforms86Mns +Mim α(d) + β(d) +MimMns +Mim - 0/11/1α(d) + β(d) +Mim 1/11/0 -Table 5.12: A comparison ofMns+Mim split criteria versus α(d)+β(d)+Mimsplit criteria across datasets. α(d) + β(d) + Mim split criteria is equivalentor better than Mns +Mim. The three numbers in each cell are the numberof times the row beat/tied/lost to the column.Mns + Mim, Mns = 100 for Mns + Mim while Mcs ≤ 12 for α(d) + β(d) +Mim. However, with a Kdr > 1.0 the number of data points per nodequickly increases with depth which results in the final leaves containingmany data points. The optimal value of Mcs must balance splitting fastenough with ensuring there are enough data points in the leafs. For theparameter configurations that we evaluated the optimal Mcs was either 2 or12. For a lower Mcs a larger Kdr tended to perform better. Unfortunatelythere were no clear trends and cross validation or Bayesian optimization isrequired to find the best parameter settings.5.4.5 Structure and estimation streamsSplitting data points into structure and estimation streams is required forthe proof of consistency but, as we now show, it hurts empirical performance.This is the same conclusion which was reached in Section 5.3.2 for offlinevariants with provable consistency. Figure 5.28 demonstrates that splittingdata points into structure and estimation streams hurts performance for thependigts dataset. The results are the same for all twelve datasets that weretested and can be seen in Section A.18.5.4.6 Multiple passes through the dataSaffari et al. [40] introduced multiple passes through the data to improveperformance. On each pass through the data, all data points are shuffled toensure a random ordering. By doing multiple passes through the data, thesize of each tree can grow because each data point can be used in multiplesplits at different depths of the tree. However, in the extreme example, thedata point to arrive at each leaf is the same data point from the previouspass through the data. At this point extra passes through the data can onlyresult in overfitting and can not improve generalization error. However fora reasonably small number of passes the generalization error of the online872 4 6 8 10 12Number of candidate features (KF )4567891011TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.01, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.01, p= 1All gaps midpoint, Mcs= 2, T= 100Figure 5.28: Comparison between Denil2013 with and without splitting eachdata point into structure and estimation streams along with the offline allgaps midpoint as a baseline for the pendigits dataset. Denil2013 withoutstream splitting is significantly closer to offline all gaps midpoint. In thisexample Denil2013 has a fixed frontier, uses the first KS data points ascandidate split points and samples the number of candidate features from aPoisson at each node.2 4 6 8 10 12Number of candidate features (KF )34567TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.01, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.1, Mim= 0.1, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.01, p= 10All gaps midpoint, Mcs= 2, T= 100Figure 5.29: Different number of passes through the data (p) for a fixedfrontier, KS data points as split points and Poisson number of candidate fea-tures along with the offline all gaps midpoint as a baseline for the pendigitsdataset. Online with 10 passes does as well as offline all gaps midpoint.88forest falls and can even approach offline variants. Figure 5.29 visualizes thereduction in test error resulting from increasing the number of passes from 1to 5 to 10; p = {1, 5, 10}. Section A.19 includes plots for all datasets whichshow that increasing the number of passes through the data to ten reducesthe test error.89Chapter 6ConclusionIn this thesis we have presented an empirical study of random forests acrossnumerous variants and parameter settings. We now conclude with somegeneral trends along with suggestions for practitioners wishing to find theoptimal configuration with the least amount of computation as well as thecritical modification that theoreticians need to remove to reduce the gapbetween variants with provable consistency and variants used in practice.For practical offline variants we have shown for the datasets that we tested:1. bagging increases generalization error for all gaps midpoint candidatesplit points and accounts for most of the gap between Breiman [10]original algorithm and Geurts et al. [27] extremely randomized trees,2. all gaps midpoint is better than node uniform on some datasets andworse on others,3. increasing the number of candidate split points for node uniform con-verges on all gaps midpoint but with higher computation cost,4. there is an optimal KF and Mcs that balances individual strength withdiversity,5. there is a critical number of trees required for the optimal KF and Mcsof a smaller forest to match that of a larger forest,6. by increasing the number of trees in the forest, the best setting of Mcseither goes to 1 or has less influence on the test error,7. subsampling Kss ≥ 100 data points at each node does not significantlyaffect generalization error or the optimal parameter setting but it doesdecrease training time, and8. linear projections rarely improve accuracy and come at greater com-putational cost.90We now suggest a heuristic for finding the optimal variant and parametersetting based on the trends listed above. Because the worst case computationcost of training a forest is linear in KF , Kss, KS and T , we recommendstarting the search with all gaps midpoint and Mcs = 1, T = 25 and Kss =100. Begin searching for the best KF by starting at√N2 and doubling it ateach iteration until the test error no longer decreases. The search can befurther refined by recursively splitting the region containing the optimum.Then increase the number of trees and number of data points per node suchthat T = 100 and Kss = 1000 and continue a local search from the optimumfound with T = 25 and Kss = 100. Now search for the the best minimumchild size by starting at Mcs = 1 and doubling it until the test error does notdecrease. Then follow the same entire procedure for node uniform candidatesplit points but begin from the optimal KF found for all gaps midpoint asthe optimal KF for node uniform is always larger than or equal to all gapsmidpoint for the datasets evaluated in this study. Finally, use the test errorto choose between selecting candidate split points with all gaps midpoint ornode uniform. The final forest is then grown as large as computationallypossible. Since random forests are not deterministic, it is possible that thevariance in the test error will result in choosing a KF that is smaller than itshould. While this is unlikely, it can be remedied by training several forestsfor each parameter setting near the optimum.By starting with KF =√N2 and doubling KF we ensure that even ifthe optimal setting is to use all the features for candidate features, theoptimization process will not cost more than three times the cost of trainingthe forest with all the features. It is reasonable to assume that in the worstcase the training cost is linear in the number of features. For example, if adataset has 1000 features and the optimal number of candidate features isKF = 1000 then our procedure would train forests with 50, 100, 200, 400,800 and 1000 candidate features. In this example the total computationalcost would be proportional to 50+100+200+400+800+1000 = 2550. Theother reason to start at√N2 is the default setting recommend by Brieman andaccepted in the literature is to use√N candidate features for classificationand N candidate features for regression.Unlike Bayesian optimization, this heuristic takes the computation costof each parameter setting into consideration while searching for the optimalparameter setting. While the work of Hutter et al. [30] takes computationalcost into consideration, it must evaluate expensive configurations to learnthat they are expensive. The ultimate goal is to develop a random forest pa-rameter optimization algorithm with provable guarantees that incorporates91the trends found in this thesis as prior knowledge to reduce the computa-tion time required for finding the optimal configuration. We leave this openproblem as future work.For online variants we have empirically demonstrated that1. using a fixed frontier to control memory greatly reduces generalizationerror,2. selecting candidate split points at the first KS data points tends tohave lower generalization error in comparison with global uniform,3. using online bagging hurts performance just like offline bagging,4. using the provably consistent split criteria of α(d)+β(d)+Mim or thesimpler split criteria Mns+Mim does not effect accuracy,5. splitting data points into structure and estimation streams hurts ac-curacy, and6. training with multiple passes through the dataset decreases the erroruntil in some cases it converges with offline.However, selecting KF , KS and the best split criteria parameters online isstill an open question that we leave as future work.For variants with provable consistency we demonstrated that Denil2014achieves the lowest test error. The one simplification that makes up thelargest gap between Denil2014 and offline with all gaps midpoint is split-ting data points into structure and estimation streams. Therefore, we endby challenging theoreticians to find proof techniques that do not rely onindependence between the points used to create the structure of the treeand those used to estimate the leaf predictors. If this is possible, the vari-ants with theoretical guarantees would achieve equivalent empirical resultsto variants commonly used in practice.92Bibliography[1] H Abdulsalam. Streaming Random Forests. PhD thesis, Queens Uni-versity, 2008.[2] Y. Amit and D. Geman. Shape quantization and recognition with ran-domized trees. Neural Computation, 9:1545–1558, 1997.[3] G. Biau. Analysis of a random forests model. Journal of MachineLearning Research, 13:1063–1095, 2012.[4] G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests andother averaging classifiers. Journal of Machine Learning Research, 9:2015–2033, 2008.[5] A. Bifet, G. Holmes, and B Pfahringer. New ensemble methods forevolving data streams. In ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages 139–148, 2009.[6] A. Bifet, G. Holmes, and B. Pfahringer. MOA: Massive online analysis,a framework for stream classification and clustering. In Workshop onApplications of Pattern Analysis, pages 3–16, 2010.[7] A. Bifet, E. Frank, G. Holmes, and B. Pfahringer. Ensembles of re-stricted Hoeffding trees. ACM Transactions on Intelligent Systems andTechnology, 3(2):1–20, 2012.[8] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140,1996.[9] L. Breiman. Some infinity theory for predictor ensembles. TechnicalReport 579, Statistics Department, UC Berkeley, 2000.[10] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.[11] L. Breiman. Consistency for a simple model of random forests. Tech-nical report, University of California at Berkeley, 2004.93[12] L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification andRegression Trees. CRC Press LLC, 1984.[13] E. Brochu, V. Cora, and N. de Freitas. A tutorial on Bayesian opti-mization of expensive cost functions, with application to active usermodeling and hierarchical reinforcement learning. Technical ReportTR-2009-23, Department of Computer Science, University of BritishColumbia, November 2009.[14] G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods:A survey and categorisation. Journal of Information Fusion, 6:5–20,2005.[15] R. Caruana and A. Niculescu-Mizil. An empirical comparison of su-pervised learning algorithms. In International Conference on MachineLearning, pages 161–168, 2006.[16] R. Caruana, N. Karampatziakis, and A. Yessenalina. An empirical com-parison of supervised learning algorithms. In International Conferenceon Machine Learning, pages 96–103, 2008.[17] A. Criminisi and J. Shotton. Decision Forests for Computer Vision andMedical Image Analysis. Springer London, 2013.[18] A. Criminisi, J Shotton, and E. Konukoglu. Decision forests: A uni-fied framework for classification, regression, density estimation, mani-fold learning and semi-supervised learning. Foundations and Trends inComputer Graphics and Vision, 7(2-3):81–227, 2011.[19] DR. Cutler, T. Edwards, and K. Beard. Random forests for classifica-tion in ecology. Ecology, 88(11):2783–92, 2007.[20] M. Denil, D. Matheson, and N. de Freitas. Consistency of online randomforests. In International Conference on Machine Learning, volume 3,pages 1256–1264, 2013.[21] M. Denil, D. Matheson, and N. de Freitas. Narrowing the gap: Ran-dom forests in theory and in practice. In International Conference onMachine Learning, 2014.[22] T. G. Dietterich. An experimental comparison of three methods forconstructing ensembles of decision trees: Bagging, boosting, and ran-domization. Machine Learning, pages 139–157, 2000.94[23] P. Domingos and G. Hulten. Mining high-speed data streams. In ACMSIGKDD International Conference on Knowledge Discovery and DataMining, pages 71–80, 2000.[24] J. Gama, P. Medas, and P. Rodrigues. Learning decision trees from dy-namic data streams. In ACM symposium on Applied computing, pages573–577, 2005.[25] R. Genuer. Risk bounds for purely uniformly random forests. arXivpreprint arXiv:1006.2980, 2010.[26] R. Genuer. Variance reduction in purely random forests. Journal ofNonparametric Statistics, 24:543–562, 2012.[27] P Geurts, D Ernst, and L Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, 2006.[28] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon.Efficient regression of general-activity human poses from depth images.International Conference on Computer Vision, pages 415–422, 2011.[29] T. K. Ho. The random subspace method for constructing decisionforests. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 20(8):832–844, 1998.[30] F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-basedoptimization for general algorithm configuration. In Learning and In-telligent Optimization (LION 5), page 507523, 2011.[31] A. Krogh and J Vedelsby. Neural network ensembles, cross validation,and active learning. In Advances in Neural Information ProcessingSystems, pages 231–238. MIT Press, 1995.[32] S. W. Kwokt and C. Carter. Multiple decision trees. In Uncertainty inArtificial Intelligence, pages 213–220, 1988.[33] Y. Lin and Y. Jeon. Random forests and adaptive nearest neigh-bors. Journal of the American Statistical Association, 101(474):578–590, 2006.[34] B. Menze, M. Kelm, D. Splitthoff, U. Koethe, and F. Hamprecht. Onoblique random forests. In Proceedings of the 2011 European Conferenceon Machine Learning and Knowledge Discovery in Databases - VolumePart II, pages 453–469, 2011.95[35] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Cri-minisi. Entangled decision forests and their application for semantic seg-mentation of CT images. In Information Processing in Medical Imaging,pages 184–196, 2011.[36] N. Oza and S. Russel. Online bagging and boosting. In ArtificialIntelligence and Statistics, volume 3, pages 105–112, 2001.[37] A. Prasad, L. Iverson, and A. Liaw. Newer classification and regressiontree techniques: Bagging and random forests for ecological prediction.Ecosystems, 9(2):181–199, 2006.[38] L. Raileanu and K. Stoffel. Theoretical comparison between the gini in-dex and information gain criteria. Annals of Mathematics and ArtificialIntelligence, 41(1):77–93, 2004.[39] J. Rodrguezm, L. Kuncheva, and C. Alonsom. Rotation forest: A newclassifier ensemble method. IEEE Transactions on Pattern Analysisand Machine Intelligence, 28(10):1619–1630, 2006.[40] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof. On-line random forests. In International Conference on Computer VisionWorkshops, pages 1393–1400, 2009.[41] R. Schapire and Y. Freund. Boosting: Foundations and Algorithms.MIT Press, 2012.[42] F. Schroff, A. Criminisi, and A. Zisserman. Object class segmentationusing random forests. In Procedings of the British Machine VisionConference, pages 54.1–54.10, 2008.[43] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,A. Kipman, and A. Blake. Real-time human pose recognition in partsfrom single depth images. IEEE Computer Vision and Pattern Recog-nition, pages 1297–1304, 2011.[44] V. Svetnik, A. Liaw, C. Tong, J. Culberson, R. Sheridan, andB. Feuston. Random forest: a classification and regression tool forcompound classification and QSAR modeling. Journal of Chemical In-formation and Computer Sciences, 43(6):1947–58, 2003.[45] A. Verikas, A. Gelzinis, and M. Bacauskiene. Mining data with randomforests: A survey and results of new tests. Pattern Recognition, 44(2):330–349, 2011.96[46] Z. Wang, M. Zoghi, F. Hutter, D. Matheson, and N. de Freitas. Bayesianoptimization in high dimensions via random embeddings. In Interna-tional Joint Conferences on Artificial Intelligence (IJCAI) - Distin-guished Paper Award, 2013.[47] C. Xiong, D. Johnson, R. Xu, and J. J. Corso. Random forests formetric learning with implicit pairwise position dependence. In ACMSIGKDD International Conference on Knowledge Discovery and DataMining, pages 958–966, 2012.[48] D. Zikic, B. Glocker, and A. Criminisi. Atlas encoding by randomizedforests for efficient label propagation. In International Conference onMedical Image Computing and Computer Assisted Intervention, 2012.97Appendix AAppendix of experimentfiguresA.1 All gaps midpoint with and without bagging1 2 3 4 5 6 7 8 9Number of candidate features (KF )455055606570TesterrorAll gaps midpoint, Mcs= 3, T= 100All gaps midpoint + Bagging, Mcs= 2, T= 100Figure A.1: Comparision of all gaps midpoint with bagging (Breiman) versusall gaps midpoint without bagging for the vowel dataset.985 10 15 20 25 30 35Number of candidate features (KF )4.85.05.25.45.65.86.06.26.4TesterrorAll gaps midpoint, Mcs= 2, T= 100All gaps midpoint + Bagging, Mcs= 1, T= 100Figure A.2: Comparision of all gaps midpoint with bagging (Breiman) versusall gaps midpoint without bagging for the dna dataset.2 4 6 8 10 12 14 16 18Number of candidate features (KF )8.59.09.5TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1, T= 100Figure A.3: Comparision of all gaps midpoint with bagging (Breiman) versusall gaps midpoint without bagging for the satimage dataset.9950 100 150 200Number of candidate features (KF )2.42.62.83.03.23.43.63.8TesterrorAll gaps midpoint, Mcs= 3, T= 100All gaps midpoint + Bagging, Mcs= 5, T= 100Figure A.4: Comparision of all gaps midpoint with bagging (Breiman) versusall gaps midpoint without bagging for the gisette dataset.10 20 30 40Number of candidate features (KF )6.06.57.07.5TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1, T= 100Figure A.5: Comparision of all gaps midpoint with bagging (Breiman) versusall gaps midpoint without bagging for the usps dataset.1002 4 6 8 10 12Number of candidate features (KF )3.54.04.55.0TesterrorAll gaps midpoint, Mcs= 2, T= 100All gaps midpoint + Bagging, Mcs= 1, T= 100Figure A.6: Comparision of all gaps midpoint with bagging (Breiman) versusall gaps midpoint without bagging for the pendigits dataset.2 4 6 8 10 12Number of candidate features (KF )45678TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1, T= 100Figure A.7: Comparision of all gaps midpoint with bagging (Breiman) versusall gaps midpoint without bagging for the letter dataset.1010 200 400 600 800 1000Number of candidate features (KF )19202122232425262728TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1Figure A.8: Comparision of all gaps midpoint with bagging (Breiman) versusall gaps midpoint without bagging for the news20 dataset.20 40 60 80 100Number of candidate features (KF )52.052.553.053.554.054.555.0TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1Figure A.9: Comparision of all gaps midpoint with bagging (Breiman) versusall gaps midpoint without bagging for the cifar10 dataset.10210 20 30 40 50Number of candidate features (KF )2.72.82.93.03.13.23.33.4TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1, T= 100Figure A.10: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the mnist dataset.5 10 15 20Number of candidate features (KF )15.816.016.216.416.6TesterrorAll gaps midpoint, Mcs= 4, T= 100All gaps midpoint + Bagging, Mcs= 1, T= 100Figure A.11: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the sensit-vehicle dataset.1035 10 15 20 25 30Number of candidate features (KF )0.900.951.001.051.101.15TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1Figure A.12: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the webspam dataset.50 100 150 200Number of candidate features (KF )5.56.06.57.07.58.08.59.0TesterrorAll gaps midpoint, Mcs= 1All gaps midpoint + Bagging, Mcs= 1Figure A.13: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the kinect dataset.1042 4 6 8 10 12Number of candidate features (KF )0.000000.000020.000040.000060.00008TesterrorAll gaps midpoint, Mcs= 5, T= 100All gaps midpoint + Bagging, Mcs= 3, T= 100Figure A.14: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the bodyfat dataset.1 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorAll gaps midpoint, Mcs= 7, T= 100All gaps midpoint + Bagging, Mcs= 8, T= 100Figure A.15: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the diabetes dataset.1052 4 6 8 10 12Number of candidate features (KF )810121416TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1, T= 100Figure A.16: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the housing dataset.1 2 3 4 5 6 7Number of candidate features (KF )4.64.85.05.25.45.6TesterrorAll gaps midpoint, Mcs= 7, T= 100All gaps midpoint + Bagging, Mcs= 7, T= 100Figure A.17: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the abalone dataset.1061 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.50TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1, T= 100Figure A.18: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the wine dataset.1 2 3 4 5 6 7Number of candidate features (KF )2.42.62.83.03.2Testerror×109 All gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 2, T= 100Figure A.19: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the cadata dataset.107100 200 300 400 500 600 700Number of candidate features (KF )0.1900.1950.2000.2050.2100.2150.220TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1Figure A.20: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the e2006 dataset.200 400 600 800 1000Number of candidate features (KF )0.500.550.600.65TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1Figure A.21: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the kinect-head dataset.10810 20 30 40 50 60 70 80Number of candidate features (KF )100120140160180TesterrorAll gaps midpoint, Mcs= 5All gaps midpoint + Bagging, Mcs= 5Figure A.22: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the ypm dataset.50 100 150 200 250 300 350Number of candidate features (KF )2468TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Bagging, Mcs= 1Figure A.23: Comparision of all gaps midpoint with bagging (Breiman)versus all gaps midpoint without bagging for the ct-slice dataset.109A.2 Number of candidate features for differentforest sizes1 2 3 4 5 6 7 8 9Number of candidate features (KF )45505560657075TesterrorAll gaps midpoint, Mcs= 3, T= 1All gaps midpoint, Mcs= 6, T= 5All gaps midpoint, Mcs= 3, T= 25All gaps midpoint, Mcs= 3, T= 100All gaps midpoint, Mcs= 1, T= 500Figure A.24: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset vowel.5 10 15 20 25 30 35Number of candidate features (KF )5101520253035TesterrorAll gaps midpoint, Mcs= 1, T= 1All gaps midpoint, Mcs= 5, T= 5All gaps midpoint, Mcs= 1, T= 25All gaps midpoint, Mcs= 2, T= 100All gaps midpoint, Mcs= 1, T= 500Figure A.25: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset dna.1102 4 6 8 10 12 14 16 18Number of candidate features (KF )10121416182022TesterrorAll gaps midpoint, Mcs= 4, T= 1All gaps midpoint, Mcs= 4, T= 5All gaps midpoint, Mcs= 1, T= 25All gaps midpoint, Mcs= 1, T= 100All gaps midpoint, Mcs= 2, T= 500Figure A.26: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset satimage.50 100 150 200Number of candidate features (KF )46810121416TesterrorAll gaps midpoint, Mcs= 6, T= 1All gaps midpoint, Mcs= 4, T= 5All gaps midpoint, Mcs= 1, T= 25All gaps midpoint, Mcs= 3, T= 100All gaps midpoint, Mcs= 2, T= 500Figure A.27: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset gisette.11110 20 30 40Number of candidate features (KF )10152025TesterrorAll gaps midpoint, Mcs= 3, T= 1All gaps midpoint, Mcs= 2, T= 5All gaps midpoint, Mcs= 1, T= 25All gaps midpoint, Mcs= 1, T= 100All gaps midpoint, Mcs= 1, T= 500Figure A.28: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset usps.2 4 6 8 10 12Number of candidate features (KF )46810121416TesterrorAll gaps midpoint, Mcs= 1, T= 1All gaps midpoint, Mcs= 2, T= 5All gaps midpoint, Mcs= 1, T= 25All gaps midpoint, Mcs= 2, T= 100All gaps midpoint, Mcs= 1, T= 500Figure A.29: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset pendigits.1122 4 6 8 10 12Number of candidate features (KF )510152025TesterrorAll gaps midpoint, Mcs= 1, T= 1All gaps midpoint, Mcs= 2, T= 5All gaps midpoint, Mcs= 1, T= 25All gaps midpoint, Mcs= 1, T= 100All gaps midpoint, Mcs= 1, T= 500Figure A.30: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset letter.10 20 30 40 50Number of candidate features (KF )510152025TesterrorAll gaps midpoint, Mcs= 1, T= 1All gaps midpoint, Mcs= 5, T= 5All gaps midpoint, Mcs= 2, T= 25All gaps midpoint, Mcs= 1, T= 100Figure A.31: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset mnist.1135 10 15 20Number of candidate features (KF )202530TesterrorAll gaps midpoint, Mcs= 8, T= 1All gaps midpoint, Mcs= 8, T= 5All gaps midpoint, Mcs= 5, T= 25All gaps midpoint, Mcs= 4, T= 100Figure A.32: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset sensit-vehicle.2 4 6 8 10 12Number of candidate features (KF )−0.000050.000000.000050.000100.000150.00020TesterrorAll gaps midpoint, Mcs= 2, T= 1All gaps midpoint, Mcs= 5, T= 5All gaps midpoint, Mcs= 5, T= 25All gaps midpoint, Mcs= 5, T= 100All gaps midpoint, Mcs= 5, T= 500Figure A.33: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset bodyfat.1141 2 3 4 5 6 7 8 9Number of candidate features (KF )30003500400045005000TesterrorAll gaps midpoint, Mcs= 8, T= 1All gaps midpoint, Mcs= 8, T= 5All gaps midpoint, Mcs= 8, T= 25All gaps midpoint, Mcs= 7, T= 100All gaps midpoint, Mcs= 8, T= 500Figure A.34: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset diabetes.2 4 6 8 10 12Number of candidate features (KF )10152025303540TesterrorAll gaps midpoint, Mcs= 1, T= 1All gaps midpoint, Mcs= 1, T= 5All gaps midpoint, Mcs= 1, T= 25All gaps midpoint, Mcs= 1, T= 100All gaps midpoint, Mcs= 1, T= 500Figure A.35: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset housing.1151 2 3 4 5 6 7Number of candidate features (KF )4.55.05.56.06.57.0TesterrorAll gaps midpoint, Mcs= 8, T= 1All gaps midpoint, Mcs= 8, T= 5All gaps midpoint, Mcs= 7, T= 25All gaps midpoint, Mcs= 7, T= 100All gaps midpoint, Mcs= 7, T= 500Figure A.36: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset abalone.1 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.40.50.60.70.8TesterrorAll gaps midpoint, Mcs= 8, T= 1All gaps midpoint, Mcs= 2, T= 5All gaps midpoint, Mcs= 1, T= 25All gaps midpoint, Mcs= 1, T= 100All gaps midpoint, Mcs= 1, T= 500Figure A.37: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset wine.1161 2 3 4 5 6 7Number of candidate features (KF )2.53.03.54.04.55.05.5Testerror×109 All gaps midpoint, Mcs= 8, T= 1All gaps midpoint, Mcs= 5, T= 5All gaps midpoint, Mcs= 2, T= 25All gaps midpoint, Mcs= 1, T= 100Figure A.38: Test error relative to number of candidate features (KF ) fordifferent size forests (T ) for dataset cadata.117A.3 Minimum child size for different forest sizes1 2 3 4 5 6 7 8Minimum child size (Mcs)4550556065707580TesterrorAll gaps midpoint, KF= 9, T= 1All gaps midpoint, KF= 5, T= 5All gaps midpoint, KF= 3, T= 25All gaps midpoint, KF= 2, T= 100All gaps midpoint, KF= 2, T= 500Figure A.39: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset vowel.1 2 3 4 5 6 7 8Minimum child size (Mcs)68101214161820TesterrorAll gaps midpoint, KF= 39, T= 1All gaps midpoint, KF= 39, T= 5All gaps midpoint, KF= 39, T= 25All gaps midpoint, KF= 15, T= 100All gaps midpoint, KF= 15, T= 500Figure A.40: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset dna.1181 2 3 4 5 6 7 8Minimum child size (Mcs)1012141618202224TesterrorAll gaps midpoint, KF= 13, T= 1All gaps midpoint, KF= 9, T= 5All gaps midpoint, KF= 7, T= 25All gaps midpoint, KF= 7, T= 100All gaps midpoint, KF= 7, T= 500Figure A.41: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset satimage.1 2 3 4 5 6 7 8Minimum child size (Mcs)468101214TesterrorAll gaps midpoint, KF= 183, T= 1All gaps midpoint, KF= 183, T= 5All gaps midpoint, KF= 210, T= 25All gaps midpoint, KF= 131, T= 100All gaps midpoint, KF= 105, T= 500Figure A.42: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset gisette.1191 2 3 4 5 6 7 8Minimum child size (Mcs)10152025TesterrorAll gaps midpoint, KF= 48, T= 1All gaps midpoint, KF= 48, T= 5All gaps midpoint, KF= 12, T= 25All gaps midpoint, KF= 12, T= 100All gaps midpoint, KF= 18, T= 500Figure A.43: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset usps.1 2 3 4 5 6 7 8Minimum child size (Mcs)4681012141618TesterrorAll gaps midpoint, KF= 12, T= 1All gaps midpoint, KF= 7, T= 5All gaps midpoint, KF= 9, T= 25All gaps midpoint, KF= 9, T= 100All gaps midpoint, KF= 9, T= 500Figure A.44: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset pendigits.1201 2 3 4 5 6 7 8Minimum child size (Mcs)51015202530TesterrorAll gaps midpoint, KF= 10, T= 1All gaps midpoint, KF= 7, T= 5All gaps midpoint, KF= 3, T= 25All gaps midpoint, KF= 3, T= 100All gaps midpoint, KF= 3, T= 500Figure A.45: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset letter.1 2 3 4 5 6 7 8Minimum child size (Mcs)510152025TesterrorAll gaps midpoint, KF= 54, T= 1All gaps midpoint, KF= 54, T= 5All gaps midpoint, KF= 54, T= 25All gaps midpoint, KF= 54, T= 100Figure A.46: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset mnist.1211 2 3 4 5 6 7 8Minimum child size (Mcs)20253035TesterrorAll gaps midpoint, KF= 20, T= 1All gaps midpoint, KF= 20, T= 5All gaps midpoint, KF= 20, T= 25All gaps midpoint, KF= 20, T= 100Figure A.47: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset sensit-vehicle.1 2 3 4 5 6 7 8Minimum child size (Mcs)0.0000100.0000150.0000200.0000250.000030TesterrorAll gaps midpoint, KF= 13, T= 1All gaps midpoint, KF= 12, T= 5All gaps midpoint, KF= 13, T= 25All gaps midpoint, KF= 12, T= 100All gaps midpoint, KF= 12, T= 500Figure A.48: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset bodyfat.1221 2 3 4 5 6 7 8Minimum child size (Mcs)30004000500060007000TesterrorAll gaps midpoint, KF= 7, T= 1All gaps midpoint, KF= 4, T= 5All gaps midpoint, KF= 3, T= 25All gaps midpoint, KF= 3, T= 100All gaps midpoint, KF= 3, T= 500Figure A.49: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset diabetes.1 2 3 4 5 6 7 8Minimum child size (Mcs)101520253035TesterrorAll gaps midpoint, KF= 11, T= 1All gaps midpoint, KF= 7, T= 5All gaps midpoint, KF= 5, T= 25All gaps midpoint, KF= 4, T= 100All gaps midpoint, KF= 5, T= 500Figure A.50: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset housing.1231 2 3 4 5 6 7 8Minimum child size (Mcs)56789TesterrorAll gaps midpoint, KF= 4, T= 1All gaps midpoint, KF= 3, T= 5All gaps midpoint, KF= 3, T= 25All gaps midpoint, KF= 3, T= 100All gaps midpoint, KF= 3, T= 500Figure A.51: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset abalone.1 2 3 4 5 6 7 8Minimum child size (Mcs)0.40.50.60.70.8TesterrorAll gaps midpoint, KF= 5, T= 1All gaps midpoint, KF= 4, T= 5All gaps midpoint, KF= 2, T= 25All gaps midpoint, KF= 3, T= 100All gaps midpoint, KF= 2, T= 500Figure A.52: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset wine.1241 2 3 4 5 6 7 8Minimum child size (Mcs)2.53.03.54.04.55.0Testerror×109 All gaps midpoint, KF= 6, T= 1All gaps midpoint, KF= 5, T= 5All gaps midpoint, KF= 4, T= 25All gaps midpoint, KF= 4, T= 100Figure A.53: Test error relative to minimum child size (Mcs) for differentsize forests (T ) for dataset cadata.125A.4 Error, individual strength and diversity forall gaps midpoints versus node uniform1261 2 3 4 5 6 7 8 9Number of candidate features (KF )4045505560657075TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 2, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9Number of candidate features (KF )−0.15−0.10−0.050.000.050.10BreimanstrengthAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 2, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9Number of candidate features (KF )−0.20.00.20.4BreimancorrelationAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 2, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.54: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the vowel dataset.1275 10 15 20 25 30 35Number of candidate features (KF )4.85.05.25.45.65.86.06.2TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 5005 10 15 20 25 30 35Number of candidate features (KF )0.400.450.500.550.600.650.700.75BreimanstrengthAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 5005 10 15 20 25 30 35Number of candidate features (KF )0.100.150.200.250.30BreimancorrelationAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.55: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the dna dataset.1282 4 6 8 10 12 14 16 18Number of candidate features (KF )8.28.48.68.89.09.29.4TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12 14 16 18Number of candidate features (KF )0.660.670.680.690.700.710.72BreimanstrengthAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12 14 16 18Number of candidate features (KF )0.300.320.340.360.380.40BreimancorrelationAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.56: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the satimage dataset.12950 100 150 200Number of candidate features (KF )2.22.42.62.83.03.2TesterrorAll gaps midpoint, Mcs= 2, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 2, KS= 100.0, T= 50050 100 150 200Number of candidate features (KF )0.760.780.800.820.84BreimanstrengthAll gaps midpoint, Mcs= 2, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 2, KS= 100.0, T= 50050 100 150 200Number of candidate features (KF )0.180.200.220.240.26BreimancorrelationAll gaps midpoint, Mcs= 2, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 2, KS= 100.0, T= 500Figure A.57: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the gisette dataset.13010 20 30 40Number of candidate features (KF )5.56.06.57.0TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 50010 20 30 40Number of candidate features (KF )0.660.680.700.720.740.76BreimanstrengthAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 50010 20 30 40Number of candidate features (KF )0.250.300.350.40BreimancorrelationAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.58: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the usps dataset.1312 4 6 8 10 12Number of candidate features (KF )2.53.03.54.04.55.05.5TesterrorAll gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 1002 4 6 8 10 12Number of candidate features (KF )0.780.800.820.840.86BreimanstrengthAll gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 1002 4 6 8 10 12Number of candidate features (KF )0.200.250.300.350.40BreimancorrelationAll gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Figure A.59: Test error, Breiman strength and Breiman correlation forall gaps midpoint (Breiman) vs node uniform (Extra) for the pendigitsdataset.1322 4 6 8 10 12Number of candidate features (KF )3456789TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )0.650.700.750.80BreimanstrengthAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )0.150.200.250.300.350.40BreimancorrelationAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.60: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the letter dataset.1330 200 400 600 800 1000Number of candidate features (KF )181920212223242526TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform0 200 400 600 800 1000Number of candidate features (KF )0.050.100.150.200.250.30BreimanstrengthAll gaps midpoint, Mcs= 1, T= 100Node uniform0 200 400 600 800 1000Number of candidate features (KF )0.000.050.100.150.200.250.300.35BreimancorrelationAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.61: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the news20 dataset.13420 40 60 80 100Number of candidate features (KF )52.052.553.053.554.054.5TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform20 40 60 80 100Number of candidate features (KF )0.0150.0200.025BreimanstrengthAll gaps midpoint, Mcs= 1, T= 100Node uniform20 40 60 80 100Number of candidate features (KF )0.0750.0800.0850.0900.0950.100BreimancorrelationAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.62: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the cifar10 dataset.13510 20 30 40 50Number of candidate features (KF )2.52.62.72.82.93.03.1TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 10010 20 30 40 50Number of candidate features (KF )0.730.740.750.760.770.780.79BreimanstrengthAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 10010 20 30 40 50Number of candidate features (KF )0.220.230.240.250.26BreimancorrelationAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Figure A.63: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the mnist dataset.1365 10 15 20Number of candidate features (KF )15.615.816.016.216.416.616.817.0TesterrorAll gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 4, KS= 100.0, T= 1005 10 15 20Number of candidate features (KF )0.470.480.490.500.51BreimanstrengthAll gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 4, KS= 100.0, T= 1005 10 15 20Number of candidate features (KF )0.300.310.320.33BreimancorrelationAll gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 4, KS= 100.0, T= 100Figure A.64: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the sensit-vehicledataset.1375 10 15 20 25 30Number of candidate features (KF )0.900.951.001.051.101.151.20TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform5 10 15 20 25 30Number of candidate features (KF )0.9100.9150.9200.9250.9300.9350.940BreimanstrengthAll gaps midpoint, Mcs= 1, T= 100Node uniform5 10 15 20 25 30Number of candidate features (KF )0.2650.2700.2750.280BreimancorrelationAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.65: Test error, Breiman strength and Breiman correlation forall gaps midpoint (Breiman) vs node uniform (Extra) for the webspamdataset.13850 100 150 200Number of candidate features (KF )6789101112TesterrorAll gaps midpoint, Mcs= 1Node uniform, Mcs= 150 100 150 200Number of candidate features (KF )0.600.620.640.660.680.700.72BreimanstrengthAll gaps midpoint, Mcs= 1Node uniform, Mcs= 150 100 150 200Number of candidate features (KF )0.260.270.280.290.30BreimancorrelationAll gaps midpoint, Mcs= 1Node uniform, Mcs= 1Figure A.66: Test error, Breiman strength and Breiman correlation for allgaps midpoint (Breiman) vs node uniform (Extra) for the kinect dataset.1392 4 6 8 10 12Number of candidate features (KF )−0.000020.000000.000020.000040.000060.00008TesterrorAll gaps midpoint, Mcs= 5, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 5, KS= 5.0, T= 25Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )−0.00010−0.000050.000000.000050.000100.000150.000200.00025IndividualMSEAll gaps midpoint, Mcs= 5, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 5, KS= 5.0, T= 25Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )−0.00010−0.000050.000000.000050.000100.000150.00020AmbiguityMSEAll gaps midpoint, Mcs= 5, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 5, KS= 5.0, T= 25Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 500Figure A.67: Test error, individual MSE and ambiguity MSE for all gapsmidpoint (Breiman) vs node uniform (Extra) for the bodyfat dataset.1401 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorAll gaps midpoint, Mcs= 8, T= 500Node uniform, Mcs= 4, KS= 1.0, T= 500Node uniform, Mcs= 6, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9Number of candidate features (KF )380040004200440046004800IndividualMSEAll gaps midpoint, Mcs= 8, T= 500Node uniform, Mcs= 4, KS= 1.0, T= 500Node uniform, Mcs= 6, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9Number of candidate features (KF )500100015002000AmbiguityMSEAll gaps midpoint, Mcs= 8, T= 500Node uniform, Mcs= 4, KS= 1.0, T= 500Node uniform, Mcs= 6, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 500Figure A.68: Test error, individual MSE and ambiguity MSE for all gapsmidpoint (Breiman) vs node uniform (Extra) for the diabetes dataset.1412 4 6 8 10 12Number of candidate features (KF )81012141618TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )15202530354045IndividualMSEAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )51015202530AmbiguityMSEAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.69: Test error, individual MSE and ambiguity MSE for all gapsmidpoint (Breiman) vs node uniform (Extra) for the housing dataset.1421 2 3 4 5 6 7Number of candidate features (KF )4.44.64.85.05.25.45.6TesterrorAll gaps midpoint, Mcs= 7, T= 500Node uniform, Mcs= 5, KS= 1.0, T= 500Node uniform, Mcs= 8, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5001 2 3 4 5 6 7Number of candidate features (KF )5.56.06.57.07.5IndividualMSEAll gaps midpoint, Mcs= 7, T= 500Node uniform, Mcs= 5, KS= 1.0, T= 500Node uniform, Mcs= 8, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5001 2 3 4 5 6 7Number of candidate features (KF )1.01.21.41.61.82.02.22.4AmbiguityMSEAll gaps midpoint, Mcs= 7, T= 500Node uniform, Mcs= 5, KS= 1.0, T= 500Node uniform, Mcs= 8, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 500Figure A.70: Test error, individual MSE and ambiguity MSE for all gapsmidpoint (Breiman) vs node uniform (Extra) for the abalone dataset.1431 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.50TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.660.680.700.72IndividualMSEAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.250.300.350.400.45AmbiguityMSEAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.71: Test error, individual MSE and ambiguity MSE for all gapsmidpoint (Breiman) vs node uniform (Extra) for the wine dataset.1441 2 3 4 5 6 7Number of candidate features (KF )2.42.62.83.03.2Testerror×109 All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 100Node uniform, Mcs= 2, KS= 100.0, T= 1001 2 3 4 5 6 7Number of candidate features (KF )0.50.60.70.80.9IndividualMSE×1010 All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 100Node uniform, Mcs= 2, KS= 100.0, T= 1001 2 3 4 5 6 7Number of candidate features (KF )23456AmbiguityMSE×109 All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 100Node uniform, Mcs= 2, KS= 100.0, T= 100Figure A.72: Test error, individual MSE and ambiguity MSE for all gapsmidpoint (Breiman) vs node uniform (Extra) for the cadata dataset.145100 200 300 400 500 600 700Number of candidate features (KF )0.1900.1950.2000.2050.2100.215TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform100 200 300 400 500 600 700Number of candidate features (KF )0.420.430.440.450.460.47IndividualMSEAll gaps midpoint, Mcs= 1, T= 100Node uniform100 200 300 400 500 600 700Number of candidate features (KF )0.220.230.240.250.260.27AmbiguityMSEAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.73: Test error, individual MSE and ambiguity MSE for all gapsmidpoint (Breiman) vs node uniform (Extra) for the e2006 dataset.146200 400 600 800 1000Number of candidate features (KF )0.480.500.520.540.560.580.60TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform200 400 600 800 1000Number of candidate features (KF )5.56.06.57.0IndividualMSEAll gaps midpoint, Mcs= 1, T= 100Node uniform200 400 600 800 1000Number of candidate features (KF )4.04.55.05.5AmbiguityMSEAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.74: Test error, individual MSE and ambiguity MSE for all gapsmidpoint (Breiman) vs node uniform (Extra) for the kinect-head dataset.14710 20 30 40 50 60 70 80Number of candidate features (KF )100120140160180TesterrorAll gaps midpoint, Mcs= 5Node uniform, Mcs= 510 20 30 40 50 60 70 80Number of candidate features (KF )130140150160170IndividualMSEAll gaps midpoint, Mcs= 5Node uniform, Mcs= 510 20 30 40 50 60 70 80Number of candidate features (KF )20406080100AmbiguityMSEAll gaps midpoint, Mcs= 5Node uniform, Mcs= 5Figure A.75: Test error, individual MSE and ambiguity MSE for all gapsmidpoint (Breiman) vs node uniform (Extra) for the ypm dataset.14850 100 150 200 250 300 350Number of candidate features (KF )246810TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform50 100 150 200 250 300 350Number of candidate features (KF )5.05.56.06.57.07.58.0IndividualMSEAll gaps midpoint, Mcs= 1, T= 100Node uniform50 100 150 200 250 300 350Number of candidate features (KF )2345678910AmbiguityMSEAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.76: Test error, individual MSE and ambiguity MSE for all gapsmidpoint (Breiman) vs node uniform (Extra) for the ct-slice dataset.149A.5 Computation and model size for all gapsmidpoints versus node uniform1501 2 3 4 5 6 7 8 9Number of candidate features (KF )4045505560657075TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 2, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9Number of candidate features (KF )−40−200204060TraintimeAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 2, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9Number of candidate features (KF )0.000.050.100.15Testtime All gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 2, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9Number of candidate features (KF )150200250300350400NumberofNodesAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 2, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.77: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the vowel dataset.1515 10 15 20 25 30 35Number of candidate features (KF )4.85.05.25.45.65.86.06.2TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 5005 10 15 20 25 30 35Number of candidate features (KF )−600−400−2000200400600TraintimeAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 5005 10 15 20 25 30 35Number of candidate features (KF )−0.6−0.4−0.20.00.20.40.60.8Testtime All gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 5005 10 15 20 25 30 35Number of candidate features (KF )40060080010001200NumberofNodesAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 2, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.78: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the dna dataset.1522 4 6 8 10 12 14 16 18Number of candidate features (KF )8.28.48.68.89.09.29.4TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12 14 16 18Number of candidate features (KF )−600−400−2000200400600TraintimeAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12 14 16 18Number of candidate features (KF )−1.0−0.50.00.51.01.5Testtime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12 14 16 18Number of candidate features (KF )1000150020002500NumberofNodesAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.79: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the satimage dataset.15350 100 150 200Number of candidate features (KF )2.22.42.62.83.03.2TesterrorAll gaps midpoint, Mcs= 2, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 2, KS= 100.0, T= 50050 100 150 200Number of candidate features (KF )−6000−4000−20000200040006000TraintimeAll gaps midpoint, Mcs= 2, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 2, KS= 100.0, T= 50050 100 150 200Number of candidate features (KF )−0.4−0.20.00.20.40.6Testtime All gaps midpoint, Mcs= 2, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 2, KS= 100.0, T= 50050 100 150 200Number of candidate features (KF )20040060080010001200NumberofNodesAll gaps midpoint, Mcs= 2, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 500Node uniform, Mcs= 2, KS= 100.0, T= 500Figure A.80: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the gisette dataset.15410 20 30 40Number of candidate features (KF )5.56.06.57.0TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 50010 20 30 40Number of candidate features (KF )−2000−1000010002000TraintimeAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 50010 20 30 40Number of candidate features (KF )0.00.20.40.60.81.01.21.4Testtime All gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 50010 20 30 40Number of candidate features (KF )100015002000250030003500NumberofNodesAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.81: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the usps dataset.1552 4 6 8 10 12Number of candidate features (KF )2.53.03.54.04.55.05.5TesterrorAll gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 1002 4 6 8 10 12Number of candidate features (KF )−60−40−20020406080TraintimeAll gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 1002 4 6 8 10 12Number of candidate features (KF )−1.5−1.0−0.50.00.51.01.5Testtime All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 1002 4 6 8 10 12Number of candidate features (KF )50010001500200025003000NumberofNodesAll gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Figure A.82: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the pendigits dataset.1562 4 6 8 10 12Number of candidate features (KF )3456789TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )−2000−1000010002000TraintimeAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )−202468Testtime All gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )40005000600070008000900010000NumberofNodesAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.83: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the letter dataset.1570 200 400 600 800 1000Number of candidate features (KF )181920212223242526TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform0 200 400 600 800 1000Number of candidate features (KF )02000400060008000100001200014000TraintimeAll gaps midpoint, Mcs= 1, T= 100Node uniform0 200 400 600 800 1000Number of candidate features (KF )2345678TesttimeAll gaps midpoint, Mcs= 1, T= 100Node uniform0 200 400 600 800 1000Number of candidate features (KF )020004000600080001000012000140001600018000NumberofNodesAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.84: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the news20 dataset.15820 40 60 80 100Number of candidate features (KF )52.052.553.053.554.054.5TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform20 40 60 80 100Number of candidate features (KF )−2000−1000010002000300040005000TraintimeAll gaps midpoint, Mcs= 1, T= 100Node uniform20 40 60 80 100Number of candidate features (KF )0.51.01.52.02.53.0TesttimeAll gaps midpoint, Mcs= 1, T= 100Node uniform20 40 60 80 100Number of candidate features (KF )3500040000450005000055000NumberofNodesAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.85: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the cifar10 dataset.15910 20 30 40 50Number of candidate features (KF )2.52.62.72.82.93.03.1TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 10010 20 30 40 50Number of candidate features (KF )−4000−20000200040006000TraintimeAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 10010 20 30 40 50Number of candidate features (KF )0.40.60.81.01.21.41.6Testtime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 10010 20 30 40 50Number of candidate features (KF )50001000015000200002500030000NumberofNodesAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Figure A.86: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the mnist dataset.1605 10 15 20Number of candidate features (KF )15.615.816.016.216.416.616.817.0TesterrorAll gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 4, KS= 100.0, T= 1005 10 15 20Number of candidate features (KF )−4000−2000020004000TraintimeAll gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 4, KS= 100.0, T= 1005 10 15 20Number of candidate features (KF )−10123456Testtime All gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 4, KS= 100.0, T= 1005 10 15 20Number of candidate features (KF )100002000030000400005000060000NumberofNodesAll gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 1, KS= 25.0, T= 100Node uniform, Mcs= 4, KS= 100.0, T= 100Figure A.87: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the sensit-vehicle dataset.1615 10 15 20 25 30Number of candidate features (KF )0.900.951.001.051.101.151.20TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform5 10 15 20 25 30Number of candidate features (KF )−50005001000150020002500TraintimeAll gaps midpoint, Mcs= 1, T= 100Node uniform5 10 15 20 25 30Number of candidate features (KF )051015TesttimeAll gaps midpoint, Mcs= 1, T= 100Node uniform5 10 15 20 25 30Number of candidate features (KF )01000020000300004000050000NumberofNodesAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.88: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the webspam dataset.16250 100 150 200Number of candidate features (KF )6789101112TesterrorAll gaps midpoint, Mcs= 1Node uniform, Mcs= 150 100 150 200Number of candidate features (KF )−150000−100000−50000050000100000150000200000TraintimeAll gaps midpoint, Mcs= 1Node uniform, Mcs= 150 100 150 200Number of candidate features (KF )1416182022TesttimeAll gaps midpoint, Mcs= 1Node uniform, Mcs= 150 100 150 200Number of candidate features (KF )100000150000200000250000300000350000NumberofNodesAll gaps midpoint, Mcs= 1Node uniform, Mcs= 1Figure A.89: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the kinect dataset.1632 4 6 8 10 12Number of candidate features (KF )−0.000020.000000.000020.000040.000060.00008TesterrorAll gaps midpoint, Mcs= 5, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 5, KS= 5.0, T= 25Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )−30−20−100102030TraintimeAll gaps midpoint, Mcs= 5, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 5, KS= 5.0, T= 25Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )−0.03−0.02−0.010.000.010.020.03Testtime All gaps midpoint, Mcs= 5, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 5, KS= 5.0, T= 25Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )100200300400500NumberofNodesAll gaps midpoint, Mcs= 5, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 5, KS= 5.0, T= 25Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 500Figure A.90: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the bodyfat dataset.1641 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorAll gaps midpoint, Mcs= 8, T= 500Node uniform, Mcs= 4, KS= 1.0, T= 500Node uniform, Mcs= 6, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9Number of candidate features (KF )−60−40−200204060TraintimeAll gaps midpoint, Mcs= 8, T= 500Node uniform, Mcs= 4, KS= 1.0, T= 500Node uniform, Mcs= 6, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9Number of candidate features (KF )−0.015−0.010−0.0050.0000.0050.0100.0150.020Testtime All gaps midpoint, Mcs= 8, T= 500Node uniform, Mcs= 4, KS= 1.0, T= 500Node uniform, Mcs= 6, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9Number of candidate features (KF )50100150200NumberofNodesAll gaps midpoint, Mcs= 8, T= 500Node uniform, Mcs= 4, KS= 1.0, T= 500Node uniform, Mcs= 6, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 500Figure A.91: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the diabetes dataset.1652 4 6 8 10 12Number of candidate features (KF )81012141618TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )−100−50050100150TraintimeAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )0.030.040.050.06Testtime All gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5002 4 6 8 10 12Number of candidate features (KF )600650700750800850900950NumberofNodesAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.92: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the housing dataset.1661 2 3 4 5 6 7Number of candidate features (KF )4.44.64.85.05.25.45.6TesterrorAll gaps midpoint, Mcs= 7, T= 500Node uniform, Mcs= 5, KS= 1.0, T= 500Node uniform, Mcs= 8, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5001 2 3 4 5 6 7Number of candidate features (KF )−400−2000200400TraintimeAll gaps midpoint, Mcs= 7, T= 500Node uniform, Mcs= 5, KS= 1.0, T= 500Node uniform, Mcs= 8, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5001 2 3 4 5 6 7Number of candidate features (KF )−0.4−0.20.00.20.4Testtime All gaps midpoint, Mcs= 7, T= 500Node uniform, Mcs= 5, KS= 1.0, T= 500Node uniform, Mcs= 8, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 5001 2 3 4 5 6 7Number of candidate features (KF )50010001500NumberofNodesAll gaps midpoint, Mcs= 7, T= 500Node uniform, Mcs= 5, KS= 1.0, T= 500Node uniform, Mcs= 8, KS= 5.0, T= 100Node uniform, Mcs= 8, KS= 25.0, T= 500Node uniform, Mcs= 8, KS= 100.0, T= 500Figure A.93: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the abalone dataset.1671 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.50TesterrorAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )−1500−1000−500050010001500TraintimeAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.40.60.81.01.21.4Testtime All gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 5001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )30004000500060007000NumberofNodesAll gaps midpoint, Mcs= 1, T= 500Node uniform, Mcs= 1, KS= 1.0, T= 500Node uniform, Mcs= 1, KS= 5.0, T= 500Node uniform, Mcs= 1, KS= 25.0, T= 500Node uniform, Mcs= 1, KS= 100.0, T= 500Figure A.94: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the wine dataset.1681 2 3 4 5 6 7Number of candidate features (KF )2.42.62.83.03.2Testerror×109 All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 100Node uniform, Mcs= 2, KS= 100.0, T= 1001 2 3 4 5 6 7Number of candidate features (KF )−1000−500050010001500TraintimeAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 100Node uniform, Mcs= 2, KS= 100.0, T= 1001 2 3 4 5 6 7Number of candidate features (KF )0.00.20.40.60.8Testtime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 100Node uniform, Mcs= 2, KS= 100.0, T= 1001 2 3 4 5 6 7Number of candidate features (KF )1500020000250003000035000400004500050000NumberofNodesAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 5.0, T= 100Node uniform, Mcs= 2, KS= 25.0, T= 100Node uniform, Mcs= 2, KS= 100.0, T= 100Figure A.95: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the cadata dataset.169100 200 300 400 500 600 700Number of candidate features (KF )0.1900.1950.2000.2050.2100.215TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform100 200 300 400 500 600 700Number of candidate features (KF )0500010000150002000025000TraintimeAll gaps midpoint, Mcs= 1, T= 100Node uniform100 200 300 400 500 600 700Number of candidate features (KF )−0.50.00.51.01.52.02.53.0TesttimeAll gaps midpoint, Mcs= 1, T= 100Node uniform100 200 300 400 500 600 700Number of candidate features (KF )3180031900320003210032200NumberofNodesAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.96: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the e2006 dataset.170200 400 600 800 1000Number of candidate features (KF )0.480.500.520.540.560.580.60TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform200 400 600 800 1000Number of candidate features (KF )−30000−20000−10000010000200003000040000TraintimeAll gaps midpoint, Mcs= 1, T= 100Node uniform200 400 600 800 1000Number of candidate features (KF )0.70.80.91.01.11.21.3TesttimeAll gaps midpoint, Mcs= 1, T= 100Node uniform200 400 600 800 1000Number of candidate features (KF )0.0150.0200.0250.0300.0350.0400.045NumberofNodes+7.999897×104 All gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.97: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the kinect-head dataset.17110 20 30 40 50 60 70 80Number of candidate features (KF )100120140160180TesterrorAll gaps midpoint, Mcs= 5Node uniform, Mcs= 510 20 30 40 50 60 70 80Number of candidate features (KF )050001000015000TraintimeAll gaps midpoint, Mcs= 5Node uniform, Mcs= 510 20 30 40 50 60 70 80Number of candidate features (KF )1213141516171819TesttimeAll gaps midpoint, Mcs= 5Node uniform, Mcs= 510 20 30 40 50 60 70 80Number of candidate features (KF )147000147500148000148500NumberofNodesAll gaps midpoint, Mcs= 5Node uniform, Mcs= 5Figure A.98: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the ypm dataset.17250 100 150 200 250 300 350Number of candidate features (KF )246810TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform50 100 150 200 250 300 350Number of candidate features (KF )200030004000500060007000TraintimeAll gaps midpoint, Mcs= 1, T= 100Node uniform50 100 150 200 250 300 350Number of candidate features (KF )1.82.02.22.42.62.8TesttimeAll gaps midpoint, Mcs= 1, T= 100Node uniform50 100 150 200 250 300 350Number of candidate features (KF )85050851008515085200852508530085350NumberofNodesAll gaps midpoint, Mcs= 1, T= 100Node uniformFigure A.99: Test error, train time, test time and model size for all gapsmidpoint (Breiman) vs node uniform (Extra) for the ct-slice dataset.173A.6 Subsample data points at node1741 2 3 4 5 6 7 8 9Number of candidate features (KF )4045505560657075TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 3, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )−0.15−0.10−0.050.000.050.10BreimanstrengthAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 3, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )−0.3−0.2−0.10.00.10.20.30.40.5BreimancorrelationAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 3, T= 100Figure A.100: Test error, Breiman strength and Breiman correlation fordifferent settings of the number of data points sampled at each node (Kss)on the vowel dataset.17510 20 30 40 50 60 70 80Number of candidate features (KF )567891011TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 2, T= 10010 20 30 40 50 60 70 80Number of candidate features (KF )0.00.10.20.30.40.50.60.70.8BreimanstrengthAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 2, T= 10010 20 30 40 50 60 70 80Number of candidate features (KF )−0.2−0.10.00.10.20.30.4BreimancorrelationAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 2, T= 100Figure A.101: Test error, Breiman strength and Breiman correlation fordifferent settings of the number of data points sampled at each node (Kss)on the dna dataset.1765 10 15 20Number of candidate features (KF )91011121314TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 1005 10 15 20Number of candidate features (KF )0.500.550.600.650.70BreimanstrengthAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 1005 10 15 20Number of candidate features (KF )0.200.250.300.350.40BreimancorrelationAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 100Figure A.102: Test error, Breiman strength and Breiman correlation fordifferent settings of the number of data points sampled at each node (Kss)on the satimage dataset.17750 100 150 200Number of candidate features (KF )468101214TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 3, T= 10050 100 150 200Number of candidate features (KF )0.30.40.50.60.70.8BreimanstrengthAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 3, T= 10050 100 150 200Number of candidate features (KF )0.000.050.100.150.200.25BreimancorrelationAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 3, T= 100Figure A.103: Test error, Breiman strength and Breiman correlation fordifferent settings of the number of data points sampled at each node (Kss)on the gisette dataset.17810 20 30 40Number of candidate features (KF )6.06.57.07.58.08.59.0TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 10010 20 30 40Number of candidate features (KF )0.400.450.500.550.600.650.700.75BreimanstrengthAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 10010 20 30 40Number of candidate features (KF )0.050.100.150.200.250.300.350.40BreimancorrelationAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 100Figure A.104: Test error, Breiman strength and Breiman correlation fordifferent settings of the number of data points sampled at each node (Kss)on the usps dataset.1792 4 6 8 10 12Number of candidate features (KF )3.54.04.55.05.5TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 2, T= 1002 4 6 8 10 12Number of candidate features (KF )0.650.700.750.800.85BreimanstrengthAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 2, T= 1002 4 6 8 10 12Number of candidate features (KF )0.100.150.200.250.300.350.40BreimancorrelationAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 2, T= 100Figure A.105: Test error, Breiman strength and Breiman correlation fordifferent settings of the number of data points sampled at each node (Kss)on the pendigits dataset.1802 4 6 8 10 12Number of candidate features (KF )45678TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )0.500.550.600.650.700.750.80BreimanstrengthAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )0.050.100.150.200.250.300.350.40BreimancorrelationAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 100Figure A.106: Test error, Breiman strength and Breiman correlation fordifferent settings of the number of data points sampled at each node (Kss)on the letter dataset.18150 100 150 200Number of candidate features (KF )3.03.54.04.55.05.5TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 10050 100 150 200Number of candidate features (KF )0.40.50.60.70.8BreimanstrengthAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 10050 100 150 200Number of candidate features (KF )0.050.100.150.200.25BreimancorrelationAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 100Figure A.107: Test error, Breiman strength and Breiman correlation fordifferent settings of the number of data points sampled at each node (Kss)on the mnist dataset.1825 10 15 20 25 30 35 40Number of candidate features (KF )1617181920212223TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 4, T= 1005 10 15 20 25 30 35 40Number of candidate features (KF )0.250.300.350.400.450.50BreimanstrengthAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 4, T= 1005 10 15 20 25 30 35 40Number of candidate features (KF )0.100.150.200.250.30BreimancorrelationAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 4, T= 100Figure A.108: Test error, Breiman strength and Breiman correlation fordifferent settings of the number of data points sampled at each node (Kss)on the sensit-vehicle dataset.1832 4 6 8 10 12 14Number of candidate features (KF )0.000.020.040.060.080.10TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 5, T= 1002 4 6 8 10 12 14Number of candidate features (KF )0.000.050.100.15IndividualMSEAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 5, T= 1002 4 6 8 10 12 14Number of candidate features (KF )0.000.020.040.060.080.100.12AmbiguityMSEAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 5, T= 100Figure A.109: Test error, individual MSE and ambiguity MSE correlationfor different settings of the number of data points sampled at each node(Kss) on the bodyfat dataset.1841 2 3 4 5 6 7 8 9Number of candidate features (KF )1000020000300004000050000TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )1000020000300004000050000IndividualMSEAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )0500100015002000AmbiguityMSEAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 7, T= 100Figure A.110: Test error, individual MSE and ambiguity MSE correlationfor different settings of the number of data points sampled at each node(Kss) on the diabetes dataset.1852 4 6 8 10 12Number of candidate features (KF )81012141618TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )20406080100IndividualMSEAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )20406080AmbiguityMSEAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 100Figure A.111: Test error, individual MSE and ambiguity MSE correlationfor different settings of the number of data points sampled at each node(Kss) on the housing dataset.1861 2 3 4 5 6 7Number of candidate features (KF )50100150200TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7Number of candidate features (KF )50100150IndividualMSEAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7Number of candidate features (KF )0.00.51.01.52.02.5AmbiguityMSEAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 7, T= 100Figure A.112: Test error, individual MSE and ambiguity MSE correlationfor different settings of the number of data points sampled at each node(Kss) on the abalone dataset.1871 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.50TesterrorAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 1001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.81.01.21.41.61.82.02.2IndividualMSEAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 1001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.40.60.81.01.21.41.61.8AmbiguityMSEAll gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 100Figure A.113: Test error, individual MSE and ambiguity MSE correlationfor different settings of the number of data points sampled at each node(Kss) on the wine dataset.1881 2 3 4 5 6 7 8Number of candidate features (KF )2.42.62.83.03.2Testerror×109 All gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 1001 2 3 4 5 6 7 8Number of candidate features (KF )5678IndividualMSE×109 All gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 1001 2 3 4 5 6 7 8Number of candidate features (KF )2.02.53.03.54.04.55.05.5AmbiguityMSE×109 All gaps midpoint + subsample at node, Kss= 10All gaps midpoint + subsample at node, Kss= 100All gaps midpoint + subsample at node, Kss= 1000All gaps midpoint, Mcs= 1, T= 100Figure A.114: Test error, individual MSE and ambiguity MSE correlationfor different settings of the number of data points sampled at each node(Kss) on the cadata dataset.189A.7 Sort and walk split point variants1 2 3 4 5 6 7 8 9Number of candidate features (KF )455055606570TesterrorAll gaps midpoint, Mcs= 3, T= 100All datapoints, Mcs= 2, T= 100All gaps uniform, Mcs= 2, T= 100Figure A.115: Test error for different sort and walk candidate split pointselection strategies for dataset vowel.5 10 15 20 25 30 35Number of candidate features (KF )4.85.05.25.45.65.86.0TesterrorAll gaps midpoint, Mcs= 2, T= 100All datapoints, Mcs= 1, T= 100All gaps uniform, Mcs= 1, T= 100Figure A.116: Test error for different sort and walk candidate split pointselection strategies for dataset dna.1902 4 6 8 10 12 14 16 18Number of candidate features (KF )8.28.48.68.89.09.29.49.6TesterrorAll gaps midpoint, Mcs= 1, T= 100All datapoints, Mcs= 1, T= 100All gaps uniform, Mcs= 2, T= 100Figure A.117: Test error for different sort and walk candidate split pointselection strategies for dataset satimage.50 100 150 200Number of candidate features (KF )2.42.62.83.03.2TesterrorAll gaps midpoint, Mcs= 3, T= 100All datapoints, Mcs= 4, T= 100All gaps uniform, Mcs= 1, T= 100Figure A.118: Test error for different sort and walk candidate split pointselection strategies for dataset gisette.19110 20 30 40Number of candidate features (KF )5.86.06.26.46.66.87.07.2TesterrorAll gaps midpoint, Mcs= 1, T= 100All datapoints, Mcs= 1, T= 100All gaps uniform, Mcs= 1, T= 100Figure A.119: Test error for different sort and walk candidate split pointselection strategies for dataset usps.2 4 6 8 10 12Number of candidate features (KF )3.03.54.04.55.0TesterrorAll gaps midpoint, Mcs= 2, T= 100All datapoints, Mcs= 1, T= 100All gaps uniform, Mcs= 1, T= 100Figure A.120: Test error for different sort and walk candidate split pointselection strategies for dataset pendigits.1922 4 6 8 10 12Number of candidate features (KF )45678TesterrorAll gaps midpoint, Mcs= 1, T= 100All datapoints, Mcs= 1, T= 100All gaps uniform, Mcs= 1, T= 100Figure A.121: Test error for different sort and walk candidate split pointselection strategies for dataset letter.10 20 30 40 50Number of candidate features (KF )2.652.702.752.802.852.902.95TesterrorAll gaps midpoint, Mcs= 1, T= 100All datapoints, Mcs= 1, T= 100All gaps uniform, Mcs= 1, T= 100Figure A.122: Test error for different sort and walk candidate split pointselection strategies for dataset mnist.1935 10 15 20Number of candidate features (KF )15.715.815.916.016.116.216.316.4TesterrorAll gaps midpoint, Mcs= 4, T= 100All datapoints, Mcs= 5, T= 100All gaps uniform, Mcs= 3, T= 100Figure A.123: Test error for different sort and walk candidate split pointselection strategies for dataset sensit-vehicle.2 4 6 8 10 12Number of candidate features (KF )0.000000.000020.000040.000060.00008TesterrorAll gaps midpoint, Mcs= 5, T= 100All datapoints, Mcs= 5, T= 100All gaps uniform, Mcs= 5, T= 100Figure A.124: Test error for different sort and walk candidate split pointselection strategies for dataset bodyfat.1941 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorAll gaps midpoint, Mcs= 7, T= 100All datapoints, Mcs= 8, T= 100All gaps uniform, Mcs= 7, T= 100Figure A.125: Test error for different sort and walk candidate split pointselection strategies for dataset diabetes.2 4 6 8 10 12Number of candidate features (KF )81012141618TesterrorAll gaps midpoint, Mcs= 1, T= 100All datapoints, Mcs= 1, T= 100All gaps uniform, Mcs= 1, T= 100Figure A.126: Test error for different sort and walk candidate split pointselection strategies for dataset housing.1951 2 3 4 5 6 7Number of candidate features (KF )4.64.85.05.25.45.6TesterrorAll gaps midpoint, Mcs= 7, T= 100All datapoints, Mcs= 6, T= 100All gaps uniform, Mcs= 8, T= 100Figure A.127: Test error for different sort and walk candidate split pointselection strategies for dataset abalone.1 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.50TesterrorAll gaps midpoint, Mcs= 1, T= 100All datapoints, Mcs= 1, T= 100All gaps uniform, Mcs= 1, T= 100Figure A.128: Test error for different sort and walk candidate split pointselection strategies for dataset wine.1961 2 3 4 5 6 7Number of candidate features (KF )2.42.62.83.03.2Testerror×109 All gaps midpoint, Mcs= 1, T= 100All datapoints, Mcs= 2, T= 100All gaps uniform, Mcs= 2, T= 100Figure A.129: Test error for different sort and walk candidate split pointselection strategies for dataset cadata.197A.8 Uniform split point variants1 2 3 4 5 6 7 8 9Number of candidate features (KF )4045505560657075TesterrorNode uniform, Mcs= 4, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Global uniform, Mcs= 6, KS= 1.0, T= 100Global uniform, Mcs= 2, KS= 100.0, T= 100Figure A.130: Test error for node uniform and global uniform candidate splitpoint selection strategies for the vowel dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.1985 10 15 20 25 30 35Number of candidate features (KF )5.05.56.06.5TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 5, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 1, KS= 100.0, T= 100Figure A.131: Test error for node uniform and global uniform candidatesplit point selection strategies for the dna dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.2 4 6 8 10 12 14 16 18Number of candidate features (KF )9101112131415TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 2, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 1, KS= 100.0, T= 100Figure A.132: Test error for node uniform and global uniform candidate splitpoint selection strategies for the satimage dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.19950 100 150 200Number of candidate features (KF )2.22.42.62.83.03.2TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 3, KS= 100.0, T= 100Figure A.133: Test error for node uniform and global uniform candidate splitpoint selection strategies for the gisette dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.10 20 30 40Number of candidate features (KF )5.56.06.57.0TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 1, KS= 100.0, T= 100Figure A.134: Test error for node uniform and global uniform candidatesplit point selection strategies for the usps dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.2002 4 6 8 10 12Number of candidate features (KF )3456789TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 1, KS= 100.0, T= 100Figure A.135: Test error for node uniform and global uniform candidate splitpoint selection strategies for the pendigits dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.2 4 6 8 10 12Number of candidate features (KF )4681012141618TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 1, KS= 100.0, T= 100Figure A.136: Test error for node uniform and global uniform candidate splitpoint selection strategies for the letter dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.20110 20 30 40 50Number of candidate features (KF )2.62.72.82.93.03.1TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 1, KS= 100.0, T= 100Figure A.137: Test error for node uniform and global uniform candidate splitpoint selection strategies for the mnist dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.5 10 15 20Number of candidate features (KF )16.016.517.017.518.018.5TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 4, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 4, KS= 100.0, T= 100Figure A.138: Test error for node uniform and global uniform candidatesplit point selection strategies for the sensit-vehicle dataset. Global uni-form samples candidate split points uniformly across the entire feature rangewhere as node uniform samples across the data points at the node.2022 4 6 8 10 12Number of candidate features (KF )0.000000.000050.000100.00015TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 5, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 7, KS= 100.0, T= 100Figure A.139: Test error for node uniform and global uniform candidate splitpoint selection strategies for the bodyfat dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.1 2 3 4 5 6 7 8 9Number of candidate features (KF )300035004000TesterrorNode uniform, Mcs= 4, KS= 1.0, T= 100Node uniform, Mcs= 8, KS= 100.0, T= 100Global uniform, Mcs= 3, KS= 1.0, T= 100Global uniform, Mcs= 8, KS= 100.0, T= 100Figure A.140: Test error for node uniform and global uniform candidate splitpoint selection strategies for the diabetes dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.2032 4 6 8 10 12Number of candidate features (KF )101520253035TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 1, KS= 100.0, T= 100Figure A.141: Test error for node uniform and global uniform candidate splitpoint selection strategies for the housing dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.1 2 3 4 5 6 7Number of candidate features (KF )4.55.05.56.06.57.0TesterrorNode uniform, Mcs= 5, KS= 1.0, T= 100Node uniform, Mcs= 8, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 7, KS= 100.0, T= 100Figure A.142: Test error for node uniform and global uniform candidate splitpoint selection strategies for the abalone dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.2041 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.500.550.600.650.70TesterrorNode uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 1, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 1, KS= 100.0, T= 100Figure A.143: Test error for node uniform and global uniform candidate splitpoint selection strategies for the wine dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.1 2 3 4 5 6 7Number of candidate features (KF )0.30.40.50.60.70.80.91.0Testerror×1010 Node uniform, Mcs= 1, KS= 1.0, T= 100Node uniform, Mcs= 2, KS= 100.0, T= 100Global uniform, Mcs= 1, KS= 1.0, T= 100Global uniform, Mcs= 2, KS= 100.0, T= 100Figure A.144: Test error for node uniform and global uniform candidate splitpoint selection strategies for the cadata dataset. Global uniform samplescandidate split points uniformly across the entire feature range where asnode uniform samples across the data points at the node.205A.9 Random and class difference projections20 40 60 80 100Number of candidate features (KF )3456789TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 3Node uniform + random projections, KP= 3All gaps midpoint + class diff projections, KP= 3Node uniform + class diff projections, KP= 320 40 60 80 100Number of candidate features (KF )0.650.700.750.80Breimanstrength All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 3Node uniform + random projections, KP= 3All gaps midpoint + class diff projections, KP= 3Node uniform + class diff projections, KP= 320 40 60 80 100Number of candidate features (KF )0.150.200.250.300.350.40Breimancorrelation All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 3Node uniform + random projections, KP= 3All gaps midpoint + class diff projections, KP= 3Node uniform + class diff projections, KP= 3 20 40 60 80 100Number of candidate features (KF )−100−50050100150Traintime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 3Node uniform + random projections, KP= 3All gaps midpoint + class diff projections, KP= 3Node uniform + class diff projections, KP= 320 40 60 80 100Number of candidate features (KF )−0.20.00.20.40.60.81.0Testtime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 3Node uniform + random projections, KP= 3All gaps midpoint + class diff projections, KP= 3Node uniform + class diff projections, KP= 3 20 40 60 80 100Number of candidate features (KF )300040005000600070008000900010000NumberofNodes All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 3Node uniform + random projections, KP= 3All gaps midpoint + class diff projections, KP= 3Node uniform + class diff projections, KP= 3Figure A.145: Test error, strength, correlation, train time, test time andmodel size for class difference and random projections on letter.20650 100 150 200Number of candidate features (KF )4681012TesterrorAll gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2050 100 150 200Number of candidate features (KF )0.10.20.30.40.50.60.70.8Breimanstrength All gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2050 100 150 200Number of candidate features (KF )−0.050.000.050.100.150.200.25Breimancorrelation All gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20 50 100 150 200Number of candidate features (KF )−15000−10000−5000050001000015000Traintime All gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2050 100 150 200Number of candidate features (KF )−4−2024Testtime All gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20 50 100 150 200Number of candidate features (KF )5001000150020002500300035004000NumberofNodes All gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20Figure A.146: Test error, strength, correlation, train time, test time andmodel size for class difference and random projections on gisette.20720 40 60 80 100Number of candidate features (KF )23456TesterrorAll gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 8All gaps midpoint + class diff projections, KP= 8Node uniform + class diff projections, KP= 820 40 60 80 100Number of candidate features (KF )0.750.800.850.90Breimanstrength All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 8All gaps midpoint + class diff projections, KP= 8Node uniform + class diff projections, KP= 820 40 60 80 100Number of candidate features (KF )0.200.250.300.350.40Breimancorrelation All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 8All gaps midpoint + class diff projections, KP= 8Node uniform + class diff projections, KP= 8 20 40 60 80 100Number of candidate features (KF )−100−50050100Traintime All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 8All gaps midpoint + class diff projections, KP= 8Node uniform + class diff projections, KP= 820 40 60 80 100Number of candidate features (KF )0.000.050.100.150.200.250.30Testtime All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 8All gaps midpoint + class diff projections, KP= 8Node uniform + class diff projections, KP= 8 20 40 60 80 100Number of candidate features (KF )50010001500200025003000NumberofNodes All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 8All gaps midpoint + class diff projections, KP= 8Node uniform + class diff projections, KP= 8Figure A.147: Test error, strength, correlation, train time, test time andmodel size for class difference and random projections on pendigits.20820 40 60 80 100Number of candidate features (KF )5.56.06.57.0TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 50Node uniform + class diff projections, KP= 5020 40 60 80 100Number of candidate features (KF )0.640.660.680.700.720.740.760.78Breimanstrength All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 50Node uniform + class diff projections, KP= 5020 40 60 80 100Number of candidate features (KF )0.250.300.350.40Breimancorrelation All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 50Node uniform + class diff projections, KP= 50 20 40 60 80 100Number of candidate features (KF )−200−1000100200300Traintime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 50Node uniform + class diff projections, KP= 5020 40 60 80 100Number of candidate features (KF )−0.4−0.20.00.20.40.6Testtime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 50Node uniform + class diff projections, KP= 50 20 40 60 80 100Number of candidate features (KF )100015002000250030003500NumberofNodes All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 50Node uniform + class diff projections, KP= 50Figure A.148: Test error, strength, correlation, train time, test time andmodel size for class difference and random projections on usps.20920 40 60 80 100Number of candidate features (KF )8.28.48.68.89.09.29.49.6TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 7All gaps midpoint + class diff projections, KP= 7Node uniform + class diff projections, KP= 1820 40 60 80 100Number of candidate features (KF )0.640.660.680.700.720.74Breimanstrength All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 7All gaps midpoint + class diff projections, KP= 7Node uniform + class diff projections, KP= 1820 40 60 80 100Number of candidate features (KF )0.280.300.320.340.360.380.40Breimancorrelation All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 7All gaps midpoint + class diff projections, KP= 7Node uniform + class diff projections, KP= 18 20 40 60 80 100Number of candidate features (KF )−60−40−200204060Traintime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 7All gaps midpoint + class diff projections, KP= 7Node uniform + class diff projections, KP= 1820 40 60 80 100Number of candidate features (KF )0.000.050.100.150.20Testtime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 7All gaps midpoint + class diff projections, KP= 7Node uniform + class diff projections, KP= 18 20 40 60 80 100Number of candidate features (KF )50010001500200025003000NumberofNodes All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 7All gaps midpoint + class diff projections, KP= 7Node uniform + class diff projections, KP= 18Figure A.149: Test error, strength, correlation, train time, test time andmodel size for class difference and random projections on satimage.21020 40 60 80 100Number of candidate features (KF )5101520TesterrorAll gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2020 40 60 80 100Number of candidate features (KF )0.00.10.20.30.40.50.60.7Breimanstrength All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2020 40 60 80 100Number of candidate features (KF )0.000.050.100.150.200.250.30Breimancorrelation All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20 20 40 60 80 100Number of candidate features (KF )−150−100−50050100150200Traintime All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2020 40 60 80 100Number of candidate features (KF )−0.2−0.10.00.10.20.3Testtime All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20 20 40 60 80 100Number of candidate features (KF )4006008001000120014001600NumberofNodes All gaps midpoint, Mcs= 2, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20Node uniform + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20Figure A.150: Test error, strength, correlation, train time, test time andmodel size for class difference and random projections on dna.21120 40 60 80 100Number of candidate features (KF )4045505560657075TesterrorAll gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 4, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 2All gaps midpoint + class diff projections, KP= 5Node uniform + class diff projections, KP= 520 40 60 80 100Number of candidate features (KF )−0.20−0.15−0.10−0.050.000.050.10Breimanstrength All gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 4, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 2All gaps midpoint + class diff projections, KP= 5Node uniform + class diff projections, KP= 520 40 60 80 100Number of candidate features (KF )−0.3−0.2−0.10.00.10.20.30.40.5Breimancorrelation All gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 4, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 2All gaps midpoint + class diff projections, KP= 5Node uniform + class diff projections, KP= 5 20 40 60 80 100Number of candidate features (KF )−6−4−202468Traintime All gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 4, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 2All gaps midpoint + class diff projections, KP= 5Node uniform + class diff projections, KP= 520 40 60 80 100Number of candidate features (KF )0.0050.0100.0150.0200.025Testtime All gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 4, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 2All gaps midpoint + class diff projections, KP= 5Node uniform + class diff projections, KP= 5 20 40 60 80 100Number of candidate features (KF )50100150200250300350NumberofNodes All gaps midpoint, Mcs= 3, T= 100Node uniform, Mcs= 4, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 2All gaps midpoint + class diff projections, KP= 5Node uniform + class diff projections, KP= 5Figure A.151: Test error, strength, correlation, train time, test time andmodel size for class difference and random projections on vowel.21220 40 60 80 100Number of candidate features (KF )3.03.54.04.5TesterrorAll gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2020 40 60 80 100Number of candidate features (KF )0.600.650.700.75Breimanstrength All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2020 40 60 80 100Number of candidate features (KF )0.200.210.220.230.240.250.260.27Breimancorrelation All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20 20 40 60 80 100Number of candidate features (KF )−6000−4000−200002000400060008000Traintime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2020 40 60 80 100Number of candidate features (KF )−15−10−505101520Testtime All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20 20 40 60 80 100Number of candidate features (KF )5000100001500020000250003000035000NumberofNodes All gaps midpoint, Mcs= 1, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20Figure A.152: Test error, strength, correlation, train time, test time andmodel size for class difference and random projections on mnist.21320 40 60 80 100Number of candidate features (KF )16.016.517.017.518.018.519.019.5TesterrorAll gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2020 40 60 80 100Number of candidate features (KF )0.440.460.480.50Breimanstrength All gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2020 40 60 80 100Number of candidate features (KF )0.290.300.310.320.33Breimancorrelation All gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20 20 40 60 80 100Number of candidate features (KF )−2000−1000010002000Traintime All gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 2020 40 60 80 100Number of candidate features (KF )−10−50510Testtime All gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20 20 40 60 80 100Number of candidate features (KF )100002000030000400005000060000NumberofNodes All gaps midpoint, Mcs= 4, T= 100Node uniform, Mcs= 1, KS= 1.0, T= 100All gaps midpoint + random projections, KP= 20All gaps midpoint + class diff projections, KP= 20Node uniform + class diff projections, KP= 20Figure A.153: Test error, strength, correlation, train time, test time andmodel size for class difference and random projections on sensit-vehicle.214A.10 Error, individual strength and diversity forBiau2008, Biau2012 and Denil20142151 2 3 4 5 6 7 8 9Number of candidate features (KF )5060708090TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 3, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )−0.5−0.4−0.3−0.2−0.10.0BreimanstrengthBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 3, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )−0.4−0.20.00.20.4BreimancorrelationBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 3, T= 100Figure A.154: Test error, strength and correlation for Biau2008, Biau2012,Denil2014 and Breiman without bagging (all gaps midpoint) on datasetvowel.2165 10 15 20 25 30 35Number of candidate features (KF )1020304050TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 1005 10 15 20 25 30 35Number of candidate features (KF )−0.20.00.20.40.6BreimanstrengthBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 1005 10 15 20 25 30 35Number of candidate features (KF )−0.10.00.10.20.3BreimancorrelationBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 100Figure A.155: Test error, strength and correlation for Biau2008, Biau2012,Denil2014 and Breiman without bagging (all gaps midpoint) on dataset dna.2172 4 6 8 10 12 14 16 18Number of candidate features (KF )101520253035TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12 14 16 18Number of candidate features (KF )0.20.30.40.50.60.7BreimanstrengthBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12 14 16 18Number of candidate features (KF )0.200.250.300.350.40BreimancorrelationBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.156: Test error, strength and correlation for Biau2008, Biau2012,Denil2014 and Breiman without bagging (all gaps midpoint) on datasetsatimage.21850 100 150 200Number of candidate features (KF )468101214TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 3, T= 10050 100 150 200Number of candidate features (KF )−0.20.00.20.40.60.8BreimanstrengthBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 3, T= 10050 100 150 200Number of candidate features (KF )−0.2−0.10.00.10.20.3BreimancorrelationBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 3, T= 100Figure A.157: Test error, strength and correlation for Biau2008, Biau2012,Denil2014 and Breiman without bagging (all gaps midpoint) on datasetgisette.21910 20 30 40Number of candidate features (KF )68101214TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 10010 20 30 40Number of candidate features (KF )0.20.30.40.50.60.7BreimanstrengthBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 10010 20 30 40Number of candidate features (KF )0.00.10.20.30.4BreimancorrelationBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.158: Test error, strength and correlation for Biau2008, Biau2012,Denil2014 and Breiman without bagging (all gaps midpoint) on datasetusps.2202 4 6 8 10 12Number of candidate features (KF )46810121416TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 1002 4 6 8 10 12Number of candidate features (KF )0.50.60.70.8BreimanstrengthBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 1002 4 6 8 10 12Number of candidate features (KF )−0.2−0.10.00.10.20.30.40.50.6BreimancorrelationBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 100Figure A.159: Test error, strength and correlation for Biau2008, Biau2012,Denil2014 and Breiman without bagging (all gaps midpoint) on datasetpendigits.2212 4 6 8 10 12Number of candidate features (KF )10203040506070TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )−0.20.00.20.40.60.8BreimanstrengthBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )0.00.10.20.30.4BreimancorrelationBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.160: Test error, strength and correlation for Biau2008, Biau2012,Denil2014 and Breiman without bagging (all gaps midpoint) on datasetletter.2222 4 6 8 10 12Number of candidate features (KF )−0.0010.0000.0010.0020.0030.0040.005TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 5, T= 1002 4 6 8 10 12Number of candidate features (KF )0.000.020.040.060.080.100.120.14IndividualMSEBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 5, T= 1002 4 6 8 10 12Number of candidate features (KF )0.000.020.040.060.080.100.12AmbiguityMSEBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 5, T= 100Figure A.161: Test error, individual MSE and ambiguity MSE for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) ondataset bodyfat.2231 2 3 4 5 6 7 8 9Number of candidate features (KF )30003500400045005000TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.166666666667Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )40005000600070008000IndividualMSEBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.166666666667Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )10002000300040005000AmbiguityMSEBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.166666666667Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 100Figure A.162: Test error, individual MSE and ambiguity MSE for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) ondataset diabetes.2242 4 6 8 10 12Number of candidate features (KF )10203040506070TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )20406080100120IndividualMSEBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )10203040AmbiguityMSEBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.163: Test error, individual MSE and ambiguity MSE for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) ondataset housing.2251 2 3 4 5 6 7Number of candidate features (KF )5678910TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7Number of candidate features (KF )67891011IndividualMSEBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7Number of candidate features (KF )123456AmbiguityMSEBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 100Figure A.164: Test error, individual MSE and ambiguity MSE for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) ondataset abalone.2261 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.40.50.60.70.80.9TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.81.01.21.4IndividualMSEBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.200.250.300.350.400.450.500.55AmbiguityMSEBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.165: Test error, individual MSE and ambiguity MSE for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) ondataset wine.227A.11 Computation and model size for Biau2008,Biau2012 and Denil20142281 2 3 4 5 6 7 8 9Number of candidate features (KF )5060708090TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 3, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )−2−101234TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 3, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )0.0050.0100.0150.0200.025Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 3, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )200300400500600700800900NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 3, T= 100Figure A.166: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) onvowel dataset.2295 10 15 20 25 30 35Number of candidate features (KF )1020304050TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 1005 10 15 20 25 30 35Number of candidate features (KF )−2024681012TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 1005 10 15 20 25 30 35Number of candidate features (KF )−0.020.000.020.040.060.080.10Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 1005 10 15 20 25 30 35Number of candidate features (KF )500100015002000250030003500NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 100Figure A.167: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) ondna dataset.2302 4 6 8 10 12 14 16 18Number of candidate features (KF )101520253035TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12 14 16 18Number of candidate features (KF )05101520TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12 14 16 18Number of candidate features (KF )−0.10−0.050.000.050.100.150.200.25Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12 14 16 18Number of candidate features (KF )10002000300040005000600070008000NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.5Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.168: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) onsatimage dataset.23150 100 150 200Number of candidate features (KF )468101214TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 3, T= 10050 100 150 200Number of candidate features (KF )−1000100200300TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 3, T= 10050 100 150 200Number of candidate features (KF )−0.10.00.10.20.3Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 3, T= 10050 100 150 200Number of candidate features (KF )200040006000800010000NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 3, T= 100Figure A.169: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) ongisette dataset.23210 20 30 40Number of candidate features (KF )68101214TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 10010 20 30 40Number of candidate features (KF )−200−1000100200300TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 10010 20 30 40Number of candidate features (KF )−0.2−0.10.00.10.20.3Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 10010 20 30 40Number of candidate features (KF )2000400060008000100001200014000NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 1.0Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.170: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) onusps dataset.2332 4 6 8 10 12Number of candidate features (KF )46810121416TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 1002 4 6 8 10 12Number of candidate features (KF )−10−5051015202530TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 1002 4 6 8 10 12Number of candidate features (KF )−0.2−0.10.00.10.20.30.40.5Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 1002 4 6 8 10 12Number of candidate features (KF )2000400060008000100001200014000NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 100Figure A.171: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) onpendigits dataset.2342 4 6 8 10 12Number of candidate features (KF )10203040506070TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )−50050100TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )0.00.51.0Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )500010000150002000025000NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.25Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.172: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) onletter dataset.2352 4 6 8 10 12Number of candidate features (KF )−0.0010.0000.0010.0020.0030.0040.005TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 5, T= 1002 4 6 8 10 12Number of candidate features (KF )0.00.20.40.60.8TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 5, T= 1002 4 6 8 10 12Number of candidate features (KF )0.00100.00150.00200.0025Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 5, T= 1002 4 6 8 10 12Number of candidate features (KF )50100150200250300350NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 5, T= 100Figure A.173: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) onbodyfat dataset.2361 2 3 4 5 6 7 8 9Number of candidate features (KF )30003500400045005000TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.166666666667Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )−0.50.00.51.01.5TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.166666666667Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )0.0000.0010.0020.0030.004Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.166666666667Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7 8 9Number of candidate features (KF )100200300400500600NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.166666666667Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 100Figure A.174: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) ondiabetes dataset.2372 4 6 8 10 12Number of candidate features (KF )10203040506070TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )−2−101234TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )−0.0020.0000.0020.0040.0060.008Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1002 4 6 8 10 12Number of candidate features (KF )200400600800100012001400NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.333333333333Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.175: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) onhousing dataset.2381 2 3 4 5 6 7Number of candidate features (KF )5678910TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7Number of candidate features (KF )−50510TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7Number of candidate features (KF )−0.04−0.020.000.020.040.060.080.10Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 1001 2 3 4 5 6 7Number of candidate features (KF )100020003000400050006000NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 7, T= 100Figure A.176: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) onabalone dataset.2391 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.40.50.60.70.80.9TesterrorBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )−1001020TraintimeBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )−0.10−0.050.000.050.100.150.20Testtime Biau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 1001 2 3 4 5 6 7 8 9 10Number of candidate features (KF )2000400060008000NumberofNodesBiau2008, T= 100, KL= 1.0Biau2012, T= 100, KL= 0.125Denil2014, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.177: Test error, train time, test time and model size for Biau2008,Biau2012, Denil2014 and Breiman without bagging (all gaps midpoint) onwine dataset.240A.12 Sampling number of candidate featuresfrom a Poisson distribution1 2 3 4 5 6 7 8 9Number of candidate features (KF )455055606570TesterrorAll gaps midpoint, Mcs= 3, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure A.178: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the vowel dataset.5 10 15 20 25 30 35Number of candidate features (KF )4.85.05.25.45.65.86.06.2TesterrorAll gaps midpoint, Mcs= 2, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure A.179: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the dna dataset.2412 4 6 8 10 12 14 16 18Number of candidate features (KF )8.28.48.68.89.09.29.4TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure A.180: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the satimagedataset.50 100 150 200Number of candidate features (KF )2.42.62.83.0TesterrorAll gaps midpoint, Mcs= 3, T= 100All gaps midpoint + Poisson KF , Mcs= 3, T= 100Figure A.181: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the gisette dataset.24210 20 30 40Number of candidate features (KF )5.86.06.26.46.66.87.0TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure A.182: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the usps dataset.2 4 6 8 10 12Number of candidate features (KF )3.54.04.55.0TesterrorAll gaps midpoint, Mcs= 2, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure A.183: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the pendigitsdataset.2432 4 6 8 10 12Number of candidate features (KF )45678TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure A.184: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the letter dataset.5 10 15 20Number of candidate features (KF )15.715.815.916.016.1TesterrorAll gaps midpoint, Mcs= 4, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure A.185: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the sensit-vehicledataset.24410 20 30 40 50Number of candidate features (KF )2.652.702.752.802.85TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure A.186: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the mnist dataset.2 4 6 8 10 12Number of candidate features (KF )0.000000.000050.000100.000150.00020TesterrorAll gaps midpoint, Mcs= 5, T= 100All gaps midpoint + Poisson KF , Mcs= 5, T= 100Figure A.187: Test error for all gaps midpoint vs all gaps midpoint with pos-sion sampling of the number of candidate features for the bodyfat dataset.2451 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorAll gaps midpoint, Mcs= 7, T= 100All gaps midpoint + Poisson KF , Mcs= 7, T= 100Figure A.188: Test error for all gaps midpoint vs all gaps midpoint with pos-sion sampling of the number of candidate features for the diabetes dataset.2 4 6 8 10 12Number of candidate features (KF )810121416TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure A.189: Test error for all gaps midpoint vs all gaps midpoint with pos-sion sampling of the number of candidate features for the housing dataset.2461 2 3 4 5 6 7Number of candidate features (KF )4.64.85.05.25.4TesterrorAll gaps midpoint, Mcs= 7, T= 100All gaps midpoint + Poisson KF , Mcs= 7, T= 100Figure A.190: Test error for all gaps midpoint vs all gaps midpoint with pos-sion sampling of the number of candidate features for the abalone dataset.1 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.50TesterrorAll gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Poisson KF , Mcs= 1, T= 100Figure A.191: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the wine dataset.2471 2 3 4 5 6 7Number of candidate features (KF )2.42.62.83.03.2Testerror×109 All gaps midpoint, Mcs= 1, T= 100All gaps midpoint + Poisson KF , Mcs= 2, T= 100Figure A.192: Test error for all gaps midpoint vs all gaps midpoint withpossion sampling of the number of candidate features for the cadata dataset.248A.13 Stream splitting1 2 3 4 5 6 7 8 9Number of candidate features (KF )455055606570TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 3, T= 100Figure A.193: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the vowel dataset.This demonstrates that the increase in error is a result of splitting datapoints into structure and estimation streams.2495 10 15 20 25 30 35Number of candidate features (KF )68101214TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 2.0, T= 100All gaps midpoint, Mcs= 2, T= 100Figure A.194: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the dna dataset. Thisdemonstrates that the increase in error is a result of splitting data pointsinto structure and estimation streams.2 4 6 8 10 12 14 16 18Number of candidate features (KF )9101112TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.195: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the satimage dataset.This demonstrates that the increase in error is a result of splitting datapoints into structure and estimation streams.25050 100 150 200Number of candidate features (KF )2.53.03.54.04.55.05.5TesterrorDenil2014, Mcs= 2.0, T= 100Denil2014 + one stream, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 3, T= 100Figure A.196: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the gisette dataset.This demonstrates that the increase in error is a result of splitting datapoints into structure and estimation streams.10 20 30 40Number of candidate features (KF )67891011TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.197: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the usps dataset. Thisdemonstrates that the increase in error is a result of splitting data pointsinto structure and estimation streams.2512 4 6 8 10 12Number of candidate features (KF )3.54.04.55.0TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 2, T= 100Figure A.198: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the pendigits dataset.This demonstrates that the increase in error is a result of splitting datapoints into structure and estimation streams.2 4 6 8 10 12Number of candidate features (KF )45678TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.199: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the letter dataset.This demonstrates that the increase in error is a result of splitting datapoints into structure and estimation streams.2522 4 6 8 10 12Number of candidate features (KF )0.000000.000050.000100.000150.00020TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 5.0, T= 100All gaps midpoint, Mcs= 5, T= 100Figure A.200: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the bodyfat dataset.This demonstrates that the increase in error is a result of splitting datapoints into structure and estimation streams.1 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 8.0, T= 100All gaps midpoint, Mcs= 7, T= 100Figure A.201: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the diabetes dataset.This demonstrates that the increase in error is a result of splitting datapoints into structure and estimation streams.2532 4 6 8 10 12Number of candidate features (KF )8101214161820TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.202: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the housing dataset.This demonstrates that the increase in error is a result of splitting datapoints into structure and estimation streams.1 2 3 4 5 6 7Number of candidate features (KF )4.64.85.05.25.45.6TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 7.0, T= 100All gaps midpoint, Mcs= 7, T= 100Figure A.203: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the abalone dataset.This demonstrates that the increase in error is a result of splitting datapoints into structure and estimation streams.2541 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.50TesterrorDenil2014, Mcs= 1.0, T= 100Denil2014 + one stream, Mcs= 1.0, T= 100All gaps midpoint, Mcs= 1, T= 100Figure A.204: Test error for Denil2014 with structure and estimationstreams vs with one stream vs all gaps midpoint for the wine dataset. Thisdemonstrates that the increase in error is a result of splitting data pointsinto structure and estimation streams.255A.14 Online: fixed frontier versus fixed depth1 2 3 4 5 6 7 8 9Number of candidate features (KF )455055606570TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 5.0, T= 100, Md= 8.0, Mim= 0.1, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 3, T= 100Figure A.205: Test error for fixed depth vs fixed frontier for dataset vowel.The error of a fixed frontier is always equal or less than fixed depth.5 10 15 20 25 30 35Number of candidate features (KF )68101214161820TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 100.0, T= 100, Md= 8.0, Mim= 0.01, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 10000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 2, T= 100Figure A.206: Test error for fixed depth vs fixed frontier for dataset dna.The error of a fixed frontier is always equal or less than fixed depth.2562 4 6 8 10 12 14 16 18Number of candidate features (KF )102030405060708090TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 100.0, T= 100, Md= 3.0, Mim= 0.1, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 1000, Mim= 0.1, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.207: Test error for fixed depth vs fixed frontier for dataset satim-age. The error of a fixed frontier is always equal or less than fixed depth.50 100 150 200Number of candidate features (KF )46810TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 100.0, T= 100, Md= 8.0, Mim= 0.01, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 10000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 3, T= 100Figure A.208: Test error for fixed depth vs fixed frontier for dataset gisette.The error of a fixed frontier is always equal or less than fixed depth.25710 20 30 40Number of candidate features (KF )68101214TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 25.0, T= 100, Md= 8.0, Mim= 0.01, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 10000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.209: Test error for fixed depth vs fixed frontier for dataset usps.The error of a fixed frontier is always equal or less than fixed depth.2 4 6 8 10 12Number of candidate features (KF )468101214TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 5.0, T= 100, Md= 8.0, Mim= 0.01, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.01, p= 1All gaps midpoint, Mcs= 2, T= 100Figure A.210: Test error for fixed depth vs fixed frontier for dataset pendig-its. The error of a fixed frontier is always equal or less than fixed depth.2582 4 6 8 10 12Number of candidate features (KF )1020304050TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 5.0, T= 100, Md= 8.0, Mim= 0.1, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.1, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.211: Test error for fixed depth vs fixed frontier for dataset letter.The error of a fixed frontier is always equal or less than fixed depth.2 4 6 8 10 12Number of candidate features (KF )0.000000.000050.000100.000150.000200.000250.00030TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 25.0, T= 100, Md= 8.0, Mim= 0.0, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 5, T= 100Figure A.212: Test error for fixed depth vs fixed frontier for dataset body-fat. The error of a fixed frontier is always equal or less than fixed depth.2591 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 25.0, T= 100, Md= 8.0, Mim= 0.0, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.77, p= 1All gaps midpoint, Mcs= 7, T= 100Figure A.213: Test error for fixed depth vs fixed frontier for dataset dia-betes. The error of a fixed frontier is always equal or less than fixed depth.2 4 6 8 10 12Number of candidate features (KF )1020304050TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 5.0, T= 100, Md= 8.0, Mim= 0.9, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.09, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.214: Test error for fixed depth vs fixed frontier for dataset hous-ing. The error of a fixed frontier is always equal or less than fixed depth.2601 2 3 4 5 6 7Number of candidate features (KF )5.05.56.06.57.07.58.0TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 5.0, T= 100, Md= 8.0, Mim= 0.3, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.3, p= 1All gaps midpoint, Mcs= 7, T= 100Figure A.215: Test error for fixed depth vs fixed frontier for datasetabalone. The error of a fixed frontier is always equal or less than fixeddepth.1 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.40.50.60.70.80.9TesterrorGlobal uniform + Fixed depth + Online bagging, Mns= 100.0, T= 100, Md= 8.0, Mim= 0.0, p= 1Global uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.008, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.216: Test error for fixed depth vs fixed frontier for dataset wine.The error of a fixed frontier is always equal or less than fixed depth.261A.15 Online: global uniform candidate splitpoints versus at data points1 2 3 4 5 6 7 8 9Number of candidate features (KF )455055606570TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 0.0, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 3, T= 100Figure A.217: Test error for selecting candidate split points at data pointsvs with global uniform for dataset vowel.5 10 15 20 25 30 35Number of candidate features (KF )68101214161820TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 10000, Mim= 0.0, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 1000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 2, T= 100Figure A.218: Test error for selecting candidate split points at data pointsvs with global uniform for dataset dna.2622 4 6 8 10 12 14 16 18Number of candidate features (KF )20406080100TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 1000, Mim= 0.1, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 1000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.219: Test error for selecting candidate split points at data pointsvs with global uniform for dataset satimage.50 100 150 200Number of candidate features (KF )46810TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 10000, Mim= 0.0, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 10000, Mim= 0.01, p= 1All gaps midpoint, Mcs= 3, T= 100Figure A.220: Test error for selecting candidate split points at data pointsvs with global uniform for dataset gisette.26310 20 30 40Number of candidate features (KF )6789101112TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 10000, Mim= 0.0, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 10000, Mim= 0.1, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.221: Test error for selecting candidate split points at data pointsvs with global uniform for dataset usps.2 4 6 8 10 12Number of candidate features (KF )45678TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.01, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.01, p= 1All gaps midpoint, Mcs= 2, T= 100Figure A.222: Test error for selecting candidate split points at data pointsvs with global uniform for dataset pendigits.2642 4 6 8 10 12Number of candidate features (KF )510152025TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.1, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.223: Test error for selecting candidate split points at data pointsvs with global uniform for dataset letter.2 4 6 8 10 12Number of candidate features (KF )0.000000.000050.000100.000150.000200.000250.00030TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 0.0, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 5, T= 100Figure A.224: Test error for selecting candidate split points at data pointsvs with global uniform for dataset bodyfat.2651 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.77, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 7.7, p= 1All gaps midpoint, Mcs= 7, T= 100Figure A.225: Test error for selecting candidate split points at data pointsvs with global uniform for dataset diabetes.2 4 6 8 10 12Number of candidate features (KF )1020304050TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.09, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.226: Test error for selecting candidate split points at data pointsvs with global uniform for dataset housing.2661 2 3 4 5 6 7Number of candidate features (KF )5.05.56.06.57.0TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.3, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 10000, Mim= 0.3, p= 1All gaps midpoint, Mcs= 7, T= 100Figure A.227: Test error for selecting candidate split points at data pointsvs with global uniform for dataset abalone.1 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.40.50.60.7TesterrorGlobal uniform + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.008, p= 1At datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 10000, Mim= 0.008, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.228: Test error for selecting candidate split points at data pointsvs with global uniform for dataset wine.267A.16 Online: negative effect of online bagging1 2 3 4 5 6 7 8 9Number of candidate features (KF )455055606570TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 0.0, p= 1At datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.0All gaps midpoint, Mcs= 3, T= 100Figure A.229: Test error with and without online bagging for dataset vowel.Online bagging either has no effect or hurts performance.5 10 15 20 25 30 35Number of candidate features (KF )68101214161820TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 1000, Mim= 0.0, p= 1At datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.01All gaps midpoint, Mcs= 2, T= 100Figure A.230: Test error with and without online bagging for dataset dna.Online bagging either has no effect or hurts performance.2682 4 6 8 10 12 14 16 18Number of candidate features (KF )20406080100TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 1000, Mim= 0.0, p= 1At datapoints + Fixed frontier, Mns= 25.0, Mft= 1000, Mim= 0.1All gaps midpoint, Mcs= 1, T= 100Figure A.231: Test error with and without online bagging for dataset satim-age. Online bagging either has no effect or hurts performance.50 100 150 200Number of candidate features (KF )46810TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 100.0, T= 100, Mft= 10000, Mim= 0.01, p= 1At datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.1All gaps midpoint, Mcs= 3, T= 100Figure A.232: Test error with and without online bagging for datasetgisette. Online bagging either has no effect or hurts performance.26910 20 30 40Number of candidate features (KF )6789101112TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 10000, Mim= 0.1, p= 1At datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.1All gaps midpoint, Mcs= 1, T= 100Figure A.233: Test error with and without online bagging for dataset usps.Online bagging either has no effect or hurts performance.2 4 6 8 10 12Number of candidate features (KF )45678TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.01, p= 1At datapoints + Fixed frontier, Mns= 5.0, Mft= 10000, Mim= 0.01All gaps midpoint, Mcs= 2, T= 100Figure A.234: Test error with and without online bagging for datasetpendigits. Online bagging either has no effect or hurts performance.2702 4 6 8 10 12Number of candidate features (KF )46810121416TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 5.0, T= 100, Mft= 10000, Mim= 0.0, p= 1At datapoints + Fixed frontier, Mns= 5.0, Mft= 10000, Mim= 0.01All gaps midpoint, Mcs= 1, T= 100Figure A.235: Test error with and without online bagging for dataset letter.Online bagging either has no effect or hurts performance.2 4 6 8 10 12Number of candidate features (KF )0.000000.000050.000100.000150.00020TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 0.0, p= 1At datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.0All gaps midpoint, Mcs= 5, T= 100Figure A.236: Test error with and without online bagging for dataset body-fat. Online bagging either has no effect or hurts performance.2711 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 7.7, p= 1At datapoints + Fixed frontier, Mns= 25.0, Mft= 1000, Mim= 0.0All gaps midpoint, Mcs= 7, T= 100Figure A.237: Test error with and without online bagging for dataset dia-betes. Online bagging either has no effect or hurts performance.2 4 6 8 10 12Number of candidate features (KF )1015202530354045TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 1000, Mim= 0.0, p= 1At datapoints + Fixed frontier, Mns= 5.0, Mft= 10000, Mim= 0.9All gaps midpoint, Mcs= 1, T= 100Figure A.238: Test error with and without online bagging for dataset hous-ing. Online bagging either has no effect or hurts performance.2721 2 3 4 5 6 7Number of candidate features (KF )4.64.85.05.25.45.65.8TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 10000, Mim= 0.3, p= 1At datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.0All gaps midpoint, Mcs= 7, T= 100Figure A.239: Test error with and without online bagging for datasetabalone. Online bagging either has no effect or hurts performance.1 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.500.550.600.650.70TesterrorAt datapoints + Fixed frontier + Online bagging, Mns= 25.0, T= 100, Mft= 10000, Mim= 0.008, p= 1At datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.008All gaps midpoint, Mcs= 1, T= 100Figure A.240: Test error with and without online bagging for dataset wine.Online bagging either has no effect or hurts performance.273A.17 Online: Mns+Mim versus α(d)+β(d)+Mim1 2 3 4 5 6 7 8 9Number of candidate features (KF )455055606570TesterrorAt datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.0At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 1000, Mns= 2.0, Kdr= 1.001, Mim= 0.1, p= 1All gaps midpoint, Mcs= 3, T= 100Figure A.241: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset vowel.5 10 15 20 25 30 35Number of candidate features (KF )6810121416TesterrorAt datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.01At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 1000, Mns= 12.0, Kdr= 1.001, Mim= 0.1, p= 1All gaps midpoint, Mcs= 2, T= 100Figure A.242: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset dna.2742 4 6 8 10 12 14 16 18Number of candidate features (KF )20406080100TesterrorAt datapoints + Fixed frontier, Mns= 25.0, Mft= 1000, Mim= 0.1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.1, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.243: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset satimage.50 100 150 200Number of candidate features (KF )345678910TesterrorAt datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.001, Mim= 0.0, p= 1All gaps midpoint, Mcs= 3, T= 100Figure A.244: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset gisette.27510 20 30 40Number of candidate features (KF )67891011TesterrorAt datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.1, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.245: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset usps.2 4 6 8 10 12Number of candidate features (KF )3.54.04.55.05.56.06.57.0TesterrorAt datapoints + Fixed frontier, Mns= 5.0, Mft= 10000, Mim= 0.01At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.01, p= 1All gaps midpoint, Mcs= 2, T= 100Figure A.246: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset pendigits.2762 4 6 8 10 12Number of candidate features (KF )46810121416TesterrorAt datapoints + Fixed frontier, Mns= 5.0, Mft= 10000, Mim= 0.01At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.247: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset letter.2 4 6 8 10 12Number of candidate features (KF )0.000000.000050.000100.00015TesterrorAt datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.0At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.002, p= 1All gaps midpoint, Mcs= 5, T= 100Figure A.248: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset bodyfat.2771 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorAt datapoints + Fixed frontier, Mns= 25.0, Mft= 1000, Mim= 0.0At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.001, Mim= 0.77, p= 1All gaps midpoint, Mcs= 7, T= 100Figure A.249: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset diabetes.2 4 6 8 10 12Number of candidate features (KF )1020304050TesterrorAt datapoints + Fixed frontier, Mns= 5.0, Mft= 10000, Mim= 0.9At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.9, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.250: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset housing.2781 2 3 4 5 6 7Number of candidate features (KF )4.64.85.05.25.45.65.8TesterrorAt datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.0At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.0, p= 1All gaps midpoint, Mcs= 7, T= 100Figure A.251: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset abalone.1 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.500.550.600.650.70TesterrorAt datapoints + Fixed frontier, Mns= 25.0, Mft= 10000, Mim= 0.008At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.08, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.252: Test error for Mns + Mim versus α(d) + β(d) + Mim splitcriteria for dataset wine.279A.18 Online: negative effect of structure andestimation streams on Denil20131 2 3 4 5 6 7 8 9Number of candidate features (KF )455055606570TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 1000, Mns= 2.0, Kdr= 1.01, Mim= 0.01, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 1000, Mns= 2.0, Kdr= 1.001, Mim= 0.1, p= 1All gaps midpoint, Mcs= 3, T= 100Figure A.253: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset vowel. All gaps midpoint is includedas a baseline.2805 10 15 20 25 30 35Number of candidate features (KF )5101520TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.01, Mim= 0.0, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 1000, Mns= 12.0, Kdr= 1.001, Mim= 0.1, p= 1All gaps midpoint, Mcs= 2, T= 100Figure A.254: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset dna. All gaps midpoint is included asa baseline.2 4 6 8 10 12 14 16 18Number of candidate features (KF )102030405060TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.1, Mim= 0.0, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.1, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.255: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset satimage. All gaps midpoint is includedas a baseline.28150 100 150 200Number of candidate features (KF )468101214TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.01, Mim= 0.1, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.001, Mim= 0.0, p= 1All gaps midpoint, Mcs= 3, T= 100Figure A.256: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset gisette. All gaps midpoint is includedas a baseline.10 20 30 40Number of candidate features (KF )68101214TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.0, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.1, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.257: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset usps. All gaps midpoint is included asa baseline.2822 4 6 8 10 12Number of candidate features (KF )4567891011TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.01, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.01, p= 1All gaps midpoint, Mcs= 2, T= 100Figure A.258: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset pendigits. All gaps midpoint is includedas a baseline.2 4 6 8 10 12Number of candidate features (KF )5101520TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.01, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.0, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.259: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset letter. All gaps midpoint is includedas a baseline.2832 4 6 8 10 12Number of candidate features (KF )0.000000.000050.000100.000150.000200.00025TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 1000, Mns= 2.0, Kdr= 1.1, Mim= 0.002, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.002, p= 1All gaps midpoint, Mcs= 5, T= 100Figure A.260: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset bodyfat. All gaps midpoint is includedas a baseline.1 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 7.7, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.001, Mim= 0.77, p= 1All gaps midpoint, Mcs= 7, T= 100Figure A.261: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset diabetes. All gaps midpoint is includedas a baseline.2842 4 6 8 10 12Number of candidate features (KF )1020304050TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.09, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.9, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.262: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset housing. All gaps midpoint is includedas a baseline.1 2 3 4 5 6 7Number of candidate features (KF )5.05.56.0TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.0, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.0, p= 1All gaps midpoint, Mcs= 7, T= 100Figure A.263: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset abalone. All gaps midpoint is includedas a baseline.2851 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.500.550.600.650.70TesterrorAt datapoints + FF + α(d), β(d) + S,E streams, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.0, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.08, p= 1All gaps midpoint, Mcs= 1, T= 100Figure A.264: Test error with structure and estimation streams (S,E) vswith no stream splitting on dataset wine. All gaps midpoint is included asa baseline.286A.19 Online: reduction in error from multiplepasses1 2 3 4 5 6 7 8 9Number of candidate features (KF )4045505560657075TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 1000, Mns= 2.0, Kdr= 1.001, Mim= 0.1, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 1000, Mns= 12.0, Kdr= 1.001, Mim= 0.0, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 1000, Mns= 12.0, Kdr= 1.01, Mim= 0.0, p= 10All gaps midpoint, Mcs= 3, T= 100Figure A.265: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset vowel. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.2875 10 15 20 25 30 35Number of candidate features (KF )6810121416TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 1000, Mns= 12.0, Kdr= 1.001, Mim= 0.1, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.001, Mim= 0.1, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.001, Mim= 0.1, p= 10All gaps midpoint, Mcs= 2, T= 100Figure A.266: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset dna. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.2 4 6 8 10 12 14 16 18Number of candidate features (KF )101520253035TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.1, Mim= 0.0, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.01, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.001, Mim= 0.01, p= 10All gaps midpoint, Mcs= 1, T= 100Figure A.267: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset satimage. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.28850 100 150 200Number of candidate features (KF )345678910TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.001, Mim= 0.0, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.01, Mim= 0.1, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.01, Mim= 0.1, p= 10All gaps midpoint, Mcs= 3, T= 100Figure A.268: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset gisette. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.10 20 30 40Number of candidate features (KF )67891011TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.1, Mim= 0.0, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.1, Mim= 0.0, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.1, Mim= 0.01, p= 10All gaps midpoint, Mcs= 1, T= 100Figure A.269: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset usps. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.2892 4 6 8 10 12Number of candidate features (KF )34567TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.01, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.1, Mim= 0.1, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.01, p= 10All gaps midpoint, Mcs= 2, T= 100Figure A.270: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset pendigits. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.2 4 6 8 10 12Number of candidate features (KF )46810121416TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.0, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.01, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.1, p= 10All gaps midpoint, Mcs= 1, T= 100Figure A.271: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset letter. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.2902 4 6 8 10 12Number of candidate features (KF )0.000000.000050.000100.00015TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.002, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.01, Mim= 0.0002, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.1, Mim= 0.0002, p= 10All gaps midpoint, Mcs= 5, T= 100Figure A.272: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset bodyfat. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.1 2 3 4 5 6 7 8 9Number of candidate features (KF )3000350040004500TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.001, Mim= 0.77, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 1000, Mns= 12.0, Kdr= 1.01, Mim= 0.0, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.01, Mim= 7.7, p= 10All gaps midpoint, Mcs= 7, T= 100Figure A.273: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset diabetes. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.2912 4 6 8 10 12Number of candidate features (KF )1020304050TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.9, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.0, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.01, Mim= 0.0, p= 10All gaps midpoint, Mcs= 1, T= 100Figure A.274: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset housing. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.1 2 3 4 5 6 7Number of candidate features (KF )4.64.85.05.25.45.6TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.0, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.1, Mim= 0.03, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 12.0, Kdr= 1.01, Mim= 0.0, p= 10All gaps midpoint, Mcs= 7, T= 100Figure A.275: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset abalone. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.2921 2 3 4 5 6 7 8 9 10Number of candidate features (KF )0.350.400.450.500.550.600.65TesterrorAt datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.01, Mim= 0.08, p= 1At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.008, p= 5At datapoints + FF + α(d), β(d) criteria, T= 100, Mft= 10000, Mns= 2.0, Kdr= 1.001, Mim= 0.0, p= 10All gaps midpoint, Mcs= 1, T= 100Figure A.276: Test error for Denil2013 without stream splitting for differentnumber of passes (p) through the data for dataset wine. As the numberof passes increase, the test error approaches the offline strategy of all gapsmidpoint.293
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- An empirical study of practical, theoretical and online...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
An empirical study of practical, theoretical and online variants of random forests Matheson, David 2014
pdf
Page Metadata
Item Metadata
Title | An empirical study of practical, theoretical and online variants of random forests |
Creator |
Matheson, David |
Publisher | University of British Columbia |
Date Issued | 2014 |
Description | Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with different goals in mind: improving predictive accuracy, extending the range of application to online and structure domains, and introducing simplifications for theoretical amenability. While there are many subtle differences among the variants, the core difference is the method of selecting candidate split points. In our work, we examine eight different strategies for selecting candidate split points and study their effect on predictive accuracy, individual strength, diversity, computation time and model complexity. We also examine the effect of different parameter settings and several other design choices including bagging, subsampling data points at each node, taking linear combinations of features, splitting data points into structure and estimation streams and using a fixed frontier for online variants. Our empirical study finds several trends, some of which are in contrast to commonly held beliefs, that have value to practitioners and theoreticians. For variants used by practitioners the most important discoveries include: bagging almost never improves predictive accuracy, selecting candidate split points at all midpoints can achieve lower error than selecting them uniformly at random, and subsampling data points at each node decreases training time without affecting predictive accuracy. We also show that the gap between variants with proofs of consistency and those used in practice can be accounted for by the requirement to split data points into structure and estimation streams. Our work with online forests demonstrates the potential improvement that is possible by selecting candidate split points at data points, constraining memory with a fixed frontier and training with multiple passes through the data. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2014-04-25 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0167214 |
URI | http://hdl.handle.net/2429/46586 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2014-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nd/2.5/ca/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2014_september_matheson_david.pdf [ 4.28MB ]
- Metadata
- JSON: 24-1.0167214.json
- JSON-LD: 24-1.0167214-ld.json
- RDF/XML (Pretty): 24-1.0167214-rdf.xml
- RDF/JSON: 24-1.0167214-rdf.json
- Turtle: 24-1.0167214-turtle.txt
- N-Triples: 24-1.0167214-rdf-ntriples.txt
- Original Record: 24-1.0167214-source.json
- Full Text
- 24-1.0167214-fulltext.txt
- Citation
- 24-1.0167214.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0167214/manifest