UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Rare-class classification using ensembles of subsets of variables Tomal, Jabed H. 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_fall_tomal_jabedhossain.pdf [ 700.14kB ]
Metadata
JSON: 24-1.0074217.json
JSON-LD: 24-1.0074217-ld.json
RDF/XML (Pretty): 24-1.0074217-rdf.xml
RDF/JSON: 24-1.0074217-rdf.json
Turtle: 24-1.0074217-turtle.txt
N-Triples: 24-1.0074217-rdf-ntriples.txt
Original Record: 24-1.0074217-source.json
Full Text
24-1.0074217-fulltext.txt
Citation
24-1.0074217.ris

Full Text

Rare-class Classification usingEnsembles of Subsets of VariablesbyJabed H TomalM.Sc., University of Windsor, 2007A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2013c? Jabed H Tomal 2013AbstractAn ensemble of classifiers is proposed for predictive ranking of the observations in a datasetso that the rare class observations are found in the top of the ranked list. Four drug-discoverybioassay datasets, containing a few active and majority inactive chemical compounds, areused in this thesis. The compounds? activity status serves as the response variable whilea set of descriptors, describing the structures of chemical compounds, serve as predictors.Five separate descriptor sets are used in each assay. The proposed ensemble aggregates overthe descriptor sets by averaging probabilities of activity from random forests applied to thefive descriptor sets. The resulting ensemble ensures better predictive ranking than the mostaccurate random forest applied to a single descriptor set.Motivated from the results of the ensemble of descriptor sets, an algorithm is developedto uncover data-adaptive subsets of variables (we call phalanxes) in a variable rich descriptorset. Capitalizing on the richness of variables, the algorithm looks for the sets of predictorsthat work well together in a classifier. The data-adaptive phalanxes are so formed that theyhelp each other while forming an ensemble. The phalanxes are aggregated by averagingprobabilities of activity from random forests applied to the phalanxes. The ensemble ofphalanxes (EPX) outperforms random forests and regularized random forests in terms ofpredictive ranking. In general, EPX performs very well in a descriptor set with many variables,and in a bioassay containing a few active compounds.The phalanxes are also aggregated within and across the descriptor sets. In all of the fourbioassays, the resulting ensemble outperforms the ensemble of descriptor sets, and randomforests applied to the pool of the five descriptor sets.The ensemble of phalanxes is also adapted to a logistic regression model and applied to theprotein homology dataset downloaded from the KDD Cup 2004 competition. The ensemblesare applied to a real test set. The adapted version of the ensemble is found more powerful interms of predictive ranking and less computationally demanding than the original ensembleof phalanxes with random forests.iiPrefaceThis thesis is written up under the supervision of my advisors Dr. William J. Welch andDr. Ruben H. Zamar. The research questions, developed methods and the analyses werethoroughly discussed in our weekly research meeting. For all the Chapters of this thesis, thecomputations, analyses, first-drafts and follow-up revisions were done by me. Dr. Welch andDr. Zamar spent a huge amount of time checking the results and improving the writing tomake this thesis better.A provisional patent application is submitted based on the work in Chapter 3: Jabed H.Tomal, William J. Welch and Ruben H. Zamar (UBC Ref: 14-019 US Prov, 2013), EnsemblingClassification Models Based on Phalanxes of Variables with Applications in Drug Discovery.US Patent & Trademark, No. 61/832,384. In this Chapter, I developed the research questionand most of the steps of the Algorithm 3.1 to uncover data-adaptive subsets of variables in avariable-rich dataset. The work is currently under revision to submit for possible publication.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Some Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Classification Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 k -Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.6 LAGO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Ensembles of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.3 Regularized Random Forests . . . . . . . . . . . . . . . . . . . . . . . . 101.3.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3.5 Ensemble of Systematic Subsets . . . . . . . . . . . . . . . . . . . . . . 111.4 Variable Selection via Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.5 Drug Discovery Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13ivTable of Contents1.6 Protein Homology Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Ensembling Natural Subsets of Variables . . . . . . . . . . . . . . . . . . . . . 152.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 BioAssay Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.1 Assay AID364 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.2 Assay AID371 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.3 Assay AID362 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.4 Assay AID348 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 The Descriptor Sets / Natural Subsets of Variables . . . . . . . . . . . . . . . 182.4 Ensemble of Descriptor Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5 Evaluation of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5.1 Balanced 10-fold Cross-validation . . . . . . . . . . . . . . . . . . . . . 212.5.2 Hit Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5.3 Initial Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5.4 Average Hit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6.1 Assay AID364 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.6.2 Assay AID371 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.6.3 Assay AID362 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.6.4 Assay AID348 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.7 Diversity Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.8 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Ensembling Data-Adaptive Subsets of Variables . . . . . . . . . . . . . . . . 323.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Datasets and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4 Searching for Data-Adaptive Subsets / Phalanxes of Variables . . . . . . . . . 363.4.1 Predictor Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.2 Initial Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.3 Screening Initial Groups . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4.4 Phalanx Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.5 Screening Out Weak Phalanxes . . . . . . . . . . . . . . . . . . . . . . 393.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.6 Ensemble of Phalanxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.7.1 Assay AID348 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42vTable of Contents3.7.2 Assay AID362 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.7.3 Assay AID364 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.7.4 Assay AID371 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.8 Diversity map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 Ensembling Phalanxes Across and Within Descriptor Sets . . . . . . . . . . 514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Datasets and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3 Ensemble of Descriptor Sets using Data-Adaptive Phalanxes . . . . . . . . . . 524.4 Evaluation of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.5.1 Assay AID348 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.5.2 Assay AID362 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5.3 Assay AID364 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5.4 Assay AID371 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.6 Diversity Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 Protein Homology: Ensembling via Logistic Regression Models . . . . . . 605.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.2 Protein Homology Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3.1 Rank Last . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3.2 Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.3.3 TOP1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.4 The Modified Algorithm of Phalanx Formation . . . . . . . . . . . . . . . . . . 665.5 Ensemble of Phalanxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.6.1 Grouping the Training Blocks . . . . . . . . . . . . . . . . . . . . . . . 695.6.2 Evaluation on the Test Set . . . . . . . . . . . . . . . . . . . . . . . . . 715.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.1 Parallel Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79viTable of ContentsAppendicesA Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84viiList of Tables2.1 The four bioassay datasets considered in this chapter. The numbers in () arethe number of active compounds in each of the assays. . . . . . . . . . . . . . . 182.2 The number of non-constant feature variables for each of the five descriptorsets. The numbers in () show the total number of feature variables generatedby PowerMV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Hotelling-Lawley test-statistic from the MANOVA fit using two responses,AHR and IE, to the AID364 bioassay data. . . . . . . . . . . . . . . . . . . . . 252.4 The mean AHRs and IEs over the 16 cross-validations for classification tree(Tree) and random forests (RF) applied to the five descriptor sets (AP, BN,CAP, FP and PH), and for their ensembles EDSs. . . . . . . . . . . . . . . . . 282.5 Observed computational time (in minutes) to complete a balanced 10-fold cross-validation for the seven ensembles applied to the five descriptor sets of the fourassay datasets. The number in () shows the number of feature variables usedto construct the corresponding classifier. . . . . . . . . . . . . . . . . . . . . . . 303.1 The number of variables, initial groups, screened groups, candidate phalanxes,and screened/army of phalanxes for the 3 runs of the algorithm to AID348 assay. 423.2 Average hit rate (AHR) for an ensemble of phalanxes (EPX), a random forests(RF), and a regularized random forests (RRF) averaged over 16 repeats ofbalanced 10-fold cross-validation for the AID348 assay. Larger AHR values arebetter. The last two columns show the number of times EPX has larger AHRamong the 16 repeats of cross-validation relative to RF and RRF. . . . . . . . . 433.3 The initial enhancement (IE) ? averaged over the 16 balanced 10-fold cross-validations ? for the three ensembles RF, RRF, and EPX of the five descriptorsets of AID348 Assay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.1 Mean AHR and IE for AF, EDS, and EDS-PX for the four assay datasets.Winning rates of EDS over AF, and EDS-PX over AF and EDS are also givenfor AHR and IE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.1 The structure of the protein homology dataset. The top and bottom portionsare extracted from the training and test sets, respectively. . . . . . . . . . . . . 625.2 Block sizes in the training and test sets of the protein homology datasets. . . . 63viiiList of Tables5.3 The performances of the three ensembles - RF, EPX(RF) and EPX(LR) - usingthree evaluation metrics: mean block RKL, APR and TOP1. The ensemblesare learned on the 153 training blocks and are evaluated on the 150 test blocks.The top performance corresponding to each evaluation metric is marked by foldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.4 The number of processors used, amount of memory allocated and the elapsedtime recorded (in minute) for parallel execution of the three ensembles. . . . . 735.5 The results from the Weka solution to the 2004 KDD Cup. . . . . . . . . . . . 74A.1 The number of variables, initial groups, screened groups, candidate phalanxes,and screened/army of phalanxes for the 3 runs of the algorithm to AID362 assay. 85A.2 Average hit rate (AHR) for an ensemble of phalanxes (EPX), a random forests(RF), and a regularized random forests (RRF) averaged over 16 repeats ofbalanced 10-fold cross-validation for the AID362 assay. Larger AHR values arebetter. The last two columns show the number of times EPX has larger AHRamong the 16 repeats of cross-validation relative to RF and RRF. . . . . . . . . 85A.3 The number of variables, initial groups, screened groups, candidate phalanxes,and screened/army of phalanxes for the 3 runs of the algorithm to AID364 assay. 86A.4 Average hit rate (AHR) for an ensemble of phalanxes (EPX), a random forests(RF), and a regularized random forests (RRF) averaged over 16 repeats ofbalanced 10-fold cross-validation for the AID364 assay. Larger AHR values arebetter. The last two columns show the number of times EPX has larger AHRamong the 16 repeats of cross-validation relative to RF and RRF. . . . . . . . . 86A.5 The number of variables, initial groups, screened groups, candidate phalanxes,and screened/army of phalanxes for the 3 runs of the algorithm to AID371 assay. 87A.6 Average hit rate (AHR) for an ensemble of phalanxes (EPX), a random forests(RF), and a regularized random forests (RRF) averaged over 16 repeats ofbalanced 10-fold cross-validation for the AID371 assay. Larger AHR values arebetter. The last two columns show the number of times EPX has larger AHRamong the 16 repeats of cross-validation relative to RF and RRF. . . . . . . . . 87ixList of Figures1.1 (a) An example of classification tree grown from a simulated dataset of size 60with two classes and two feature variables. (b) The corresponding partitionsof the two-dimensional space into hyper-rectangles. Class 0 and class 1 casesare denoted by ?o? and ?+?, respectively. . . . . . . . . . . . . . . . . . . . . . 41.2 (a) The 7-nearest neighbours for two test objects (?) using Euclidean distancefor the simulated dataset. (b) Two decision boundaries using k-nearest neigh-bours: the solid line is for k = 1 and the dashed line is for k = 7. Class 0 andclass 1 cases are denoted by ?o? and ?+?, respectively. . . . . . . . . . . . . . . 52.1 (a) The hit curves obtained from applying random forests to the five descriptorsets (AP, BN, CAP, FP and PH), and to the pool of the five descriptor sets (AF)of the AID364 dataset. The other hit curve is for the ensemble descriptor sets(EDS) applied to the AID364 dataset. (b) The corresponding initial enhance-ment (IE) versus average hit rate (AHR) plot for the seven ensembles appliedto the AID364 assay. Both of the panels are obtained using the probabilitiesof activity from a particular balanced 10-fold cross validation. . . . . . . . . . . 242.2 Plots of mean initial enhancement (IE) versus mean average hit rates (AHR)with 95% confidence bands obtained from applying the seven classifiers to thefour bio assay datasets: AID364, AID371, AID362, and AID348. . . . . . . . . 262.3 Diversity maps of ranks of the active compounds for AID362 and AID348 as-says. The probabilities of activity associated with the cross-validation number1 are used to rank the compounds. . . . . . . . . . . . . . . . . . . . . . . . . . 293.1 Illustration of hit curves for three methods random forests (RF), regularizedrandom forests (RRF) and ensemble of phalanxes (EPX) corresponding toAtom Pairs of AID348 Assay. The number in () is the corresponding Aver-age Hit Rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Schematic presentation of the algorithm of phalanx formation: D predictorvariables are grouped into d initial groups, then reduced down to s screenedgroups, then combined into c candidate phalanxes, which are then reduced top screened phalanxes in the final army (D ? d ? s ? c ? p). . . . . . . . . . . . 37xList of Figures3.3 Box-plots of average hit rates (AHR) from 16 repeats of cross-validation for arandom forests (RF), a regularized random forests (RRF), and for 3 ensemblesof phalanxes (EPX) for the five descriptor sets (AP, BN, CAP, FP, and PH)of AID348 assay. The boxes for RF, RRF, and EPX are marked by light-grey,grey and dark-grey colour schemes. . . . . . . . . . . . . . . . . . . . . . . . . . 453.4 Hit curves for random forests (dashed line), regularized random forests (dottedline), and army of phalanxes (solid line) for the four descriptor sets, BN inpanel (a), CAP in panel (b), FP in panel (c) and PH in panel (d). The numberin () is the corresponding Average Hit Rate. Data from the AID348 assay. . . . 463.5 Diversity map of ranks of the active compounds by 8 phalanxes from BN ofAID348 assay. The average hit rates of a random forest for this cross-validationis 0.103. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.1 Hit curves comparing three ensembles ? AF, EDS and EDS-PX ? for the AID348assay dataset (panel a). Initial enhancement (IE) versus average hit rate (AHR)plot for AF, EDS and EDS-PX (panel b). The results are obtained from theestimated probabilities of activity using a balanced 10-fold cross-validation. . . 544.2 Plots of mean initial enhancement (IE) versus mean average hit rate (AHR)with 95% confidence regions for four assay datasets: AID348, AID362, AID364,and AID371. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.3 Diversity maps of EDS (left panel) and EDS-PX (right panel) for the ranks ofthe active compounds in AID348 assay. The compounds are ranked using theprobability of activity corresponding to the cross-validation number 1. . . . . . 585.1 Histogram of block sizes in the training and test sets of the protein homologydatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Histogram of the proportions of homologous protein in the training set. . . . . 655.3 Kernel density plots of the feature variables x63 (panel a) and x47 (panel b) forthe homologous and non-homologous proteins in the training set. . . . . . . . . 665.4 Histograms of the proportion of homologous proteins in four groups (panel a).Side-by-side box plots comparing the performances of the three ensembles -RF, EPX(RF) and EPX(LR) - in each of the four groups in terms of meanblock RKL (panel b), mean block APR (panel c) and mean block TOP1 (paneld). The colors of the boxes differentiating RF, EPX(RF) and EPX(LR) arewhite, light grey and dark grey respectively. . . . . . . . . . . . . . . . . . . . . 705.5 Hit curves comparing the performances of three ensembles - RF, EPX(RF) andEPX(LR) - in blocks 18 (panel a) and 162 (panel b) chosen from the groups 1and 4, respectively. The respective block sizes are 1114 and 816. The legendin each figure also shows the estimated APR, RKL and TOP1 for the threeensembles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72xiList of Algorithms1.1 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 AdaBoost.M1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Phalanx Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.1 Modified Algorithm of Phalanx Formation . . . . . . . . . . . . . . . . . . . . . 68xiiGlossaryRF Random ForestsRRF Regularized Random ForestsEDS Ensemble of Descriptor SetsEPX Ensemble of PhalanxesLR Logistic Regression ModelAHR Average Hit RateIE Initial EnhancementAPR Average PrecisionRKL Rank LastAP Atom PairsBN Burden NumbersCAP Carhart Atom PairsFP Fragment PairsPH PharmacophoresSNP Single Nucleotide PolymorphismKDD Knowledge Discovery and Data MiningxiiiAcknowledgementsMy sincere gratefulness goes to my supervisors Professor William J. Welch and ProfessorRuben H. Zamar for their thoughtful guidance, remarkable insights and continuous encour-agement throughout this research. I have acquired a handful of good qualities from them,especially inquisitiveness, abilities to think hard and skills for academic writing. This thesiswould not be possible to finish without their scholarly support.I thank Dr. Alexandre Bouchard-Co?te? for his time and effort in serving in the supervisorycommittee, reviewing the thesis and providing thoughtful comments.My thanks go to Dr. S. Stanley Young for his prompt help to generate the descriptor setsfor the bioassay datasets.I also want to thank Professors Raymond Ng and Gabriela Cohen Freue for serving as theuniversity examiners, and Professor Marianthi Markatou for serving as the external examiner.I want to convey my sincere gratitude to the faculty and administrative staff of the de-partment of statistics. I have learned a lot taking courses with Professor John Petkau andProfessor Paul Gustafson. My cordial thanks are for Peggy Ng, Andrea Sollberger, ElaineSalameh, and Viena Tran for their supportive and prompt administrative help.Special thanks are for my friends in the department for their warm support during mystudy. Andy Leung helped me debugging and running a portion of C codes used in mythesis. Camila P. Estevam de Souza was always supportive with her big smile throughout mygraduate life. Thank you very much!The warmest thanks and gratefulness are for my wife, Shafinaz Shafique, for her support,care and affection throughout my study. My son, Tashreeq, was always with me to bringhappiness in my life. My sincere gratefulness goes to parents who were and are always in myheart.This thesis has been written up with the generous financial support from the NaturalSciences and Engineering Research Council (NSERC) of Canada, to which I want to expressmy sincere thankfulness.xivDedicationTo My FamilyxvChapter 1IntroductionThe theory of classification has received considerable attention in statistics and in machinelearning. A classifier places objects into different classes based on the characteristics of theobjects. In this thesis, I will be dealing with supervised classification only where a classifieris learned on a dataset with objects of known classes and the learned classifier is then usedto predict the classes of test objects. The dataset with objects of known classes, where aclassifier is trained or learned, is known as training or learning dataset and the dataset forwhich objects? predictions are required is known as test dataset.Suppose we wish to classify an incoming email as ?Spam? or ?Non-spam? using featuresof the email like: email ID, subject, length, percentage of specific words etc. (DeBarr andWechsler, 2012). We build a classifier on previously received emails, already filtered as ?Spam?or ?Non-spam?, and classify incoming emails based on their features. The collection of receivedemails forms the training or learning data and the collection of incoming emails forms thetest data.The application of statistical classification includes all disciplines of science and busi-ness. Machine learners used classification methods in robotics, internet search engines, im-age/face/speech recognitions, computer games and in many other systems. However, the goalof my research is to identify relevant rare class objects from a large collection of objects.The reason why an object is considered relevant depends on the context of a specific prob-lem. For example, in terrorist detection the relevant objects are individual communicationsdeemed very suspicious (Fienberg, 2004); in credit card fraud detection the relevant objectsare individual transactions that are fraudulent (Srivastava et al., 2008); in spam detection therelevant objects are individual emails that are highly likely to be spam (DeBarr and Wechsler,2012); in drug discovery the relevant objects are chemical compounds that are active againsta specific biological target (Warmuth et al., 2001), for example, the human immunodeficiencyvirus (HIV).The goal is to rank the rare-class objects ahead of the majority class objects. The objectsin a test set are ranked using the probabilities of belonging to the rare-class from a classifierlearned on a training set. The top ranked objects may be shortlisted for further investigation.The shortlist is intended to include most, if not all, of the rare-class objects in the test set.The length of the shortlist will depend on the available resources in a study.Although this rare class ranking problem is similar to those encountered in the two-classclassification problem, the underlying objective is different. In particular, automatic classifi-11.1. Objectivescation is of little interest. This is because further investigation of the top-ranked candidates isalmost always necessary. For example, in the fraud detection application, one seldom termi-nates a credit card account without confirming the suspected fraud (Bolton and Hand, 2002).Therefore, we are most interested in producing an effective ranking of all of the candidates sothat any further investigation is least likely to be carried out in vain.With the advancement of computer speed, memory and different softwares, the numberof variables in recent statistical datasets has become very large. For example, the datasets indrug discovery and in computational chemistry contain a few hundreds to thousands of vari-ables. The analysis of such large datasets poses two types of problems: (i) large computationaltime and (ii) poor model building in terms of low prediction power and high complexity ofthe resulting model. The large computational time incurred by a model is usually handled bydeveloping fast computational tools or algorithms for fitting the model. The poor model build-ing strategy is usually solved by filtering or selecting variables or by regularizing the model.The problems relating to the large number of variables is popularly known as the ?curse-of-dimensionality?. Instead of repeating the conventional word ?curse-of-dimensionality?, weterm this aspect of data as ?rich-in-variables? as a variable-rich dataset may possess manysignals for modelling the response.Ensembling is a fairly novel method of improving prediction accuracy of a classifier. Inan ensemble, a collection of classifiers is constructed and the class membership of an objectis determined by aggregating individual classifiers. Ensemble methods have been shown tobe more accurate than non-ensemble methods and insensitive to irrelevant feature variables(e.g., Breiman, 1996a, 2001; Freund and Schapire, 1996a; Wang, 2005). Capitalizing on the?richness-of-variables? in a dataset, I develop an ensemble method which outperforms existingoff-the-shelf ensembles in terms of predictive ranking of the rare class objects.1.1 ObjectivesThe objectives of this thesis are:1. Placing rare class objects before the majority class objects in highly unbalanced two-class problems.2. Developing an ensemble which capitalizes on the ?richness-of-variables? in a dataset tooutperform existing ensembles.1.2 Some ClassifiersIn this section, I will briefly describe some popular non-ensemble classifiers. The first twowere the top performing non-ensembles in two-class ranking problems (Hughes-Oliver et al.,2011; Wang, 2005).21.2. Some Classifiers1.2.1 Classification TreeA classification tree is a binary decision tree constructed by splitting a node into two childnodes (descendants) repeatedly, beginning with the root node that contains the whole trainingsample. The tree building procedure partitions the space of the feature variables successivelyinto smaller hyper-rectangles (nodes) so that the class variable in each hyper-rectangle is ashomogeneous as possible.At a node, the tree growing process chooses a split from all possible splits so that theresulting left and right descendants are as pure as possible. If all of the objects of a nodebelong to a particular class, the node is called pure. The tree growing algorithm first findseach feature?s best split and then finds the node?s best split choosing one that maximizes asplitting criterion. The process then splits the node using the node?s best split if stoppingrules are not satisfied.An example will make our understanding clear. For this, I used a toy dataset of size60 with two classes (1 and 0) and two feature variables, x1 and x2. The observations aregenerated as follows. Let ?1 = {2.5, 2.5}, ?2 = {7.5, 7.5}, ?3 = {2.5, 7.5}, ?4 = {7.5, 2.5}be four mean vectors and ? =(1.00 0.150.15 4.00)be the common variance-covariance matrix.A total of 40 class 0 observations are generated from the mixture of two bivariate normaldistributions ? N(?1,?) + (1 ? ?) N(?2,?), where ? = 0.5 is the mixture parameter. Theother 20 class 1 observations are generated independently from the following distribution? N(?3,?) + (1 ? ?) N(?4,?). Unlike the rare class problem, the classes are kept onlyroughly unbalanced so that the tree partitioning method can be explained better.The left panel of Figure 1.1 shows a typical classification tree grown on the simulateddataset. The root node, which contains all 60 cases, is split using the feature variable x2 intotwo child nodes: x2 < 2.256 (left descendant) and x2 ? 2.256 (right descendant). The leftdescendant is an internal node, an internal node is represented by a circle, which contains 15cases with 7 cases from class 0 and 8 cases from class 1. Thus, the predicted class for this nodeis 1 with probability 8/15. This left internal node is further split using the feature variable x1into two child nodes: x1 < 5.415 (left descendant) and x1 ? 5.415 (right descendant). Bothdescendants are terminal nodes, represented by a rectangle, with predicted classes 0 and 1,respectively, each with probability 1.The right descendant of the root node is split using the feature variable x1 into two childnodes: x1 < 6.716 (left descendant) and x1 ? 6.716 (right descendant). The right descendantis a terminal node with predicted class and probability 0 and 18/19, respectively. The leftdescendant, an internal node, is further split using the feature variable x2 into two terminalnodes: x2 < 4.871 (left terminal) and x2 ? 4.871 (right terminal). If a test case is sent downthis tree and it reaches the right-most terminal node, the case will be predicted as from class0 with probability 18/19.The resulting hyper-rectangles obtained by splitting the two-dimensional space are dis-31.2. Some Classifiers040:2017:8x2 < 2.25607:0x1 < 5.41510:8x1 ? 5.415033:12x2 ? 2.256015:11x1 < 6.716012:0x2 < 4.87113:11x2 ? 4.871018:1x1 ? 6.716(a)o ooooo ooooooooooooooooooooooooooo ooooo0 2 4 6 8 10?2024681012x1x 2++ +++++++++ ++++ +++++(b)Figure 1.1: (a) An example of classification tree grown from a simulated dataset of size 60with two classes and two feature variables. (b) The corresponding partitions of the two-dimensional space into hyper-rectangles. Class 0 and class 1 cases are denoted by ?o? and?+?, respectively.played in the right panel of Figure 1.1. It is evident that the space partitioning process of atree allows the effect of one feature variable to be different at various values of other featurevariables, and includes non-linear effect of features.The tree building algorithm uses one of many splitting criteria to determine how homoge-neous (i.e., pure) the response variable is in a child node: deviance, entropy, Gini index, andmisclassification error. Among them, deviance, entropy, Gini index are mainly used to grow atree and misclassification error is usually used to prune a tree. The rules that decide when tostop growing a tree are called ?stopping rules?. When growing a tree, a node will not be split(i) if the node becomes pure, (ii) if all cases have identical values for each feature variable,(iii) if the tree size reaches the user-specified maximum tree size, (iv) if the size of the node issmaller than the user-specified minimum node size, (v) if the child node size is smaller thanthe user-specified minimum child node size, and (vi) if the improvement of information gainfrom the best split of the node is smaller than the user-specified minimum improvement. Forfurther reference, please see Breiman, Friedman, Olshen, and Stone (1984) and Chapter 9 ofVenables and Ripley (2002).1.2.2 k-Nearest NeighboursThe k-nearest neighbours (KNN) classifier assigns a class to an object based on the majorityclass of that object?s k nearest neighbours. The value of k, the number of nearest neighbours,is usually chosen by the user. Hall, Park, and Samworth (2008) study the effect of the value41.2. Some Classifierso ooooo ooooooooooooooooooooooooooo ooooo0 2 4 6 8 10?2024681012x1x 2++ +++++++++ ++++ +++++??(a)o ooooo ooooooooooooooooooooooooooo ooooo0 2 4 6 8 10?2024681012x1x 2++++++++++++++++ +++++++++++++++ ++++++++ 0.5  0.5  0.5  0.5 (b)Figure 1.2: (a) The 7-nearest neighbours for two test objects (?) using Euclidean distancefor the simulated dataset. (b) Two decision boundaries using k-nearest neighbours: the solidline is for k = 1 and the dashed line is for k = 7. Class 0 and class 1 cases are denoted by ?o?and ?+?, respectively.of k on the misclassification error rate and present a number of methods for selecting thevalue of this parameter. Often Euclidean distance is used to define the neighbourhood forcontinuous features. The pioneers of KNN are Cover and Hart (1967), who investigated theproperties of 1-NN, the rule that uses one nearest neighbour. If there are ties, the predictedclass is determined by random mechanism.The algorithm for KNN is very simple. For example, we want to predict classes of thetwo test objects (2.34, 8.31) and (7.35, 4.56), using the simulated dataset (used in Section1.2.1) of size 60. The left panel of Figure 1.2 shows two 7-nearest neighbourhoods of the testobjects. In the neighbourhood of the object (2.34, 8.31), we have seven training observations,all are from class 1. Hence, this test object will be predicted as class 1 with probability 1.The neighbourhood of the second test object (7.35, 4.56) contains 7 training observations:five from class 0 and two from class 1. Thus this object will be classified as class 0 withprobability 5/7.The right panel of Figure 1.2 contains two decision boundaries, one for k = 1 (the solidline) and the other for k = 7 (the dashed line). It is evident that the decision boundary fork = 1 is more irregular than k = 7. Seemingly, the classifier with k = 1 overfits the data andcaptures noise variability. For low values of k we have low bias and high variance in prediction,and vice versa. The balance between bias and variance is usually struck by choosing k viacross validation.51.2. Some Classifiers1.2.3 Discriminant AnalysisIn discriminant analysis, the joint distributions of D dimensional feature variables for differentclasses are usually assumed multivariate normal with class-specific mean vectors and covari-ance matrices, and the posterior class probabilities conditional on features are determinedby Bayes rule. The unknown parameters (such as class prior probabilities, mean vectors andcovariance matrices) are usually computed from the training sample. The assumption of equalcovariance matrices results in linear discriminant analysis (LDA) (Fisher, 1936); whereas theassumption of unequal covariance matrices results in quadratic discriminant analysis (QDA).LDA works well when the feature space is linearly separable for classes, and QDA works wellwhen the decision boundary is quadratic in the feature space. Friedman (1989) proposed acompromise between LDA and QDA, which allows shrinking of separate covariance matricesin QDA toward a common covariance as in LDA by introducing a regularization parameter.This method is known as regularized discriminant analysis (RDA). In some applications mix-ture of Gaussian distributions is a good choice for the joint distributions of features, whichgenerates a classification method known as mixture discriminant analysis (MDA) (Hastie,Tibshirani, and Friedman, 2009, Section 6.8). On the other hand, nonparametric discrim-inant analysis (Hastie, Tibshirani, and Friedman, 2009, Section 6.6.2) uses kernel densityapproach to determine the joint distribution of features in each class.LDA is one of the oldest classification methods, but often found effective especially whenthe decision boundary is linearly separable as it generates discriminant functions which arelinear combinations of features. The assumption of multivariate normal is not a requirement,but it is helpful in determining cut-offs of the discriminant functions to help separating objectsin the two class problem. In a multi-class problem, the prediction for a new object can be madeby computing posterior probabilities for each class or by determining optimal discriminantfunctions. Suppose we have c classes and s = min(c?1, D) discriminant functions wTj x, wherewj is the vector of weights for the jth discriminant function, and x is the D dimensional vectorof feature variables. A new object with vector of features xnew will be assigned to class c ifs?j=1[wTj (xnew ? x?c)]2 ?s?j=1[wTj (xnew ? x?i)]2 ; ?i 6= c, (1.1)where x?i is the mean vector of x?s for class i = 1, ? ? ? , c. In words, the object xnew will beclassified as from class c if the transformed space (by common covariance matrix) of xnew isclosest to the centroid of class c.1.2.4 Neural NetworksNeural networks (NN) are very popular in machine learning. The history of NN dates backto McCulloch and Pitts (1943). The motivation goes back to Widrow and Hoff (1960) andRosenblatt (1962). The idea of neural networks mimics the structure of the human brain,where many biological neurons are interconnected with each other to perform complicated61.2. Some Classifierstasks. In a neural network, each neuron receives linear combination of inputs as derivedfeatures, transforms the derived features using a nonlinear function (usually the sigmoidfunction) and passes the result to another neuron. The output function, generally the softmaxfunction (Hastie, Tibshirani, and Friedman, 2009, Section 11.3) in classification, allows a finaltransformation. In general there are several layers of neurons in a network with many unitsin each layer. The units in the middle of a network, computing the derived features, arecalled hidden units because the values of the derived features are not directly observed. Theterminal neurons are sometimes allowed to receive linear combinations of the raw inputs toskip hidden layer(s) simplifying the interpretation of a network. The structure of a neuralnetwork is usually complex, but the central idea is to extract linear combinations of inputs asderived features, and model the class variable as a nonlinear function of these features.A neural network has unknown parameters, often called weights, and we seek values forthem so that the model fits the training data well. For classification, cross-entropy (deviance)is generally used as a measure of fit which is minimized by gradient descent known as back-propagation (Hastie, Tibshirani, and Friedman, 2009, Section 11.4). Typically we don?t wantthe global minimizer of the cross-entropy, as this is likely to be an overfit solution. Insteadregularized cross-entropy is used with a penalty term, often called weight decay, which isusually optimized via cross validation. Scaling of the input variables and multiple startingvalues of the weights are often utilized to achieve optimal weights at the global minimumof the cross entropy. The choice of the number of hidden layers is guided by backgroundknowledge and experimentation. It is better to have many hidden units than too few (seeHastie, Tibshirani, and Friedman, 2009, Section 11.5.4). With too few hidden units, the modelmight not have enough flexibility to capture the nonlinearities in the data; with many hiddenunits, the extra weights can be shrunk toward zero if appropriate regularization is used.1.2.5 Support Vector MachinesSupport vector machine (SVM) is also a popular classifier with application in statistics and inmachine learning. SVM was introduced by Boser, Guyon, and Vapnik (1992) and the currentstandard incarnation was proposed by Cortes and Vapnik (1995). The idea is to separatethe space of feature variables so that the classes become as distinct as possible. In a twoclass problem, SVM separates the feature space by a linear hyperplane, and its parametersare determined using a training sample by a constrained optimization technique. To producea nonlinear decision boundary, SVM constructs a linear boundary in a large, transformedversion of the feature space. Generally linear boundaries in the enlarged space achieve bettertraining-class separation, and translate to nonlinear boundaries in the original space. In SVMthe points well inside their class boundary do not play a big role in shaping the optimalhyperplane; and this attractive criterion differentiates SVM from LDA. In LDA, the decisionboundary is usually determined by the property of the covariance of the class distributionsand the positions of class centroids. The support vector machine is usually extended to71.3. Ensembles of Classifiersmulti-class problems by solving many two-class problems. A classifier is built for each pairof classes, and the final classifier is the one that dominates the most. However, SVM is slowin high dimensions and does poorly in the presence of many noise feature variables (Hastie,Tibshirani, and Friedman, 2009, Chapter 12).1.2.6 LAGOLAGO is a computationally efficient classifier, developed by Zhu, Su, and Chipman (2006),to rank the rare class objects ahead in a two-class problem. LAGO only targets the rareclass for detection and saves much computation time for a large database. Whenever thegoal is ranking rare objects, a classifier computes posterior class probabilities conditionalon feature variables and uses the computed probabilities to rank the target. Zhu, Su, andChipman (2006) argued that, as far as ranking is concern, posterior class probabilities giventhe features and the ratio of the distributions of features for the rare class to the backgroundclass serve the same purpose, as the former is a monotone transformation of the latter forfixed prior probabilities. In LAGO, the density for the rare class is computed by a Kernelapproach with mass only on the rare class objects. LAGO considers the distribution of thefeatures for the background class around a rare class object flat (i.e., Uniform) and reciprocalto the bandwidth of the Kernel density determined earlier. Finally, the values of the ratioof the distributions of features of the rare class to the background class are used to rank theobjects. Zhu, Su, and Chipman (2006) generalize LAGO to multivariate problem using thenaive Bayes principle, i.e., by multiplying the univariate feature?s distributions together asif they are independent to each other. LAGO has a parameter k like KNN and is usuallydetermined adaptively by cross-validation.1.3 Ensembles of ClassifiersAn ensemble of classifiers is an aggregated collection of models/classifiers which can be con-sidered as a model/classifier by itself. An ensemble can improve classification accuracy ofa non-ensemble classifier. In this section, I briefly present five ensembles: bagging, randomforests, regularized random forests, boosting, and ensemble of systematic subsets.1.3.1 BaggingBreiman (1996a) introduced an ensemble method, called bagging, which aggregates over sev-eral tree classifiers constructed on bootstrap samples of the training observations. Bagging isthe acronym of bootstrap aggregating. A bootstrap sample is a sample with replacement ofequal size of the training observations (Efron and Tibshirani, 1994). The bootstrap samplesare used to construct multiple versions of tree predictors and bagging aggregates them intoan ensemble. The aggregation is done by majority vote when predicting a class variable and81.3. Ensembles of Classifiersby averaging over the versions when predicting a numerical outcome. When aggregation isdone by majority vote, ties are handled by random mechanism.The vital element for the success of bagging is the instability of the prediction method usedto build constituent classifiers. If perturbation of a training dataset causes significant changesin prediction performances, then bagging can improve accuracy. Breiman (1996b) showedthat bagging can push a good but unstable procedure a significant step toward optimality.On the other hand, bagging does not improve performance of a stable classifier. Instabilityof a classifier is studied by Breiman (1996b), where it is pointed out that neural networks,classification and regression trees, and subset selection in linear regression are unstable, whilek-nearest neighbour is stable.1.3.2 Random ForestsIn bagging, each tree is built using a bootstrap sample of a training dataset employing all ofthe feature variables. Breiman (2001) incorporated one extra layer of randomness, randomselection of features at each node of a tree, to bagging and named the new ensemble as randomforests. In random forests, each tree is grown using a bootstrap sample of training data; and ateach node of the tree, the best split is chosen from a random sample of mtry variables insteadof all feature variables. Algorithm 1.1 shows the steps in learning a random forests classifierusing a training set and predicting the test observations. This extra layer of randomness isused to break correlation between classifiers and thus to improve overall accuracy. Each treeis grown to its maximal depth and aggregation is done by majority vote.Algorithm 1.1 Random Forests1. Let the number of training observations and predictor variables be n and D, respectively.2. For ntree = 1 to M :? Generate a synthetic training set by drawing a bootstrap sample of size n from the originaltraining set.? Build a fully grown unpruned classification tree using the synthetic training set.? At each node of the tree, randomly choose mtry =?D variables to find the split-point ofthat node.3. End For.4. Predict test observations by aggregating the predictions from the M classification trees usingmajority vote.Breiman (2001) showed that the upper bound on prediction error of random forests de-pends on the ratio of average correlation between classifiers and average strength of all classi-fiers. Thus, a tighter upper bound of prediction error can be achieved by minimizing averagecorrelation and maximizing average strength of classifiers. The extra layer of randomnesshelps to lower average correlation; and growing trees to maximal depth helps to increase91.3. Ensembles of Classifiersaverage strength. Although growing trees to maximal depth increases prediction variability,aggregation over many trees reduces overall variance and controls overfitting.A random forests offers built-in estimation of test-set error via the out-of-bag (OOB)samples. Approximately one third of training cases remains out on each bootstrap sampleand constitutes the OOB sample. Importance of feature variables can be assessed in randomforests. Each feature variable is randomly permuted in the OOB samples and its impact onprediction error is measured; and variables with large changes in prediction error are deemedimportant. Random forests also computes a proximity matrix by counting how often a pair ofpoints land in the same terminal node. The proximity matrix is useful for outlier detection,clustering, and missing value replacement etc.The algorithm for random forests is fast, and often faster than growing and pruning a singletree. Random forests has only one tuning parameter mtry. Breiman suggests to use mtry =?D, where D is the number of feature variables. To construct a random forests ensemble,I grow 500 trees with tuning parameter mtry =?D using the R package randomForest.Constructing 500 trees in a random forests is a common practice.1.3.3 Regularized Random ForestsDeng and Runger (2012) proposed the regularized version of random forests. The goal isto select high-quality feature subsets without reducing the prediction performance of theensemble itself.The only difference between regularized random forests and random forests (Breiman,2001) is in the tree growing process. A regularized random forests regularizes variables ateach node of the trees in the forest. In order to be selected for splitting a node, a variableneeds to produce substantially larger ?information gain? than the variables already selected inprevious splits. For example, let F be the feature set used in previous splits in a tree. To splita node, the tree growing algorithm avoids selecting a new feature xj /? F unless its gain(xj) issubstantially larger than the max (gain(xi)) for xi ? F . To achieve this goal, the regularizedgain is calculated as:gainR(xj) ={?? gain(xj) xj /? Fgain(xi) xi ? F(1.2)where ? ? [0, 1] is the regularization coefficient. The gainR(xj) is used for splitting each non-terminal nodes instead of gain(xj). In this thesis, I have used the R package RRF (Deng,2012) to fit regularized random forests.1.3.4 BoostingBoosting, pioneered by the work of Schapire (1990), Freund (1995) and Freund and Schapire(1996a,b, 1997), is an ensemble method. The boosting algorithm starts by assigning equalweights to all observations and sequentially builds a series of classifiers by reweighting only101.3. Ensembles of Classifiersthe misclassified observations. It focuses more on the observations misclassified by previousclassifiers. At the end, each classifier in the ensemble casts a vote with appropriate weightdetermined from its misclassification rate. The AdaBoost.M1 (Freund and Schapire, 1996a)algorithm for classification is presented below.Algorithm 1.2 AdaBoost.M1Let we have n observations in the training data (xi, yi), i = 1, ? ? ? , n, where the class variable is yi = 1for class 1, and ?1 for class 2 with xi be the vector of feature variables.1. Initialize the observation weights wi = 1/n, i = 1, ? ? ? , n.2. For m = 1 to M :? Fit a classifier to the training data with weights wi and get the prediction Cm(x) ? {?1, 1}.? Compute weighted misclassification errorerrm =?ni=1 wiI(yi 6= Cm(xi))?ni=1 wi.? Compute weight for the mth classifier?m = log(1? errmerrm).? Increase weights for the misclassified observationswi ? wi. exp [?mI (yi 6= Cm(xi))] ,and scaled to sum 1, ?i = 1, ? ? ? , n.3. Output aggregated classifierC(x) = sign( M?i=1?mCm(x)).Algorithm 1.2 aims to improve classification accuracy of a weak classifier. However, theweak classifier needs to perform better than a random guess. If the current classifier misclas-sifies an observation, then this observation will receive more weight in the next iteration. Inthe final overall ensemble, classifiers that are accurate predictors of the training data receivemore weight, whereas, classifiers that are poor predictors receive less weight.1.3.5 Ensemble of Systematic SubsetsThe ensemble of systematic subsets (Wang, 2005) is particularly designed to rank objectsfrom a particular class ahead of the other class. Although this idea is intuitive for a two classproblem, it can be easily generalized to the multi-class problem, i.e., ranking objects froma particular class ahead of the other classes. For this ensemble, the constituent classifiersare constructed on the subsets of feature variables and aggregation is done by probability111.4. Variable Selection via Filteringaveraging, i.e., averaging probabilities of belonging to the target class from different classifiersbuilt on subsets of variables. The performance of this ensemble would depend on the strengthand diversity of the constituent classifiers constructed on the subsets of feature variables.Consider that there are three feature variables in a dataset: x1, x2 and x3. The setof all possible subsets is {x1, x2, x3, (x1, x2), (x1, x3), (x2, x3), (x1, x2, x3)}. A total of sevenclassifiers can be constructed each using a subset of features, and the probability of belongingto the target class is averaged over the seven classifiers. Such average probabilities are usedto rank the target objects ahead in the list of all objects. Many ensembles can be constructedbased on the orders of the subsets: the systematic subset of order one averages over the firstthree classifiers, order two averages over the first six classifiers, and order three overages overthe seven classifiers.Wang (2005) employed KNN and classification tree as base learners to construct the en-semble of systematic subsets. When KNN is the base classifier, leave-one-out cross validationwas used to choose k, the number of nearest neighbours. Although the use of KNN was foundencouraging in terms of prediction performance by Wang (2005), it appeared computation-ally expensive for large data. An increase in the number of feature variables exponentiallyincreases the number of possible subsets making the ensemble even slower.1.4 Variable Selection via FilteringSome of the classifiers presented in section 1.2 perform variable selection automatically. Forexample, a tree growing algorithm selects the best split at each node from all possible avail-able splits. Thus, while splitting a node, a classification tree performs variable selectionautomatically. The other non-ensemble classifiers - LDA, KNN, NN, and SVM ? do notpossess this attractive feature of automatic variable selection. Among LDA, KNN, NN, andSVM, variable selection is usually performed by penalizing unimportant variables. For exam-ple, penalized discriminant analysis uses regularization in disregarding useless and correlatedfeatures (Hastie, Tibshirani, and Friedman, 2009, Section 12.6).Among the ensembles, bagging and random forests possess the automatic variable selectionproperty, as their constituent models are built using tree based methods. Furthermore, bothof the ensembles aggregate constituent classifiers using majority vote. If aggregation is doneby majority vote, the performance of an ensemble with a few bad classifiers and many goodclassifiers is not harmed much. A similar statement is also true for the ensemble of systematicsubsets. In this ensemble, aggregation is done by probability averaging. A few weak classifiersbased on unimportant feature variables cannot do much harm when aggregation is doneby probability averaging. But the number of good and weak classifiers plays a big role indetermining the performance of the ensemble. If an ensemble contains many weak classifiersthen its performance is highly likely to be decreased (Wang, 2005, Section 5.3).Boosting and regularized random forests perform variable selection automatically as both121.5. Drug Discovery Datasetsof the ensembles use classification tree based algorithm to grow constituent classifiers. Thelatter ensemble puts an extra effort through regularization to find a compact subset of impor-tant features. Thus, in a sense, the regularized random forests is too aggressive in filteringvariables.Often, high dimensional datasets contain many useless or unimportant variables. In thepresence of many useless variables, the performances of the predictive models applied to suchdatasets become weak. This motivates me to perform model selection by filtering unimpor-tant and useless variables. Wang (2005) proposed a permutation test to screen weak or noiseclassifiers out constructed through irrelevant feature variables. She proposed random permu-tation of training objects to obtain a distribution of random prediction. The classifiers withcross-validated prediction performances exceeding 95% quantile of the distribution of randomprediction are proposed important; else filtered from the ensemble. A similar approach willbe adapted in this thesis to filter weak and useless variables.1.5 Drug Discovery DatasetsI have used four PubChem BioAssay datasets (www.ncbi.nlm.nih.gov/pcassay). Eachdataset provides bioassay screens of chemical substances which are either active or inactiveagainst a biological target. The activity status of a chemical compound is determined throughhigh-throughput screening (HTS). The selected datasets are highly unbalanced: on averagethe ratio is 1 active compound to 100 inactive compounds. The process of determining activitystatus of many chemical compounds in a chemical library is costly and time consuming.Chemists are able to understand the structures of chemical compounds at the atomiclevel which provides quantitative representation of chemical structures. A set of variablesrepresenting the quantitative structure of chemical compounds is also known as a descriptorset. The molecule-based hypothesis is that compounds with similar chemical structures havesimilar activities (Mezey et al., 2001). This hypothesis leads to the quantitative structure-activity relationship (QSAR) models which relate the biological activity status of chemicalcompounds with their quantitative structures. The QSAR models can be used to rank the testcompounds and the top ranked candidates can be passed through high-throughput screeningto determine their activity status. This process of virtual screening is much cheaper and fasterthan the high-throughput screening of all of the compounds as it is very cheap and fast togenerate descriptor sets using computer softwares. There exist many molecular descriptor setsin QSAR studies; (see Leach and Gillet, 2007; Todeschini and Consonni, 2000). I have usedfive variable-rich descriptor sets for each of the four PubChem Assays. The drug discoverydatasets are described in detail in Sections 2.2 and 2.3 of Chapter 2.131.6. Protein Homology Dataset1.6 Protein Homology DatasetProtein homology means biological homology between proteins. Two or more proteins arehomologous in the sense that they have a common evolutionary origin. Knowing homologystatus helps scientists to infer the evolutionary sequences of many proteins (Koonin andGalperin, 2003, Chapter 2). The protein homology dataset is downloaded from the 2004KDD Cup website. The response variable is the homology status between a candidate proteinand a native protein. The predictors are the 74 feature variables which are derived fromthe similarity search between a candidate protein and a native protein. A total of 303 nativeproteins are considered in this dataset which represent 303 blocks. There are several candidateproteins in each block which are tested for homology against a native protein. The 303 blocksare randomly divided into training and test sets by the KDD Cup organizers. The trainingset contains a total of 153 blocks for which the homology status is known to us. The test setcontains a total of 150 blocks for which the homology status is completely unknown to us.The dataset is highly sparse: on average there are 5 homologous proteins in each block in thetraining set. The protein homology dataset is described in detail in Section 5.2 of Chapter 5.I fit an ensemble using the 153 blocks in the training set. The fitted ensemble is used torank the candidate proteins in each test block so that the rare homologous proteins are foundin the top. The estimated probabilities of being homologous are used to rank the candidateproteins in each block. Thus, a performance metric is computed in each test block. The finalperformance of the ensemble is measured by averaging the performance metric over the 150test blocks. The method will be described in detail in Chapter 5.1.7 SummaryThe goal of this thesis is to rank the rare class objects ahead of the majority class objectsin a dataset in such a way that the rare objects are found at the top of the ranked list.We propose ensemble methods to serve this goal. The ensembles are developed focusing ondatasets with many variables. Instead of regularization, I use the richness of variables in adataset to formulate the ensemble. The motivation for developing our ensemble comes fromthe work described next in Chapter 2.14Chapter 2Ensembling Natural Subsets ofVariables2.1 IntroductionIn drug discovery, the compounds in a chemical library may be ?active? or ?inactive? againsta biological target. High-throughput screening (HTS) may be used to assay a portion ofthe compounds to determine their activity status. In quantitative structure-activity relation-ship (QSAR) studies, it is customary to model the biological activity of compounds by theirchemical structures or physiochemical properties known as molecular descriptors (Hughes-Oliver et al., 2011; Merkwirth et al., 2004; Rusinko et al., 1999; Svetnik et al., 2003). In thisChapter, I train a classifier using molecular descriptors of chemical compounds with knownactivity status. The trained classifier is used to estimate the probability of activity for thetest compounds.The classifiers are applied to four drug discovery bioassays (described in Section 2.2) whichcontain very small proportions of active compounds compared to the proportion of inactivecompounds. Moreover, I am interested in identifying the rare active compounds only. In suchproblems, a prediction model might perform poorly in classifying the rare active compounds.For example, if the proportion of active compounds is smaller than 1%, a classifier thatclassifies every compound to be inactive will have high correct classification rate of greaterthan 99% without being able to correctly classify any active compound. Obviously, this doesnot serve our purpose at all. Hence, unlike classification, I rank all of the compounds in adataset using the probabilities of activity so that the rare actives are found at the top of ashortlist of all compounds. Thus, a classifier will be considered good as long as it ranks moreactives earlier in the shortlist ? no matter whether this model can correctly classify any activeor not. As such, if a chemist becomes interested in finding the rare active compounds in ahuge chemical library, he/she can go through a shortlist of the top ranked compounds onlyand can save time and money.Many non-ensemble classifiers, such as neural networks, recursive partitioning, k-nearestneighbours, support vector machines etc., have been successfully applied to QSAR studies(see Aoyama et al., 1990; Doniger and Hofmann, 2002; Kauffman and Jurs, 2001; Rusinkoet al., 1999). In this chapter, I will be dealing with ensembles of classifiers. Merkwirthet al. (2004) described the application of ensemble methods to binary classification problems152.1. Introductionin QSAR studies. They built ensembles by combining classifiers constructed on randomsubspace of features using a particular modelling method chosen from k-nearest neighboursand support vector machines. Their performances were compared with the non-ensemble k-nearest neighbours, support vector machines, and linear regression model with ridge penalty.The resulting ensembles were found better performing than the non-ensemble classifiers.Bruce et al. (2007) presented a comparative assessment of several machine learning toolsfor mining drug data, including support vector machines and decision tree based ensembles:boosting, bagging, and random forests. The authors demonstrated that the ensemble ?randomforests? can provide consistent improvement in predictive performance over single decisiontree in terms of classification accuracy. Svetnik et al. (2003) applied random forests forclassification of chemical compounds in QSAR studies. Their analysis demonstrates thatrandom forests is a powerful tool capable of delivering highly accurate performance.Chen et al. (2004) proposed two techniques to deal with unbalanced classification problemusing random forests: one is based on assigning large (small) weight for the misclassificationof the minority (majority) class objects, and the other is based on under sampling of themajority class objects while drawing bootstrap samples during training process. Both ofthe approaches were shown to improve prediction accuracy for the minority class. Zhanget al. (2009) proposed an ensemble of classification trees for QSAR studies by adaptivelydetermining the threshold of probability, instead of using the default 0.5, to classify classes ofchemical compounds. In this ensemble, the constituent classifiers were built by sampling 70%of the feature variables and then growing classification trees optimized by pruning method.This ensemble was shown to perform well in highly unbalanced QSAR data. Their works(Chen et al., 2004; Zhang et al., 2009) dealt with the classification of the minority class,whereas, in this chapter, I am interested in early ranking of the minority class objects.Hughes-Oliver et al. (2011) compared a comprehensive collection of the state-of-the-artQSAR models and five descriptor sets to rank the rare active compounds ahead of the inactivecompounds in several highly unbalanced two-class assay datasets. Among the models, therewere eleven non-ensemble and one ensemble classifiers. Repeatedly the ensemble randomforests with one of the five sets of descriptors appeared as the top performing model. Butclearly there was no clear winner among the descriptor sets. For example, in one of theassay datasets, AID362, the performance of random forests using the descriptor sets ?BurdenNumbers? and ?Atom Pairs? (the descriptor sets will be described in Section 2.3) was in the topand fourth position from the top, respectively - whereas, in another assay dataset, AID364,the positions of the descriptor sets in terms of performance (using random forests, indeed)swapped.As noted earlier, the molecular descriptors play an important role in developing a classifier.There exist many molecular descriptor sets in QSAR studies (see Leach and Gillet, 2007;Todeschini and Consonni, 2000). Although the number of available descriptor sets is large,there is no consensus on the types of input descriptors to build uniformly good performing162.2. BioAssay DataQSAR model (Zhang et al., 2009). In such a situation, rather than using a single set descriptor,we use several sets to build our model.Until recently, many researchers have either tried to find a good model/classifier or a goodset of descriptors of the chemical compounds in QSAR studies. But the performance of thedescriptor sets is target-specific, and we are not there yet to label the best set of descriptors.As such, we have proposed an ensemble method which aggregates over several moleculardescriptor sets employing one of the most accurate QSAR classifiers: the random forests. Irank compounds in a dataset by averaging the probabilities of activity from random forestsclassifiers applied to different sets of molecular descriptors (the method is presented in Section2.4). The proposed ensemble performs better than the most accurate random forests classifierthat uses a single descriptor set. Our method is found more computationally efficient thanthe random forests classifier applied to the pool of descriptor sets. Moreover, for the mostpart, the former model outperforms the latter.2.2 BioAssay DataI focus on the application of our methods to four BioAssay datasets made available by theMolecular Libraries Screening Center Network (MLSCN) (http://mli.nih.gov/mli/mlscn).The bioassay datasets were briefly introduced in Section 1.5 of Chapter 1. These assaystypically resulted in continuous responses (percent inhibition) from a primary screen andbinary responses (active versus inactive) from a secondary screen. In this thesis I focus on thebinary responses. Table 2.1 presents the biological targets, number of compounds, number ofactive compounds and proportion of actives in the four assay datasets.2.2.1 Assay AID364Assay AID364 is a cytotoxicity assay conducted by The Scripps Research Institute - MolecularScreening Center (http://mlpcn.florida.scripps.edu/). The cytotoxicity of a chemicalcompound is the quality of being toxic to cells which is helpful to develop potential humantherapeutics (http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=364). In this as-say the goal of the Scripps Research Institute was to screen for cytotoxic compounds whichwere active against cancer cells. A total of 3311 compounds were tested, and 50 (1.51%) ofthem were found active. This is the smallest dataset, in terms of the number of compounds,among the four assays.2.2.2 Assay AID371Assay AID371, also a cytotoxic assay, was conducted by the Southern Research Molecular Li-braries Screening Center (http://www.southernresearch.org/life-sciences/). The abil-ity of a cytotoxic compound to inhibit the growth of the human non-small cell lung tumourline, A549, is a preliminary indication of anti-cancer activity for treating patients with lung172.3. The Descriptor Sets / Natural Subsets of Variablescancer (http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=371). This assay con-tains a total of 3312 compounds with 278 actives (8.4%).2.2.3 Assay AID362Assay AID362 is a formylpeptide receptor ligand binding assay conducted by the New MexicoMolecular Libraries Screening Center (http://nmmlsc.health.unm.edu/). The formylpep-tide receptor (FPR) family of G-protein coupled receptors (GPCR) contributes to the local-ization and activation of tissue-damaging leukocytes at sites of chronic inflammation (http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=362). This assay contains 4279 uniquecompounds out of which 60 are active (1.4%).2.2.4 Assay AID348According to Wikipedia, ?Beta-glucocerebrosidase is an enzyme with glucosylceramidase ac-tivity that is needed to cleave, by hydrolysis, the beta-glucosidic linkage of the chemicalglucocerebroside, an intermediate in glycolipid metabolism.? (http://en.wikipedia.org/wiki/Glucocerebrosidase). The inherited deficiency of this enzyme results in Gaucher?s dis-ease, a lysosomal storage disease characterized by an accumulation of glucocerebrosides. Thedeficiency also creates some forms of disorder affecting the central nervous system. This beta-glucocerebrosidase assay AID348, developed by the National Institute of Health Chemical Ge-nomics Center (http://www.ncats.nih.gov/research/reengineering/ncgc/ncgc.html),has 4946 compounds out of which 48 are active. This is the largest assay in terms of thenumber of compounds with only 0.97% active compounds.Table 2.1: The four bioassay datasets considered in this chapter. The numbers in () are thenumber of active compounds in each of the assays.Assay Biological Target # of Compounds Fraction of ActivesAID364 Potential human 3311 ( 50) 0.0151therapeuticsAID371 Lung tumor 3312 (278) 0.0839cellsAID362 Tissue-damaging 4279 ( 60) 0.0140leukocytesAID348 Gaucher?s 4964 ( 48) 0.0097disease2.3 The Descriptor Sets / Natural Subsets of VariablesMolecular descriptors are numeric variables that describe the structure or shape of molecules,helping to predict the activity or properties of molecules in drug discovery. The Quantita-182.4. Ensemble of Descriptor Setstive Structure-Activity Relationships (QSAR) model relates activity/toxicity/drug potencyof chemical compounds with their molecular descriptors. In this chapter, I have used fivesets of molecular descriptors for the four PubChem assays. As the molecular descriptor setsare available in the QSAR literature as separate sets of variables each describing differentproperties of the chemical compounds, I term each of them as a ?natural subset of variables.?The descriptor sets are obtained using the descriptor generator engine: computer softwarecalled PowerMV (Liu et al., 2005). The generated descriptor sets are: Atom Pairs (AP), Bur-den Numbers (BN) (Burden, 1989; Pearlman and Smith, 1999), Carhart Atom Pairs (CAP)(Carhart et al., 1985), Fragment Pairs (FP), and Pharmacophores Fingerprints (PH). TheBurden Numbers are continuous descriptors, and the other four are bit string descriptorswhere each bit is set to ?1? when a certain feature is present and ?0? when it is not. See Liuet al. (2005) and Hughes-Oliver et al. (2011) for further information on these descriptor sets,how they are generated and what properties of the molecules they represent.Table 2.2: The number of non-constant feature variables for each of the five descriptor sets.The numbers in () show the total number of feature variables generated by PowerMV.Descriptor Sets Assay NameAID364 AID371 AID362 AID348Atom Pairs / AP (546) 380 382 360 367Burden Numbers / BN (24) 24 24 24 24Carhart Atom Pairs / CAP (4662) 1585 1498 1319 1795Fragment Pairs / FP (735) 580 580 563 570Pharmacophores / PH (147) 120 119 112 122All Features / AF (6114) 2689 2603 2378 2878PowerMV generates a total of 546, 24, 4662, 735 and 147 feature variables for AP, BN,CAP, FP and PH, respectively. Table 2.2 shows the number of non-constant feature variablesfor each descriptor set and assay. The binary descriptor set CAP provides the largest numberof feature variables (1319?1795), and the continuous descriptor set BN provides the smallestnumber of feature variables (24). Three of the four binary descriptor sets AP, CAP and FP arevery rich in terms of number of variables. The continuous descriptor set BN is also rich in thesense that a continuous variable possesses much more resolution than a binary variable. Thelast row of Table 2.2 shows the number of feature variables after pooling the five descriptorsets together. I call this pool of descriptor sets all features and denote by AF.2.4 Ensemble of Descriptor SetsI build an ensemble by averaging probability of activity across descriptor sets and call theresulting ensemble as the ?ensemble of descriptor sets?; in short EDS. Let us consider thereare ns sets of molecular descriptors. Using training data, I build ns random forests using192.5. Evaluation of Classifiersns sets of descriptors and compute probabilities of activity using each of the classifiers. Thecomputed probabilities are then averaged across classifiers/sets of descriptors to build the finalensemble. The averaged probabilities of activity are used to rank the compounds in the testset. The compound with the largest averaged probability of activity is ranked first, followedby the compound with the second largest averaged probability of activity and so on.The logic for building such an ensemble is straightforward. Dietterich (2000) pointedout that the performance of an ensemble depends critically on three factors: (i) strong baselearners, i.e., good modelling/classification method to build the constituent models/classifiers(ii) low correlation between predictions obtained from the constituent classifiers, i.e., a diverseset of constituent models/classifiers, and (iii) the strategy of combining results from theconstituent models/classifiers. To construct EDS, I choose ?random forests? as the baselearner, which is undoubtedly a strong classifier. Random forests is a highly accurate ensembleas it has often appeared to be the top performer in various QSAR studies (Bruce et al., 2007;Hughes-Oliver et al., 2011; Svetnik et al., 2003; Zhang et al., 2009). I conjecture that thedifferent sets of descriptors are so diverse that they will generate considerably low correlatedclassifiers, i.e., low correlation between probabilities of activity across the classifiers.There are two popular schemes to aggregate multiple classifiers in an ensemble: major-ity voting and probability averaging. Wang (2005) showed that the scheme of probabilityaveraging is better than the majority voting. Zhang et al. (2009) pointed out that the gainin accuracy due to probability averaging relative to majority voting can get more than 6%improvement if the base learner is more accurate than that of a learner that performs randomselection.In this chapter, I average classifiers across five sets of descriptors: Atom Pairs (AP), Bur-den Numbers (BN), Carhart Atom Pairs (CAP), Fragment Pairs (FP), Pharmacophores (PH).The advantages are two-fold: (i) improvement in every dataset investigated in this chapterover the most accurate random forest applied to any single set of molecular descriptors, and(ii) less computational requirement than the random forest applied to the pool of descriptorsets (AF). The base learner ?random forests? is constructed by the R packages randomFor-est (Liaw and Wiener, 2011). As there is only one tuning parameter (mtry) in random forests,I use the default, mtry =?D, following the suggestion of Breiman (2001).2.5 Evaluation of ClassifiersThe goal of this thesis is to rank the test compounds. But, I have assay datasets where theclasses for all of the compounds are known, i.e., there is no test data. Hence, to evaluatepredictive ranking performances of the ensembles, one potential solution is to split the datarandomly into two parts of approximately equal size, and to consider one of the parts astraining and the other as test. Simultaneously, it is possible to swap the training and testparts to use the entire data set. This splitting process of the data is called 2-fold cross-202.5. Evaluation of Classifiersvalidation. In this chapter, I have used balanced 10-fold cross-validation which I am going todescribe next.2.5.1 Balanced 10-fold Cross-validationIn regular 10-fold cross-validation the dataset is randomly partitioned into 10 folds eachcontaining approximately equal number of compounds. Treating one of these folds as a ?testset?, the remaining (10? 1) = 9 folds are combined together to form a ?training set? in orderto learn a model. This model is then applied to the ?test set? to obtain predictions. Theprocess is repeated, holding out each of the 10 folds for testing in turn. The advantage of the10-fold cross-validation over a one-time split (2-fold cross-validation) is the increased precisionof error estimation.Since the assay datasets contain very few active compounds, I have to be careful in makingthe splits. For example, in regular 10-fold cross-validation, a particular fold might completelymiss the rare active compounds and thus could make the training and test splits unrepresen-tative to each other with respect to the rare active compounds. Thus, I randomly divide thedata into 10 folds in such a way that the resulting folds are approximately of equal sized withapproximately equal number of actives in each. I named this new splitting method as thebalanced 10-fold cross-validation.In a particular turn of the cross-validation, a set of compounds appear exactly once in thetest fold. After all of the ten turns of the cross-validation, all of the compounds appear exactlyonce in one of the ten test folds. Thus, having completed a balanced 10-fold cross-validation,I obtain the probabilities of activity for all of the compounds in an assay and use them torank the compounds in order to determine the performance of the corresponding model.2.5.2 Hit CurveA hit curve provides graphical inspection of the performance of a classifier when the objectiveis ranking rare class objects ahead of the majority class objects. I used the estimated proba-bilities of activity to rank the compounds in a dataset and thus to produce a shortlist of thetop ranked compounds. The determination of the number of compounds to be shortlisted forfurther investigation depends on the available resources in a project.Let n be the total number of compounds in a test set and let a be the number activecompounds. Let 0 ? at ? a be the number of actives (called hits) in a shortlist of sizet ? [1, n]. The hit curve is a plot of at versus t or equivalently a plot of at/a versus t/n wherethe information is given in term of percentages. A hit curve can be viewed as a variant of thereceiver-operating characteristic (ROC) curve where we plot true positive and false positiverates in the vertical and horizontal axes, respectively. The hit curve is an effective method forevaluating a ranking procedure showing its performance at all possible shortlist cutoff-pointt ? [1, n]. Classifier 1 with hit curve H1(t) is unambiguously superior to classifier 2 with hitcurve H2(t) if H1(t) ? H2(t) at every t.212.5. Evaluation of ClassifiersThe left panel of figure 2.1 shows seven hit curves generated by seven classifiers using theAID364 data. A hit curve resembles a step function with slopes in places where there are tiesin the probabilities of activity. Having seen the hit curves, we may realize that finding thebest hit curve may not be an easy task as they often cross and overlap each other. In such asituation, a single number summary of a hit curve might be helpful to evaluate classifiers andto facilitate comparisons.2.5.3 Initial EnhancementInitial enhancement (IE), proposed by Kearsley et al. (1996), is a popular measure to evaluateperformances of classifiers when the objective is ranking rare class compounds. Suppose,having ranked the compounds using the estimated probabilities of activity, I have a shortlistof top t ? n compounds which are highly likely to be active. Initial enhancement at theshortlist of length t is the hit rate at t divided by the proportion of actives in the entirecollection of compounds. The IE is defined as:IE = (at/t)(a/n).It is a relative measure of hit rate improvement offered by a classifier beyond what can beobtained under random ranking, and values much larger than 1 are desired. Notice that IEdepends on the particular shortlist cutoff point t. Moreover, IE doesn?t distinguish whetherthe actives are ranked at the very top or right before the end of the shortlist. Therefore, IEis not our favorite ranking evaluation method.Here, we will see that sometimes IE does not reward a hit curve well with many activesin the start of a shortlist. The left panel of Figure 2.1 shows seven hit curves obtained fromapplying random forests to the five descriptor sets of the AID364 assay data. Following theresults of Hughes-Oliver et al. (2011), IE is computed at t = 300. I compared two ensemblesin terms of IE: random forests applied to atom pairs (AP) and random forests applied to thepool of five descriptor sets (AF). It is very clear that AP gives a very good initial enhancementof 6.401 which is bigger than 5.960, the IE obtained from AF. That is, AP ranks more activesin this shortlist of 300 compounds and hence the larger IE. But, careful inspection of the twohit curves (red and yellow lines are for AP and AF, respectively) reveals that AF identifiesmore actives earlier in the list than AP. If our goal is to rank the rare actives earlier in thelist than at a particular test point, we should choose random forests with AF as our desiredQSAR classifier among the two. As IE fails to reward early ranking of active compounds in ashortlist, we should try to find another metric to evaluate QSAR models simultaneously withIE.222.6. Results2.5.4 Average Hit RateThe average hit rate (AHR) gives a single number summary for a hit curve and is a commonmeasure in information retrieval (Zhu, 2004). Suppose we shortlist the top t ? n compoundsand at of them are active. Thenh(t) = att? [0, 1] (2.1)is the hit rate for the top t ranked. Naturally, we want h(t) to be as large as possible at everyt ? [1, n]. Let 1 ? t1 < t2 < ? ? ? < ta ? n be the positions of the active compounds in theranked list. The average hit rate ? occasionally referred to as average precision ? is definedasAHR = 1a[h(t1) + h(t2) + ? ? ?+ h(ta)] . (2.2)AHR averages the ?hit rates? of the selected active compounds, and larger AHR corre-sponds to the hit curve with most rapid early rise. If a hit curve ranks more actives aheadof the inactive compounds, then AHR rewards the hit curve by assigning a bigger number.AHR reaches the maximum 1, if all of the active compounds are ranked before the inactivecompounds amongst those selected. To calculate AHR, we assume random ordering of theresponse in a tied group. Further details on AHR can be found in Chapter 3 of Wang (2005).Sometimes AHR does not fully respect the properties of a hit curve in a single number.This measure gives very large weights to the actives found earlier in the list than those arefound later. For example, we can compare two hit curves corresponding to random forestsapplied to atom pairs (AP) and pharmacophores (PH) in Figure 2.1. AP gives an AHR of0.279 and PH gives an AHR of 0.285 ? but for most of the list the hit curve correspondingto AP (the red line) is far above than the hit curve corresponding to PH (the sky-blue line).The larger AHR for PH is due to the fact that it identifies more hits at the start of the list:10 of the first 10 compounds tested are hits.If we rely on IE alone, we might infer that the AP gives a better model. On the other handif we rely on AHR alone, we might infer that the PH gives a better model. Thus, sometimes,an evaluation metric alone might not be enough to reward a good hit curve. Hence, ideally Ishall try to choose a classifier by maximizing both of the assessment measures: IE and AHR,i.e., I shall look for a classifier which gives more hits earlier in the list as well as more actives atthe chosen test point. Otherwise, I will choose a classifier by maximizing AHR alone providedthat the IE is close to the maximum.2.6 ResultsThis section contains results after applying the described ensembles to the assay datasets.So far, we have five random forests for the five sets of descriptors: atom pairs (AP), burden232.6. Resultsnumbers (BN), carhart atom pairs (CAP), fragment pairs (FP), and pharmacophores (PH).There are also two ensembles using all of the descriptor sets: random forests applied to thepool of descriptor sets (AF), and ensemble of descriptor sets (EDS). In total, there are sevenensembles to compare in each of the four assays.2.6.1 Assay AID364Panel (a) of Figure 2.1 shows seven hit curves corresponding to seven classifiers. The hitcurves are produced from the probabilities of activity obtained from a balanced 10-fold cross-validation of the compounds of AID364 assay. As this assay contains only 50 active com-pounds, I computed IE at t = 300. The top three ensembles in terms of AHR are EDS, AFand CAP; and in terms of IE are EDS, CAP and AP. So the top performing ensemble is EDS.The top performing random forests using a single descriptor set is CAP. The performanceof random forests with the pool of descriptor sets (AF) is in second and fourth positions interms of AHR and IE, respectively.0 50 100 150 200 250 300051015202530Number of compounds selectedNumber of active compounds00.10.20.30.40.50.6Proportion of active compounds0 0.02 0.03 0.05 0.06 0.08 0.09Proportion of compounds selectedAP : 0.279 / 6.401BN : 0.349 / 5.518CAP : 0.352 / 6.622FP : 0.317 / 5.718PH : 0.285 / 4.635AF : 0.363 / 5.96EDS : 0.387 / 6.622D.NAME : AHR IE(a)0.28 0.30 0.32 0.34 0.36 0.38 0.405.05.56.06.5Average Hit RatesInitial EnhancementAPBNCAPFPPHAFEDS(b)Figure 2.1: (a) The hit curves obtained from applying random forests to the five descriptorsets (AP, BN, CAP, FP and PH), and to the pool of the five descriptor sets (AF) of the AID364dataset. The other hit curve is for the ensemble descriptor sets (EDS) applied to the AID364dataset. (b) The corresponding initial enhancement (IE) versus average hit rate (AHR) plotfor the seven ensembles applied to the AID364 assay. Both of the panels are obtained usingthe probabilities of activity from a particular balanced 10-fold cross validation.To make comparison of the methods easy and straightforward, I have plotted IE againstAHR for the same balanced 10-fold cross-validation of the AID364 dataset (Panel (b) of Figure2.1). As the target is to maximize both AHR and IE, a good method is positioned in the topright corner of this plot. Fortunately, the proposed ensemble EDS is found in the top right242.6. Resultscorner indeed. The top performing random forest using a single descriptor set is CAP, andour ensemble EDS outperforms CAP in terms of AHR and performs equally in terms of IE.In order to compare average performances of the classifiers, I repeated the balanced 10-foldcross-validation for a total of 16 times. There are 16 processors available in my department?scomputing network for parallel computation and hence the 16 cross-validations. Since I wantto compare performances in terms of AHR and IE, I performed a multivariate analysis ofvariance (MANOVA) with two dependent variables: AHR and IE. The factor variable methodshas seven labels (AP, BN, CAP, FP, PH, AF, EDS) and the blocking factor has 16 blockscorresponding to the 16 cross-validations. After fitting the MANOVA model, employingstandard assumptions (bivariate normality for the dependent variables, homogeneity of thecovariance matrices across levels of the independent variables, independence of observationsetc.), I have found that the effects of methods on both of the responses are highly significantusing four testing methods: Pillai?s Trace, Wilks? Lambda, Hotelling?s Trace, Roy?s LargestRoot (Johnson and Wichern, 2002, Section 6.9). Table 2.3 shows the Hotelling-Lawley test-statistic from the MANOVA fit using two responses, AHR and IE, to the AID364 bioassaydata. The blocking factor is also found significant which signifies the importance of includingblocks in the model.Table 2.3: Hotelling-Lawley test-statistic from the MANOVA fit using two responses, AHRand IE, to the AID364 bioassay data.Factors df Hotelling-Lawley App F Num df Den df Pr(> F )Methods 6 19.203 140.824 12 176 < 0.0001Blocks 15 4.877 14.307 30 176 < 0.0001Residuals 90As the effect of the factor variable methods is found significant, I have compared bivariatemean vectors (AHR, IE) for the seven ensembles. Using the bivariate normality assumption,the 95% confidence region for the bivariate mean vector is constructed. Panel (a) of Figure2.2 shows the plots of mean AHR versus mean IE with 95% confidence regions for the sevenensembles in the AID364 assay. In terms of mean AHR and IE, the top performing ensembleis EDS. Using mean IE alone, the top performing random forests constructed on a singledescriptor set are AP and CAP, and their performances exceeded the performance of AF. Allof the three models (EDS, AP and CAP) provide significantly larger mean IE than AF. Usingmean AHR, the second performer from the top is AF followed by CAP and BN. The messageis: when we use different assessment criteria, the performances of the classifiers fluctuate alot ? but my ensemble EDS is in the top right corner. It is clear that when we aggregateclassifiers over the five descriptor sets (AP, BN, CAP, FP, and PH), we get higher AHR andIE than any of the classifier constructed on a single descriptor set.252.6. Results0.26 0.28 0.30 0.32 0.34 0.36 0.384.55.05.56.06.5Average Hit RatesInitial EnhancementAPBNCAPFPPHAFEDS(a) Assay AID3640.28 0.30 0.32 0.34 0.36 0.38 0.402.62.83.03.23.4Average Hit RatesInitial Enhancement AP BN CAPFPPHAF EDS(b) Assay AID3710.22 0.24 0.26 0.28 0.30 0.32 0.3478910Average Hit RatesInitial Enhancement APBNCAPFPPHAFEDS(c) Assay AID3620.06 0.08 0.10 0.12 0.14 0.16 0.185678910Average Hit RatesInitial EnhancementAPBNCAPFPPHAFEDS(d) Assay AID348Figure 2.2: Plots of mean initial enhancement (IE) versus mean average hit rates (AHR)with 95% confidence bands obtained from applying the seven classifiers to the four bio assaydatasets: AID364, AID371, AID362, and AID348.262.6. Results2.6.2 Assay AID371Assay AID371 contains 3312 compounds in total, out of which 278 are active. As there are278 active compounds, I have computed IE at t = 600 test point. I have fitted a MANOVAusing bivariate responses of AHR and IE, and the results are found similar to the results ofthe assay AID364. Panel (b) of Figure 2.2 shows the plots of mean AHR versus mean IE forthe seven classifiers applied to the AID371 assay. As we have seen earlier, averaging randomforests across descriptor sets provides better predictive ranking than any constituent randomforests applied to any single descriptor set. In terms of the assessment measures AHR andIE, the top performing ensemble is EDS followed by AF. The top performing random forestamong the individual descriptor sets is FP, and its performance is significantly lower than myensemble EDS.2.6.3 Assay AID362As the number of active compounds is 60 in this assay, the IE has been computed at t = 300test compounds. The bottom left panel of Figure 2.2 shows the plots of mean AHR versusmean IE for the seven classifiers applied to the assay AID362. I see results similar to theearlier assays AID364 and AID371: averaging random forests over the descriptor sets givesbetter predictive ranking of the rare actives than the random forests applied to any of thesingle descriptor set. In terms of mean IE and AHR, the top performing ensemble is EDS.The second performer from the top is AF. The top performing random forests constructed onsingle descriptor set is BN using IE and AP using AHR.2.6.4 Assay AID348Assay AID348 contains the largest number of compounds (4946) among the four assays withonly 48 active compounds. The assessment measure IE is calculated at t = 300 shortlistedcompounds. The MANOVA with bivariate response of AHR and IE gives similar results asother three assays, and hence we directly present the mean AHR versus mean IE plots (Panel(d) of the Figure 2.2). I see that averaging random forests over the descriptor sets gives betterpredictive ranking of the rare active compounds than the random forests applied to any singleset of descriptors. The two ensembles, EDS and AF, which use the five sets of descriptors,have significantly outperformed all of the five random forests constructed using any single setof descriptors. The classifier EDS appeared in the second position in terms of both AHR andIE.Surprisingly, random forests with the pool of descriptor sets (AF) appeared as the topperforming ensemble in terms of mean AHR and IE for this assay. This result is not consistentwith the results from other assays. To be honest, it is hard to answer exactly why randomforests with the pool of descriptor sets (AF) outperformed EDS. But I conjecture that theconstituent classifiers of EDS are in general weak and so is their ensemble. I provide an272.7. Diversity Mapexample in the following paragraphs.The basic principle of constructing an ensemble states that, in order to perform well, anensemble needs to aggregate strong and diverse constituent classifiers (Breiman, 2001). Thediversity among the constituent random forests in EDS will be examined in Section 2.7 .Here, I will be checking the strengths of the constituent random forests of EDS by comparingtheir performances with classification trees applied to the five descriptor sets of the AID348Assay. The classification trees are fitted using the default settings of the R package Tree(Ripley, 2011). For a fair comparison, I used the same 16 balanced 10-fold cross-validationsas in random forests to evaluate classification trees.Table 2.4 shows the mean AHRs and IEs over the 16 cross-validations for classificationtree (Tree) and random forests (RF) applied to the five descriptor sets (AP, BN, CAP, FPand PH). The table also shows the performances of their ensembles, EDSs. The results arefairly interesting. In terms of mean AHR, Tree with CAP outperformed random forests withCAP. In terms of mean IE, Tree with FP and PH outperformed random forests with FP andPH, respectively. The ensembles of descriptor sets with random forests and classification treesperform fairly similar.Table 2.4: The mean AHRs and IEs over the 16 cross-validations for classification tree (Tree)and random forests (RF) applied to the five descriptor sets (AP, BN, CAP, FP and PH), andfor their ensembles EDSs.Metrics Classifiers Constituents EnsembleAP BN CAP FP PH EDSAHR Tree 0.033 0.039 0.071 0.066 0.069 0.122RF 0.063 0.090 0.068 0.077 0.070 0.128IE Tree 3.547 4.585 6.470 7.290 5.774 8.007RF 5.186 6.618 7.161 7.069 5.526 8.388In general, random forests provides much better predictive performance than a classifi-cation tree. So the question arises, ?why random forests could not improve performancesof classification trees for some descriptor sets?? Perhaps those descriptor sets contain manyunimportant feature variables for which the performances of random forests are decreased.Thus, equipped with weak constituent random forests the ensemble EDS appears weak. How-ever, as you will see, I will improve the performance of our ensemble in Chapters 3 and4.2.7 Diversity MapIn this section, I shall try to understand why the proposed ensemble of averaging proba-bilities of activity across descriptor sets provides good predictive ranking of the rare activecompounds. I plot the ranks of the active compounds corresponding to the seven classifiers282.7. Diversity MapAP : 0.273 / 8.8BN : 0.245 / 9.51CAP : 0.268 / 9.03FP : 0.283 / 9.27PH : 0.215 / 7.46EDS : 0.325 / 9.98AF : 0.301 / 9.98 01051015202530354045505560151050100200300500(a) Assay AID362AP : 0.062 / 5.84BN : 0.103 / 7.15CAP : 0.072 / 7.21FP : 0.083 / 7.56PH : 0.071 / 5.68EDS : 0.137 / 8.59AF : 0.209 / 10.3 01051015202530354045151050100200300500(b) Assay AID348Figure 2.3: Diversity maps of ranks of the active compounds for AID362 and AID348 assays.The probabilities of activity associated with the cross-validation number 1 are used to rankthe compounds.applied to AID362 and AID348 Assays (Panels a and b of Figure 2.3, repectively). Thecompounds are ranked using the probabilities of activity obtained from the cross-validationnumber 1. If an active compound is ranked at the top position by a classifier, the compoundreceives the darkest gray colour. If a compound is ranked down in the sequence, it receivesabsolutely no colour. The colour key in each plot shows where in the sequence an activecompound is ranked by a classifier. I also sequence the active compounds in the right side ofthe vertical axis. The compounds are sequenced using the probabilities of activity obtainedfrom EDS. As a result, we can see the diversity in ranks across different classifiers. Such aplot is popularly known as the diversity map (Hughes-Oliver et al., 2011). I also report theAHRs and IEs for the seven classifiers to understand how strong the classifiers are.It is the diversity in AP, BN, CAP, FP and PH which makes our ensemble EDS performbetter than the top performing random forests applied to any of the five descriptor sets. Youcan see there is variation in ranks across the random forests applied to AP, BN, CAP, FPand PH, i.e., different classifiers rank different sets of active compounds well. For example, inAID362 Assay the active compound 1 is ranked very well by AP, CAP and FP, and moderatelywell by BN and PH. As a result of this variation or diversity, the compound is ranked verywell by EDS. We also see diversity in ranks among the classifiers AP, BN, CAP, FP and PHin Assay AID348 (Panel (b) of Figure 2.3). We see EDS ranks a compound very well if all ofthe constituent classifiers rank it well. If some of the classifiers rank a compound well and theothers fail to rank it well then EDS ranks the compound well too. If none of the constituentclassifiers rank a compound well then EDS fails to rank it well.292.8. Computational ComplexityHowever, it has been observed that the model random forests with AF is computationallyvery intensive. The next section (Section 2.8) presents computational complexity of theclassifiers we presented so far.2.8 Computational ComplexityTable 2.5 shows the computational time in minutes for the seven ensembles to complete abalanced 10-fold cross-validation. The times are obtained in an Intel Core I5 CPU (2.67GHz) with 6.00 GB RAM and 64-bit Windows Operating System. It is easy to see that thecomputational time does not vary much for increasing the number of compounds, but dependsheavily on the number of feature variables. As a result, the computational time for AF isvery high comparing to the other models. It is worthy to mention that EDS and AF use thesame number of feature variables, but the former requires 33? 48% less computational timethan the latter.Table 2.5: Observed computational time (in minutes) to complete a balanced 10-fold cross-validation for the seven ensembles applied to the five descriptor sets of the four assay datasets.The number in () shows the number of feature variables used to construct the correspondingclassifier.Assay Name AID364 AID371 AID362 AID348(Number of Compounds) (3311) (3312) (4279) (4946)Atom Pairs (360-382) 9.09 12.30 11.28 12.76Burden Numbers (24) 0.49 0.74 0.67 0.70Carhart Atom Pairs (1319-1795) 103.78 114.91 101.85 170.79Fragment Pairs (563-580) 21.60 27.59 25.98 31.05Pharmacophores (112-122) 3.11 3.86 3.38 5.39Ensemble of DS (2378-2878) 138.62 159.36 144.01 220.55All Features (2378-2878) 252.70 278.99 230.14 329.62The computational time of a random forest can be divided up into two parts: (1) timefor reading the data, and (2) time for computation. Having read the data, the R packagerandomForest stores the entire data file into its memory. Obviously, a large data file willoccupy large computer memory. So this program is not scalable to a dataset for which thememory requirement exceeds the memory capacity of a computer. So far we have beensuccessful in running random forests using the pool of five descriptor sets (AF). But there aremany descriptor sets available in cheminformatics literature. Thus, if the size of the memoryrequirement for the pool of many descriptor sets (> 5) exceeds the memory of a computerthen AF would not be scalable to that computer. In the presence of many descriptor sets,EDS is easily scalable, whereas scalability of AF is in doubt.302.9. Summary2.9 SummaryI have introduced a novel ensemble method through averaging probabilities of activity overthe molecular descriptor sets to rank rare active compounds ahead of the majority inactivecompounds in QSAR studies. Convincing results have been observed in favor of the proposedensemble which ensures significant improvement over the most accurate classifier that can bebuilt using a single set of descriptors. In fact there is no any particular set of descriptors thatperforms uniformly best over the others: the results of this chapter also favour this statement.Moreover, the proposed method often outperforms the random forest classifier applied to thepool of the five descriptor sets. On the other hand, the ensemble EDS is scalable to manydescriptor sets where it is quite difficult or, perhaps, impossible to apply random forests toAF. We have used five sets of descriptors in this study - but it is possible to include many asthere are many sets of molecular descriptors available in cheminformatics literature.One of the reasons why averaging over descriptor sets works better than the top performingrandom forests applied to any of the five descriptor sets is the diversity of the classifiers builtusing different sets of molecular descriptors. The classifiers are diverse in the sense that eachof them ranks diverse sets of active compounds well. As a result, the ensemble obtained byaveraging the probabilities of activity ranks more active compounds ? perhaps, a subset ofthe union of active compounds found by all of the constituent classifiers ? than the numberof active compounds ranked by any single constituent classifier.A naive method of pooling the descriptor sets together (AF) gave top performance inconjunction with random forests for the AID348 assay. Perhaps there is manageable numberof good feature variables in that assay and a regular random forests can effectively handlethem to produce top result. However, this top result alone does not justify the use of AFinstead of EDS as the performance of AF fluctuates a lot for other assays. For example,AF sometimes provides weaker performance than the most accurate classifier based on anindividual descriptor set. EDS is computationally efficient, scalable to high dimension, andis one of the top-two models. Moreover, this model gives better performance than the topperforming classifier based on any single molecular descriptor set.I used random forests because it is found as one of the top performing ranking models.Some other modelling methods (classification tree, k-NN, boosted tree, for example) might alsohelp to give improved performance as they are also found highly accurate in other applications.However, the good results from the ensemble of natural descriptor sets motivates me to uncoverdata-adaptive subsets of variables in a variable-rich descriptor set which I am going to describenext in Chapter 3.31Chapter 3Ensembling Data-Adaptive Subsetsof Variables3.1 IntroductionA new ensemble using data-adaptive subsets of variables is proposed and applied to the fourhighly unbalanced (i.e. sparse) drug discovery datasets. The response variable in our datasetsis the compound activity status. The predictors are the five sets of variables called ?descriptorsets?. The variables in each descriptor set describe chemical/molecular structures of thecompounds. Each of the five descriptor sets contains a large number of predictors. Withoututtering the conventional term ?curse of dimensionality? we say that the descriptor sets arerich in terms of number of variables. In fact, we judiciously use the richness of the descriptorsets to develop our ranking procedure.We introduce the concept of phalanx, a group of variables that work well together toform one of several models in an ensemble. Capitalizing on the richness of variables, we formphalanxes by grouping variables together. Phalanxes are characterized by their ability to yielda complete predictive model. The variables in a phalanx are in the same model because theycomplement each other while the variables in different phalanxes work better in ensemble.Natural phalanxes are present in our datasets. For a particular bioassay, each of thefive descriptor sets can be considered as a natural phalanx. By ensembling over the naturalphalanxes, our ensemble EDS performed better than the top performing random forest builton a single set of descriptor (see Chapter 2). Natural phalanxes are sometimes suggested bysubject matter knowledge. For example in Tebbutt et al. (2005) SNP genotyping platform,grouping of the variables is naturally suggested by the different chemical procedures employedin the genotyping platform. Finding a good ensemble of natural phalanxes is loosely connectedbut different from group LASSO (Yuan and Lin, 2007) and its variants (Meier et al., 2008).The main difference is that in group LASSO the given groups are only evaluated by theirability to work together with other groups in a single model. Moreover, the groups may ormay not conform natural phalanxes ? depending on whether the variables in each group areable to yield by themselves a good predictive model.On the other hand, unknown statistical phalanxes may be present in variable rich datasets.Uncovering these phalanxes is an interesting and challenging statistical problem. Motivatedfrom the results of Chapter 2, we propose an algorithm to uncover statistical phalanxes and323.2. Datasets and Variablesdemonstrate its performance in several applications. We can think of our phalanx formingalgorithm as a special type of clustering of variables where ?similarity? means working welltogether in a particular model and ?dissimilarity? means working well apart in different modelswhich are ultimately ensembled.In principle, phalanxes can be formed using any given classifier. We use random forests(Breiman, 2001) because of its well-known superior performance for ranking compounds. Wecompare our ensemble of phalanxes with random forests and regularized random forests (Dengand Runger, 2012). As we will see in Section 3.7, the descriptor sets contain a fraction ofnoise variables, that is, variables which are not useful to rank the compounds. Regularizedrandom forests is useful for screening noise variables in a dataset. By comparing our ensembleof phalanxes against regularized random forests we convey the message that, in this context,phalanxing is better than regularization.We have found better performance of the ensemble of phalanxes than its competitor ran-dom forests and regularized random forests. Our ensemble performs very well when there aremany variables in a descriptor set and when the proportion of active compounds is very small.The harder the problem the better the ensemble of phalanxes performs. When the proportionof active is very small, regularized random forests outperforms random forests; and when thenumber of variables is very large, random forests outperforms regularized random forests. Inboth of the situations, the difference between the top performer and its nearest performingensemble is very large ? where the top performer is our ensemble. Having convinced by thegood results of our ensemble, we urge people to use the term ?rich-in-variables? instead of?curse-of-dimensionality?.3.2 Datasets and VariablesWe analyze 20 datasets from four different assays in the Molecular Libraries Screening CenterNetwork (MLSCN). The response variable in all the cases is the compound activity status.Each of the assays contains a number of compounds that are either active or inactive againsta biological target. Section 2.2 of Chapter 2 describes the four bioassay datasets.All the assays are highly unbalanced (i.e. sparse) in terms of proportions of actives. Forthree of the assays the fractions of active compounds are around 0.01. Hence, the problem ofsparseness poses difficult challenges. For example, a classification tree would soon run shortof rare active compounds and tend to build a shallow tree, using a few important variables. Ifthere are many important variables, their participation would become difficult. This problemis partially addressed but not completely resolved by random forests-like ensembles. Ourapproach allows more participation of the variables in the phalanxes.The covariates in all the cases are molecular descriptors ? numeric variables that describethe structure or shape of molecules ? which may help to predict the activity of molecules indrug discovery. Quantitative Structure-Activity Relationship (QSAR) models relate activ-333.3. Performance Measuresity/toxicity/drug potency of chemical compounds with its molecular descriptors. We considerfive sets of descriptors for each of the four assays. This gives a total 4 ? 5 = 20 datasets.Section 2.3 of Chapter 2 describes the five descriptor sets in four bioassays.3.3 Performance MeasuresIn Chapter 2, we have described a few methods to evaluate ranking procedures when the goalis to detect a few rare occurrences hidden in the midst of a large number of uneventful objects.Ranking is usually performed using a classifier to estimate the probability of activity for thecandidate compounds. We rank the compounds using their probabilities of activity. The goalis to rank the actives at the top of the list.The standard method for evaluating the performance of a classifier is by computing itsmisclassification error (ME): the proportion of compounds assigned to the wrong class. ButME is not useful when the classes are highly unbalanced (Zhu et al., 2006). Instead ofminimizing ME, we evaluate a classifier using a hit curve. The definition of a hit curve ispresented in Subsection 2.5.2 of Chapter 2.Figure 3.1 shows hit curves for three ensembles RF, RRF, and EPX applied to AtomPairs of AID348 Assay. We consider a shortlist of 300 compounds because we wish to rankthe actives earlier in the list. Note that EPX dominates the other two ensembles as it findsmore active compounds earlier in the list. RRF and RF do not dominate each other becausetheir hit curves criss-cross at several points.Comparison of crossing hit curves, as in RF and RRF of Figure 3.1, could be difficult ifwe have many of them. This comparison would become impossible to handle if we repeatcomputations many times through cross-validation. Thus, a single number summary of a hitcurve is desirable in order to facilitate comparison and to automate the selection process.The average hit rate (AHR) gives a single number summary for a hit curve and is acommon measure in information retrieval (Zhu, 2004). The definition of average hit rate isprovided in Subsection 2.5.4 of Chapter 2. If a classifier identifies more hits earlier in theranked list, AHR reward that classifier by assigning a large number. The AHRs for RF, RRFand EPX (see Figure 3.1) are 0.06, 0.08 and 0.20, respectively. We use AHR not only toevaluate the ranking procedures but also to form the phalanxes (introduced in Section 3.4).Initial enhancement (IE), introduced by Kearsley et al. (1996), is a popular ranking per-formance measure in QSAR studies. The definition of IE is presented in Subsection 2.5.3 ofChapter 2. Usually, IE evaluates the performance of a classifier at a particular cutoff pointof a hit curve. Naturally, larger values of IE indicate better ranking by a classifier. FollowingHughes-Oliver et al. (2011) we use a shortlist (cutoff) of 300 compounds. The IE for RF,RRF and EPX in Figure 3.1 are 5.84, 5.53 and 6.26, respectively. For the reasons mention inSubsection 2.5.3 of Chapter 2, IE is not our favorite ranking evaluation method, however.We have used balanced 10-fold cross-validation to assess the performance of the ranking343.3. Performance Measures0 50 100 150 200 250 300051015202530Number of compounds selectedNumber of active compounds00.10.210.310.420.520.62Proportion of active compounds0 0.01 0.02 0.03 0.04 0.05 0.06Proportion of compounds selectedRF (0.062)RRF (0.081)EPX (0.199)Figure 3.1: Illustration of hit curves for three methods random forests (RF), regularizedrandom forests (RRF) and ensemble of phalanxes (EPX) corresponding to Atom Pairs ofAID348 Assay. The number in () is the corresponding Average Hit Rate.353.4. Searching for Data-Adaptive Subsets / Phalanxes of Variablesprocedures. Since the assay datasets contain very few actives, we make sure that there areenough actives in each of the ten groups. One of the groups is separated to serve as ?test set?and the remaining nine groups are combined to fit the ranking model. The ranking model isthen applied to the left out ?test set? to obtain probability of activity for the compounds. Theprocess is repeated for the ten groups to obtain probabilities of activity for all the compounds(please see, Subsection 2.5.1 of Chapter 2).3.4 Searching for Data-Adaptive Subsets / Phalanxes ofVariablesThe term phalanx is borrowed from the ancient military formation used by Alexander TheGreat and his father Philip II of Macedon to deploy infantry soldiers in the battlefield. Thesoldiers in each phalanx had to trust their neighbours to protect them, and be willing to protecttheir neighbours too. To provide psychological incentive the phalanxes were organized intogroup of friends and family members closely together. As a result, the strength of the phalanxwould depend upon the individual strength of the soldiers and the psychological/emotionalbond between them. A phalanx was an autonomous fighting unit and could be ensembled withother phalanxes to form a formidable military machine. In our case, the groups of variablesin a statistical phalanx work better together than separated in different groups. Moreover, asthe variables in different phalanxes work better separated than together, the phalanxes helpeach other yielding a stronger ranking procedure.Given m test items and D covariates there exists in principle an optimal partition of thesecovariates into p+1 groups consisting of p phalanxes plus a subset of screened noise variables.Optimality here is in the sense of best ranking of the given test items by some given criterion(e.g. AHR, IE, etc.). Finding this optimal partition, however, is very difficult (unfeasible)because the number of phalanxes p and the phalanx membership are unknown making the totalnumber of possible solutions exponentially large. Moreover, in most applications we do nothave a test set. So, we aim at the more realistic goal of finding a good phalanx partition usingcross-validation and a greedy aggregation algorithm that resembles hierarchical clustering. Asshown in Figure 3.2, there are five main steps:? The input of predictor variables (x1, x2, ? ? ? , xD).? The arrangement of the D predictors into d initial groups (this step is optional, necessaryonly when dealing with binary predictors).? The d initial groups are screened down to s groups.? The s groups are arranged into c candidate phalanxes.? The c candidate phalanxes are screened down to p final phalanxes.363.4. Searching for Data-Adaptive Subsets / Phalanxes of VariablesOur algorithm and its main parameters ? groups, iquant, nsim, and ntree ? are describedbelow.Figure 3.2: Schematic presentation of the algorithm of phalanx formation: D predictor vari-ables are grouped into d initial groups, then reduced down to s screened groups, then combinedinto c candidate phalanxes, which are then reduced to p screened phalanxes in the final army(D ? d ? s ? c ? p).x1 x2 x3 ... xDGroupingg1 g2 g3 . . . . . . . gdScreeningG1 G2 G3 . . . . . . GsPhalanx FormationPX1 PX2 PX3 . PXcScreeningPX(1) PX(2) . PX(p)PREDICTOR VARIABLESINITIAL GROUPSSCREENED GROUPSCANDIDATE PHALANXESSCREENED PHALANXES3.4.1 Predictor VariablesThe main input of our algorithm is the D predictor variables: x1, x2, ? ? ? , xD. The other inputis the response variable y.3.4.2 Initial GroupsAs a preliminary (optional) step of the phalanx formation algorithm, the D original covariatesmay be grouped into d ? D initial groups g1, g2, ? ? ? , gd. This step is implemented by the groupindicator groups and optional (the default is taking the groups equal to the input covariates).However, initial grouping is convenient and recommended in the case of binary covariatesbecause of two reasons: (i) It reduces computational burden. Since we fit classifiers for allpairs of covariates our computational complexity is quadratic in the number of covariates. (ii)It increases the ranking models resolution. The resolution of a binary predictor is very low373.4. Searching for Data-Adaptive Subsets / Phalanxes of Variablesas it only has two possible values 0 and 1. The resolution of a group of k binary covariatesis higher as they can jointly distinguish 2k possible situations. So forming groups in the caseof binary variables not only saves time but may also increase the overall ranking accuracy ofthe models.In our application we aim at forming initial groups of covariates that are as diverse as pos-sible. In particular, we have experimentally verified that grouping based on similar covariatenames is better than random grouping. For example, there are seven features which repre-sent the atomic distances (up to 7 bonds) between a pair of atoms or groups of atoms. ForFragment Pairs, an example of these seven features is: AR 01 AR, AR 02 AR, AR 03 AR,AR 04 AR, AR 05 AR, AR 06 AR, and AR 07 AR. Here, AR 02 AR represents two phenylrings separated by two bonds. In this case each group of covariates corresponds to a particularpair of atoms.3.4.3 Screening Initial GroupsWe screen out weak initial groups to reduce computational burden and noise. Strong groupsare those which help to rank high the active compounds yielding larger average hit rates. Tobe considered strong a group must either be strong by itself or in a model with another groupor in an ensemble with another group.The key tool for screening groups (and for other procedures in our algorithm) is the distri-bution of AHR under random ranking. To obtain this distribution we randomly permute theresponse variable y nsim=1000 times and for each permutation we obtain the correspondingAHR. We then compute iquant-th quantile of this distribution which we denote AHRiquant.The parameter iquant gives the desired level for the probability threshold (0.95 in our appli-cations).We now describe how we decide if a given group, gi, is strong enough to be kept in thephalanx formation algorithm. We perform three tests and the group survives if it passes atleast one of these tests.Test 1: Individual strength of gi. We first fit a random forest classifier using y and giwith ntree=150 trees and extract the out-of-bag (OOB) vector of probability of activity, ?P (gi).Using these probabilities we compute AHRi = AHR(?P (gi)). We say gi is strong by itself ifAHRi ? AHRiquant. (3.1)The OOB class probabilities are an attractive feature of random forests. Breiman (1996c,2001) showed that the OOB class probabilities are as good as the cross-validated class prob-ability. See also Tibshirani (1996) and Wolpert and Macready (1999).Test 2: Joint strength of gi together with another group. Now we fit two random forestclassifiers with ntree = 150 trees. The first one with covariates {gi, gj} and the second onewith covariate gj , j 6= i. Again, we obtain the OOB probability of activity vectors ?P (gj)383.4. Searching for Data-Adaptive Subsets / Phalanxes of Variablesand ?P (gi, gj), and compute the corresponding average hit rates AHRj = AHR(?P (gj))andAHRij = AHR(?P (gi, gj)). We say gi has strong joint predictive power together with gj ifE(AHRR) + AHRij ?AHRj ? AHRiquant, (3.2)where E(AHRR) is the expected AHR under random ranking (roughly equal to a/n).Test 3: Joint strength of gi in ensemble with another group. Now we compute AHRij =AHR({?P (gi) + ?P (gj)}/2)and say that gi has strong ensemble predictive power with gj ifE(AHRR) + AHRij ?AHRj ? AHRiquant. (3.3)The group gi will not be screened out if it is strong by itself (satisfies (3.1)) or hasstrong joint predictive power together with another group (satisfies (3.2) for some j 6= i)or has strong joint predictive power in ensemble with any another group ((satisfies (3.3) forsome j 6= i ). After removing weak initial groups, we update the list of preliminary groups to{G1, G2, ? ? ? , Gs} and the updated group indicator parameter groups.3.4.4 Phalanx FormationThe input for this step are the strong groups G1, G2,...,Gs that survived screening. This stepresembles hierarchical clustering: at each iteration the two groups that optimize a certainmerging criterion are merged and the parameters groups and s are updated to reflex the merge.The merging criterion consists of two conditions: (i) minimizing the ratio AHRij/AHRij and(ii) beating a baseline value b with default value 1. We keep merging groups of variablesuntil the candidate phalanxes are found good to ensemble, i.e., b ? 1. The following exampleillustrates the hierarchical procedure.Let s = 3 and consider the three variables G1 = WBN GC L 1.00, G2 = WBN EN H 0.50,and G3 = WBN LP H 1.00 from the descriptor set BN in AID348. The AHRs when we pairedthem together with each other are: AHR12 = 0.052, AHR13 = 0.037 and AHR23 = 0.054. TheAHRs when we paired them to ensemble with each other are: AHR12 = 0.069, AHR13 = 0.050and AHR23 = 0.031. The corresponding AHRij/AHRij ratios are 1.312, 1.357 and 0.570,respectively. As the variables G2 and G3 give the smallest ratio which is less than 1, wemerge G2 and G3 together. We recomputed AHR by putting G1, {G2,G3} together as 0.058.We also recomputed AHR by ensembling G1 and {G2,G3} as 0.069. Their ratio is 1.177 whichis greater than 1. Hence, we stop by producing two candidate phalanxes PX1 = {G1} andPX2 = {G1,G2}.3.4.5 Screening Out Weak PhalanxesA candidate phalanx is kept in the ensemble if it is individually strong (see 3.1) or it is strongin ensemble with other phalanx (see 3.3). After this second stage of screening, the surviving393.4. Searching for Data-Adaptive Subsets / Phalanxes of VariablesAlgorithm 3.1 Phalanx Formation1. Setting the arguments:(a) Predictors and the response: {x1, x2, ? ? ? , xD} and y.(b) Set initial groups, nsim=1000, iquant=0.95, ntree=150.2. Forming initial groups of variables:(a) g1, g2, ? ? ? , gd ? x1, x2, ? ? ? , xD (initial groups).3. Screening initial groups of variables/predictors:(a) Record ordering of the compounds. Permute the response variable y and compute AHR.Repeat the process nsim times.(b) Store AHRiquant, the iquant-th quantile of the distribution of AHR under random ranking.(c) For each group of variables gi; i = 1, 2, ? ? ? , d, fit a random forest growing ntree trees andget the out-of-bag (OOB) probability vector ?P (gi). Store the probability vectors in thecolumns of a matrix, say PROB.(d) Using ?P (gi), compute average hit rate AHRi = AHR(?P (gi))and store in a vector, saySAHR.(e) For each pair of initial groups {gi, gj}; i = 1, ? ? ? , (d ? 1); j = (i + 1), ? ? ? , d,get ?P (gi, gj),{?P (gi) + ?P (gj)}/2 and thus AHRij = AHR(?P (gi, gj)), AHRij =AHR({?P (gi) + ?P (gj)}/2), respectively. Store AHRij and AHRij in matrices AHRSYNand AHRCOM, respectively.(f) Screen the ith group of variables out if max[AHRi, E(AHRR)+AHRij?AHRj , E(AHRR)+AHRij ?AHRj ] < AHRiquant ? j 6= i = 1, ? ? ? , d.(g) Update 3(c), 3(d), and 3(e) by deleting rows and columns of PROB, SAHR, AHRSYN,and AHRCOM corresponding to the screened out groups of variables.(h) Supply the screened groups of predictors: G1, G2, ? ? ? , Gs.4. Forming candidate phalanxes :(a) Find rmin = minij{AHRij/AHRij}; i = 1, 2, ? ? ? , (s? 1), j = (i+ 1), ? ? ? , s.(b) If rmin < 1, block the groups of variables {Gi, Gj}|rmin and s? (s? 1).(c) Redo the steps 3(c), 3(d), and 3(e) specific to the index of rmin.(d) Repeat 4(a) to 4(c) until rmin ? 1 OR s = 1.(e) Set c? s and supply candidate phalanxes: PX1,PX2, ? ? ? ,PXc.5. Screening candidate phalanxes :(a) Screen the ith candidate phalanx out if max[AHRi, E(AHRR) + AHRij ? AHRj ] <AHRiquant ? j 6= i = 1, ? ? ? , c.(b) Return PX(1),PX(2), ? ? ? ,PX(p): the army of phalanxes.403.5. Computational Complexityphalanxes form our army of phalanxes. Notice that we no longer check whether two candidatephalanxes should be merged since we did an exhaustive search for merging predictor groupsin the previous step of this algorithm. The output of this step is the army of phalanxes:PX(1),PX(2), ? ? ? ,PX(p).3.5 Computational ComplexityNow I show the computational complexity of our algorithm in terms of the maximum numberof random forests that needs to be grown. In the worst case scenario there is no initialgrouping of the D feature variables.Screening phase: For the screening phase, we fit a total of D(D+ 1)/2 random forests.In fact, first we fit D random forests, i.e., one random forest for each feature variable. Second,for all possible pairs of feature variables we fit D(D ? 1)/2 random forests, i.e., one randomforests for each pair. Hence, in total we have D(D+1)/2 random forests. Again, in the worstcase scenario there is no screened out feature variables.Phalanx formation phase: For the phalanx formation phase, we fit a maximum ofD(D ? 1)/2 random forests. In fact, first we fit one random forest to the merged pair offeature variables; then we fit D ? 2 random forests by pairing each of the D ? 2 featurevariables with the merged pair. In this iteration, we fit a total of D? 1 random forests. Notethat we do not need to grow random forests for all possible pairs. In the next iteration, wefit D ? 2 random forests, and so on. In the worst case scenario we would have one phalanxat the end and that gives us a total of D(D ? 1)/2 fitted random forests.Adding the numbers in the two phases, we obtain a maximum of D2 random forests.Thus, the computational complexity of our algorithm is of order O(D2). In this chapter,most of the datasets are high-dimensional in terms of the number of feature variables. Hence,the computational burden is greatly reduced by forming initial groups of the binary featurevariables. Moreover, the computational burden is reduced by using many processors in parallelcomputation.3.6 Ensemble of PhalanxesWe fit p random forest classifiers using p phalanxes and obtain probabilities of activity fromthem. Each of those random forests is grown using the default settings of the R packagerandomForest (Liaw and Wiener, 2011). The p random forest classifiers are aggregatedtogether to form the ensemble of phalanxes, denoted by EPX, by averaging probabilities ofactivity across the p random forests/phalanxes.413.7. Results3.7 ResultsWe apply the algorithm of phalanx formation to the five descriptor sets of the four assays.For each descriptor set we get an army of phalanxes and ensemble them to form EPX. Theresults of EPX are compared with RF and RRF. The RF and RRF are constructed using thedefault setup of the R packages randomForest (Liaw and Wiener, 2011) and RRF (Deng,2012).3.7.1 Assay AID348The algorithm of phalanx formation was applied to the five descriptor sets separately. For aparticular descriptor set, the algorithm is run for 3 times with 3 different random seeds. Thefirst column of Table 3.1 identifies the descriptor set, the second column identifies the randomruns and the last 5 columns correspond to the 5 main steps of our algorithm (see Figure 3.2).For example, looking at the first row in the body of the table, the descriptor set AP has atotal of 367 variables arranged into 75 initial groups of which 22 survive screening. The 22groups are aligned into 4 candidate phalanxes of which 2 survive screening. The screenedphalanxes form the army of phalanxes and are ensembled to get EPX. Rows 2 and 3 showthe results for the other two runs. We observe some random variation from run to run butwe will see below that performance remains pretty stable.Table 3.1: The number of variables, initial groups, screened groups, candidate phalanxes, andscreened/army of phalanxes for the 3 runs of the algorithm to AID348 assay.DS RunNumber ofVariables Groups PhalanxesInitial Screened Candidate ScreenedAP1367 7522 4 22 19 8 53 22 8 4BN124 2424 8 82 24 9 93 24 4 4CAP11795 455398 13 102 128 8 83 352 17 12FP1570 10124 6 42 22 5 43 22 5 5PH1120 215 1 12 5 3 23 5 1 1A large number of binary predictors are screened out for the four descriptor sets based423.7. Resultson binary variables: AP, CAP, FP, and PH. For example, in the case of AP, between 71%and 75% of the initial groups are dropped. The ranges for the other binary descriptor setsare 12% to 72%, 76% to 79%, and 77%, respectively. All of the continuous predictors of BNare useful (none is dropped) in our 3 runs. The initial screening is more unstable for CAP,perhaps due to the large number of variables.We repeat balanced 10-fold cross-validation experiments 16 times to obtain 16 AHR?s forEPX, RF and RRF. The average hit rates are reported in Table 3.2. The average hit ratesfor EPX are much larger than those of RF and RRF, three times larger in the case of CAP,the predictor with the largest number of variables. For the binary descriptor sets AP, CAP,FP and PH ? where screening variables is dominant ? RRF outperforms RF. But for BN,where there is no filtering, RF outperforms RRF. The last two columns in Table 3.2 showsthat EPX consistently beats RF and RRF in all our cross-validation experiments.Table 3.2: Average hit rate (AHR) for an ensemble of phalanxes (EPX), a random forests(RF), and a regularized random forests (RRF) averaged over 16 repeats of balanced 10-foldcross-validation for the AID348 assay. Larger AHR values are better. The last two columnsshow the number of times EPX has larger AHR among the 16 repeats of cross-validationrelative to RF and RRF.DS Run Mean AHR EPX beatsEPX RF RRF RF RRFAP1 0.1820.063 0.08116/16 16/162 0.194 16/16 16/163 0.146 16/16 16/16BN1 0.1430.090 0.07816/16 16/162 0.153 16/16 16/163 0.132 16/16 16/16CAP1 0.2010.068 0.09016/16 16/162 0.184 16/16 16/163 0.155 16/16 16/16FP1 0.1570.077 0.09816/16 16/162 0.130 16/16 16/163 0.157 16/16 16/16PH1 0.1080.070 0.08016/16 16/162 0.108 16/16 16/163 0.108 16/16 16/16Figure 3.3 shows the box plots for the 16 AHR?s. The three boxplots for EPX correspond tothe three runs ? with different random seeds ? in our cross-validation experiment. Notice that,despite exhibiting some run-to-run variability EPX consistently outperforms RF and RRF.We could stabilize the EPX?s performance by using random forests with a larger number oftrees at the phalanx formation stage. But this change would not necessarily improve theperformance and increase the computational burden of the algorithm.433.7. ResultsTo further illustrate the performance of EPX we used the balanced 10-fold cross-validationnumber 1 to draw the hit curve (see Figure 3.4). We choose t = 300 which comprises 6% ofthe compounds of Assay AID348.Figures 3.1 and 3.4 display the hit curves, descriptor set AP in Figure 3.1 and the otherfour descriptors sets in Figure 3.4. In all the cases the hit curve for EPX start rising veryquickly and dominates the other two curves at every t ? [1, 300].Although we form the phalanxes by optimizing average hit rates, the army of phalanxesalso shows very good performance regarding Initial Enhancement (IE), as shown in Table 3.3,which reports the results from the 16 replications in our balanced cross-validation results forassay AID348.Table 3.3: The initial enhancement (IE) ? averaged over the 16 balanced 10-fold cross-validations ? for the three ensembles RF, RRF, and EPX of the five descriptor sets of AID348Assay.Ensembles AP BN CAP FP PHRF 5.19 6.62 7.16 7.07 5.53RRF 5.80 6.25 6.83 7.28 5.78EPX 6.27 8.80 8.20 8.40 6.443.7.2 Assay AID362In this assay, we observed smaller percentage of screening weak variables than the AID348Assay. The results for screening and phalanx formation are shown in Table A.1. For thebinary descriptor sets AP, CAP, FP and PH, the percentages of screened out initial groupsare 25?36%, 11?14%, 5?10% and 14?24%, respectively. As before, none of the continuousvariable of BN is filtered. The largest descriptor set CAP tends to supply many phalanxes:10 ? 14 screened phalanxes form the final army. A good number of phalanxes (4 ? 6) arefound for BN and FP. The descriptor sets AP and PH show moderate (1? 4) to small (1? 2)number of phalanxes, respectively, to ensemble.Table A.2 shows the mean AHRs, averaged over 16 repeats of the balanced 10-fold cross-validation, for the three ensembles EPX, RF, and RRF. For the largest descriptor set CAP, theensemble EPX shows the largest improvement over RF. The second largest improvement overRF is observed in the descriptor set FP, followed by BN and AP. For PH, our ensemble couldnot improve much, but didn?t perform worse, than RF. The ensemble RRF underperformsRF in all of the five descriptor sets. RRF underperforms RF with the largest margin in CAP,where our ensemble shows the largest improvement.443.7. ResultsAverage Hit Rate0.050.100.150.20AP BN CAP FP PHFigure 3.3: Box-plots of average hit rates (AHR) from 16 repeats of cross-validation for arandom forests (RF), a regularized random forests (RRF), and for 3 ensembles of phalanxes(EPX) for the five descriptor sets (AP, BN, CAP, FP, and PH) of AID348 assay. The boxesfor RF, RRF, and EPX are marked by light-grey, grey and dark-grey colour schemes.453.7. Results0 50 100 150 200 250 300051015202530Number of compounds selectedNumber of active compounds00.10.210.310.420.520.62Proportion of active compounds0 0.01 0.02 0.03 0.04 0.05 0.06Proportion of compounds selectedRF (0.103)RRF (0.080)EPX (0.152)(a) Burden Numbers0 50 100 150 200 250 300051015202530Number of compounds selectedNumber of active compounds00.10.210.310.420.520.62Proportion of active compounds0 0.01 0.02 0.03 0.04 0.05 0.06Proportion of compounds selectedRF (0.072)RRF (0.097)EPX (0.210)(b) Carhart Atom Pairs0 50 100 150 200 250 300051015202530Number of compounds selectedNumber of active compounds00.10.210.310.420.520.62Proportion of active compounds0 0.01 0.02 0.03 0.04 0.05 0.06Proportion of compounds selectedRF (0.083)RRF (0.087)EPX (0.191)(c) Fragment Pairs0 50 100 150 200 250 300051015202530Number of compounds selectedNumber of active compounds00.10.210.310.420.520.62Proportion of active compounds0 0.01 0.02 0.03 0.04 0.05 0.06Proportion of compounds selectedRF (0.071)RRF (0.092)EPX (0.106)(d) PharmacophoresFigure 3.4: Hit curves for random forests (dashed line), regularized random forests (dottedline), and army of phalanxes (solid line) for the four descriptor sets, BN in panel (a), CAPin panel (b), FP in panel (c) and PH in panel (d). The number in () is the correspondingAverage Hit Rate. Data from the AID348 assay.463.8. Diversity map3.7.3 Assay AID364Like AID362 Assay, we observed roughly the similar proportions of filtering. We also ob-served roughly the similar room, represented through the number of screened phalanxes, forensembling. The largest and the smallest descriptor sets tend to supply the most and theleast numbers of screened phalanxes, respectively, in the final army. For details, please seethe Table A.3.Table A.4 shows the mean AHRs for EPX, RF and RRF. The largest improvement of EPXover RF is observed in CAP, followed by BN, AP, FP and PH. The highest and the lowestimprovements correspond to the largest descriptor set CAP and the smallest binary descriptorset PH, respectively. The ensemble RRF under-performs RF in all of the five descriptor sets.The margin of underperformance of RRF over RF is the highest for the largest descriptor setCAP.3.7.4 Assay AID371For the binary descriptor sets, the algorithm filters larger and smaller proportions of initialgroups than {AID362, AID364} and AID348 Assays, respectively. None of the continuousBurden Numbers is filtered. The largest descriptor set CAP tends to supply the largestnumber (6?9) of screened phalanxes. The other descriptor sets, including PH, supply a goodnumber (3? 5) of screened phalanxes. For details, please see Table A.5.Table A.6 shows the results of EPX, RF and RRF. The largest improvement of EPX overRF is observed in CAP, followed by AP, BN and PH, respectively. For the 3 runs in FP,the ensemble EPX slightly outperforms RF in 1 run, and underperforms in 2 runs. For thisdescriptor set, the ensemble EPX outperforms RRF in all 3 runs. The RRF underperformsthe RF in AP, BN, CAP and FP, and outperforms in PH.3.8 Diversity mapWe focus our attention to answer why EPX ranks the active compounds so well. To beginwith, let us recall the work of Breiman (2001), where he pointed out that a strong ensembleneeds to have strong and low correlated constituent classifiers. The stronger the classifiers arethe stronger the ensemble. The smaller the correlation between predictions of the constituentclassifiers is the better the performance of the ensemble.To proceed, we choose an army of 8 phalanxes from a run of the algorithm ?phalanxformation? to BN of AID348 Assay. We examine how the 8 phalanxes and their ensemble EPXrank the 48 active compounds. We obtain probabilities of activity from a cross-validation,and plot the ranks of the 48 active compounds. Figure 3.5 shows the ranks of the activecompounds by the 8 phalanxes each and by their ensemble EPX as well. The darker thecolour is the earlier the active is found in the ranked list. We also compute the average hitrates for the 8 phalanxes and for the ensemble as well. In the vertical axis we sequence the473.8. Diversity map48 active compounds using the probabilities of activity from EPX. On the left hand side, weplot the scale of the ranks of the active compounds.We see different mass of colour to active compounds across the 8 phalanxes. In words,the phalanxes rank different sets of actives well. If one phalanx misses an active compound,then other phalanxes rank it well and eventually their ensemble ranks the active well. Suchbehaviour of ranking different sets of actives is called diversity across phalanxes. Having seenthe results, we say our algorithm is successful in producing diverse set of phalanxes. Diversityis a similar measure to correlation: the higher the diversity the smaller the correlation.PX?1 : 0.141PX?2 : 0.066PX?3 : 0.165PX?4 : 0.082PX?5 : 0.088PX?6 : 0.106PX?7 : 0.132PX?8 : 0.068EPX : 0.16301051015202530354045151050100200300500Figure 3.5: Diversity map of ranks of the active compounds by 8 phalanxes from BN ofAID348 assay. The average hit rates of a random forest for this cross-validation is 0.103.In addition to diversity, all of the phalanxes are strong as they have passed the secondscreening test. Moreover, by checking AHRs of the 8 phalanxes carefully, we find 4 of themare very strong ? even stronger than a regular random forest using all of the 24 predictors.For this cross-validation, the random forests with all 24 predictors gives an AHR of 0.103.483.9. SummaryThe phalanxes 1, 3, 6 and 7 provide larger AHRs than 0.103. This is also true for theother 15 cross-validations. Now we are convinced that our algorithm produces phalanxes thatconstitute strong classifiers. Equipped with strong and diverse set of classifiers, the ensembleEPX outperforms the benchmark random forests.For this cross-validation, phalanx 3 provides slightly larger AHR than the overall ensembleEPX. This is not the case for the other cross-validations. In general, the ensemble EPX giveslarger AHR than any of the constituent phalanxes.3.9 SummaryIn this Chapter, we used four bioassays each with five sets of molecular descriptors. As weapplied our ensemble to each descriptor set, there are 20 datasets in a sense. Our ensemble,in all of its 3 runs, clearly outperformed the random forests and regularized random forestsin almost every dataset. However, there were a few datasets where our ensemble could notoutperform its competitors. We now present those situations case by case. For three runs ofEPX in PH of AID362 Assay, we observed larger AHR than RF for a total of 14, 7 and 9times, respectively, out of 16 cross-validations. In FP of AID371 Assay, EPX produced betterAHR than RF for 3, 4, and 12 times, respectively. In the first and second examples, EPX wonover RF in 2 and 1 runs, respectively, out of 3 runs. In those two examples, EPX did not losecompletely, however. Our ensemble, EPX clearly won against RRF in the 19 datasets out of20. In PH of AID371 Assay, EPX outperformed RRF in 1 run out of 3.The highest improvement of EPX over RF was observed for the largest descriptor setCAP. It was CAP for which the largest margin of underperformance of RRF over RF was ob-served. The story was similar for the continuous descriptor set BN. Note that, the continuouspredictor of BN possesses more resolution than any binary predictor. The least improvementof EPX over RF and RRF was observed for the smallest binary descriptor set PH. Given adescriptor set, the most and the least improvements of EPX over RF and RRF were observedfor AID348 and AID371 Assays, respectively. The smaller the proportions of actives the largerthe improvements of EPX over RF and RRF were. In a nutshell, the harder the problem thebetter the ensemble of phalanxes performed relative to its competitors RF and RRF.Some instability in the results of EPX was observed from run to run. Despite this instabil-ity, the resulting ensemble consistently outperformed RF and RRF. However, some stabilityin the results can be achieved by growing a large number of trees (> 150, for example) whenforming the phalanxes. This change would increase computational complexity, not necessarilythe performance, of the ensemble of phalanxes. We plan to tackle such increased computa-tional complexity of our method by coding the algorithm in C/C++ or in Fortran. However,we managed to run our method on fairly large descriptor sets and observed impressive results.The ensemble EPX is developed by optimizing average hit rate. For good performanceof EPX in terms of IE, we propose to form the phalanxes by optimizing IE. Moreover, the493.9. Summaryextension of this ensemble of phalanxes to classification and regression is straightforward. Weneed to replace AHR by misclassification error and mean squared error, respectively, to formthe phalanxes and to evaluate the ensemble. Our next target is to implement this algorithmfor regression and classification.We have used subject matter knowledge to form initial groups of the binary feature vari-ables. Specifically, a set of feature variables belonging to a particular pair of atoms are groupedtogether. The goal is to form diverse initial groups: naturally, two sets of feature variablesspecific to two pairs of atom are diverse to each other. However, such diversity between theinitial groups might also be obtained by examining the correlation structures of the featurevariables.On the other hand, the resulting screened phalanxes represent different characteristics ofthe predictor sets towards the response variable. Such screened phalanxes may also repre-sent multiple mechanism of activity in a dataset. If multiple mechanisms of activity producedifferent activity classes, the phalanxes can also be formed in such a way that one phalanx rep-resents one activity class which would be ultimately ensembled to provide predictive rankingof the activity classes.There exist learning algorithms to handle class-imbalances in two-class classification prob-lems through over-sampling of the minority class and under-sampling of the majority class(see Chawla, Lazarevic, Hall, and Bowyer (2003); Chen, Liaw, and Breiman (2004)). Thegoal of over-sampling of the minority class, for example, is to reduce the increased bias of aclassifier towards the majority class and thus to improve prediction accuracy of the minorityclass. Our method is different from the over or under sampling procedures as it does notcorrect any bias towards the minority/majority class. Rather, it aims to rank the minorityclass items at the start of the list.50Chapter 4Ensembling Phalanxes Across andWithin Descriptor Sets4.1 IntroductionIn Chapter 2, an ensemble of descriptor sets (EDS) is developed aggregating random forests(RF) over several sets of molecular descriptors. Specifically, the chemical compounds in abioassay dataset are ranked by averaging probabilities of activity from random forests appliedto the five sets of molecular descriptors. The top ranked compounds are shortlisted in away that the most, if not all, of the rare active compounds are found in the shortlist. Thedeveloped ensemble performs better than the top performing random forests that uses a singledescriptor set. Moreover, for the most part, our ensemble EDS outperforms random forestsapplied to the pool of descriptor sets (AF). However, for Assay AID348, random forests withthe pool of descriptor sets outperforms our ensemble of descriptor sets.There are five molecular descriptor sets for each of the four bioassays (see Sections 2.2 and2.3 of Chapter 2). All of the five sets of descriptors are considered rich in variables as theycontain a large number of predictors. In the process of developing a good ranking model, weuse the richness of variables in the descriptor sets. The ranking model, called the ensemble ofphalanxes (EPX), is developed in Chapter 3. First, the data-adaptive phalanxes are formed ina descriptor set by grouping predictor variables together. The variables in a phalanx are goodin terms of predictive ranking when used together in a model, and the variables in differentphalanxes are good in separate models. The resulting phalanxes are aggregated growing arandom forest in each and averaging probabilities of activity across the phalanxes.The performance of EPX is better than the random forests (Breiman, 2001) and regularizedrandom forests (Deng and Runger, 2012). EPX performs very well when there are manyvariables in a descriptor set and when the proportions of active compounds are small. Theharder the problem the better the ensemble of phalanxes performs compared to the alternativeprocedures such as random forests.In this chapter, the ensembles of phalanxes (EPX) are applied to the five descriptor setsreplacing random forests to form an improved version of EDS. Specifically, the probabilitiesof activity obtained from applying EPX to each of the five descriptor sets are averaged acrossthe five sets of molecular descriptors. The new ensemble is denoted by EDS-PX.The ensemble EDS-PX provides better predictive ranking of the rare active compounds514.2. Datasets and Variablesthan EDS in all of the four bioassay datasets. Moreover, EDS-PX outperforms random forestsapplied to the pool of five descriptor sets (AF). Careful investigation reveals that EDS-PXaggregates strong and diverse sets of constituent classifiers to form a highly powerful rankingmodel.4.2 Datasets and VariablesIn this chapter, four bioassay datasets are used which contain a small proportion of activecompounds compared to the proportion of inactive compounds. The response variable in eachassay is the compounds? activity status where each compound is recorded as either active orinactive against a biological target. The interest is in detecting the active compounds only.The assay datasets are presented in Section 2.2 of Chapter 2.There are five sets of predictor variables for each assay. Each set of predictors is also knownas a descriptor set. The predictors in a set are numeric variables which describe the structureor shape of the chemical compounds in a bioassay dataset. Following alphabetical order thedescriptor sets are: Atom Pairs (AP), Burden Numbers (BN) (Burden, 1989; Pearlman andSmith, 1999), Carhart Atom Pairs (CAP) Carhart et al. (1985), Fragment Pairs (FP), andPharmacophores Fingerprints (PH). The aim is to develop a predictive ranking model whichrelates the activity status/toxicity/drug potency of the chemical compounds in a bioassaywith all the descriptor sets. The descriptor sets are presented in Section 2.3 of Chapter 2.4.3 Ensemble of Descriptor Sets using Data-AdaptivePhalanxesFirst, let me explain the process of aggregating the descriptor sets. Let ns be the numberof molecular descriptor sets in an assay. We build ns classifiers using ns sets of descriptorsand estimate probabilities of activity using each of the classifiers. Finally, the aggregation isperformed by averaging probabilities across the classifiers/sets of descriptors. The averagedprobabilities of activity are used to rank the compounds. Specifically, the compound withthe largest averaged probability of activity is ranked first, followed by the compound with thesecond largest averaged probability of activity and so on.In Chapter 2, RF is applied to the five sets of descriptors: Atom Pairs (AP), BurdenNumbers (BN), Carhart Atom Pairs (CAP), Fragment Pairs (FP), Pharmacophores (PH).When the probabilities of activity from random forests are averaged across AP, BN, CAP, FPand PH, the resulting ensemble is called the ensemble of descriptor sets or simply EDS.In this chapter, the ensemble of phalanxes is applied to the five descriptor sets. Theprobabilities of activity across the five ensembles of phalanxes are averaged to form anotherensemble which we call the ?ensemble of descriptor sets from data-adaptive phalanxes? orsimply EDS-PX. We will show that EDS-PX is an improved version of EDS.524.4. Evaluation of Classifiers4.4 Evaluation of ClassifiersA classifier is evaluated based on how well it positions the rare active compounds at thebeginning of a ranked list of compounds. The compounds in a bioassay are ranked using theestimated probabilities of activity from a classifier. Since the activity statuses of all of thecompounds in the four bioassays are known, we used cross-validation to mimic the trainingand test parts. Specifically, the probabilities of activity for the compounds in a bioassayare obtained using a balanced 10-fold cross-validation. Ten random groups (folds) of thecompounds, each with approximately equal number of active and inactive compounds, areformed first. Nine such groups are used for training and the other group is left out for testing.Repeating the training and testing procedures ten times, each time leaving one group outfor testing, provides the probabilities of activity for all of the compounds in a dataset. Thedetails are presented in Section 2.5.1 of Chapter 2.Having estimated the probabilities of activity, the ranking performance of a classifier isevaluated using a hit curve. A hit curve enables visual inspection of the performance of aclassifier. A classifier with a high hit curve is preferred. The definition of a hit curve is givenin the Subsection 2.5.2 of Chapter 2.The comparison of the performances of many classifiers using hit curves is not well definedif the curves cross each other. Such comparison might become very hard if the classifiers are tobe evaluated repeatedly. To automate such comparison, we summarize a hit curve by a singlenumber. The metric average hit rate (AHR) summarizes a hit curve by a single number andfacilitates comparison. AHR varies from 0 to 1, where larger values imply better ranking ofthe rare active compounds. The definition of AHR is presented in Subsection 2.5.4 of Chapter2.Initial enhancement (IE) is also a single number summary of a hit curve evaluated at aparticular cut-off point. An IE of 1 implies performance similar to random ranking. Largervalues of IE imply better predictive ranking. The definition of IE is presented in Subsection2.5.3 of Chapter 2.4.5 ResultsThis section contains results after applying three ensembles to the four bioassay datasets. Thethree ensembles are: (1) random forests applied to the pool of five descriptor sets (AF), (2)ensemble of descriptor sets using random forests (EDS) as in Chapter 2, and (3) ensemble ofdescriptor sets using data-adaptive phalanxes (EDS-PX). The results for EDS-PX are obtainedfrom run 1 of EPX presented in Chapter 3.534.5. Results0 50 100 150 200 250 300Number of compounds selectedNumber of active compounds0510152025303500.10.210.310.420.520.620.73Proportion of active compounds0 0.01 0.02 0.03 0.04 0.05 0.06Proportion of compounds selectedAF : 0.209 / 10.304EDS : 0.137 / 8.587EDS?PX : 0.286 / 12.365METHOD : AHR IE(a)0.15 0.20 0.25 0.309101112AFEDSEDS?PXAverage Hit RatesInitial Enhancement(b)Figure 4.1: Hit curves comparing three ensembles ? AF, EDS and EDS-PX ? for the AID348assay dataset (panel a). Initial enhancement (IE) versus average hit rate (AHR) plot for AF,EDS and EDS-PX (panel b). The results are obtained from the estimated probabilities ofactivity using a balanced 10-fold cross-validation.4.5.1 Assay AID348I start showing the results of AID348 Assay for two reasons: in this assay (i) EDS appearedweak compared to AF (see Section 2.6.4 of Chapter 2), and (ii) EDS-PX improves the perfor-mance of EDS the most. Panel (a) of Figure 4.1 shows the three hit curves corresponding tothe three ensembles: AF, EDS and EDS-PX. To plot the hit curves I used the probabilities ofactivity from a balanced 10-fold cross-validation of the compounds of the AID348 assay. Asthis assay contains only 48 active compounds, I computed IE at 300 shortlisted compoundsfollowing the results of Hughes-Oliver et al. (2011). We see that the top performing ensemblein terms of high hit curve is EDS-PX which clearly dominates the other two ensembles, AFand EDS.For a straightforward comparison of the methods, I have plotted in panel (b) of Figure 4.1IE against AHR for the same balanced 10-fold cross-validation of the AID348 dataset. As thegoal is to maximize both AHR and IE, I want our method in the top right corner of this plot.Indeed, EDS-PX is found in the top right corner, followed by AF and EDS, respectively.Now we will show that the performance of our ensemble is consistent over many cross-validations. Thus, to compare average performances of the ensembles, we repeated the bal-anced 10-fold cross-validation for a total of 16 times. There are 16 processors available inour department?s computing networks for parallel computation, one processor is used for onecross-validation, and hence the 16 cross-validations. The bivariate mean vectors (AHR, IE)corresponding to the three ensembles are compared. The 95% confidence band for the bi-544.5. Resultsvariate mean vector is constructed employing the bivariate normality and homogeneity ofcovariance assumptions. Panel (a) of Figure 4.2 shows the plots of mean AHR versus meanIE with 95% confidence bands for AF, EDS and EDS-PX applied to the AID348 assay. Interms of both the mean AHR and mean IE, the top performing ensemble is EDS-PX followedby AF and EDS. The performances are significantly different from one to another.Next we will compare the performances of the ensembles at the cross-validation level. Ina particular cross-validation, we fitted AF, EDS and EDS-PX and computed their AHRs andIEs. The process is repeated 16 times, enabling us to perform pairwise comparison betweentwo methods. Column 3 of Table 4.1 shows the mean AHRs and mean IEs and the winningrates of one ensemble over another for Assay AID348. In terms of AHR and IE, AF beatsEDS for 16 out of 16 cross-validations. In terms of both AHR and IE, EDS-PX beats bothAF and EDS a total of 16 out of 16 times.4.5.2 Assay AID362EDS-PX again improves the performance of EDS for both AHR and IE. The improvement isparticularly large for AHR. Panel (b) of Figure 4.2 shows the plots of mean IE versus meanAHR for the three ensembles applied to AID362. The ensemble EDS-PX is found in the topright corner followed by EDS and AF. Hence, in this assay the top performing ensemble isEDS-PX.Let us compare the performances of the ensembles at the cross-validation level. Column4 of Table 4.1 shows the mean AHRs and mean IEs and the winning rates of one ensembleover another. In terms of AHR, EDS wins over AF a total of 15 out of 16 cross-validations,and EDS-PX beats both EDS and AF a total of 16 out of 16 cross-validations. In termsof IE, which is calculated at 300 shortlisted compounds, EDS beats AF a total of 14 times,and EDS-PX wins over AF and EDS a total of 16 and 14 times, respectively, out of 16cross-validations.4.5.3 Assay AID364The ensemble EDS-PX shows large improvement over EDS in terms of both AHR and IE. Inthis assay the IE is computed at 300 shortlisted compounds. Panel (c) of Figure 4.2 showsthe plots of mean IE versus mean AHR for the three ensembles. Following the diagonal linefrom the top right corner the ensembles are sequenced as EDS-PX, EDS and AF. Thus thetop performer is EDS-PX, and its performance is much larger than the performances of EDSand AF. In column 5 of Table 4.1, we see that EDS-PX beats EDS and AF a total of 16 timesout of 16 cross-validations using both evaluation metrics AHR and IE.554.5. Results0.15 0.20 0.2589101112AFEDSEDS?PXAverage Hit RatesInitial Enhancement(a) Assay AID3480.30 0.32 0.34 0.36 0.389.29.49.69.810.010.2AFEDSEDS?PXAverage Hit RatesInitial Enhancement(b) Assay AID3620.34 0.36 0.38 0.40 0.426.06.57.07.5AFEDSEDS?PXAverage Hit RatesInitial Enhancement(c) Assay AID3640.38 0.39 0.40 0.41 0.42 0.433.43.53.63.7AFEDSEDS?PXAverage Hit RatesInitial Enhancement(d) Assay AID371Figure 4.2: Plots of mean initial enhancement (IE) versus mean average hit rate (AHR) with95% confidence regions for four assay datasets: AID348, AID362, AID364, and AID371.564.6. Diversity Map4.5.4 Assay AID371There are 278 active compounds in this assay and the IE is computed at 600 shortlistedcompounds. The performance of EDS-PX is far better than the performances of EDS andAF, see panel (d) of Figure 4.2. Column 6 of Table 4.1 shows that the ensemble EDS-PXbeats EDS and AF a total of 16 times out of 16 cross-validations.Table 4.1: Mean AHR and IE for AF, EDS, and EDS-PX for the four assay datasets. Winningrates of EDS over AF, and EDS-PX over AF and EDS are also given for AHR and IE.Assay Datasets AID348 AID362 AID364 AID371Mean AHRAF 0.179 0.305 0.357 0.389EDS 0.128 0.328 0.367 0.396EDS-PX 0.257 0.368 0.409 0.421Larger (?) AHREDS versus AF 0/16 15/16 14/16 15/16EDS-PX versus AF 16/16 16/16 16/16 16/16EDS-PX versus EDS 16/16 16/16 16/16 16/16Mean IEAF 9.630 9.478 5.878 3.381EDS 8.388 9.744 6.479 3.444EDS-PX 11.989 9.962 7.257 3.672Larger (?) IEEDS versus AF 0/16 14/16 16/16 12/16EDS-PX versus AF 16/16 16/16 16/16 16/16EDS-PX versus EDS 16/16 14/16 16/16 16/16It was shown in Section 2.6 of Chapter 2 that the aggregate of random forests over thefive descriptor sets (EDS) outperformed the top performing random forest applied to any ofthe single set of descriptors. Moreover, in three of the four bioassays, EDS outperformed therandom forest applied to the pool of descriptor sets (AF). In this chapter, we have shownthat the aggregated ensemble of phalanxes over the five descriptor sets (EDS-PX) not onlyoutperforms AF but also improves over EDS. However, to check the strengths and diversityof the constituent classifiers of EDS and EDS-PX, we plot a diversity map next in Section4.6.4.6 Diversity MapIn this section, we will try to better understand why EDS-PX outperforms EDS. As such,the ranks of the 48 active compounds in the AID348 Assay are plotted using EDS and EDS-PX (left and right panels of Figure 4.3, respectively). The ranks are obtained from theprobabilities of activity corresponding to the cross-validation number 1. The ranks of theactive compounds are also plotted for the constituents random forests of EDS (AP, BN, CAP,FP and PH), and the constituents ensembles of phalanxes of EDS-PX (AP-PX, BN-PX, CAP-PX, FP-PX and PH-PX). If an active compound is ranked first by a classifier, the compound574.7. SummaryAP : 0.062 / 5.84BN : 0.103 / 7.15CAP : 0.072 / 7.21FP : 0.083 / 7.56PH : 0.071 / 5.68EDS : 0.137 / 8.5901051015202530354045151050100200300(a) EDSAP?PX : 0.199 / 6.26BN?PX : 0.152 / 9.27CAP?PX : 0.21 / 8.59FP?PX : 0.191 / 8.59PH?PX : 0.106 / 6.24EDS?PX : 0.286 / 12.3701051015202530354045151050100200300500(b) EDS-PXFigure 4.3: Diversity maps of EDS (left panel) and EDS-PX (right panel) for the ranks ofthe active compounds in AID348 assay. The compounds are ranked using the probability ofactivity corresponding to the cross-validation number 1.receives the darkest grey colour. If a compound is ranked down in the sequence, it receivesabsolutely no colour. The colour key on the left of each panel shows where in the sequence anactive compound is ranked by a classifier. In the left and right panels, the active compoundsare sequenced using the probabilities of activity from EDS and EDS-PX, respectively. Wereport the AHRs and IEs for all of the constituent classifiers and their ensembles to understandhow strong the classifiers are.We see variation in ranks across the constituent classifiers of EDS: AP, BN, CAP, FPand PH, i.e., the five random forests rank different sets of active compounds well. The storyis similar among the constituent classifiers of EDS-PX. However, the constituents of EDS-PX show more diversity, reflected by the contrast in colours among the five ensembles ofphalanxes, than the constituents of EDS. Moreover, the constituents of EDS-PX are muchstronger than the constituents of EDS. Hence, equipped with stronger and more diverse setsof constituent classifiers, EDS-PX outperforms EDS.4.7 SummaryIn Chapter 2, we developed an ensemble by averaging probabilities of activity from randomforests applied to the five molecular descriptor sets to rank rare active compounds ahead ofthe majority inactive compounds in four bioassay datasets. The ensemble of descriptor sets(EDS) provides better predictive ranking of the rare active compounds than the most accuraterandom forests applied to any of the single set of molecular descriptors. For the most part, in584.7. Summary3 out of 4 bioassays, the ensemble of descriptor sets outperforms the random forests appliedto the pool of the five descriptor sets. It is the diversity and strengths of the random forestsacross the descriptor sets which make EDS a good ranking model.We define phalanxes as the groups of predictors, where the variables in a phalanx are good,in predicting the rare actives, when used together in a model and the variables in separatephalanxes are good in separate models. Having defined the term phalanx, we label the setsof molecular descriptors as the natural phalanxes. We then pose the question whether thereare data-adaptive phalanxes in variable-rich descriptor sets. Such thinking motivates me todevelop an algorithm (Algorithm 3.1 in Chapter 3) to unmask data-adaptive phalanxes in adescriptor set. Finally, the ensemble of phalanxes (EPX) is developed by fitting a randomforest in each phalanx and averaging probabilities of activity across the phalanxes. Equippedwith strong and diverse set of phalanxes, EPX outperforms regular random forests applied toa descriptor set.In this chapter, we have aggregated the ensembles of phalanxes over the five descriptor setsand compared the resulting EDS-PX with the ensemble of descriptor sets (EDS) and randomforests applied to the pool of descriptor sets (AF). In all of the four bioassays, the ensembleEDS-PX outperforms EDS and AF with a big margin. Our method EDS-PX aggregates strongconstituent classifiers to produce a powerful committee capitalizing on the data-adaptive andnatural diversities within and across descriptor sets, respectively.The algorithm of phalanx formation could also be applied to the pool of the five descriptorsets in each bioassay. The five sets of initial groups obtained from the five descriptor sets couldalso be combined into a single set of initial groups to run the algorithm. I did not attemptto do so because of the large computational burden of our algorithm to such a unified set ofpredictors. However, it is possible to run a group lasso (Meier et al., 2008; Yuan and Lin,2007) like algorithm to the unified initial groups to obtain a set of screened initial groupsthrough regularization. If the number of screened initial groups drops down to a manageablefigure then the algorithm of phalanx formation might be applied to the set of screened initialgroups.59Chapter 5Protein Homology: Ensembling viaLogistic Regression Models5.1 IntroductionEnsembles are applied to the protein homology dataset. This dataset was briefly introducedin Section 1.6 and will be described in more detail in Section 5.2. The ensembles are con-structed with the training set by predicting the response variable homology status using aset of predictor variables. In addition to the training set, this application comes with a testset for which we don?t know the status of the response variable. But the performance of anensemble can be evaluated by submitting intermediate results to the KDD Cup website whichgives us a chance for honest comparison of the competing ensembles. As before, the goal isthe detection of the rare class, more specifically, ranking rare homologs of a native proteinahead of the non-homologous candidate proteins.The original ensemble of phalanxes is developed in Chapter 3 which requires repeated fitsof random forests. But random forests itself is fairly computationally demanding and therepeated fits of random forests, as in the ensemble of phalanxes, increases the computationaltime substantially. In this chapter, we incorporate the popular and computationally lessdemanding logistic regression model to form the phalanxes and to build the ensemble. Theaims are: (i) improving the performance of the ensemble of phalanxes in terms of predictiveranking, and (ii) reducing computational burden of the algorithm. Moreover, we want toshow that the ensemble of phalanxes is easily adaptable to a base learner other than randomforests.As noted in Section 1.3.2, random forests (Breiman, 2001) possesses inherent variableselection properties and is insensitive to the inclusion of a few noise variables. Unlike randomforests, the performance of a logistic regression model containing some good variables ishampered by the inclusion of noise variables. Thus, in order to avoid mixing up noise variableswith good variables, we introduce an intermediate step of filtering noise variables duringphalanx formation using logistic regression. Moreover, we use cross-validation to form thephalanxes instead of out-of-bag samples. The modified algorithm of phalanx formation willbe discussed in Section 5.4.The training set of the protein homology data contains a total of 145, 751 rows, eachrepresenting a candidate protein, and 74 continuous predictors. The formation of phalanxes605.2. Protein Homology Datawith repeated fits of random forests to such a huge training set appeared computationally veryexpensive. For example, the parallelization of the processors in our department?s computingnetworks is found ineffective to handle this massive computation. Those computational issuesare tackled by parallel execution of the algorithm of phalanx formation using many computingnodes in a cluster of processors of the WestGrid (Western Canada Research Grid) computingnetwork. Moreover, we reduce the overhead data-transfer from the remote to local nodesby sending an appropriate chunk of data used at each node through careful coding of ouralgorithm.Ensembles of phalanxes using random forests and logistic regression ? denoted by EPX(RF)and EPX(LR), respectively ? and regular random forests (RF) are applied to this protein ho-mology dataset. This dataset comes with many blocks where each block belongs to a nativeprotein to which the homology status is tested for many candidate proteins. This block struc-ture makes this dataset special. The ensembles are learned using the training blocks and areevaluated on the test blocks. The performance of an ensemble is tested in each test block, andthe overall performance is obtained by averaging over all of the test blocks. Thus, in orderto provide good predictive ranking an ensemble needs to perform well in as many blocks aspossible. The proposed ensembles of phalanxes, EPX(RF) and EPX(LR), are found betterthan RF in terms of predictive ranking. Most importantly, EPX(LR) is found more powerfuland less computationally demanding than EPX(RF).5.2 Protein Homology DataThe protein homology dataset is downloaded from the 2004 KDD Cup competition?s website(http://osmot.cs.cornell.edu/kddcup/datasets.html). The structure of this dataset ispresented in Table 5.1. Each block relates to one native protein. Each row (or line) withina block describes a candidate protein which is tested for homology (y = 0/1 for no/yes) tothe block?s native protein. The first element of each line is a BLOCK ID that denotes towhich native protein this line belongs. There is a unique BLOCK ID for each native proteinsequence. For example, the blocks 279 and 48 in Table 5.1 represent the first and last nativeproteins in the training set, respectively.Many candidate proteins were tested for homology to a native protein. The second elementof each line uniquely describes the corresponding candidate protein. The variable whichidentifies the candidate protein sequences is called EXAMPLE ID. For example, the candidateproteins with EXAMPLE IDs from 261532 to 262336 belong to the first training block 279.The third element of a line provides a realization of the response variable, y. The responseis a class variable known as homology status. The candidate proteins that are homologousto the native protein are denoted by 1, and the non-homologous proteins are denoted (i.e.,decoys) by 0. The test set (lower part of Table 5.1) provides ??? in this position.The following elements in a line are realizations of the feature variables: x1, x2, ? ? ? , xD.615.2. Protein Homology DataTable 5.1: The structure of the protein homology dataset. The top and bottom portions areextracted from the training and test sets, respectively.ID Response PredictorsBLOCK EXAMPLE y x1 x2 x3 ? ? ? x72 x73 x74279 261532 0 52.00 32.69 0.30 ? ? ? -0.35 0.26 0.76279 261533 0 58.00 33.33 0.00 ? ? ? 1.16 0.39 0.73..............................279 262336 0 46.00 27.08 1.72 ? ? ? -0.75 0.55 0.69............................................................48 43884 0 87.96 25.26 -0.94 ? ? ? 1.61 0.29 0.0848 43885 0 48.61 25.47 -0.50 ? ? ? 0.67 0.08 0.09..............................48 44825 1 87.50 29.33 5.84 ? ? ? -0.58 0.16 0.23153 141691 ? 71.84 23.17 -0.57 ? ? ? -0.58 0.24 0.29153 141692 ? 83.50 26.09 1.50 ? ? ? 0.50 0.11 0.05..............................153 142501 ? 50.49 24.56 0.84 ? ? ? -0.40 0.34 0.08............................................................140 129681 ? 82.80 31.58 2.62 ? ? ? 2.64 0.14 0.34140 129682 ? 67.52 25.00 1.76 ? ? ? -0.06 0.20 0.52..............................140 130656 ? 23.57 27.03 -1.49 ? ? ? 1.05 0.00 -0.30625.2. Protein Homology DataThe feature variables represent the similarity or match between the native protein and thecandidate protein which is tested for homology. This dataset contains 74 feature variableswhich will be used to predict the response y.Many native proteins are considered in this dataset. A total of 303 native proteins provideBLOCK IDs running from 1 to 303. After assigning the BLOCK IDs, the organizers of theKDD Cup assigned the blocks to the training and test sets. The training set contains 153native protein sequences (153 blocks) and the test set contains 150 native protein sequences(150 blocks). For example, the pairs of blocks (279, 48) and (153, 140) in Table 5.1 are thefirst and last blocks taken from the training and test set, respectively. The homology statusfor the training set is known to us, and is unknown for the test set, i.e., undisclosed by theKDD cup organizers.The training and test sets contain 145, 751 and 139, 658 candidate proteins, respectively.The large number of candidate proteins in each set brings computational challenges for ourmethod.Table 5.2 shows the block sizes for the training and test sets. The 1st quartile, median, 3rdquartile, and the maximum of the block sizes in both of the training and test sets are fairlysimilar. The minimum block sizes for the training and test sets are 612 and 251 respectively.The test set possesses 3 blocks with sizes 251, 256 and 372, and has a more left skeweddistribution; please see the histograms of block sizes in Figure 5.1.Table 5.2: Block sizes in the training and test sets of the protein homology datasets.Datasets BLOCK SIZEMin 1st Qrt Median Mean 3rd Qrt MaxTraining 612 859 962 952.6 1048 1244Test 251 847 954 931.1 1034 1232Most of the blocks in the training set contain very few homologous proteins. The minimum,median and maximum of the proportions of homologous proteins are: 0.0008, 0.0047 and0.0581. Figure 5.2 shows the histogram of the proportions in the training set. In this Figure,the 3rd quartile is marked by the vertical bold line. We see that more than 75% of theblocks contain at most 2 homologs per 100 candidate proteins. Thus, in order to do well,the proposed ensemble of phalanxes needs to ensure good ranking of the homologous proteinsparticularly when the block-wise proportion of homologous protein is small.There are 74 predictor variables and some of them look very useful as a single variable andsome do not. Figure 5.3 (a) shows the kernel density plots of the feature variable x63 for thehomologous proteins (the solid line) and for the non-homologous proteins (the dashed line)in the training set. This variable seems to differentiate the homologous and non-homologousproteins fairly well. However, there is also evidence of less-informative feature variables in thetraining set; please see the density plots of the feature variable x47 in Figure 5.3 (b). Such635.3. Evaluation MetricsBlock SizeFrequency200 400 600 800 1000 1200 140001020304050(a) TrainingBlock SizeFrequency200 400 600 800 1000 1200 1400010203040(b) TestFigure 5.1: Histogram of block sizes in the training and test sets of the protein homologydatasets.a less-informative variable is not thrown out instantly, however. We further check whetherthe variable possesses good marginal prediction power when used with any other predictorsin the dataset. If the variable appears useful, it is used in forming the phalanxes.5.3 Evaluation MetricsWe have used three metrics, specific to ranking rare homologous proteins, to evaluate predic-tive performances of the classifiers. Since the protein homology data comes in blocks, eachof the three metrics is computed on each block independently. For each metric, the averageperformance across the blocks is used as the final metric. Thus, in order to do well a classi-fier has to rank the rare homologous proteins well across many blocks. By ranking we meansequencing (or sorting) the candidate proteins in each block using the probability of beinghomologous to the native protein. The candidate protein with the largest probability of beinghomologous in a block is ranked first, followed by the second largest probability and so on.The three metrics are as specified by the 2004 KDD Cup competition. For further readingplease see the following link: http://osmot.cs.cornell.edu/kddcup/metrics.html.5.3.1 Rank LastBy rank last (RKL) we mean the rank of the last homologous protein. This metric measureshow far the last true homolog falls among the sorted cases. Ties are treated conservatively:if multiple sequences tie, the last element of the tie determines the rank, so ties are notbeneficial. An RKL of 1 means that the last true homolog is sorted in the top position.645.3. Evaluation MetricsProportion of Homologous ProteinFrequency0.00 0.01 0.02 0.03 0.04 0.05 0.060204060803rd QuartileFigure 5.2: Histogram of the proportions of homologous protein in the training set.This is excellent but can only happen in a block containing only one homologous protein.The maximum possible value of RKL is the size of the block, i.e., the number of candidateproteins. The goal is to keep the RKL as small as possible.5.3.2 Average PrecisionThe metric average precision (APR) is a variant of the metric average hit rate (AHR) wedefined in section 2.5.4. They are no different when there are no ties in the score (i.e.,probability of being homologous) used for ranking. In case of ties, the former used a non-standard definition, sometimes called ?expected precision.? This definition uses a method forhandling ties that calculates the average precision of all possible orderings of the cases thatare tied. An average precision of 1 indicates perfect ranking. The lowest possible averageprecision depends on the data, and our goal is to maximize this metric.655.4. The Modified Algorithm of Phalanx Formation?10 ?5 0 5 10 150.000.050.100.150.200.25Feature Variable (x63)DensityHomologousNon?homologous(a)?20 ?15 ?10 ?5 0 5 100.000.050.100.150.200.250.30Feature Variable (x47)DensityHomologousNon?homologous(b)Figure 5.3: Kernel density plots of the feature variables x63 (panel a) and x47 (panel b) forthe homologous and non-homologous proteins in the training set.5.3.3 TOP1The candidate proteins in each block are sorted by the estimated probabilities of being ho-mologous to the native protein. If the top ranked candidate protein is homologous to thenative protein, TOP1 scores 1, otherwise 0. TOP1 is calculated conservatively when there areties. If multiple sequences are tied for rank 1, all of them must be homologous to score a 1 forTOP1. If any of the sequences tied for rank 1 is non-homologous, TOP1 scores 0. This meansit is never beneficial to have ties. However, it would be easy (difficult) to maximize TOP1 ifa block contains many (a few) homologous proteins. We want to maximize the TOP1.5.4 The Modified Algorithm of Phalanx FormationAlgorithm 5.1 shows the steps of phalanx formation using logistic regression. Figure 3.2 inChapter 3 shows the schematic of the original algorithm (Algorithm 3.1). The differencesbetween the original and the modified algorithms are presented below:? The modified algorithm incorporates logistic regression as the base classifier. The orig-inal algorithm used random forests.? The modified algorithm uses 10-fold cross validation of the training blocks in phalanxformation to evaluate a predictor (or a set of predictors). Thus, grouping is at the levelof blocks, with all the proteins of a block belonging to just one fold. On the other hand,the original algorithm uses out-of-bag (OOB) samples in random forests to evaluate apredictor (or group of predictors) during phalanx formation.665.5. Ensemble of Phalanxes? The original algorithm 3.1 used average hit rate (AHR) to optimize the phalanxes.The modified algorithm 5.1 uses mean block average precision (MAPR) to optimize thephalanxes.? In step 4, both the original and adapted algorithms start by searching for the mostpromising pair of screened predictors to merge together. Unlike the original, the modifiedalgorithm introduces a new step 4(b)i. The reasons for adding this step are explainedin the third paragraph of Section 5.1. In this added step, the new algorithm checkswhether the weak performer of the merging pair of predictors degrades the performanceof the strong performer. If true, the weak performing predictor (or group of predictors)is filtered out. Otherwise, the pair of predictors (or group of predictors) is mergedtogether.In this chapter, we have also formed the phalanxes using random forests. For a fair comparisonbetween the resulting ensembles of phalanxes, all of the steps in this adapted algorithm, exceptfor 4(b)i, are incorporated while forming phalanxes using random forests. The reasons for notincorporating 4(b)i are simple: (i) the performance of random forests is insensitive to theinclusion of a few weak variables, and (ii) we want to use as many useful variables as possiblein the phalanxes.5.5 Ensemble of PhalanxesAfter forming the phalanxes, p (? 1) classifiers are fitted to obtain p vectors of probabilitiesof homogeneity. The classifiers are aggregated together by averaging p vectors of probabilitiesof homogeneity across the phalanxes. If the algorithm ends up with only one phalanx, thereis no scope to ensemble over many phalanxes. In this case, fit a classifier to that phalanxof screened predictors. When a logistic regression model is used in phalanx formation, theresulting ensemble of phalanxes is denoted by EPX(LR). For random forests, the resultingensemble is denoted by EPX(RF).5.6 ResultsThis section shows the results after applying the ensembles of phalanxes ? EPX(RF) andEPX(LR) ? and random forests (RF) to the protein homology data. First, we grouped thetraining blocks into four quarters using the per-block proportions of homologous protein andevaluated the performances of the ensembles in each of the four groups. The goal is to findthe situations where the ensembles of phalanxes perform better than random forests. Theresults are presented next in Subsection 5.6.1. Second, the ensembles are evaluated on thetest set. The results are presented in Subsection 5.6.2.675.6. ResultsAlgorithm 5.1 Modified Algorithm of Phalanx Formation1. Setting the arguments:(a) Predictors and the response: {x1, x2, ? ? ? , xD} and y.(b) Set initial groups, nsim=1000, iquant=0.95.(c) Define a leave 10% blocks out cross-validation.2. Forming initial groups of variables:(a) g1, g2, ? ? ? , gd ? x1, x2, ? ? ? , xD (initial groups).3. Screening initial groups of variables/predictors:(a) Record ordering of the candidate proteins. Permute the response variable y and computeMean Block APR, i.e., MAPR. Repeat the process nsim times.(b) Store MAPRiquant, the iquant-th quantile of the distribution of MAPR under randomranking.(c) For each group of variables gi; i = 1, 2, ? ? ? , d, fit a logistic regression model and get thecross-validated probability vector ?P (gi). Store the probability vectors in the columns of amatrix, say PROB.(d) Using ?P (gi), compute MAPRi = MAPR(?P (gi))and store in a vector, say SAPR.(e) For each pair of initial groups {gi, gj}; i = 1, ? ? ? , (d ? 1), j = (i + 1), ? ? ? , d, get?P (gi, gj),{?P (gi) + ?P (gj)}/2 and thus MAPRij = MAPR(?P (gi, gj)), MAPRij =MAPR({?P (gi) + ?P (gj)}/2), respectively. Store MAPRij and MAPRij in matricesAPRSYN and APRCOM, respectively.(f) Screen the ith group of variables out if max[MAPRi, E(MAPRR) + MAPRij ?MAPRj , E(MAPRR) +MAPRij ?MAPRj ] < MAPRiquant ? j 6= i = 1, ? ? ? , d.(g) Update 3(c), 3(d), and 3(e) by deleting rows and columns of PROB, SAPR, APRSYN,and APRCOM corresponding to the screened out groups of variables.(h) Supply the screened groups of predictors: G1, G2, ? ? ? , Gs.4. Forming candidate phalanxes:(a) Find rmin = minij{MAPRij/MAPRij}; i = 1, 2, ? ? ? , (s? 1), j = (i+ 1), ? ? ? , s.(b) If rmin < 1, then:i. If MAPRij ? max (MAPRi,MAPRj), screen out the ith group of variables {Gi} ifMAPRi ? MAPRj ; else screen out the jth group. Update 3(c), 3(d), and 3(e) bydeleting rows and columns of PROB, SAPR, APRSYN, and APRCOM correspondingto the screened out groups of variables. Set s? (s? 1) and go to 4(a).(c) Block the groups of variables {Gi, Gj}|rmin together and s? (s? 1).(d) Redo the steps 3(c), 3(d), and 3(e) specific to the index of rmin.(e) Repeat 4(a) to 4(d) until rmin ? 1 OR s = 1.(f) Set c? s and supply candidate phalanxes: PX1,PX2, ? ? ? ,PXc.5. Screening candidate phalanxes:(a) Screen the ith candidate phalanx out if max[MAPRi, E(MAPRR)+MAPRij?MAPRj ] <MAPRiquant ? j 6= i = 1, ? ? ? , c.(b) Return PX(1),PX(2), ? ? ? ,PX(p): the army of phalanxes.685.6. Results5.6.1 Grouping the Training BlocksGroups are formed stratified by proportions of homologous proteins. The 153 training blocksare partitioned into four groups using the per-block proportions of homologous protein. Specif-ically, the four groups are the four quarters of the 153 training blocks. For example, the firstgroup contains 39 native protein sequences (i.e., blocks) having per-block proportions of ho-mologous proteins less than or equal to the first quartile (0.00143). The second group contains38 native protein sequences with per-block proportions of homologous proteins greater thanthe first quartile and less than or equal to the second quartile (0.0047). The third and fourthgroups are defined in the same way each containing 38 native protein sequences.Panel (a) of Figure 5.4 shows four histograms of the proportions of homologs in the fourgroups. Group 1 contains the blocks with the smallest proportions of homologous proteins -there is only 1 homologous protein in each block. In contrast, Group 4 contains the blockswith the largest proportions of homologous proteins, i.e., the number of per-block homologousproteins ranges from 11 to 50. I conjecture that the ranking of the homologous proteins is verychallenging in group 1 followed by groups 2, 3 and 4. As the blocks in group 4 contain plentyof homologous proteins, it might be easy to score a good number for some of the evaluationmetrics. For example, if there are 50 homologous proteins in a block, it might not be veryhard to rank a true homolog in the top position yielding the highest score 1 for TOP1.Let us now explain the evaluation process of EPX(LR) in a particular group. The evalu-ation processes for the other groups are analogous. Given the blocks in a particular group, aset of screened phalanxes is obtained by running Algorithm 5.1. As explained in Section 5.4,we used a leave 10% blocks out cross-validation to guide the formation of the phalanxes. Thescreened phalanxes are ensembled to get EPX(LR). To evaluate the resulting EPX(LR), acompletely independent leave 10% blocks out cross-validation is performed to get a vector ofthe probabilities of being homologous to the native protein for all of the candidate proteins ineach block of that group. This vector of probabilities is used to compute the three evaluationmetrics. The leave 10% blocks out cross-validation is repeated for a total of 16 times providing16 numbers for each evaluation metric.The panels (b), (c) and (d) in Figure 5.4 show the side-by-side box plots comparing theperformances of the three ensembles - RF, EPX(RF) and EPX(LR) - in each of the fourgroups in terms of mean block RKL, mean block APR and mean block TOP1, respectively.Each box represents 16 realizations of a metric for an ensemble in a group. The colors ofthe boxes differentiating RF, EPX(RF) and EPX(LR) are white, light grey and dark grey,respectively. The numbers at the top of the boxplots are the numbers of phalanxes obtainedafter running the modified algorithm of phalanx formation (Algorithm 5.1).In terms of mean block RKL (panel (b)), our ensembles EPX(RF) and EPX(LR) gavebetter predictive ranking than RF in all of the four groups. The adapted ensemble EPX(LR)outperformed EPX(RF) in groups 1, 2 and 3, but not in group 4. The ensembles of phalanxes,EPX(RF) and EPX(LR), outperformed RF in all of the four groups in terms of mean block695.6. ResultsGroup 1Proportion of Homologous ProteinFrequency0.0008 0.0010 0.0012 0.001402468Group 2Proportion of Homologous ProteinFrequency0.0020 0.0030 0.004002468Group 3Proportion of Homologous ProteinFrequency0.006 0.008 0.010 0.0120246810Group 4Proportion of Homologous ProteinFrequency0.02 0.03 0.04 0.050246810(a)GroupMean Block RKL501001502001 2 3 43 12 3 5 6 4 2 3(b)GroupMean Block APR0.70.80.91 2 3 43 12 3 5 6 4 2 3(c)GroupMean Block TOP10.60.70.80.91.01 2 3 43 12 3 5 6 4 2 3(d)Figure 5.4: Histograms of the proportion of homologous proteins in four groups (panel a).Side-by-side box plots comparing the performances of the three ensembles - RF, EPX(RF)and EPX(LR) - in each of the four groups in terms of mean block RKL (panel b), mean blockAPR (panel c) and mean block TOP1 (panel d). The colors of the boxes differentiating RF,EPX(RF) and EPX(LR) are white, light grey and dark grey respectively.705.6. ResultsAPR (panel (c)). The most striking improvement of EPX(LR) over EPX(RF) is observed ingroups 2 and 3, followed by group 1. In terms of mean block TOP1 (panel (d)), EPX(RF)often outperformed RF. In group 4, all of the ensembles secured the top mean block TOP1.But in groups 2 and 3, EPX(LR) substantially outperformed EPX(RF).Clearly the ensembles of phalanxes are the winners against random forests in groups 2and 3. Hence, we want to closely look at the performances of the three ensembles in groups1 and 4, where our ensembles marginally win against random forests. The left and rightpanels of Figure 5.5 show hit curves in two typical blocks, 18 and 162, selected from groups1 and 4, respectively. In both panels, the dotted, dashed and the solid lines are for randomforests, ensemble of phalanxes using random forests and ensemble of phalanxes using logisticregression model, respectively. The legend also shows the estimated APR, RKL and TOP1for the three ensembles. In each panel, the performances are reported for only one block sothe metrics are not averaged.Panel (a) of Figure 5.5 is for block 18 which is chosen from group 1. Ranking of thehomologous protein is challenging as this block contains only 1 homolog out of 1114 candidateproteins. Using the three evaluation metrics, the top performers are our ensembles EPX(LR)and EPX(RF). They ranked the only homologous protein in the top position, and all of thethree metrics achieved the top score. Random forests could not rank the homologous proteinin the top position and scored 0 for TOP1. For random forests, the estimated APR and RKLwere 0.081 and 14, respectively.Block 162 is chosen from group 4 and contains a total of 37 homologous proteins outof 816 candidate proteins (panel b of Figure 5.5). Ranking of the homologous proteins isleast challenging in this block compared to the blocks in other three groups. All of thethree ensembles perfectly rank the first 28 homologous proteins scoring 1 for TOP1. The topperformer is EPX(LR) followed by EPX(RF) and RF in terms of RKL and APR.When there are many homologous proteins in a block, all of the three ensembles mightrank a homologous protein in the top position to provide the best possible score for TOP1.Hence, the comparison of the ensembles using TOP1 might appear problematic especiallywhen there are many homologous proteins. However using TOP1, our ensembles EPX(LR)and EPX(RF) often outperform RF if the blocks contain a few homologous proteins. Ourfavorite performance metrics are RKL and APR, and using them, the ensembles EPX(LR)and EPX(RF) are clear winner over random forests. Moreover, for the most part, the adaptedensemble EPX(LR) outperforms EPX(RF).5.6.2 Evaluation on the Test SetIn this section, the three ensembles are evaluated on the test set. The organizers of the2004 KDD cup decided not to disclose the status of the response variable for the test set- however, the results can be obtained by submitting intermediate results to their website.Hence, the ensembles of phalanxes and random forests were learned on the 153 training715.6. Results0 100 200 300 400 5000.00.20.40.60.81.0Number of Protein SelectedNumber of Homologous Protein00.20.40.60.81Proportion of Homologous Protein0 0.09 0.18 0.27 0.36 0.45Proportion of Protein SelectedRF : 0.081 / 14 / 0EPX(RF) : 1 / 1 / 1EPX(LR) : 1 / 1 / 1METHOD : APR RKL TOP1(a)0 100 200 300 400 5000102030Number of Protein SelectedNumber of Homologous Protein00.270.540.81Proportion of Homologous Protein0 0.12 0.25 0.37 0.49 0.61Proportion of Protein SelectedRF : 0.962 / 104 / 1EPX(RF) : 0.963 / 97 / 1EPX(LR) : 0.972 / 74 / 1METHOD : APR RKL TOP1(b)Figure 5.5: Hit curves comparing the performances of three ensembles - RF, EPX(RF) andEPX(LR) - in blocks 18 (panel a) and 162 (panel b) chosen from the groups 1 and 4, respec-tively. The respective block sizes are 1114 and 816. The legend in each figure also shows theestimated APR, RKL and TOP1 for the three ensembles.blocks, for which the response was known, and were applied to the test set to obtain theprobabilities of being homologous to the native protein. The estimated probabilities werepreserved in a specific format suggested by the KDD cup organizers and were submitted totheir website for the result. The KDD cup website did check the format of the submittedprobabilities automatically and reported back the final results. It is important to mentionthat we submitted one set of probabilities corresponding to an ensemble to compute all of theevaluation metrics, i.e. the methods were not tuned specifically for each criterion.Table 5.3 shows the performances of the three ensembles - RF, EPX(RF) and EPX(LR)- using the three evaluation metrics: mean block RKL, APR and TOP1. We obtain 4 and 5phalanxes using EPX(RF) and EPX(LR), respectively. In terms of mean block RKL, APRand TOP1, the top performer is EPX(LR) followed by EPX(RF) and RF. In terms of meanblock TOP1, the performances of RF and EPX(RF) are found similar.Table 5.4 shows the number of processors used, amount of memory allocated and theamount of time recorded (in minute) for parallel execution of the three ensembles. ForEPX(RF) and EPX(LR), a total of 32 processors of the Bugaboo machine were parallelized inthe Western Canada Research Grid (WestGrid) computing network (http://www.westgrid.ca/support/quickstart/bugaboo). We allocated 8GB memory to each processor while run-ning EPX(RF) and 4GB memory to each while running EPX(LR). When EPX(RF) took atotal of 26 hours and 9 minutes for learning on the training set and estimating the probabili-ties of being homologous to the native protein on the test set, the EPX(LR) took only 1 hour725.6. ResultsTable 5.3: The performances of the three ensembles - RF, EPX(RF) and EPX(LR) - usingthree evaluation metrics: mean block RKL, APR and TOP1. The ensembles are learnedon the 153 training blocks and are evaluated on the 150 test blocks. The top performancecorresponding to each evaluation metric is marked by fold face.Ensembles Number of Mean BlockPhalanxes RKL APR TOP1RF ? ? ? 143.733 0.8089 0.8733EPX(RF) 4 82.3067 0.8140 0.8733EPX(LR) 5 71.4733 0.8274 0.8867and 20 minutes to finish the same task. Using random forests, the task was completed assign-ing only 1 processor with 8GB memory. The total running time was 1 hour and 7 minutes.However, the goal was to reduce the computational time for EPX(RF) through EPX(LR).Here, EPX(LR) runs much faster than EPX(RF) and gives better results as well.Table 5.4: The number of processors used, amount of memory allocated and the elapsed timerecorded (in minute) for parallel execution of the three ensembles.Ensembles Number of Memory allocation Elapsed timeProcessors (GB) (Minute)RF 1 8 67EPX(RF) 32 8 1569EPX(LR) 32 4 80Now let us compare the results of our ensemble with the winner of the 2004 KDD Cup.For the protein homology section, the winner of the 2004 KDD Cup was the Weka Group(Pfahringer, 2004). Weka is a collection of machine learning algorithms for data mining tasks(http://www.cs.waikato.ac.nz/ml/weka/). Weka supports a large number of algorithmsout of the box, and the winning group tried all of them. The Weka group then selected onlythe top three classifiers, which were: (1) a boosting ensemble of 10 unpruned decision trees(Boost10), (2) a linear support vector machines with a logistic model fitted to the raw outputsfor improved probabilities (LinSVM), and (3) 100000 or more random rules (105 RR). Thetop three performers were aggregated to obtain the winning ensemble. Table 5.5 shows theperformance of the top three constituent classifiers as well as the performance of the winningensemble. The performance of the ensemble of phalanxes using logistic regression, EPX(LR),is more or less like the LinSVM of Weka group which was one of the top performers selectedfrom many classifiers. However, our target was not to change much of the original ensemble ofphalanxes ? but to develop a general purpose ensemble with good results. Despite changinga little of the algorithm of phalanx formation, the adapted ensemble EPX(LR) does provideimpressive results.735.7. SummaryTable 5.5: The results from the Weka solution to the 2004 KDD Cup.Classifiers Mean BlockRKL APR TOP1Boost10 500.68 0.6858 0.7933LinSVM 64.41 0.8258 0.8867105 RR 53.77 0.8373 0.8933Ensemble 52.45 0.8412 0.90675.7 SummaryThe ensemble of phalanxes is easily adapted to the logistic regression model replacing randomforests. This proves that the scope of the ensemble of phalanxes is not limited to randomforests only; some other simple models could also be used. One potential candidate is obviouslythe classification tree which is a simple model in nature and fast in terms of computationalcomplexity. A potential question in using a classification tree would be to decide how deepto grow the trees. This problem might be addressed with an extra computational effort ofemploying cross-validation, however.The ensemble of phalanxes is adapted to the logistic regression model by introducingfiltering in step 4(b)i during phalanx formation. If the weak performing variable (group ofvariables) of the most promising pair of variables (groups of variables) to merge harms the topperforming variable (group of variables), the modified algorithm filters the weak performer.This change is crucial for the logistic regression model ? but not for random forests. Inrandom forests, merging two groups of variables usually does not harm the top performinggroup unless the weak performing group contains too many noise variables. Moreover, we didnot encounter any initial screening in this application as all of the 74 predictors, either aloneor together with other variable, beat the threshold of random ranking.In order to ensure the good performance of EPX(LR), the adaptation through the inter-mediate step of filtering was necessary. However, the goal was not to change much of thealgorithm as we wanted to develop a general purpose ensemble. We have observed that theadapted ensemble EPX(LR) outperforms EPX(RF) with a big margin using three evaluationmetrics specific to ranking rare homologous proteins. However, some thoughtful modifica-tions could further improve the performance of EPX(LR). For example, the use of appropri-ate weights to aggregate the phalanxes might help improving the performances in terms ofpredictive ranking.It is shown that the EPX(LR) is much faster than EPX(RF) in terms of computationaltime. Although parallel computation is used to learn EPX(LR), we expect that the ensemblecould also be trained using a single processor in a reasonable amount of time by coding thealgorithm in C/C++ or in Fortran. Having finished the coding, we plan to embed the programin an R package in the near future.74Chapter 6Conclusion and Future WorkThis thesis tackles a very special type of two-class classification problem. The specialty isthat the frequency of one of the classes is very small (i.e., rare) comparing to the other class.Moreover, it is the rare, not the majority, class for which the correct classification is required.For example, in medicinal chemistry, a chemist might be interested in classifying the rareactive compounds only in a chemical library of many compounds.Minimizing the misclassification error rate can be unhelpful to such problems. To adaptclassification to unbalanced classes, statisticians or machine learners often maximize the fol-lowing evaluation metrics: precision, recall, G-mean (geometric mean of precision and recall)and F-measure (harmonic mean of precision and recall). An alternative is to sequence all ofthe observations in a dataset in such a way that the rare class observations are found at thetop of a shortlist. The shortlist comprises the top-ranked observations which are intendedto include most, if not all, of the rare-class observations. The length of the shortlist woulddepend on the available resources in a project. As such, after shortlisting the compounds in alarge chemical library, the chemist only needs to go through the shortlist to find most of theactive compounds in the library and thus to save time and money.I chose four bioassay datasets with chemical compounds that are either active or inactiveagainst a biological target, e.g., active against lung tumor cells. The predictors are the numer-ically represented chemical structures of the compounds. It is possible to use many differenttypes of classifiers to rank the compounds. However, classification-tree based ensembles areused in this thesis for two reasons: (i) ensembles usually possess better predictive power thannon-ensembles, and (ii) classification-tree based methods possess good predictive power withinherent variable selection properties, and are applicable to mixed type of high-dimensionaldatasets. Specifically, the popular state-of-the-art ensemble random forests is used becauseof its superiority over other ensemble and non-ensemble classifiers.In each of the four bioassays, I use five sets of predictor variables. In Chapter 2, theensemble of descriptor sets (EDS) was proposed by aggregating five random forests fitted tothe five descriptor sets. The resulting EDS outperforms the top performing random forestsapplied to any of the single set of descriptors. It is the natural diversity and strength of therandom forests built across the descriptor sets which help to develop a powerful ensemble likeEDS. I name such descriptor sets as natural phalanxes, where the variables within a phalanxare good when used together and the variables between phalanxes are good to help each otherin an ensemble.756.1. Parallel ComputationThe presence of such natural phalanxes in the data motivates me to unmask data-adaptivephalanxes in a variable-rich descriptor set. For this purpose, an algorithm of phalanx forma-tion (Algorithm 3.1) is proposed in Chapter 3. The ensemble random forests itself is usedto unmask data-adaptive phalanxes in each of the five descriptor sets. The algorithm filtersthe weak variables and forms the phalanxes using unfiltered variables. As expected, the largedescriptor sets tend to supply many phalanxes.The ensemble of phalanxes (EPX) is obtained by fitting a random forest in each phalanxand aggregating probabilities across the phalanxes. The performance of EPX was comparedwith random forests and regularized random forests. The former performs better in termsof predictive ranking than the latter two ensembles. The advantage of EPX over the twoensembles is large when there are many variables in a descriptor set and when an assaycontains a small proportion of active compounds. The ensemble of descriptor sets is alsoimproved replacing random forests in each descriptor set by the ensemble of phalanxes. Thereplacement of random forests by a more powerful ensemble like EPX further improves theperformances of the resulting ensemble of descriptor sets.In Chapter 3, the phalanxes are formed using the probabilities of belonging to the rareclass obtained from the out-of-bag (OOB) samples in a random forests, whereas in Chapter5, the phalanxes are formed using the probabilities obtained from a cross-validation. Thedatasets used in Chapters 3 and 5 are the bioassay datasets and the protein homology dataset,respectively. The ensemble of phalanxes performs better than the random forests in thoseapplications. The ensemble of phalanxes is also adapted to the logistic regression model inChapter 5. The resulting ensemble of phalanxes runs faster in terms of computational timeand performs better in terms of predictive ranking than the original ensemble of phalanxesusing random forests. This shows that the ensemble of phalanxes is generalizable to anysuitable classification methods.Last, but not the least, the protein homology dataset provides a real test set withoutdisclosing the status of the response variable for the candidate proteins. However, it is pos-sible to evaluate the performances of the ensembles by submitting the probabilities of beinghomologous to the KDD cup website. The test results prove the superiority of the ensembleof phalanxes over random forests. The evaluation of the ensemble of phalanxes in the test setis completely honest and confirms the cross-validated results found in Chapters 3 and 4.6.1 Parallel ComputationIn this thesis, instead of executing the computations sequentially, I choose to execute themin parallel using many processors/cores in a computer or using many nodes in a cluster. Theprimary goal is to reduce the computational time.Specifically, I used the four MAC processors (Grouse, Baker, Seymour and Cypress) in thecomputing servers of the department of statistics at the University of British Columbia. Those766.2. Future Workprocessors provide high computing power with large distributed memory in each and are goodfor high performance computing within each. But the interconnection of the processors doesnot support large data transfer from one to another. In order to facilitate the transfer of largedata and results from one node to another as well as to get high performance computing, Imoved to the western Canada research grid (WestGrid) cluster Bugaboo.To provide parallel backend for the R commands which support parallel execution, I usedthe R packages doSNOW and doMPI. If the processors are not connected as a cluster dueto various reasons, one can build a Simple Network Of Workstations (SNOW) using the Rpackage doSNOW. The processors in the computing servers of the department of statisticsat UBC fall in this category, and I used doSNOW to create parallel backend. On the otherhand, the computing nodes in Bugaboo are connected as clusters, and I used doMPI forcreating parallel backend.Both of the R packages doSNOW and doMPI create parallel backend for the R packageforeach. The R package foreach gives a parallel programming framework and provides alooping construct for executingR codes repeatedly on multiple processors/cores on a computeror on multiple nodes of a cluster. The package foreach also supports sequential execution ofrepeating statements with a minor change in the command line. This attractive feature offoreach enables a programmer writing and debugging codes in a personal computer and thenexecuting the final codes in parallel in a cluster.As stated earlier, I moved from the statistics department?s servers to WestGrid?s clusterBubaboo in order to facilitate transfer of large data files and results from one node to another.Yet I employed another R package iterators to further decrease the overhead data transfer.As the data communication overhead was heavy, I considered chunking and transferring theappropriate portion of the data needed for computing using iterators. Even though there areseveral built-in statements in iterators, I created my own iterators to transfer appropriatechunk of data suitable to my R program.6.2 Future WorkThroughout this thesis, the phalanxes are aggregated to form an ensemble using unweightedaverage of the probabilities belonging to the rare class. This is common both in case of formingthe phalanxes and aggregating the final phalanxes. I propose that the strong phalanxes getlarger weights than the weak phalanxes to improve prediction performance of the resultingensemble. Thus, determining appropriate weights for the phalanxes would be a good extensionof the current ensemble of phalanxes. To determine the optimal weight, I plan to optimize theperformance of the ensemble by examining individual strengths of the phalanxes as well as thecovariance structure of the probabilities belonging to the rare class between the phalanxes.This process of determining optimal weight would be adapted not only to aggregate finalphalanxes but also to form the phalanxes. In addition to optimal weights, I plan to determine776.2. Future Workan appropriate bound on prediction performance of the ensemble by studying the strengthsand correlations between constituent classifiers constructed using the phalanxes.One of the objectives of this thesis is to sort the rare class observations through ranking.To evaluate the performance of a classifier in terms of predictive ranking, I mainly usedaverage hit rate (i.e., average precision) as a single number summary of a hit curve. Themetric ?average hit rate? is also used to form the phalanxes. The formation of phalanxes isalso possible by optimizing other evaluation metrics, for example, optimizing rank last (RKL).Such generalization of the ensemble of phalanxes would be a good extension of this currentmethod.A straightforward extension of this ensemble would be in regression, where the responsevariable is continuous. For example, the four assay datasets used in Chapters 2 to 4 typicallyresulted in continuous responses (percent inhibition) from a primary screen and binary re-sponses (active versus inactive) from a secondary screen. Had I used the continuous responsevariables from those assays, I would have dealt with models related to classical regression. Todeal with continuous response variable as in classical regression, I would have to optimize themetric squared error loss to form the phalanxes as well as to evaluate the resulting ensemble.Having predicted the response percent inhibition, one may rank to sort out the compoundswith high activity.Unlike the rare class problem, the frequencies of the classes could be approximately equal toeach other in many applications. An extension of this ensemble of phalanxes is possible in suchbalanced classification problem by employing the metric misclassification error. Hopefully, theensemble of phalanxes would outperform random forests in balanced classification problemstoo, if a dataset contains many useful feature variables with signals for the response.The socialization of the predictor variables in the screened phalanxes can be made by asimple modification of the algorithm 1.1 of random forests presented in Section 1.3.2. Theproposed modification is as follows: at each node of a tree in a forest, randomly choosemtry =?Di variables to find the split-point of that node. Here, Di ; i = 1, ? ? ? p, is the numberof feature variables in the ith phalanx. As a result of this modification, the resulting trees inthe modified random forests would be more diverse and stronger than the trees generated byregular random forests. For the above mentioned reasons, the modification to random forestsmight perform better in terms of predictive ranking than regular random forests.The algorithm for the ensemble of phalanxes is coded in R, the freely available statisticalsoftware. The original C codes for the three evaluation metrics ? APR, RKL, and TOP1 -are downloaded from the 2004 KDD Cup website and are used in my programs. Besides, thewhole R program is designed in such a way that the program can be executed using as manynodes as possible in a cluster of processors. With the vision of running the codes in a singleprocessor, I plan to rewrite the entire program in C/C++ or in Fortran and wrap it up in anR package. In this R package, there would be opportunity for employing most of the popularevaluation metrics to form the phalanxes and to build the ensemble as well.78BibliographyAoyama, T., Suzuki, Y., and Ichikawa, H. (1990), ?Neural networks applied to quantitativestructure-activity relationships,? Journal of Medicinal Chemistry, 33, 2583?2590. ? pages15Bolton, R. and Hand, D. (2002), ?Statistical fraud detection: a review,? Statistical Science,17, 235?255. ? pages 2Boser, B., Guyon, I., and Vapnik, V. (1992), ?A training algorithm for optimal margin clas-sifiers,? in 5th Annual ACM Workshop on COLT, ed. Haussler, D., Pittsburgh, PA: ACMPress, pp. 144?152. ? pages 7Breiman, L. (1996a), ?Bagging predictors,? Machine Learning, 26, 123?140. ? pages 2, 8? (1996b), ?Heuristics of instability and stabilization in model selection,? The Annals ofStatistics, 24, 2350?2383. ? pages 9? (1996c), ?Out-of-bag estimation,? Tech. rep., Department of Statistics, University of Cal-ifornia, Berkley, USA. ? pages 38? (2001), ?Random forests,? Machine Learning, 45, 5?32. ? pages 2, 9, 10, 20, 28, 33, 38,47, 51, 60Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984), Classification andRegression Trees, The Wadsworth Statistics/Probability Series, Belmont, California, USA:Wadsworth International Group. ? pages 4Bruce, C. L., Melville, J. L., Pickett, S. D., and Hirst, J. D. (2007), ?Contemporary QSARclassifiers compared,? Journal of Chemical Information and Modeling, 47, 219?227. ?pages 16, 20Burden, F. R. (1989), ?Molecular identification number for substructure searches,? Journalof Chemical Information and Computational Sciences, 29, 225?227. ? pages 19, 52Carhart, R. E., Smith, D. H., and Ventkataraghavan, R. (1985), ?Atom pairs as molecu-lar features in structure-activity studies: definition and application,? Journal of ChemicalInformation and Computational Sciences, 25, 64?73. ? pages 19, 5279BibliographyChawla, N., Lazarevic, A., Hall, L., and Bowyer, K. (2003), ?SMOTEBoost: Improvingprediction of the minority class in boosting,? in 7th European Conference on Principles andPractice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, pp. 107?119.? pages 50Chen, C., Liaw, A., and Breiman, L. (2004), ?Using random forest to learn imbalanced data,?Tech. rep., Department of Statistics, University of California, Berkeley, CA. ? pages 16,50Cortes, C. and Vapnik, V. (1995), ?Support-vector networks,? Machine Learning, 20, 273?297.? pages 7Cover, T. and Hart, P. (1967), ?Nearest neighbour pattern classification,? IEEE Transactionon Information Theory, 13, 21?27. ? pages 5DeBarr, D. and Wechsler, H. (2012), ?Spam detection using random boost,? Pattern Recog-nition Letters, 33, 1237?1244. ? pages 1Deng, H. (2012), R Package Regularized Random Forest, Version 1.2; R Foundation for Sta-tistical Computing: Viena, Austria. ? pages 10, 42Deng, H. and Runger, G. (2012), ?Feature selection via regularized trees,? in The 2012 In-ternational Joint Conference on Neural Networks (IJCNN), IEEE. ? pages 10, 33, 51Dietterich, T. G. (2000), ?Ensemble methods in machine learning,? Lecture Notes in ComputerScience, 1857, 1?15. ? pages 20Doniger, S. and Hofmann, T. (2002), ?Predicting CNS permeability of drug molecules: com-parison of neural network and support vector machine algorithms,? Journal of Computa-tional Biology, 9, 849?864. ? pages 15Efron, B. and Tibshirani, R. (1994), An Introduction to the Bootstrap, Chapman & Hall/CRC.? pages 8Fienberg, S. E. (2004), ?Homeland insecurity: Data mining, terrorism detection, and confi-dentiality,? Tech. Rep. 148, National Institute of Statistical Sciences, 19 T. W. AlexanderDrive, Research Triangle Park, NC. ? pages 1Fisher, R. (1936), ?The use of multiple measurements in taxonomic problems,? Annals ofEugenics, 7, 179?188. ? pages 6Freund, Y. (1995), ?Boosting a weak learning algorithm by majority,? Information and Com-putation, 121, 256?285. ? pages 10Freund, Y. and Schapire, R. (1996a), ?Experiments with a new boosting algorithm,? in Ma-chine Learning: Proceedings of the Thirteenth International Conference, Morgan Kauf-mann, San Francisco, pp. 148?156. ? pages 2, 10, 1180Bibliography? (1996b), ?Game theory, online prediction and boosting,? in Proceedings of the Ninth AnnualConference on Computational Learning Theory, pp. 325?332. ? pages 10? (1997), ?A decision-theoretic generalization of online learning and an application to boost-ing,? Journal of Computer and System Sciences, 55, 119?139. ? pages 10Friedman, J. (1989), ?Regularized discriminant analysis,? Journal of the American StatisticalAssociation, 84, 165?175. ? pages 6Hall, P., Park, B., and Samworth, R. (2008), ?Choice of neighbor order in nearest-neighborclassification,? The Annals of Statistics, 36, 2135?2152. ? pages 4Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning:Data Mining, Inference, and Prediction, Springer, 2nd ed. ? pages 6, 7, 8, 12Hughes-Oliver, J. M., Brooks, A. D., Welch, W. J., Khaledi, M. G., Hawkins, D., Young,S. S., Patil, K., Howell, G. W., and Ng, R. T. (2011), ?ChemModLab: A web-basedcheminformatics modeling laboratory,? In Silico Biology, 11, 61?81. ? pages 2, 15, 16, 19,20, 22, 29, 34, 54Johnson, R. A. and Wichern, D. W. (2002), Applied Multivariate Statistical Analysis, PrenticeHall, 5th ed. ? pages 25Kauffman, G. W. and Jurs, P. C. (2001), ?QSAR and k-nearest neighbor classification analysisof selective cyclooxygenase-2 inhibitors using topologically based numerical descriptors,?Journal of Chemical Information and Computational Sciences, 41, 1553?1560. ? pages 15Kearsley, S. K., Sallamack, S., Fluder, E. M., Andose, J. D., Mosley, R. T., and Sheridan,R. P. (1996), ?Chemical similarity using physiochemical property descriptors,? Journal ofChemical Information and Computational Sciences, 36, 118?127. ? pages 22, 34Koonin, E. V. and Galperin, M. Y. (2003), Sequence Evolution Function: ComputationalApproaches in Comparative Genomics, Boston: Kluwer Academic. ? pages 14Leach, A. R. and Gillet, V. J. (2007), An Introduction to Chemoinformatics, Dordrecht,Netherlands: Springer. ? pages 13, 16Liaw, A. and Wiener, M. (2011), R Package Random Forest, Version 4.6-2; R Foundation forStatistical Computing: Viena, Austria. ? pages 20, 41, 42Liu, K., Feng, J., and Young, S. S. (2005), ?PowerMV: A software environment for molecu-lar viewing, descriptor generation, data analysis and hit evaluation,? Journal of ChemicalInformation and Modeling, 45, 515?522. ? pages 19McCulloch, W. and Pitts, W. (1943), ?A logical calculus of the ideas immanent in nervousactivity,? Bulletin of Mathematical Biophysics, 5, 115?133. ? pages 681BibliographyMeier, L., van de Geer, S., and Buhlmann, P. (2008), ?The group lasso for logistic regression,?Journal of the Royal Statistical Society, Series B, 70, 53?71. ? pages 32, 59Merkwirth, C., Mauser, H., Schulz-Gasch, T., Roche, O., Stahl, M., and Lengauer, T. (2004),?Ensemble methods for classification in chemiinformatics,? Journal of Chemical Informa-tion and Computational Sciences, 44, 1971?1978. ? pages 15Mezey, P., Carbo, R., and Girones, X. (2001), Fundamentals of molecular similarity, NewYork: Kluwer Academic/Plenum Publishers. ? pages 13Pearlman, R. S. and Smith, K. M. (1999), ?Metric validation and the receptor-relevant sub-space concept,? Journal of Chemical Information and Computational Sciences, 39, 28?35.? pages 19, 52Pfahringer, B. (2004), ?The weka solution to the 2004 KDD cup,? SIGKDD Explorations, 6,117?119. ? pages 73Ripley, B. D. (2011), R Package Tree, Version 1.0-29; R Foundation for Statistical Computing:Viena, Austria. ? pages 28Rosenblatt, F. (1962), Principles of Neurodynamics: Perceptrons and the Theory of BrainMechanisms, Spartan, Washington, D.C. ? pages 6Rusinko, A., Farman, M. W., Lambert, C. G., Brown, P. L., and Young, S. S. (1999), ?Analysisof a large structure/biological activity data set using recursive partitioning,? Journal ofChemical Information and Computational Sciences, 39, 1017?1026. ? pages 15Schapire, R. (1990), ?The strength of weak learnability,? Machine Learning, 5, 197?227. ?pages 10Srivastava, A., Kundu, A., Sural, S., and Majumdar, A. (2008), ?Credit card fraud detectionusing hidden markov model,? IEEE Transactions on Dependable and Secure Computing, 5,37?48. ? pages 1Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., and Feuston, B. R. (2003),?Random forest: A classification and regression tool for compound classification and QSARmodeling,? Journal of Chemical Information and Computational Sciences, 43, 1947?1958.? pages 15, 16, 20Tebbutt, S., Opushnyev, I., Tripp, B., Kassamali, A., Alexander, W., and Andersen, M.(2005), ?SNP chart: An integrated platform for visualization and interpretation of mi-croarray genotyping data,? Bioinformatics, 21, 124?124. ? pages 32Tibshirani, R. (1996), ?Bias, variance, and prediction error for classification rules,? Tech. rep.,Department of Statistics, University of Toronto, ON, Canada. ? pages 3882Todeschini, R. and Consonni, V. (2000), Handbook of Molecular Descriptors, Weinbeim, Ger-many: WILEY-VCH. ? pages 13, 16Venables, W. and Ripley, B. (2002), Modern Applied Statistics with S, Springer, 4th ed. ?pages 4Wang, Y. (2005), ?Statistical Methods for High Throughput Screening Drug Discovery Data,?Ph.D. thesis, University of Waterloo, Waterloo, Canada. ? pages 2, 11, 12, 13, 20, 23Warmuth, W., Ratsch, G., Mathieson, M., Liao, J., and Lemmen, C. (2001), ?Active learningin the drug discovery process,? NIPS. ? pages 1Widrow, B. and Hoff, M. (1960), ?Adaptive switching circuits,? IRE WESCON Conventionrecord, 4, 96?104. ? pages 6Wolpert, D. and Macready, W. (1999), ?An efficient method to estimate baggings generaliza-tion error,? Machine Learning, 35, 41?55. ? pages 38Yuan, M. and Lin, Y. (2007), ?Model selection and estimation in regression with groupedvariables,? Journal of the Royal Statistical Society, Series B, 68, 49?67. ? pages 32, 59Zhang, Q., Hughes-Oliver, J. M., and Ng, R. T. (2009), ?A model-based ensembling approachfor developing QSARs,? Journal of Chemical Information and Modeling, 49, 1857?1865. ?pages 16, 17, 20Zhu, M. (2004), ?Recall, precision and average precision,? Tech. rep., Department of Statisticsand Actuarial Science, University of Waterloo, ON, Canada. ? pages 23, 34Zhu, M., Su, W., and Chipman, H. (2006), ?LAGO: A computationally efficient approach forstatistical detection,? Technometrics, 48, 193?205. ? pages 8, 3483Appendix ASupplementary Tables84Appendix A. Supplementary TablesTable A.1: The number of variables, initial groups, screened groups, candidate phalanxes,and screened/army of phalanxes for the 3 runs of the algorithm to AID362 assay.DS RunNumber ofVariables Groups PhalanxesInitial Screened Candidate ScreenedAP1360 7750 1 12 54 6 43 57 12 4BN124 2424 4 42 24 5 53 24 6 6CAP11319 325281 10 102 289 12 123 288 14 14FP1563 10292 5 52 95 6 53 97 4 4PH1112 2118 2 22 16 1 13 18 2 1Table A.2: Average hit rate (AHR) for an ensemble of phalanxes (EPX), a random forests(RF), and a regularized random forests (RRF) averaged over 16 repeats of balanced 10-foldcross-validation for the AID362 assay. Larger AHR values are better. The last two columnsshow the number of times EPX has larger AHR among the 16 repeats of cross-validationrelative to RF and RRF.DS Run Mean AHR EPX beatsEPX RF RRF RF RRFAP1 0.3000.280 0.25616/16 16/162 0.306 15/16 16/163 0.295 13/16 15/16BN1 0.2610.242 0.23816/16 16/162 0.299 16/16 16/163 0.285 16/16 16/16CAP1 0.3630.267 0.17116/16 16/162 0.355 16/16 16/163 0.368 16/16 16/16FP1 0.3150.266 0.17416/16 16/162 0.323 16/16 16/163 0.306 16/16 16/16PH1 0.2270.216 0.16814/16 16/162 0.212 7/16 16/163 0.218 9/16 16/1685Appendix A. Supplementary TablesTable A.3: The number of variables, initial groups, screened groups, candidate phalanxes,and screened/army of phalanxes for the 3 runs of the algorithm to AID364 assay.DS RunNumber ofVariables Groups PhalanxesInitial Screened Candidate ScreenedAP1380 7863 6 52 71 6 63 65 5 5BN124 2424 4 42 24 3 33 24 4 4CAP11585 394316 7 72 321 10 103 380 10 7FP1580 104100 4 42 101 3 33 102 6 6PH1120 2118 3 32 16 1 13 16 2 2Table A.4: Average hit rate (AHR) for an ensemble of phalanxes (EPX), a random forests(RF), and a regularized random forests (RRF) averaged over 16 repeats of balanced 10-foldcross-validation for the AID364 assay. Larger AHR values are better. The last two columnsshow the number of times EPX has larger AHR among the 16 repeats of cross-validationrelative to RF and RRF.DS Run Mean AHR EPX beatsEPX RF RRF RF RRFAP1 0.2910.265 0.23016/16 16/162 0.292 16/16 16/163 0.310 16/16 16/16BN1 0.3710.327 0.30016/16 16/162 0.365 16/16 16/163 0.373 16/16 16/16CAP1 0.3790.334 0.25216/16 16/162 0.390 16/16 16/163 0.390 16/16 16/16FP1 0.3180.305 0.26115/16 16/162 0.320 16/16 16/163 0.317 14/16 16/16PH1 0.2780.275 0.21911/16 16/162 0.276 9/16 16/163 0.282 14/16 16/1686Appendix A. Supplementary TablesTable A.5: The number of variables, initial groups, screened groups, candidate phalanxes,and screened/army of phalanxes for the 3 runs of the algorithm to AID371 assay.DS RunNumber ofVariables Groups PhalanxesInitial Screened Candidate ScreenedAP1382 7858 5 42 59 5 43 61 3 3BN124 2424 3 32 24 5 53 24 3 3CAP11498 362192 7 72 196 9 93 202 7 6FP1580 10485 5 52 85 2 23 88 3 3PH1119 2115 5 52 15 4 43 14 3 3Table A.6: Average hit rate (AHR) for an ensemble of phalanxes (EPX), a random forests(RF), and a regularized random forests (RRF) averaged over 16 repeats of balanced 10-foldcross-validation for the AID371 assay. Larger AHR values are better. The last two columnsshow the number of times EPX has larger AHR among the 16 repeats of cross-validationrelative to RF and RRF.DS Run Mean AHR EPX beatsEPX RF RRF RF RRFAP1 0.3270.315 0.28116/16 16/162 0.331 16/16 16/163 0.328 16/16 16/16BN1 0.3420.335 0.33316/16 16/162 0.354 16/16 16/163 0.338 13/16 13/16CAP1 0.3900.347 0.31016/16 16/162 0.384 16/16 16/163 0.378 16/16 16/16FP1 0.3580.362 0.3383/16 15/162 0.358 4/16 14/163 0.364 12/16 16/16PH1 0.2770.277 0.2829/16 5/162 0.284 15/16 10/163 0.279 9/16 6/1687

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0074217/manifest

Comment

Related Items