Scandent tree: a decision forest basedclassification method for multimodalincomplete datasetsbySoheil HorB.Sc., Isfahan University of Technology, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Biomedical Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2016c© Soheil Hor 2016AbstractIncomplete and inconsistent datasets often pose difficulties in multimodalstudies. A common scenario in such studies is where many of the samplesare non-randomly missing a large portion of the most discriminative features.We introduce the novel concept of scandent decision trees to tackle this issuein the context of a decision forest classifier. Scandent trees are decision treesthat optimally mimic the partitioning of the data determined by anotherdecision tree, and crucially, use only a subset of the feature set. We use theforest resulting from ensembling these trees as a classification model. Wetest the proposed method on a real world example of the target scenario,a prostate cancer dataset with MRI and gene expression modalities. Thedataset is imbalanced with many MRI only samples and few with MRI andgene expression. Using scandent trees, we train a classifier that benefitsfrom the large number of MRI samples at training time, and of the presenceof MRI and gene expression features at the time of testing. The resultsshow that the diagnostic value of the proposed model in terms of detectingprostate cancer is improved compared to traditional methods of imputationand missing data removal.The second major contribution of this work is the concept of tree-basedfeature maps in the decision forest paradigm. The tree-based feature mapsenable us to train a classifier on a rich multimodal dataset, and use it toclassify samples with only a subset of features of the training data. This hasimportant clinical implications: one can benefit from an advanced modalityto train a classifier, but use it in a practical situation when less expensivemodalities are available. We use the proposed methodology to build a modeltrained on MRI and PET images of the Alzheimer’s Disease NeuroimagingInitiative (ADNI) dataset, and then test it on cases with only MRI data. Weshow that our method is significantly more effective in staging of cognitiveimpairments compared to a model trained and tested on MRI only, or onethat uses other kinds of feature transform applied to the MRI data.iiPrefaceThe concept of scandent trees was originally introduced by the author in apaper published in Medical Imaging and Computer Assisted Interventionsconference (MICCAI-2015) titled as “Scandent tree: a random forest learn-ing method for incomplete multimodal datasets”. This paper is the sourceof most of the material used in chapter 3.Majority of the work discussed in chapter 4 is extracted from a paperconditionally accepted to the MICCAI special issue on Medical Image Anal-ysis (MedIA) journal under the title of “Learning in data-limited multimodalscenarios: scandent decision forests and tree-based features”.The contribution of the author is development and evaluation of the tech-niques proposed in these publications and was performed under supervisionof Dr. Mehdi Moradi.This study has been performed as part of an ethics certificate approvedby UBC research ethics board under the title of “Computational multimodalradiologic profiling as a prognostic bio-marker for individualized prostatecancer therapy” with UBC CREB number of H14-00359. Under supervisionof Dr. Peter Black and Dr. Mehdi Moradi.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Organization of the thesis . . . . . . . . . . . . . . . . . . . . 52 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Decision trees and decision forests . . . . . . . . . . . . . . . 62.2.1 Classification and regression decision trees . . . . . . 62.2.2 C5.0 decision trees . . . . . . . . . . . . . . . . . . . . 112.2.3 Decision forests . . . . . . . . . . . . . . . . . . . . . 142.3 Handling missing values . . . . . . . . . . . . . . . . . . . . . 162.3.1 Data removal methods . . . . . . . . . . . . . . . . . 182.3.2 General imputation methods . . . . . . . . . . . . . . 192.4 State of the art tree-based imputation methods . . . . . . . . 212.4.1 CART embedded imputation method: surrogate divi-sions . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.2 C5.0 embedded imputation method . . . . . . . . . . 23ivTable of Contents2.4.3 Decision forest embedded imputation method: rfIm-pute . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Scandent tree: a forest based method for multimodal clas-sification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.1 Mathematical formulation . . . . . . . . . . . . . . . 283.2.2 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.3 Support tree . . . . . . . . . . . . . . . . . . . . . . . 293.2.4 Scandent trees . . . . . . . . . . . . . . . . . . . . . . 303.2.5 Leaf level inference . . . . . . . . . . . . . . . . . . . 323.2.6 Implementation . . . . . . . . . . . . . . . . . . . . . 333.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.1 Evaluation using benchmark datasets . . . . . . . . . 343.3.2 A real scenario: prostate cancer dataset . . . . . . . . 373.4 Simulation and experimental results . . . . . . . . . . . . . . 413.4.1 Simulation results . . . . . . . . . . . . . . . . . . . . 413.4.2 Experimental results: prostate cancer dataset . . . . 453.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Tree-based feature transforms: applying scandent tree modelfor single modal classification . . . . . . . . . . . . . . . . . . 494.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . 524.3 Evaluation and results . . . . . . . . . . . . . . . . . . . . . . 534.3.1 Evaluation using benchmark datasets . . . . . . . . . 534.3.2 A real scenario: ADNI dataset . . . . . . . . . . . . . 554.3.3 Comparison with other work on ADNI . . . . . . . . 614.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.2 Discussions and limitations . . . . . . . . . . . . . . . . . . . 665.2.1 Limitations of the implementated method . . . . . . . 665.2.2 Discussions and limitations of the multimodal study . 675.2.3 Discussions and limitations of the single modal study 685.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 69vTable of ContentsBibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71viList of Tables3.1 List of features and the outcome classes, dermatology dataset 353.2 The feature set of the heart disease dataset . . . . . . . . . . 363.3 The feature set of the breast cancer dataset . . . . . . . . . . 373.4 List of the genes used in the prostate cancer study . . . . . . 404.1 Accuracy (Acc), Sensitivity (Sens), Specificity (Spec) andArea under ROC curve (AUC) of the proposed methods andthe baseline forest for the NL vs. pMCI single modal classi-fication task, ADNI dataset . . . . . . . . . . . . . . . . . . . 594.2 Accuracy (Acc), Sensitivity (Sens), Specificity (Spec) andArea under ROC curve (AUC) of the proposed methods andthe baseline forest for the sMCI vs. AD single modal classifi-cation task, ADNI dataset . . . . . . . . . . . . . . . . . . . . 594.3 Accuracy (Acc), Sensitivity (Sens), Specificity (Spec) andArea under ROC curve (AUC) of the proposed methods andthe baseline forest for the sMCI vs. pMCI single modal clas-sification task, ADNI dataset . . . . . . . . . . . . . . . . . . 624.4 Comparison of the proposed single modal method with thestate of the art for sMCI vs. pMCI prediction, ADNI dataset 63viiList of Figures3.1 Diagram of the proposed method for growing the scandent trees 323.2 Registration example: (a) T2-weighted , (b) DTI and (c)DCE-MRI slice. The green contour represents the bound-aries of the prostate gland. The red contour represents themapped tumor ROI([16, 25]). . . . . . . . . . . . . . . . . . . 383.3 Gene expression heat-map of the probes corresponding to theselected genes for each patient. Each row presents a sample.Each column presents a gene expression feature. The verti-cal dendograms show clustering of samples. The horizontaldendograms show clustering of features. Sample clusteringcorrectly clusters each patient. This shows that the gene ex-pression profiles are mostly patient-specific. Although all ofthe selected genes are known to be biomarkers of prostate can-cer, neither correlations between features nor cancer-relatedpatterns are visible. . . . . . . . . . . . . . . . . . . . . . . . 393.4 AUC vs multimodal sample size for the dermatology dataset(each box shows variation of AUC values for different ran-domised training and test sets) . . . . . . . . . . . . . . . . . 423.5 AUC vs multimodal sample size for heart disease dataset(each box shows AUC values for different single modal featuresets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.6 AUC vs single modal feature set size for heart disease dataset(each box shows AUC values for different multimodal samplesizes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.7 AUC vs multimodal sample size for breast cancer dataset(each box shows AUC values for different singlemodal featuresets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.8 AUC vs single modal feature set size for breast cancer dataset(each box shows AUC values for different multimodal samplessizes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45viiiList of Figures3.9 AUC for multimodal classification task obtained with differ-ent strategies for handling the missing data issue, prostatecancer dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1 Extracting tree-based feature transforms from the scandenttree model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Diagram of the proposed method for training the “multimodalfeature transform” forest . . . . . . . . . . . . . . . . . . . . . 524.3 AUC vs single modal sample size for dermatology dataset . . 544.4 AUC vs single modal feature set size for breast cancer dataset(each box shows AUC values for different multimodal samplesizes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5 Diagram of the method used for forming the PC forest . . . . 574.6 Diagram of the method used for forming a forest based onsingle modal tree-based feature transforms . . . . . . . . . . . 584.7 ROC curve for NL vs. progressive MCI classification, singlemodal classification task, ADNI dataset . . . . . . . . . . . . 584.8 ROC curve for stable MCI vs. AD classification, single modalclassification task, ADNI dataset . . . . . . . . . . . . . . . . 604.9 ROC curve for stable MCI vs. progressive MCI classification,single modal classification task, ADNI dataset . . . . . . . . . 61ixAcknowledgementsFunding from Canadian Institutes of Health Research (CIHR, OperatingGrant) and Natural Sciences and Engineering Research Council of Canada(Discovery Grant) is acknowledged. Prostate imaging and genomic datawere obtained at Vancouver General Hospital and UBC Hospital with ap-proval from Clinical Research Ethics Board and informed patient consent.I would like to acknowledge Drs. Peter Black, Larry Goldenberg, Piotr Ko-zlowski, Jennifer Locke, Silvia Chang, Edward C. Jones, Ladan Fazli, allfrom UBC/VGH; Dr. Elai Davicioni, Christine Buerki, Heesun Shin, andZaid Haddad from GenomeDx Biosciences Inc.ADNI Disclosure: Data collection and sharing for parts of this workwas funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (Na-tional Institutes of Health Grant U01 AG024904) and DOD ADNI (De-partment of Defense award number W81XWH-12-2-0012). ADNI is fundedby the National Institute on Aging, the National Institute of Biomedi-cal Imaging and Bioengineering, and through generous contributions fromthe following: AbbVie, Alzheimers Association; Alzheimers Drug Discov-ery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-MyersSquibb Company; CereSpir, Inc.; Eisai Inc.; Elan Pharmaceuticals, Inc.;Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and itsaffiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.;Janssen Alzheimer Immunotherapy Research & Development, LLC.; John-son & Johnson Pharmaceutical Research & Development LLC.; Lumosity;Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRxResearch; Neurotrack Technologies; Novartis Pharmaceuticals Corporation;Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company;and Transition Therapeutics. The Canadian Institutes of Health Researchis providing funds to support ADNI clinical sites in Canada. Private sectorcontributions are facilitated by the Foundation for the National Institutes ofHealth (www.fnih.org). The grantee organization is the Northern CaliforniaInstitute for Research and Education, and the study is coordinated by theAlzheimer’s Disease Cooperative Study at the University of California, SanDiego. ADNI data are disseminated by the Laboratory for Neuro ImagingxAcknowledgementsat the University of Southern California.xiDedicationFirst and foremost, I want to dedicate this thesis to my parents and mysisters for being the most supportive family one could possibly have. Thankyou for supporting me with love and believing in me all through my life.I also want to thank my supervisor, Dr. Mehdi Moradi for his kind super-vision through this study. During the past two years, his kind comments,his guidance and support kept me motivated. Thank you for your trust andyour patience.Finally, I want to thank all my friends who made my life away from homemore tolerable. Thank you for your continuous help and encouragement,when I needed it the most.xiiChapter 1Introduction1.1 MotivationIn recent years there has been a surge of interest in multimodal data anal-ysis. Different modalities provide researchers with complementary informa-tion about diseases and provide the means for more accurate detection andstaging. This can be valuable in the case of progressive illnesses such asAlzheimer’s disease and certain kinds of cancer. Simultaneous analysis ofmultiple modalities could also help us discover novel relations between differ-ent modalities, such as understanding the relationship of molecular changescaused by a disease and its imaging signature when both genetics and imag-ing data are available. Given these potential advantages, there has been atrend of merging different modalities in biomedical studies. For instance theAlzheimer’s Disease Neuroimaging Initiative (ADNI), a six year $65 millionstudy, has focused on using medical imaging modalities like Magnetic Res-onance Imaging (MRI) and Positron Emission Tomography (PET) togetherwith genetics and other clinical biomarkers for gaining better understandingof Alzheimer’s Disease and its progression.Acquiring multimodal data is generally more costly and time consum-ing than a single modality. As a result, multimodal datasets usually havevaluable features, but a small set of samples with all features. This makesit difficult to build classifiers with large training data for highly multimodalprotocols. For instance, in the case of the ADNI dataset, nearly half ofthe patients are missing the PET data. PET imaging is expensive andrequires the use of radioactive tracers. As a result, a large number of pa-tients only receive MRI scans, despite the fact that PET imaging providesunique brain functional information by quantification of the cerebral bloodflow, metabolism, and receptor binding, which are not measured with MRI.This is a common scenario in dealing with multimodal data. So designing acomputational model that can be trained on both MRI and PET data (mul-timodal data), but be deployed in clinical settings where only MRI (singlemodal data) is available, is a valuable contribution in this area, providedthat the model outperforms one that is solely trained on MRI data.11.1. MotivationAnother common scenario is the case of a new multimodal research pro-tocol including at least one component which is only obtained in the courseof the study itself. An example of this scenario is our study to understandthe relationship of molecular signature of prostate cancer with the imagingsignature of the disease obtained through multiparametric MRI (mpMRI).The hope is that the simultaneous analysis of the molecular and imagingdata can provide clues towards building a reliable and affordable clinicalstaging test. Since prostate cancer is a multifocal disease with tumors atdifferent stages in each foci, this study requires tissue samples for molecularanalysis that are obtained from a specific area with known spatial regis-tration to the MRI images. The steps taken to acquire this data are notpart of the clinical routine and the pace of data acquisition is slow. Onthe contrary, we have access to hundreds of samples with only mpMRI dataand known histology. In this scenario, we are building a computationalmodel that would be studied on the mpMRI+genomics (multimodal) data.Here, we may benefit from a computational framework that can utilize therather large single modal dataset during training, but be able to handle themultimodal data at the testing stage.In this thesis we present solutions, within the context of decision tree/forestparadigm of learning to address the problems posed in the two scenarios de-scribed above. Recent relevant work includes an investigation of the appli-cations of imputation methods for dealing with missing values in the ADNIdataset [6]. The results show that joining a multimodal dataset with a singlemodal dataset by imputation of the missing values improves the classifica-tion accuracy, compared to training a classifier on either the single modal orthe available multimodal data. In our current work, we intend to go beyondthe paradigm of imputation. This is due to the fact that multimodal studiesdo not necessarily hold the usual assumptions in imputation that only asmall number of data points are missing at random. We intend to deal withsituations where blocks of data are missing together and the missing valuesare not spread randomly.One trend in dealing with block wise missing values in multimodal datasetsis separately modeling different blocks of data and then joining the resultingmodels by using a merging classifier or an ensembling method. One of themost successful attempts in this field is applying multi-source learning tech-niques for dealing with block wise missing data in ADNI [38, 39, 42]. Theincomplete Multi-Source Feature learning method (iMSF) proposed by Yuanet al., models different blocks of data with similar feature sets as differenttasks and learns a joint model by imposing a sparse learning regularisationon these tasks [42]. The authors also propose a different approach by using21.1. Motivationa model score completion scheme. This method is based on training inde-pendent classifiers on different blocks of data, and then using the predictionscores calculated by each classifier as a new presentation of the data thatcan then be imputed using conventional imputation techniques. A recentpaper by Yu et al., proposes a new method based on Multi-task Linear Pro-gramming Discriminant (MLPD) analysis [41]. This method formulates theproblem as a multi-task learning scenario in a fashion similar to the iMSFmethod but does not constraint all of the tasks to share the same set offeatures, allowing joint learning of a more flexible model.As a limitation to these studies, the training and testing datasets areassumed to have the same distribution and feature sets. Recently, Cheng etal., addressed this issue and proposed a method for multimodal data analysisbased on multimodal manifold-regularized transfer learning method [9]. Thismethod enables using data from different domains together with unlabeleddata for multimodal classification. This work uses a kernel based data fusionapproach and includes a sparsity constraint in order to deal with the highdimensionality issue.In this thesis we address the same limitations reported in [9], but withdifferent assumptions that fit our scenarios. We don’t assume that thereis unlabeled data available. We do assume that the feature set of the testdata is a subset of the training data. For instance, in case of the ADNIdataset, we assume that the training dataset consists of a set of sampleswith both MRI and PET data (although incomplete) but the test sampleonly consists of MRI data. This scenario is aimed at enabling the use ofmultimodal datasets for training of a classifiers that requires only a subsetof modalities for testing.Another important issue in multimodal classification is the high dimen-sionality that poses difficulties in feature selection and classifier building.The majority of the methods in the literature use the multi-kernel SVMframework for multimodal classification and need to impose sparse condi-tions on the multimodal feature set in order to avoid over-fitting [9, 20, 43].However by working within the decision tree/forest paradigm we can ben-efit from its embedded way of dealing with high dimensional data throughfeature bagging [4]. Another motivation for the use of decision forest paradigmis that it provides the ability to morph the treatment of missing data withinthe framework of learning to maximize the classification performance. Thisarea of work has seen significant contributions in recent years. These in-clude the state of the art imputation methods embedded in the classificationand regression tree (CART) algorithm and C5.0 algorithm for decision treegrowth [28, 29] and in Random Forests (rfImpute) [4] which we discuss in31.2. Objectivemore detail in the next chapter.1.2 ObjectiveThe objective of this thesis is two fold:First, developing a classifier in the context of decision forests that ben-efits from a large single modal dataset and a small multimodal dataset atthe time of training, but is tested on multimodal data. This is motivatedby the work in the area of prostate cancer detection and staging.Second, developing a method based on the decision forest classifier thatcan benefit from a multimodal dataset together with a single modal datasetat the time of training, but is tested on single modal data. This is motivatedby the work in the area of Alzheimer’s Disease detection and staging.1.3 ContributionsThis thesis reports two specific contributions:• First, introducing the concept of scandent trees, a novel forest-basedmethod that can leverage one or more single modal datasets in orderto enhance a multimodal forest. To our knowledge, this is the firstdecision-forest based algorithm specifically designed for this purpose.We provide results for different scenarios by simulation of the missingvalue problem on publicly available benchmark datasets. We also com-pare the scandent tree method to different state of the art methodsfor missing value imputation on a prostate cancer dataset which is areal-world example of the target scenario.• Second, we develop the idea of scandent tree-based feature transformsto solve the problem of missing data in the single modal testing sce-nario. This problem has many clinical applications in areas whereexpensive research protocols meet the realities of clinical practice andhigh cost. Here, the assumption of a multimodal dataset with blockwise missing values remains. However, there is no multimodal assump-tion about the test set. Using the proposed approach on the ADNIdataset, we show that we can use MRI and PET data for training aclassifier that only requires the MRI data for the prediction of differ-ent stages of Alzheimer’s disease. We show that the inclusion of thePET data at the time of training results in an improved classification41.4. Organization of the thesisaccuracy, even though the test cases are not subjected to PET imag-ing. We also examine the proposed method in different scenarios bysimulation of the single modal and multimodal datasets on publiclyavailable benchmark datasets. To our knowledge this is also the firstmethod based on decision forests that has been designed specificallyfor this purpose.1.4 Organization of the thesisIn this chapter, we discussed the advantages and limitations of multimodalstudies and the importance of developing a new approach based on the deci-sion forest classifiers leveraging multimodal datasets with block-wise missingdata for either single modal or multimodal classification tasks. The remain-der of this thesis is organised as follows:• In Chapter 2 we explain the basics of the state of the art methods forgrowing decision trees and decision forests. Then we present a reviewof conventional methods for handling missing data, including generalpurpose imputation methods and the methods specific to tree-basedclassifiers.• Chapter 3 describes the concept of scandent trees for multimodal clas-sification and provides experimental and simulation results as proofsof concept.• Chapter 4 introduces the tree-based feature transforms and applica-tions of the scandent tree model for single modal classification. Thischapter also provides simulation results together with experimentalresults.• Finally, chapter 5 provides the conclusions of the thesis and a discus-sion about the limitations of this study and the potential for futurework.5Chapter 2Background2.1 IntroductionIn order to gain a better understanding of the missing value handling prob-lem in tree-based classifiers, we first review tree-based classifiers. In the firstsection of this chapter, two of the most well known algorithms for growth ofdecision trees named, Classification And Regression Trees (CART) and C5.0algorithms are introduced and explained in detail. Then an introduction tothe concept of random forests and their theory of operation is presented. Ina separate section, the challenges present in handling the problem of missingvalues in a general context are discussed in detail and a brief introductionto different general approaches to handle data with missing values is pro-vided. In the next section of this chapter, we investigate the state of theart imputation methods which are specifically designed for tree-based clas-sifiers. Detailed information about three of the state of the art embeddedimputation methods for CART, C5.0 and random forests (rfImpute) is alsoin this section.2.2 Decision trees and decision forests2.2.1 Classification and regression decision treesIn this section the tree growth algorithm known as “Classification And Re-gression Trees” or in short, the CART algorithm will be explained. Thisalgorithm was first introduced by Breiman, et al. in 1984 [5]. A CART treeis a binary tree that is grown based on an iterative process of finding thebinary split point that gives the maximum purity gain at each node andusing this division point to split each node to two child nodes. Let Y be thedependent variable or outcome class that can be ordinal categorical, nom-inal categorical or a continuous number. If Y is categorical with k classes,its class takes values in C = 1, 2, . . . , k. Lets also define F as the set offeatures describing the data. Each feature (predictor) can also be ordinalcategorical, nominal categorical or continuous. Assuming this notation, the62.2. Decision trees and decision forestsgrowing process of a CART tree can be explained as described below.Tree growing processAs it was mentioned, the CART tree algorithm is an iterative process appliedto each node of the tree starting from the root (the node with all of theavailable samples). The aim of this algorithm is to find the best split definedas the split in data that can result in the maximum purity in the child nodes.In the basic implementation of CART it is assumed that the splits are uni-variate. Meaning that each split only depends on the value of one feature andone feature only. If F is the set of all available features and fi is a nominalcategorical feature in F with ki different categories, there exist 2k−1 − 1possible splits for this feature. If fi is an ordinal categorical or continuousvariable with ki different values there are ki − 1 different possible splits forfi. The iterative growth algorithm of CART for each node can be simplifiedas:• Find each predictor’s best split. In case of continuous or ordinalcategorical features, first sort the samples by the given feature. Thenfor each possible split from smallest to largest form the two child nodes.Given the child nodes one can calculate the splitting criterion or purityfunction for each split. The splitting criterion will be defined later.Choose the split with the highest purity. For each nominal categoricalfeature, examine all the possible subsets of the categories to find thepurest split. For the sorted predictor, go through each value from topto examine each candidate split.• Find the node’s best split. Among the splits selected for eachfeature in previous step, select the one with maximum purity (the onethat maximises the splitting criterion).• Split parent node to child nodes. Use the best split found inprevious step to divide the parent node to the child nodes.• Iterate. Until the criteria for maximum depth of the tree has notreached, apply the same algorithm to each child node.Splitting criteria and impurity measuresFor any given node t, and a given split s, the best split is the split thatmaximizes the splitting criterion ∆i(s, t). Which corresponds to a decreasein impurity i.72.2. Decision trees and decision forestsFor classification tasks (categorical Y ), there are three splitting criteriadefined for CART algorithm: Gini, Twoing, and ordered Twoing.Let us define P (t) and P (j, t) as the probability that a sample belongs tonode t and the probability that a sample in class j be in node t respectively.We can estimate these probabilities by :P (j, t) =pi(j)Nw,j(t)Nw,j, (2.1)P (t) =∑jP (j, t), (2.2)where pi(j) is the prior probability of the outcome class j and by definitionP (j|t) = P (j, t)P (t)=P (j, t)∑j P (j, t), (2.3)andNw,j(t) =∑h(t)wnfnI(yn = j), (2.4)in which wn and fn are the case weight and frequency weight associatedwith sample n, h(t) is the set of samples at node t and I(j1 = j2) is theidentifier function resulting in 1 when j1 and j2 are equal and is 0 otherwise.Gini criterionThe Gini impurity measure at node t is defined as:i(t) =∑i,jC(i|j)P (j|t)P (i|t), (2.5)in which C(i|j) is the cost of classifying a sample to class i given that itbelongs to class j assuming that C(i|i) is equal to zero. Using the Gini impu-rity measure we can define one of the most well known splitting criterion’s,the Gini decrease of impurity, defined as:∆i = i(t)− PLi(tL)− PRi(tR). (2.6)in which PL and PR are the probabilities that a sample is sent to the leftchild node or the right child node respectively and can be defined as:PL =P (tL)P (t), PR =P (tR)P (t). (2.7)82.2. Decision trees and decision forestsIt should be noted that if user specified costs are involved, altered priorscan be used instead of the empirical estimations. In this case the alteredprior can be defined as :pi′(j) =C(j)pi(j)∑j C(j)pi(j), (2.8)in whichC(j) =∑iC(j|i). (2.9)Twoing criterionThis Criterion is actually a goodness measure not an impurity measure. Soit should be maximised for each split. The Twoing criterion can be definedas:∆i(s, t) = PLPR[∑j|p(j|tL)− P (j|tR)|]2(2.10)Ordered Twoing criterionIn the case of ordinal categorical outcome classes, the ordered Twoing crite-rion is the purity measure of choice. The algorithm to calculate this measureis as follows:• First separate the class C = {1, . . . , k} of Y into two complementarysuper-classes C1 and C2 such that C1 is of the form C1 = {1, 2, . . . , k1}in which, k1 ∈ (1, 2, . . . , k − 1).• Using the purity measure i(t) = p(C1|t)P (C2|t) find the split thatmaximises the Twoing criterion shown in equation 2.10 on C1.• Find the super class C1 that results in best split (maximum gain inTwoing measure).Continuous dependent variableFor continuous outcome variables (regression trees) a splitting criterion sim-ilar to the equation 2.6 can be used. However, the impurity measure usedin this equation is different. A usual choice for the impurity measure is theLeast Squares Deviation (LSD) measure which can be described as:92.2. Decision trees and decision forestsi(t) =∑h(t)wnfn(yn − y(t))2∑h(t)wnfn, (2.11)in whichPL =Nw(tL)NW (t), PR =Nw(tR)NW (t), (2.12)andy(t) =∑h(t)wnfnynNW (t), (2.13)whereNw(t) =∑h(t)wnfn. (2.14)Stopping rulesStopping rules determine if the growth algorithm should continue on dividingeach child node into two other nodes or should it set the final nodes as leaves.The stopping roles can be set based on the application. But the ones usuallyused in CART are as follows:• If a node becomes completely pure, the tree growth stops. Meaningthat if in classification trees, all cases in a node belong to the sameoutcome class, or in regression trees, all the cases in one node havethe exact same number as the outcome variable, the node will not besplit any further.• If all samples in a node have the exact same values for all of thefeatures, the node will not be split any further.• If the tree depth reaches the predefined maximum limit set by the user,the tree growth will stop.• If the sample size at a node is less than the minimum threshold set bythe user the tree growth at this node will stop.• If in case of splitting the parent node into child nodes, the child nodesample size would be less than the minimum threshold set by the user,the parent node will not be split.102.2. Decision trees and decision forests• If for the best split possible at node t, the splitting criterion (∆I(t))is less than a user specified minimum purity gain, the node will not besplit.2.2.2 C5.0 decision treesThe C5.0 algorithm is based on the Iterative Dichotomiser 3 (ID3) algo-rithm first introduced by Ross Quinlan in 1986 [27]. Similar to the CARTalgorithm, this algorithm is based on iterative splitting of the sample spaceinto smaller nodes. However, unlike the CART algorithm, the first versionsof the ID3 algorithm only supported categorical features. Later improve-ments resulted in C4.5 algorithm which is the most well-known variationof the ID3 algorithm and in addition to supporting continuous variables,has several advantages over its ancestor, ID3. In this subsection we startwith a brief explanation of the ID3 algorithm, then we introduce the C4.5algorithm and finally the latest version of this family of algorithms, C5.0, isexplained.ID3 algorithmSimilar to the CART algorithm, the Iterative Dichotomiser 3 (ID3) algo-rithm grows a tree using a top-down greedy search through all the possiblesplits of training data for each feature at each node. However it uses in-formation gain as the measure of goodness for each split. Information andentropy are measures in information theory that can directly be used as im-purity measures for tree growth algorithms. In information theory, entropyof a sample set S that consists of c different classes can be defined as:Entropy(S) =c∑i=1−pilog2(pi), (2.15)in which pi is the probability that sample s belongs to class i and canbe estimated by the proportion of samples in class i relative to the wholepopulation. Given that in the above equation the logarithm function is inbase 2, the unit for entropy is Bits. In this equation if the probability ofa class is too small (pi =˜0) or pi is too large (pi =˜1) the entropy measurebecomes very small. So the entropy measure gives smaller values for puresubsets of samples. In other words, the more uniform the probability dis-tribution between classes becomes, the larger the entropy measure. If wedefine information simply as “lack of entropy”, information gain can be de-112.2. Decision trees and decision forestsfined as an splitting criteria that uses entropy as the impurity measure. Theinformation gain can be defined as :∆I(S, F ) = Entropy(S)−∑fi∈FSfiSEntropy(Sfi) (2.16)In which F is the set of all features of samples in S and the sum on fi isover the all the values in F . Assuming this function as the splitting criteria,the algorithm to grow an ID3 tree is as follows:• Given the samples in node t, calculate the splitting criteria (informa-tion gain) for all the features,• Select the feature with the largest information gain (ft) and the rela-tive splitting point as the optimum division points,• Use the optimum division points to split the samples at node t to twochild nodes,• If none of the stopping rules are true, continue growth of the tree onthe child nodes using the remaining features.The minimum set of stopping rules for ID3 algorithm are:• If all the cases in one node are from the same class (entropy=0),• If the Information gain is 0 or smaller than a pre-defined threshold,• If the number of remaining features is 0.It should be noted that other limits similar to the ones explained for theCART algorithm can be put on the number of samples at each node or thetotal depth of the tree in order to penalize over-fitting.C4.5 algorithmThe C4.5 is an improved version of the ID3 algorithm with three majorimprovements:• It can handle continuous features as well as categorical features. C4.5extends the ID3 algorithm by putting a threshold on the continuousfeatures and calculation of the Information gain based on the selectedthreshold,122.2. Decision trees and decision forests• It has an embedded method for handling missing values (this methodwill be explained in detail later),• It can incorporate user-defined weights for importance or cost of thefeatures,• It has an embedded method for pruning the tree.More information about this method can be found in a recent paper byQuinlan [28].C5.0 algorithmC5.0 algorithm is the latest version of the tree growth algorithms of thisfamily which has several advantages over the previous implementations, in-cluding:• More computationally efficient implementation in comparison to C4.5resulting in faster performance,• A more memory efficient implementation,• Results in smaller trees in comparison to previous versions which usu-ally results in smaller probability of over fitting,• Supports boosting. The C5.0 algorithm uses a process similar to ad-aboost [15]. In this process, first a conventional C5.0 tree is grown.Then the weights for each sample are calculated and subsequent it-erations are used to build weighted trees and rule-sets. Then theserule-sets and trees are used to generate class probabilities. Finally theaverage of these class probabilities is reported as the final prediction,• Supports using different weightings for samples and different mis-classification measures,• Supports winnowing, an embedded feature selection method that isparticularly useful if the number of features are large but sample sizeis relatively small. This process is done as follows: First the samplesare randomly split in half and one conventional C5.0 forest is grownusing the first half of data. The effect of removing each feature on theperformance of the first tree is determined using the other half of thedata. If there is a feature that it’s removal does not increase the totalerror rate, that feature will be removed from the set of features usedfor growth of the final tree.132.2. Decision trees and decision forests2.2.3 Decision forestsDecision trees offer a fast and easy-to-interpret classifier for simple classi-fications tasks but generally result in weak classifiers that easily over-fit,especially if the sample size is small. An idea to increase the classificationpower of decision trees and avoiding over-fitting at the same time is to en-semble decision trees and form a decision forest. Ensembling is known as avery effective method that can improve the performance of any single clas-sifier. In a similar way, the ensemble of decision trees as a forest is expectedto always outperform a single decision tree. The idea of decision forestsbecame popular after a paper by Ho, et al. in 1995. In [18] they show thatif trees of a forest are trained on a set of randomly selected features, theaccuracy of the forest grows by increasing the number of the trees of the for-est. This observation that forests get more and more accurate as the modelgrows (gets more complex) is in direct contrast to the common belief thatcomplexity of a classifier can only be increased to a certain point before itis reduced do to over-fitting. The key for this unique advantage of randomforest over other classifiers is in randomising the basic classifiers to gain anensemble of independent estimators of the outcome label. More detail onthe importance of randomisation in random forests can be find in a paperby Kleinberg, et al [22].The current state of the art decision forest method is based on the al-gorithm proposed by Breiman, et al. [4]. In this work Breiman uses twomain tricks in order to ensure a fully randomised forest. First, randomlyselecting a “bag” of samples for each tree, introduced for the first time byBreiman, et al. And second, randomly sampling a subset of features for eachtree, introduced by Ho, et al in [19]. Breiman also introduced methods forcalculating a feature importance measure based on a forest by calculating adistance measure between samples and calculation of error rates based onout of the bag samples. In the original implementation of random forest byBreiman, the decision trees were grown using an algorithm similar to theCART algorithm and the randomisation was introduced to each tree usingthe bagging and randomised feature selection methods. There exist differ-ent implementations of random forest that have put more emphasis on therandomisation of the base classifiers. For instance, an idea introduced byDietterich, et al. in [12] is to ignore the optimum division step in CARTalgorithm and choose a random split for the data at each node of each tree.This results in a faster classifier but because it also results in weaker basetrees it may limit the performance of the final forest.142.2. Decision trees and decision forestsAlgorithmDecision forests use the general technique of bagging or bootstrap aggrega-tion with replacement to re-sample nb number of bags. Each one with thesame size as the original training set. For instance if the training set consistsof n samples S = {s1, s2, . . . , sn} described by k features F = {f1, f2, . . . , fk}and an outcome class C = {c1, c2, . . . , cn}, each bag of samples will also con-sist of n samples randomly chosen (with replacement) from the same sampleset. We hereby name these samples as Sb. These samples are described bythe same set of features and corresponding outcome classes (Cb). After bag-ging samples, each bag is used to train a decision tree that is grown solelybased on Sb and uses F to predict Cb. In the testing phase, the predictionsof these trees can be merged either by majority voting or by averaging theprobabilities using the following equation:P (C) =1nb∑b∈Bpˆ(Sb, C) (2.17)In which P (C) is the probability of class C predicted by the whole forestand pˆ(Sb) is the probability of class C predicted by the tree trained on bagb.If grown too deep, each decision tree will over-fit to the training dataand result in a low-bias but high-variance classifier. The boot strap andensemble trick does not have any effect on the bias but reduces the varianceof the final classifier as the number of the trees grow. However, this istrue with the assumption that the resulting decision trees are uncorrelated.Otherwise the resulting trees will be very similar and averaging the similartrees does not have significant effect on the variance of the final classifier.The number of the bags and corresponding trees (nb) is a parameter rangingfrom a few hundreds to several thousand trees depending on the nature ofthe dataset and the sample size and can be optimised by cross validation orby observing the out-of-bag error. The decrease in out of bag error seemsto diminish after the number of the trees in the forest becomes large. Therandom forest growth algorithm also randomly selects a subset of features(for classification, usually√k, in which k is the total number of features) togrow each tree of the forest. This process helps in building a larger numberof uncorrelated trees in the random forest.152.3. Handling missing valuesVariable importance measureBreiman also introduces a novel way to measure importance of each fea-ture embedded in the random forest. The first step in order to measureimportance of a feature is to grow a random forest using the training dataavailable. Then the out-of-bag errors for each data point is calculated andaveraged over the whole forest. Now for each feature f , we permute valuesof this feature from the whole data set (replace it with a randomly generatedvalue in the same range). Then we re-calculate and average the out-of-bagerror. The importance of each feature is the difference in the total out-of-bag error before and after permutation normalised by standard deviation ofthese differences. The features that result in larger differences are ranked asthe most important features. This method has a few drawbacks, for instancethis method is biased to give higher importance to categorical variables thathave many different levels in comparison with features with fewer levels. Away to overcome this problem is to use methods like growing unbiased trees[30] or partial permutations [1].2.3 Handling missing valuesMissing data is a well known problem that if not handled correctly, cansignificantly affect the accuracy of any statistical inference performed on adataset. This problem might be caused by human factors during the dataacquisition stage. For instance patients that drop out of the study beforethe data acquisition is complete. Or it can be a natural possible state forthe target variable. For instance the age of the spouses of patients in case ofsingle patients. To decide how to handle missing data, it is helpful to firstlearn about the usual assumptions about missing data. The missing valuescenarios can be divided into four general types or classes:• Missing completely at random. A variable is missing completelyat random if the probability of a data point being missing is the samefor all samples. For example, if the decision that each patient shouldor should not undergo a specific clinical exam is taken by generatinga random number or rolling a dice. If data is missing completely atrandom, then throwing out cases with missing data does not bias thestatistical inference.• Missing at random. Most of the time data points are not missingcompletely at random. For instance, the probability that a patient isrequired to undergo additional examinations might be taken based on162.3. Handling missing valueshis preliminary clinical test results. As a general assumption in thisscenario, it is assumed that the probability that a variable is missingdepends only on the available information. Thus, if preliminary datafor a patient consists of age, sex and race, then it is assumed thatthe probability that each patient will undergo more examinations issolely dependant on these fully available parameters. In this case itis reasonable to assume a model for this process. One example isassumption of a logistic regression model, where the outcome variableequals 1 for observed cases and 0 for missing. In this scenario, anydata point can be removed from the study as long as it does not affectthe assumed model for probability of the data-point being missing.• Missing that depends on missing parameters. In this scenarionot only the data is not missing at random, the probability that eachdata point is missing depends on the variables that are also missing.For instance, an example of this scenario is when the probability that acancer patient is sent for an MRI scan is dependent on ultrasound pre-screening results that are not available at the analysis time. Anotherfamiliar example from medical studies is that if a particular experimentcauses discomfort, a patient is more likely to drop out of the study.These data points are not missing at random (unless discomfort ismeasured and observed for all patients). If the data points are notmissing at random, they must be explicitly modeled, otherwise addingbias to the statistical inference will be inevitable.• Missing that depends on the missing value itself. This sce-nario makes handling missing values difficult not only because the datapoints are not missing at random, but also because the probability ofmissing a data point depends on the data point that might be missing.For instance in case of heart disease patients, if the blood pressure ofthe patients is saved only if it is outside of the normal range, we aredealing with missing values that are also determining the probabilityof them being missing. Another example is the case of censoring ofdata. For example in case of financial surveys, people with very highearnings may be less likely to report their actual salary. In the ex-treme case (for instance, if all participants earning more than $100k ayear refuse to report their earning) a large part of the dataset will bemissing and the probability of lack of the income variable depends onthe income itself.172.3. Handling missing valuesIt is correct that when data is not missing at random specially whenit depends on the missing values themselves, it is hard to compensate forthe bias introduced into the inference algorithm. However, this bias can bemitigated by using the available variables. For instance, available featureslike age or sex or preliminary clinical data can be used in order to guesswhether a patient’s blood pressure would be out of the standard range andif it is higher than average or lower. Or in case of financial surveys, it canbe assumed that if a person has higher education and is above a certain agehe or she will have higher income. We then can use that information tocompensate for relative bias in inference. These methods can not yield in agood estimation of the missing values but can certainly help in achieving abetter model of the blanks in data.It should be mentioned that it is in general impossible to prove whetherdata is missing at random or not. If it is, the process will be simple becausewe can model the probability of data being missing based on the availablefeatures, if the data is not missing at random we try to add as many relatedparameters as possible to the model so the missing at random assumptionbecomes reasonable. For instance, it is correct that the probability of missingthe blood pressure is dependant on itself, but because the blood pressureand heart rate are indirectly related, adding the heart rate to the model canhelp in modeling the probability of missing of the blood pressure values.This approach helps in reducing the bias caused by removal of the datapoints from a study. The next step will be to fill-in the missing data pointsor removing them in a way that has the least negative effect on the finalclassifier used on the final completed dataset. This approach is the basisof the imputation methods introduced in literature. We first investigatethe general imputation methods that do not assume any specific model indata other than the missing at random assumption. We also introduce theregression based methods that treat each missing variable as a regressiontarget, for instance for linear regression. These methods can model complexrelationships in data but are still independent of the final classifier used onthe imputed data so might not yield in the best classification performance.We also investigate imputation methods that are proposed specifically forour target classifiers: decision trees and forests.2.3.1 Data removal methodsThe simplest and maybe most common approach in dealing with missingdata is simply ignoring the part of data that is missing one or more of thefeatures. This can be done in two general ways, removing columns (features)182.3. Handling missing valuesor rows (samples) from the data matrix.Sample removalThis approach is most useful in cases when the number of samples with oneor more missing values is small or a set of samples are missing a large set offeatures. This method may decrease the accuracy of the final classificationbecause it reduces the total sample size but lets us train a more complexmodel because it preserves all of the features.Feature removalIn contrast to the sample removal approach, the feature removal approach isusually selected when almost all of the samples are missing a set of specificfeatures. This method decreases the accuracy of the final classifier by re-moving some of the potentially useful features but it is the simplest methodthat can preserve the sample size of the dataset.2.3.2 General imputation methodsZeroA simple approach to the imputation problem is to just replace all themissing values with a constant value: Zero. This approach originates fromthe natural assumption that in the absence of input signal in a sensor, therecorded value should be zero. This method sometimes significantly out-performs the data removal methods because it preserves both samples andfeatures but obviously it introduces a bias towards smaller values.Random guessAnother approach is replacing the missing values with a random guess in theacceptable range of the missing value. This approach matches a scenario inwhich a data acquisition system yields in white noise in the absence of inputsignal. This method also makes use of all data by filling in the missingvalues. Therefore, it may perform significantly better than data removalmethods. As an advantage of this method, it guarantees that no unwantedcorrelation will be added between different features. In other words, thefeatures that are statistically independent will stay statistically independentafter the imputation. This is a necessity that the constant value replacementmethods usually do not guarantee.192.3. Handling missing valuesReplacement with meanAnother well known method that is frequently used for handling missing datain large datasets is replacement by the mean value. This approach originatesfrom the statement that without any other knowledge, the average of thefeature over the whole population is the best estimation of the real value andyields in the smallest average error. Assuming that the sampling populationis balanced and the final classifier uses the deviation from mean as an errormeasure (which is the common scenario in regression models), this method isthe simplest method that can introduce a minimum bias into the dataset. Incase of a normalised dataset with mean value of 0 and standard deviation ofone which is the usual scenario in many data analysis problems, this methodis the same as the zero imputation method.Replacement with medianFor very large datasets, this method performs the same as the mean re-placement method. However, in smaller datasets it results in a more robustsolution which is less dependent on the outliers.KNN imputationThe mean and median replacement methods might result in a reasonableestimation of the expected value for missing values over the whole popula-tion but do not necessarily give the best local estimation. The K-Nearest-Neighbors (KNN) method is based on replacement of missing values locallywith mean or median of the neighbor data points selected via a distancefunction. Although KNN method seems like a very simple and naive ap-proach, it has some advantages comparing to some advanced imputationmethods. For instance, it can predict both discrete attributes (the most fre-quent value among the k nearest neighbours) and continuous attributes (themean among the k nearest neighbours), so there is no necessity for creatinga predictive model for each attribute that has missing data. But as a majordrawback, it has to compute distance to all the samples in the dataset foreach sample that needs to be imputed. In large datasets this becomes amajor issue. Even with this drawback, the KNN method is used in manymedical data analysis studies (for instance [3] and [21]) and many generaldata analysis applications as a simple and robust imputation method.202.4. State of the art tree-based imputation methodsRegression based imputation methodsAnother approach to the missing value problem which is specifically usefulwhen the missing values are distributed only between a small set of featuresis using regression for imputation. These methods try to predict missingvalues of each feature using the other available features. These methodsalso make use of all the samples and all the features in order to modeland predict missing values in each feature so they may be more effectivethan methods which only use local data for estimation of each missing value(like KNN). However, assuming a wrong model between features may haveunexpected effects on the final classification. For instance, using a linearmodel for regression may introduce correlations between different featuresthat did not previously exist in the data. The regression methods also cannot guarantee the same error rate for all the predictions because missingvalues in each feature are predicted using a potentially very different set ofsamples depending on the distribution of missing values among features andsamples. Although these weaknesses limit the power of regression methodsin many applications, these methods are among the most popular methodsused in literature especially in medical applications (for instance [8], [34],[2]).2.4 State of the art tree-based imputationmethodsThe imputation methods explained in the previous section can be used withany type of classifier including decision trees or random forests. But it shouldbe noted that the aim of imputation is not finding the best estimation ofthe missing values. It is finding a set of values that cause the minimumreduction in the accuracy of the final classifier. With this goal in mind, theimputation methods that are designed to work with a specific classifier, inour case the tree-based classifiers, are the best choice. The three most wellknown methods for this purpose are introduced in this section.2.4.1 CART embedded imputation method: surrogatedivisionsThe missing data problem in CART algorithm can be divided into twosmaller problems. First, finding the optimum division points and second,assigning samples to child nodes at each division. Let us re-examine the212.4. State of the art tree-based imputation methodsinformation gain equation used for finding the optimum division point:∆I = I(t)− p(tL)I(tL)− p(tR)I(tR) (2.18)In which I is the impurity function and P is the probability that a sample inthe parent node belongs to the corresponding child node. Considering thatI(t) is calculated on the parent node, the missing features do not have anyeffect on this term of the equation above. However, the left and right childnode impurities and the probability that a sample belongs to these nodesis affected by the missing features. In the CART algorithm for every givenfeature, tested for the impurity gain, these values are calculated using thesamples that are not missing that particular feature.Given that the division points are calculated using the explained method,the problem of assigning samples with missing features to each child noderemains unsolved. The method used in the CART algorithm for this purposeis “surrogate divisions”. This method proposes that for each optimal divisionwhich is found using the method explained above, we find a “surrogate”division that uses one of the other features to split the data in a similarway as the original optimal division. In other words, given a known divisionof the complete samples in the parent node, lets grow a tree of length onethat can split the samples into the same child nodes using a different feature.Suppose there are n predictors (x1, x2 . . . xn) included in the CART analysis,lets assume that there are missing values only for one of the features. Inthis case, x1 which happens to be the best predictor chosen to define theoptimal split. The split necessarily defines two categories for x1. This x1feature now actually becomes a binary response variable that splits the datainto two classes, left and right nodes. Then a tree of depth one (a singlesplit) is grown that uses x2 . . . xn as potential splitting variables and x1 asthe response variable. The next step is to rank the n− 1 possible predictorsby the proportion of cases that are inevitably misclassified. The surrogatesplits that do no better than the marginal distribution of the missing featureare ignored and removed from the list of surrogate divisions. The best splitbased on this ranking is then used to divide the samples with missing valuesinto the child nodes. In other words, the class predicted by the surrogatesplit is then used in order to split the data similar to the original divisionwhen x1 is not available. If a sample is missing the optimum division feature(x1 in this example) we then use the best surrogate division instead. If thebest surrogate division is also missing, we use the second best and so on. Ifnone of the features are available, the sample is assigned to the child nodethat the majority of samples have been directed to. In the implementation222.4. State of the art tree-based imputation methodsof the CART algorithm in R language (rpart package [33]) there are threeways to deal with missing data:• Display only. Samples with missing values are completely ignored andare not passed to deeper nodes in the tree.• Use surrogates. Split subjects with missing values according to thesurrogate divisions, if all of the surrogates are missing, ignore theobservation.• The same as the second option, but if all surrogates are missing, assignthe samples with missing value to the child node with the majority ofcomplete samples.In practice when a small portion of data is missing completely at random(MCAR) or missing at random (MAR), this method provides a robust andeffective solution. However, if a large portion of the data is missing, con-secutive surrogate divisions will be very likely to completely mis-guide thedecision tree. Another issue that is very common in scenarios with verysmall sample size is skewed data. As mentioned in [17] this problem can bemade worse by this imputation method. These weaknesses motivate us toexamine other methods for imputation of the missing values, for instancethe embedded method of C5.0 decision trees.2.4.2 C5.0 embedded imputation methodC5.0 is the new version of the C4.5 algorithm which is one of the mostwell-known decision tree growth methods used today. Besides the basicdifferences between a C5.0 decision tree and a CART decision tree, the twoalgorithms basically use the same criteria for finding the optimum divisionpoints. However, unlike the CART algorithm that simply ignores the missingvalues in calculation of the information gain in equation 2.18, the C5.0method uses a modified version of the information gain equation as it canbe seen bellow.∆I =N −N0N∆I(t− t0). (2.19)In which N is the total sample size, N0 is the number of samples thatare missing the feature tested by the C5.0 algorithm and ∆I(t − t0) is theimpurity decrease assuming that only the samples that possess the respectivefeature are present. In simple words, the C5.0 algorithm calculates theimpurity decease in the same fashion as the CART algorithm does, but232.4. State of the art tree-based imputation methodsit assigns a smaller weight to the attributes that are missing from a largeportion of data. As the second difference to the CART algorithm, C5.0 doesnot use surrogate variables in order to assign samples with missing valuesin parent node to the corresponding child nodes. Instead, the samples arefragmented into fractional cases and then assigned to the corresponding childnodes. For instance, if child node i has Ni samples, the missing samples inchild node i will have weights equal to Ni/(N−N0). These weights will thenbe used in order to weight the class probabilities in the leaves and calculatethe probability of each class at each leaf.The method used to handle missing data during the training phase of theC5.0 algorithm is straightforward. However, handling the missing values inthe testing phase is a different story. Let P be the classification result of thetest case Y using a C5.0 tree named T . There are three possible scenarios:• If T is leaf (a tree of depth 0), P is found by the relative frequency oftraining cases that belong to the leaf.• If T is a tree of depth one or more and all the features used as divisionpoints in T are available for Y , P is found by the relative frequencyof training cases that belong to the same leaf as Y .• Otherwise, all the possible outcomes of the decision tree (all the leavesthat Y might belong to) are are explored and combined probabilisti-cally, giving:P =k∑i=1NiN −N0Pi, (2.20)in which Pi is the probability of each class given that Y belongs toleaf i. N is the total training set sample size, Ni is the number ofcorresponding training samples at leaf i, N0 is the number of trainingsamples at leaf i with missing value (each N might be fractional).When the probability P for each class of the outcome is calculated, theclass with the largest probability is chosen as the classification result. TheC5.0 algorithm has a more sophisticated approach for dealing with missingdata problem. Nonetheless the approach is designed for a single decisiontree. In comparison to a random forest, decision trees are very prone toover-fitting. In the next section, we investigate the embedded imputationmethod for random forests, rfImpute.242.5. Summary2.4.3 Decision forest embedded imputation method:rfImputeThere are two main methods for dealing with missing values embedded inthe implementation of random forests by Breiman, et al. [4]. First is therough-fix method. This method is a simple and naive approach which uses atechnique similar to the median imputation explained in previous section toestimate the missing values for continuous variables and assigns the majorityclass for the categorical variables.The state of the art embedded imputation method for random forestsis rfImpute. This method starts with the rough-fix method as an initialestimation of the missing values, grows a random forest based on the imputeddataset, calculates the proximity matrix based on the resulting forest andthen updates the missing values relative to the proximity matrix. Thisprocess is repeated for the new imputed values for a few iterations. Inorder to explain this method in more detail, let’s first see how the proximitymatrix is formed. The proximity matrix is an intrinsic measure of similaritybetween samples in a forest based on the number of times that two samplesland in the same leaf in each tree of the forest. The values in the proximitymatrix are calculated as follows:Given that all the samples are run down each tree of the forest, for eachtree in the forest add 1 to the proximity measure between i and j if theyboth land in the same leaf. Then divide the whole matrix by the numberof the trees in the forest and set the main diagonal of the matrix to 1.This matrix is embedded within the random forest growth algorithm and asBreiman, et al., mentioned in [4], the values 1− prox(i, j) can be interpretedas squared distances in an Euclidean space of high dimension. Each row ofthis proximity matrix is then used in order to calculate an estimate of eachmissing value based on weighted average of the corresponding feature inother samples. For the categorical features, each class is weighted relative tothe proximity measures and the most probable class is chosen. This processis usually repeated for 5 to 6 iterations. Although the method explainedis iterative and therefore slow, it has proven to be an effective imputationmethod designed specifically for random forests.2.5 SummaryIn this section we provided a detailed review of the most common tree growthalgorithms, an overview of different types of missing data, and some generalpurpose imputation methods. Then we focused on the imputation methods252.5. Summarydesigned specifically for tree-based classifiers like decision trees and randomforests.In the work presented in this thesis, we select the state of the art imputa-tion method embedded in decision forests (rfImpute) as a natural choice fora baseline imputation method. In the multimodal test scenario of prostatecancer we compare the proposed method with two data-discarding methodsin which we simply drop one or the other dataset (the single modal forest andthe multimodal forest), two forest-based imputation methods (C5.0 forestand rfImpute) and two other general purpose imputation methods, namelyreplacing the missing values with zero, and replacing with the weighted av-erage value of the K nearest neighbors (KNN) [3].26Chapter 3Scandent tree: a forest basedmethod for multimodalclassification3.1 IntroductionMissing data is a well known problem in data analysis and machine learning,however missing data handling in a multimodal study is more challengingbecause it might not hold some of the basic assumptions in usual missingdata scenarios. For instance, the usual assumption in missing value handlingis that only a subset of features are randomly missing from a subset ofsamples (typically 10% to 30% of data). However, in multimodal scenariosit is common that a large set of samples is missing a large set of features,often belonging to the same modality.Incomplete multimodal datasets are common in biomedical experiments,where a usual scenario is to examine new protocols or new modalities andthe relationship between them. For instance, joint analysis of medical imag-ing modalities and genetic biomarkers is an attractive subject for biomed-ical research. However, since clinical analysis of both imaging and geneticbiomarkers is not common, it is hard for biomedical researchers to buildlarge datasets of this type in a time and cost effective manner. This makesdealing with very valuable but very small datasets a common scenario.On the other hand, some of the modalities used in a multimodal studymight be part of the standard clinical procedures separately. For instance,although it is impractical and expensive to do MRI and genetics analysis oneach and every patient that visits a clinic, the MRI data of a large numberof patients is easily accessible from medical imaging archives of hospitals. Sothe missing value problem in multimodal scenarios can usually be formulatedas merging a small valuable multimodal dataset with a large but single modaldataset for enhancement of a multimodal data analysis task.In this chapter we intend to introduce a new method named “scandent273.2. Methodtrees” that is based on decision forests and is specifically designed for thistask. In this chapter first we explain the concept and implementation of theproposed method in detail then evaluate the proposed method in differentsample sizes and feature set sizes by simulation of an incomplete multimodalscenario using publicly available datasets. Finally, we examine the proposedmethod on a real incomplete multimodal dataset, a joint analysis of MRIand genetics features for prostate cancer detection.3.2 Method3.2.1 Mathematical formulationLet us assume that the training data consists of at least one single modaldataset defined as S = (s1, . . . , sNs) (in which each si is a single modalsample) and at least one multimodal dataset defined as M = (m1, . . . ,mNm)(in which each mi is a multimodal sample). These two datasets are describedrespectively by the single modal feature set Fs and the multimodal featureset Fm, where Fs ⊂ Fm. We do not set conditions on the feature or samplesizes but in practical scenarios, usually the multimodal dataset has fewersamples (Nm < Ns). Also the single modal set is missing some of the morediscriminative features. In this section we aim to train a classifier using bothS and M that can predict the outcome class C, for any test data describedby Fm. In other words, we want to make use of a single-modal dataset foroptimisation of a multimodal decision forest.3.2.2 IntuitionAssuming the decision tree model for a classification task, two main sourcesof error can be imagined. First is the error caused by in-efficient partitioningof the sample space by the decision tree. And second, the error in estimationof the outcome class probabilities at each leaf. As an advantage of having allthe important features, decision trees formed by the multimodal dataset areexpected to partition the feature space very effectively. However, because ofthe low multimodal sample size, the estimation of the outcome probabilityat each leaf may not be accurate. The proposed method tries to reduce theprediction error at each leaf of the multimodal tree by using single modalsamples that are likely to belong to the same leaf.In order to find these single modal samples, a feature space partitioningalgorithm is needed that can simulate the feature space division of the targetmultimodal tree on the single modal dataset. The proposed method is to283.2. Methodgrow single modal trees (trees that only need the available feature set) thatmimic the feature space division structure of the multimodal decision tree(a tree that needs all of the features for classification). Although it can notbe guaranteed that such tree exists, we try to provide an estimation of thedivision boundaries of a multimodal tree by breaking it into smaller sub-trees and trying to estimate simple division boundaries of these trees usingthe set of available features.This technique is expected to be more effective than the imputationmethods because it eliminates the need to know the exact value of eachmissing feature and only relies on predicting the feature-space partition thateach sample belongs to. This can be a much easier task.By using such a technique we expect that we can use high-level rela-tionships between modalities in order to merge datasets. For instance, weknow that in usual scenarios the values of two different modalities like MRIand PET might not be predictable by each other. However, assuming thatthey are not statistically independent, the local trees might be able to avoidprediction of the exact values and instead translate the knowledge of themissing modality in form of given that a patient belongs to a partition ofthe PET feature space, what is the probability that the patient belongs toa partition in MRI feature space?. For instance, if a patient is similar toanother patient in PET space, is it likely that these patients are also similarin MRI space ?. We are trying to show that sometimes, the answer to thisquestion is enough for a more accurate classification and we wont need topredict exact values of a modality by the other one.Growing a tree that follows the structure of another tree from the rootto the top brings analogy to the behaviour of “scandent”trees in nature thatclimb a stronger “support” tree. Considering this analogy, the proposedmethod can be divided into three basic steps: First, division of the samplespace by a multimodal decision tree, called “the support tree”. Second,forming the single modal trees that mimic the structure of the support tree,called “the scandent trees”. And third, leaf level inference of outcome labelC, using the multimodal samples in each leaf and the single modal samplesthat are most likely to belong to the selected leaf.3.2.3 Support treeThe first step in the proposed method is growing a decision tree to predictthe outcome class based on the multimodal dataset. This tree can be oneof the trees in a decision forest or an individual tree grown using any of thewell known methods, such as C4.5 [28] and CART [33]. The method used293.2. Methodin this paper for growth of the support tree is based on the implementationof CART algorithm in the package “rpart” in R language [32].Assuming that the tree is grown and optimized using the multi-modaldataset M , there are two steps that might be the source of classificationerror in the tree: Division of sample space at inner branches, and majorityvoting at the leaves. The sample space division requires sufficient samplesize at each division point which becomes an issue as the tree gets deeper.However, ensembling within the forest paradigm compensates for occasionalincorrect divisions at inner branches, leaving majority voting at the leavesas the critical step to get a precise estimation of probability of the classlabel. This error can be compensated for by the scandent trees.3.2.4 Scandent treesThe second step is to form the scandent trees which enable the assignmentof single modal samples to the leaves of the support tree. The process offeature space division in the support tree can be considered as grouping themultimodal data set M to different multimodal subsets. Let us define thesubset of the samples of M in the ith node as Mi and the feature used forsample space division at node i as fi. For any arbitrary choice of node j, andit’s immediate parent node i, we define node j as a ’link node’ if fi belongsto a different feature set from fj , or if node j is either the root node or aleaf. In other words, node j is a link node if and only if :Node j is the root node,orNode j is a leaf node,orfj ∈ Fs and fi /∈ Fsorfj /∈ Fs and fi ∈ FsIntuitively, the link nodes are the nodes that mark the roots and leaves ofthe largest sub-tree that uses only one modality for feature space partition-ing. For each division node i in the set of the link nodes of the support tree,there exists a set of nearest child link nodes and child leaves j1, j2, . . . jki. Wedefine Ti as an optimum tree that can divide the set of multimodal samplesat node i (Mi) to the set of multimodal samples at each child node (Mj)using the feature set Fs. The pseudo-code for forming a scandent tree is asfollows:303.2. MethodFor each link node i in the support tree,{For each sample n in Mi and each node j in set ofnearest child link nodes and child leaves of node i{if n ∈Mj ,C′i,n = j}Grow Ti, as optimum tree that for each sample n in Mi,predicts C′i,n using only Fs.}The above algorithm forms local trees Ti for each node i that divide Mito the child subsets Mj , using only the single modal features Fs. Here C′is a new categorical label-set defined for the corresponding local tree. Foreach sample in the parent node, the C′is assigned in a way that the samplesbelonging to a specific child node j are mapped to the same category withinC′.For each node i, if fi ∈ Fs, then Ti is expected to divide Mi to thechild subsets (Mj) with perfect accuracy. But if fi /∈ Fs, then Ti will beoptimized to form the smallest tree that can divide the sample space in asimilar manner to the support tree. Using Ti’s for feature space division ateach node, we can form a new tree that consists of the same link nodes asthe support tree but only uses features of a single modal (Fs) for samplespace division, we name this single modal tree, a scandent tree. Since Ti’sare single modal trees, they can be used to predict the probability that eachsingle modal sample s belongs to link node j, calculated by:p(s ∈ Nodej) = p(s ∈ Nodej |s ∈ Nodei)p(s ∈ Nodei)in which Nodei is the parent link node of Nodej , the term p(s ∈ Nodej |s ∈Nodei) is estimated by the corresponding sub-tree Ti and p(s ∈ Nodei) iscalculated by recursion.This method is expected to be generally more accurate than direct esti-mation of the leaves by other single modal classifiers. Because the scandenttree only has to predict the division boundaries for features that do notbelong in Fs and other divisions will be perfectly accurate.Given the small multimodal sample size, the local trees could be proneto over-fitting if only the few samples in the corresponding link nodes are313.2. Methodused for training Ti’s. We overcome this problem by using all of the availablemultimodal samples (M) for training of each local tree by running the wholemultimodal training set through the corresponding sub-tree of the supporttree. This will give each sample in the multimodal dataset a label from theset C′. This method adds more multimodal samples to the parent link node(Mi) and each child link node (Mj). This results in better estimation of Ti.We found that using this trick adds to the robustness of the scandent treemethod. Figure 3.1 shows a simple diagram of the proposed method.Figure 3.1: Diagram of the proposed method for growing the scandent trees3.2.5 Leaf level inferenceThe standard method for leaf level inference is majority voting. However,if there are a large number of single modal samples misplaced by the scan-dent tree, they might flood the original multimodal samples. The proposedmethod is weighted majority voting by non-uniform re-sampling from eachleaf i and then calculating the probability of outcome C using the resampleddata. We define the re-sampling probability of each sample x in leaf i as:p(x)re−sample,i =1/N, x ∈Mip(x ∈ Leaf i)/N, x /∈Mi &p(x ∈ Leaf i) > q0, x /∈Mi &p(x ∈ Leaf i) < qIn which q is the selected minimum threshold for the probability that323.2. Methoda single modal sample belongs to the selected leaf i, and N is the totalnumber of samples in leaf i (single modal and multimodal). As q valueincreases, the probability that a misplaced sample is used in the leaf levelinference is reduced. This may increase the accuracy of the majority votingbut increasing q will also reduce the number of single modal samples at eachleaf resulting in low precision of the probability estimation. This trade-off ismore evident at the two ends of the spectrum, for q = 1 the tree will be thesame as the support tree which suffers from low sample size at the leaves.For q = 0 all the single modal samples will be used for inference at each leaf.The optimization of the q parameter for each leaf is essential for optimalperformance of the resulting tree. This can be done by cross validationover the multimodal dataset, using out of the bag samples in case of adecision forest. Using non-uniform re-sampling instead of majority votingensures that the single modal samples at the leaves are randomized. Thisrandomization is critical because the single modal samples are not randomlyselected in the scandent tree growth algorithm and without re-sampling,there is a possibility that many of the scandent trees in the resulting forestare not independent. This would violate one of the basic requirements oftree ensembling in a decision forest.Although the proposed algorithm is explained only for one single modaldataset, the same method can be applied on different single modal datasetsusing the same support tree. As a result, the proposed framework can beused flexibly when different subsets of features are missing.3.2.6 ImplementationFor building the support trees, we randomly bagged two-third of the mul-timodal samples and randomly selected the square root of the dimensionof the multimodal feature set as the feature bag. This bootstrapping andbagging phase is done separately for each of the outcome classes to ensurebalanced class labels. Then the scandent trees are formed and for each leafof each support tree in the forest the q parameter is optimised using thecorresponding out of the bag samples.After growing and optimizing each of the trees, the probability of out-come class C is calculated by averaging the corresponding probabilities of alltrees in the forest. We use the R package “rpart” [32] both for growing eachsupport tree and each of the local single modal trees (Ti’s). This packageuses internal cross validation to form the optimal tree. But for the purposeof controlling the bias-variance of the resulting forest, the depth of supporttree is limited by controlling the minimum of samples needed for each divi-333.3. Evaluationsion. The depth of Ti’s in each scandent tree is optimized by internal crossvalidation.3.3 EvaluationWe simulate the missing data scenario using three publicly available datasets.First is a dermatology dataset which is multimodal in nature, therefore weonly need to simulate the size of the single modal and multimodal datasets.Second is a heart disease dataset which is formed by subsets that are par-tially overlapping in terms of features. However, it does not match thedefinition of a multimodal dataset in the sense that the missing features donot belong to separate modalities. For this dataset we simulate modalitiesby intentional feature removal from one of the subsets. The third dataset isa breast cancer dataset which is neither multimodal nor multi-source, so weshould simulate both the modalities and the single modal and multimodalsub-sets by random sampling and feature removal. A brief description ofeach dataset and evaluation method is presented below.3.3.1 Evaluation using benchmark datasetsDermatology datasetIn order to test the performance of the proposed method for different sam-ple sizes we simulate an incomplete multimodal dataset by discarding a setof features from a complete multimodal dataset. Because the dermatologydataset is a complete multimodal dataset, we have the opportunity to sim-ulate the target incomplete multimodal scenario by intentionally removingone of the modalities from a set of randomly selected samples.This set is a natural example of a multimodal dataset publicly availablethrough the University of California Irvine (UCI) database [23]. This datasetconsists of two distinct feature sets, an easily accessible feature set obtainedduring clinic visit of each patient (for instance, age of each patient) anda harder to access feature set that is acquired by further histopathologicaltests in a laboratory (eg. Melanin incontinence observed in skin samples).The total sample size of this dataset is 357, the clinical feature set size is 12,while the size of the histopathological feature set obtained in the laboratoryis 22. The outcome class is the diagnosis of one of six dermatology diseases.For this simulation we examine the classification task of one disease class(Seborrheic Dermatitis) vs other classes. We examine the performance of theproposed method for different multimodal sample sizes by assuming that the343.3. Evaluationhistopathological features are missing from a subset of samples. We changethe size of the multimodal sub-set (the set with no missing features) andreport Area Under Curve (AUC) as a function of the multimodal samplesize in comparison to the state of the art imputation method for randomforests (rfImpute). A list of clinical and histopathological features used inthis study is shown in the Table 3.1. The outcome class is selected fromthe dermatology diseases shown in the same table.Table 3.1: List of features and the outcome classes, dermatology datasetHistopathological features clinical features dermatology diseasesMelanin incontinence Erythema PsoriasisEosinophils in the infiltrate Scaling Seboreic dermatitisPNL infiltrate Definite borders Lichen planusFibrosis of the papillary dermis Itching Pityriasis roseaExocytosis Koebner phenomenon Cronic dermatitisAcanthosis Polygonal papules Pityriasis rubra pilarisHyperkeratosis Follicular papulesParakeratosis Oral mucosal involvementClubbing of the rete ridges Knee and elbow involvementElongation of the rete ridges Scalp involvementThinning of the suprapapillary epidermis Family historySpongiform pustule AgeMunro microabcessFocal hypergranulosisDisappearance of the granular layerVacuolisation and damage of basal layerSpongiosisSaw-tooth appearance of retesFollicular horn plugPerifollicular parakeratosisInflammatory monoluclear inflitrateBand-like infiltrateHeart disease datasetThis set consists of data from two different studies reported in [11]. Thisdataset is a natural example of a complete dataset accompanied by a similarlarge dataset with non-random missing features. One set (data from theHungarian Institute of Cardiology) is missing two out of 14 features. Weuse this as the single modal dataset in our experiments while the completeset (the Cleveland dataset) is used as the multimodal dataset. In real worldproblems, such as our prostate cancer study, the single modal dataset ismissing some of the most discriminative features. To simulate this conditionwe used a classical random forest feature ranking approach. Moreover we353.3. Evaluationstudy the effect of decreasing the number of features in the single modaldataset on the overall performance by sweeping from 12 to two features,always removing the most discriminative ones. The multimodal datasetin this experiment (the Cleveland dataset) consists of 303 samples. 100samples were randomly separated and used as the test data. The remainingsamples were used as the multimodal data for training the support trees.We experimented with scenarios that included 10% to 90% of this data intraining of the support trees. More information about the features used inthis experiment can be found in Table 3.2Table 3.2: The feature set of the heart disease datasetFeatures/Variables Explanationage Age in yearssex sex (1 = male; 0 = female)cp chest pain type:Value 1: typical anginaValue 2: atypical anginaValue 3: non-anginal painValue 4: asymptomatictrestbps resting blood pressure (in mm Hg on admission to the hospital)chol serum cholestoral in mg/dlfbs fasting blood sugar > 120 mg/dl (1 = true; 0 = false)restecg resting electrocardiographic results:Value 0: normalValue 1: having ST-T wave abnormalityValue 2: showing left ventricular hypertrophy by Estes’ criteriathalach maximum heart rate achievedexang exercise induced angina (1 = yes; 0 = no)oldpeak ST depression induced by exercise relative to restslope the slope of the peak exercise ST segmentValue 1: upslopingValue 2: flatValue 3: downslopingca Number of major vessels (0-3) colored by flourocsopythal 3 = normal; 6 = fixed defect; 7 = reversible defectoutcome class Diagnosis of heart disease (Angiographic disease status)Value 0: < 50% diameter narrowingValue 1: > 50% diameter narrowingBreast cancer datasetThis is a complete set with 569 samples [37]. This dataset consists of 30features describing the nucleus properties of a breast cancer cell. The sce-nario of multimodal and single modal datasets was simulated with sampling.363.3. EvaluationWe change the size of the single modal feature set and report the AUC asa function of the number of single modal features. In a fashion similar tothe heart disease experiment, we simulate the feature quality disadvantageof the single modal dataset by removing the top ranked features of thisdataset. More information about this feature set can be found in Table 3.3Table 3.3: The feature set of the breast cancer datasetFeatures/variables Explanationradius Mean of distances from center to points on the perimetertexture Standard deviation of gray-scale valuesperimeterareasmoothness Local variation in radius lengthscompactness perimeter2/area - 1concavity Severity of concave portions of the contourconcave points Number of concave portions of the contoursymmetryfractal dimension “coastline approximation” - 1outcome class Cancer diagnosis:B:BenignM:Malignant3.3.2 A real scenario: prostate cancer datasetThis set consists of a small genomics+MRI prostate cancer dataset (Nm =27) accompanied by a relatively large MRI only dataset (Ns = 428). Thesingle modal dataset consists of five multi-parametric MRI features fromdynamic contrast enhanced (DCE) MRI and diffusion MRI on a 3 Teslascanner. We used the apparent diffusion coefficient (ADC) and fractionalanisotropy (FA) from diffusion MRI, and three pharmacokinetic parametersfrom DCE MRI: volume transfer constant, ktrans, fractional volume of ex-travascular extracellular space, ve, and fractional plasma volume vp [16, 25].This data is from patients undergoing radical prostatectomy at Vancou-ver General Hospital and has been collected with informed consent, andwith the approval of the Research Ethics Board of the Vancouver GeneralHospital. Imaging is performed a week before the surgery. After the surgery,the prostate specimens were processed with wholemount cuts that matchedthe slices in the MRI scans. A cutting device and the procedure described in[13] ensured that the cuts matched the MRI slices. An experienced patholo-gist outlined the area of the tumor/normal from wholemount histopathologyslides ([16, 25]).373.3. EvaluationThe histopathology slide of each patient was then registered to a T2-weighted image and corresponding DTI and DCE-MRI slices ([16, 24, 25]).This registration lets us find the Region Of Interest (ROI) in the DTI andDCE MRI slices that co-respond to the tumor(please see Figure 3.2). Wetook the average of the quantitative MRI features (apparent diffusion coeffi-cient, ADC, fractional anisotropy, FA, volume transfer constant, ktrans, frac-tional volume of extravascular extracellular space, ve, and fractional plasmavolume, vp) as the MRI features for each ROI.Figure 3.2: Registration example: (a) T2-weighted , (b) DTI and (c) DCE-MRI slice. The green contour represents the boundaries of the prostategland. The red contour represents the mapped tumor ROI([16, 25]).The tissue samples were then obtained by needle biopsy from the corre-sponding formalin-fixed paraffin-embedded (FFPE) tissue blocks and RNAwas extracted and purified from these samples. The expression level of39 genes that form the most recent consensus on the genetic signature ofprostate cancer for patients with European ancestry as reported and main-tained by National Institute of Health [26] were used as features (please seeTable 3.4). Each of the selected genes were mapped to the closest probelocation on an Affimetrix Exon micro-array and the gene expression at eachof these locations were selected as a genetic feature in our study. Figure3.3 shows a heat-map of the gene expression data for all the patients in ourstudy. The dendrogram shown in this figure shows clustering of samples andgenes based on the gene expression.We have 27 samples with gene expression data and registered imagingdata (14 normal, 13 cancer) from 21 patients. The evaluation of the proposedmethod on this small dataset was carried out in a leave one out scheme. Eachtime, the support trees were trained using 26 samples, with all the singlemodal data samples and features used for forming the scandent trees.383.3. Evaluation383953834137872845829273125729206193754797331141739683033956433328692133496603128411394760424691573014159285276633591803338060263648325365312562343304952226520272436826294990129345212738146248497037396793043264400842737617372852742273632231279783739668288763324173902376037GDX−4540−RAD037.CELGDX−4485−RAD_001.CELGDX−4520−RAD017.CELGDX−4538−RAD035.CELGDX−4522−RAD019.CELGDX−4516−RAD013.CELGDX−4537−RAD034.CELGDX−4534−RAD031.CELGDX−4536−RAD033.CELGDX−4544−RAD041.CELGDX−4542−RAD039.CELGDX−4523−RAD020.CELGDX−4532−RAD029.CELGDX−4528−RAD025.CELGDX−4535−RAD032.CELGDX−4521−RAD018.CELGDX−4513−RAD010.CELGDX−4539−RAD036.CELGDX−4517−RAD014.CELGDX−4486−RAD_002.CELGDX−4529−RAD026.CELGDX−4511−RAD008.CELGDX−4543−RAD040.CELGDX−4541−RAD038.CELGDX−4533−RAD030.CELGDX−4530−RAD027.CELGDX−4531−RAD028.CELFigure 3.3: Gene expression heat-map of the probes corresponding to theselected genes for each patient. Each row presents a sample. Each columnpresents a gene expression feature. The vertical dendograms show clusteringof samples. The horizontal dendograms show clustering of features. Sampleclustering correctly clusters each patient. This shows that the gene expres-sion profiles are mostly patient-specific. Although all of the selected genesare known to be biomarkers of prostate cancer, neither correlations betweenfeatures nor cancer-related patterns are visible.393.3. EvaluationTable 3.4: List of the genes used in the prostate cancer studyProbe ID Gene name Probe ID Gene name2376037 MDM4 3947604 BIK4008427 NUDT11 3956433 CHEK22436826 KCNN3 2887633 BOD12562343 GGCX 3968303 SHROOM23128411 EBF2 2920619 ARMC23761737 ZNF652 3286921 08-Mar3754797 HNF1B 2934521 SLC22A32852766 AMACR 2852742 AMACR2731257 AFM 2652027 CLDN112736322 PDLIM5 2949901 NOTCH43127978 NKX3-1 3043264 JAZF12484970 EHBP1 3349660 HTR3B2845829 TERT 3359180 TH3739668 VPS53 3739679 VPS532738146 TET2 3014159 LMTK22536531 FARP2 3338060 MYEOV3839538 KLK3 3049522 TNS32417390 CTBP2 2469157 GRHL13311417 CTBP2 2636483 SIDT13413787 TUBA1C403.4. Simulation and experimental results3.4 Simulation and experimental resultsIn this section we provide simulation and experimental results to verify theperformance of the method explained in the previous sections. The experi-mental results are on a prostate cancer dataset which is an example of a realmultimodal incomplete dataset which consists of a small but multimodaldataset with genetics and MRI and a larger single modal dataset with onlyMRI. This dataset is a perfect example of our target scenario but because ofthe small sample size of the multimodal dataset, it is hard to verify that theproposed method outperforms the state of the art in imputation of randomforests.In order to evaluate the performance of the proposed method in differentscenarios we also used three publicly available datasets. We simulated themissing value scenario for different sample sizes of the multimodal datasetand different feature set sizes of the single modal dataset. These datasetsare real biomedical datasets and some are even multimodal in nature (forinstance the dermatology dataset). However, we call the results on thesedatasets, “simulation results” because these datasets are complete and themissing values are intentionally removed from each dataset to simulate dif-ferent missing value scenarios.3.4.1 Simulation resultsDermatology datasetFigure 3.4 shows the AUC of the proposed method and the state of theart embedded imputation method of random forests(rfImpute) in differentmultimodal sample sizes. It can be seen that the proposed method outper-forms the rfimpute method especially in smaller samples sizes. For instancewhen the sample size is as small as 51, simulations on this dataset resultedin average AUC of 0.97 for the proposed method and AUC of 0.92 for therfImpute method. This is because in smaller samples sizes the imputationmethod is forced to predict a large number of missing values using a verylimited set of available samples and this bias introduced by the imputationin the training set may mis-guide the random forest classifier. The scan-dent tree method enhances the performance of the random forest leaf byleaf, using cross validation instead of trying to predict the missing values.Therefore, it is expected to be less vulnerable to mis-classifications.Because the dermatology dataset is multimodal in nature, we can notexamine the performance of the proposed method when a larger number of413.4. Simulation and experimental results51 64 76 890.900.920.940.960.981.00Number of Multimodal samplesAUC51 64 76 890.900.920.940.960.981.00rfImputeProposed methodFigure 3.4: AUC vs multimodal sample size for the dermatology dataset(each box shows variation of AUC values for different randomised trainingand test sets)features are missing. In the next section we use the heart disease datasetfor this purpose.Heart disease datasetFigure 3.5 shows the AUC of the proposed method and the rfImpute methodfor different multimodal sample sizes. Each box in this figure shows AUC val-ues for different single modal feature set sizes and a fixed multimodal datasetsample size. The expected upward trend in AUC vs. multimodal samplesize is evident and it can be seen that the proposed method outperformsthe rfImpute method especially in smaller samples sizes. For example, whenonly 14 multimodal samples are available, the rfImpute method results in amean AUC of 0.91 whereas the proposed method delivers an AUC of 0.94.As the number of multimodal samples increases to 112, the performancesincrease for rfImpute and scandent tree to 0.96 and 0.97, respectively. Inother words, the scandent tree approach has a clear advantage when thedataset with multimodal data is significantly smaller.Figure 3.6 shows the AUC of the proposed method and the rfImputemethod for different single modal feature set sizes. Each box shows changesof AUC for different sample sizes at a fixed feature set in the single modal423.4. Simulation and experimental results14 18 22 26 30 35 39 51 71 91 1121321521730.850.900.95Number of Multi modal samplesAUC14 18 22 26 30 35 39 51 71 91 1121321521730.850.900.95Proposed methodrfImputeFigure 3.5: AUC vs multimodal sample size for heart disease dataset (eachbox shows AUC values for different single modal feature sets)data. Smaller variances of the boxes for the proposed method, especially insmaller feature set sizes, show that the proposed method is on average lesssensitive to the multimodal sample size especially when the single modaldataset has a large number of missing features. For example, at featurevector size of 2 for the single modal dataset, the performance of rfImputevaries from 0.88-0.98, whereas scandent tree shows a performance range of0.93-0.98. This stable behavior is due to the unique ability of the scandenttrees to predict division points for missing features that only conditionallydepend on the available features.Breast cancer datasetFigure 3.7 shows the AUC for the proposed method and the state of the artimputation method of random forests as a function of multimodal samplesize. Each box in this figure shows AUC of the two methods for a fixedmultimodal sample size and different single modal feature set sizes. Smallervariances of the boxes for the proposed method, especially in smaller samplesizes, show that the proposed method is on average less sensitive to the singlemodal featureset size. For instance, at sample size of 393 for the multimodaldataset, the performance of rfImpute varies from 0.978-0.989, (more than433.4. Simulation and experimental results2 4 5 6 7 8 10 11 120.850.900.95Number of Single modality featuresAUC2 4 5 6 7 8 10 11 120.850.900.95Proposed methodrfImputeFigure 3.6: AUC vs single modal feature set size for heart disease dataset(each box shows AUC values for different multimodal sample sizes)24 33 42 51 60 69 78 87 1231682132583033483930.940.950.960.970.980.99Number of Multi modal samplesAUC24 33 42 51 60 69 78 87 1231682132583033483930.940.950.960.970.980.99rfImputeProposed methodFigure 3.7: AUC vs multimodal sample size for breast cancer dataset (eachbox shows AUC values for different singlemodal feature sets)443.4. Simulation and experimental results1%) whereas scandent tree shows a performance range of 0.986-0.988 (lessthan 0.3%).6 9 12 16 19 22 25 280.940.950.960.970.980.99Number of Single modal featuresAUC6 9 12 16 19 22 25 280.940.950.960.970.980.99RFimputeProposed methodFigure 3.8: AUC vs single modal feature set size for breast cancer dataset(each box shows AUC values for different multimodal samples sizes)Figure 3.8 shows the AUC of the proposed method and the rfImputemethod for different single modal feature set sizes. Each box shows changesof AUC for different sample sizes at a fixed feature set in the single modaldata. It can be seen that the proposed method outperforms the state of theart imputation method of random forests (rfImpute) when a large numberof features are missing from the single modal dataset. It also has similarperformance when the single modal feature set is almost complete. Smallervariances of the boxes for the proposed method, show that the proposedmethod is on average less sensitive to the multimodal sample size especiallywhen the single modal dataset has a large number of missing features.3.4.2 Experimental results: prostate cancer datasetFigure 3.9 shows the AUC obtained on this data, for detection of prostatecancer, for several experiments, namely from left to right the bars show thedistribution of AUC areas for 1) a multimodal decision forest that simplyignores the existence of archival imaging data, 2) our proposed scandent treeapproach to use the archival data to improve the performance of a foresttrained and testes on multimodal data, 3) the standard rfImpute method453.4. Simulation and experimental resultsapplied at the forest level to include the single modal data in training, 4)the standard C5.0 method applied at trees level, 5) training and testing atree using only the single modal features of the multimodal set, 6) KNNimputation, and 7) zeroing of the missing feature values.multimodal scandent rdImpute C5.0 singlemodal knn zero0.40.50.60.70.80.91.0AUCFigure 3.9: AUC for multimodal classification task obtained with differentstrategies for handling the missing data issue, prostate cancer datasetIt can be seen that the multimodal forest is performing significantly bet-ter than the single modal forest even though the sample size of the singlemodal dataset is significantly larger than the multimodal dataset. This sug-gests that the missing modality, in this case the genetic features, is far morediscriminative than the shared modality, MRI. The imputation methods out-perform a single modal forest, but they fail to outperform the multimodalforest. This shows that even the state of the art imputation methods maymisguide the decision forest when a large portion of data is missing, to theextent that a simple imputation method like zero replacement outperformsthe state of the art imputation approaches.In order to measure the statistical significance of the difference betweenAUC values of different methods we use one of the most well-known uni-variate tests, student’s t-test. The t-test measures the difference betweentwo populations relative to their variances. This test is commonly used463.5. Summarywhen the sample size is small and the variance of the two populations areunknown. The outcome of a t-test is the p-value: the probability that theNull hypothesis (in this case, having similar distributions for AUC values) istrue. In other words, the p-value is the probability that we observe a resultsimilar to what we are observing or more extreme, assuming that the twodistributions are equal. In our experiments we measure the significance ofresults by comparison of the AUC values resulting from 100 random runs ofdifferent methods.In case of the proposed method, scandent forest, the significant advan-tage over a single modal forest, and each of the imputation methods isevident. Moreover, the proposed method does not introduce bias into theprediction like the other imputation methods and as a result, it outper-forms both the multimodal forest and the single modal forest. However, be-cause the shared modality is significantly less discriminative than the missingmodality, the improvement in performance is small (mean AUC of 94% forthe scandent forest and 93% AUC for the multimodal forest), although it isstatistically significant (a two sample t-test resulted in p <0.01).3.5 SummaryIn this section we introduced the novel concept of scandent trees, singlemodal trees that enable a conventional random forest trained on a smallmultimodal dataset to leverage a single modal dataset. Also a detailedexplanation of the proposed method and implementation was presented.We evaluated the proposed method using three publicly available datasetsby simulation of different incomplete multimodal scenarios. We also com-pared the proposed method with the state of the art imputation methods onprostate cancer dataset as a real incomplete multimodal dataset. Using thedermatology and heart disease datasets we showed that the proposed methodoutperforms the state of the art imputation method of random forests if themultimodal dataset (the block of the dataset that has all of the modalities)is very small in comparison with the whole dataset. Using the breast cancerdataset we showed that the proposed method outperforms the state of theart imputation method for random forests if a large number of features bemissing from the single modal dataset. Moreover, we showed that on allof the datasets, in comparison with rfImpute, the proposed method is ingeneral less sensitive to multimodal sample size and single modal feature setsize.The experimental results on the prostate cancer dataset show that the473.5. Summaryproposed method significantly outperforms the well known imputation meth-ods, even the state of the art embedded imputation methods of randomforests (rfImpute) or C5.0 trees. Because in this study the missing modality(genetic features) are far more discriminative than imaging features, a mul-timodal classifier which is equivalent of using the sample removal methodto handle the missing values was the best method among the conventionalimputation methods. The proposed method outperformed the multimodalclassifier. This improvement was small, but statistically significant.48Chapter 4Tree-based featuretransforms: applyingscandent tree model forsingle modal classification4.1 IntroductionIn the previous chapter we focused on a scenario in which a small but valu-able multimodal dataset could be merged with a large and easy to accessdataset in order to improve performance of a multimodal classifier. Wediscussed that this is a common scenario in biomedical data analysis lab-oratories or in data analysis scenarios that deal with very novel or specialmultimodal datasets. However, from clinical point of view the reverse prob-lem is more attractive: leveraging a multimodal dataset in order to enhancea classifier trained and tested on a single modality.One of the well-known applications of multimodal data analysis is ingrading of generative diseases like cancer or Alzheimer’s disease. Eachmodality provides us with unique information about the patient and is ableto compensate for weaknesses of other modalities. However, the fact thatmany modalities are very expensive or are simply not feasible to use forevery patient, limits the clinical applications of multimodal data analysis.In case of Alzheimer’s disease, the state of the art medical imagingmodalities used are MRI and PET scan. The information provided by aPET scan is more useful than MRI because it provide information aboutcerebral blood flow, metabolism, and receptor binding but PET imaging isexpensive and requires the use of radioactive tracers. As a result, a largenumber of patients only receive MRI scans. For example, in the Alzheimer’sDisease Neuroimaging Initiative (ADNI) study which is one of the largestmultimodal studies of Alzheimer’s disease worldwide, nearly half of the pa-tients are missing the PET data. So a very valuable contribution in this494.1. Introductionfield would be to use a multimodal dataset together with a single modaldataset in order to train a classifier that only needs one of the modalities.This means, training a classifier that uses MRI and PET data to train aclassifier that only needs MRI data for Alzheimer’s disease grading.In a fashion similar to the previous chapter, we benefit from the embed-ded way of decision forests for dealing with high dimensional data throughfeature bagging [4]. Another motivation for the use of decision forest paradigmis that it provides the ability to morph the treatment of missing data withinthe framework of learning to maximize the classification performance.In this chapter we focus on applications of tree-based feature maps inthe scandent tree model. A disadvantage of decision forests compared withSVM is the lack of an embedded framework for kernel-based feature trans-formation in the case of forests. Using multi-kernel approaches, researchershave devised solutions for incorporation of various modalities in the SVMcontext.Tree-based feature transforms have recently received some attention. Forexample, a recent work by Cao et al., [7] uses stacked decision forests. Thismethod is based on using the probability values estimated by trees in arandom forest as a feature vector, and using this feature vector for trainingof an enhanced decision forest, potentially together with the original featureset. Inspired by the applications of multi-kernel SVMs in multimodal dataanalysis, we apply this concept of tree-based feature maps, for multimodaldata analysis.In this chapter we intend to develop the idea of scandent tree-basedfeature transforms to solve the problem of missing data in the single modaltesting scenario. This problem has many clinical applications in areas whereexpensive research protocols meet the realities of clinical practice and highcost. Here, the assumption of a multimodal dataset with block wise missingvalues remains. However, there is no multimodal assumption about the testset. To solve this problem, we use the idea of tree-based feature transformsalong with the scandent tree. This combination allows us to use tree-basedfeature transforms built on one modality to transform the features from adifferent modality. Using this approach, we use MRI and PET data in theADNI dataset and train a classifier that only requires the MRI data forthe prediction of different stages of Alzheimer’s disease. We show that theinclusion of the PET data at the time of training results in an improvedclassification accuracy, even though the test cases are not subjected to PETimaging. In order to test the performance of the proposed classifier in differ-ent scenarios, we also simulate the missing target scenario by intentionallyremoving a set of features from two publicly available benchmark datasets,504.2. MethodThe dermatology dataset and The breast cancer dataset introduced in theprevious chapter.4.2 MethodTo obtain a forest that transfers the value of the multimodal dataset intoa single modal environment, it is tempting to simply replace all the treesin a forest trained on the available multimodal training data with theircorresponding scandent trees. However, this approach fails due to bias andthe fact that many of the multimodal divisions of support trees might notbe predictable by the single modal feature set.Instead, we choose an approach inspired by the use of decision treesas feature maps. For this we start with growing a scandent forest usingthe method explained in the previous chapter. However, instead of directlyusing the scandent trees, we use the set of local trees (Ti’s) from all thescandent trees of a multimodal forest as tree-based “feature-maps” or “tree-based feature transforms”. Each Ti is a single modal tree which maps Fs toa new space defined by the corresponding C′set. This means that each Tiyields a categorical feature to describe each sample. Then we use the singlemodal dataset with the extended feature set, including the original and thesetree-based features, to grow an improved single modal forest. Note thattrees trained on single modal features can be directly used as categoricalor continuous (similar to [7]) feature transformers. In the current work,however, we use the scandent subtrees to link two inconsistent datasets.Figure 4.1: Extracting tree-based feature transforms from the scandent treemodelThis method has a few advantages compared to the conventional methodfor forming a single modal decision forest or directly using the scandent treesas a new set of trees in a single modal decision forest. First, because ateach split of each tree in the single modal forest, the tree growth algorithm514.2. Methodsearches for the best division feature among both the original single modalfeatures and the new features generated by the local trees (Tis), the resultingtree is expected to be more accurate than both the scandent tree and the treegrown using only the original single modal features. Second, although the Tisare formed by a small multimodal dataset, the feature selection criteria (Giniimpurity or information gain) is calculated based on the large single modaldataset. In other words, the single modal forest uses the features inspiredby the multimodal forest, but it is completely randomized and optimizedbased on the larger single modal dataset.Figure 4.2: Diagram of the proposed method for training the “multimodalfeature transform” forest4.2.1 ImplementationThe first step is to grow a multimodal forest and the related scandent treesusing the method explained in the previous chapter. Then the local trees(Tis) are extracted from each tree and each Ti is used as feature generatorsfor the single modal dataset. Given that each Ti is a single modal classifier,it can assign labels relative to the local class labels (C′) to each single modalsample. The resulting labels are used as new categorical features which canbe calculated for any test data using the corresponding Ti. We then use aconventional decision forest growth method similar to what was explained inthe previous chapter to grow a forest using this set of new features togetherwith the original single modal feature set.It should be mentioned that because the local trees are trained usingthe small multimodal dataset, many of the generated features might not beuseful for the single modal decision forest. Considering the large numberof local trees in a random forest, this can flood the original single modalfeatures. So we filter the new features by a conventional feature selectionalgorithm, namely based on the feature importance measure in a decision524.3. Evaluation and resultsforest. We apply feature bagging separately to the set of the original singlemodal features and the new features, and then merge them together to formthe feature bag used for each single modal tree. Diagrams of the proposedmethod for extracting the tree-base feature transforms from the scandenttree model and using that for single modal classifier design are shown infigures 4.1 and 4.2.4.3 Evaluation and resultsIn this section we evaluate the proposed method for single modal classifica-tion task. First, we examine the proposed method in different simulated sce-narios and for different single modal sample sizes using two of the datasetsused in previous chapter: the dermatology dataset and the breast cancerdataset. Then we test the performance of the proposed method on a realdataset that exactly matches our scenario, the ADNI dataset.4.3.1 Evaluation using benchmark datasetsFor this simulation task we should simulate the same two simulation parame-ters that were used in the previous chapter, with the difference that insteadof the multimodal sample size that is assumed to be sufficiently large inthis scenario, we examine the effect of the single modal sample size of thetraining set. The dermatology dataset introduced in the previous chapteris multimodal in nature so it is the perfect benchmark for testing the singlemodal sample size. However, because the feature sets for each modality isfixed, it is not meaningful to simulate the single modal feature size usingthis dataset. Instead, we use the breast cancer dataset introduced in theprevious chapter in order to simulate different feature set sizes for singlemodal dataset.Effect of the single modal sample size: dermatology datasetIt is informative to investigate performance of the proposed method forsingle modal classification tasks as a function of the single modal samplesize. We design an experiment similar to the previous chapter using the thedermatology dataset but with the assumption that the test set is also missingthe most discriminative modality. For this experiment we first randomlyselect 100 samples from the dermatology dataset as a test set. Then wediscard the histopathological features from this set. Moreover, we form amultimodal dataset with a fixed size (40 percent of the remaining samples534.3. Evaluation and resultsin this experiment) and use it to extract the proposed feature transforms.Finally we evaluate performance of a single modal forest with and withoutthe new features in different single modal sample sizes.71 96 1221481731992252510.50.60.70.80.9Number of Single modal samplesAUC71 96 1221481731992252510.50.60.70.80.9Single modal Random forestEnhanced RandomForestFigure 4.3: AUC vs single modal sample size for dermatology datasetFigure 4.3 shows AUC of a single modal random forest and the enhancedforest trained on the dermatology dataset for different sample sizes. It can beseen that the enhanced forest can improve the single modal forest especiallywhen the single modal dataset is small. For instance at sample size of 71the single modal classifier gives AUC of 0.78 while the enhanced forest givesAUC of 0.84. This improvement is also evident in larger sample sizes but isless significant (AUC of 0.87 and 0.85 for the proposed method and a singlemodal forest). The dermatology dataset is multimodal in nature, so the sizesof the single modal and multimodal feature sets are fixed. On the other handthe breast cancer dataset does not have a predefined single-modal featureset so it can be used for simulation of different single modal featuresets.Effect of the single modal feature set size: The breast cancerdataset:For this simulation we first randomly select 100 test samples. We thenchange the single modal feature set size from 5 to 29 (the total feature set544.3. Evaluation and resultssize is 30) removing the most discriminative features at each step. Then weobserve AUC as a function of feature set size for a single modal forest withand without the use of the tree-based feature transforms. The tree-basedfeature transforms used in each iteration were trained using 30 samples ran-domly selected from the multimodal dataset. We found that the size of themultimodal dataset in the breast cancer dataset does not have much effect onthe tree-based feature transforms once the multimodal sample size is morethan or equal to 30. Figure 4.4 shows the AUC of the conventional singlemodal forest and the proposed method for different single modal featuresets.The expected upward trend in AUC vs. single modal feature set sizeis evident, it also can be seen that the proposed method outperforms aconventional single modal forest method especially in smaller feature setsizes. For example, when only the 5 least discriminative multimodal featuresare available, the conventional single modal random forest results in a meanAUC of 0.77 whereas the proposed method delivers an AUC of 0.83 whilein larger feature sets the AUC of the two methods methods is almost equal.This is because at feature set sizes of larger than 20, the AUC of a singlemodal forest without using the tree based feature transforms is almost 1.This eliminates the need for any more improvement on the conventionalclassifier.The breast cancer dataset and the dermatology dataset are completedatasets that provided us the opportunity to test the performance of the pro-posed method in different simulated scenarios. Furthermore we test the per-formance of the proposed method on a real incomplete multimodal dataset,the ADNI dataset.4.3.2 A real scenario: ADNI datasetWe test the proposed single modal classification method on a dataset fromAlzheimer’s Disease Neuro-imaging Initiative (ADNI) database. The ADNIwas launched in 2003 as a public-private partnership, led by Dr. MichaelW. Weiner. The primary goal of ADNI has been to test whether serialmagnetic resonance imaging (MRI), positron emission tomography (PET),other biological markers, and clinical and neuropsychological assessment canbe combined to measure the progression of mild cognitive impairment (MCI)and early Alzheimers Disease (AD). The ADNI study is an example of amultimodal scenario in which a large portion of samples are missing oneof the modalities. In this chapter we take the samples that come frompatients with both MRI and PET scan as multimodal dataset (Nm = 218)554.3. Evaluation and results5 8 11 14 17 20 23 26 290.700.750.800.850.900.951.00Number of Single modal featuresAUC5 8 11 14 17 20 23 26 290.700.750.800.850.900.951.00Singlemodal Random forestEnhanced RandomForestFigure 4.4: AUC vs single modal feature set size for breast cancer dataset(each box shows AUC values for different multimodal sample sizes)accompanied by a relatively large single modal dataset (Ns = 508) consistingof patients with only MRI data. This includes the MRI data from the 218multimodal samples.The single modal dataset consists of MRI volume measurements of sixROIs in the human brain (ventricles, hippocampus, whole-brain, entorhinal,fusiform and mid-temporal) and intra-cranial volume (ICV) in mm3. Themultimodal feature set consists of the same MRI features together with twoadditional PET scan features, FluoroDeoxyGlucose (FDG) measurementand AV45 uptake measurement. The outcome labels include cognitivelynormal patients (NL), patients with confirmed dementia (AD) and patientswith mild cognitive impairment (MCI). The MCI group can be divided intoprogressive (pMCI) that eventually converts to dementia and stable (sMCI).In this chapter we assume a maximum of 36 month conversion time for theMCI class to be considered pMCI.The distribution of different outcome classes in the two datasets is asfollows: for the normal class we have 178 samples in the single modal datasetversus only 18 samples in the multimodal dataset, for the dementia class wehave 108 single modal samples versus 29 multimodal samples, for the sMCI564.3. Evaluation and resultsclass we have 126 single modal versus 144 multimodal samples and for thepMCI class we have 96 single modal samples versus 27 multimodal samples.In other words, not only the multimodal dataset is much smaller than thesingle modal dataset, it also does not have the same distribution of outcomeclasses. This makes the data fusion between the two datasets extremelydifficult with traditional approaches. We examine the performance of theproposed method by reporting AUC for three classification scenarios: NLversus pMCI, sMCI versus AD. and sMCI versus pMCI.Figure 4.5: Diagram of the method used for forming the PC forestWe take the performance of a single modal forest trained solely on theoriginal MRI feature set as the baseline. We then compare performance ofthis baseline with a forest enhanced using the principal component features(PC forest) as shown in Figure 4.5, with a similar forest trained using MRI-based transformed features shown in Figure 4.6, and a transformed featureset extracted using both MRI and PET using scandent tree approach shownin Figure 4.2. It should be noted that all of these classifiers are designedfor the single modal classification task, meaning that they only need theoriginal MRI feature set for classification but may use the other modalities(PET in this example) for better feature transform design in the trainingphase.5-fold cross-validated ROC curves of the baseline single modal forest,PC forest, single modal feature-transform forest, and the scandent tree mul-timodal feature transform forest for NL vs. pMCI classification task areshown in Figure 4.7.As it can be seen in Table 4.1, the forests grown based on the tree-based feature transforms significantly outperform the baseline single modalforest and the PC forest. The difference between the baseline and the fea-ture transform methods is statistically significant (p=0.01) for the singlemodal feature transforms and (p=0.002) for multimodal feature transforms.574.3. Evaluation and resultsFigure 4.6: Diagram of the method used for forming a forest based on singlemodal tree-based feature transformsSpecificitySensitivity0.00.20.40.60.81.01.00.80.60.40.20.0Single modal forestPC forestSingle modal feature transform forestMulti modal feature transform forestFigure 4.7: ROC curve for NL vs. progressive MCI classification, singlemodal classification task, ADNI dataset584.3. Evaluation and resultsTable 4.1: Accuracy (Acc), Sensitivity (Sens), Specificity (Spec) and Areaunder ROC curve (AUC) of the proposed methods and the baseline forestfor the NL vs. pMCI single modal classification task, ADNI datasetAcc Sens Spec AUCSingle modal forest 0.744 0.663 0.791 0.779PC forest 0.774 0.878 0.747 0.781Single modal feature transform 0.781 0.691 0.844 0.819Multimodal feature transform 0.788 0.747 0.805 0.837Table 4.2: Accuracy (Acc), Sensitivity (Sens), Specificity (Spec) and Areaunder ROC curve (AUC) of the proposed methods and the baseline forestfor the sMCI vs. AD single modal classification task, ADNI datasetAcc Sens Spec AUCSingle modal forest 0.731 0.824 0.699 0.814PC forest 0.752 0.758 0.748 0.836Single modal feature transform 0.782 0.734 0.863 0.868Multimodal feature transform 0.795 0.737 0.897 0.892However, the improvement in the performance achieved by the PC-basedfeatures is not statistically significant (p-value = 0.92). The multimodalfeature transforms are more effective compared to the single modal featuretransforms. This difference is significant (p=0.04).Another classification problem worth investigating is discrimination ofsamples with stable MCI from dementia cases using the MRI feature set.Figure 4.8 and Table 4.2 show ROC curves and performance measures ofthe enhanced and baseline forests for this classification task.It can be seen that similar to the NL vs. pMCI task, the forests en-hanced by the new feature sets are outperforming the baseline single modalforest. The improvement observed in the PC forest is more significant thanthe previous task but it still can not be considered statistically significant(p-value=0.08). On the other hand the proposed tree-based feature trans-form methods significantly outperform the baseline methods with p-valuesof 0.0001 and 3.698e-07 for the single and multimodal feature transforms,594.3. Evaluation and resultsSpecificitySensitivity0.00.20.40.60.81.01.00.80.60.40.20.0Single modal forestPC forestSingle modal feature transform forestMulti modal feature transform forestFigure 4.8: ROC curve for stable MCI vs. AD classification, single modalclassification task, ADNI dataset604.3. Evaluation and resultsrespectively, and the multimodal feature transforms are more effective thansingle modal feature transforms (p=0.0003).The third classification task which separates sMCI from pMCI cases ispotentially the most clinically relevant model. The ROC curves and perfor-mance measures for this task can be seen in Figure 4.8 and Table 4.2.SpecificitySensitivity0.00.20.40.60.81.01.00.80.60.40.20.0Single modal forestPC forestSingle modal feature transform forestMulti modal feature transform forestFigure 4.9: ROC curve for stable MCI vs. progressive MCI classification,single modal classification task, ADNI datasetThe trends remain the same: the tree-based feature transforms out-perform a simple single modal forest with p=0.01 and and p=0.0002 for thesingle modal (MRI-based) and multimodal (MRI+PET) feature transforms,respectively. It can also be seen that the PC-based features fail to enhancethe baseline forest to a statistically significant level (p=0.672). Similar tothe previous experiments, the multimodal feature transforms yield a largerAUC than single modal feature transforms with p=0.01.4.3.3 Comparison with other work on ADNIThe block wise missing value problem is a well-known issue of the ADNIdataset and it is addressed in many papers in literature. However, noneof them has the same goal and assumptions as our study. For instance,614.3. Evaluation and resultsTable 4.3: Accuracy (Acc), Sensitivity (Sens), Specificity (Spec) and Areaunder ROC curve (AUC) of the proposed methods and the baseline forestfor the sMCI vs. pMCI single modal classification task, ADNI datasetAcc Sens Spec AUCSingle modal forest 0.743 0.713 0.744 0.810PC forest 0.757 0.769 0.746 0.815Single modal feature transform 0.777 0.819 0.750 0.848Multimodal feature transform 0.815 0.831 0.803 0.872this paper focuses on improving the performance of decision forests withthe assumption that a decision forest is the classifier of choice for a givenmultimodal dataset. However, most of the studies on the ADNI dataset useother classifiers like multi-kernel SVM for multimodal classification. As aresult, it is difficult to compare our results with the available literature asany such comparison will be mostly informed by the choice of classificationparadigm.One other issue that makes the comparison difficult is the different fea-ture sets and sample sizes impacted by patient selection criteria. A simpleexample is the different assumptions on the conversion time for MCI to ADfor differentiating progressive versus stable MCI. In our study, we assumeda 36 month conversion time for progressive MCI cases and used the sum-marized set of features extracted by adnimerge R package as our feature set.This package is accessible from the ADNI website (https://adni.loni.usc.edu).With all these differences and limitations in mind, we have gathereda list of comparable methods with performance measures reported in theliterature in Table 4.4. These are all on the sMCI vs pMCI classificationtask. As it can be seen, the proposed method matches or surpasses theperformance of the state of the art, even in cases where multimodal data isavailable for all cases.624.4. SummaryTable 4.4: Comparison of the proposed single modal method with the stateof the art for sMCI vs. pMCI prediction, ADNI datasetMethod sample size modalities performanceAcc Sens Spec AUCProposed method 122 MRI 0.815 0.831 0.803 0.872[9] 99 MRI, PET, CSF 0.801 0.853 0.733 0.852[31] 204 MRI, PET 0.759 0.48 0.952 0.746[6] 397 MRI, PET, CSF 0.732 0.655 0.767 0.786[14] 388 MRI 0.754 0.705 0.776 0.82[35] 200 MRI 0.751 - - 0.84[40] 143 MRI, PET, CSF, APOE 0.741 0.787 0.656 0.795[43] 91 MRI,PET,CSF 0.739 0.686 0.736 0.797[10] 405 MRI 0.71 0.7 0.72 -[36] 162 MRI, CSF 0.685 0.741 0.63 0.764.4 SummaryIn this chapter a novel application of the scandent tree model as tree-basedfeature transforms was introduced. These feature transforms can be used forleveraging a multimodal dataset for single modal classifier design. A methodto extract these single modal tree feature transforms was explained and amethod to use these feature transforms within the context of a conventionalrandom forest classifier was provided.Using two publicly available datasets for simulation and the ADNI datasetas a real incomplete multimodal dataset, it was shown that the proposedmethod can be used to enhance a single modal classifier. The experimentson the ADNI dataset show if one extracts the proposed feature transformsfrom a scandent forest trained on both PET and MRI features and uses themtogether with the original set of MRI features for classification of differentstages of Alzheimer’s disease, the resulting classifier can outperform a con-ventional single modal random forest, a random forest enhanced with PCsof the MRI features and a random forest enhanced with tree-based featuretransforms solely trained on the MRI data.We examined the proposed classifier in three different scenarios of nor-mal vs progressive MCI, stable MCI vs progressive MCI and stable MCI vsAlzheimer’s disease. It was shown that in all of these scenarios the proposedmethod outperforms the conventional single modal forest and the other en-hanced single modal forests mentioned. The stable MCI vs progressive MCIclassification task was found to be the most clinically valuable classificationtask. Comparison of the proposed method and state of the art in this sce-nario show that the proposed method matches or outperforms the state of634.4. Summarythe art even in case of larger training sets or feature sets.64Chapter 5Conclusion5.1 SummaryIn this thesis we addressed the problem of incomplete multimodal datasets inrandom forest learning algorithms in a scenario where many of the samplesare non-randomly missing a large portion of the most discriminative features.This missing value problem in multimodal datasets is different from thecommon scenarios in single modal data analysis. In our problem of interest,features of one specific modality might be missing altogether in training, ortesting. This causes an issue known as block-wise missing data. We showedthat this issue can not be handled by conventional imputation techniquesif the number of samples that include all of the features is small. This isa common scenario in biomedical data analysis applications. In summary,this thesis has two major contributions:• We developed the novel concept of scandent trees for enriching a mul-timodal classifier with a large training dataset from only a subset ofmodalities. The results show that the proposed method for multi-modal classification outperforms the embedded missing value impu-tation method of decision forests introduced in [4] and other stateof the art imputation methods, particularly in smaller samples sizesand when a large portion of features are missing. We showed thatthe proposed method enables the integration of a small genomic plusimaging dataset, with a relatively large imaging dataset. We alsoshowed that this method is in general less sensitive to the number ofmissing features and to the multimodal sample size by simulation ofdifferent missing value scenarios on three publicly available benchmarkdatasets.• We also proposed a novel learning method for training on multiplemodalities and testing on one modality. To this end, we introducedthe concept of tree-based feature transforms. We showed that usingthis approach, we can efficiently transfer the discriminative power ofPET imaging into the training phase of building a model that would655.2. Discussions and limitationsonly use the MRI data at the testing phase. By simulation on two pub-licly available benchmark datasets, we showed that the single modalfeatures generated by the multimodal model can significantly improvea single modal forest especially when the single modal dataset is smallor it is missing most of the discriminative features. We also showedthat the model achieved through multimodal data analysis can be usedto form an enhanced random forest that only needs a single modalityfor classification.5.2 Discussions and limitationsIn this thesis we have addressed two classification scenarios regarding theproblem of missing values in multimodal datasets: leveraging a single modaldataset for multimodal classification, and leveraging a multimodal datasetfor single modal classification.5.2.1 Limitations of the implementated methodThere are some limitations to our current implementation of the proposedmethod:As one limitation, it should be mentioned that similar to the other tree-based imputation methods, we are proposing this algorithm with the as-sumption that random forest is the classifier of choice. In other words, whena different classifier outperforms the baseline random forest, the proposedmethod might not be the best option.Another limitation of the current proposed method is that it is based onoptimising each leaf of each tree in a forest separately which is computation-ally expensive in large forests. Unlike the conventional imputation methods,the computational time of the proposed method scales with the sample sizeof the small multimodal dataset not the large sample size of the single modaldataset. However, as the multimodal dataset grows the computation timebecomes an issue. The computational complexity of the proposed methodnot only depends on the model parameters, but also depends on the datasetand feature-sets used in the analysis. For instance, if the missing modal-ities are significantly more discriminative than the available modalities, itis expected that the link nodes are limited to the root and leaves of thesupport forest. As a result, the computational cost of training a scandenttree forest would be in the same order as training two support forests. Asimilar result is also expected if the missing feature set is significantly less665.2. Discussions and limitationsdiscriminative. However, the testing phase of the scandent forest is exactlysimilar to a conventional random forest.5.2.2 Discussions and limitations of the multimodal studyWe tested the scandent tree method on a prostate cancer dataset as a realworld example of our target scenario. This dataset is an example of theworst case scenario of missing data: a large non-random portion of the datais missing the potentially more powerful genomic features resulting in avery small multimodal dataset. At the same time, the number of featureson the single modal (imaging) side is small. It is, therefore, revealing thateven in this situation, the use of scandent tree methodology provides a clearadvantage against the traditional approaches to deal missing values in asituation like this, such as simply ignoring one or the other set, or imputationapproaches.There are a few limitations to our work with prostate cancer data:• As our experiments show, the missing modality (gene expression) is farmore discriminative than the shared modality (MRI). This makes itextremely difficult for the proposed method to model the relationshipsbetween the modalities and effectively merge the two datasets.• The small number of features in the shared modality (MRI) makes thefeature-bagging in the support tree unbalanced between the modali-ties. As a result, many of the support trees are completely grown basedon the missing modality (gene expression) and scandent trees have tofollow the structure of a whole support tree. This together with smallsample size of the multimodal dataset can cause over-fitting.• Another limitation is that the small sample size of the multimodaldataset. It not only makes it hard to train the support forest neededfor the scandent tree method, it also makes it hard to show statisticallysignificant results in comparison between different methods.Given these limitations, a more revealing test of the performance of thesolution proposed for the multimodal scenario was achieved by study of thebenchmark datasets. One such study was presented in chapter 3 where weexamined the performance of the scandent tree method for different mut-limodal sample sizes and different feature sets using benchmark datasetspublicly available from the University of California Irvine (UCI) database[23]. In comparison with the state of the art imputation method for deci-sion forests (rfImpute), we observed that in larger multimodal sample sizes675.2. Discussions and limitationsor when only a small number of features were missing from the single modaldataset, both of the methods perform very well in handling the missing val-ues for multimodal classification. However, in smaller multimodal datasetsor when a large portion of features are missing from the single modal sam-ples, the scandent tree method showed significantly better performance incomparison with the rfImpute method. Another observation was that fora fixed sample size, the scandent tree method is less sensitive to the num-ber of missing features, especially in smaller multimodal sample sizes. Thisadvantage was also evident from the results on the prostate cancer dataset.5.2.3 Discussions and limitations of the single modal studyWe proposed the tree-based feature transform method in order to leveragea multimodal dataset for single modal classification. We showed than thismethod is very effective in common real-world scenarios when a large numberof samples are missing many of the potentially most discriminative features.We also succeeded to leverage information from PET scan in ADNI datasetfor enhancement of classification of different stages of Alzheimer’s diseasewhen only MRI data is available.Unlike the prostate cancer study we did not have the problem of samplesize in this study but there is one limitation in our results on the ADNIdataset: The different stages of Alzheimer’s disease are not clearly definedand vary from one study to another. For instance, the conversion timebetween MCI and Alzheimer’s disease which determines whether an MCIcase is stable or progressive is different between different studies. Also thefact that the ADNI dataset is continuously being updated, makes it hard tocompare results on this dataset.It should also be mentioned that the current implementation of the pro-posed method is not computationally efficient. This becomes an issue if therelationship between the available modalities and the missing modalities isvery complex or the feature set size is very large. This scenario will require alarge scandent forest in order to generate the tree-based feature transforms.Considering the fact that most of the scandent trees consist of at least 2 or3 local trees, the number of the tree-based feature transforms becomes verylarge. The computation costs of growing a scandent forest for large datasetsmay make it computationally unpractical to use in many studies. However,our experiments show that only around 5% of the feature transform treesare actually useful in practice. This means that 95% of the computationtime used for training of the scandent forest is not necessary if it is onlygrown for the purpose of generating the tree-based feature transforms.685.3. Future workMoreover, note that a large set of tree-based feature transforms are onlylocally discriminant and their number may exceed the original single modalfeature set size by a factor of 5. Therefore, using an off-the-shelf randomforest that does not separately rank and bag the two feature sets may notresult in an improved classifier and even may result in a forest weaker thanthe original single modal forest. This is because a large number of artificialfeatures may flood the set of original features. This becomes an issue in theforest growth algorithms that use a randomized feature selection approachat each division of each tree in a forest. The randomized methods may resultin decreased growth time and independence between trees. However, theymay also yield in a large number of trees that are grown solely based on thetransformed features. These trees are very likely to over-fit.A similar problem is when the scandent trees are not randomised ef-ficiently. In this scenario, the transformed features are very likely to bestatistically dependent because they are formed using the same feature-setwithout bagging. This makes it hard to form independent decision trees andas a result limits the performance of the resulting decision forest.5.3 Future workWe can envision four future improvements to the implementation of thescandent tree method:• The current implementation of the proposed inference method is basedon the optimization of the scandent trees in a leaf by leaf manner. Thisprocess is computationally expensive and becomes an issue in caseof larger multimodal datasets. An area of improvement will be re-designing the inference method so that instead of leaf-by-leaf mergingof the scandent and support trees, it can merge the two trees directlyat tree level.• The second area for continued work is improving the baseline trees.Because we needed full control over each division of each tree in the for-est, we could not use the off-the-shelf decision forest packages availablein R. Therefore, the support forest, which is the base of the scandenttree method, is our in-house implementation. We can improve thebase classifier by using the boosting methods and advanced random-izing schemes embedded in off-the-shelf random forests.• The third area that needs more work is the design of the local predic-tors in each scandent tree. The current implementation uses a very695.3. Future worksimple classification tree while any other machine learning algorithmthat can mimic a local multimodal tree can be used. The currentimplementation has the advantage that it automatically results in ascandent tree, but methods resulting in any single modal rule-set canalso be used.• The fourth area is adapting the scandent tree model for online learning.As it was mentioned, one of the main applications of the proposedmethod is in data analysis tasks where only a limited set of the samplesare multimodal because of the time constraints. Therefore a naturalstep can be to design a mechanism to insert information from newmultimodal samples into the scandent tree model. This way the modelcan be iteratively trained as the number of multimodal samples grows.• Another potential applications of the scandent tree concept that isworth investigating is relationship finding between modalities. Thescandent tree model provides a unique tool to model the relationshipbetween two set of features potentially from two modalities. The lo-cal trees that form each scandent tree are in fact decision trees thatmodel the relationship between one available feature and a set of fea-tures in the missing modality. In many applications, for instance ingene expression studies, a decision tree is a perfect tool to model therelationships between features in an interpretable fashion. This canbe a very simple but effective way to find relationships between animaging modality and a set of genes which may result in finding newbiomarkers for diagnosis of diseases like cancer by finding links betweenfeatures.70Bibliography[1] Andre´ Altmann, Laura Tolos¸i, Oliver Sander, and Thomas Lengauer.Permutation importance: a corrected feature importance measure.Bioinformatics, 26(10):1340–1347, 2010.[2] Francisco Arteaga and Alberto Ferrer. Framework for regression-basedmissing data imputation methods in on-line mspc. Journal of chemo-metrics, 19(8):439–447, 2005.[3] Hussam Aldeen Ashab, Piotr Kozlowski, S Larry Goldenberg, andMehdi Moradi. Solutions for missing parameters in computer-aideddiagnosis with multiparametric imaging data. In Machine Learning inMedical Imaging, pages 289–296. Springer, 2014.[4] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.[5] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Ol-shen. Classification and regression trees. CRC press, 1984.[6] Sergio Campos, Luis Pizarro, Carlos Valle, Katherine R Gray, DanielRueckert, and He´ctor Allende. Evaluating imputation techniques formissing data in adni: A patient classification study. In Progress in Pat-tern Recognition, Image Analysis, Computer Vision, and Applications,pages 3–10. Springer, 2015.[7] Yu Cao, Hongzhi Wang, Mehdi Moradi, Prasanth Prasanna, and Tan-veer F Syeda-Mahmood. Fracture detection in x-ray images throughstacked random forests feature fusion. In Biomedical Imaging (ISBI),2015 IEEE 12th International Symposium on, pages 801–805. IEEE,2015.[8] Qixuan Chen and Sijian Wang. Variable selection for multiply-imputeddata with application to dioxin exposure study. Statistics in medicine,32(21):3646–3659, 2013.71Bibliography[9] Bo Cheng, Mingxia Liu, Heung-Il Suk, Dinggang Shen, and DaoqiangZhang. Multimodal manifold-regularized transfer learning for MCI con-version prediction. Brain imaging and behavior, pages 1–14, 2015.[10] Pierrick Coupe´, Simon F Eskildsen, Jose´ V Manjo´n, Vladimir S Fonov,Jens C Pruessner, Miche`le Allard, and D Louis Collins. Scoring by non-local image patch estimator for early detection of Alzheimer’s disease.NeuroImage: clinical, 1(1):141–152, 2012.[11] Robert Detrano, Andras Janosi, Walter Steinbrunn, Matthias Pfisterer,Johann-Jakob Schmid, Sarbjit Sandhu, Kern H Guppy, Stella Lee, andVictor Froelicher. International application of a new probability al-gorithm for the diagnosis of coronary artery disease. The Americanjournal of cardiology, 64(5):304–310, 1989.[12] Thomas G Dietterich. An experimental comparison of three methodsfor constructing ensembles of decision trees: Bagging, boosting, andrandomization. Machine learning, 40(2):139–157, 2000.[13] B. Drew, E. C. Jones, S. Reinsberg, and et al. Device for section-ing prostatectomy specimens to facilitate comparison between histol-ogy and in vivo MRI. Journal of magnetic resonance imaging : JMRI,32:992–996, 2010.[14] Simon F Eskildsen, Pierrick Coupe´, Daniel Garc´ıa-Lorenzo, VladimirFonov, Jens C Pruessner, and D Louis Collins. Prediction of Alzheimer’sdisease in subjects with mild cognitive impairment from the ADNI co-hort using patterns of cortical thinning. Neuroimage, 65:511–521, 2013.[15] Yoav Freund and Robert E Schapire. A decision-theoretic generalizationof on-line learning and an application to boosting. Journal of computerand system sciences, 55(1):119–139, 1997.[16] Nandinee Fariah Haq, Piotr Kozlowski, Edward C Jones, Silvia DChang, S Larry Goldenberg, and Mehdi Moradi. A data-driven ap-proach to prostate cancer detection from dynamic contrast enhancedMRI. Computerized Medical Imaging and Graphics, 41:37–45, 2015.[17] Yan He. Missing data imputation for tree-based models. PhD thesis,University OF California Los Angeles, 2006.[18] Tin Kam Ho. Random decision forests. In Document Analysis andRecognition, 1995., Proceedings of the Third International Conferenceon, volume 1, pages 278–282. IEEE, 1995.72Bibliography[19] Tin Kam Ho. The random subspace method for constructing decisionforests. Pattern Analysis and Machine Intelligence, IEEE Transactionson, 20(8):832–844, 1998.[20] Biao Jie, Daoqiang Zhang, Bo Cheng, and Dinggang Shen. Manifoldregularized multitask feature learning for multimodality disease classi-fication. Human Brain Mapping, 36(2):489–507, 2015.[21] Phimmarin Keerin, Werasak Kurutach, and Tossapon Boongoen.Cluster-based knn missing value imputation for dna microarray data.In Systems, Man, and Cybernetics (SMC), 2012 IEEE InternationalConference on, pages 445–450. IEEE, 2012.[22] EM Kleinberg et al. An overtraining-resistant stochastic modelingmethod for pattern recognition. The annals of statistics, 24(6):2319–2349, 1996.[23] M. Lichman. UCI machine learning repository, 2013.[24] Mehdi Moradi, Firdaus Janoos, Andriy Fedorov, Petter Risholm, TinaKapur, Luciant D Wolfsberger, Paul L Nguyen, Clare M Tempany, andWilliam M Wells. Two solutions for registration of ultrasound to mrifor image-guided prostate interventions. In Engineering in Medicineand Biology Society (EMBC), 2012 annual international conference ofthe IEEE, pages 1129–1132. IEEE, 2012.[25] Mehdi Moradi, Septimiu E Salcudean, Silvia D Chang, Edward C Jones,Nicholas Buchan, Rowan G Casey, S Larry Goldenberg, and PiotrKozlowski. Multiparametric MRI maps for detection and grading ofdominant prostate tumors. Journal of Magnetic Resonance Imaging,35(6):1403–1413, 2012.[26] National Institutes of Health. National cancer institute: PDQ geneticsof prostate cancer, Date last modified 02/20/2015.[27] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.[28] J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.[29] Dan Steinberg and Phillip Colla. Cart: classification and regressiontrees. The top ten algorithms in data mining, 9:179, 2009.73Bibliography[30] Carolin Strobl, Anne-Laure Boulesteix, and Thomas Augustin. Un-biased split selection for classification trees based on the gini index.Computational Statistics & Data Analysis, 52(1):483–501, 2007.[31] Heung-Il Suk, Seong-Whan Lee, Dinggang Shen, et al. Hierarchicalfeature representation and multimodal fusion with deep learning forAD/MCI diagnosis. NeuroImage, 101:569–582, 2014.[32] Terry M Therneau, Beth Atkinson, and Brian Ripley. rpart: Recursivepartitioning. R package version 3.1-46. Ported to R by Brian Ripley.,3, 2010.[33] Terry M Therneau, Beth Atkinson, Brian Ripley, et al. rpart: Recursivepartitioning. R package version, 3:1–46, 2010.[34] Xian Wang, Ao Li, Zhaohui Jiang, and Huanqing Feng. Missing valueestimation for dna microarray gene expression data by support vectorregression imputation and orthogonal coding scheme. BMC bioinfor-matics, 7(1):1, 2006.[35] Chong-Yaw Wee, Pew-Thian Yap, and Dinggang Shen. Prediction ofAlzheimer’s disease and mild cognitive impairment using cortical mor-phological patterns. Human brain mapping, 34(12):3411–3425, 2013.[36] Eric Westman, J-Sebastian Muehlboeck, and Andrew Simmons. Com-bining mri and csf measures for classification of Alzheimer’s diseaseand prediction of mild cognitive impairment conversion. Neuroimage,62(1):229–238, 2012.[37] William H Wolberg, W Nick Street, Dennis M Heisey, and Olvi LMangasarian. Computer-derived nuclear features distinguish malignantfrom benign breast cytology. Human Pathology, 26(7):792–796, 1995.[38] Shuo Xiang, Lei Yuan, Wei Fan, Yalin Wang, Paul M Thompson,and Jieping Ye. Multi-source learning with block-wise missing datafor Alzheimer’s disease prediction. In Proceedings of the 19th ACMSIGKDD international conference on Knowledge discovery and datamining, pages 185–193. ACM, 2013.[39] Shuo Xiang, Lei Yuan, Wei Fan, Yalin Wang, Paul M Thompson, andJieping Ye. Bi-level multi-source learning for heterogeneous block-wisemissing data. NeuroImage, 102:192–206, 2014.74Bibliography[40] Jonathan Young, Marc Modat, Manuel J Cardoso, Alex Mendelson,Dave Cash, and Sebastien Ourselin. Accurate multimodal probabilisticprediction of conversion to alzheimer’s disease in patients with mildcognitive impairment. NeuroImage: Clinical, 2:735–745, 2013.[41] Guan Yu, Yufeng Liu, Kim-Han Thung, and Dinggang Shen. Multi-task linear programming discriminant analysis for the identification ofprogressive MCI individuals. PLOS One, 9:e96458, 2014.[42] Lei Yuan, Yalin Wang, Paul M Thompson, Vaibhav A Narayan, andJieping Ye. Multi-source feature learning for joint analysis of incompletemultiple heterogeneous neuroimaging data. Neuroimage, 61(3):622–632,2012.[43] Daoqiang Zhang and Dinggang Shen. Multi-modal multi-task learningfor joint prediction of multiple regression and classification variables inAlzheimer’s disease. Neuroimage, 59(2):895–907, 2012.75
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Scandent tree : a decision forest based classification...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Scandent tree : a decision forest based classification method for multimodal incomplete datasets Hor, Soheil 2016
pdf
Page Metadata
Item Metadata
Title | Scandent tree : a decision forest based classification method for multimodal incomplete datasets |
Creator |
Hor, Soheil |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | Incomplete and inconsistent datasets often pose difficulties in multimodal studies. A common scenario in such studies is where many of the samples are non-randomly missing a large portion of the most discriminative features. We introduce the novel concept of scandent decision trees to tackle this issue in the context of a decision forest classifier. Scandent trees are decision trees that optimally mimic the partitioning of the data determined by another decision tree, and crucially, use only a subset of the feature set. We use the forest resulting from ensembling these trees as a classification model. We test the proposed method on a real world example of the target scenario, a prostate cancer dataset with MRI and gene expression modalities. The dataset is imbalanced with many MRI only samples and few with MRI and gene expression. Using scandent trees, we train a classifier that benefits from the large number of MRI samples at training time, and of the presence of MRI and gene expression features at the time of testing. The results show that the diagnostic value of the proposed model in terms of detecting prostate cancer is improved compared to traditional methods of imputation and missing data removal. The second major contribution of this work is the concept of tree-based feature maps in the decision forest paradigm. The tree-based feature maps enable us to train a classifier on a rich multimodal dataset, and use it to classify samples with only a subset of features of the training data. This has important clinical implications: one can benefit from an advanced modality to train a classifier, but use it in a practical situation when less expensive modalities are available. We use the proposed methodology to build a model trained on MRI and PET images of the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, and then test it on cases with only MRI data. We show that our method is significantly more effective in staging of cognitive impairments compared to a model trained and tested on MRI only, or one that uses other kinds of feature transform applied to the MRI data. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-04-26 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0300237 |
URI | http://hdl.handle.net/2429/57848 |
Degree |
Master of Applied Science - MASc |
Program |
Biomedical Engineering |
Affiliation |
Applied Science, Faculty of |
Degree Grantor | University of British Columbia |
GraduationDate | 2016-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2016_may_hor_soheil.pdf [ 1.04MB ]
- Metadata
- JSON: 24-1.0300237.json
- JSON-LD: 24-1.0300237-ld.json
- RDF/XML (Pretty): 24-1.0300237-rdf.xml
- RDF/JSON: 24-1.0300237-rdf.json
- Turtle: 24-1.0300237-turtle.txt
- N-Triples: 24-1.0300237-rdf-ntriples.txt
- Original Record: 24-1.0300237-source.json
- Full Text
- 24-1.0300237-fulltext.txt
- Citation
- 24-1.0300237.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0300237/manifest