Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Evaluating feature selection techniques for identifying a microbial signature in 16S microbiome sequencing… Nguyen, Thuy Tien 2021

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2021_may_nguyen_thuytien.pdf [ 1.52MB ]
JSON: 24-1.0396544.json
JSON-LD: 24-1.0396544-ld.json
RDF/XML (Pretty): 24-1.0396544-rdf.xml
RDF/JSON: 24-1.0396544-rdf.json
Turtle: 24-1.0396544-turtle.txt
N-Triples: 24-1.0396544-rdf-ntriples.txt
Original Record: 24-1.0396544-source.json
Full Text

Full Text

Evaluating feature selection techniques for identifying amicrobial signature in 16S microbiome sequencing databyThuy Tien NguyenB.Sc., University of Waterloo, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Bioinformatics)The University of British Columbia(Vancouver)March 2021© Thuy Tien Nguyen, 2021The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Evaluating feature selection techniques for identifying a microbial sig-nature in 16S microbiome sequencing datasubmitted by Thuy Tien Nguyen in partial fulfillment of the requirements for thedegree of Master of Science in Bioinformatics.Examining Committee:Raymond Ng, Professor, Department of Computer Science, University of BritishColumbiaCo-supervisorSara Mostafavi, Associate Professor, Paul G. Allen School of Computer Scienceand Engineering, University of WashingtonCo-supervisorStuart Turvey, Professor, Department of Pediatrics, University of British ColumbiaSupervisory Committee MemberBrett Finlay, Professor, Department of Biochemistry, University of British ColumbiaSupervisory Committee MemberiiAbstractThe goal of understanding which microbes are responsible for shifts in different en-vironmental and clinical conditions has motivated the increase in the developmentof custom feature selection techniques for microbiome data. When identifying anappropriate feature selection method, researchers are often faced with the questionof whether or not to apply a phylogenetic approach. However, in many cases, itis not possible to know which is most suitable a priori. The motivation behind aphylogenetic approach is biological, as the features in microbiome data embodiesan inherent hierarchical structure that may contain signal from trait conservation.Microbiome shifts correlated with host outcome could be driven by groups of taxawhich are closely phylogenetically related. As such, techniques that leverage phy-logenetic information seem highly fitting. Recent studies have shown promisingresults that a phylogenetic approach could be beneficial, however less have soughtto provide a thorough evaluation of the robustness and applicability of the phylo-genetic approach for different host outcomes of study. Guidance for researchers onthe applicability of a phylogenetic approach is unclear in the current state of theliterature. In this work, we sought to perform an assessment of feature selectionmethods in order to understand how different classes of methods compete on mi-crobiome data. We sought to evaluate whether a phylogenetic approach would bemore powerful in finding ground-truth features with phylogenetic correlation struc-tures, leading us to discover that non-phylogenetic methods are the best all-aroundmethods both in the presence and absence of strong phylogenetic signal. Someevidence has shown that there is still merit in the phylogenetic approach — suchas in scenarios where the phylogenetic signal is very strong. Our observations andfindings provide insights into strategies for testing for a phylogenetic signal usingiiia combination of techniques.ivLay SummaryThe human body is home to billions of microbes living in and on different sites,each referred to as a microbiome. Research has revealed a role of the gut micro-biome in human disease from microbe-derived metabolites that can influence hostbiological functions. Deciphering the role of the microbiome typically starts withthe use of computational methods to search for the relevant microbes. When theset of microbes found are very genetically related to one another, they are morelikely to produce analogous metabolites that affect host function similarly. Whenvery genetically distant, the metabolites and effect on host function would differand may be synergistic. In this thesis, we aim to address the two following re-search questions: 1) whether computational methods for identifying microbes canfind both very genetically related microbes and very genetically distant microbes,and 2) how relevant the identified microbes are to host function.vPrefaceThis dissertation is an original intellectual product of the author, Thuy Tien Nguyen.The breakdown of contributions reported in the present manuscript is as follows:The implementation of all analyses as described in all of Chapter 3 has been per-formed by Thuy Tien Nguyen. Results, discussions and exploration presented inChapters 4 and 5 has been performed by Thuy Tien Nguyen. This manuscript hasbeen written by Thuy Tien Nguyen under the supervision of Dr. Sara Mostafaviand Dr. Raymond Ng.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 16S microbiome data . . . . . . . . . . . . . . . . . . . . . . . . 41.1.1 16S rRNA sequencing and data representation . . . . . . . 41.1.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . 41.2 Identifying important microbes to a biological variable of interest . 51.2.1 Differential abundance analysis . . . . . . . . . . . . . . 61.2.2 Feature selection . . . . . . . . . . . . . . . . . . . . . . 61.2.3 Leveraging the evolutionary relationships in microbiomedata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Thesis motivation and contribution . . . . . . . . . . . . . . . . . 7vii1.4 Detailed outline of thesis . . . . . . . . . . . . . . . . . . . . . . 82 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 Differential abundance analysis . . . . . . . . . . . . . . . . . . . 102.1.1 Assessing differentially abundant microbiome compositionbetween groups . . . . . . . . . . . . . . . . . . . . . . . 112.1.2 Identifying differentially abundant taxon between groups . 122.2 Applying generic supervised learning methods to microbiome data 132.2.1 Standard feature selection and classification methods ap-plied to microbiome data . . . . . . . . . . . . . . . . . . 132.2.2 Normalization for feature selection and classification meth-ods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Custom feature selection techniques for microbiome data . . . . . 142.3.1 Non-phylogenetic subset selection methods . . . . . . . . 152.3.2 Phylogenetic techniques that use UniFrac distances . . . . 162.3.3 LASSO regularization of the phylogenetic tree . . . . . . . 172.3.4 Feature engineering of the phylogenetic tree . . . . . . . . 172.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1 Testing for association in global microbiome compositional shifts 213.2 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Phylogenetic feature selection methods . . . . . . . . . . 223.2.2 Generic feature selection methods . . . . . . . . . . . . . 243.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 CHILD study data . . . . . . . . . . . . . . . . . . . . . . . . . . 283.5 Count normalization . . . . . . . . . . . . . . . . . . . . . . . . 293.6 Evaluation metrics and performance assessment . . . . . . . . . . 303.6.1 Evaluation metric for simulated data . . . . . . . . . . . . 303.6.2 Evaluation metric for real data . . . . . . . . . . . . . . . 313.6.3 Performance assessment . . . . . . . . . . . . . . . . . . 324 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1 Analyzing the relationship of the microbiome and asthma . . . . . 35viii4.1.1 Association testing of asthma and the microbiome . . . . 354.1.2 Searching for a microbial signature using feature selectiontechniques . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 Evaluating feature selection performance on simulated data . . . . 384.2.1 Phylogenetic methods for finding phylogeny-induced fea-tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.2 Phylogenetic methods for finding phylogeny-independentfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.3 Generic techniques for finding phylogeny-induced ground-truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.4 Generic techniques for finding phylogeny-independent ground-truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Factors that influence feature selection performance in simulation 454.3.1 The effect of phylogenetic relatedness . . . . . . . . . . . 454.3.2 The effect of the number of relevant features . . . . . . . 464.3.3 The effect of sample size . . . . . . . . . . . . . . . . . . 484.3.4 The effect of taxa abundance level . . . . . . . . . . . . . 504.3.5 The effect of microbe-microbe correlation . . . . . . . . . 524.4 Evaluating classes of feature selection techniques on real data . . . 524.4.1 High signal real data condition . . . . . . . . . . . . . . . 534.4.2 Lower signal real data conditions . . . . . . . . . . . . . 544.5 Factors that influence feature selection performance on real data . 574.5.1 The effect of normalization . . . . . . . . . . . . . . . . . 584.5.2 The effect of sample size . . . . . . . . . . . . . . . . . . 604.5.3 The effect of phylogenetic aggregation . . . . . . . . . . 604.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 685.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.2.1 The phylogenetic signal in real data . . . . . . . . . . . . 715.2.2 Sensitivity of HFE . . . . . . . . . . . . . . . . . . . . . . 725.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 73ix5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77xList of TablesTable 4.1 P-value Table for Covariate-Microbiome Association Tests (3-month Samples) . . . . . . . . . . . . . . . . . . . . . . . . . 36Table 4.2 P-value Table for Covariate-Microbiome Association Tests (1-year Samples) . . . . . . . . . . . . . . . . . . . . . . . . . . 37Table 4.3 P-value Table for MIRKAT-RFE Results for Asthma at 3-years . 38Table 4.4 P-value Table for MIRKAT-RFE Results for Asthma at 5-years . 38Table 4.5 Table of Simulation Parameter Levels . . . . . . . . . . . . . . 40Table 4.6 Table of Sample Sizes for Real Data Outcome Variables . . . . 53Table 4.7 Table of Performance Differences of Normalization on Real Data 62xiList of FiguresFigure 3.1 Simulation Methodology . . . . . . . . . . . . . . . . . . . . 26Figure 4.1 Aggregated Performance (F1) of Generic vs. Phylogenetic Fea-ture Selection Techniques on Two Types of Simulated Data . . 39Figure 4.2 Feature Selection Performance on Simulated Extreme Low andHigh Signal Conditions . . . . . . . . . . . . . . . . . . . . . 42Figure 4.3 Feature Selection Performance of Phylogeny-Induced Ground-Truth (Phylogenetic Relatedness vs. Sample Size . . . . . . . 43Figure 4.4 Aggregate Performance of Feature Selection Techniques AcrossSix Simulated Parameter Levels in Phylogeny-Induced Syn-thetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Figure 4.5 Aggregate Performance of Feature Selection Techniques AcrossSix Simulated Parameter Levels in Phylogeny-Independent Syn-thetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 4.6 Feature Selection Performance of Phylogeny-Independent Ground-Truth (Number of True Features vs. Minimum Taxa Prevalence) 48Figure 4.7 Feature Selection Performance of Phylogeny-Induced Ground-Truth (Number of True Features vs. Minimum Taxa Prevalence) 49Figure 4.8 Feature Selection Performance of Phylogeny-Independent Ground-Truth (Abundance Level vs. Sample Size) . . . . . . . . . . . 50Figure 4.9 Feature Selection Performance of Phylogeny-Induced Ground-Truth (Number of True Features vs. Sample Size) . . . . . . . 51Figure 4.10 Feature Selection Performance for Timepoint . . . . . . . . . 55Figure 4.11 Feature Selection Performance of Asthma . . . . . . . . . . . 56xiiFigure 4.12 Feature Selection Performance of Exclusive Breastfeeding Status 57Figure 4.13 Feature Selection Performance of Asthma . . . . . . . . . . . 58Figure 4.14 Feature Selection Performance of Normalized Real Data (TimePoint) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Figure 4.15 Feature Selection Performance of Normalized Real Data (Asthmaat 5 Years) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Figure 4.16 Feature Selection Performance of Normalized Real Data (Breast-feeding Status) . . . . . . . . . . . . . . . . . . . . . . . . . 65Figure 4.17 Feature Selection Performance of Normalized Real Data (Birthby Vaginal Delivery) . . . . . . . . . . . . . . . . . . . . . . 66Figure 4.18 Effect of Sample Size Differences Between Low and High Sig-nal Real Data Conditions . . . . . . . . . . . . . . . . . . . . 67xiiiGlossaryAUC Area Under the CurveAUROC Area under the Receiver Operator Characteristic curveBC Bray-CurtisCHILD Canadian Healthy Infant Longitudinal DevelopmentCLR Centered Log-RatioFDR False Discovery RateF1 F1-score or F-measureFPR False Positive RatePCR Polymerase Chain ReactionHFE Hierarchical Feature EngineeringxivHSIC Hilbert Schmidt Information CriterionLR Logistic RegressionLASSO Least Absolute Shrinkage and Selection OperatorMIRKAT Microbiome Regression Kernel Association TestOTU Operational Taxonomic UnitPAM Partition Around MediodsPERMANOVA Permutational Analysis of VarianceQC Quality ControlRF Random ForestRFE Recursive Feature EliminationRLE Relative Log ExpressionROC Receiver Operator CharacteristicSVM Support Vector MachinexvTMM Trimmed Mean of M-ValuesTPR True Positive RateTSS Total Sum ScalingUQ Upper QuartileVC Variance ComponentxviAcknowledgmentsI wouldn’t have been able to complete my master’s thesis without the strong sup-port system that I am incredibly fortunate to have. I would like to sincerely thankand acknowledge the following individuals:• My supervisors, Raymond and Sara, for all of their support, guidance, andwisdom throughout the entire journey. Thank you for all of the teachingsand providing the foundation to explore my research endeavours.• My beloved dog, Rakki, who passed near the end of my graduate studies.Having fostered him during my undergraduate studies, he had been at everystep and milestone of my academic and career journey. As the most constantcompanion throughout my adult life, he travelled with me from Waterloo, toToronto, St John’s, 7 American states, finally to Vancouver, pitter-patteringthrough all the fields, trails, and sidewalks. He will forever be in my heart.Thank you for all of those walks you forced me to take when I was in thethick of it.• My family, who always provided their unconditional love and care. I thankmy mother, who did not always understand my academic interests but didher best to support me in the ways that she knew how. She taught me tobe highly independent and to persevere, which I attribute greatly to beingable to endure the journey of graduate school. None of this would have beenpossible without the tough love from my family.• My dearest Imaad, who always believed in me, supported and encouragedme through every single academic interest, idea, and aspiration. I thank himxviifor giving me the wings to fly.• All of my cherished friends, especially Lena, Wendy, and Michelle. I thankthem for always being there to provide many supportive words, accommo-dating my ever changing schedule, celebrating the major milestones, andreminding me of the light at the end of the tunnel. I wouldn’t have been ableto complete it without them.• My friends from graduate school, Greg, Sina, and Elijah, who were alwaysthere to provide support and share experiences with that made the journeymore relatable, bearable, and less lonely.xviiiChapter 1IntroductionIncreasing amounts of research have shown that the microbiota plays a pivotal roleas a proxy to the environmental and lifestyle factors significant in the develop-ment of numerous complex diseases [5, 34, 44, 49]. The computational task ofsearching for the important microbes are among the most impactful applicationsof microbiome data. Identification of relevant microbes can enable the study ofmicrobial function that can shed light on the mechanistic actions in disease aetiol-ogy. Asthma, for example, is a complex respiratory condition in which a definitivecause has yet been determined. Recent findings [5, 10] have shown that a group ofmicrobes with decreased abundance in the gut microbiota of infants who later de-veloped asthma could be key markers in understanding the biological mechanismof the development of the disease. The ability to identify these microbes with highpredictive capability to the clinical outcome is of key interest.To date, the common practice of searching for important microbes related toa clinical outcome have been focused on univariate statistical methods that searchfor correlations between single taxon and outcome referred to as differential abun-dance testing. However, the microbiota is considered a commensal organism — anecosystem where members often do not function independently. Modelling singletaxon-outcome interactions could thus fail to capture the combinatorial relation-ships that multiple microbes could have with a host phenotype. In addition tothis, searching for per taxon-outcome correlations suffer from the problem of mul-tiple testing [27, 67]. Since 2013, an increasing number of novel computational1techniques have been developed for microbiome data analysis. Within this bodyof literature, the last five years have seen a marked increase in the developmentof methods for identifying important microbes through multivariate approaches. Inthe machine learning literature, this process is typically referred to as feature selec-tion. Applied in the classification setting, feature selection aims to identify a subsetof features that can reduce classification error and eliminate irrelevant features thatwould otherwise decrease model performance.Within the computational microbiome literature, there are two broad themesthat have emerged. On one side, simply the abundance profiles of microbial taxaare modelled. On the other side, the evolutionary relationships in microbiomedata are leveraged, referred to as the phylogenetic structure, to model the rela-tionship between taxa and outcome through a biologically motivated theme. Loworder taxa are organized into a tree structure, connected by branches representingphylogenetic distances to higher order clades that group together related members[32, 33, 36]. Researchers seek to leverage this structure because closely relatedtaxa may possess shared traits correlated with a clinical phenotype that can pro-vide greater evidence of the function of the important taxa. In addition to this,due to the high sparsity and high dimensionality in microbiome data, the incorpo-ration of phylogenetic information can enable signal pooling which can direct theanalysis to a more relevant feature space [69, 72].However, a fundamental problem with tackling the task of feature selectionusing a phylogenetic approach makes the assumption that a phylogenetic signalexists [36, 52]. It has been shown that not all conditions are related to the mi-crobiota through a phylogenetic signal [8, 71]. That is, the correlation with themicrobiota could be driven by very phylogenetically distant taxa that are too unre-lated to confer a phylogenetic signal. Applying phylogenetic techniques in thesescenarios would not yield informative results.While several phylogenetic methods have been developed, few studies havesought to provide evidence on whether applying a phylogenetic technique is appro-priate for a condition or trait of study. As a result, guidance of using a phylogeneticapproach versus non-phylogenetic approach remains unclear. [28] was the first ma-jor benchmarking review paper that surveyed feature selection and classificationtechniques on microbiome data. [75] used a novel feature selection technique by2[40] to compare different types of state-of-the-art machine learning classificationmodels on a large number of public datasets. Both studies showed that identifyinga reduced set of informative OTU features can improve classification performancein trait prediction. However, there is limited work to date that has specifically eval-uated and compared the applicability and performance of a phylogenetic approachfor feature selection of microbiome data with varying correlation structures.Novel phylogenetic techniques developed by the computational microbiomecommunity have provided benchmarking studies as part of their evidence of rea-sonable performances achieved, however few have addressed the applicability ofsuch methods to the particular condition and dataset of study. The approach takenoften involves assuming that most microbiome-outcome relationships contain aphylogenetic signal, however, research has shown that this is not the case. Often,performances are reported without commentary on whether the problem that themethods were applied to are appropriate. This can fall short in providing the nec-essary and comprehensive guidance researchers need to decide which methods toapply and what interpretations to draw from their results.In this work, we investigate the application of feature selection techniquesfor both phylogenetic and non-phylogenetic approaches for finding important mi-crobes. We propose an objective evaluation procedure to test the applicability of thephylogenetic approach through an extensive simulation study. This evaluation pro-cedure can be used for benchmarking studies for testing methods on microbiomedata with and without phylogenetic correlation structures. We apply our bench-marking procedure to novel microbiome research data with unknown correlationstructures to show how the feature selection methods used in this work perform insearching for the important microbes for different host phenotypes with varied sig-nal levels. We discover that phylogenetic techniques are at best as good as generictechniques in conditions where a strong phylogenetic signal exists. Our findingsshow that feature selection with generic machine learning methods are highly ro-bust in many conditions. We found that a phylogenetic approach could be useful asa test for the presence of a phylogenetic signal that could inform downstream anal-ysis of microbial function. However, performance optimization in a classificationsetting is best achieved with the generic techniques we evaluated.31.1 16S microbiome dataIn this work, we focus on computational methods for 16S microbiome data. Here,we will provide background on the primary sequencing method, the data represen-tation, and common considerations for normalizing the data.1.1.1 16S rRNA sequencing and data representationShort-read 16S targeted rRNA gene sequencing is a cost-effective method datedback to the 1970s. The elegance of this strategy for microbial identification is inthe nature of the 16S rRNA gene in bacteria and archaea which have hypervariableregions in the gene that allows for highly accurate identification of taxa withoutwhole metagenomic sequencing [43, 65]. In total, there are nine hypervariableregions in the 16S rRNA gene that can be used as targets for PCR primers withnear-universal bacterial specificity [20]. Carried out by pyrosequencing on a 454sequencing machine, the sequence reads of a single variable 16S rRNA gene regionare clustered based on sequence similarity to a minimum identity threshold of typ-ically 97% [20, 65]. This is followed by taxonomy assignment by comparing thesequence clusters to known reference genomes in databases such as Greengenes[21, 37] , producing clusters of sequences corresponding to operational taxonomicunits (OTUs). Phylogenetic tree construction uses the OTU sequences to form arooted tree that has levels corresponding to the taxonomic assignments of the OTUs.In studies that use 16S microbiome data, the OTUs are considered a proxy for thespecies level assignment. OTUs are typically represented in a tabular format wherethe sequence reads that map to each OTU are quantified and represented as countdata, also referred to as abundances. OTU tables are the common data format usedfor analysis, containing microbiota samples on one axis and OTU features on theother axis.1.1.2 NormalizationNormalizing 16S microbiome data is a key step in a typical microbiome data analy-sis pipeline. Microbiome data is highly susceptible to confounding due to technicalartifacts such as sequence library size differences between samples. This can poseissues in downstream models if not accounted for. For example, low relative abun-4dance profiles for an OTU in sample groups with large sequencing depth can appearinflated when compared to other samples with small library sizes, potentially lead-ing to spurious correlations. There are additional reasons for normalizing due todata challenges such as compositionality and overdispersion. Different types ofnormalization methods address different types of issues in the data.The most common and widely accepted technique for normalization is rarefy-ing [51] to handle library size differences and its application is widely debated[38]. The method involves random subsampling of counts in each sample to aneven sequencing depth. Empirical studies showed that rarefying prior to multivari-ate analyses with unsupervised clustering can reproduce biologically meaningfulclusters and had been recommended as a standard for those applications [38, 64].Further, evaluation of rarefying prior to differential abundance analysis leads tovaried control of false discovery rate (FDR) depending on the analysis method ap-plied [64]. Overall, loss of sensitivity is the major drawback of the method due tothe discarding of data samples [38, 64].Aitchinson’s log-ratio transformation is a common technique for handling thecompositional nature of microbiome data [2, 35, 46, 64]. It is the normalizationstrategy used in ANCOM [35], a differential abundance method which has beenshown to have good control of FDR [64]. Scaling methods are another class oftechnniques that aim to reduce the effects of library size differences. It involvesmultiplying the abundance data by a scale or size factor for each sample and this isestimated differently depending on the technique [3, 47, 48, 64].1.2 Identifying important microbes to a biologicalvariable of interestWith research in precision medicine advancing rapidly with the addition of micro-biome data, two of the key applications of microbiome data is in 1) assessing theglobal compositional shifts between communities for a biological variable of inter-est and 2) the identification of important taxa that drive community shifts for thebiological outcome. In this section, we’ll provide an introduction to the themesof analysis that are common in microbiome studies that shape the motivation inthis work. Both of these types of methods are supervised techniques that involves5modelling the correlation structures in the data based on a known output variable.1.2.1 Differential abundance analysisTo date, the most common strategy to search for relevant microbes is through uni-variate differential abundance analysis which is a statistical technique for assessingthe effect of taxon abundances on an outcome variable. These tests are widelyadopted and heavily applied in many microbiota-host clinical analysis studies.They are simple to use, easy to interpret, and are included in popular self-containedpipeline tools for microbiome data such as QIIME [9] and mothur [53]. The meth-ods have been used historically to identify key associations with host phenotypethat have paved the way for the vast amount of interest in studying the role of themicrobiota in human disease. However, due to the inherent biological property thatmany microbes work in synergy with one another to perform key functions in hostphenotypes, there is great motivation to develop more sophisticated methods thatcan identify a set of relevant microbes based on their joint effects with an outcome.1.2.2 Feature selectionBroadly, feature selection is a machine learning strategy (also known as variableselection) that aims to identify a subset of measured features that are jointly pre-dictive of the outcome. It is common to perform feature selection on a trainingset of data, then use the reduced feature set to transform a test set to eliminate theuninformative features for testing the performance of a predictive model. In thiswork, since we are interested in identifying important taxa features to clinical out-comes such as case vs. control studies — our focus is on the application of featureselection methods within a binary classification setting.Feature selection methods for microbiome data offers an alternative strategy tounivariate differential abundance testing where both share the goal of identifying areduced set of informative features, however through a multivariate approach. Ma-chine learning methods for feature selection can implicitly circumvent the problemof multiple testing through cross-validation, enable modelling complex interactionsamong taxa, while reducing the complexity of a model. A number of strategiestake into account the abundance profiles of taxa only to find an optimal subset of6features [6, 22, 68]. In parallel, a growing number of studies are incorporatingphylogenetic information into the feature selection techniques that seek to modelthe correlation structure of microbial taxa with an outcome based on evolutionaryor taxonomic relationships [39, 40, 50, 62, 63, 66, 66, 69].1.2.3 Leveraging the evolutionary relationships in microbiome dataThe growing number of phylogenetic techniques for both association testing andfeature selection is a key motivation for this work. The biological motivation un-derlying the development of phylogenetic techniques is strong and explains theimmense degree of interest in this research area. Many of these methods are builtoff the invention of the UniFrac distance metric [32, 33] that measures pairwisedistances between microbiota communities based on observed phylogenetic rela-tionships within communities. Beta-diversity analysis is among the most com-mon applications of the UniFrac distance metrics, which allow for the analysisof microbiota diversity between communities. Non-phylogenetic metrics are alsowidely used, such as Bray-Curtis dissimilarity, Jaccard index, and Euclidean dis-tance which do not account for the evolutionary relationships of the members inthe microbiota and are considered quantitative metrics that take into account theabundances only. The widely accepted application of beta-diversity metrics haveled way for the use of the UniFrac distances in a number of novel techniques formicrobiome data for association testing, feature selection, and classification mod-els [39, 45, 69, 72, 74]. Other methods that exploit the evolutionary relationshipsin microbiome data without the use of UniFrac metrics also exist [40, 50, 62, 63].These methods have shown promising results and provide great insight into newstrategies for approaching the task of feature engineering the phylogenetic tree forimproved performance in classification models.1.3 Thesis motivation and contributionIn microbiome data, 16S and metagenome alike, taxa in a microbiome are all re-lated at different evolutionary levels. In studying the host-microbiota relationshipsto identify a set of informative features that can improve the prediction of diseasediagnosis, it isn’t always clear what type of methods are the most appropriate to7apply. With custom feature selection techniques for microbiome data increasinglybeing developed each year, the great divide in the motivation for a phylogeneticvs. non-phylogenetic approach has led to further confusion in deciding which setof methods would be best to apply for a condition of study. While a number ofpublications have revealed that their methods have the ability to identify biologi-cally meaningful results with good performance, few have provided guidance onwhether the condition and dataset of study is appropriate for the given technique.To date, a benchmarking study that evaluates the application of feature selectiontechniques using a combination of phylogenetic and non-phylogenetic methods tospecifically evaluate the applicability of such methods on microbiome data doesnot exist. That said, the aim of this work is to test the applicability of feature se-lection methods applied to microbiome data by evaluating phylogenetic and non-phylogenetic methods and their ability to identify relevant features to an outcomeof study with and without a phylogenetic signal. We aim to assess these methods,discuss their advantages and limitations, and provide hints to the microbiome dataanalysis community on how to use such methods for their analyses and provide aguide for the interpretation of their results.1.4 Detailed outline of thesisThe rest of this thesis is outlined as follows:• In Chapter 2, we survey the current literature on methods for identifying im-portant features in microbiome data that correlate with a host phenotypic out-come. We discuss common statistical methods used for microbiome analysisfollowed by machine learning techniques for identifying relevant features.We survey both standard machine learning techniques designed for universalapplications, as well as methods designed for microbiome data. We highlighthow the methods work in a general sense by providing an overview of twoclasses of methods that are present in the literature.• In Chapter 3, we describe methods for a comparative analysis of the appli-cation of feature selection techniques on microbiome data. We first detailan association test for identifying overall signal between global microbiome8structure with an outcome. Then, we describe two major classes of featureselection techniques: phylogenetic and non-phylogenetic approaches to fea-ture selection. Next, we describe our simulation methodology for generatingsynthetic data that can be used to test the two classes of feature selectiontechniques. Then, we describe the real microbiome data that we use to testour methods on real data with unknown correlation structures. Finally, wediscuss the evaluation metrics and performance assessment strategy for per-formance benchmarking of the feature selection techniques.• In Chapter 4, we discuss the results of our comparative analysis of featureselection techniques applied to simulated and real data. First, we detail theresults from our initial analysis that applies a set of generic feature selectiontechniques for identifying the important features related to a disease out-come. Then, we detail the performance and behaviour of our two classesof feature selection techniques for recovering ground-truth features with dif-ferent types of correlation structures in simulated microbiome data. Finally,we present our results on our assessment of the feature selection methodsapplied back to real data while integrating our findings from our simulation.• In Chapter 5, we summarize our conclusions on the feature selection bench-marking that evaluates the differences between phylogenetic and non-phylo-genetic approaches. We discuss possible uses of the feature selection meth-ods we applied, highlighting the strengths and limitations of specific meth-ods for different types of classification problems. We also discuss possibleuses of our simulation methodology for testing the two types of methods forfeature selection. Finally, we will highlight shortcomings of our methodol-ogy and suggest future directions to improve addressing our research ques-tions.9Chapter 2Related WorksIn this section, we will provide a survey of the key literature that has motivatedthis work. First, we will discuss the standard procedure in microbiome analysis fordifferential abundance testing. This starts with describing the methods for globalassociation testing followed by univariate association methods. Then, we will dis-cuss multivariate feature selection techniques for finding important features thatfollow two major themes: methods that only use abundance profiles and methodsthat account for evolutionary relationships using the phylogenetic or taxonomictree. In this, we will also discuss key benchmarking studies that tested featureselection techniques on microbiome data, and we will show that there is a lackof consensus in the literature on the performance and suitability of phylogeneticmethods on microbiome data.2.1 Differential abundance analysisDifferential abundance testing is the statistical technique for assessing the effectof shifts in taxa abundances with an outcome of study. Methods of this class areamong the most common strategies for studying the association between the micro-biota and a biological variable of interest. There are two common modes of study:multivariate tests which typically assess the association with an outcome variablebased on global shifts in microbial composition over the full set of taxa betweensample groups, or univariate, which aims to identify taxon that are differentially10abundant between groups [6, 14].To state clearly, our core focus in this work is to evaluate methods for mod-elling microbiome-outcome relationships through the identification of a small setof microbes with joint effects with a biological outcome. Specifically, we wouldlike to evaluate the phylogenetic vs. non-phylogenetic approaches to do so. Inthe typical microbiome analysis toolbox, most of the phylogenetic approaches areglobal association tests that do not inform of the per taxon relevance to a biologi-cal outcome of interest. On the other hand, the majority of univariate tests that doaim to inform of per taxon-outcome relationships do not utilize phylogenetic in-formation. We will provide a survey of the common statistical techniques for bothtypes of tests to provide background on the motivation for our primary methods ofinterest: multivariate feature selection techniques.2.1.1 Assessing differentially abundant microbiome compositionbetween groupsMost of the current multivariate association methods for microbiome data modelthe compositional shifts between sample groups to explain whether global mi-crobiome variation explains variation in the outcome of study. These tests areoften based on computing the pairwise distances between microbiota communi-ties to model compositional shifts between environments. MIRKAT [16, 66, 74]and PERMANOVA [4] are examples which use phylogenetic beta-diversity distancemetrics such as UniFrac [32, 33] for computing different types of relationshipsamong taxa between communities. Such methods are a core part of the multivariateanalysis workflow for studying microbiome diversity between- and within- groups.MIRKAT and glmmTree [69] are among the kernel-based methods which use phylo-genetic distance metrics to compute kernel matrices. Kernels can be thought of ascovariance matrices that captures pairwise similarities between individuals acrossa set of features. The two methods both aim to model the association of globalmicrobiomes based on the prior that closely related OTUs have a tendency to havesimilar effects. As such, metrics that account for the evolutionary relationships oftaxa would, in theory, yield more powerful and biologically relevant results thannon-phylogenetic counterparts.112.1.2 Identifying differentially abundant taxon between groupsThe research question univariate differential abundance tests address is whether in-dividual microbes are differentially abundant between different ecosystems, suchas whether subjects are disease or healthy subjects [64]. Besides the recent worksof [8, 67] which aim to incorporate phylogenetic information to control false dis-covery rate, incorporating phylogenetic information in univariate differential abun-dance tests is not common. While these methods are easy to use and interpret, asearlier said, the complex correlation structure between taxa are overlooked throughthis strategy. However, due to its popularity, here we provide a brief review of thecommon techniques.First, we’ll state that most of these tests are used for hypothesis testing and alarge number of specialized methods have been developed to handle the charac-teristics of microbiome data. Common challenges addressed by the methods are:overdispersion, excessive zeros, compositional effects, and non-normal distribu-tion. Two major types of methods are outlined below:• Parametric models assume a finite set of parameters to estimate the char-acteristics of the data and can be powerful if the set of assumptions aboutthe data are valid [24]. These models can suffer if the assumptions are vi-olated, which may impose invalid correlation structures [35]. This is theprimary reason that custom models are developed, rather than using clas-sical parametric tests such as t-tests and ANOVA. Among the most widelyadopted methods are those from the literature for RNA sequencing (RNA-seq) analysis such as DESeq2 [3, 31] and edgeR [19] which were proposedfor microbiome data due to the shared characteristics with RNA-seq data.However, since differences exist, such as greater excess zeros, specializedmethods have been proposed [12, 17, 23]• Non-parametric models do not make distributional assumptions and pa-rameters are estimated directly from the data. Excessive zeros and under-sampling are handled in metagenomeSeq [41] using a zero-inflated Gaussianmodel for assessing mean group abundances. ANCOM [35] is designed tohandle compositional effects and is based on a Mann-Whitney U test of the12pairwise log-ratios of taxon abundances between populations and has beenshown to have good control of FDR.2.2 Applying generic supervised learning methods tomicrobiome dataIn 2.1, we discussed common methods for exploring the relationship of taxon abun-dances with biological outcome variables. There, we saw that the purpose of manydifferential abundance techniques is for hypothesis testing between sample groups.These approaches can be limited in their ability for extracting features from adataset with complex correlation structures that can be used for classification andpredictive modeling [28]. Inherently, many supervised learning techniques are de-signed specifically for this purpose, and there are a number of generic techniquesthat can be applied to microbiome data. In this work, we define our use of the termgeneric to refer to models that can be applied to a wide array of data types — notstrictly microbiome data.2.2.1 Standard feature selection and classification methods appliedto microbiome dataOverall, past studies have shown success with applying generic techniques to realdata. [28] was the first benchmarking study that applied supervised classificationand feature selection techniques to a set of real microbiome data to show the ap-plicability of machine learning techniques for identifying a reduced set of predic-tive features that achieved improved classification performance from baseline. Thestudy applied Random Forest (RF) [11] and Support Vector Machine (SVM) [61]classifiers using optimal feature sets produced by 4 different feature selection tech-niques. Treated as a filter-based feature selection technique, a recursive featureelimination method called SVM-RFE [25] originally developed for improving can-cer gene set classification showed the best performance. The method is based onoptimizing the SVM cost function as features from a global model are recursivelyremoved, leading to a reduced set of important features to an outcome. Similarly,another benchmarking paper [55] evaluated 18 machine learning classifiers and 5feature selection techniques on 8 microbiome datasets. They found that SVM-RFE13was in general the best performing feature selection technique and that RF andSVM were among the most effective classifiers for microbiome data.While RF was designed as a classifier, the embedded feature ranking charac-teristic of the technique can be used implicitly for feature selection. The method isbased on an ensemble of many weak decision tree classifiers. Randomly weightedlinear combinations of randomly selected subsets of features are evaluated at eachlevel of the decision tree based on their ability to discriminate between categories[28]. This produces a set of ranked features, ordered by importance. The methodis reportedly one of the most effective techniques for microbiome data, which hasshown to be a high accuracy classifier [55, 59] with a reliable feature ranking ca-pability that has been extended and applied with phylogenetic data [39, 62].2.2.2 Normalization for feature selection and classification methodsOne of the primary considerations before applying generic methods would be de-termining the appropriate normalization technique, which isn’t always clear. [64]assessed the application of different normalization techniques for differential abun-dance testing. There is less known about the effects of normalization for machinelearning methods. [45] recommended centered log-ratio (CLR) transformation forLASSO (least absolute shrinkage and selection operator) and kernel-based methods,indicating that cumulative sum scaling (CSS) can have potential consequences.The idea behind CLR transformation is to remove the effects from compositionaldata, which represents relative information, by converting the relative abundancevectors of OTUs into absolute abundance vectors where its changes do not affectother OTUs. This property in general is known as subcompositional coherence [56]. Recent works of [57] showed that CLR does not adequately adjust for compo-sitional effects in three LASSO-frameworks for microbiome data due to a lack ofsubcompositional coherence.2.3 Custom feature selection techniques for microbiomedataIn the previous section, we surveyed literature that applied machine learning meth-ods to microbiome data for feature selection and classification. In this section, we14will provide a review of custom techniques built for microbiome data, starting withnon-phylogenetic aware methods that model only the abundance profiles of micro-biome data. Then, we will provide a review of the phylogenetic-aware methodsthat account for the evolutionary relationships among taxa through the use of thephylogenetic tree, taxonomic definitions, or evolutionary distances. We will startin that section with methods that were built on top of the UniFrac metrics [32, 33],then we will discuss methods which exploit the phylogenetic structure with alterna-tive strategies. For ease, in this work, we refer to this collection of methods simplyas phylogenetic techniques.2.3.1 Non-phylogenetic subset selection methodsWithin the feature selection subdomain of the machine learning field, there are agrowing number of microbiome-specific techniques that are being developed tomodel the relationships in microbiome data. Here, we will start to explain one ofthe first feature selection techniques made for microbiome data — a subset selec-tion technique called Fizzy [22] which was published in 2015. It is a simple greedyalgorithm that performs a forward selection search to find an optimal set of features.A forward selection algorithm is one which starts with a null or empty model withno features, adding features over iterations while optimizing an objective functionto find a feature subset that produces the best score. The scoring functions in Fizzyare based on information theory [54], such as mutual information, which ultimatelyaims to identify the set of features that carry the most information about the out-come variable of interest. Overall, we do not find that the technique addresses anyspecial characteristics of microbiome data, however we highlight this method as itwas among the first publications for a feature selection methods for microbiomedata.selbal [46] is another forward selection technique designed specifically for mi-crobiome data that aims to search for an optimal joint model while accounting forcompositional effects. Different from Fizzy, it does seek to address challenges ofmicrobiome data. Compositional data poses issues in modelling due to the de-pendency of individual features which can produce spurious correlations if notproperly accounted for. selbal addresses composition by modelling the relation-15ship between relevant taxa and an outcome by taking the geometric means of theabundances from two groups of disjoint taxa as a log-ratio that preserves the scale-invariance principle of compositional data analysis [14].2.3.2 Phylogenetic techniques that use UniFrac distancesThe UniFrac distance metric is the most common technique used for computingsample-sample phylogenetic distances. It measures the phylogenetic distance be-tween sets of taxa present within the phylogenetic tree of two communities, takenas the fraction of branch lengths that are unshared in the communities. From 2015onward, a number of publications have built custom phylogenetic-aware methodsfor microbiome data by incorporating the use of UniFrac distance metrics withcommon machine learning techniques such as kernel logistic regression (KLR)[74], SVM [39], and LASSO [72]. By leveraging the UniFrac distance metrics formeasuring distances between microbiome communities, standard modeling tech-niques can be adapted to discover evolutionary correlation structures in micro-biome data.VC-LASSO [72] is based on kernel construction of the OTU table using theUniFrac distance metric in which feature selection is achieved with regularizationof the variance components corresponding to individual kernel matrices for eachtaxon. It was benchmarked against non-phylogenetic group lasso (Elastic Net) [76]which performs feature selection by selecting groups of correlated features insteadof single features. VC-LASSO showed better ability to recover ground truth in sim-ulation.SVM-UniFrac [39] used the UniFrac distance as the kernel function in an SVMmodel for classification, combined with 3 different filter-based feature selectionmethods that are non-phylogenetic techniques. This is technically not a UniFrac-based feature selection technique but a UniFrac-based classifier. The motivation forthe use of the UniFrac kernel as opposed to the generic RBF or linear kernel is thatecological similarity scores would outperform the naı¨ve generic kernels. However,non-phylogenetic measures outperformed the phylogenetic measures in this study.Further, there was minimal improvement with feature selection compared to thebaseline with all taxa using all 3 methods. A major limitation with this work is16that the intention of applying a phylogenetic approach in machine learning wasincorporated at the classification step and not in the feature selection step. If aphylogenetic signal among a reduced set of features correlated with the outcomewas present, the non-phylogenetic feature selection techniques may not be able tomodel that structure.2.3.3 LASSO regularization of the phylogenetic treeThe concept of regularization involves penalizing a model to produce a parsimo-nious form that can improve generalization to new data. The LASSO regulariza-tion term can achieve feature selection by shrinking the coefficients of irrelevantfeatures to zero, producing a model that is dependent on a smaller number offeatures. It has been extended for use with microbiome data a number of times[22, 50, 57, 72] as we had touched on above. The phylogenetic LASSO [50] wasformulated with the same intention and motivation as other phylogenetic featureselection techniques — that there are clustered signals which represent the pres-ence of shared traits among OTUs that are correlated with a biological outcome. Amodel that traverses the phylogenetic tree and penalizes the hierarchical featuresof the tree could achieve a collapsed version that captures only the most importantphylogenetic levels and members. This technique shares characteristics with VC-LASSO [72] in that both apply a LASSO penalty to a model with elements of thephylogenetic tree structure embedded as features to the model. However, they dif-fer in that VC-LASSO treats features as variance components computed by UniFracdistances and the phylogenetic LASSO directly penalizes the tree structure.2.3.4 Feature engineering of the phylogenetic treeIn addition to techniques that extend the use of UniFrac metrics, recent studies haveshown that feature engineering techniques can have promising results that improvefeature selection and classification performance. Feature engineering refers to aprocess that involves re-engineering (e.g. combining, transforming) the represen-tation of the features in the original data structure, such as the phylogenetic tree, toa representation that exposes the data in a different way. Recent studies [40, 62, 63]have shown that a new way to account for phylogenetic information in supervised17learning models could improve classification performance. [63] reconstructed theOTU table, in which all features typically are of uniform taxonomic rank (e.g. fam-ily, genera, OTU). In the case of [63], the leaf node OTUs and its ancestral nodesare included by computing a combined measure of its phylogenetic distance tothe leaf node along with the abundances of the constituting OTUs [62]. This pro-duces an alternative to the OTU table called the phylogeny-aware abundance matrix(PAAM) structure. An extension of this work called Phy-PMRFI [62] encapsulatedthe PAAM transformation in a feature selection procedure that uses the RF featureimportance ranking algorithm with the Gini coefficient index to calculate tree nodeimpurity for evaluating feature relevance to an outcome variable.On the other hand, HFE (hierarchical feature engineering) [40] exploits the phy-logenetic tree structure by assessing the Pearson correlation of each parent-childpairs to collapse redundant pairs to higher levels that otherwise add little value todownstream models. HFE aims to address the challenge of determining which tax-onomic rank to use for analysis. An objective approach to determining the rank isnot well established and thus HFE removes the need to make this decision.Both Phy-PMRFI and HFE introduce a novel way to exploit the phylogeneticstructure and subsequently achieve feature selection that can improve classificationperformance. The methods offer a promising direction towards modelling the phy-logenetic structure through strategies that do not depend on UniFrac distances orthe diluted structure of the phylogenetic tree comprising full microbiome commu-nities, which have been said to be difficult to model.2.4 SummaryA challenge in using the phylogenetic structure in supervised learning models anddifferential abundance techniques is in the diluted signal from leaf node members(OTUs) across the tree structure, largely attributed to sparsity in the data. Meth-ods such as [69, 74] aim to use the phylogenetic structure to pool signal in orderto determine if a phylogenetic correlation structure exists in global microbiomecommunities between sample groups of a biological variable of interest. Featureselection methods such as [50, 72] build on concepts from the works of multivari-ate global association testing methods [4, 69, 74] to offer multivariate statistical18techniques for modeling joint taxon effects on an outcome. Then, methods basedon the principles of feature engineering [40, 62, 63] took an alternative approach toleveraging the information embedded in the phylogenetic tree for discovering phy-logenetic correlation structures. The methods reviewed in this section have pavedway for our question of the performance of the phylogenetic approach in featureselection in comparison to non-phylogenetic counterparts. In this work, we aim toperform a comparative analysis that incorporates a subset of the above methods ina classification setting applied to both simulated and real data.19Chapter 3MethodsThis chapter details the approach used to perform a comparative analysis of differ-ent techniques for identifying important features in microbiome data with differenttypes of correlation structures. We focus our core methods on feature selectiontechniques applied to 16S rRNA sequencing data.We start in section 3.1 to describe tests that we use to assess for global mi-crobiome compositional shifts. Assessing for global association with multivariatemethods can be used as a proxy for testing whether a signal between the micro-biome and a biological outcome exists. In identifying signal between a host phe-notype and overall microbiome composition, we apply feature selection techniquesoutlined in section 3.2 that can be used for identifying the key microbes driving anassociation. We will describe two types of feature selection techniques that aredivided into two classes: phylogenetic and generic techniques.In our comparative analysis, we use our described feature selection techniquesto identify the important features to an outcome on both simulated and real data.In section 3.3, our simulation methodology describes our synthetic data genera-tion process designed to test feature selection techniques with different correlationstructures. We apply each set of our feature selection methods to two types ofsimulated data with the following induced correlation structures: phylogenetic andnon-phylogenetic. We aim to assess the feature selection techniques on the twotypes of data in order to gain an understanding of how the methods perform inthe presence and absence of phylogenetic correlation structures. Our goal is to20assess whether generic techniques are capable of recovering phylogenetic corre-lation structures and to answer our question of whether phylogenetic techniquesare more appropriate than generic techniques for problems where a phylogeneticsignal exists.We apply our understanding of the comparative analysis on simulated data toaid our interpretation of the presence and absence of phylogenetic correlation struc-ture in real data. As such, in section 3.4, we provide detailed information on thereal microbiome dataset that is used in this work with a description of the host phe-notypic outcomes we focus on. We outline in section 3.5 a collection of techniquesthat are used for normalizing microbiome data in order to improve modelling theunderlying biological signal.Finally, in section 3.6, we detail metrics we use to evaluate the performanceof our feature selection benchmarking procedure on simulated and real data. Inthat section, we will also provide an explanation for our approach on assessing theperformance of feature selection subsets in a classification setting.3.1 Testing for association in global microbiomecompositional shiftsIn this section, we describe a global microbiome association testing technique usedto detect compositional shifts in microbiome communities over the set of all taxapresent. MIRKAT (microbiome regression-based kernel association test) [74] is amultivariate beta-diversity test that is designed to use distance and dissimilaritymetrics, with a transformation, to produce kernel matrices for modelling in a semi-parametric regression framework. The method has been extended multiple times[16, 42, 66, 73] and shares similarities with other techniques [4, 69]. Using a fullset of taxa within communities, the method measures association between covari-ation in the communities with covariation in an outcome variable of study. In thiswork, we employed MIRKAT with four different kernel functions which are dis-tances and dissimilarity matrices: Jaccard, Bray-Curtis dissimilarity, unweightedUniFrac, and weighted UniFrac. Bray-Curtis dissimilarity [7] is a common metricfor studying the differences between ecological communities based on the abun-dances of taxa present in each site. Jaccard is a non-quantitative metric that mea-21sures presence and absence of taxa between sites only. The UniFrac distancesuse the phylogenetic tree to compute distances. Both measure distances betweencommunities based on the fraction of the branch lengths that leads to all descen-dants which are not shared between the two communities. One difference is thatthe weighted UniFrac metric adds a weight to the branches corresponding to anabundance level. Thus, the unweighted UniFrac is considered a non-quantitativetechnique that accounts for the presence and absence of taxa only.As additional support, we also employed PERMANOVA [4]. Both are super-vised techniques that assess separation of patterns that cluster according to an out-put. However, PERMANOVA uses the output of dissimilarity and distance metricsdirectly for assessing an association with sample groups while MIRKAT uses simi-larities from kernel transformation.3.2 Feature selectionOur comparative analysis involves benchmarking feature selection techniques thatcan be separated into two classes: phylogenetic techniques and generic techniques.In this work, we will refer to methods that are used for feature ranking as featureselection techniques for simplicity. We will also state that for all model setups, thedefault parameters are used.3.2.1 Phylogenetic feature selection methodsAmong the phylogenetic techniques are two kernel methods, one which is a step-wise method and another which is a sparse regression method. The third techniqueis HFE, a feature engineering method. In an ideal scenario, a larger set of phylo-genetic techniques would have been tested. However, due to a lack of softwareavailability, we were not able to test [50, 62, 63, 72]. As such, we will describethe two kernel methods that we extended to perform feature selection and modelphylogenetic relationships using the UniFrac kernel. Then, we will describe thefeature engineering method.Phylogenetic kernel methods: The first kernel method involves wrapping theMIRKAT model in a recursive feature elimination (RFE) algorithm for finding a sub-set of features that optimizes the p-value obtained from the association between an22outcome variable and a set of microbial taxa. Searching for the most relevant fea-tures starts with a global model with all taxa, evaluating every single subset withp - 1 features, where p is the current number of remaining features. This is per-formed until all features are ranked from most relevant to least relevant. The sec-ond is HSIC-LASSO which was designed as a generic feature selection technique.It achieves feature selection with LASSO regularization of a kernel matrix contain-ing pairwise transformations of the features instead of the samples, which is moretypical in kernel construction [30, 70]. It shares similarities with [72] which bothapply regularization to regression models with kernel representations of the fea-tures to obtain a sparse model with only the most relevant features to an outcome.For both kernel methods, using a phylogenetic distance metric as the kernel func-tion can achieve modelling the dependency between features and outcome throughthe evolutionary relationships among features. Due to the popularity of the UniFracdistances, we employed two types: unweighted UniFrac and weighted UniFrac.Feature engineering methods: The HFE algorithm [40] for feature engineer-ing and feature selection involves a bottom-up traversal of the phylogenetic treeand computing parent-child correlations to find redundant pairs. Child nodes arediscarded if redundant correlations are found, producing a smaller tree with leafnodes at potentially different taxonomic ranks. The resultant tree undergoes furtherfiltering steps. Again, traversing from leaf to root, a measurement of InformationGain (IG) for each node on the path is calculated with respect to a binary outcome.Then, in order to handle OTUs with incomplete taxonomic information such as theresult of low confidence taxonomic classifications during the processing of 16SrRNA sequencing data, an IG-based leaf node filtering is applied to filter out unin-formative OTUs. The final set of features represents all nodes that are consideredinformative to the outcome, which can comprise of nodes at any levels in the tree(e.g. phylum, family, OTU). In this work, if an informative node in the resultantfeature set from HFE represents a higher order rank, we extract the constitutingOTU features to represent the full feature set produced by HFE.233.2.2 Generic feature selection methodsIn section 2.2 we described generic methods as those techniques that can be appliedto a wide set of data types. They are methods which are not specifically designedfor modelling phylogenetic relationships and include popular state-of-the-art ma-chine learning methods. Among the generic methods, we test six methods whichcan be further divided into the following categories: regularized regression, featureranking, and kernel methods.Regularized regression methods: Within the generic machine learning meth-ods, we employed two regularized regression methods. LASSO [60] and Elastic Net[76] comprise the regularization-based methods which selects features by shrink-ing the coefficients of irrelevant features to zero. The features with non-zero coeffi-cients are thus the selected features. Elastic Net has a slight difference from LASSOin that it is designed to select full groups if multicollinearity among features exist,rather than selecting only a single relevant feature from a group of correlated fea-tures. Studies have sought to benchmark Elastic Net with other methods due to itsnatural application for capturing groups of correlated abundances among taxa inmicrobiome data, hence its inclusion here.Feature ranking methods: Two additional common machine learning meth-ods were employed among the techniques meant for generic data types. Logisticregression with an RFE (LR-RFE) algorithm achieves feature selection similar toMIRKAT-RFE however without kernel computation of the microbiome samples. Anadded positive characteristic is that LR-RFE feature ranking is based on optimizingthe cost function for the model instead of p-value optimization in MIRKAT-RFE.The second feature ranking method is Random Forest (RF) importance ranking. Aswas earlier introduced, RF is a classification technique that has an inherent rankingcharacteristic that can be used to evaluate feature importance. Feature importanceranking is a result of the decision tree evaluation process in an RF model, whereeach node involves evaluating a decision split of a random set of features. Thecriterion applied to evaluate the quality of a split is the Gini impurity, which quan-tifies the probability that a split leads to misclassification of an outcome label. Thetraversal process will produce a ranked set where features associated with nodeswith the greatest decrease in impurity are at the start of the tree, while those lead-24ing to small decreases in impurity are at the end.Generic kernel methods: In 3.2.1, we introduced two kernel methods (MIRKAT-RFE and HSIC-LASSO). For the generic methods, we employ those two methodswith generic kernels that are applicable to a wide array of problems. The Gaussiankernel is applied for HSIC-LASSO and the Radial Basis Function (RBF) kernel isapplied with MIRKAT-RFE. A second kernel, Bray-Curtis is a special case of thegeneric kernels used with MIRKAT-RFE as it was originally designed as a dissimi-larity metric for ecological studies that is popular for microbiome data. In a sense,it could be considered a microbiome-informed, non-phylogenetic method.3.3 Simulation studyIn this section, we outline a simulation methodology to objectively test feature se-lection techniques to focus on two goals: (1) assess the ability of different methodsfor recovering ground-truth features in the presence and absence of induced phy-logenetic correlation structures, and (2) assessing the performances of individualmethods on simulated microbiome data over a set of parameterized conditions. Ul-timately, we wish to gain insight on the applicability of different methods, bothphylogenetic and generic techniques, in the presence of different types of correla-tion structures found in associations between host phenotype and the microbiota.To address (1), we propose a data simulation pipeline to generate two differenttypes of synthetic datasets (Figure 3.1) with the following correlation structures:ground-truth features with phylogeny-induced correlation, or ground-truth featureswith phylogeny-independent correlation. To address (2), we chose six commonparameters characteristic of microbiome data and performed a parameter sweepover four levels each to produce a set of multi-dimensional parameterized datasetscorresponding to different conditions. The process of inducing correlation structurestarts with picking ground-truth features from a calibration dataset. In our work,we used the CHILD study 16S microbiome data as our calibration dataset (N = 736for 3-months and N = 738 for 1-year), described more in 3.4.Phylogeny-induced ground-truth: For phylogeny-induced correlation, we adoptedand extended the strategy from [15] to simulate ground-truth with phylogeny-induced correlation structure. This process is dependent on using the phylogenetic25Phylogenycalibration dataPhylogenycalibration dataType A Type BFigure 3.1: Schematic for the synthetic data generation procedure. The pro-cess involves picking the ground-truth features from the calibrationdataset, setting the simulation parameters, then feature distribution esti-mation and count modelling.tree corresponding to the taxonomic definitions in the calibration OTU table. Tak-ing the phylogenetic tree, we use PAM (partitioning around mediods) to clusterthe phylogenetic distances of pairwise taxon, computed with the patristic distancemetric. This produces clusters of OTUs with different levels of phylogenetic re-latedness. Ground-truth features used for assessing feature selection are randomly26sampled from a cluster of related members. Greater partitioning leads to clusterswith more phylogenetically related OTUs. This measure is additionally parameter-ized over four levels in order to control the relatedness level of ground-truth OTUs.The k hyperparameter that controls the number of clusters formed from PAM is aproxy to the phylogenetic relatedness level. We fixed the parameter to take on thefollowing four levels: k = 4, 6, 8, 16.Phylogeny-independent ground-truth: For phylogeny-independent correlation,the process is simpler. It involves randomly selecting OTU features as ground-truthbased solely on similar abundance levels. An intended phylogenetic structure is notinduced, and members could be phylogenetically related to various degrees how-ever this is not controlled as the selection is random. The process starts with sortingthe variance-adjusted mean abundances of each OTU, then we split the OTUs intoquantiles based on their abundances. The quantiles correspond to four differentabundance levels (low, medium low, medium high, high), which is one of the sim-ulation parameters. Parameterizing abundance level thus involves picking featuresfrom a single abundance quantile corresponding to the specific parameter level.Similar to this process, parameterizing the abundance level for synthetic data withphylogeny-induced ground-truth involves picking features from a cluster with acorresponding variance-adjusted mean abundance level for the specific parameterlevel.Upon picking the set of ground-truth features of a specific correlation structure,parameter estimation of feature distributions and count modelling is performed.We employ the sparseDOSSA package [1] to handle this, along with spiking cor-relations between ground-truth features and a binary synthetic outcome generatedwith a Bernoulli distribution. sparseDOSSA models the marginal distribution ofeach OTU feature as a truncated, zero-inflated log-normal distribution. The modelis fit to the calibration dataset in order to parameterize the microbes present in theCHILD study 16S microbiome data (section 3.4).Six simulation parameters underwent a parameter sweep over four levels each.Abundance level and phylogenetic relatedness were described above. The otherfour are: sample size, number of ground-truth features, microbe-microbe corre-lation, and minimum taxa prevalence. Minimum taxa prevalence is defined by aminimum percentage for which a taxon is present over all samples. It is a common27filtering step in microbiome analysis. The number of ground-truth features is ofinterest because the number of microbial taxa correlated with an outcome can varydepending on the condition. We aim to assess whether specific methods are betterwith smaller or larger sets of true features. Similarly, microbe-microbe correlationadds noise to the data since multicollinearity can increase the difficulty to sepa-rate signal from noise, possibly resulting in the selection of correlated but incorrectfeatures by a feature selection method.3.4 CHILD study dataIn this section we introduce the real dataset we apply our methods to. The CanadianHealthy Infant Longitudinal Development (CHILD) cohort study [58] is a nation-wide effort to collect comprehensive biological, psychological, genetic, and en-vironmental data on families across four provinces. Studying asthma and otheratopic diseases are among the key applications of this data. In this work, we usethe 3-month (N = 836) and 1-year (N = 857) microbiome samples of infants fromthe CHILD study.Sequencing and preprocessing of 16S microbiome samples were performed bythe works of [5, 10]. The microbiome samples are based on amplicon sequencingof the V4 hypervariable region of the 16S rRNA gene extracted from faecal samplesof infants at different time points. PCR amplification of the gene was performed us-ing universal bacterial primers (V4-515f: V4-806r). Pooled PCR amplicons under-went paired-end sequencing on a Illumina MiSeq platform. Using QIIME2, readswere assembled to produce 247 bp reads that were referenced against the Green-genes database (v13.8). Taxonomic classification of a resulting 3916 unique OTUswere formed from classication with a naı¨ve Bayes classifier trained on referencereads extracted from the reference database at 97% sequence similarity. An addi-tional filtering step to remove very low abundant OTUs (present in less than 5%)was applied, resulting in 174 OTUs.For our comparative analysis, our covariates of interest include the asthma di-agnosis variables at 3-years and 5-years of age, as well as other phenotypic andenvironmental data including breastfeeding duration, mode of birth delivery, andtime point. We treat these variables (asthma, breastfeeding, mode of birth delivery,28time point) in our modelling as binary outcomes for feature selection and classifi-cation.3.5 Count normalizationIn this section, we introduce 8 normalization techniques that were benchmarkedfor differential abundance testing and beta-diversity analysis in [64]. Normalizingfor library size differences among samples is the main target for most methods andthis is attributed to the large effect that uneven library size has on the modelling oftaxon abundance profiles.At this point in time, there is no accepted standard for normalization of micro-biome data. Further, a single normalization technique appropriate for all modellingtasks does not exist. Since different methods could have different effects on down-stream uses, we aim to assess the effects that different normalization techniqueshave on feature selection methods applied to our real data. Our methods are sum-marized below:1. Rarefying: Random subsampling counts without replacement until an evensequencing depth across all samples. The sequencing depth chosen is basedon determining an optimal point on a rarefaction curve that maximizes li-brary size while minimizing the loss of species diversity. Rarefying cansuffer from sensitivity and power loss due to the discarding of samples thatdon’t meet the minimum sequencing depth requirements [38]. However, themethod remains the most popular way to account for the effects of librarysize differences.2. Total sum scaling (TSS): Counts in each sample are divided by total librarysize.3. Relative log expression (RLE): Computes a reference sample by taking thegeometric mean for each feature, used to compare against all other samples.A fold change log-ratio is computed for each feature for a sample and themedian ratio for the sample is taken to be the RLE scale factor for normaliz-ing counts sample-wise [18, 19, 31].294. Trimmed mean of M-values (TMM): The scale factor is computed by tak-ing the weighted mean of log-ratios between each pair of samples [31, 47,64]. The highest count OTUs and OTUs with the largest log-fold change are’trimmed’ prior to scale factor calculation. An additional form is included,TMMwzp, which is the zero-inflated version of TMM to handle data with ahigher proportion of zeros such as microbiome data.5. Upper quartile (UQ): Computed by multiplying each sample by a scale fac-tor determined based on its count distribution at the 75th percentile followedby log transformation of the counts [13, 64]. While a different percentile canbe used, the rationale is that the counts should be high enough at the chosenpercentile to fall outside the range of the zero and extremely low counts thatare not informative to sequencing depth [26].6. Centered log-ratio (CLR) transformation: CLR does not address library sizedifferences. It is an extension of the Aitchinson’s log-ratio transformation fordealing with compositional data [2] where all features are bounded within asimplex and thus are dependent, representing relative information [14]. TheCLR transformation pulls the data out of the simplex by dividing each featureby the geometric mean of all features then applying a log transformation.For our normalization methods, we use the following R packages: edgeR forfour of the scaling methods (RLE, TMM, TMMwzp, UQ), compositions for CLRcomputation, and phyloseq for rarefying. For methods where a log transformationis applied, a pseudocount of 1 is added to circumvent invalid log transformationwhen the count is zero, which is prevalent in microbiome data.3.6 Evaluation metrics and performance assessmentTo evaluate feature selection techniques, we employ one evaluation metric each forour simulated versus real data.3.6.1 Evaluation metric for simulated dataTo benchmark feature selection methods on our simulated data, we use the F1 score,computed as the weighted mean of precision and recall. Since the ground-truth fea-30tures for our simulated data are known, we use the metric in an alternative mannerthat is independent of classification performance which is evaluated in terms of amodel’s ability to classify samples. Instead, we use F1 to directly evaluate featureselection capability by evaluating how well a model recovers ground-truth features.Normally, the confusion matrix is used to quantify the frequency that a modellabels a sample correctly and incorrectly as the positive versus negative class. Inour usage, the confusion matrix quantifies the frequency a model labels a featurecorrectly or incorrectly as the ground-truth versus not ground-truth feature. Thatis, our positive label corresponds to the set of ground-truth features, while ournegative label corresponds to all other features. Thus, in our terminology in thissection, we’ll refer to ’labeling a feature’ to be synonymous to ’selecting a feature’or ’deselecting a feature’.F1 is suitable because it puts emphasis on how well a model can correctly labelthe ground-truth features (true positives) while penalizing the overall score basedon how frequently a model incorrectly labels a feature (false negatives and falsepositives). The range for the F1 score is between 0 and 1, with 1 indicating perfectability to recover all of the ground-truth features without selecting any incorrectfeatures, and 0 indicating no ability to recover any ground-truth features.F1 = 2· precision·recallprecision+ recall =T PT P+ 12 ·(FP+FN)(3.1)3.6.2 Evaluation metric for real dataTo evaluate the performance of feature selection techniques on our real data, weuse the area under the curve (AUC) of the receiver operator characteristic (ROC)curve (AUROC), a common metric for evaluating binary classifiers. We use themetric as it was designed, to classify samples as being a positive class (e.g. dis-ease case) or negative class (e.g. healthy control). The ROC curve is a probabilitycurve that plots model discrimination ability measured by true positive rate (TPR= sensitivity) against false positive rate (FPR = 1 - specificity) at various thresh-old values. Threshold values correspond to prediction probability thresholds thatassigns a prediction to one of the possible classes. If the threshold is 0.5, then aprediction greater than 0.5 is assigned to the positive class, and those equal to or31less than 0.5 are assigned to the negative class. For a classification problem withperfectly balanced classes (e.g. 50% cases and 50% controls), the default thresholdof 0.5 can be a good fit. However, since severe class imbalance is a common prob-lem for biomedical applications such as disease diagnosis prediction, evaluationmetrics robust to class imbalance are ideal. The AUC of the ROC can address thisbecause it does not assume that the classes are balanced, and enables finding anoptimal threshold for which TPR is maximized and FPR is minimized.The AUC score represents the integral of the ROC curve, with a range of 0to 1. A score of 0 indicates that at all possible threshold values, the classifier ispredicting all samples incorrectly (negative as positive and positive as negative).At AUC > 0.5, there are thresholds that exist where the model is better able todiscriminate the positive class from the negative class. At AUC of 1, there is athreshold where the model has perfect classification ability. At AUC of 0.5, theclassifier is predicting randomly and is at best predicting at chance level.T PR =T PT P+FN(3.2)FPR = 1− T NT N +FP(3.3)3.6.3 Performance assessmentTo benchmark feature selection techniques on both our simulated and real datausing the evaluation metrics introduced in 3.6.1 and 3.6.2, we compare the perfor-mances of individual subsets of features produced by each method. We take eachfeature selection set produced by a method, ordered from most to least importantand evaluate the performance of the subsets. We start with the most important fea-tures (e.g. top 3) and increase the size of the subset by adding more features witheach performance assessment. For a given condition and dataset of study, we takethe top score corresponding to a number of top features produced by the methodas the best score for that method. For our real data, we use an SVM classifier withdefault parameters to evaluate classification performance. For our simulated data,we also use an SVM classifier, however we note that feature selection F1 is not de-32pendent on classification performance and strictly dependent on feature selectionability.33Chapter 4ResultsIn this chapter, we summarize the results of feature selection benchmarking onsimulated and real data. In 4.1, the results of exploratory analysis of asthma andthe microbiome using CHILD study data 3.4 are presented. We first show results ofglobal association testing of the microbiome data with environmental and clinicalcovariates. Next, in 4.1.2 we show the results of searching for a microbial signatureusing a recursive feature elimination technique, benchmarked against a collectionof standard machine learning feature selection methods.In order to test the suitability of the feature selection techniques used in 4.1.2,in 4.2 we evaluate our methodology through a simulation study that investigatesthe behaviour of different methods in the presence and absence of induced phy-logenetic correlation structures in synthetic data. We start by showing compara-tive aggregate performances of two classes of feature selection methods on twotypes of simulated datasets. Further, we divide our sections to present specific re-sults on how each class of methods behaves on each type of dataset. 4.2.1 and4.2.3 shows results of each class of methods on simulated data with phylogeny-induced correlation. 4.2.2 and 4.2.4 shows results of the methods in the absenceof phylogeny-induced correlation. Finally, in 4.3, we show results of the effectsof different simulation parameters on feature selection. Comparative results showperformance assessment using the F1 score for feature selection of simulated data.To test the transferability of the behaviour and performance of feature selectionmethods on simulated data, in 4.4, we show results of the feature selection tech-34niques applied back to real data. In section 4.4.1 and 4.4.2, we present results ofapplying feature selection techniques to CHILD study data (3.4) using higher andlower signal outcome variables. Next, in 4.5 we show the effects of normaliza-tion, sample size, and phylogenetic aggregation on feature selection with real data.Here, we use AUROC to evaluate the performance of the classification models usingreduced feature sets from feature selection.4.1 Analyzing the relationship of the microbiome andasthma4.1.1 Association testing of asthma and the microbiomeTable 4.1 and 4.2 shows a subset of the results of exploratory analysis of the rela-tionship between environmental and clinical covariates with the microbiome sam-ples at 3-month and 1-year respectively. The values in the table are p-values (ad-justed with Benjamini-Hochberg) obtained from running kernel-based regressionmodels using MIRKAT with four different kernel functions that model differenttypes of relationships in microbiome data.Our goal in this section was to assess the CHILD study data for signal relevantto the disease phenotypes for asthma. Our access to the 3-month and 1-year mi-crobiome samples in addition to a large set of environmental covariates allowedus to perform a large-scale analysis of the relationships between the microbiotaand the clinical and environmental covariates. We observed that a larger numberof covariates are associated with the 3-month microbiota compared to the 1-yearmicrobiota.Table 4.1 and 4.2 shows that the disease phenotypes for asthma at 3-years isassociated with the 3-month microbiota and that asthma at 5-years is associatedwith both the 3-month and 1-year microbiotas. The MIRKAT model we used formicrobiome analysis is a global association testing technique that uses the unionof all taxa across the samples for assessing the correlation between the covariationof taxa abundances between samples with the covariation of an outcome variable.The association we observed between the 3-month microbiota and the disease phe-notypes were statistically significant when using the unweighted UniFrac kernel35function, a phylogenetic distance metric. This metric is useful for modelling phy-logenetic relationships in the microbiome, and is particularly useful in capturingdifferences in the presence or absence of rare taxa.Our covariate analysis also revealed a strong association between the micro-biota (3-month and 1-year) and exclusive breastfeeding status (up to 3-months ofage) using all four kernel types. We also observed an association between birthby vaginal delivery and the 3-month microbiota, with stronger signal using thephylogenetic kernel functions (weighted UniFrac and unweighted UniFrac). Forboth covariates, while an association with the 1-year microbiota of the subjects ispresent, it is less significant compared to the 3-month microbiota. Since we alsoobserved signal with asthma at two diagnostic time points (3-years, 5-years) withthe 3-month microbiome samples, our further analyses focus primarily on 3-monthsamples.Kernel FunctionCovariate BC Jaccard wUniFrac uwUniFracBreastfeeding up to 3-months p < .0001 p < .001 p < .01 p < .00001Birth by vaginal delivery 0.01 0.02 p < .0001 p < .01Antibiotics (1st year) 0.78 0.80 0.86 0.60NO2 (1st year) 0.43 0.40 0.75 0.35Walkability 0.02 p < .01 0.49 0.09Food allergy at 3 years 0.12 0.16 0.04 0.02Atopy 0.06 0.12 0.16 p < .001Asthma at 3 years 0.26 0.28 0.69 0.01Asthma at 5 years 0.17 0.22 0.20 0.03Table 4.1: Table of p-values (p) for association tests (MIRKAT) between the3-month microbiome samples and a subset of clinical and environmentalcovariates using four different kernel functions. The models control forage and batch effects. P-values are adjusted with Benjamini-Hochberg.4.1.2 Searching for a microbial signature using feature selectiontechniquesGiven that MIRKAT tests the association between the full set of taxa across all sam-ples against an outcome, our goal in this section was to identify the important taxon36Kernel FunctionCovariate BC Jaccard wUniFrac uwUniFracBreastfeeding up to 3-months p < .01 p < .01 p < .01 p < .01Birth by vaginal delivery 0.54 0.55 0.22 0.03Antibiotics (1st year) 0.77 0.8 0.25 0.26NO2 (1st year) 0.54 0.53 0.65 0.02Walkability 0.42 0.34 0.22 0.19Food allergy at 3 years 0.94 0.83 0.91 0.24Atopy 0.37 0.08 0.74 0.28Asthma at 3 years 0.84 0.85 0.41 0.17Asthma at 5 years 0.33 0.34 0.49 0.02Table 4.2: Table of p-values (p) for association tests (MIRKAT) between the3-month microbiome samples and a subset of clinical and environmentalcovariates using four different kernel functions. The models control forage and batch effects. P-values are adjusted with Benjamini-Hochberg.that drive the association observed. When encapsulating the MIRKAT test within arecursive feature elimination algorithm, we test for the effect of removing singletaxon from the global MIRKAT model to find the important taxa that drive an asso-ciation. The reduced taxa sets that were produced by this technique had variableresults (Table 4.3 and 4.4), producing a statistically significant result for asthma at3-years but not for asthma at 5-years. Next, we narrowed down our analysis in thissection to focus on a single disease phenotype (asthma at 3-years) for compara-tive analysis against a collection of standard feature selection techniques (LASSO,Elastic Net, LR-RFE). This would allow us to obtain additional evidence for theperformance of the reduced feature set produced by MIRKAT-RFE. From this work,we obtained reduced feature sets for six models and tested the feature sets in anSVM classification model. We compared the classifiers containing reduced featuresets against a baseline SVM model with a full set of taxa features (174 OTUs). Wefound that there was an improvement from the baseline, however poor performanceoverall (under AUROC of 0.65) for each reduced feature set. We used AUROC ofthe classifiers to compare performances between the methods.37Taxa Features MIRKAT (uwUniFrac) PERMANOVA (BC)Full set 0.18 0.06MIRKAT-RFE candidate set p < .01 0.03Table 4.3: Table of p-values of a global model of asthma at 3-years for 3-month microbiome samples using a test set of samples from a MIRKAT-RFE procedure, showing improved performance both with MIRKAT andPERMANOVA using the candidate set.Taxa Features MIRKAT (uwUniFrac) PERMANOVA (BC)Full set 0.61 0.134MIRKAT-RFE candidate set 0.86 0.72Table 4.4: Table of p-values of a global model of asthma at 5-years for 3-month microbiome samples using a test set of samples from a MIRKAT-RFE procedure, showing worse performance with the candidate set.4.2 Evaluating feature selection performance onsimulated dataThe results from section 4.1 led us to re-evaluate the feature selection methods wehad applied. In 4.1.1, we observed an initial signal using a phylogenetic ap- proachwhen evaluating the global relationship between all 3-month microbiota taxa andasthma. Due to the fact that our feature selection procedure from 4.1.2 lacked aphylogenetic technique, we sought to investigate whether the inclusion of a methodthat can identify features with a phylogenetic signal can improve our results. Be-fore applying our methods to real data, we first conducted a simulation study inorder to investigate how the methods perform on synthetic microbiome data undervarious simulated conditions. We present three factors that motivate our need fora simulation study. First, our real data represents novel research data where ourdisease phenotype of interest lacks a ground-truth. This presents a challenge forvalidating our methods. Second, there are techniques we apply in this work whichhave not yet been formally tested or applied to microbiome data. HSIC-LASSO,for example, has limited reports of application to microbiome data. On the otherhand, MIRKAT-RFE and HSIC-LASSO with the Bray-Curtis and UniFrac metricsare custom extensions of existing techniques that were created for this project. The38performance and behaviour of these methods are not yet known. Third, if a phylo-genetic correlation structure is present in the relationship between asthma and themicrobiome, we are uncertain whether standard machine learning feature selectionmethods could recover such ground-truth correlation structure. In this section, weseek to alleviate these challenges by testing our methods on a large set of syntheticdata with known ground-truth correlation structures of different types found in realmicrobiome data. We aim to gain insight on the behaviour and applicability ofdifferent types feature selection techniques on microbiome data.phylogenetic methods generic methods00. Ground-truth Synthetic Dataphylogenetic methods generic methods00. Ground-truth Synthetic DataFigure 4.1: A summary of the best average performances of all feature se-lection techniques across 300 synthetic datasets for different simulatedconditions. Feature selection techniques are binned into two classes,where six are generic (blue) and two are phylogenetic (pink). Here, theperformances of the two classes of methods across all simulated con-ditions are shown for datasets with A) phylogeny-induced structure inthe ground-truth and B) phylogeny-independent structure in the ground-truth.4.2.1 Phylogenetic methods for finding phylogeny-induced featuresIn Figure 4.3, an effect from increasing the phylogenetic relatedness level is notobserved, however an aggregate performance score shown in Figure 4.1 reveals thatphylogeny-induced correlation structure among the ground-truths have an effect onthe performance of phylogenetic techniques. Higher performance is observed in39Simulation Parameter Level 1 Level 2 Level 3 Level 4Phylogenetic Relatedness 4 6 8 16Abundance Level (Mean) < 40 40 to 64 65 to 150 > 150Number of True Features 3 or 5 5 or 9 8 or 14 11 or 24Microbe-Microbe Correlation 0.0 0.30 0.55 0.80Minimum Taxa Prevalence 2% 6% 10% 14%Sample Size 100 250 750 1500Table 4.5: Table of simulation parameter levels, showing four levels from lowto high (left-right). Bolded values are the default configuration for fixedparameter values in the synthetic data generation process.the phylogeny-induced data than on the phylogeny-independent data which is inabsence of an intended phylogenetic relationship among ground-truth features.In Figure 4.2, feature selection benchmarking with a set of synthetic data thattested extreme low and high signal scenarios revealed that MIRKAT-RFE with theweighted UniFrac kernel performed poorly. Thus, for the other steps in perfor-mance benchmarking, this method was eliminated for both reasons of high com-putational cost and poor performance. Elimination of this method also includes itsnon-phylogenetic kernels.In Figure 4.3, we observe that HSIC-LASSO with the unweighted UniFrac ker-nel performs poorly across the simulated conditions. HFE on the other hand showsa degree of randomness in its performance across the conditions, with high perfor-mance in some and poor performance in others. In investigating the random natureof this method’s performance, repeated simulation performed in the extreme lowand high signal conditions in Figure 4.2 shows that HFE consistently does very wellin two scenarios that are considered very high signal conditions. These simulationconditions in Figure 4.2 are a variant of the simulation conditions presented else-where. The synthetic datasets for these conditions differ in that for the phylogeny-induced ground-truths, all members of a phylogenetic group are used instead of arandom selection of the members in a group. We observed that HFE did very wellin two very high signal conditions tested: when the abundance of the phylogeneticgroup is high, and when the phylogenetic group has a greater ratio of OTUs togenera, indicating that it represents greater phylogenetic and taxonomic similarity40at lower depth in the tree. Thus, we observe that HFE shows the most consistentfeature selection capability on phylogeny-induced ground-truth when the signal isvery high. We would also like to note in this section that since the two other formsof phylogenetic techniques performed poorly, any higher performances represent-ing phylogenetic techniques are driven by the performance of HFE.4.2.2 Phylogenetic methods for finding phylogeny-independentfeaturesWe observe that the average feature selection performance of phylogenetic tech-niques in Figure 4.1 perform better when finding phylogeny-induced ground-truthfeatures than it does for phylogeny-independent features. We highlight in Figure4.2 that HFE does very poorly in the high signal phylogeny-independent conditionswhere the top ten most abundant OTUs from the calibration dataset were used asthe ground-truth features. This indicates that the phylogenetic techniques we testedare poor at recovering the true features in the absence of an induced phylogeneticstructure.4.2.3 Generic techniques for finding phylogeny-inducedground-truthFigure 4.1 shows that generic methods are robust in recovering ground-truth fea-tures that have an induced phylogenetic structure. To be specific, RF and HSIC-LASSO with the Gaussian kernel performs with high feature selection F1 scoreconsistently across most simulated conditions. In the extreme low and high signalconditions (Figure 4.2), we found that the top performing generic methods werealso robust at both low and high signal phylogeny-induced ground-truth conditions.In Figure 4.4, we observe lower performances from the following methods, in de-scending order of average best performance for all simulated conditions: LR-RFE,HSIC-LASSO (Bray-Curtis), LASSO and Elastic Net.410 5 15 30 50 1000. type: Phylogeny-InducedRelatedness by Ratio of OTUs/GeneraCondition: low similarity deep in the tree0 5 15 30 50 1000. type: Phylogeny-InducedRelatedness by Ratio of OTUs/GeneraCondition: high similarity deep in the tree0 5 15 30 50 1000. type: Phylogeny-InducedCondition: low abundance0 5 15 30 50 1000. type: Phylogeny-InducedCondition: high abundance0 5 15 30 50 100number of features0. type: Phylogeny-IndependentCondition: low abundance0 5 15 30 50 100number of features0. type: Phylogeny-IndependentCondition: high abundanceLASSOElastic NetHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)LR-RFEMiRKAT-RFE (wUniFrac)MiRKAT-RFE (Bray-Curtis)MiRKAT-RFE (RBF)HFERandom ForestHSIC-LASSO (uwUniFrac)Extreme Low and High Simulated ConditionsFigure 4.2: Performance of both classes of feature selection techniques onextreme low and high signal conditions on synthetic data. Each con-dition is repeated 5 times and the performance represents aggregatesscores. Phylogenetic techniques (3 total) are shown, as well as generictechniques (8 total). Six conditions are simulated for three conditioncategories defined by structure of the ground-truth features: phylogeny-induced (related by phylogenetic distance as well as taxonomic classifi-cation), phylogeny-induced (related by phylogenetic distance only), andphylogeny-independent.420 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 50number of features0. 5 15 30 50number of features0. 5 15 30 50number of features0. 5 15 30 50number of features0. NetRandom ForestLR-RFEHFEHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HSIC-LASSO (uwUniFrac)Feature Selection of Simulated Phylogeny-Induced Ground-Truth (Phylogenetic Relatedness vs. Sample Size)Figure 4.3: Performance of both classes of methods on phylogeny-inducedsynthetic data at different levels of phylogenetic relatedness (0, 4, 8, 12,16) and sample sizes (100, 250, 750, 1500). All other parameters areset to the default configuration in 4.5. An effect from sample size isobserved (left-right), but an effect from phylogenetic relatedness is notobvious (top-down).4.2.4 Generic techniques for finding phylogeny-independentground-truthFigure 4.1 shows that generic techniques have good performance in recovering thephylogeny-independent ground-truth features. As we expected, we also observe430. (Bray-Curtis)HSIC-LASSO (Gaussian)HSIC-LASSO (uwUniFrac)HFELR-RFERandom ForestElastic NetLASSOAggregate Performance Across Simulated Parameter Conditions on Phylogeny-Induced Ground-Truth DataFigure 4.4: Aggregate performance of feature selection techniques onphylogeny-induced synthetic data across six simulation parameters(top-down) with four levels each (left-right). Each bar represents amean score for all possible instances of the parameter value. The sixsimulation parameters (top-down) are A) phylogenetic relatedness, B)abundance level, C) number of true features, D) microbe-microbe cor-relation, E) minimum taxa prevalence, F) sample size. There is an effectobserved from abundance level, the number of true features, and samplesize.that the generic techniques are better than the phylogenetic techniques on this task.Overall, we found that generic techniques are robust at recovering true features insimulated microbiome data, whether or not phylogeny-induced structure is presentin the ground-truth features.440. abund=medlow0. abund=medhigh0. abund=high0. n_true= n_true= n_true= bb_corr= bb_corr=0.550. bb_corr= prev= prev= prev= N=2500. N=7500. N=1500HSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HSIC-LASSO (uwUniFrac)HFELR-RFERandom ForestElastic NetLASSOAggregate Performance Across Simulated Parameter Conditions on Phylogeny-Independent Ground-Truth DataFigure 4.5: Aggregate performance of feature selection techniques onphylogeny-independent synthetic data across five simulation parame-ters (top-down) with four levels each (left-right). Each bar represents amean score for all possible instances of the parameter value. The fivesimulation parameters are A) abundance level, B) number of true fea-tures, C) microbe-microbe correlation, D) minimum taxa prevalence, E)sample size. Similar to 4.4, we observe an effect from abundance level,the number of true features, and sample size.4.3 Factors that influence feature selection performancein simulation4.3.1 The effect of phylogenetic relatednessIn this section, we show our results on changing the simulation parameter for thelevel of phylogenetic relatedness that is shared among the ground-truth features.45The phylogenetic relatedness parameter is ultimately the k hyperparameter for thePAM clustering algorithm which determines how many clusters are formed fromthe phylogenetic tree using the patristic distance metric. A relatedness level of0 indicates that the simulated data has ground-truth features that do not have aninduced phylogenetic relationship. As the relatedness parameter is increased, theground-truth features share more similarities at lower depth in the tree since thephylogenetic tree was partitioned a greater number of times. Our results from4.2.1 and 4.2.3 showed that phylogeny-induced correlation has a small effect onthe performance of generic methods, in which the top methods are very robust inmost conditions. The performance maximum, minimum, and outliers of the box-plots in Figure 4.1(A) show us that the inclusion of phylogeny-induced correlationin the ground-truth features has an effect on the performances of both classes ofmethods. The generic methods show a slight decrease in performance, while thephylogenetic methods show an increase in performance. Overall, the effect on thegeneric methods is small, supporting our claim in 4.2.3 that generic techniquesare robust in the presence of phylogenetic structure among ground-truth features.However, we must note a constraint in our synthetic data generation process thatcould have had an impact on the effect size of the relatedness parameter. The phy-logenetic relatedness parameter is constrained by the calibration data used. In ourcase, we used the CHILD study 3-month microbiome data which after QC prepro-cessing has only 174 OTUs. With a small number of OTUs, the phylogenetic treecan only be partitioned so many times and thus the phylogenetic relatedness para-mater cannot take on a wider range of values. We suspect that using a calibrationset with a larger number of OTU features would lead to a greater effect from thephylogenetic relatedness simulation parameter.4.3.2 The effect of the number of relevant featuresIn this section, we present our observations on the effect of changing the size of theground-truth feature set on both types of simulated data. Aggregate performanceon the phylogeny-independent data in Figure 4.5 shows improved feature selectionperformance as the size of the ground-truth set grows (N true features = 5, 9, 14,24). In Figure 4.6, the method that showed the most improvement was HSIC-LASSO46with the Gaussian kernel, which has low performance when the number of true fea-tures is small (at N true features = 5, average best feature selection F1 < 0.55) andhigher performance when the number of true features is larger (at N true features =24, average best feature selection F1 > 0.70). We observe improvement in perfor-mance in LR-RFE, LASSO and Elastic Net as the number of true features increasesas well, however the performance overall for those methods have an average thatfall under F1 of 0.50. RF also showed reduced performance when the size of theground-truth set is small, however maintained stable and reasonable performanceoverall above 0.70 at both small (N true features = 5) and large (N true features =24) sets of ground-truths. Last, the phylogenetic techniques perform poorly on thephylogeny-independent data at all levels of sizes of the ground-truth set.On simulated data with phylogeny-induced correlation, Figure 4.7 shows thatHFE also performs comparably to the top two generic techniques (RF and HSIC-LASSO with the Gaussian kernel) which are robust on both types of simulated data.The effect of a small number of true features also impacts the performance of HFE,with lower performance observed at a size of 3 (average best feature selection F1< 0.45), and higher performance when the number of true features is increasedto 11 (average best feature selection F1 > 0.70). The range for the size of theground-truth feature set differs between the phylogeny-induced and phylogeny-independent data due to constraints mentioned in 4.3.1. If the relatedness parame-ter is set to a higher value, the clustering algorithm will partition the phylogenetictree a greater number of times, leading to fewer taxa that can be used as ground-truth features since the phylogenetic groups will have a smaller number of mem-bers. We observed a smaller effect in the increase of the number of true featureson our generic methods in the phylogeny-induced data, showing more stable per-formance from a set size of 3 to 11 compared to HFE. Since the change of the sizeof the ground-truth feature set is much smaller for this set of simulated data, theabsence of an effect is most likely attributed to the smaller range. This indicatesthat HFE is more sensitive to small changes in the size of the ground-truth featureset in comparison to the top generic techniques.470 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 50number of features0. 5 15 30 50number of features0. 5 15 30 50number of features0. 5 15 30 50number of features0. NetRandom ForestLR-RFEHFEHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HSIC-LASSO (uwUniFrac)Feature Selection of Simulated Phylogeny-Independent Ground-Truth (Number of True Features vs. Minimum Taxa Prevalence)Figure 4.6: Performance of both classes of methods on phylogeny-independent synthetic data at different levels of the number of true fea-tures (5, 9, 14, 24) and minimum taxa prevalence (2%, 6%, 10%, and14%). All other parameters are set to the default configuration in Table4.5. An effect from the number of true features is observed (top-down),but an effect from minimum taxa prevalence is not observed (left-right).4.3.3 The effect of sample sizeIn this section, we evaluate the effect of sample size on feature selection. Therange of values we examine are N = 100, 250, 750, 1500. Each synthetic datasetthat is tested with a given sample size is split 80% into a training set and 20% into atesting set. We observe in Figures 4.8, 4.9, and 4.3 that there is a strong effect fromincreasing sample size over the values of N = 100, 250, 750, 1500. At N = 100,all of the methods perform very poorly, with improvement in the top two genericmethods observed at N = 250, and very high performance is reached at N = 750480 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 50number of features0. 5 15 30 50number of features0. 5 15 30 50number of features0. 5 15 30 50number of features0. NetRandom ForestLR-RFEHFEHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HSIC-LASSO (uwUniFrac)Feature Selection of Simulated Phylogeny-Induced Ground-Truth (Number of True Features vs. Minimum Taxa Prevalence)Figure 4.7: Performance of both classes of methods on phylogeny-inducedsynthetic data at different levels of the number of true features (3, 5, 8,11) and minimum taxa prevalence (2%, 6%, 10%, and 14%). All otherparameters are set to the default configuration in Table 4.5. An effectfrom the number of true features is observed (top-down) by HFE from achange of 3 to 5, while generic methods RF and HSIC-LASSO (Gaussian)are robust in the change of the number of true features from 5 to 11. Aneffect from minimum taxa prevalence is not observed (left-right).and N = 1500. We observed increased feature selection performance among the toptwo generic methods on both the phylogeny-induced and phylogeny-independentdatasets as sample size increased. At N = 1500, we observed improvement with LR-RFE. Figure 4.9 shows that feature selection performance of HFE is not impactedby a change in sample size and that the method performs similarly at both smalland large sample size from N = 100 to N = 1500.490 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 50number of features0. 5 15 30 50number of features0. 5 15 30 50number of features0. 5 15 30 50number of features0. NetRandom ForestLR-RFEHFEHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HSIC-LASSO (uwUniFrac)Feature Selection of Simulated Phylogeny-Independent Ground-Truth (Abundance Level vs. Sample Size)Figure 4.8: Performance of both classes of methods on phylogeny-independent synthetic data at different levels of abundance level (low,medium low, medium high, high) and sample sizes (100, 250, 750,1500). All other parameters are set to the default configuration in Table4. The effect of taxa abundance levelIn this section, we evaluate the effect of taxa abundance level on feature selec-tion performance. For all simulated conditions except for our extreme low andhigh conditions from Figure 4.2, the mean abundance level of all taxa were sortedand grouped into four levels (low, medium low, medium high, high). Parameter-ization of the abundance level thus meant picking ground-truth features from theappropriate abundance level group. In Figure 4.4 and 4.5 we had observed forboth simulated dataset types, a small effect in the feature selection performance500 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 500. 5 15 30 50number of features0. 5 15 30 50number of features0. 5 15 30 50number of features0. 5 15 30 50number of features0. NetRandom ForestLR-RFEHFEHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HSIC-LASSO (uwUniFrac)Feature Selection of Simulated Phylogeny-Induced Ground-Truth (Number of True Features vs. Sample Size)Figure 4.9: Performance of both classes of methods on phylogeny-inducedsynthetic data at different levels of the number of true features (3, 5, 8,11) and sample sizes (100, 250, 750, 1500). All other parameters areset to the default configuration in Table 4.5. An effect from the numberof true features is observed (top-down) by HFE and HSIC-LASSO (Bray-Curtis) from a change of 3 to 11. At at sample size above 100, genericmethods RF and HSIC-LASSO (Gaussian) are robust to the change ofthe number of true features from 5 to 11. Sample size influences theperformance of generic methods, while HFE does not show decreasedperformance at lower sample size.from lower to higher abundance level among the ground-truth features. Our resultsfrom Figure 4.2 shows an effect from the change in abundance level in the top twogeneric methods for the the phylogeny-independent data from the extremely lowabundance condition (lowest ten OTU features) to the extremely high abundancecondition (highest ten OTU features). Although, performance maintains at a rea-51sonable level at both low and high abundance. This shows that for the genericmethods, rare taxa with lower abundance in samples are not substantially more dif-ficult to recover as the ground-truth features than very abundant taxa. On the otherhand, our best phylogenetic method, HFE, shows substantially poorer performancein the extremely low abundance phylogeny-induced data in Figure 4.2 comparedto the extremely high abundance condition where HFE achieves excellent perfor-mance.4.3.5 The effect of microbe-microbe correlationIn this section, we present the results from adding noise through microbe-microbecorrelation. As the levels of microbe-microbe correlation (0, 0.30, 0.55, 0.80)increased, at both low and high levels, we show in Figures 4.4 and 4.5 that our toptwo generic techniques (RF and HSIC-LASSO with the Gaussian kernel) along withthe phylogenetic method HFE show high performance. There was no obvious effectfrom microbe-microbe correlation on the feature selection methods. We expectedhigher levels to lead to lower performance as this creates a more challenging taskfor finding the true features due to feature multicollinearity. While our simulationdid not reveal an effect, there is insufficient evidence to state that the methods arerobust to increased levels of microbe-microbe correlation. This could suggest alimitation in the simulation, such that the correlation level parameterized couldhave been too low to reveal an effect.4.4 Evaluating classes of feature selection techniques onreal dataIn this section, we present results from applying the feature selection techniqueswe employed in 4.2 to the CHILD study microbiome data. We evaluate two kindsof signal levels, a high signal scenario (age time point) and three lower signal sce-narios (asthma at 5 years, birth by vaginal delivery, exclusive breastfeeding up to3-months). Based on the results in 4.1 and prior research on asthma and the CHILDstudy, the variables we investigate in this section were chosen based on evidence ofan association with the microbiota. From 4.2, we had learned that generic machinelearning feature selection techniques that can be applied to a wide variety of data52types are robust in recovering ground-truth features in simulated microbiome data.We saw that a single phylogenetic technique, HFE, could be robust for finding phy-logenetically related ground-truth features if the signal is strong. Now that we havelearned the behaviour of the methods we have evaluated from the previous section,we aim to apply our findings to improve our understanding of how the feature se-lection techniques perform and behave on CHILD study clinical and environmentalcovariates.We note that feature selection methods with the UniFrac distance kernels wereeliminated due to poor performance that was consistently observed (section 4.2.1).Our results in this section will be based on microbiome data that is rarefied andrepresented as counts and fractional abundance to demonstrate that different fea-ture selection techniques perform differently depending on the transformation ofthe data. Section 4.5.1 will present more thorough examination of the effect ofdifferent normalization techniques on our feature selection methods with the realdata.Outcome Number of SamplesTime Point 1528Asthma at 5 years 631-687Birth by Vaginal Delivery 758-827Exclusive Breastfeeding Status (Up to 3-months) 766-836Table 4.6: Table of sample sizes for real data modelling outcomes, which dif-fer for each outcome due to missing values. A range is stated for eachbecause pipelines that include normalization by rarefaction produces areduced sample size. In the process of rarefying a dataset, samples thathave an insufficient library size for the optimal constant library size cho-sen are discarded.4.4.1 High signal real data conditionWe applied feature selection benchmarking to find the most important features thatseparate the labels for the time point variable. This involved combining the 3-month and 1-year microbiome samples (N = 1528 in total) and labeling the samplesbased on their time point of data collection. Time point is ultimately a proxy for53the age of a child and is treated as a categorical variable. Similar to our simulation,the features selected or ranked by a feature selection technique were evaluated ina classification setting using an SVM model. We take subsets of ranked featuresproduced by each feature selection technique and evaluate its AUROC performancebased on the feature subsets with increasing size.Figure 4.10 shows that feature selection techniques can find the important fea-tures with good discrimination of AUROC > 0.70 for all seven methods tested onboth rarefied count data and rarefied fractional abundance data. The best performeron the rarefied fractional abundance data was HSIC-LASSO with the Bray-Curtiskernel with AUROC 0.77 using 7.5% of total features, equating to 13 top OTUsselected by the method. HFE was the second best performer with AUROC 0.74using 12% of the total features, equating to 18 top OTUs selected by the method.On rarefied count data, we observe greatest performance achieved by RF at 6 topOTUs, and LR-RFE at 31 top OTUs. RF and HFE were among the top methods in oursimulation. LR-RFE showed reasonable performance in our simulation, howeverHSIC-LASSO with the Bray-Curtis kernel was not typically among the top meth-ods. We attribute this observation to the fact that our simulated data representedrarefied count data. Hence, we observe the best performances achieved by the sametop methods from our simulation on the rarefied count data, but not necessarily thesame methods on the rarefied fractional abundance data.Our results reveal that beyond the inclusion of an optimum number of featuresof 31 for rarefied count data and 13 for rarefied fractional abundance data, theperformance plateaus for most methods. This implies that there is an importantset of features near the optimum number, where the inclusion of more featuresfrom the full set of 174 OTUs does not lead to a better classification model that candiscriminate samples as 3-month or 1-year.4.4.2 Lower signal real data conditionsNext, we present results on the feature selection benchmarking procedure appliedto 3-month microbiome samples to find the relevant features to three lower signalconditions treated as binary outcomes: asthma at 5 years, birth by vaginal delivery,and exclusive breastfeeding up to 3-months. In total, there are 750 samples for this540 15 30 45 60 75 90 105 120 135number of features0.450.500.550.600.650.700.750.80ROC/AUCNormalization: rarefied counts0 15 30 45 60 75 90 105 120 135number of features0.450.500.550.600.650.700.750.80Normalization: rarefied proportionsLASSOElastic NetLR-RFERandom ForestHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HFEmeanFeature Selection for Time PointFigure 4.10: Feature selection benchmarking results of seven methods on 174OTU features from real data for finding the relevant taxa that separatethe time point age variable as 3-month or 1-year samples. Results withtwo forms of normalized data is presented, with normalized countson the left and rarefied fractional abundance on the right. Count datashows greater variability between methods than fractional abundancedata.time point, though each modeling scenario will have less samples after filteringmissing values for a given outcome variable, shown in Table 4.6. For asthma,only absolute labels for case vs. control were used. The disease phenotype for’probable asthma’ were filtered out in order to obtain a cleaner model of the moreconfident asthma diagnosis. For birth by vaginal delivery, the variable is binarizedwhere infants who were birthed by vaginal delivery are labeled as the positiveclass and all other forms of birth delivery represent the negative class. Similarly,for breastfeeding status, all infants that were exclusively breastfed up to 3-monthsof age were treated as one class, and those who weren’t were treated as the otherclass.Showing our results for asthma first, in Figure 4.11, we see that the best per-formance is achieved by RF for the count data at AUROC 0.67 with 20% of featurescorresponding to 35 top OTUs ranked by the algorithm. For fractional abundancedata, we observe that LR-RFE is the best performing method at AUROC 0.66 with8% of features corresponding to 14 top OTUs. HFE selected 1 and 3 features re-55spectively for these two versions of the data, producing classification models withlow performance of AUROC under 0.55.Next, in Figures 4.12 and 4.13, the feature selection results of relevant taxafor exclusive breastfeeding and birth by vaginal delivery, respectively, are shown.We observe that HFE produces larger feature selection sets than was found forasthma. Based on our simulation in 4.2.1 and 4.2.3, we learned that HFE is sen-sitive to strong phylogenetic signal and otherwise performs very poorly with fewor no selected features in the absence of a phylogenetic signal. Our results fromFigure 4.12 could thus suggest a presence of a phylogenetic signal. However, theperformance is subpar, with the best HFE score of AUROC 0.58 for birth by vagi-nal delivery and AUROC 0.62 for exclusive breastfeeding. Thus, more investigationwould be needed to verify this. For both outcomes, the best method is HSIC-LASSOwith the Gaussian kernel, one of the top two generic techniques we found from oursimulation at AUROC 0.62 with 9 top OTUs and AUROC 0.66 with 7 top OTUs forbirth by vaginal delivery and exclusive breastfeeding, respectively.0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80number of features0.250.300.350.400.450.500.550.600.650.70ROC/AUCNormalization: rarefied counts0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80number of features0.250.300.350.400.450.500.550.600.650.70Normalization: rarefied proportionsLASSOElastic NetLR-RFERandom ForestHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HFEmeanFeature Selection for Asthma at 5 yearsFigure 4.11: Feature selection benchmarking results of seven methods on 174OTU features from real data for finding the relevant taxa for asthmaticvs. healthy subjects using the 3-month microbiome samples. Re-sults with two forms of normalized data is presented, with normalizedcounts on the left and rarefied fractional abundance on the right. Countdata shows greater variability between methods than fractional abun-dance data.560 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85number of features0.350.400.450.500.550.600.65ROC/AUCNormalization: rarefied proportionsLASSOElastic NetLR-RFERandom ForestHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HFEFeature Selection for Infant Breastfeeding Status (Up to 3-months)Figure 4.12: Feature selection benchmarking results of seven methods on 174OTU features from real data for finding the relevant taxa that exclu-sive breastfeeding status (up to 3-months of age) using the 3-monthmicrobiome samples. Due to higher variability in between-methodperformances with rarefied counts, we show only rarefied fractionalabundance.4.5 Factors that influence feature selection performanceon real dataIn this section, we present results on factors that could influence the performanceof feature selection techniques on our real data. First off, we will state that ourresults from 4.4.1 showed us that we could achieve good performance using bothphylogenetic and generic techniques in a high signal real data condition. A range ofpoor to reasonable performance was observed in our lower signal real data condi-tions. We saw that phylogenetic techniques pick up signal for all conditions exceptasthma, which could indicate that the initial phylogenetic signal observed from theMIRKAT association test (Table 4.1) may not be strong. In this section, we evaluatestrategies to improve the performance of our feature selection procedure on our570 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85number of features0.300.350.400.450.500.550.60ROC/AUCNormalization: rarefied proportionsLASSOElastic NetLR-RFERandom ForestHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HFEFeature Selection for Birth Mode By Vaginal DeliveryFigure 4.13: Feature selection benchmarking results of seven methods on 174OTU features from real data for finding the relevant taxa for birthby vaginal delivery using the 3-month microbiome samples. Dueto higher variability in between-method performances with rarefiedcounts, we show only rarefied fractional abundance.real data. First, we will show results for our investigation of the effect of differentstrategies for normalization. Next, we will present an investigation of the effect ofsample size on the performance of our high signal real data condition. Finally, wewill show results for our investigation of the effect of aggregating OTU abundancesto a higher phylogenetic depth in order to increase the signal by abundance level.4.5.1 The effect of normalizationOur investigation of the effect of normalization techniques on feature selectionperformance showed a very strong effect on the time point variable (Figure 4.14)and a smaller effect on our lower signal conditions (asthma, birth by vaginal de-livery, and exclusive breastfeeding). Overall, multiple normalization techniquesshowed significant improvement in performance for the time point variable, with58up to a 37.5% performance increase in AUROC from rarefied fractional abundancedata to the other forms of normalization introduced in this section. For time point,we observe very high average classification performance AUROC above 0.90 usingthe feature selection subsets produced by multiple feature selection methods withfour normalization techniques: relative log expression (RLE), trimmed mean of M-values (TMM), a variant of TMM for zero-inflated data called TMMwzp, and upperquartile (UQ) normalization. For the lower signal conditions, UQ, RLE, TMMwzp,and CLR were the methods that had the greatest effect on overall performance.The full normalization suite was applied to both time point and asthma. Then, asubset of techniques were picked out based on higher performance and commonusage and applied for birth by vaginal delivery and exclusive breastfeeding. Thedifference in the effect of normalization on the high vs. low signal conditions isinteresting. We observe that the benefits from normalization of microbiome datais great when the biological effect is strong. This could suggest that the relation-ship between age and the microbiota is more direct, such that reducing the effectof confounding from sample library size differences has a larger impact on featureselection and classification performance.We found the normalization techniques that produced substantial performanceincrease in our high signal condition did not always translate to performance in-creases in our low signal conditions. Table 4.7 shows the percentage of perfor-mance differences of five normalization techniques against a baseline normaliza-tion method, which was chosen to be rarefied fractional abundance. The reasonsfor this choice is based on 1) common usage in microbiome studies and 2) lowerperformance variability between methods in comparison to rarefied count data. Weattribute lower performance variability to signify greater stability. We found thatUQ showed performance improvement across all seven feature selection techniquespresented in Table 4.7. Normalization with RLE shows performance increases forall four real data conditions when the feature selection method is RF. The largestincrease in performance we observed for asthma was with UQ, with a 17.4% in-crease from a baseline AUROC with HSIC-LASSO (Bray-Curtis) of 0.59 to 0.69.594.5.2 The effect of sample sizeSince our simulation results in 4.3.3 revealed that sample size has a strong effect onfeature selection performance from N = 100 to N = 1500, we sought to investigatewhether there is an effect from the differences in sample size in our high signal(section 4.4.1) vs. low signal (section 4.4.2) modeling conditions that could explainour observation of the differences in performance. Our high signal condition, timepoint, has 1528 samples, while our low signal conditions have between 631 to 836.In Figure 4.18, we show that down-sampling the sample size for the time pointcondition to less than half its original size revealed stable performance from N =1500 to N = 650 and thus no visible effect indicating that the higher performancein feature selection and classification of the time point variable presented in 4.4.1is due to greater sample size.4.5.3 The effect of phylogenetic aggregationIn evaluating the effect of phylogenetic aggregation level, we sought to test whethergreater abundance level through grouping OTU features at the genera level couldimprove the performance in our lower signal real data conditions. We found phy-logenetic aggregation did not have an effect on feature selection and classificationperformance.4.6 SummaryTo summarize, through our benchmarking on synthetic and real data, we showedthat both generic and phylogenetic classes of feature selection techniques are ca-pable of identifying microbial signatures with reasonable performance. In oursimulation, we found that generic techniques are highly robust in modeling con-ditions where a phylogenetic signal is present. We found that one phylogenetictechnique, HFE, shows high performance when a strong phylogenetic signal ispresent in our simulation and poor performance in the absence of phylogeneticstructure in the microbial signature. In evaluating the effects of our simulationparameters, we learned that the performances of the feature selection techniquesare influenced by the number of features in the microbial signature, with greaternumber of features corresponding to better performance. We learned that a sam-60ple size of at least 750 for feature selection on simulated microbiome data is themost ideal. We showed that while generic techniques have shown to be robuston both phylogeny-independent and phylogeny-induced datasets, there is an effectobserved from greater phylogenetic relatedness within the ground-truth feature setthat decreases the performance in some simulated conditions.On real data, we showed that the performances of the feature selection tech-niques evaluated in this work can translate from our simulation, showing goodperformance on a higher signal condition. With normalization, we saw a large im-provement in performance for our high signal condition, and a smaller effect on thelower signal conditions. In our exploration of the microbiome and asthma, we haddiscovered a significant association between the global set of taxa for the 3-monthCHILD study microbiome samples with a phylogenetic association test. However,we found that feature selection with a phylogenetic approach did not reveal a goodperforming feature subset. Two additional CHILD study covariates (birth by vagi-nal delivery and exclusive breastfeeding) showed a strong association using phy-logenetic tests with both the 3-month and 1-year microbiome samples. Featureselection benchmarking to find the important taxa for these two covariates showedreasonable performance using a phylogenetic approach, potentially indicating thatphylogenetic signal among the ground-truth features exists.61FS Method Outcome Base-lineTSS(%)UQ(%)RLE(%)TMM-wzp (%)CLR(%)LASSOasthma 0.62 -2.5 +5.9 -6.3 -2.7 -2.5birthmode 0.61 -23.2 -15 -14.1 -5.7 +0.7breastfeeding 0.61 -2.5 +7.4 -3.3 +5.7 +1.4timepoint 0.72 -1.6 +29.6 +37.5 +36.3 +8.9Elastic Netasthma 0.62 -7.3 +1.5 -6.1 -4.6 -5.1birthmode 0.59 -16.9 -14.6 -14 -5.6 +6.0breastfeeding 0.63 -2.2 +3.7 -6.9 +0.6 +0.3timepoint 0.73 -4.2 +24.6 +32.4 +33.2 +5.8LR-RFEasthma 0.66 -10 -6.1 -8.9 -8.9 -14.5birthmode 0.56 -14.4 -6.7 -6.6 -1.4 +8.4breastfeeding 0.63 -2.2 +4.7 -4.1 +1.2 +3.0timepoint 0.74 -5.6 +24.8 +32.2 +32.1 +5.5RandomForestasthma 0.60 +6.3 -0.2 +8.1 +2.3 -7.2birthmode 0.52 +14.4 -0.4 +6.7 -8.5 +20.2breastfeeding 0.63 -3.8 +6.0 +1.0 -2 +2.4timepoint 0.75 -3.8 +22.1 +31.9 +31.7 +5.4HSIC-LASSO(BC)asthma 0.59 +2.3 +17.4 +6.8 +6.3 -5.1birthmode 0.61 -3.3 -12 -12.1 -12.9 -6.1breastfeeding 0.62 -8.7 +5.7 +1.0 +1.8 +2.6timepoint 0.76 -7 +19.3 +28.2 +28.3 +2.3HSIC-LASSO(Gaussian)asthma 0.65 +4.0 +8.2 -1.4 -5.8 -13.7birthmode 0.62 -9.4 -17.2 -5.5 -13.5 -12.9breastfeeding 0.66 -10.3 -0.5 -7.7 -2.8 -1.6timepoint 0.74 -7.1 +22.1 +32.1 +33.6 +5.8HFEasthma 0.54 - - - -5.2 -birthmode 0.58 +20.0 +15.3 -0.4 +6.6 +5.8breastfeeding 0.60 -6.4 +0.4 +0.8 -1.6 -10.5timepoint 0.74 -7.1 +22.0 +24.1 +26.0 +2.9Table 4.7: Table of performance differences when different normalizationtechniques are used for real data outcomes. The baseline represents theAUROC performance for a feature selection method using rarefied frac-tional abundance data. The best performance achieved is shown, whichis defined by the subset of selected or ranked features produced by a fea-ture selection method that produces the highest performance. Only thenormalization methods that were tested on all four outcome variables areshown.620 20 40 60 80 100 120 1400.50.60.7ROC/AUCNormalization: raw counts0 20 40 60 80 100 120 1400.50.60.7Normalization: proportions0 20 40 60 80 100 120 1400.60.7ROC/AUCNormalization: rarefied counts0 20 40 60 80 100 120 1400.50.60.7Normalization: rarefied proportions0 20 40 60 80 100 120 1400.50.60.7ROC/AUCNormalization: TSS0 20 40 60 80 100 120 1400.650.700.75Normalization: CLR0 20 40 60 80 100 120 1400.70.80.9ROC/AUCNormalization: UQ0 20 40 60 80 100 120 1400.81.0 Normalization: RLE0 20 40 60 80 100 120 140number of features0.81.0ROC/AUCNormalization: TMM0 20 40 60 80 100 120 140number of features0.81.0Normalization: TMMwzpLASSOElastic NetLR-RFERandom ForestHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HFEmeanFeature Selection for Time PointFigure 4.14: Feature selection benchmarking results of seven methods on 174OTU features from real data for the time point variable using nine nor-malization techniques in addition to raw counts. We select rarefactionas our baseline for comparison due to wide usage and observe largeimprovements in AUROC with other normalization techniques. Fromlowest to highest mean percent increases, the normalization techniquesthat achieve the highest performance improvements are CLR, UQ, RLE,TMM, and TMMwzp. 630 30 60 90 120 1500.40.6ROC/AUCNormalization: raw counts0 30 60 90 120 1500.40.6Normalization: proportions0 30 60 90 120 1500.40.6ROC/AUCNormalization: rarefied counts0 30 60 90 120 1500.40.6Normalization: rarefied proportions0 30 60 90 120 1500.40.6ROC/AUCNormalization: RLE0 30 60 90 120 1500.40.6Normalization: UQ0 30 60 90 120 1500.40.50.6ROC/AUCNormalization: TMM0 30 60 90 120 1500.40.6Normalization: TMMwzp0 30 60 90 120 150number of features0.40.6ROC/AUCNormalization: TSS0 30 60 90 120 150number of features0.40.6Normalization: CLRLASSOElastic NetLR-RFERandom ForestHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)meanFeature Selection Benchmarking for Asthma at 5 yearsFigure 4.15: Feature selection benchmarking results of seven methods on 174OTU features from real data for asthma at 5-years using nine normaliza-tion techniques in addition to raw counts. We select rarefaction as ourbaseline for comparison due to wide usage and observe large improve-ments in AUROC with other normalization techniques. From lowestto highest mean percent increases, the normalization techniques thatachieve the highest performance improvements are TSS and UQ.640 20 40 60 800.350.450.550.65ROC/AUCNormalization: rarefied proportions0 20 40 60 800.350.450.550.65Normalization: TSS0 20 40 60 800.350.450.550.65ROC/AUCNormalization: UQ0 20 40 60 800.350.450.550.65Normalization: RLE0 20 40 60 80number of features0.350.450.550.65ROC/AUCNormalization: TMMwzp0 20 40 60 80number of features0.350.450.550.65Normalization: CLRLASSOElastic NetLR-RFERandom ForestHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HFEmeanFeature Selection for Infant Breastfeeding Status (Up to 3-months)Figure 4.16: Feature selection benchmarking results of seven methods on 174OTU features from real data for breastfeeding status using six normal-ization techniques. We observe that different normalization techniquesresults in larger feature selection sets for HFE, albeit with poor AUROCperformance.650 20 40 60 80 100 120 1400.250.350.450.550.65ROC/AUCNormalization: rarefied proportions0 20 40 60 80 100 120 1400.250.350.450.550.65Normalization: TSS0 20 40 60 80 100 120 1400.250.350.450.550.65ROC/AUCNormalization: UQ0 20 40 60 80 100 120 1400.250.350.450.550.65Normalization: RLE0 20 40 60 80 100 120 140number of features0.250.350.450.550.65ROC/AUCNormalization: TMMwzp0 20 40 60 80 100 120 140number of features0.250.350.450.550.65Normalization: CLRLASSOElastic NetLR-RFERandom ForestHSIC-LASSO (Bray-Curtis)HSIC-LASSO (Gaussian)HFEmeanFeature Selection for Birth Mode by Vaginal DeliveryFigure 4.17: Feature selection benchmarking results of seven methods on 174OTU features from real data for birth by vaginal delivery using six nor-malization techniques. We observe improved performance with HFEusing additional normalization techniques when compared to rarefiedfractional abundance data. UQ and TSS show AUROC above 0.65.66A BFigure 4.18: An evaluation of the effect of sample size on the higher per-formance observed for feature selection of the time point variable.Since performance was much higher than our lower signal conditions,whether the significant difference in sample size could have an effectwas investigated. Down-sampling to a sample size lower than the meansample size of the lower signal real data conditions did not show a de-crease in performance from N = 1528 (left) to N = 650 (right). A singlefeature selection technique (LR-RFE) is shown, however this observa-tion was consistent for all six other feature selection techniques.67Chapter 5Discussion and Conclusion5.1 SummaryIn this thesis, we performed a comparative analysis to evaluate the phylogenetic vs.non-phylogenetic approach to identifying important features to a host phenotypicoutcome which can improve classification models of microbiome data. Due to theimmense amount of interest in developing phylogenetic techniques in the litera-ture, our work sought to investigate the applicability of such approaches throughan objective evaluation procedure with extensive simulation and application to realdata. We conducted this research by benchmarking 11 feature selection techniquesapplied to simulated data with and without induced phylogenetic correlation struc-ture over a set of parameterized conditions. This was followed by the applicationof feature selection and classification on novel research data for four phenotypicoutcomes with varied signal levels and correlation structures. We aimed to gaininsight on the behaviour and performances of different classes of feature selectiontechniques in order to provide a guideline that can inform researchers on the usageof the approaches and the interpretation of their results.For our simulation study, we generated two types of synthetic data with differ-ent types of corrrelation structures to test the ability of feature selection techniquesfor recovering ground-truth features. From our experiments, we found that in boththe presence and absence of induced phylogenetic correlation, generic methodsare very robust in many conditions on both types of synthetic datasets, with the68best methods being RF and HSIC-LASSO with the Gaussian kernel. We observedthat the effect of increased phylogenetic relatedness was as we expected, withincreased performance in our phylogenetic methods and decreased performancein our generic methods. However, the effect on the generic methods was verysmall. Overall, the generic methods outperformed the phylogenetic methods inmost cases, including conditions where there is induced phylogenetic correlationstructure. We discovered that a single phylogenetic technique, HFE, excels in ex-tremely high signal phylogeny-induced conditions, outperforming the best generictechniques. We learned that HFE is highly sensitive to changes in the correlationstructure of the microbial signature, such as the size of the microbial signature(number of OTUs) and the level of phylogenetic relatedness among the OTUs inthe signature. On the other hand, the performance of the phylogenetic methods onsynthetic data in the absence of induced phylogenetic correlation structure was ex-tremely poor, confirming our initial speculation that phylogenetic methods appliedto conditions they are not designed for would result in non-informative results.This suggests that classification problems for microbiome data are not all inher-ently applicable with a phylogenetic approach. This is contrary to the motivationof many phylogenetic methods in literature which propose that the phylogeneticapproach will be more powerful for discovering relationships in microbiome data.We believe this highlights the need to evaluate the question of whether a phyloge-netic signal exists when applying methods to search for a microbial signature forhost phenotypic outcomes. As one of the goals of this work is to provide a guideto approaching a classification problem which may or may not have a phylogeneticsignal, we will propose possible avenues for this in the next sections.The second part of this thesis involved applying the feature selection bench-marking procedure to novel research data with unknown correlation structures andvaried signal levels. Our goal was to show that our feature selection results fromour simulation study can transfer to real data and that we can elucidate the types ofcorrelation structures that may be present for four phenotypic outcomes. We alsocoupled our feature selection benchmarking with a set of normalization techniquesfor handling library size differences and compositionality. In this, we sought toprovide empirical evidence on which normalization techniques may be most suit-able for a feature selection method while also aiming to show that we can improve69the modelling of the underlying biological effect in the data with normalization.We discovered that for the time point outcome, a proxy for age, showed goodperformance with multiple feature selection methods. When our normalization ap-proach was applied, we saw an increase in performance of above 30%, reachingan AUROC of 0.95-0.99 from a baseline of approximately 0.74-0.77 for specificfeature selection-normalization pairs. Our lower signal conditions (asthma at 5years, birth mode, and breastfeeding status) showed reasonable performance fromfeature selection benchmarking, with improvements achieved with normalizationas well (up to 20% increase), however less substantial than was observed with ourhigh signal condition. This revealed that the type of normalization technique thatis paired with a feature selection method can have a strong impact on the classifi-cation performance of the reduced feature sets if the baseline signal is good and thebiological effect is strong. We discovered that RLE paired with RF shows the great-est and most consistent performance increases for all four phenotypic conditions.With rarefied fractional abundance data (our baseline normalization), we foundthat HFE produced higher AUROC performances for time point (> 0.70), exclusivebreastfeeding (0.60-0.66), and birth by vaginal delivery (0.60-0.66). For asthma,few taxa were selected by HFE and the performance was very poor (0-0.55). Com-bined with the results from global association testing using phylogenetic kernelswith MIRKAT, there is potential insight into elucidating the correlation structurespresent among the four conditions. Where strong signal is observed with MIRKAT,a stronger HFE signal is also observed. We believe there is opportunity to combinethese techniques in an analysis guideline for verifying if a phylogenetic signal ispresent when studying a phenotypic outcome. We will outline this in our sugges-tions for future directions in 5.3.To highlight key methods and conditions in which they excel, on both our sim-ulated and real data, we found that RF was the most powerful method in manyconditions. This is consistent with reports found in literature [28, 55, 75]. HSIC-LASSO with the Gaussian kernel was our second best contender in most scenariosand to our knowledge, has not yet been applied to microbiome data. Both methodsare good at recovering ground-truth features with extremely low abundance whichis useful for microbiome data due to the large number of low abundant taxa. Inaddition, the methods are both fairly robust at low to moderate sample sizes of 25070to 750, which is also useful for microbiome studies. RF is the most robust methodwhen the ground-truth feature set is small (3-9 features). To summarize, RF andHSIC-LASSO (Gaussian) were our most reliable methods. We stated above that HFEhas some conditions where it excels. While HFE cannot outperform RF and HSIC-LASSO (Gaussian) in most conditions, it can be a useful tool for diagnosing thesignal in a dataset — something that the two generic techniques cannot inherentlydo. We believe that optimization of a classification model is best achieved withgeneric feature selection techniques and that in general, the phylogenetic methodswe tested in this work are not as competitive compared to the generic techniques.5.2 DiscussionIn this section, we will provide an explanation of our main findings, as well asdetail challenges and limitations of our study.5.2.1 The phylogenetic signal in real dataOur evaluation of the phylogenetic vs. non-phylogenetic approach revealed insightinto the possible correlation structures present in the microbial signature of ourcovariates of interest. In this, we highlight a potential use for the combination ofMIRKAT and HFE as a test for a phylogenetic signal.In our simulation, we saw that if a phylogenetic signal is not strong, it selectsfew (0-3) features and/or performs poorly overall. In the extreme simulated condi-tions, if the signal is very strong, then it outcompetes the top generic methods. Onour real data, if a global association test (MIRKAT) revealed a very strong signal us-ing the phylogenetic kernels, large sets of features were selected by HFE with com-parable performances to the best generic methods. MIRKAT showed that birth byvaginal delivery, exclusive breastfeeding, and asthma were significantly associatedwith the 3-month microbiome using a phylogenetic approach (Table 4.1). For allthree covariates, the signal was strongest using at least one of the UniFrac kernels(weighted or unweighted) compared to the non-phylogenetic kernels (Bray-Curtisand Jaccard). The association for asthma however was not as strong comparedto the other two covariates. Integrating these findings with our feature selectionbenchmarking, we found that the covariates with very strong global association71with phylogenetic MIRKAT showed larger sets of features selected by HFE and bet-ter classification performance. For asthma, HFE selected few to no features, andwith poor performance if any features were selected at all. This was observed onall 7 versions of the data where different normalization techniques applied to re-duce the effects from library size differences or compositionality. We believe thatthe combination of our global association testing and feature selection suggeststhat a phylogenetic signal may be present in the microbial signature that drives thecomposition of the microbiome for exclusive breastfeeding and birth by vaginaldelivery, and that there may be an absence of a sufficiently strong phylogeneticsignal in the microbial signature for asthma.5.2.2 Sensitivity of HFEApplied to both simulated and real data, we saw that HFE is sensitive to changes incorrelation structure and the number of important microbes to an outcome of study.In our simulation, if all members of a phylogenetic cluster were included withinthe ground-truth, HFE did significantly better. Additionally, if the abundance of thecluster was also high or if the cluster contains members with higher similarity atlower taxonomic depth (e.g. high ratio of OTUs to genera), then HFE had excellentperformance. On real data, we saw that HFE would only produce larger sets ofreasonably performing features if there was also a strong phylogenetic signal fromMIRKAT, otherwise it selected close to no features with poor performance.We attribute the behaviour of HFE on our simulated data to the feature engineer-ing phase of the HFE algorithm. In this method, abundances of higher taxonomicranks for the OTUs are also treated as features by summing up the abundances oftheir respective children. If all the OTUs in a phylogenetic cluster are correlatedwith an outcome, then the features corresponding to the shared ancestral ranks ofthose OTUs will have a very strong signal from this pooling process. We believethat this explains the perfect performance in the extreme high signal conditionswhen the OTUs were from the same monophyletic group (when all members be-long to a single ancestor).When evaluating the effect of the change in the number of true features, wesaw that small increases led to larger performance changes compared to the top72generic methods. Due to the feature engineering phase in HFE, with each additionalspiked OTU in a phylogenetic cluster, the pooled signal at the shared ancestral rankwill be stronger. In addition to this, one of the filtering phases can also result inreconfiguration of the tree through the collapsing of parent-child nodes if redundantcorrelation is found. When a single spiked feature is added, if parent-child noderedundancy is present, this small perturbation in the data can lead to larger shiftsin the representation of the tree compared to the shifts that would occur in flatOTU tables. We believe that this could be one explanation for the difference in thesensitivity observed between HFE and the generic methods.5.2.3 LimitationsIn this section, we will explain three limitations of our study: biases in our sim-ulated data, the number of phylogenetic methods used, and limitations from thekernel construction for our UniFrac feature selection methods.Biases in our simulated data: We had explained that our synthetic data gen-eration process contained biases from the calibration dataset we used, which hada small number of OTUs. With a larger set of OTUs, the phylogenetic relatednessparameter could take on a larger range. The reason for the small number of OTUswas due to the QC filtering criteria to remove extremely rare taxa which is a com-mon step in microbiome data preprocessing. This limitation had two effects. Thefirst is that we could not test the influence of the number of true features on ourphylogeny-induced data to a greater degree, which could have been useful in elu-cidating the behaviours of our generic methods over a wide range of values. Thesecond is that we could not stretch the phylogenetic relatedness parameter as much.Our top generic methods were robust in many conditions in the presence of phylo-genetic correlation structure, however a small effect was observed with decreasedperformance in the presence of phylogenetic relatedness. It would have been in-teresting to observe a stronger effect from the phylogenetic relatedness level onthe generic techniques, as well as a stronger positive effect on the phylogeneticmethods over more conditions besides the extreme high signal conditions.The number of phylogenetic methods tested: With the increasing number ofphylogenetic feature selection methods being developed, incorporating more of73them into a benchmarking study would provide greater evidence of the applicabilityof the phylogenetic approach. Due to the lack of working software availability,we were limited in the number of methods we were able to test. In our work,we evaluated three methods, of which two were quickly eliminated. While wewere able to gain some insight, the comparative analysis of the two classes offeature selection methods provided an unfair advantage for the larger set of genericmethods.UniFrac kernel construction as a factor for poor performance: We saw thatthe two UniFrac-based phylogenetic feature selection methods performed poorlyand were eliminated at different stages. We would like to comment that the imple-mentation of the kernels for these methods are likely a key factor in this observa-tion. The kernel construction for the feature selection methods used an optimizedversion that computes partial kernels using subsets of OTU features in order tospeed up the high cost of kernel construction. When using this optimized versionfor UniFrac distances, the distance values computed are less informative than forother distance metrics. A brief explanation of this is that the UniFrac distance ismore informative when there are a large number of OTUs included in the compu-tation and this is attributed to the inherent property of microbiome data that fewOTUs are shared across all samples. UniFrac distance computation is based oncomputing the ratio of unshared branch lengths to all branch lengths, and thus tak-ing subsets of OTUs representing reduced phylogenetic trees will more often thannot produce distance values of 1. This is due to the high likelihood that the numberof unshared to total branch lengths between two communities would be equal. Forpairwise sample computation, this would mean that no samples are similar to anydegree, which is not informative to the model that uses the metric. We note thislimitation because we do not wish to draw the conclusion that UniFrac metrics arenot informative in the feature selection context. Improvements in the implemen-tation and further investigation would be needed to elucidate the performance andbehaviour of such methods.745.3 Future WorkIn this section, we detail two possible avenues for the application of the methodsin this work.Diagnosing a phylogenetic signal: Ultimately, our discovery of the sensitivityof HFE offers a potential avenue for using the method as a test for the presenceand absence of a phylogenetic signal in a microbial signature. Since we had es-tablished that the phylogenetic approach may not be applicable to all conditionsof study, performing a diagnostic assessment combining a collection of differentmethods to test for a phylogenetic signal could help inform downstream analysis.If a phylogenetic signal is present, the analysis of the functional components ofthe microbes may involve searching for genes corresponding to highly conservedtraits. On the contrary, if a phylogenetic signal is absent, it could inform that thetaxa in the microbial signature have specialized traits that are more unique to theOTUs involved. A further direction that combines MIRKAT and HFE in a test wouldbe to incorporate functional analysis using a tool such as PICRUSt [29] for in-ferring the presence of gene families and its content level among a set of taxa.Using the gene count output from PICRUSt to test whether there is a correlationwith the biological output of study can offer an additional piece of evidence for aphylogenetic signal in the microbial signature. PICRUSt results for a microbiomedataset may not necessarily contain gene content that suggests a signal with thehost outcome, and this result may be aligned with the observation of uninformativeresults when applying phylogenetic feature selection methods to conditions wherea phylogenetic signal is absent. Hence, combining MIRKAT, HFE, and PiCRUST ina 3-tier analysis workflow for identifying global phylogenetic signal, a microbialsignature, and functional content of the microbes could provide strong evidence forthe presence or absence of a phylogenetic signal.Benchmarking standards to test the phylogenetic approach: We saw that thereis great interest in developing custom methods for microbiome data, with manystrongly motivated by the phylogenetic approach. While some studies [39, 50]have directly evaluated the correlation structures in the microbial signatures of theoutcomes, there are methods-based studies that do not [40, 62, 63, 72], and it isunclear whether the method would be useful for an arbitrary condition of study.75Hence, through this work, we saw that our extension of the simulation setup from[74] allowed us to objectively evaluate the research question of the applicabilityof a phylogenetic approach. We think that this approach could be extended to in-clude additional relevant simulation parameters for microbiome data in general,but also parameters that can further define the attributes pertaining a phylogeneticsignal. For instance, in our simulation, we produced ground-truth sets that repre-sent only monophyletic groups such that all taxa in a phylogenetic cluster belongto a unique common ancestor. It would be interesting to test the ability of methodsto find multiple monophyletic groups that are distantly related, which could cap-ture a correlation that suggests the presence of a relationship between an outcomewith multiple groups corresponding to different types of conserved functions. Thiscould be packaged as a synthetic data generation software for microbiome data thateither is a standalone package or extends the sparseDOSSA package [1]. We notethat sparseDOSSA is the only software package for simulating microbiome datacurrently.5.4 ConclusionIn this work, we evaluated the motivation for a phylogenetic approach for identify-ing important features through a comparative analysis of two classes of feature se-lection techniques. We discovered that generic techniques that model microbiome-outcome relationships strictly based on abundance levels are highly robust in manyconditions on both real and simulated data. Through our extensive simulation, werevealed conditions in which a single phylogenetic feature selection technique ex-cels. While there have been promising indications of the power of a phylogeneticapproach, our work showed that generic methods are the best techniques for iden-tifying the relevant features in a classification setting on most simulated and realconditions for microbiome data. However, the phylogenetic approach provides ad-ditional information that generic techniques cannot reveal and we suggested a po-tential use for the method as a test for a phylogenetic signal that can aid hypothesisgeneration and support downstream functional studies.76Bibliography[1] SparseDOSSA – The Huttenhower Lab. URL → pages 27, 76[2] J. Aitchison. The Statistical Analysis of Compositional Data. Journal of theRoyal Statistical Society: Series B (Methodological), 44(2):139–160, 1982.ISSN 2517-6161. doi: → pages5, 30[3] S. Anders and W. Huber. Differential expression analysis for sequence countdata. Genome Biology, 11(10):R106, Oct. 2010. ISSN 1474-760X.doi:10.1186/gb-2010-11-10-r106. URL → pages 5, 12[4] M. J. Anderson. Permutational Multivariate Analysis of Variance(PERMANOVA). In Wiley StatsRef: Statistics Reference Online, pages1–15. American Cancer Society, 2017. ISBN 978-1-118-44511-2.doi:10.1002/9781118445112.stat07841. URL→ pages 11, 18, 21, 22[5] M.-C. Arrieta, L. T. Stiemsma, P. A. Dimitriu, L. Thorson, S. Russell,S. Yurist-Doutsch, B. Kuzeljevic, M. J. Gold, H. M. Britton, D. L. Lefebvre,P. Subbarao, P. Mandhane, A. Becker, K. M. McNagny, M. R. Sears,T. Kollmann, t. C. S. Investigators, W. W. Mohn, S. E. Turvey, and B. B.Finlay. Early infancy microbial and metabolic alterations affect risk of77childhood asthma. Science Translational Medicine, 7(307):307ra152–307ra152, Sept. 2015. ISSN 1946-6234, 1946-6242.doi:10.1126/scitranslmed.aab2271. URL → pages 1, 28[6] K. Banerjee, N. Zhao, A. Srinivasan, L. Xue, S. D. Hicks, F. A. Middleton,R. Wu, and X. Zhan. An Adaptive Multivariate Two-Sample Test WithApplication to Microbiome Differential Abundance Analysis. Frontiers inGenetics, 10, 2019. ISSN 1664-8021. doi:10.3389/fgene.2019.00350. URL Frontiers. → pages 7, 11[7] E. W. Beals. Bray-Curtis Ordination: An Effective Strategy for Analysis ofMultivariate Ecological Data. In A. MacFadyen and E. D. Ford, editors,Advances in Ecological Research, volume 14, pages 1–55. Academic Press,Jan. 1984. doi:10.1016/S0065-2504(08)60168-3. URL →page 21[8] A. Bichat, J. Plassais, C. Ambroise, and M. Mariadassou. Incorporatingphylogenetic information in microbiome abundance studies has no effect ondetection power and FDR control. bioRxiv, page 2020.01.31.928309, Feb.2020. doi:10.1101/2020.01.31.928309. URL Publisher:Cold Spring Harbor Laboratory Section: Contradictory Results. → pages2, 12[9] E. Bolyen, J. R. Rideout, M. R. Dillon, N. A. Bokulich, C. C. Abnet, G. A.Al-Ghalith, H. Alexander, E. J. Alm, M. Arumugam, F. Asnicar, Y. Bai, J. E.Bisanz, K. Bittinger, A. Brejnrod, C. J. Brislawn, C. T. Brown, B. J.Callahan, A. M. Caraballo-Rodrı´guez, J. Chase, E. K. Cope, R. Da Silva,C. Diener, P. C. Dorrestein, G. M. Douglas, D. M. Durall, C. Duvallet, C. F.Edwardson, M. Ernst, M. Estaki, J. Fouquier, J. M. Gauglitz, S. M. Gibbons,D. L. Gibson, A. Gonzalez, K. Gorlick, J. Guo, B. Hillmann, S. Holmes,H. Holste, C. Huttenhower, G. A. Huttley, S. Janssen, A. K. Jarmusch,L. Jiang, B. D. Kaehler, K. B. Kang, C. R. Keefe, P. Keim, S. T. Kelley,D. Knights, I. Koester, T. Kosciolek, J. Kreps, M. G. I. Langille, J. Lee,R. Ley, Y.-X. Liu, E. Loftfield, C. Lozupone, M. Maher, C. Marotz, B. D.Martin, D. McDonald, L. J. McIver, A. V. Melnik, J. L. Metcalf, S. C.Morgan, J. T. Morton, A. T. Naimey, J. A. Navas-Molina, L. F. Nothias,S. B. Orchanian, T. Pearson, S. L. Peoples, D. Petras, M. L. Preuss,78E. Pruesse, L. B. Rasmussen, A. Rivers, M. S. Robeson, P. Rosenthal,N. Segata, M. Shaffer, A. Shiffer, R. Sinha, S. J. Song, J. R. Spear, A. D.Swafford, L. R. Thompson, P. J. Torres, P. Trinh, A. Tripathi, P. J.Turnbaugh, S. Ul-Hasan, J. J. J. van der Hooft, F. Vargas, Y. Va´zquez-Baeza,E. Vogtmann, M. von Hippel, W. Walters, Y. Wan, M. Wang, J. Warren,K. C. Weber, C. H. D. Williamson, A. D. Willis, Z. Z. Xu, J. R. Zaneveld,Y. Zhang, Q. Zhu, R. Knight, and J. G. Caporaso. Reproducible, interactive,scalable and extensible microbiome data science using QIIME 2. NatureBiotechnology, 37(8):852–857, Aug. 2019. ISSN 1546-1696.doi:10.1038/s41587-019-0209-9. URL Number: 8 Publisher:Nature Publishing Group. → page 6[10] R. Boutin, H. Sbihi, M. Dsouza, R. Malhotra, C. Petersen, D. Dai, M. Sears,T. Moraes, A. Becker, M. Azad, P. Mandhane, P. Subbarao, B. Finlay, andS. Turvey. Mining the infant gut microbiota for therapeutic targets againstatopic disease. Allergy, 75, Feb. 2020. doi:10.1111/all.14244. → pages 1, 28[11] L. Breiman. Random Forests. Machine Learning, 45(1):5–32, Oct. 2001.ISSN 1573-0565. doi:10.1023/A:1010933404324. URL → page 13[12] B. Brill, A. Amir, and R. Heller. Testing for differential abundance incompositional counts data, with application to microbiome studies.arXiv:1904.08937 [q-bio, stat], Aug. 2019. URL arXiv: 1904.08937. → page 12[13] J. H. Bullard, E. Purdom, K. D. Hansen, and S. Dudoit. Evaluation ofstatistical methods for normalization and differential expression inmRNA-Seq experiments. BMC Bioinformatics, 11(1):94, Feb. 2010. ISSN1471-2105. doi:10.1186/1471-2105-11-94. URL → page 30[14] M. L. Calle. Statistical Analysis of Metagenomics Data. Genomics &Informatics, 17(1), Mar. 2019. ISSN 2234-0742.doi:10.5808/GI.2019.17.1.e6. URL Publisher: KoreaGenome Organization. → pages 11, 16, 30[15] J. Chen and H. Li. Kernel Methods for Regression Analysis of MicrobiomeCompositional Data. In Topics in Applied Statistics - 2012 Symposium of theInternational Chinese Statistical Association, pages 191–201, 2013.79doi:10.1007/978-1-4614-7846-1 16. URL →page 25[16] J. Chen, K. Bittinger, E. S. Charlson, C. Hoffmann, J. Lewis, G. D. Wu,R. G. Collman, F. D. Bushman, and H. Li. Associating microbiomecomposition with environmental covariates using generalized UniFracdistances. Bioinformatics, 28(16):2106–2113, Aug. 2012. ISSN 1367-4803.doi:10.1093/bioinformatics/bts342. URL → pages 11, 21[17] J. Chen, E. King, R. Deek, Z. Wei, Y. Yu, D. Grill, and K. Ballman. Anomnibus test for differential distribution analysis of microbiome sequencingdata. Bioinformatics, 34(4):643–651, Feb. 2018. ISSN 1367-4803.doi:10.1093/bioinformatics/btx650. URL → page 12[18] L. Chen, J. Reeve, L. Zhang, S. Huang, X. Wang, and J. Chen. GMPR: Arobust normalization method for zero-inflated count data with application tomicrobiome sequencing data. PeerJ, 6:e4600, Apr. 2018. ISSN 2167-8359.doi:10.7717/peerj.4600. URL Publisher:PeerJ Inc. → page 29[19] Y. Chen, A. T. Lun, D. J. McCarthy, M. E. Ritchie, B. Phipson, Y. Hu,X. Zhou, M. D. Robinson, and G. K. Smyth. edgeR: Empirical Analysis ofDigital Gene Expression Data in R, 2021. URL → pages 12, 29[20] M. J. Claesson, Q. Wang, O. O’Sullivan, R. Greene-Diniz, J. R. Cole, R. P.Ross, and P. W. O’Toole. Comparison of two next-generation sequencingtechnologies for resolving highly complex microbiota composition usingtandem variable 16S rRNA gene regions. Nucleic Acids Research, 38(22):e200–e200, Dec. 2010. ISSN 0305-1048. doi:10.1093/nar/gkq873. URL → page 4[21] T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller,T. Huber, D. Dalevi, P. Hu, and G. L. Andersen. Greengenes, aChimera-Checked 16S rRNA Gene Database and Workbench Compatiblewith ARB. Applied and Environmental Microbiology, 72(7):5069–5072,July 2006. ISSN 0099-2240, 1098-5336. doi:10.1128/AEM.03006-05. URL80 Publisher: American Society forMicrobiology Section: METHODS. → page 4[22] G. Ditzler, J. C. Morrison, Y. Lan, and G. L. Rosen. Fizzy: feature subsetselection for metagenomics. BMC Bioinformatics, 16(1):358, Nov. 2015.ISSN 1471-2105. doi:10.1186/s12859-015-0793-8. URL → pages 7, 15, 17[23] A. D. Fernandes, J. N. Reid, J. M. Macklaim, T. A. McMurrough, D. R.Edgell, and G. B. Gloor. Unifying the analysis of high-throughputsequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencingand selective growth experiments by compositional data analysis.Microbiome, 2:15, May 2014. ISSN 2049-2618.doi:10.1186/2049-2618-2-15. URL → page 12[24] J. B. Greenhouse and J. M. Lachin. Greenhouse, Samuel W. InEncyclopedia of Biostatistics. American Cancer Society, 2005. ISBN978-0-470-01181-2. doi:10.1002/0470011815.b2a17057. URL →page 12[25] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene Selection for CancerClassification Using Support Vector Machines. Machine Learning, 46:389–422, Jan. 2002. doi:10.1023/A:1012487302797. → page 13[26] S. Hawinkel. Evaluation of normalization and analysis methods formicrobiome data. page 42. → page 30[27] L. Jiang, A. Amir, J. T. Morton, R. Heller, E. Arias-Castro, and R. Knight.Discrete False-Discovery Rate Improves Identification of DifferentiallyAbundant Microbes. mSystems, 2(6), Nov. 2017. ISSN 2379-5077.doi:10.1128/mSystems.00092-17. URL → page 1[28] D. Knights, E. K. Costello, and R. Knight. Supervised classification ofhuman microbiota. FEMS Microbiology Reviews, 35(2):343–359, Mar.2011. ISSN 0168-6445. doi:10.1111/j.1574-6976.2010.00251.x. URL → pages2, 13, 14, 7081[29] M. G. I. Langille, J. Zaneveld, J. G. Caporaso, D. McDonald, D. Knights,J. A. Reyes, J. C. Clemente, D. E. Burkepile, R. L. Vega Thurber, R. Knight,R. G. Beiko, and C. Huttenhower. Predictive functional profiling ofmicrobial communities using 16S rRNA marker gene sequences. NatureBiotechnology, 31(9):814–821, Sept. 2013. ISSN 1546-1696.doi:10.1038/nbt.2676. URL 9 Publisher: Nature Publishing Group. → page 75[30] F. Li, Y. Yang, and E. P. Xing. From Lasso regression to Feature vectormachine. In NIPS, 2005. → page 23[31] M. I. Love, W. Huber, and S. Anders. Moderated estimation of fold changeand dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12):550, Dec. 2014. ISSN 1474-760X. doi:10.1186/s13059-014-0550-8. URL → pages 12, 29, 30[32] C. Lozupone and R. Knight. UniFrac: a New Phylogenetic Method forComparing Microbial Communities. Applied and EnvironmentalMicrobiology, 71(12):8228–8235, Dec. 2005. ISSN 0099-2240.doi:10.1128/AEM.71.12.8228-8235.2005. URL → pages2, 7, 11, 15[33] C. Lozupone, M. E. Lladser, D. Knights, J. Stombaugh, and R. Knight.UniFrac: an effective distance metric for microbial community comparison.The ISME journal, 5(2):169–172, Feb. 2011. ISSN 1751-7362.doi:10.1038/ismej.2010.133. URL → pages2, 7, 11, 15[34] S. V. Lynch and O. Pedersen. The Human Intestinal Microbiome in Healthand Disease. New England Journal of Medicine, 375(24):2369–2379, Dec.2016. ISSN 0028-4793. doi:10.1056/NEJMra1600266. URL Publisher: Massachusetts MedicalSociety eprint: → page 1[35] S. Mandal, W. Van Treuren, R. A. White, M. Eggesbø, R. Knight, and S. D.Peddada. Analysis of composition of microbiomes: a novel method forstudying microbial composition. Microbial Ecology in Health and Disease,26, May 2015. ISSN 0891-060X. doi:10.3402/mehd.v26.27663. URL → pages 5, 1282[36] J. B. H. Martiny, S. E. Jones, J. T. Lennon, and A. C. Martiny. Microbiomesin light of traits: A phylogenetic perspective. Science (New York, N.Y.), 350(6261):aac9323, Nov. 2015. ISSN 1095-9203.doi:10.1126/science.aac9323. → page 2[37] D. McDonald, M. N. Price, J. Goodrich, E. P. Nawrocki, T. Z. DeSantis,A. Probst, G. L. Andersen, R. Knight, and P. Hugenholtz. An improvedGreengenes taxonomy with explicit ranks for ecological and evolutionaryanalyses of bacteria and archaea. The ISME Journal, 6(3):610–618, Mar.2012. ISSN 1751-7370. doi:10.1038/ismej.2011.139. URL Number: 3 Publisher:Nature Publishing Group. → page 4[38] P. J. McMurdie and S. Holmes. Waste Not, Want Not: Why RarefyingMicrobiome Data Is Inadmissible. PLOS Computational Biology, 10(4):e1003531, Apr. 2014. ISSN 1553-7358. doi:10.1371/journal.pcbi.1003531.URL Public Library of Science. → pages 5, 29[39] J. Ning and R. G. Beiko. Phylogenetic approaches to microbial communityclassification. Microbiome, 3, Oct. 2015. ISSN 2049-2618.doi:10.1186/s40168-015-0114-5. URL → pages7, 14, 16, 75[40] M. Oudah and A. Henschel. Taxonomy-aware feature engineering formicrobiome classification. BMC Bioinformatics, 19(1):227, June 2018.ISSN 1471-2105. doi:10.1186/s12859-018-2205-3. URL → pages3, 7, 17, 18, 19, 23, 75[41] J. N. Paulson, O. C. Stine, H. C. Bravo, and M. Pop. Differential abundanceanalysis for microbial marker-gene surveys. Nature Methods, 10(12):1200–2, Dec. 2013. ISSN 15487091.doi: URL Num Pages: 6 Place: New York, United StatesPublisher: Nature Publishing Group. → page 12[42] A. Plantinga, X. Zhan, N. Zhao, J. Chen, R. R. Jenq, and M. C. Wu.MiRKAT-S: a community-level test of association between the microbiota83and survival times. Microbiome, 5(1):17, Feb. 2017. ISSN 2049-2618.doi:10.1186/s40168-017-0239-9. URL → page 21[43] R. Poretsky, L. M. Rodriguez-R, C. Luo, D. Tsementzi, and K. T.Konstantinidis. Strengths and Limitations of 16S rRNA Gene AmpliconSequencing in Revealing Temporal Microbial Community Dynamics. PLoSONE, 9(4), Apr. 2014. ISSN 1932-6203. doi:10.1371/journal.pone.0093827.URL → page 4[44] N. Qin, F. Yang, A. Li, E. Prifti, Y. Chen, L. Shao, J. Guo, E. Le Chatelier,J. Yao, L. Wu, J. Zhou, S. Ni, L. Liu, N. Pons, J. M. Batto, S. P. Kennedy,P. Leonard, C. Yuan, W. Ding, Y. Chen, X. Hu, B. Zheng, G. Qian, W. Xu,S. D. Ehrlich, S. Zheng, and L. Li. Alterations of the human gut microbiomein liver cirrhosis. Nature, 513(7516):59–64, Sept. 2014. ISSN 1476-4687.doi:10.1038/nature13568. URL Number: 7516 Publisher:Nature Publishing Group. → page 1[45] T. W. Randolph, S. Zhao, W. Copeland, M. Hullar, and A. Shojaie.KERNEL-PENALIZED REGRESSION FOR ANALYSIS OFMICROBIOME DATA. The annals of applied statistics, 12(1):540–566,Mar. 2018. ISSN 1932-6157. doi:10.1214/17-AOAS1102. URL → pages 7, 14[46] J. Rivera-Pinto, J. J. Egozcue, V. Pawlowsky-Glahn, R. Paredes,M. Noguera-Julian, and M. L. Calle. Balances: a New Perspective forMicrobiome Analysis. mSystems, 3(4), Aug. 2018. ISSN 2379-5077.doi:10.1128/mSystems.00053-18. URL → pages 5, 15[47] M. D. Robinson and A. Oshlack. A scaling normalization method fordifferential expression analysis of RNA-seq data. Genome Biology, 11(3):R25, Mar. 2010. ISSN 1474-760X. doi:10.1186/gb-2010-11-3-r25. URL → pages 5, 30[48] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edgeR: a Bioconductorpackage for differential expression analysis of digital gene expression data.Bioinformatics, 26(1):139–140, Jan. 2010. ISSN 1367-4803.doi:10.1093/bioinformatics/btp616. URL → page 584[49] B. Routy, E. L. Chatelier, L. Derosa, C. P. M. Duong, M. T. Alou,R. Daille`re, A. Fluckiger, M. Messaoudene, C. Rauber, M. P. Roberti,M. Fidelle, C. Flament, V. Poirier-Colame, P. Opolon, C. Klein, K. Iribarren,L. Mondrago´n, N. Jacquelot, B. Qu, G. Ferrere, C. Cle´menson, L. Mezquita,J. R. Masip, C. Naltet, S. Brosseau, C. Kaderbhai, C. Richard, H. Rizvi,F. Levenez, N. Galleron, B. Quinquis, N. Pons, B. Ryffel, V. Minard-Colin,P. Gonin, J.-C. Soria, E. Deutsch, Y. Loriot, F. Ghiringhelli, G. Zalcman,F. Goldwasser, B. Escudier, M. D. Hellmann, A. Eggermont, D. Raoult,L. Albiges, G. Kroemer, and L. Zitvogel. Gut microbiome influencesefficacy of PD-1–based immunotherapy against epithelial tumors. Science,359(6371):91–97, Jan. 2018. ISSN 0036-8075, 1095-9203.doi:10.1126/science.aan3706. URL Publisher: AmericanAssociation for the Advancement of Science Section: Report. → page 1[50] S. T. Rush, C. H. Lee, W. Mio, and P. T. Kim. The Phylogenetic LASSO andthe Microbiome. arXiv:1607.08877 [q-bio, stat], July 2016. URL arXiv: 1607.08877. → pages7, 17, 18, 22, 75[51] H. L. Sanders. Marine Benthic Diversity: A Comparative Study. TheAmerican Naturalist, 102(925):243–282, 1968. ISSN 0003-0147. → page 5[52] K. Sankaran and S. Holmes. structSSI: Simultaneous and Selective Inferencefor Grouped or Hierarchically Structured Data. Journal of statisticalsoftware, 59(13):1–21, 2014. ISSN 1548-7660. doi:10.18637/jss.v059.i13.URL → page 2[53] P. D. Schloss, S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B.Hollister, R. A. Lesniewski, B. B. Oakley, D. H. Parks, C. J. Robinson, J. W.Sahl, B. Stres, G. G. Thallinger, D. J. V. Horn, and C. F. Weber. Introducingmothur: Open-Source, Platform-Independent, Community-SupportedSoftware for Describing and Comparing Microbial Communities. Appliedand Environmental Microbiology, 75(23):7537–7541, Dec. 2009. ISSN0099-2240, 1098-5336. doi:10.1128/AEM.01541-09. URL Publisher: American Society forMicrobiology Section: METHODS. → page 6[54] C. E. Shannon. A mathematical theory of communication. The Bell SystemTechnical Journal, 27(3):379–423, July 1948. ISSN 0005-8580.85doi:10.1002/j.1538-7305.1948.tb01338.x. Conference Name: The BellSystem Technical Journal. → page 15[55] A. Statnikov, M. Henaff, V. Narendra, K. Konganti, Z. Li, L. Yang, Z. Pei,M. J. Blaser, C. F. Aliferis, and A. V. Alekseyenko. A comprehensiveevaluation of multicategory classification methods for microbiomic data.Microbiome, 1(1):11, Apr. 2013. ISSN 2049-2618.doi:10.1186/2049-2618-1-11. URL → pages 13, 14, 70[56] C. Stewart. An approach to measure distance between compositional dietestimates containing essential zeros. Journal of Applied Statistics, 44(7):1137–1152, May 2017. ISSN 0266-4763.doi:10.1080/02664763.2016.1193846. URL Publisher: Taylor &Francis eprint: → page14[57] A. Susin, Y. Wang, K.-A. Leˆ Cao, and M. L. Calle. Variable selection inmicrobiome compositional data analysis. NAR Genomics andBioinformatics, 2(2), June 2020. doi:10.1093/nargab/lqaa029. URL → pages14, 17[58] T. K. Takaro, J. A. Scott, R. W. Allen, S. S. Anand, A. B. Becker, A. D.Befus, M. Brauer, J. Duncan, D. L. Lefebvre, W. Lou, P. J. Mandhane, K. E.McLean, G. Miller, H. Sbihi, H. Shu, P. Subbarao, S. E. Turvey, A. J.Wheeler, L. Zeng, M. R. Sears, and J. R. Brook. The Canadian HealthyInfant Longitudinal Development (CHILD) birth cohort study: assessmentof environmental exposures. Journal of Exposure Science & EnvironmentalEpidemiology, 25(6):580–592, Nov. 2015. ISSN 1559-0631.doi:10.1038/jes.2015.7. URL → page 28[59] J. Thompson, R. Johansen, J. Dunbar, and B. Munsky. Machine learning topredict microbial community functions: An analysis of dissolved organiccarbon from litter decomposition. PLOS ONE, 14(7):e0215502, July 2019.ISSN 1932-6203. doi:10.1371/journal.pone.0215502. URL Public Library of Science. → page 1486[60] R. Tibshirani. Regression Shrinkage and Selection Via the Lasso. Journal ofthe Royal Statistical Society, Series B, 58:267–288, 1994. → page 24[61] V. N. Vapnik. Statistical learning theory. Adaptive and learning systems forsignal processing, communications, and control. Wiley, New York, 1998.ISBN 978-0-471-03003-4. → page 13[62] J. Wassan, H. Wang, F. Browne, and H. Zheng. Phy-PMRFI:Phylogeny-Aware Prediction of Metagenomic Functions Using RandomForest Feature Importance. IEEE Transactions on NanoBioscience, PP:1–1,Apr. 2019. doi:10.1109/TNB.2019.2912824. → pages7, 14, 17, 18, 19, 22, 75[63] J. T. Wassan, H. Wang, F. Browne, and H. Zheng. PAAM-ML: A novelPhylogeny and Abundance aware Machine Learning Modelling Approachfor Microbiome Classification. In 2018 IEEE International Conference onBioinformatics and Biomedicine (BIBM), pages 44–49, Dec. 2018.doi:10.1109/BIBM.2018.8621382. → pages 7, 17, 18, 19, 22, 75[64] S. Weiss, Z. Z. Xu, S. Peddada, A. Amir, K. Bittinger, A. Gonzalez,C. Lozupone, J. R. Zaneveld, Y. Va´zquez-Baeza, A. Birmingham, E. R.Hyde, and R. Knight. Normalization and microbial differential abundancestrategies depend upon data characteristics. Microbiome, 5(1):27, Mar. 2017.ISSN 2049-2618. doi:10.1186/s40168-017-0237-y. URL → pages 5, 12, 14, 29, 30[65] P. C. Y. Woo, S. K. P. Lau, J. L. L. Teng, H. Tse, and K.-Y. Yuen. Then andnow: use of 16S rDNA gene sequencing for bacterial identification anddiscovery of novel bacteria in clinical microbiology laboratories. ClinicalMicrobiology and Infection: The Official Publication of the EuropeanSociety of Clinical Microbiology and Infectious Diseases, 14(10):908–934,Oct. 2008. ISSN 1469-0691. doi:10.1111/j.1469-0691.2008.02070.x. →page 4[66] J. Xiao and J. Chen. Phylogeny-Based Kernels with Application toMicrobiome Association Studies. pages 217–237. Jan. 2017. ISBN978-3-319-69415-3. doi:10.1007/978-3-319-69416-0 13. → pages 7, 11, 21[67] J. Xiao, H. Cao, and J. Chen. False discovery rate control incorporatingphylogenetic tree increases detection power in microbiome-wide multipletesting. Bioinformatics, 33(18):2873–2881, Sept. 2017. ISSN 1367-4803.doi:10.1093/bioinformatics/btx311. URL → pages 1, 1287[68] J. Xiao, L. Chen, S. Johnson, Y. Yu, X. Zhang, and J. Chen. PredictiveModeling of Microbiome Data Using a Phylogeny-Regularized GeneralizedLinear Mixed Model. Frontiers in Microbiology, 9, 2018. ISSN 1664-302X.doi:10.3389/fmicb.2018.01391. URL Frontiers. → page 7[69] J. Xiao, L. Chen, Y. Yu, X. Zhang, and J. Chen. A Phylogeny-RegularizedSparse Regression Model for Predictive Modeling of Microbial CommunityData. Frontiers in Microbiology, 9, Dec. 2018. ISSN 1664-302X.doi:10.3389/fmicb.2018.03112. URL → pages2, 7, 11, 18, 21[70] M. Yamada, W. Jitkrittum, L. Sigal, E. P. Xing, and M. Sugiyama.High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso.Neural Computation, 26(1):185–207, Jan. 2014. ISSN 0899-7667,1530-888X. doi:10.1162/NECO a 00537. URL arXiv: 1202.0515. → page 23[71] N. D. Youngblut, G. H. Reischer, W. Walters, N. Schuster, C. Walzer,G. Stalder, R. E. Ley, and A. H. Farnleitner. Host diet and evolutionaryhistory explain different aspects of gut microbiome diversity amongvertebrate clades. Nature Communications, 10, May 2019. ISSN 2041-1723.doi:10.1038/s41467-019-10191-3. URL → page 2[72] J. Zhai, J. Kim, K. S. Knox, H. L. I. Twigg, H. Zhou, and J. J. Zhou.Variance Component Selection With Applications to MicrobiomeTaxonomic Data. Frontiers in Microbiology, 9, 2018. ISSN 1664-302X.doi:10.3389/fmicb.2018.00509. URL → pages2, 7, 16, 17, 18, 22, 23, 75[73] X. Zhan, A. Plantinga, N. Zhao, and M. C. Wu. A fast small-sample kernelindependence test for microbiome community-level association analysis.Biometrics, 73(4):1453–1463, Dec. 2017. ISSN 0006-341X.doi:10.1111/biom.12684. URL → page 21[74] N. Zhao, J. Chen, I. Carroll, T. Ringel-Kulka, M. Epstein, H. Zhou, J. Zhou,Y. Ringel, H. Li, and M. Wu. Testing in Microbiome-Profiling Studies with88MiRKAT, the Microbiome Regression-Based Kernel Association Test.American Journal of Human Genetics, 96(5):797–807, May 2015. ISSN0002-9297. doi:10.1016/j.ajhg.2015.04.003. URL R documentation: → pages7, 11, 16, 18, 21, 76[75] Y.-H. Zhou and P. Gallins. A Review and Tutorial of Machine LearningMethods for Microbiome Host Trait Prediction. Frontiers in Genetics, 10,2019. ISSN 1664-8021. doi:10.3389/fgene.2019.00579. URL → pages2, 70[76] H. Zou and T. Hastie. Regularization and Variable Selection via the ElasticNet. Journal of the Royal Statistical Society. Series B (StatisticalMethodology), 67(2):301–320, 2005. ISSN 1369-7412. URL Publisher: [Royal Statistical Society,Wiley]. → pages 16, 2489


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items