Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A stable and robust method to identify modules of functionally coherent genes Takhar, Mandeep Kaur 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_november_takhar_mandeep.pdf [ 3.11MB ]
JSON: 24-1.0166054.json
JSON-LD: 24-1.0166054-ld.json
RDF/XML (Pretty): 24-1.0166054-rdf.xml
RDF/JSON: 24-1.0166054-rdf.json
Turtle: 24-1.0166054-turtle.txt
N-Triples: 24-1.0166054-rdf-ntriples.txt
Original Record: 24-1.0166054-source.json
Full Text

Full Text

A Stable and Robust Method to Identify Modules ofFunctionally Coherent GenesbyMandeep Kaur TakharB.Sc., The University of British Columbia, 2008B.Sc., The University of British Columbia, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)October 2014c©Mandeep Kaur Takhar, 2014AbstractComplex cellular functions are carried out by the coordinated activity of networksof genes and gene products. In order to understand mechanisms of disease and dis-ease pathogenesis, it is crucial to develop an understanding of these complex inter-actions. Microarrays provide the potential to explore large scale cellular networksby measuring the expression of thousands of genes simultaneously. The purposeof our project is to develop a stable and robust method that can identify, from suchgene expression data, modules of genes that are involved in a common functionalrole. These modules can be used as a first step in systems scale analyses to extractvaluable information from various gene expression studies. Our method constructsmodules by identifying genes that are co-expressed across many diseases. We useperipheral blood microarray samples from patients having one of several diseasesand cluster the genes in each disease group separately. We then identify genes thatcluster together across all disease groups to construct our modules. We first use ourmethod to construct baseline peripheral blood modules relevant to the lung using 5groups of peripheral blood microarray samples that were collected as controls forseparate studies. An enrichment analysis using gene sets from a number of path-way and ontology databases reveals the biological significance of our modules. Weutilize our background modules by doing an enrichment analysis on a list of genesthat were differentially expressed in a COPD case vs. control study and identifymodules that are enriched in that list.Although a similar approach has been used to identify modules of genes that arecoordinately expressed across multiple conditions, we show that our method is animprovement as it is robust to the order in which the different disease datasets areiipresented to the algorithm. We also apply our procedure to 3 different datasetsincluding a COPD dataset, a COPD normal dataset and a lung tissue dataset. Wethen assess the stability of our method by performing a resampling experiment onour module construction procedure and find that our method repeatedly producesmodules with high concordance which is measured by Jaccard distance.iiiPrefaceThe research in this thesis was carried out under the supervision of Dr. RaymondT. Ng. I was the primary researcher in all work presented.Part of the data was provided by The PROOF Centre of Excellence and was col-lected from Chronic Obstructive Pulmonary Disease (COPD) patients enrolled inthe Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints(ECLIPSE) Study. Part of the data was collected by Dr. Anne Ellis at Queen’sUniversity. The remaining data was obtained from publicly available sources. Dataused in Chapters 4 and 5 was prepared by The PROOF Centre of Excellence.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation and contribution . . . . . . . . . . . . . . . . . . . . 21.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Clustering based methods . . . . . . . . . . . . . . . . . . . . . . 52.1.1 Chaussabel et al. method . . . . . . . . . . . . . . . . . . 52.1.2 Other relevant clustering based methods to derive func-tional modules . . . . . . . . . . . . . . . . . . . . . . . 62.2 Reverse engineering approaches . . . . . . . . . . . . . . . . . . 92.2.1 WGCNA method . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Other relevant reverse engineering methods to derive func-tional modules . . . . . . . . . . . . . . . . . . . . . . . 102.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12v3 Module Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 133.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.1 Selecting the number of clusters . . . . . . . . . . . . . . 143.1.2 Selecting the distance function . . . . . . . . . . . . . . . 143.1.3 Parameters used . . . . . . . . . . . . . . . . . . . . . . 153.2 Cluster matching . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Cluster matching algorithm . . . . . . . . . . . . . . . . 163.2.2 Cluster matching and switched labels . . . . . . . . . . . 183.2.3 Order independence . . . . . . . . . . . . . . . . . . . . 183.3 Module extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Evaluation of the module construction procedure . . . . . . . . . 233.4.1 Jaccard distance and order independence . . . . . . . . . 233.5 Background modules . . . . . . . . . . . . . . . . . . . . . . . . 243.5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 263.5.3 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 274 Module Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 324.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.1 Stability within datasets . . . . . . . . . . . . . . . . . . 334.1.2 Stability across datasets . . . . . . . . . . . . . . . . . . 364.2 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.1 Number of samples . . . . . . . . . . . . . . . . . . . . . 384.2.2 Number of modules . . . . . . . . . . . . . . . . . . . . . 394.2.3 Module size . . . . . . . . . . . . . . . . . . . . . . . . . 405 Application of Baseline Modules . . . . . . . . . . . . . . . . . . . . 435.1 Application to COPD . . . . . . . . . . . . . . . . . . . . . . . . 435.1.1 Description of COPD . . . . . . . . . . . . . . . . . . . . 435.1.2 Description of COPD study . . . . . . . . . . . . . . . . 455.1.3 Using modules for the discovery of transcriptional biomark-ers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45vi5.1.4 Differential expression analysis on a module-by-modulebasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2 Baseline module functional annotation . . . . . . . . . . . . . . . 575.2.1 Tissue enrichment . . . . . . . . . . . . . . . . . . . . . 575.2.2 Pathway enrichment . . . . . . . . . . . . . . . . . . . . 586 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65viiList of TablesTable 3.1 Information of samples used . . . . . . . . . . . . . . . . . . . 27Table 3.2 List of parameters used in the construction of background modules 27Table 3.3 List of modules created and their respective sizes along withannotation information of probe-sets included in each module . 30Table 4.1 Split information. This table outlines the number of subsetsused in each split (each row represents one split) of the data andthe proportion of overlapping sample in each subset as well asthe number of repetitions used to sample the data set. . . . . . 34Table 4.2 P-values resulting from t-tests performed on the pairwise aver-age Jaccard2 distances for each split of the data. . . . . . . . . 38Table 5.1 Sample information . . . . . . . . . . . . . . . . . . . . . . . 46Table 5.2 Baseline modules that were enriched for probe-sets that are dif-ferentially expressed in IE vs. NE samples as determined bythe LiMMa moderated t-test and an adjusted p-value cut-off of0.05. Those modules having an adjusted p-value < 0.01 arehighlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . 47Table 5.3 A list of the 24 modules that were significantly enriched (ad-justed pvalue <0.01) in one of the 3 lists of differentially ex-pressed genes. The 3 lists include genes that were found to bedifferentially expressed by moderated t-tests performed using30 day IE vs. NE samples, 60 day IE vs. NE samples, and 90day IE vs. NE samples. . . . . . . . . . . . . . . . . . . . . . 51viiiList of FiguresFigure 2.1 (a) Module Construction Algorithm Overview [11] and (b) pseudocode [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 3.1 Matching algorithm applied to diseases d1 and d2 where thegenes in each disease, represented by letters A-Z, were parti-tioned into k=5 clusters each. (a) depicts the d1-optimal resultswhen matching with respect to d1 whereas (b) depicts the d2-optimal solution when matching with respect to d2. . . . . . 19Figure 3.2 Cluster matching algorithm with the carry forward parameterimplemented. The figure on the left shows a d1-optimal so-lution (purple) whereas the right side shows a d2-optimal so-lution (purple). Each solution contains 8 clusters only one ofwhich is not identical in both solutions. . . . . . . . . . . . . 20Figure 3.3 Illustration of iteratively applying the cluster matching algo-rithm to extract modules from our simulated data set of 4 dis-eases, 26 genes, and 5 cluster centers. . . . . . . . . . . . . . 22ixFigure 3.4 Evaluation of module concordance using Jaccard distance whenvarying the carry forward parameter from 0% to 40% (a) indi-cates the change in Jaccard1 distance, in red, and the changein Jaccard2 distance, in blue. The Jaccard distances are plot-ted along the y-axis and the mean Jaccard distance across allpermutations are indicated in the text. In (b) we’ve illustratedthe change in runtime, in minutes, when the carry forward pa-rameter is increased. The time in seconds is indicated in thetext. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 3.5 Preprocessing steps used on microarray data before the moduleconstruction procedure was applied to construct backgroundmodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 3.6 Visualization of the separation in our modules by the first prin-ciple component of a PCA analysis. . . . . . . . . . . . . . . 31Figure 4.1 Average pairwise Jaccard2 distances between modules con-structed using our module construction procedure on multipledata sets created by repeated random sampling of three origi-nal data sets (a) Peripheral blood samples from patients havingCOPD, (b) lung tissue samples from normal subjects, and (c)peripheral blood samples from COPD control subjects. . . . . 37Figure 4.2 The relationship between input sample size and Jaccard2 dis-tance. Two subsets of the COPD data with a varying numberof samples are created by repeated random sampling, with avarying number of repetitions used. The module constructionprocedure is applied to both subsets and the average Jaccard2distance is calculated. . . . . . . . . . . . . . . . . . . . . . 40Figure 4.3 The relationship between the number of input samples usedand the number of modules constructed. . . . . . . . . . . . . 41Figure 4.4 Relationship between the size of the universe of probe-sets,number of modules, and input sample size. . . . . . . . . . . 41Figure 4.5 The distribution of module size. The colors indicate the rangeof the number of modules constructed. . . . . . . . . . . . . . 42xFigure 5.1 Venn diagram indicating the advantage of performing differ-ential expression analysis on a module-by-module basis ratherthan a differential expression analysis on the universe of probe-sets that pass the pre-processing steps. . . . . . . . . . . . . . 49Figure 5.2 (a) Venn diagram showing the 391 probe-sets differentially ex-pressed in either the 30 day IE vs. NE samples or 60 day IE vsNE samples that were found in our modules. Of the 391 probe-sets, 132 were differentially expressed in both lists, 182 weredifferentially expressed in only the 60 day IE vs. NE sam-ples and 77 were differentially expressed in only the 30 day IEv. NE samples. (b) An illustration of the differences in meanfold-change of genes that were significant in the 30 day IE vs.NE samples compared to the mean fold-change of genes thatwere significant in the 60 day IE vs. NE samples. Blue circlesrepresent a mean negative fold-change and red circles repre-sent a mean positive fold-change. The number of up-regulatedprobe-sets and the number of down-regulated probe-sets areindicated by the text in each circle . . . . . . . . . . . . . . . 53Figure 5.3 The first principal component of module genes enriched for IEvs. NE differentially expressed genes was computed. The sam-ples were stratified into 30 day IE samples, 60 day IE samplesand NE samples and projected onto the first principal compo-nent. The boxplots on the left indicate the separation of thesesamples by the first principal component and show the devia-tion of IE samples from NE samples. The 30 day IE samplesdeviate farther than the 60 day IE samples. The figure on theright illustrates the same information in module M46. Here,the 30 day IE samples deviate from the NE samples less thanthe 60 day IE samples. . . . . . . . . . . . . . . . . . . . . . 54xiFigure 5.4 Genes that were up regulated and down regulated in IE sam-ples compared to NE samples. The genes that were differ-entially expressed in the 30 day IE group were compared withgenes that were differentially expressed in the 60 day IE group.The venn diagram on the left depicts this information for Mod-ule M5 whereas the venn diagram depicts this information forModule M46. . . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 5.5 Module M5 enrichment in KEGG pathways and GO biolog-ical process. A hypergeometric overlap of genes in moduleM5 and genes in KEGG pathways and GO biological processcategories was performed and a p-value cut-off of 0.05 wasapplied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Figure 5.6 Median enrichment values of the genes in each module. Thedifferent tissues are plotted along the x-axis and are groupedby tissue type. The 53 baseline modules are plotted along they-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Figure 5.7 KEGG pathways associated with our modules. Only modulesthat had at least one significant (adjusted p-value <0.01) path-way association are shown. . . . . . . . . . . . . . . . . . . 61Figure 5.8 Gene Ontology biological processes associated with our mod-ules. Only modules that had at least one significant (adjustedp-value <0.01) biological process associated with them areshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62xiiChapter 1Introduction1.1 BackgroundComplex cellular functions are carried out by the coordinated activity of networksof genes and gene products. In order to understand mechanisms of disease anddisease pathogenesis, it is crucial to develop an understanding of these complexinteractions in addition to an understanding of individual gene functions [3]. Highthroughput methods for assaying gene expression levels such as microarray ex-periments produce mass amounts of data and have become a standard tool usedin biomedical research. Since microarrays measure the expression of thousandsof genes simultaneously, providing a snapshot of all transcriptional activity takingplace in a sample, they provide the potential to explore large scale cellular net-works [9]. Microarray experiments have frequently focused on identifying changesin gene expression between case-control study groups. Such analyses have oftenled to lists of differentially expressed genes that have been used as biomarkers andhave contributed to the development of new diagnostic tools [5]. However manyof these types of studies encounter analytical challenges as microarray results areprone to include noise which ensues false positive results, and the results oftenare inconsistent across different laboratories and platforms. Furthermore the bio-logical and functional interpretation of such lists of genes has been difficult andis often the bottleneck of many studies when it comes to characterizing diseasesand understanding disease pathogenesis. In recent years as an attempt to mitigate1these problems, there has been a considerable focus on uncovering modules con-sisting of genes involved in a common functional role as a first step in systemsscale analyses [5–7]. Several different computational approaches have been usedto construct such modules [8–11]. The approach focused on in this thesis is theco-expression based approach similar to those used by Chaussabel et al. to createtranscriptional modules that map the human immune system [11] and by Horvathet al. in Weighted Gene Co-expression Analysis (WGCNA) [42].1.2 Motivation and contributionThe motivation of our project was to develop a stable and robust method that canidentify modules of genes that are involved in a common functional role which canbe used as a first step in systems scale analyses and can also be used to extractvaluable information in various studies. Our method builds on previous work doneby Chaussabel et al. which identifies genes that are co-expressed across many dis-eases. We improved this method by making modifications to ensure that it is robustto the order in which different diseases are presented to the algorithm and used itto construct baseline modules pertinent to the lung. We also assessed the stabilityof our method to ensure that we produced stable modules even when using vary-ing samples and different datasets. Furthermore, we aimed to compare our methodto a popular reverse engineering approach to construct modules by applying thatmethod to the same samples and datasets used by our method and comparing theresults. Lastly, we demonstrated the benefits of using the modules constructed byour procedure for biomarker discovery. As a case-study, we used COPD which isa poorly characterized disease consisting of multiple phenotypes.1.3 Thesis outlineThe rest of this thesis is organized as follows. In Chapter 2 we’ve provided anoverview of related work focusing on clustering based methods and reverse en-gineering approaches. In Chapter 3 we’ve provided a detailed description of theimplementation of our module construction procedure as well as an assessmentof the quality of our procedure and finally we applied our procedure to lung datato create baseline modules and illustrated those resulting modules. In Chapter 42we’ve measured the stability and sensitivity of our module construction procedure.We assessed the stability within data sets as well as across different data sets andwe assessed the sensitivity to sample size and module size. In Chapter 5 we’veused our baseline modules to enhance biomarker discovery analysis. Finally ourconclusions have been detailed in Chapter 6.3Chapter 2Related WorkSeveral groups have worked on inferring global regulatory networks by identify-ing strongly co-regulated genes. Most approaches to this fall under two broadcategories: clustering based approaches and reverse engineering approaches. Thenetworks resulting from these methods have frequently been used post hoc to gainbiological insight into experimental results or to advance biological knowledgeof the regulatory interactions of specific genes or processes. Fewer groups haveworked on the identification of modules of genes to use in a more untargeted fash-ion, for example as a first step in differential expression analysis where the goalwould be to find networks of genes that are conserved across conditions but areup or down regulated in case vs control groups. Even fewer studies have usedmodules of genes for differential co-expression analysis where the aim is to iden-tify networks of genes where expression is conserved across case vs. controls butnetwork structure is perturbed. Further, we found very few studies that identifiedmodules of genes and used the perturbations of those modules as the basis forbuilding diagnostic tools or classifiers of poorly characterized conditions, which isthe goal of the work done in this thesis. The focus of this thesis is on a clusteringbased approach however the rest of this chapter will outline popular methods ofboth categories.42.1 Clustering based methodsClustering allows one to identify groups of genes that share the same expressionpattern across a number of different experimental conditions. Since genes havingsimilar expression patterns have a good chance of being functionally related, clus-tering algorithms have often been applied to microarray expression data with theaim of determining functional relationships among genes and to help infer regula-tory networks. The most common clustering methods applied to gene expressiondata are hierarchical clustering [11, 12, 31] and k-means or variations of k-meansclustering[7, 9, 11, 27, 28].2.1.1 Chaussabel et al. methodThe Chaussabel et al. group designed a strategy for constructing modules by identi-fying groups of genes that are coordinately expressed across multiple diseases [11].The reasoning behind this strategy is that the likelihood of many genes having thesame expression pattern across many diseases just by chance is low therefore thesemodules could represent biologically functional units.In this module construction method, K-means clustering is applied to microarraygene expression data of each disease group separately to cluster the transcripts into30 clusters. The number of clusters to use was chosen to be K=30 by the elbow-criterion which requires clustering the datasets numerous times while varying thenumber of clusters, K, each time. The ratio of ’between cluster variance’ and’within cluster variance’ is then assessed for each value of K. The issue of selectingthe number of clusters for k-means clustering is discussed further in Section 3.1.1.Module selection from the clustered data sets is done in a few iterative steps. Theprocedure starts with analyzing cluster membership and selecting the largest set ofgenes belonging to the same cluster in all diseases. This set is then expanded toinclude genes belonging to the same cluster in any combination of n-1 diseases,n-2 diseases, and so forth. The reduction in the number of diseases included incluster membership analysis is implemented to account for the fact that in a givenfunctional unit, some genes may not be expressed in every condition. All genesbelonging to the newly created module are then removed from the pool of genes to5select from for the next iteration. This module selection process, which is depictedin Figure 2.1, is repeated again with the next largest set of co-clustering genes andis continued iteratively until all genes have been considered. A functional annota-tion is assigned to each module using a literature profiling method that associateskeyword occurrence in PubMed abstracts with the genes in each module [10].Chaussabel et al. constructed modules to represent the transcriptional activity ofthe human immune system by using peripheral blood mononuclear (PBMC) sam-ples from individuals with one of 8 immune-mediated diseases. After three roundsof selection, their method resulted in 28 modules. This modular framework hasbeen used to gain biological insight in studies involving poorly characterized dis-eases where the understanding of disease pathogenesis is incomplete [4, 7, 27, 28].These modules have also been used to conduct differential expression analysis atthe modular level by averaging the proportion of each module containing differen-tially expressed genes between 2 groups [9, 11].2.1.2 Other relevant clustering based methods to derive functionalmodulesEisen et al. [12] used a form of hierarchical clustering on both yeast and humangene expression data to separate genes into functional categories. This clusteringmethod is used to assemble all elements into a single tree represented as a den-drogram. The branches of the resulting tree are the desired clusters. A similaritymatrix containing similarity scores for all pairs of genes is computed and used asinput into their clustering algorithm. The clustering algorithm starts with the mostsimilar pair of genes and creates a node by joining the two genes and averagingtheir expression profiles. The similarity matrix is then updated by replacing thetwo joined genes with the new node and this process is repeated until only oneelement remains. Their method was applied to a single time course of a model ofgrowth response in human cells as well as time course data of the yeast cell cycle.In both types of data they found clusters of genes that share similar expression pat-terns over multiple conditions. To ensure the biological origin of this pattern, they6CHAPTER 2. RELATED WORK 7Figure 2.1: (a) Module Construction Algorithm Overview [11] and (b)pseudo code [2]randomized the human growth response data in 3 ways and found no similar clus-tering patterns. They also found that many groups of genes that were co-expressedand represented diverse patterns of expression across the measured conditions wereinvolved in shared cellular processes.Segal et al. [32] applied both hierarchical clustering and the expectation-maximizationmethod, which involves a similar procedure to k-means clustering. They identifiedregulatory modules in yeast stress data using a set of known regulatory genes inaddition to gene expression data. Using hierarchical clustering they created an ini-tial set of 50 modules and then identified for each module a regulation program,which is a small set of regulatory genes that control the mRNA expression of thegenes in that module. The regulation program is structured as a regression treeand is identified through a search over the space of all possible trees. Each gene isthen re-assigned to the module whose regulation program best predicts its behavior.These two steps are implemented using the Expectation-Maximization method andare repeated until convergence. They evaluated the functional coherence of theirmodules using external data sets which resulted in 46 coherent modules. Since theyeast cell cycle and stress response has been thoroughly studied and well docu-mented, they were able to evaluate the ability of their method to derive regulationfrom expression.Another group of researchers implemented a biclustering method [30] to iden-tify co-regulated groups of genes which were then used infer global regulatorynetworks using their own network inference procedure [8, 15]. They used a bi-clustering method where both genes as well as conditions were clustered, to allowsome genes to be grouped into multiple clusters as they may be involved in multi-ple cellular processes and also to account for the chance that some groups of genesmay be co-regulated only in subsets of conditions, similar to the rationale behindsome of Chaussabel et al.’s implementation. Similar to Segal et al.’s incorpora-tion of a priori knowledge to their clustering method, Bonneau et al. integratedknown functional associations and the occurrence or detection of sequence motifswith the biclustering of genes and conditions. While many groups commonly usethese types of associations to assess the quality and biological relevance of their8clusters post-hoc, Bonneau et al. treated these associations as priors with appropri-ately assigned weights (based on prior knowledge and biological relevance). Thisapproach helps reduce the rate of false positives compared to clustering methodsalone. Their clustering method was applied to a Halobacterium data set which ledto 300 biclusters. These biclusters were used as input to their regulatory networkinference procedure [8] which identified a network of 1,431 regulatory influencesof varying strengths. This regulatory network inference procedure has more re-cently been enhanced further by incorporating into it structure priors from multipledata sources [15].2.2 Reverse engineering approachesReverse engineering approaches use graphs to model the structure of co-regulationnetworks where genes are represented by nodes and the co-regulation relationshipsbetween genes are represented by edges. The co-regulation relationships and rela-tive strengths are typically defined by adjacency values. The selection of the adja-cency metric is an important step in these methods and not a trivial one. Once theadjacency measure is selected, several different approaches can be used to removespurious edges and extract important co-regulation modules. An important aspectof these approaches is the identification of hub genes in the resulting modules andnetworks. Hub genes are the few highly connected nodes that account for mostof the connections in the network and link the rest of the less connected nodes tothe system. Hub genes are important in regulatory network inference as they likelyhave a fundamental role and may be drivers of disease due to their vital positionsin the networks.2.2.1 WGCNA methodThe weighted gene co-expression network analysis (WGCNA) method implementedby Horvath et al. constructs networks or modules based on co-expression of themost highly varying genes in microarray gene expression data [42]. This methodbuilds on pre-existing network concepts such as scale-free networks and constructsmodules consisting of genes that have high topological overlap, that is, genes thatare roughly connected to the same group of genes in the network. Hub genes are9central to the topology of this type of network as they link the rest of the less con-nected nodes to the system and provide robustness to errors and false connections.An important feature of the WGCNA method is the use of a soft threshold. In thisnetwork a connection strength, or weight, between two genes is derived from thePearson correlation and is used to define how a pair of nodes is connected ratherthan using the typical threshold cut-off and a binary value to represent connected orun-connected nodes. Using this soft-threshold prevents information loss and allowsone to maintain the continuous spectrum between complete co-expression and noco-expression. The topological overlap matrix, derived from the adjacency matrixof connection strengths, is then used with average linkage hierarchical clusteringto define modules in the network. This module construction process has been usedon various data sets for different studies including simulated data, yeast data, andchimpanzee data in addition to human data. After constructing modules, Horvathet al. use the first principal component of a module expression matrix, known as theeigen-gene, to represent a module. This reduction of each module into one expres-sion profile facilitates the use of analysis methods frequently used in patient-basedmicroarray studies, such as differential expression and causal analysis, but at themodular level. The application of WGCNA modules has been illustrated in variousstudies [13, 18, 29, 40].2.2.2 Other relevant reverse engineering methods to derivefunctional modulesAnother frequently used reverse engineering approach called ARACNe (Algorithmfor the Reconstruction of Accurate Cellular Networks) infers global regulatory net-works [26]. This method uses mutual information to identify the gene co-regulationrelationships. Mutual information of two genes measures the genes mutual depen-dence. A mutual information threshold is also applied in this method to eliminateany spurious non-significant edges. After generating the gene co-regulation rela-tionships, ARACNe uses data processing inequality to eliminate indirect relation-ships between genes. Indirect relationships between two genes are those wherethe genes are separated by two or more links with other genes. These types ofgenes may be co-expressed even though there is no direct regulatory interaction be-tween them. This differentiation between direct and indirect relationships between10genes reduces false positive and is a fundamental difference between ARACNe andother co-expression networks. When applied to expression data of human B cells,ARACNe identified a major hub which led to the identification of new target genescontrolled by the MYC transcription factor which is known to regulate a networkof known target genes [5].GeneNet is another well-developed method that infers global regulatory networks[31]. GeneNet is a Gaussian graphical model which assumes gene expression val-ues are jointly Gaussian distributed. In this method pairwise gene co-regulationrelationships are defined by partial correlation estimates where the correlation ofany two genes is conditioned on the remainder of the genes. This method uses im-proved estimates of the correlation matrix by using regularization methods whichalso makes this method suitable for situations where the sample size is small rela-tive to the number of variables as in microarray data. As in the ARACNe method,this graphical model is also able to distinguish between direct and indirect interac-tions. To obtain the final network, the significance of the correlations is tested andedges that represent spurious correlations are differentiated from those that shouldbe included in the resulting network. To accomplish those tasks, GeneNet uses amethod called the local fdr network search which calculates the posterior probabil-ity for the presence/absence of an edge, and takes account of the multiplicity in thesimultaneous testing of edges. This method when applied to gene expression datafrom a breast cancer study containing 49 samples and expression measurementsof 7129 genes also revealed several highly connected hub genes involved in keyprocesses such as the control of tumor growth [31].Another group of researches implemented a different method to infer global regu-latory networks called GENIE3 (GEne Network Inference with Ensemble of Trees)[20]. In their method, the problem of finding estimates of co-regulation is convertedto a feature selection problem: for any given target gene they define the identifi-cation of its regulatory genes as determining the subset of genes whose expressionis predictive of the expression of the target gene. This method therefore is able topredict directed edges to represent co-regulation relationships. In this method treebased ensemble methods (Random Forests and Extra Trees) are used to solve the11feature selection problem. The tree-based methods are able to deal with combina-torial regulations and non-linear relationships which is an advantage of this methodover other reverse engineering approaches. Tree based methods provide a variableimportance measure, which is a ranking of features by their importance in predict-ing the output variable. Once this ranking is computed for all genes a thresholdcan be applied, similar to many other reverse engineering approaches, to removethe edges that are thought to be spurious. One of the datasets they tested GENIE3on was the E.Coli dataset for which prior, experimentally validated, informationon which genes are transcription factors was available. Their method performedvery well on this dataset however, when prior information was unavailable, theperformance of their method dropped.2.3 SummaryMany of the reverse engineering approaches aim to identify all global transcrip-tional interactions or some transcriptional interactions with high confidence. Thesemethods have been used in a targeted way to identify transcriptional regulators ofspecific genes and to further the understanding of known regulatory processes. Be-cause the WGCNA method focuses on identifying modules of tightly connectedgenes rather than a global regulatory network, it can be used in a more untargetedfashion in differential co-expression analysis which is more suitable for the anal-ysis of poorly characterized conditions. To look for perturbations of networks ormodules, a highly robust method of constructing modules is required. Chauss-abel et al.’s clustering approach to module construction provides this robustnessby looking at clustering patterns of samples collected under many different experi-mental conditions. Another advantage of this framework is that it allows flexibilityin the clustering method and similarity measure as well as flexibility in the datasetsused. Since the datasets are clustered separately, it provides potential to extend itsuse to more heterogeneous data sets from multiple sources and platforms.12Chapter 3Module ImplementationIn Chapter 2 we introduced the approach used by Chaussabel et al. to constructmodules that were used to create a map of the human immune system. The basisof their method was the identification of genes that co-cluster across multiple con-ditions. The method we’ve used to implement our modules also uses that notionas our method builds on the work done by Chaussabel et al. In this chapter we’veprovide detailed descriptions of the steps involved in our implementation.3.1 K-means clusteringClustering methods are applied to data to partition all data points into distinct clus-ters where the points in any given cluster are closer, or more similar, to each otherthan to points belonging to different clusters. This similarity of data points is asimilarity in terms of some variables of interest and is typically determined bysome distance function [17]. Numerous methods can be used to perform clusteringanalysis but in our module construction algorithm we’ve used k-means clusteringto cluster each of our datasets. The k-means clustering algorithm is a top downtechnique as it randomly divides all data points into a predetermined number ofclusters. It then iteratively reassigns each data point to the cluster that has a cen-ter that is closest to that point. The following two key steps are repeated until13convergence in the algorithm [17]:1. For each data point, identify the closest cluster center (measured by a dis-tance function).2. Replace each cluster center by the average of all data points that are closestto it.In our module construction procedure, k-means was applied separately to the datasetof each disease group. As mentioned above this algorithm uses a predeterminednumber of clusters therefore, we can be sure that each disease dataset is beingpartitioned into the same number of clusters.3.1.1 Selecting the number of clustersSelecting the number of clusters for k-means is a subjective matter and the appro-priate number to use can be ambiguous. Chaussabel et al. chose 30 clusters forthe k-means step of their module construction process. This number was decidedupon using the frequently utilized elbow criterion where the proportion of vari-ance explained or the within sum of squares for each clustering is a function of thenumber of clusters. Ideally the proportion of variance explained increases with thenumber of clusters until a certain point where the curve levels off as the increasein number of clusters no longer proves a large increase in proportion of varianceexplained. That point indicates the optimal number of clusters. Chaussabel et al.selected the largest number of optimal clusters indicated by this method for any oftheir datasets. We’ve maintained the use of 30 clusters for the k-means step of ourmodule construction process as 30 clusters explained approximately 97% of thevariance in our datasets and an increase beyond 30 provided very little increase inthe proportion of variance explained.3.1.2 Selecting the distance functionSelecting a dissimilarity measure, or distance function, is fundamental to all clus-tering methods. The k-means algorithm is intended for use with data containingquantitative variables and with Euclidean distance as the chosen dissimilarity mea-sure [17]. Therefore we have selected Euclidean distance as our distance function14with our variables of interest being quantitative measurements of the expression ofeach gene on the microarray. Euclidean distance is defined as:d(xi,xi′) =p∑j=1(xi j− xi′ j)2 = ||xi− xi′ || (3.1)Using Euclidean distance on genes assumes that every gene is ”equi-distant” fromany other gene. This is likely an over simplifying assumption and creates a biastowards round clusters. Our module construction procedure can easily be adaptedto use a number of dissimilarity measures to partition our data into k clusters bymaking minor changes such as partitioning around medoids rather than means.3.1.3 Parameters usedAs mentioned above, k-means clustering partitions data points into k clusters sothat the sum of squares from data points to the assigned cluster’s center is min-imized. Since the initial k centers are arbitrarily chosen, the solution will likelynot be the minimal sum of squares of all possible partitions [16]. Instead, a localminimum is returned where movement of a data point from one cluster to anotherwon’t reduce the within cluster sum of squares any further. One option to help ar-rive at a more stable solution is to attempt multiple initial configurations and reportthe one delivering the best solution. In our module construction procedure we’veused 10 random initial configurations which can also be adjusted by varying oneparameter. Although this provides a minimum over several partitions, it still doesnot guarantee a global minimum.Because the k-means algorithm iterates over two fundamental steps until conver-gence with the upper bound on its runtime being exponential in the number of datapoints [37], we’ve limited the number of iterations to 50 in order to achieve a rea-sonable execution runtime.153.2 Cluster matchingIn our module construction procedure, each disease condition was separately par-titioned into 30 clusters using k-means. Because a module should be composed ofgenes that cluster together in multiple conditions, it was essential to identify cor-responding clusters across all conditions. Cluster matching is a non-trivial issuethat becomes problematic due to two properties that are inherent to the clusteringprocess. The first issue arises due to the stochastic properties of the k-means al-gorithm. Since the k-means algorithm starts with a random initial configuration ofcenters and there is no guarantee of a solution that is the global minimum, replicatecluster analyses can produce different solutions. In our case of clustering acrossdifferent biological conditions, there were also biological factors contributing todifferences in cluster membership in the different groups. The second is a rathertrivial problem caused by the method of label assignment to the clusters. Labelsare arbitrarily assigned during clustering therefore even with replicate clusteringanalyses on the same data, clusters with identical contents may take on differentlabels.3.2.1 Cluster matching algorithmTo resolve the issue of matching clusters across different clustering analyses, weconsidered it as a graph matching problem where the sets of clusters generated byeach clustering analysis were the disjoint sets to match. We used a relaxed variationof the weighted stable marriage problem to solve it. The stable marriage problem(SMP) is a well-known problem of matching elements in one set to elements inanother disjoint set. The elements in each set have associated with them a list ofstrictly ordered preferences for all elements in the other set and SMP provides amatching such that no two elements, which are not matched to each other, bothprefer each other over their current matches [14]. In the weighted stable marriageproblem, each element provides a score or a weight rather than a ranking for el-ements in the other set. There exist more general versions of SMP that allow forpartial ordering of preferences which is suitable for practical applications where analternative pairing is not possible for an element of a set [21]. We used this gener-alization in our matching algorithm as clusters in one set will only match with all16clusters in another set in the worst case.One further adjustment we made to the generalized SMP was to allow for a many-to-one pairing so multiple elements of one set may be paired with the same elementof the second set. This was desired because we were interested only in the genesthat clustered together in multiple conditions and these co-clustered genes may bea subset of larger clusters in each of the conditions, or some conditions may havemerged or split clusters relative to another condition.As an example, which we’ve illustrated in Figure 3.1, we used a simulated dataset of 26 probe-sets represented by letters A-Z and 4 disease groups which havebeen arbitrarily relabeled as d1-d4. The probe-sets in each disease group were par-titioned into k=5 clusters by the k-means algorithm. When matching two sets ofclusters (d1-d2 or d3-d4), we followed these steps:1. Consider 2 sets of clusters (d1 and d2)2. Initialize an adjacency matrix using clusters of d1 as the rows and clusters ofd2 as columns3. Compute the length of the intersection between each row and each columnand assign that as a weight in the matrix. For example C1 of d1 intersectswith C7, C8 and C10 of d2 and the respective lengths of those intersectionsare 2, 3, and 1 as indicated by the numbers on the edges between clusters inFigure 3.1(a)4. For each row, select the column with the largest weight, breaking ties arbi-trarily, thereby maximizing the overlap between clusters in d1 and clustersin d2The algorithm provides a solution for this example that is d1-optimal which isillustrated in Figure 3.1(a). Changing the order in which we select pairings, byswitching columns and rows or selecting the cluster pairs column-wise for exam-ple, will render a different solution that is d2-optimal as illustrated in Figure 3.1(b).Figure 3.1(b) illustrates the situation where the clusters of d2 are being used as17rows and therefore the largest weight is being selected in terms of d2. In doing so,the resulting solution differs by 4 clusters: {DOY} and {HPS} from Figure 3.1(a)are missing whereas {K} and {WZ} are now present. In Figure 3.1(a) clustersC1, C2 and C3 of d1 all intersect with a subset of cluster C8 of d2 each havinga weight of 3. Since that is the largest weight in all three cases, when selectingweights in terms of d1, all three intersections will be chosen resulting in the clus-ters {CMV},{DOY}, {HPS}. However when selecting weights in terms of d2, asis the case in Figure 3.1(b), only one of those intersections will be chosen as C1,C2 and C3 are a tie for C8 and ties are broken arbitrarily. Therefore the cluster{CMV} is a part of the resulting solution but clusters {DOY} and {HPS} are not.This characteristic of multiple solutions is common and acceptable for many graphmatching problems. However in our case, we required the matching algorithmto render one unique solution because this algorithm was applied iteratively toextract modules, which we have described further in Section 3.3, and only one setof possible modules should be constructed for a given set of disease conditionsregardless of the order in which disease groups are presented to the algorithm.3.2.2 Cluster matching and switched labelsIn order to resolve the issue of switched labels, we used a hash to maintain a uni-verse of clusters. Multiple conditions were clustered sequentially and the hash ofclusters was updated with any novel clusters that arose from each clustering anal-ysis. If the clustering algorithm identified clusters that already existed in the hash,the labels of those clusters were replaced to match the labels stored in the hash.This way clusters across all conditions used a uniform labelling convention so theproblem of switched labels was eliminated.3.2.3 Order independenceTo ensure that constructed modules varied minimally when different permutationsof disease order were input to the clustering procedure, we made one more adjust-ment to the algorithm. Rather than selecting the pairing with the largest weight,we carried forward a list of multiple pairings for each cluster. To prevent the al-18Figure 3.1: Matching algorithm applied to diseases d1 and d2 where thegenes in each disease, represented by letters A-Z, were partitioned intok=5 clusters each. (a) depicts the d1-optimal results when matchingwith respect to d1 whereas (b) depicts the d2-optimal solution whenmatching with respect to d2.gorithm from exhausting considerable time and space, we introduced a thresholdparameter that was used as a cutoff to specify the maximum number of pairingsto carry forward for each cluster. We set that parameter to be equal to 40% of thenumber of centers specified in k-means. Further details on selecting this value areprovided in Section 3.4. In our implementation k = 30 therefore the parameter =30(0.4) = 12. In our simulated example k = 5 so the parameter was then 5(0.4) =2. In Figure 3.2 we’ve illustrated the result of the algorithm with the carry forwardparameter implemented using two diseases in different orders.19Figure 3.2: Cluster matching algorithm with the carry forward parameter im-plemented. The figure on the left shows a d1-optimal solution (pur-ple) whereas the right side shows a d2-optimal solution (purple). Eachsolution contains 8 clusters only one of which is not identical in bothsolutions.Although implementing the carry forward parameter does not guarantee a uniquesolution, it does provide a notable increase in the concordance among solutions.Without the carry forward parameter, there was only a 60% overlap in solutionsas illustrated in Figure 3.1 with only 3 out of 5 clusters being identical. Howeveronce the carry forward parameter was implemented 88% of clusters were identicalas illustrated in Figure 3.2 with 7 out of 8 identical clusters. We evaluated theorder independence of our entire module construction procedure further and haveprovided more details in Section Module extractionTo construct modules after clustering each disease group, the algorithm describedin sections Section 3.2.1 and Section 3.2.3 was applied iteratively. The moduleextraction process is illustrated using our simulated example in Figure 3.3. We im-plemented a size cut-off of 30 genes to avoid obtaining small modules that consistof only 1 or 2 genes. Using our simulated example, we followed these steps:1. Apply k-means using 5 cluster centers to each disease group (note that eachdisease has arbitrarily been relabeled d1 through d4)202. Add all clusters created in step 1 to the empty cluster universe3. Initialize two adjacency matrices using d1 and d3 clusters as rows and d2and d4 clusters as columns4. Compute the lengths of the intersection between each row and column andassign that as a weight in each matrix5. In each matrix, select 2 [5*0.4=2] columns with the largest weights for eachrow to carry forward and add the intersection to sets d1∩d2 and d3∩d46. Remove the genes selected in step 4 from the cluster universe7. Repeat steps 2-4 using the newly created d1∩d2 and d3∩d4 clusters as rowsand columns and continue this procedure until only one set results8. Select all clusters from the resulting set with size greater than the specifiedcut-off and add to the module set9. Create a module that indicates no assignment and assign to it any probe-setsthat were not assigned to a module21CHAPTER 3. MODULE IMPLEMENTATION 22Figure 3.3: Illustration of iteratively applying the cluster matching algorithmto extract modules from our simulated data set of 4 diseases, 26 genes,and 5 cluster centers.3.4 Evaluation of the module construction procedureWe assessed the quality of our module construction process by considering a fewdifferent parameters. We observed the results after varying the number of probe-sets and cluster centers, and the number and order of diseases, as well as the carryforward parameter. We also used these results to select the value to use for thecarry forward parameter in our implementation.3.4.1 Jaccard distance and order independenceTo evaluate module concordance when presenting the algorithm with different per-mutations of the diseases, we used Jaccard1 and Jaccard2 distances. Jaccard1 dis-tance is defined as:J1(A,B) =| A∩B || A∪B |(3.2)where A and B are different clusters. Jaccard2 distance is defined as:J2(A,B) =| A∩B |min{| A |, | B |}(3.3)Jaccard1 and Jaccard2 indicate a different measure of similarity when compar-ing clusters of unequal sizes and when some clusters are subsets of other clusters.While Jaccard1 provides an unbiased measure of similarity between 2 clusters bytreating both clusters equally regardless of their sizes, Jaccard2 is more sensitive tothe presence of subsets. For example if cluster B is a subset of cluster A, Jaccard1would indicate a less than complete similarity even though all genes in cluster B arealso in cluster A. Jaccard2 however would show complete similarity as the overlapof the 2 clusters is scaled by the length of the smaller cluster.When using N number of diseases we ran the module construction procedure us-ing each possible permutation of disease order, thereby running the algorithm N!times. However, in the interest of time when N > 6 we used only 1000 randomlyselected permutations of all possible permutations. In Figure 3.4 (a) we’ve plottedthe average Jaccard1 distance (in red) and average Jaccard2 distance (in blue) when23the carry forward parameter was varied between 0%, 10%, 20%, 30%, and 40% ofcluster centers. We used a simulated example of 3328 probe-sets, 7 diseases, and30 cluster centers. Each permutation of inputs produces a different set of modulesas described in Section 3.2.1. The degree of similarity of the modules resultingfrom each permutation is dependent on the carry forward parameter as describedin Section 3.2.3. When the carry forward parameter is 0% of cluster centers, theaverage Jaccard distances should be very low which is the result we obtained asindicated in Figure 3.1(a). As the carry forward parameter increases, the resultingsolutions of all permutations should begin to converge which would be indicatedby higher Jaccard distances. The Jaccard2 distances were higher than the Jaccard1distances in every case. This result was not surprising as it’s likely in our data thata smaller cluster in one permutation may be a complete subset of a larger cluster ina different permutation because some conditions might split or merge clusters rela-tive to other conditions as we described in Section 3.2.1. We can see from the graphthat module concordance does indeed increase with the carry forward parameter.In Figure 3.4 (b) we’ve also indicated how the runtime of the module construc-tion procedure varied with each value of the carry forward parameter. Since using40% as the carry forward parameter constructed modules with high concordance(J1=0.86 and J2=0.95) when using different permutations of diseases and also ranin only several minutes (495.62 seconds), we used that value in the implementationof our module construction procedure.3.5 Background modulesWe used our module construction procedure to create background modules usingnormal control samples from various studies involving the lung. These backgroundmodules were constructed to provide a snapshot of the complex networks of co-expressed genes that exist in the body. Structuring the data in this way allows usto focus the analysis of studies involving the lung on these sets of functionally co-herent genes rather than individually and independently testing thousands of genesmeasured by microarrays. The data used and preprocessing steps involved are de-scribed in this section and presented in Figure 3.5.24CHAPTER 3. MODULE IMPLEMENTATION 25Figure 3.4: Evaluation of module concordance using Jaccard distance whenvarying the carry forward parameter from 0% to 40% (a) indicates thechange in Jaccard1 distance, in red, and the change in Jaccard2 distance,in blue. The Jaccard distances are plotted along the y-axis and the meanJaccard distance across all permutations are indicated in the text. In(b) we’ve illustrated the change in runtime, in minutes, when the carryforward parameter is increased. The time in seconds is indicated in thetext.3.5.1 DataTo construct background modules we used whole blood samples of five controlgroups collected for separate studies. Three of the datasets used included samplesthat were collected from COPD patients enrolled in the ECLIPSE study [38]. Thisstudy was a three year longitudinal study which aimed to identify parameters thatpredict the progression of disease in different COPD phenotypes. One of addi-tional two datasets used included samples that were collected by Dr. Anne Ellisat Queens University. These samples were used in a study done to identify t-celldifferentiation expression patterns associated with pollen exposure in individualswith allergic rhinitis. The final dataset used included samples that were collectedfor a study conducted to identify molecular patterns in peripheral blood of asth-matic individuals that could differentiate between two response phenotypes afteran allergen inhalation challenge [33].The COPD control samples used were smoking and non-smoking controls: 29 cur-rent smokers (CS), 28 former smokers (FS), 29 never smokers (NS). The asthmacontrol samples used were blood draws done before the inhalation challenges: 14pre-methacholine inhalation challenge (AR) and 14 pre-allergen inhalation chal-lenge (Asthma). All samples were processed on the Affymetrix Human Gene STplatform. COPD samples were processed on Gene ST 1.0 platform whereas theAsthma samples were processed on the Gene ST 1.1 platform. Sample informa-tion in summarized in Table PreprocessingThe robust multi-array average (RMA) technique was applied to the microarraysfor background correction and normalization. The Factor Analysis for Robust Mi-croarray Summarization (FARMS) technique was used for summarization [19]. Afiltering method called I/NI calls was also applied to eliminate non-informativeprobe-sets [36]. Since the ECLIPSE samples were processed on a different versionof the Affymetrix Human Gene ST array than the Asthma samples, we were com-pelled to perform the background correction, normalization, summarization andpre-filtering steps on the ECLIPSE samples separately from the Asthma samples.26Study Group (Condition) Number ofsamplesPlatformECLIPSE Current Smokers (CS) 29 Affymetrix Human Gene ST 1.1ECLIPSE Former Smokers (FS) 28 Affymetrix Human Gene ST 1.1ECLIPSE Never Smokers (NS) 29 Affymetrix Human Gene ST 1.1AllergicRhinitisPre-methacholineChallenge (AR)14 Affymetrix Human Gene ST 1.0Asthma Pre-allergen Challenge(Asthma)14 Affymetrix Human Gene ST 1.0Table 3.1: Information of samples usedParameter ValueNumber of cluster centers 30Number of random starts 10Carry forward 0.4 (40% of cluster centers)Minimum module size cut-off 30Table 3.2: List of parameters used in the construction of background modulesAfter those steps, we were left with 12,561 probe-sets in the ECLIPSE samples and6,348 probe-sets in the Asthma samples. Taking the intersection of the probe-setsin both groups left us with 5788 probe-sets. This data set of 5 sample groups and5,788 probe-sets were batch corrected using the ComBat Bioconductor package[22]. The preprocessing steps have been illustrated in Figure ModulesWe applied our module construction algorithm to the 5 datasets after completingthe preprocessing steps. The parameters used during the procedure are listed inTable 3.2. The algorithm produced 53 modules of varying sizes, which are listedin Table 3.3, where the smallest module contained the minimum of 30 probe-setsand the largest contained 155 probe-sets.27Figure 3.5: Preprocessing steps used on microarray data before the moduleconstruction procedure was applied to construct background modulesModule No. ofuniqueprobe-setsNo. ofuniquegenesNo. ofunanotatedprobe-setsNo. ofprobe-setswithmultiplemappingsLargestprobe-set togenemappingM1 155 108 80 21 8M2 135 111 43 0 4M3 123 103 27 0 2M4 119 93 29 0 2Continued on next page28Table 3.3 – continued from previous pageModule No. ofuniqueprobe-setsNo.uniquegenesNo. ofunannotatedprobe-setsNo. ofprobe-setswithmultiplemappingsLargestprobe-set togenemappingM5 116 77 51 0 7M6 107 51 70 4 6M7 98 93 37 1 8M8 96 50 63 8 6M9 96 82 43 0 9M10 94 80 24 3 4M11 85 75 28 3 6M12 82 35 50 0 2M13 82 79 18 0 3M14 78 73 9 0 2M15 78 52 34 0 3M16 74 44 37 0 4M17 72 72 17 2 5M18 63 58 9 0 2M19 62 75 9 0 14M20 62 57 9 0 3M21 62 55 11 0 3M22 58 48 16 0 6M23 55 51 7 0 2M24 54 35 31 0 10M25 54 57 5 0 4M26 53 22 36 0 4M27 52 47 11 0 4M28 52 56 4 0 4M29 52 48 11 0 3M30 51 21 36 0 3M31 51 41 18 0 3M32 50 60 3 0 7M33 50 44 12 1 3Continued on next page29Table 3.3 – continued from previous pageModule No. ofuniqueprobe-setsNo.uniquegenesNo. ofunannotatedprobe-setsNo. ofprobe-setswithmultiplemappingsLargestprobe-set togenemappingM34 50 44 12 0 4M35 48 25 30 0 4M36 45 47 17 6 10M37 45 35 13 0 2M38 43 24 20 0 1M39 39 35 14 0 4M40 38 32 15 0 8M41 38 38 3 0 2M42 37 19 34 0 14M43 37 33 12 0 5M44 36 12 28 0 4M45 35 36 9 0 8M46 34 27 18 0 7M47 34 7 29 0 2M48 33 27 8 0 2M49 33 25 11 0 2M50 33 8 29 0 2M51 32 26 7 0 1M52 31 29 5 0 3M53 30 10 25 0 5Table 3.3: List of modules created and their respective sizes along with an-notation information of probe-sets included in each moduleWe performed a principle component analysis (PCA) on the resulting modules tovisualize their separation and illustrated the result in Figure 3.6. We’ve plottedthe first two principle components and colored each data point according to itscorresponding module. The first principle component explained 97.9% of the vari-30Figure 3.6: Visualization of the separation in our modules by the first princi-ple component of a PCA analysis.ance in our data and we see that it clearly separated the modules as depicted inFigure 3.6.31Chapter 4Module Characteristics4.1 StabilityOur module construction procedure uncovers modules based on the natural bio-logical structure of the data. Since the transcriptional organization in biologicalsystems is conserved and reproducible, a sound and reliable module constructionprocedure should be able to repeatedly uncover modules that reflect this underlyingbiology.In Chapter 3 we used our module construction procedure to construct backgroundmodules using normal control samples from various studies involving the lung. Toconstruct our modules, we used samples from multiple studies involving variousconditions pertinent to the lung to incorporate the biological variability naturallypresent across human cellular systems. These background modules are meant tobe a universal set of modules that provide a snap shot of the complex networks ofco-expressed genes that exist in the body regardless of conditions present. Thesemodules can then be used in numerous studies pertaining to conditions involvingthe lung. In order to accomplish these tasks, it is particularly essential that ourmodule construction procedure is sound and reliable, repeatedly producing stablemodules.To measure this aspect in our module construction procedure, we used an objective32measure, the Jaccard2 index (equation 3.3), along with permutation testing and re-sampling techniques to show that our procedure uncovers biologically consistentmodules even when using varied inputs and different data sets.4.1.1 Stability within datasetsIn our analysis of the robustness of our procedure, we used sampling techniques tocreate multiple data sets from the original data and applied our procedure to eachsubset of the data. We then evaluated module similarity across the resulting sets ofmodules.The data used was a COPD (described in Section 5.1) data set containing 238 pe-ripheral blood microarray samples from patients having COPD. Although COPDis a condition with varying sub-phenotypes and varying severity of disease (dis-cussed in more detail in Chapter 5), for this task we considered COPD samples ashomogeneous and did not stratify the samples by their sub-phenotype.Splitting of the data was carried out seven times with an increasing number ofsubsets created each time, starting with 2 subsets and increasing up to 10 subsetsof the original data. When creating 2 subsets of data the procedure used was tosample 119 samples from the original 238 to obtain the first subset and use the re-maining 119 samples as the second subset. This random sampling was repeated tentimes to account for variability. This procedure returned two distinct subsets withno overlapping samples, however when we increased the number of subsets createdwe used the same sampling procedure with one modification: we also increased theproportion of samples that overlap across the subsets of data. For example, whencreating 3 subsets of data, we sampled 79 samples three times to obtain 3 distinctsubsets. We then merged two of those subsets using all three combinations of two,giving us three subsets of data with 50% of the samples overlapping across subsets.The number of subsets used and the proportion of overlap in each split of the datahave been summarized in Table 4.1.33No. of subsets Proportion of overlap No. of repetitions2 0 103 50 104 67 105 75 106 80 108 86 1010 89 10Table 4.1: Split information. This table outlines the number of subsets usedin each split (each row represents one split) of the data and the proportionof overlapping sample in each subset as well as the number of repetitionsused to sample the data set.Our module construction procedure requires that we have multiple disease groups.In order to emulate that in the COPD data set, we further stratified each subset ofdata into 3 random groups which were to represent 3 disease groups. This stratifi-cation was repeated 5 times for each subset to account for variability.In each of the seven splits of data, we applied our module construction procedureon each subset and evaluated the concordance among the resulting sets of modulesin a pairwise way. The concordance of two sets of modules, set A and set B, wasevaluated by finding for each module in set A, the module in set B with the maxi-mum Jaccard2 distance. The maximum Jaccard2 distance which indicates the bestmatches is defined as:maxJ2(A,B) = ∀a ∈ A,∀b ∈ B max(| a∩b |min{| a |, | b |})(4.1)We then computed the average of those Jaccard2 distances across all modules in setA. Since the maximum Jaccard2 distance between a pair of modules is not symmet-ric i.e. max Jaccard2(A,B) does not always equal max Jaccard2(B,A), as explained34in Chapter 3, the comparison of two sets of modules was done in a bi-directionalway. The average of maximum Jaccard2(A,B) and maximum Jaccard2(B,A) wastaken to be the average Jaccard2 distance for set A and set B and that was used asour measure of concordance:concordance(A,B) = 1|A|+|B|∑[{maxJ2(A,B)},{maxJ2(B,A)}] (4.2)This process was repeated for all pairwise comparisons in one split of the data.When we have k subsets of data, we generate k sets of modules which gives us(k*(k-1))/2 possible pairwise comparisons. For example in our first split we gener-ated 2 sets of modules which gave us 1 pairwise comparison, and in our last split wegenerated 10 subsets of data which gave us (10*9)/2 = 45 pairwise comparisons.The boxplots in Figure 4.1(a) illustrate the result of this analysis where the aver-age maximum Jaccard2 distance from every pairwise comparison is plotted as onepoint in each split of the data. We can see that when two distinct subsets of data areused, there is slightly over 60% mean concordance in the resulting modules witha small range of variation. This indicates that our module construction procedurecan produce modules that reflect the underlying biological consistency in the data.As the data in each subset became more similar, in each subsequent split of thedata, the concordance of the resulting modules also increased as expected. Theseresults indicate the robustness of our procedure as the modules were highly similareven with slightly varied input data.In order to measure the significance of the observed Jaccard2 values, we performeda permutation test procedure where we randomly permuted module assignment be-fore each pairwise comparison. This permutation was repeated 10 times for eachpairwise comparison. The results are indicated by the lower line and set of errorbars in Figure 4.1. We can see that the permuted labels only reached an averageJaccard2 distance of about 0.10 with very small variance which confirms that ourmodules are not a random set of genes and are driven by the innate structure in thedata.354.1.2 Stability across datasetsWe were also interested in evaluating the consistency of our procedure across dif-ferent data sets. We used two data sets for this purpose. One of those data setsincluded transcriptional profiles from 76 normal lung tissue samples which weredownloaded from Gene Expression Omnibus public repository [23, 41]. Using thisdata we performed an analysis replicate to the one described in section 4.1.1 andhave displayed the result in Figure 4.1(b). The second data set used included 87peripheral blood samples from COPD control patients. Using this data we againperformed a replicate analysis to that described in section 4.1.1 and illustrated theresult in Figure 4.1(c). Both of these applications again showed high concordance,measured by the average Jaccard2 values, and a small range of variation in theresulting modules when doing pairwise comparisons. The average Jaccard2 dis-tance in the tissue data set was 0.537-0.704 and in the COPD controls data set itwas 0.587-0.739. The results of the permutation analyses were also similar to thatseen in the COPD data set as in both new cases we reached an average Jaccard2distance of only 0.10. This result implies the reliability of our procedure in uncov-ering underlying biological structure as it produced consistent modules using datafrom different tissues.To compare the results across all three data sets we performed a t-test on the aver-age Jaccard2 distances for each split of the data. The results of this test are givenin Table 4.2. One notable observation is that those modules generated using theGEO tissue normal samples as input as well as those generated using peripheralblood normal samples as input did not give a higher concordance than modulesconstructed using the COPD data set. This is contradictory to what was expectedas when using only normal controls the samples are expected to be more homoge-neous and therefore are likely to produce modules with higher concordance whenusing different subsets of the data. The outcome that we saw however may be at-tributable to the sizes of the datasets used, which we discuss further in Section 4.2.36CHAPTER 4. MODULE CHARACTERISTICS 37Figure 4.1: Average pairwise Jaccard2 distances between modules con-structed using our module construction procedure on multiple data setscreated by repeated random sampling of three original data sets (a) Pe-ripheral blood samples from patients having COPD, (b) lung tissue sam-ples from normal subjects, and (c) peripheral blood samples from COPDcontrol subjects.Overlap COPD vs. GEO COPD vs. Normals GEO vs. Normals0% 7.16e-33 6.53e-41 1.28e-5350% 1.37e-41 1.096e-84 9.94e-11367% 3.92e-08 2.36e-104 9.35e-17975% 0.00262027 2.68e-132 2.93e-20480% 1.62e-30 7.08e-108 5.43e-24986% 4.31e-109 6.93e-90 1.24e-17189% 3.77e-220 6.38e-58 9.008e-99Table 4.2: P-values resulting from t-tests performed on the pairwise averageJaccard2 distances for each split of the data.4.2 Sensitivity4.2.1 Number of samplesIn the previous sections we showed the result of our module construction proce-dure using three different data sets and compared the results across data sets. Onefactor that may contribute to the quality of the results is the size of the input data.The first data set, the COPD data set, was comprised of peripheral blood samplesfrom patients that were classified with different phenotypes of COPD and displayedvarying degrees of symptoms. This data set contained 238 samples. The seconddata set, the GEO data set, included tissue samples from the lungs of 78 normalcontrol samples. This set is considered to be relatively homogeneous compared tothe COPD samples since they are controls and they come from tissue as opposedto peripheral blood. The third data set, the COPD controls, included 89 peripheralblood samples from subjects that did not have a lung condition and were used ascontrols in the ECLIPSE study. We performed an analysis to assess the effect ofsample size on module concordance and we’ve illustrated the result inFigure 4.2.We again used the resampling technique to create two subsets of data from theoriginal COPD data set. This random sampling was repeated multiple times to ac-count for variability. This entire repeated random sampling experiment was run 838times, and an increasingly larger number of samples were used in each experimentstarting from 20 samples and increasing to all 238 samples. When fewer sampleswere used, a larger number of repetitions were used for random sampling. Foreach experiment the module construction procedure was applied to both subsets ofdata in each run and the bi-directional Jaccard2 average was taken between bothsets of modules (described previously in Section 4.1.1). Figure 4.2 clearly showsthe dependence of module concordance, measured by average maximum Jaccard2distance, on sample size as the Jaccard2 distance increases from 0.335 when using20 samples to 0.618 when using 238 samples.One noteworthy point in the result is the average Jaccard2 distance when 80 sam-ples are used. That sample size is more similar to the lung tissue and COPD controldata sets. When comparing this Jaccard2 distance to the 0% overlap columns (twodistinct subsets) in Figure 4.1 (b) and (c), we see that the COPD data set has aJaccard2 distance of 0.481 which is lower than both the lung tissue data (0.537)and COPD normal data (0.587). This is an intuitive result as the lung tissue dataset and COPD normal data set consists of samples that are more homogeneous rel-ative to the COPD data set. When the input samples are more similar, the modulesconstructed are also expected to be more similar.4.2.2 Number of modulesSince the module construction procedure has no parameter to control the numberof modules constructed, unless a replicate data set is used, the number of mod-ules resulting from multiple applications of the procedure will vary. We performedan analysis to assess the relationship between the number of samples in the inputdata and the number of modules constructed. The same dataset from the previoussection was used where two subsets of the original data were repeatedly sampledusing a varying number of samples. The result is depicted in Figure 4.3 whichshows that a fewer number of samples will produce a larger number of modules.However the number of modules constructed is less consistent when using fewersamples as indicated by the larger range of variation when using only 10 samplesto construct modules compared to using 119 samples.39Figure 4.2: The relationship between input sample size and Jaccard2 dis-tance. Two subsets of the COPD data with a varying number of samplesare created by repeated random sampling, with a varying number ofrepetitions used. The module construction procedure is applied to bothsubsets and the average Jaccard2 distance is calculated.We also examined the relationship between the size of the universe of probe-sets(union of probe-sets across all modules) and the number of modules constructed.The result is depicted in Figure 4.4 where figures (a) through (c) show the numberof modules vs. the size of the universe in each of our 3 data sets. We see a fairlyconsistent universe size with the number of modules varying between 50 and 120.Figure 4.4 (d) includes another layer of information by varying the number of sam-ples used to construct modules and we see that a larger number of samples resultsin a larger universe of probe-sets distributed over a smaller number of modules.4.2.3 Module sizeAnother factor that may contribute to the quality of the constructed modules is themodule size distribution i.e. the number of probe-sets in each module. Using the40CHAPTER 4. MODULE CHARACTERISTICS 41Figure 4.3: The relationship between the number of input samples used andthe number of modules constructed.Figure 4.4: Relationship between the size of the universe of probe-sets, num-ber of modules, and input sample size.Figure 4.5: The distribution of module size. The colors indicate the range ofthe number of modules constructed.same dataset we used in Section 4.1.1 (two subsets of the original data, repeatedlysampled for a varying number of samples) we observed the distribution of modulesizes. In Figure 4.5 we have depicted this distribution across the varying numberof input samples (10 samples -119 samples). The colors indicate the range of thenumber of modules created. The sizes of the modules vary between 30 probe-setsand approximately 570 probe-sets. The number of modules created varies between60 and 200 modules.42Chapter 5Application of Baseline Modules5.1 Application to COPDIn previous chapters we introduced our module construction procedure and illus-trated its application in the construction of baseline modules pertinent to the lung.In this chapter we’ve illustrated how the use of those baseline modules can help toenhance analysis methods and we use COPD as a case study.5.1.1 Description of COPDCOPD is a disease containing multiple phenotypes for which pathogenesis is notcompletely understood. It is a common preventable and treatable disease character-ized by the progressive limitation of airflow associated with an exaggerated chronicinflammatory response in the lungs which is caused by the inhalation of toxins ingenetically susceptible individuals. Toxins that contribute to the inflammatory re-sponse include noxious particles and gases such as air pollution, infection, andcigarette smoking. Cigarette smoking is the best-studied risk factor for COPDbut non-smokers also develop chronic airflow limitation and not all people withthe same smoking history will develop COPD which may likely be due to geneticdifferences. The lung inflammation caused by toxins is a normal response whichseems to be modified in patients with COPD [39]. The inflammatory processescause destruction of lung tissue which leads to further structural changes that dis-rupt normal repair and defense mechanisms and reduce the ability of the airways43to stay open during expiration.COPD is a major global health problem which is projected to soon rank fifth interms of disease burden and third in terms of mortality world-wide [39]. COPDis responsible for 11,000 deaths and even more hospital admissions per year inCanada. There is a significant economic burden associated with COPD as there isa direct relationship between the cost of care and COPD severity. COPD relatedcosts in Canada alone are near $4 billion per year and are estimated to more thandouble in the next 15 years [1].One of the primary contributors to the projected increases in economic and dis-ease burden is hospital admission rates caused by acute exacerbation of COPD.Exacerbation is characterized by a worsening of a patient’s respiratory symptomsbeyond day to day variations and leads to a change in medication. Symptoms in-clude dyspnea, cough, and sputum production. Exacerbations are brought on byseveral factors such as bacterial or viral respiratory tract infections and air pollu-tion; however the cause of about one-third of severe exacerbations is unidentified.Exacerbations are critical events in the course of COPD for many reasons: lengthyrecovery time, acceleration of the rate of decline in lung function, association withsignificant mortality, high socioeconomic costs, and negative effect on a patient’squality of life. Diagnosis of exacerbation relies only on the clinical presentation ofa patient conveying an acute change of symptoms beyond normal day to day vari-ation. Beta-2 agonists and systemic corticosteroids are used to treat exacerbations,with the goal of treatment being minimization of the impact of the current exacer-bation and prevention of subsequent exacerbations. Exacerbation rates vary greatlyfrom patient to patient and not all COPD patients exacerbate. Patients having 2 ormore exacerbation events per year are classified as frequent exacerbators (FE) andthe severity of an exacerbation can be mild, moderate, or severe, determined by theextent of treatment required. Despite the fact that not all patients exacerbate, allpatients receive the same treatment of corticosteroids and beta-2-agonists becausethere is no way to risk-stratify patients and the single best predictor of exacerba-tions is a history of exacerbations [39]. This is not only a costly approach but it putsmany COPD patients at unnecessary risk of the adverse effects of corticosteroids.445.1.2 Description of COPD studyThe availability of a simple clinical test such as a blood test for the prognosis ofexacerbations would be extremely beneficial as it would allow patients to receivetreatment tailored for their specific case which would help safeguard against theadministration of drugs to patients who don’t require them. The PROOF Cen-tre’s COPD Biomarker Program aims to develop biomarkers to identify patientsat high risk of exacerbations for this purpose. The goal here is to identify bloodbiomarkers of ”imminent exacerbation” (IE) which is defined as an exacerbationoccurring shortly after blood collection (within 30 days, 60 days or 90 days). Theuntargeted biomarker discovery method includes an analytic pipeline consisting ofmany steps starting with all probe-sets measured on a microarray and narrowingdown to a panel of genes. These steps include quality assessment, pre-processing,pre-filtering, univariate ranking, univariate selection, classifier generation, combi-natorial biomarker generation, and biomarker evaluation. In Section 5.1.3 we’vedescribed how our baseline modules can be used to help enhance this biomarkerdiscovery strategy by adding a targeted feature to the pipeline.5.1.3 Using modules for the discovery of transcriptional biomarkersWe performed differential expression analysis between imminent exacerbators andnever exacerbators (NE). Blood samples from patients enrolled in the ECLIPSEstudy (described previously in Chapter 3) were used. This was a three year longi-tudinal study with the most well-phenotyped COPD cohort in the world. This co-hort included 2,180 patients, 343 smoking control subjects and 233 non-smokingcontrol subjects. For our analysis we used a subset of 110 patients which included24 IE samples and 86 NE samples. All patients were between the ages of 49 and75 and included 37 females and 73 males. Fifty five patients in this dataset weresmokers and the remaining 54 patients were former smokers. Sample informationis outlined in Table 5.1).Sample pre-processing steps were the same as those used for constructing our base-line modules (Section 3.5.2). All samples were processed on the Affymetrix Hu-man Gene ST 1.0 platform and the RMA technique was applied to the microarrays4530 day cohort 60 day cohort 90 day cohortExacerbation Phenotype NE IE NE IE NE IETotal 134 10 86 24 134 52Current Smokers 67 5 43 12 67 26Former Smokers 67 5 42 12 67 26Age 50-75 55-73 49-75 50-75 50-75 52-74Male 94 7 55 17 94 33Female 40 3 30 7 40 19Table 5.1: Sample informationfor background correction and normalization. The FARMS technique was used forsummarization and I/NI calls filtering method was also applied to eliminate non-informative probe-sets. After these pre-processing steps we were left with 12,381probe-sets5.1.4 Differential expression analysis on a module-by-module basisWe first identified probe-sets that were differentially expressed between IE and NEsamples using only those IE samples for which the exacerbation occurred within60 days. We used LiMMa moderated t-test from the Bioconductor package [34] toidentify differentially expressed probe-sets. This is analogous to the classical t-testexcept that empirical Bayes methods are used to borrow information between genesto help with inference about each individual gene [34]. We applied a p-value cut-offand selected all probe-sets with a p-value < 0.05 which returned 1,649 probe-sets.We were interested in determining which of our baseline modules are enriched forthese differentially expressed genes so we performed a hypergeometric overlap be-tween the genes in each of our modules and the list of 1,649 differentially expressedgenes. To control the false discovery rate we applied the Benjamini and Hochbergmethod for p-value adjustment. The results are presented in Table 5.2 which liststhe modules that were enriched for the differentially expressed probe-sets with anadjusted p-value of < 0.01.46CHAPTER 5. APPLICATION OF BASELINE MODULES 47Module No. of DE probe-setsin modulep-value Adjusted p-valueM1 27 2.81E-06 1.86E-05M2 24 1.44E-07 1.27E-06M3 12 0.011763 0.023091M4 11 0.016988 0.031046M5 13 0.004196 0.010065M6 15 0.000422 0.001399M7 22 1.01E-07 1.08E-06M8 15 0.00029 0.001026M9 16 0.000135 0.00055M10 14 0.00023 0.000872M12 11 0.000866 0.002699M13 19 4.10E-08 7.25E-07M14 19 1.82E-09 4.81E-08M16 9 0.006058 0.012844M18 12 1.80E-05 9.54E-05M21 11 7.33E-05 0.000353M25 8 0.002969 0.007493M26 6 0.021957 0.035264M28 12 4.92E-06 2.90E-05M29 14 9.83E-08 1.08E-06M30 8 0.001617 0.004286M31 23 1.11E-16 5.88E-15M33 6 0.020092 0.034123M34 6 0.01834 0.032401M35 8 0.00124 0.003652M36 8 0.006826 0.013914M37 9 8.33E-05 0.000368M38 11 7.66E-07 5.80E-06M41 5 0.013231 0.025044M43 6 0.005532 0.012216M46 5 0.020603 0.034123M49 6 0.001429 0.003987M51 5 0.004368 0.010065Table 5.2: Baseline modules that were enriched for probe-sets that are differ-entially expressed in IE vs. NE samples as determined by the LiMMamoderated t-test and an adjusted p-value cut-off of 0.05. Those moduleshaving an adjusted p-value < 0.01 are highlighted in bold.We were also interested in observing the effect of performing a differential expres-sion analysis on a module by module basis rather than the entire universe of 12,831probe-sets. We used the same differential expression analysis and applied it to eachof the 53 modules. We took the union of probe-sets that were found to be differen-tially expressed across all modules and compared this to the union of probe-sets inour modules that were enriched for the list of 1,649 differentially expressed probe-sets (union of ”No. DE probe-sets in Module” column in Table 5.2). The Venndiagram in Figure 5.1(a) indicates the difference in the results. Performing thedifferential expression on all probe-sets and finding their enrichment in our mod-ules gave us 402 probe-sets among our modules that were differentially expressed.Performing differential expression analysis on a module by module basis identi-fied 470 probe-sets that were differentially expressed across all modules. These470 probe-sets included an additional 74 probe-sets that could not be detected bythe former analysis that considers all probe-sets simultaneously. This suggests thebenefit in performing differential expression analysis on a module by module ba-sis. When performing differential expression on the complete list of probe-sets amasking effect likely occurs since information is borrowed across the ensemble ofgenes. Performing the analysis on a module by module basis has an effect similarto an increase in resolution as information is borrowed across genes that are func-tionally related.LiMMa performs an adjustment to control the false discovery rate to correct formultiple comparisons as well, using the Benjamini and Hochberg procedure, andprovides an associated adjusted p-value. Figure 5.1(b) shows the results of apply-ing an adjusted p-value cut-off of 35%. This cut-off of 35% is much larger thanwould typically be used in a differential expression analysis however a large cut-off was required in this case in order to obtain a sufficient number of probe-setswhen differential expression was performed on the non-modular set of probe-sets.Results of this analysis indicate a reduction in the false discovery rates when per-forming differential expression on a module-by-module basis.48Figure 5.1: Venn diagram indicating the advantage of performing differentialexpression analysis on a module-by-module basis rather than a differ-ential expression analysis on the universe of probe-sets that pass thepre-processing steps.Differential expression in time-course (30d vs. 60d)We were also very interested in observing whether or not there are differences ingene expression of IE patients who exacerbate within 30 days (30 day IE), thosewho exacerbate within 60 days (60 day IE), and those who exacerbate within 90days (90 day IE). We used LiMMa again to identify differentially expressed genesbetween 30 day IE and NE samples. We again used a p-value cutoff of 0.05 andthis returned 1,274 probe-sets. We performed the same analysis using only the90 day IE samples and obtained a list of 1,232 probe-sets. For each of these listswe performed hypergeometric overlaps to identify which of our baseline modulesare enriched for these lists. We also applied the Benjamini-Hochberg procedure tocorrect for the false discovery rate and applied an adjusted p-value cut-off of 0.01.There were 24 modules that were significantly enriched in at least one of the 3 listsof differentially expressed genes and the results are listed in Table 5.3.Module Cohort No.probe-setsin DE listNo.probe-setsin moduleOverlap P-value Adjustedp-valueM160 day 1649 155 27 2.81E-06 1.86E-0590 day 1232 155 16 0.002678 0.007886Continued on next page49Table 5.3 – continued from previous pageModule Cohort No.probe-setsin DE listNo.probe-setsin moduleOverlap P-value Adjustedp-valueM230 day 1274 135 18 6.64E-06 0.00017660 day 1649 135 24 1.44E-07 1.27E-0690 day 1232 135 17 1.56E-05 0.000103M5 60 day 1649 116 13 0.004196 0.010065M630 day 1274 107 11 0.002906 0.01160 day 1649 107 15 0.000442 0.00139990 day 1232 107 9 0.016983 0.031038M730 day 1274 98 12 0.001476 0.00698160 day 1649 98 22 1.01E-07 1.08E-0690 day 1232 98 17 1.52E-06 1.61E-05M930 day 1274 96 8 0.04886 0.08092460 day 1649 96 16 0.000135 0.0005590 day 1232 96 12 0.000703 0.00266M1060 day 1649 94 14 0.00023 0.00087290 day 1232 94 14 8.49E-06 6.42E-05M1230 day 1274 82 14 6.31E-07 3.35E-0560 day 1649 82 11 0.000866 0.002699M1330 day 1274 82 13 1.72E-05 0.00022860 day 1649 82 19 4.10E-08 7.25E-0790 day 1232 82 12 5.35E-05 0.000315M1430 day 1274 78 11 5.77E-05 0.0005160 day 1649 78 19 1.82E-09 4.81E-0890 day 1232 78 15 3.91E-08 6.91E-07M1830 day 1274 63 10 3.83E-05 0.00040660 day 1649 63 12 1.80E-05 9.54E-0590 day 1232 63 7 0.002945 0.008214M2130 day 1274 62 8 0.000795 0.00601660 day 1649 62 11 7.33E-05 0.000353M25 60 day 1649 54 8 0.002969 0.007493M2830 day 1274 52 10 1.28E-05 0.00022560 day 1649 52 12 4.92E-06 2.90E-05Continued on next page50Table 5.3 – continued from previous pageModule Cohort No.probe-setsin DE listNo.probe-setsin moduleOverlap P-value Adjustedp-value90 day 1232 52 14 2.65E-09 7.02E-08M2960 day 1649 52 14 9.83E-08 1.08E-0690 day 1232 52 11 1.15E-06 1.53E-05M3130 day 1274 51 7 0.001575 0.00698160 day 1649 51 23 1.11E-16 5.88E-1590 day 1232 51 18 1.31E-13 6.96E-12M3530 day 1274 48 7 0.000978 0.00647760 day 1649 48 8 0.00124 0.00365290 day 1232 48 7 0.000787 0.002781M3660 day 1649 45 8 0.006826 0.01391490 day 1232 45 8 0.000973 0.003223M3760 day 1649 45 9 8.33E-05 0.00036890 day 1232 45 9 6.89E-06 6.09E-05M46 30 day 1274 34 6 0.00131 0.006981M49 60 day 1649 33 6 0.001429 0.003987Table 5.3: A list of the 24 modules that were significantly enriched (adjustedpvalue <0.01) in one of the 3 lists of differentially expressed genes. The3 lists include genes that were found to be differentially expressed bymoderated t-tests performed using 30 day IE vs. NE samples, 60 day IEvs. NE samples, and 90 day IE vs. NE samples.We focused on the 30 day IE and 60 day IE samples for the purposes of biomarkerdiscovery and looked at the union of probe-sets from our modules that overlapwith the two lists of differentially expressed genes. The results are presented inFigure 5.2(a) which illustrates that there were 391 probe-sets in our modules thatwere differentially expressed in one of the two lists. Of the list of 1,649 probe-setsdifferentially expressed in 60 day IE and NE samples, 314 were in our modules.Out of the list of 1,274 probe-sets differentially expressed in 30 day IE and NEsamples, 209 were in our modules. There was an overlap of 132 probe-sets in ourmodules that were differentially expressed in both the 30 day IE and 60 day IE51samples compared to NE samples.Figure 5.2(b) depicts for each module, the differences in mean fold change of theprobe-sets that were differentially expressed in the 30 day IE samples vs. the 60day IE samples. The numbers of genes that are up or down regulated are indi-cated as well. This figure reveals very interesting results for example, module M5had 5 up-regulated probe-sets and 5 down-regulated probe-sets in the 30 day IEsamples however in the 60 day IE samples, 10 probe-sets were up-regulated andonly 3 were down-regulated. Module M46 also shows an interesting result as mostdifferentially expressed probe-sets were up-regulated in the 30 day IE samples (5up-regulated and 1 down-regulated) but the reverse signal was seen in the 60 dayIE samples as most probe-sets were down-regulated (1 up-regulated and 4 down-regulated). We computed the first principal component of these modules using thesignificant probe-sets in each module and the 30 day IE samples, 60 day IE sam-ples, and NE samples. In Figure 5.3 we’ve illustrated the separation of the samplesin each group by the first principal component. In both modules M5 and M46 thesignal of the exacerbators deviated from that of the non-exacerbators however inmodule M5 the 30 day IE samples deviated more than the 60 day IE samples butin M46 the 30 day IE samples deviated less than the 60 day IE samples. This isan optimistic result as it suggests that at the time of blood draw, the mechanism ofexacerbation is already underway and we may be able to use different modules toadvise how soon an exacerbation might occur which could help guide personalizetreatment.In Figure 5.4 we’ve indicated for both modules the genes that were up regulated(red) and down regulated (blue) in the 30 day IE samples and the 60 day IE sam-ples. Fourteen out of 19 probe-sets had associated gene annotations in moduleM5 with the remaining 5 probe-sets being unannotated. In module M46, 4 out of10 probe-sets had associated gene annotations with the remaining 6 being unan-notated. In both modules, the mean fold change of significant probe-sets changeddirection between 30 day IE samples and 60 day IE samples as seen in Figure 5.2(b)however, the direction of gene expression in the individual genes didn’t change be-tween these samples but rather there were additional genes differentially expressed52CHAPTER 5. APPLICATION OF BASELINE MODULES 53Figure 5.2: (a) Venn diagram showing the 391 probe-sets differentially ex-pressed in either the 30 day IE vs. NE samples or 60 day IE vs NEsamples that were found in our modules. Of the 391 probe-sets, 132were differentially expressed in both lists, 182 were differentially ex-pressed in only the 60 day IE vs. NE samples and 77 were differentiallyexpressed in only the 30 day IE v. NE samples. (b) An illustration of thedifferences in mean fold-change of genes that were significant in the 30day IE vs. NE samples compared to the mean fold-change of genes thatwere significant in the 60 day IE vs. NE samples. Blue circles representa mean negative fold-change and red circles represent a mean positivefold-change. The number of up-regulated probe-sets and the number ofdown-regulated probe-sets are indicated by the text in each circleFigure 5.3: The first principal component of module genes enriched for IEvs. NE differentially expressed genes was computed. The samples werestratified into 30 day IE samples, 60 day IE samples and NE samples andprojected onto the first principal component. The boxplots on the leftindicate the separation of these samples by the first principal componentand show the deviation of IE samples from NE samples. The 30 dayIE samples deviate farther than the 60 day IE samples. The figure onthe right illustrates the same information in module M46. Here, the 30day IE samples deviate from the NE samples less than the 60 day either 30 day IE or 60 day IE samples as indicated by Figure 5.4.Functional interpretation of module M5We performed a literature search to associate possible functions to the differentiallyexpressed genes. Many of the genes in module M5 have a similar functional roleor belong to the same family of genes which is a set of similar genes that originatedfrom one original gene and have similar biochemical roles. Four of the six genesdifferentially expressed in the 30 day IE samples are involved in immune systemfunctioning. Three genes, TNFRSF1A, IGF2R, PPP6C, are part of the CD familywhich is a set of genes that encode for proteins involved in immune system func-54Figure 5.4: Genes that were up regulated and down regulated in IE samplescompared to NE samples. The genes that were differentially expressedin the 30 day IE group were compared with genes that were differen-tially expressed in the 60 day IE group. The venn diagram on the leftdepicts this information for Module M5 whereas the venn diagram de-picts this information for Module M46.tioning. For example TNFR1 protein can trigger either inflammation or apoptosis(self-destruction of the cell). Signaling within the cell initiates a pathway that turnson a protein called nuclear factor kappa B, which triggers inflammation and leadsto the production of immune system proteins called cytokines. Apoptosis is initi-ated when the TNFR1 protein forms a complex with another protein and enters thecell to start a process known as the caspase cascade. Mutations of TNFR1 causemisfolded proteins that get trapped within a cell and constantly trigger alternativeinflammation initiation pathways which lead to excess inflammation. The fourthgene NCF1; NCF1C plays a role in adjusting the inflammatory response. This geneencodes for proteins that form the NADPH oxidase complex which regulates theactivity of phagocytes and neutrophils and those cells are known to play a role inadjusting the inflammatory response to optimize healing and reduce injury to thebody. The remaining two genes differentially expressed in the 30 day IE samplesare involved in the drug response mechanism and lung development. The geneSLC38A is an acetyltransferase which encodes proteins that activate and deacti-vate certain drugs and carcinogens. Polymorphisms in this gene are responsiblefor the polymorphism of N-acetylation which separates humans into rapid, inter-55mediate and slow acetylator phenotypes. This is interesting as it might indicate anassociation with the different exacerbation phenotypes as this gene is differentiallyexpressed between IE and NE in the 30 day samples but not the 60 day samples.The gene KLF2, Krppel-like Factor 2, encodes a protein that has been associatedwith multiple biochemical processes in the human body, including lung develop-ment, epithelial integrity, T-cell viability, and adipogenesis.Three of the genes involved in immune system functioning, IGF2R, PPP6C, andNCF1; NCF1C were also differentially expressed in the 60 day IE samples as wasKLF2. Of the additional seven genes differentially expressed in the 60 day IE sam-ples, two of them, LSP1 and TSC22D3, are also involved in the anti-inflammatoryresponse, and the remaining five encode for proteins required for various functionssuch as structural components of the extracellular matrix, cell cycle regulation, andplatelet function and release.We also performed a gene set enrichment analysis using the MSigDB gene setsthat are available for download from the broad institute [35]. These gene setsare divided into 7 collections of which we use one: curated gene sets. The cu-rated gene sets are further organized by into sets that come from online pathwaydatabases, publications in PubMed, and knowledge of domain experts. We se-lected the gene sets curated from the KEGG pathway database. KEGG [24, 25]gene sets include 186 gene sets derived from the KEGG pathway database whichis a collection of manually drawn pathway maps and represent knowledge on themolecular interactions and reaction networks for metabolism, genetic informationprocessing, environmental information processing, cellular processes, organismalsystems, and human diseases. Our gene set enrichment analysis included a hyper-geometric overlap of the genes in our module and the gene sets mentioned above.We use an adjusted p-value cut-off of 0.05 to select significant overlaps.We also performed an enrichment using the Gene Ontology file which we down-loaded from the Gene Ontology Consortium [3]. The GO project has developedthree structured, controlled vocabularies (ontologies) that describe gene productsin terms of their associated biological processes, cellular components and molecu-56lar functions. GO is structured as a graph, where each GO term is a node, and theedges represent relationships between the terms. GO is a hierarchy, where childterms are more specialized than their parent terms, and a term may have more thanone parent term [3]. We filtered the ontology to use only the biological processescategory. Biological Process terms represent operations or sets of molecular eventswith a beginning and end, which are relevant to the functioning of cells, tissues,organs, and organisms. From the biological processes, we selected only those cat-egories containing more than 175 genes. We further filtered those to only retainthe categories where each of their children contains less than 175 genes. These areconsidered to be the informative functional categories as described in [43]. We per-formed another hypergeometric overlap of the genes in our module and the genesin the informative functional categories. We used an adjusted p-value cut-off of0.05 to select significant overlaps. The results of these enrichment analyses arepresented in Figure Baseline module functional annotationIn an attempt to reveal the functionality associated with our baseline modules weperformed enrichment analyses on all 53 modules using gene sets representingdifferent tissues, pathways and gene ontologies. The results of those analyses arepresented here.5.2.1 Tissue enrichmentIn order to determine whether our baseline modules may be specific to certain celltypes, we performed a tissue specific enrichment using the Gene Set EnrichmentProfiler by Benita et al [6]. They provide enrichment values derived by applyingLiMMa pairwise to hundreds of tissue expression profiles. LiMMa generates co-efficients as a measure of the difference between two groups and the sum of thecoefficients of comparisons with a p-value < 0.05 are used to derive an enrichmentvalue for each gene. Using those enrichment values we performed a hypergeomet-ric overlap of the genes in each module and the tissue specific enrichment valuesprovided by Benita et al. This analysis didn’t reveal that our modules are specific57Figure 5.5: Module M5 enrichment in KEGG pathways and GO biologicalprocess. A hypergeometric overlap of genes in module M5 and genes inKEGG pathways and GO biological process categories was performedand a p-value cut-off of 0.05 was cell types as can be seen in Figure 5.6 where the median enrichment of the genesin each of our modules are plotted and the tissues are grouped by cell type. We cansee from this figure that there isn’t any one cell type that is highly enriched in anyof the modules.5.2.2 Pathway enrichmentWe then performed a gene set enrichment analysis using the gene sets curated fromthe KEGG pathway database and the Gene Ontology file from the Gene OntologyConsortium, which we described in Section 5.1.4. Our enrichment analysis in-cluded a hypergeometric overlap of the genes in each of our baseline module andthe genes in the KEGG pathway gene sets as well as the genes in the Gene On-tolgoy informative functional categories mentioned in Section 5.1.4. We used an58CHAPTER 5. APPLICATION OF BASELINE MODULES 59Figure 5.6: Median enrichment values of the genes in each module. The dif-ferent tissues are plotted along the x-axis and are grouped by tissue type.The 53 baseline modules are plotted along the y-axis.adjusted p-value cut-off of 0.01 to select significant overlaps. The results are de-picted in Figure 5.7 (KEGG) and Figure 5.8 (Gene Ontology). We can see fromthese results that although there was some overlap in the biological processes andpathways associated with each module, the modules were overall quite heteroge-neous, as they were associated with different arrays of biological processes andkegg pathways.60CHAPTER 5. APPLICATION OF BASELINE MODULES 61Figure 5.7: KEGG pathways associated with our modules. Only modulesthat had at least one significant (adjusted p-value <0.01) pathway asso-ciation are shown.CHAPTER 5. APPLICATION OF BASELINE MODULES 62Figure 5.8: Gene Ontology biological processes associated with our modules.Only modules that had at least one significant (adjusted p-value <0.01)biological process associated with them are shown.Chapter 6ConclusionIn this thesis we studied the problem of identifying modules of genes that are in-volved in a common functional role. We first presented several clustering basedmethods as well as several reverse engineering approaches that have been used ina similar capacity. We then introduced our methodology which constructs mod-ules of genes by identifying genes that are co-expressed across many diseases. Weevaluated the quality of our module construction procedure using Jaccard distanceas an objective measure. Using this measure we illustrated that we obtain mod-ules with high concordance regardless of the order in which diseases are presentedto the algorithm which is an improvement on previous methods. We then usedour module construction procedure to construct baseline modules pertinent to thelung. These modules were constructed using 5 peripheral blood data sets, eachof which contained control samples used in different studies involving the lungs.We assessed the stability of our module construction procedure by applying it tothree different data sets, a COPD dataset, a COPD normal dataset and a lung tissuedataset. We used resampling techniques along with Jaccard distance, as an objec-tive measure, to evaluate the concordance of modules constructed by our methodwithin a dataset and between different data sets. We found that our method pro-duces modules with high concordance both within datasets and across differentdatasets as indicated by high Jaccard2 distances and small ranges of variation. Wealso assessed the sensitivity of our procedure to sample size and module size andfound that our procedure produces more stable modules when sample size is larger63and using a smaller sample size produces a larger number of modules with a largervariation in the number of modules produced. Furthermore we used our baselinemodules in a case study using COPD samples and demonstrated the benefits of us-ing modules in biomarker discovery. We performed an enrichment analysis using alist of genes that were previously shown to be differentially expressed between twophenotypes of COPD and identified modules that were enriched in that list. Wethen used the enriched modules and COPD samples that were stratified into threesub-phenotypes to compute the first principal component, which can be thoughtof as an eigen-gene, and projected the stratified samples onto that space. Thisexperiment identified modules that could lead to biomarkers that may be used tostratify patients into different phenotypes upon diagnosis and help tailor the treat-ment to each individual. Finally we performed a differential expression analysis,by module, using moderated t-tests and compared the resulting list of differen-tially expressed genes to previously identified differentially expressed genes andfound that by using our modules we can identify a larger number of differentiallyexpressed genes as the false discovery rate is decreased.64Bibliography[1] Cost risk analysis for chronic lung disease in canada -12-114 costriskanalysisv4.pdf. URL costriskanalysisv4.pdf.→ pages 44[2] Module extraction bioinformatics - ModuleAnnotation. URL wikis/module annotation/Module Extraction Bioinformatics. →pages ix, 7[3] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry,A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill,L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson,M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for theunification of biology. Nature genetics, 25(1):25–29, May 2000. ISSN1061-4036. doi:10.1038/75556. URL → pages 1, 56, 57[4] R. Banchereau, A. Jordan-Villegas, M. Ardura, A. Mejias, N. Baldwin,H. Xu, E. Saye, J. Rossello-Urgell, P. Nguyen, D. Blankenship, C. B.Creech, V. Pascual, J. Banchereau, D. Chaussabel, and O. Ramilo. Hostimmune transcriptional profiles reflect the variability in clinical diseasemanifestations in patients with staphylococcus aureus infections. PLoSONE, 7(4):e34390, Apr. 2012. doi:10.1371/journal.pone.0034390. URL → pages 6[5] K. Basso, A. A. Margolin, G. Stolovitzky, U. Klein, R. Dalla-Favera, andA. Califano. Reverse engineering of regulatory networks in human b cells.Nature Genetics, 37(4):382–390, Apr. 2005. ISSN 1061-4036.doi:10.1038/ng1532. URL → pages 1, 2, 1165[6] Y. Benita, Z. Cao, C. Giallourakis, C. Li, A. Gardet, and R. J. Xavier. Geneenrichment profiles reveal t-cell development, differentiation, andlineage-specific transcription factors including ZBTB25 as a novel NF-ATrepressor. Blood, 115(26):5376–5384, July 2010. ISSN 0006-4971.doi:10.1182/blood-2010-01-263855. URL → pages 57[7] M. P. R. Berry, C. M. Graham, F. W. McNab, Z. Xu, S. A. A. Bloch, T. Oni,K. A. Wilkinson, R. Banchereau, J. Skinner, R. J. Wilkinson, C. Quinn,D. Blankenship, R. Dhawan, J. J. Cush, A. Mejias, O. Ramilo, O. M. Kon,V. Pascual, J. Banchereau, D. Chaussabel, and A. OGarra. Aninterferon-inducible neutrophil-driven blood transcriptional signature inhuman tuberculosis. Nature, 466(7309):973–977, Aug. 2010. ISSN0028-0836. doi:10.1038/nature09247. URL →pages 2, 5, 6[8] R. Bonneau, D. J. Reiss, P. Shannon, M. Facciotti, L. Hood, N. S. Baliga,and V. Thorsson. The inferelator: an algorithm for learning parsimoniousregulatory networks from systems-biology data sets de novo. GenomeBiology, 7(5):R36, 2006. ISSN 1465-6906. doi:10.1186/gb-2006-7-5-r36.URL → pages 2,8, 9[9] D. Chaussabel and N. Baldwin. Democratizing systems immunology withmodular transcriptional repertoire analyses. Nat Rev Immunol, 14(4):271–280, Apr. 2014. ISSN 1474-1733. URL → pages 1, 5, 6[10] D. Chaussabel and A. Sher. Mining microarray expression data by literatureprofiling. Genome Biology, 3(10):research0055.1–research0055.16, 2002.ISSN 1465-6906. URL → pages 6[11] D. Chaussabel, C. Quinn, J. Shen, P. Patel, C. Glaser, N. Baldwin,D. Stichweh, D. Blankenship, L. Li, I. Munagala, L. Bennett, F. Allantaz,A. Mejias, M. Ardura, E. Kaizer, L. Monnet, W. Allman, H. Randall,D. Johnson, A. Lanier, M. Punaro, K. M. Wittkowski, P. White, J. Fay,G. Klintmalm, O. Ramilo, A. K. Palucka, J. Banchereau, and V. Pascual. Amodular analysis framework for blood genomics studies: application tosystemic lupus erythematosus. Immunity, 29(1):150–164, July 2008. ISSN1074-7613. doi:10.1016/j.immuni.2008.05.012. URL66 → pages ix, 2, 5,6, 7[12] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysisand display of genome-wide expression patterns. Proceedings of theNational Academy of Sciences, 95(25):14863–14868, Dec. 1998. ISSN0027-8424, 1091-6490. URL →pages 5, 6[13] T. F. Fuller, A. Ghazalpour, J. E. Aten, T. A. Drake, A. J. Lusis, andS. Horvath. Weighted gene coexpression network analysis strategies appliedto mouse weight. Mammalian Genome, 18(6-7):463–472, July 2007. ISSN0938-8990. doi:10.1007/s00335-007-9043-3. URL → pages 10[14] D. Gale and L. S. Shapley. College admissions and the stability of marriage.The American Mathematical Monthly, 69(1):9–15, Jan. 1962. ISSN0002-9890. doi:10.2307/2312726. URL → pages 16[15] A. Greenfield, C. Hafemeister, and R. Bonneau. Robust data-drivenincorporation of prior knowledge into the inference of dynamic regulatorynetworks. Bioinformatics, 29(8):1060–1067, Apr. 2013. ISSN 1367-4803.doi:10.1093/bioinformatics/btt099. URL → pages 8, 9[16] J. A. Hartigan and M. A. Wong. Algorithm AS 136: A k-means clusteringalgorithm. Applied Statistics, 28(1):100, 1979. ISSN 00359254.doi:10.2307/2346830. URL → pages 15[17] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. → pages 13, 14[18] M. J. Hawrylycz, E. S. Lein, A. L. Guillozet-Bongaarts, E. H. Shen, L. Ng,J. A. Miller, L. N. van de Lagemaat, K. A. Smith, A. Ebbert, Z. L. Riley,C. Abajian, C. F. Beckmann, A. Bernard, D. Bertagnolli, A. F. Boe, P. M.Cartagena, M. M. Chakravarty, M. Chapin, J. Chong, R. A. Dalley, B. D.Daly, C. Dang, S. Datta, N. Dee, T. A. Dolbeare, V. Faber, D. Feng, D. R.Fowler, J. Goldy, B. W. Gregor, Z. Haradon, D. R. Haynor, J. G. Hohmann,S. Horvath, R. E. Howard, A. Jeromin, J. M. Jochim, M. Kinnunen, C. Lau,E. T. Lazarz, C. Lee, T. A. Lemon, L. Li, Y. Li, J. A. Morris, C. C. Overly,67P. D. Parker, S. E. Parry, M. Reding, J. J. Royall, J. Schulkin, P. A. Sequeira,C. R. Slaughterbeck, S. C. Smith, A. J. Sodt, S. M. Sunkin, B. E. Swanson,M. P. Vawter, D. Williams, P. Wohnoutka, H. R. Zielke, D. H. Geschwind,P. R. Hof, S. M. Smith, C. Koch, S. G. N. Grant, and A. R. Jones. Ananatomically comprehensive atlas of the adult human brain transcriptome.Nature, 489(7416):391–399, Sept. 2012. ISSN 0028-0836.doi:10.1038/nature11405. URL→ pages 10[19] S. Hochreiter, D.-A. Clevert, and K. Obermayer. A new summarizationmethod for affymetrix probe level data. Bioinformatics, 22(8):943–949, Apr.2006. ISSN 1367-4803, 1460-2059. doi:10.1093/bioinformatics/btl033.URL → pages 26[20] V. A. Huynh-Thu, A. Irrthum, L. Wehenkel, and P. Geurts. Inferringregulatory networks from expression data using tree-based methods. PLoSONE, 5(9):e12776, Sept. 2010. doi:10.1371/journal.pone.0012776. URL → pages 11[21] R. W. Irving. Stable marriage and indifference. Discrete AppliedMathematics, 48(3):261–272, Feb. 1994. ISSN 0166-218X.doi:10.1016/0166-218X(92)00179-P. URL →pages 16[22] W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects inmicroarray expression data using empirical bayes methods. Biostatistics, 8(1):118–127, Jan. 2007. ISSN 1465-4644, 1468-4357.doi:10.1093/biostatistics/kxj037. URL → pages 27[23] M. Kabbout, M. M. Garcia, J. Fujimoto, D. D. Liu, D. Woods, C.-W. Chow,G. Mendoza, A. A. Momin, B. P. James, L. Solis, C. Behrens, J. J. Lee, I. I.Wistuba, and H. Kadara. ETS2 mediated tumor suppressive function andMET oncogene inhibition in human nonsmall cell lung cancer. ClinicalCancer Research, 19(13):3383–3395, July 2013. ISSN 1078-0432,1557-3265. doi:10.1158/1078-0432.CCR-13-0341. URL → pages 36[24] M. Kanehisa and S. Goto. KEGG: Kyoto encyclopedia of genes andgenomes. Nucleic Acids Research, 28(1):27–30, Jan. 2000. ISSN680305-1048. URL →pages 56[25] M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, andM. Tanabe. Data, information, knowledge and principle: back to metabolismin KEGG. Nucleic Acids Research, 42(D1):D199–D205, Jan. 2014. ISSN0305-1048. doi:10.1093/nar/gkt1076. URL → pages 56[26] A. A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. D.Favera, and A. Califano. ARACNE: An algorithm for the reconstruction ofgene regulatory networks in a mammalian cellular context. BMCBioinformatics, 7(Suppl 1):S7, Mar. 2006. ISSN 1471-2105.doi:10.1186/1471-2105-7-S1-S7. URL → pages 10[27] A. Mejias, B. Dimo, N. M. Suarez, C. Garcia, M. C. Suarez-Arrabal,T. Jartti, D. Blankenship, A. Jordan-Villegas, M. I. Ardura, Z. Xu,J. Banchereau, D. Chaussabel, and O. Ramilo. Whole blood gene expressionprofiles to assess pathogenesis and disease severity in infants withrespiratory syncytial virus infection. PLoS Med, 10(11):e1001549, Nov.2013. doi:10.1371/journal.pmed.1001549. URL → pages 5, 6[28] G. Obermoser, S. Presnell, K. Domico, H. Xu, Y. Wang, E. Anguiano,L. Thompson-Snipes, R. Ranganathan, B. Zeitner, A. Bjork, D. Anderson,C. Speake, E. Ruchaud, J. Skinner, L. Alsina, M. Sharma, H. Dutartre,A. Cepika, E. Israelsson, P. Nguyen, Q.-A. Nguyen, A. C. Harrod, S. M.Zurawski, V. Pascual, H. Ueno, G. T. Nepom, C. Quinn, D. Blankenship,K. Palucka, J. Banchereau, and D. Chaussabel. Systems scale interactiveexploration reveals quantitative and qualitative differences in response toinfluenza and pneumococcal vaccines. Immunity, 38(4):831–844, Apr. 2013.ISSN 1074-7613. doi:10.1016/j.immuni.2012.12.008. URL → pages 5, 6[29] M. C. Oldham, S. Horvath, and D. H. Geschwind. Conservation andevolution of gene coexpression networks in human and chimpanzee brains.Proceedings of the National Academy of Sciences, 103(47):17973–17978,Nov. 2006. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.0605938103.URL → pages 1069[30] D. J. Reiss, N. S. Baliga, and R. Bonneau. Integrated biclustering ofheterogeneous genome-wide datasets for the inference of global regulatorynetworks. BMC Bioinformatics, 7:280, June 2006. ISSN 1471-2105.doi:10.1186/1471-2105-7-280. URL → pages 8[31] J. Schfer and K. Strimmer. An empirical bayes approach to inferringlarge-scale gene association networks. Bioinformatics, 21(6):754–764, Mar.2005. ISSN 1367-4803, 1460-2059. doi:10.1093/bioinformatics/bti062.URL → pages 5,11[32] E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, andN. Friedman. Module networks: identifying regulatory modules and theircondition-specific regulators from gene expression data. Nature Genetics, 34(2):166–176, June 2003. ISSN 1061-4036. doi:10.1038/ng1165. URL → pages 8[33] A. Singh, M. Yamamoto, S. H. Y. Kam, J. Ruan, G. M. Gauvreau, P. M.O’Byrne, J. M. FitzGerald, R. Schellenberg, L.-P. Boulet, G. Wojewodka,C. Kanagaratham, J. B. De Sanctis, D. Radzioch, and S. J. Tebbutt.Gene-metabolite expression in blood can discriminate allergen-inducedisolated early from dual asthmatic responses. PLoS ONE, 8(7), July 2013.ISSN 1932-6203. doi:10.1371/journal.pone.0067907. URL → pages 26[34] G. K. Smyth. limma: Linear models for microarray data. In R. Gentleman,V. J. Carey, W. Huber, R. A. Irizarry, and S. Dudoit, editors, Bioinformaticsand Computational Biology Solutions Using R and Bioconductor, Statisticsfor Biology and Health, pages 397–420. Springer New York, Jan. 2005.ISBN 978-0-387-25146-2, 978-0-387-29362-2. URL 23. → pages 46[35] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A.Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P.Mesirov. Gene set enrichment analysis: A knowledge-based approach forinterpreting genome-wide expression profiles. Proceedings of the NationalAcademy of Sciences of the United States of America, 102(43):15545–15550,Oct. 2005. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.0506580102.URL → pages 5670[36] W. Talloen, D.-A. Clevert, S. Hochreiter, D. Amaratunga, L. Bijnens,S. Kass, and H. W. H. Ghlmann. I/NI-calls for the exclusion ofnon-informative genes: a highly effective filtering tool for microarray data.Bioinformatics, 23(21):2897–2902, Nov. 2007. ISSN 1367-4803,1460-2059. doi:10.1093/bioinformatics/btm478. URL → pages 26[37] A. Vattani. k-means requires exponentially many iterations even in theplane. Discrete & Computational Geometry, 45(4):596–616, June 2011.ISSN 0179-5376, 1432-0444. doi:10.1007/s00454-011-9340-1. URL → pages 15[38] J. Vestbo, W. Anderson, H. O. Coxson, C. Crim, F. Dawber, L. Edwards,G. Hagan, K. Knobil, D. A. Lomas, W. MacNee, E. K. Silverman, andR. Tal-Singer. Evaluation of COPD longitudinally to identify predictivesurrogate end-points (ECLIPSE). European Respiratory Journal, 31(4):869–873, Apr. 2008. ISSN 0903-1936, 1399-3003.doi:10.1183/09031936.00111707. URL → pages 26[39] J. Vestbo, S. S. Hurd, A. G. Agust, P. W. Jones, C. Vogelmeier, A. Anzueto,P. J. Barnes, L. M. Fabbri, F. J. Martinez, M. Nishimura, R. A. Stockley,D. D. Sin, and R. Rodriguez-Roisin. Global strategy for the diagnosis,management, and prevention of chronic obstructive pulmonary disease.American Journal of Respiratory and Critical Care Medicine, 187(4):347–365, Feb. 2013. ISSN 1073-449X. doi:10.1164/rccm.201204-0596PP.URL →pages 43, 44[40] I. Voineagu, X. Wang, P. Johnston, J. K. Lowe, Y. Tian, S. Horvath, J. Mill,R. M. Cantor, B. J. Blencowe, and D. H. Geschwind. Transcriptomicanalysis of autistic brain reveals convergent molecular pathology. Nature,474(7351):380–384, June 2011. ISSN 0028-0836.doi:10.1038/nature10110. URL →pages 10[41] I. V. Yang, C. D. Coldren, S. M. Leach, M. A. Seibold, E. Murphy, J. Lin,R. Rosen, A. J. Neidermyer, D. F. McKean, S. D. Groshong, C. Cool, G. P.Cosgrove, D. A. Lynch, K. K. Brown, M. I. Schwarz, T. E. Fingerlin, andD. A. Schwartz. Expression of cilium-associated genes defines novelmolecular subtypes of idiopathic pulmonary fibrosis. Thorax, 68(12):711114–1121, Dec. 2013. ISSN 1468-3296.doi:10.1136/thoraxjnl-2012-202943. → pages 36[42] B. Zhang and S. Horvath. A general framework for weighted geneco-expression network analysis. Statistical Applications in Genetics andMolecular Biology, 4(1), Aug. 2005. URL;jsessionid=393B3930C5A7B1D63504AD5CB4452B64.→ pages 2, 9[43] X. Zhou, M.-C. J. Kao, and W. H. Wong. Transitive functional annotation byshortest-path analysis of gene expression data. Proceedings of the NationalAcademy of Sciences of the United States of America, 99(20):12783–12788,Oct. 2002. ISSN 0027-8424. doi:10.1073/pnas.192159399. URL → pages 5772


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items