UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Evaluating coexpression analysis for gene function prediction Lotay, Vaneet Singh 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2010_spring_lotay_vaneet.pdf [ 992.8kB ]
Metadata
JSON: 24-1.0068712.json
JSON-LD: 24-1.0068712-ld.json
RDF/XML (Pretty): 24-1.0068712-rdf.xml
RDF/JSON: 24-1.0068712-rdf.json
Turtle: 24-1.0068712-turtle.txt
N-Triples: 24-1.0068712-rdf-ntriples.txt
Original Record: 24-1.0068712-source.json
Full Text
24-1.0068712-fulltext.txt
Citation
24-1.0068712.ris

Full Text

  EVALUATING COEXPRESSION ANALYSIS FOR GENE FUNCTION PREDICTION  by VANEET SINGH LOTAY B. Sc., University of Manitoba, 2006    A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Bioinformatics)   THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  December 2009  © Vaneet Singh Lotay, 2009  ii  Abstract  Microarray expression data sets vary in size, data quality and other features, but most methods for selecting coexpressed gene pairs use a „one size fits all‟ approach. There have been many different procedures for selecting coexpressed gene pairs of high functional similarity from an expression dataset. However, it is not clear which procedure performs best as there are few studies reporting comparisons of these approaches. The goal of this thesis is to develop a set of “best practices” in order to select coexpression links of high functional similarity from an expression dataset, along which methods for identifying datasets likely to yield poor information. With these goals, we hope to improve the quality of gene function predictions produced by coexpression analysis.  Using 80 human expression datasets we examined the impact of different thresholds, correlation metrics, expression data filtering and transformation procedures on performance in functional prediction. We also investigated the relationship between data quality and other features of expression datasets and their performance in functional prediction. We used the annotations of the Gene Ontology as a primary metric to measure similarity in gene function, and employ additional functional metrics for validation.  Our results show that several dataset features have a greater influence on the performance in functional prediction than others. Expression datasets which produce coexpressed gene pairs of poor functional quality can be identified by a similar set of data features. Some procedures used in coexpression analysis have a negligible effect on the quality of functional predictions while others are essential to achieving the best performance in the algorithm. We also find that some procedures interact greatly with features of expression datasets and that these interactions increase the number of high quality coexpressed gene pairs retrieved through coexpression  iii  analysis. This thesis uncovers important information on the many intrinsic and extrinsic factors that influence the performance in functional prediction of coexpression analysis. The information summarized here will help guide future studies using coexpression analysis and improve the quality of gene function predictions.  iv  Table of contents Abstract ........................................................................................................................................................................ ii List of tables ................................................................................................................................................................ vi List of figures ............................................................................................................................................................. vii Acknowledgements .................................................................................................................................................. viii Dedication.................................................................................................................................................................... ix 1 Introduction ........................................................................................................................................................ 1 1.1 Predicting gene function .............................................................................................................................. 1 1.2 Coexpression link selection .......................................................................................................................... 1 1.3 Expression dataset features ......................................................................................................................... 4 1.4 Evaluation of coexpression links .................................................................................................................. 4 1.5 Preprocessing expression data .................................................................................................................... 5 2 Methods ............................................................................................................................................................... 7 2.1 Expression data ............................................................................................................................................ 7 2.2 Dataset features ......................................................................................................................................... 13 2.3 Baseline algorithm settings ........................................................................................................................ 14 2.4 Filtering ..................................................................................................................................................... 18 2.5 Additional normalization ........................................................................................................................... 18 2.6 Log-transformation .................................................................................................................................... 18 2.7 Coexpression link selection ........................................................................................................................ 19 2.8 Functional similarity metric ....................................................................................................................... 19 2.9 Evaluating performance of alternate algorithm settings ........................................................................... 20 2.10 Protein to protein interactions ................................................................................................................... 22 2.11 KEGG pathways......................................................................................................................................... 22 2.12 MitoCarta genes......................................................................................................................................... 23 3 Results ................................................................................................................................................................ 24 3.1 Summary of performance statistics ............................................................................................................ 24 3.2 Baseline coexpression settings ................................................................................................................... 26 3.3 Expression correlation metrics .................................................................................................................. 31 3.4 Dataset features ......................................................................................................................................... 32 3.5 Configuring link selection .......................................................................................................................... 37 3.6 Removing negative correlations ................................................................................................................. 40 3.7 Filtering of expression data ....................................................................................................................... 41 3.8 Additional normalization ........................................................................................................................... 45  v  3.9 Log-transformation of expression values ................................................................................................... 47 4 Discussion .......................................................................................................................................................... 49 5 Conclusion ......................................................................................................................................................... 55 6 Bibliography ...................................................................................................................................................... 56  vi  List of tables Table 1. Some previous coexpression studies ............................................................................................................ 3 Table 2. Expression dataset features part 1 .............................................................................................................. 7 Table 3. Expression dataset features part 2 ............................................................................................................ 10 Table 4. Baseline algorithm settings ........................................................................................................................ 15 Table 5. Gene Ontology functional similarity statistics .......................................................................................... 25 Table 6. Effect of Balance normalization on performance .................................................................................... 47 Table 7. New algorithm settings to achieve better performance ........................................................................... 54  vii  List of figures Figure 1. Correlation magnitude density curves ..................................................................................................... 14 Figure 2. Coexpression algorithm procedures ........................................................................................................ 17 Figure 3. Performance of baseline coexpression links ............................................................................................ 28 Figure 4. Comparing standards for gene functional similarity ............................................................................. 30 Figure 5. Effect of correlation metric on performance .......................................................................................... 32 Figure 6. Effect of number of samples on performance ......................................................................................... 33 Figure 7. Effect of number of probes on performance ........................................................................................... 34 Figure 8. Effect of dataset size on performance ...................................................................................................... 34 Figure 9. Effect of missing values proportion on performance ............................................................................. 35 Figure 10. Effect of PC1 variance proportion on performance ............................................................................. 35 Figure 11. Effect of microarray technology type on performance ........................................................................ 36 Figure 12. Effect of median probe-probe correlation on performance ................................................................. 36 Figure 13. Effect of 90 th  quantile probe-probe correlation on performance ........................................................ 37 Figure 14. Effect of number of tissue groups investigated on performance ......................................................... 37 Figure 15. Effect of correlation distribution cut on performance ......................................................................... 39 Figure 15. Effect of statistical significance threshold on performance ................................................................. 40 Figure 16. Effect of removing negative correlations on performance ................................................................... 41 Figure 17. Effect of missing value filtering on performance .................................................................................. 43 Figure 18. Effect of low variance filtering on performance ................................................................................... 44 Figure 19. Effect of low expression filtering on performance ................................................................................ 45 Figure 20. Effect of SVD normalization on performance ...................................................................................... 46 Figure 21. Effect of log-transformation on performance ....................................................................................... 48  viii  Acknowledgements   I wish to offer my gratitude, first and foremost to my research supervisor, Dr. Paul Pavlidis. Thank you for your faith in my abilities and constant drive to push me to new levels as a scientific researcher and writer. I sincerely appreciate your willingness and availability to discuss research related matters whenever possible. It has been a pleasure to work under your guidance and support throughout my degree.  I wish to offer my thanks to the other lab members at the Centre for High Throughput Biology lab where my research was conducted. Thank you for providing assistance and support throughout my work and allowing me to work in a friendly and enjoyable environment.  I wish to offer my thanks to my thesis committee members, Dr. Jenny Bryan and Dr. Wyeth Wasserman. Thank you for your taking the time to give me guidance and feedback throughout the course of my research project.  I wish to offer my thanks to the Bioinformatics program coordinator, Sharon Ruschkowski, for her assistance in helping me enter the Bioinformatics program and provide guidance whenever needed throughout my degree.  Finally, I would like to give a special thanks to my family, most of all my parents for providing encouragement and support throughout my studies.  ix  Dedication  To my parents  1  1 Introduction 1.1 Predicting gene function  Determining the function of all human genes is a major goal of post-genomic biomedical research. Approximately 37% (9334) of human genes have no publications documenting their function [1]. Continuing advances in genomic sequencing discover new hypothetical genes and proteins whose functions remains unknown. With an ever-growing amount of publicly annotated biological data, researchers continue to search for a reliable prediction method that will associate genes with known functions to genes of unknown function.  Using gene expression data, the unknown functions of genes can be predicted through coexpression analysis. Two genes that have expression profiles which are highly correlated are more likely to have related biological functions than two genes that do not [2-4]. In expression data studies, it is not uncommon to detect that some of these pairings of highly correlated expression profiles will involve one gene with a yet unclassified biological function. Thus the unknown function of a gene can be predicted from a gene with a well-known function using the guilt-by-association concept. 1.2 Coexpression link selection Genes with highly correlated expression profiles are referred to as „coexpressed‟. In expression data analyses, the expression of many genes is being studied and therefore there are many possible gene pairings to consider as „coexpressed‟ given all their expression profiles. Consequently, it is practical to select the most significant pairings which will most likely share related biological functions. These selected pairs are what we term „coexpression links‟. We  2  believe that the procedures involved in selecting these coexpression links greatly affects their ability to predict gene function. Coexpression studies in the past have used a variety of different correlation metrics and coexpression link selection procedures to extract the most significant relationships as coexpression links in expression data (Table 1). One common method is to use a constant correlation threshold of absolute value so that only the highest positive and negative correlations between expression profiles are selected as coexpression links [5, 6]. This method relies on the strength of the link‟s correlation for selection and will extract a different number of coexpression links for each expression dataset depending on the overall distribution of correlations among all expression profiles. A similar approach uses a percentage as a threshold to ensure the same proportion of correlated gene pairs are selected as links, to account for varying correlation magnitude distributions [7]. In other studies, links are selected based on statistical significance by controlling for false positive error rates under the t-distribution [8, 9]. For this approach gene pairs with correlated expression profiles are evaluated individually by the statistical likelihood of their association; and being selected as coexpression links under a P- value cutoff. The connectivity of nodes in a gene coexpression network is also employed as a link selection method. In a gene coexpression network, only the expression profile correlations between genes that satisfy a chosen correlation threshold (usually conservative), possibly from one of the methods mentioned previously, are connected as nodes of the original network. Then in some cases, the stringency of the correlation threshold is increased, thereby pruning the coexpression graph, until certain properties of the network can be optimized such as average clustering coefficient [10, 11]. This approach looks to retain highly connected genes, in terms of coexpressed neighbors, found in gene coexpression networks. A meta-analysis approach aims to detect coexpression between the same pair of genes in multiple expression data studies [7]. A high expression profile correlation between the same pair of genes found in different expression  3  data studies is often not coincidental and not only validates further their coexpression but also improves their chances of having related biological functions. Table 1. Some previous coexpression studies Reference Correlation metric Coexpression Link Selection Criteria Eisen, et al. (1998) [12] Gene similarity metric similar to Pearson correlation coefficient none (all genes included in functional enrichment analysis) Butte, et al. (2000) [13] Entropy and mutual information measures mutual information threshold Stuart, et al. (2003) [3] Pearson correlation coefficient P-value cutoff (Bonferroni multiple test corrected) Tavazoie, et al. (1999) [4] Euclidean distance metric none (all genes included in functional enrichment analysis) Reverter, et al. (2006) [5] Pearson correlation coefficient constant correlation threshold Zhou, et al. (2002) [6] Pearson correlation coefficient constant correlation threshold with leave-one- out cross validation Voy, et al. (2006) [8] Pearson correlation coefficient (chosen) and Spearman correlation coefficient constant correlation threshold Carter, et al.(2004) [9] Pearson correlation coefficient P-value cutoff (uncorrected for multiple comparisons) Elo, et al. (2007) [11] Pearson correlation coefficient clustering coefficient based correlation threshold Mao, et al. (2009) [10] Pearson correlation coefficient clustering coefficient based correlation threshold Lee, et al. (2004) [7] Pearson correlation coefficient constant correlation threshold % and p-value cutoff (Bonferroni multiple test corrected) Prieto, et al. (2008) [14] Pearson and Spearman correlation coefficients constant correlation threshold with 25% cross validation Wolfe, et al. (2005) [15] Pearson correlation coefficient P-value cutoff (corrected for multiple tests)  While there are many approaches for selecting coexpression links (Table 1), it is far from clear if some methods are better than others, because there are few reports of comparisons of approaches. In this thesis, we will carry out multiple methods for link selection. We will also investigate, independently, the effect of each link selection method on the accuracy of coexpression links to predict gene function.   4  1.3 Expression dataset features We hypothesize that coexpression analysis of some expression data sets will provide better functional predictions than other expression data sets. It may not be feasible for some data sets to produce biologically meaningful coexpression links; or link selection criteria must change to accommodate for these types of affected expression datasets. For this thesis, an expression dataset or expression data study involves a group of microarray expression profiles which are collected together and described in a single publication [7]. These studies will analyze the expression profiles obtained from a data matrix which represent the RNA levels of genes across a set of experimental conditions (different samples). The experimental conditions can represent a gene‟s expression measured in different tissue samples, different time points or under different conditions (e.g. drug treatments). While the columns identify the different expression samples, the rows of the matrix identify the different probes used on the microarray. In addition, the values in the data matrix could be raw expression intensities, expression ratios or possibly log- transformed values. Expression data sets vary in numerous ways (e.g., choice of platform) and additional variability is introduced by decisions made during analysis. We hypothesize that the quality of gene function predictions is related to these properties and decisions. In this thesis, we examine 80 human expression data sets with varying properties. We examine the influence of these properties on coexpression analysis and consequently, the ability to predict gene function. 1.4 Evaluation of coexpression links  To assess the prediction power of the selected coexpression links, they must be evaluated with a metric for gene functional similarity. In previous coexpression studies, various databases have been used to classify human gene function. Biological pathway annotations of The Kyoto Encyclopedia of Genes and Genomes pathway database (KEGG) or similar pathway databases  5  are used, defining “functionally related” as “in the same pathway” [3, 14]. Annotations from the Gene Ontology database [16] can also be used as a definition of function [3, 5-9, 11]. A third definition of gene function is provided by using protein interactions as the operational characteristic. In another study the results of research papers were used to relate gene function to more specific classes such as „housekeeping genes‟ [14]. Because the Gene Ontology contains the best coverage of gene functional annotations across the human genome, we used it as the primary metric for evaluating the functional similarity of the coexpression links. We also employ additional functional metrics to validate the annotations of the Gene Ontology (Methods). 1.5 Preprocessing expression data  There are many steps in preparing the microarray before expression data can be analyzed: affixing the probes to the microarray, hybridizing the RNA samples to the probes, and capturing the image of fluorescence levels as expression intensities. Technical variation in these steps is a source of noise. Controlling for these sources of noise, and correcting for them where possible, is important for identifying real biological signals in the RNA levels (differences between cell types and tissues) [17]. The steps used to adjust the data to remove noisy signals or systematic errors are grouped together as “preprocessing”. One aim of our study is to investigate the impact preprocessing procedures have on the quality of gene function predictions from coexpression links.  One of the first preprocessing steps is normalization, which attempts to remove systematic errors in signal levels [18]. One type of normalization used for expression data is singular value decomposition (SVD) to remove components of the data identified as artifactual [19]. Another method employs a modified approach of SVD to “equalize” sources of variance  6  (including biological sources) [20]. Some normalization methods are specific to a particular type of array design while others are generic [21]. A second major preprocessing step is probe filtering, which removes expression profiles from the expression data matrix if they do not satisfy certain criteria. For example, filtering can be done to remove genes with low expression levels [7, 14, 22], low expression variance between samples [7, 14, 22] and too many missing expression values [7]. While one goal of such filtering is to remove noisy or irrelevant data, it also serves to reduce the number of subsequent comparisons performed and increases statistical power [22], at the possible cost of losing information. Filtering and normalization methods have not always been evaluated for their direct effect on producing biologically meaningful coexpression links.  7  2 Methods 2.1 Expression data Eighty human microarray expression datasets were used in the study with a total of 6732 samples. Most of these expression datasets were obtained from the Gene Expression Omnibus database (GEO) [23] while others were simply from individual publications (Table 2-3). All expression datasets were loaded and organized into the Gemma database (http://www.chibi.ubc.ca/Gemma), which we use in our lab. Expression data values remained in the same condition they were when retrieved from their database sources before any further analysis. More detailed information about the expression datasets such as the name of the microarray design used can be found on the Gemma website using the Dataset ID. Table 2. Expression dataset features part 1 Dataset ID Data Source Samples Probes Dataset Size Array Type Tissue Groups GSE420 GEO 10 10,125 101,250 one color single GSE685 GEO 9 17,900 161,100 one color single GSE833 GEO 11 5,390 59,290 one color single GSE8514 GEO 15 38,844 582,660 one color single GSE5808 GEO 18 17,900 322,200 one color single GSE2377 GEO 48 10,125 486,000 one color single GSE7509 GEO 26 38,844 1,009,944 one color single GSE2164 GEO 87 10,125 880,875 one color single GSE6008 GEO 103 17,900 1,843,700 one color single GSE1397 GEO 28 17,900 501,200 one color multiple GSE8479 GEO 65 17,663 1,148,095 one color single  8  Dataset ID Data Source Samples Probes Dataset Size Array Type Tissue Groups GSE9006 GEO 117 17,900 2,094,300 one color single GSE8586 GEO 54 38,844 2,097,576 one color single GSE6536 GEO 210 26,322 5,527,620 one color single GSE1037 GEO 91 29,553 2,689,323 two color single GSE8121 GEO 75 38,844 2,913,300 one color single GSE3526 GEO 353 38,844 13,711,932 one color multiple GSE8052 GEO 404 38,844 15,692,976 one color single GSE7307 GEO 677 38,844 26,297,388 one color multiple GSE8507 GEO 141 38,844 5,477,004 one color single allander-gist Allander, et al. (2001) [24] 19 1,791 34,029 two color single GSE58 GEO 12 8,484 101,808 two color single GSE53 GEO 26 4,506 117,156 two color single luo-prostate Luo, et al. (2001) [25] 25 5,725 143,125 two color single ma-breast Ma, et al. (2003) [26] 61 1,788 109,068 two color single khan-bluecell Khan, et al. (2001) [27] 88 2,097 184,536 two color single GSE59 GEO 68 8,484 576,912 two color multiple GSE3211 GEO 35 13,954 488,390 two color single GSE4988 GEO 20 9,913 198,260 two color single GSE6818 GEO 39 26,322 1,026,558 one color single GSE4058 GEO 38 25,493 968,734 two color single jazaeri-breast Jazaeri, et al. (2002) [28] 61 5,530 337,330 two color single GSE3398 GEO 73 25,493 1,860,989 two color single dhanasekaran- prstate Dhanasekaran, et al. (2001) [29] 53 8,591 455,323 two color single GSE3023 GEO 38 45,062 1,712,356 two color multiple GSE343 GEO 66 21,587 1,424,742 two color single  9  Dataset ID Data Source Samples Probes Dataset Size Array Type Tissue Groups vantveer-breast Van't Veer, et al. (2002) [30] 117 12,782 1,495,494 two color single GSE4007 GEO 126 34,195 4,308,570 two color single GSE3630 GEO 68 29,957 2,037,076 two color single GSE3497 GEO 114 33,905 3,865,170 two color single ramaswamy-cancer Ramaswamy, et al. (2001) [31] 280 6,216 1,740,480 one color multiple GSE7390 GEO 198 17,900 3,544,200 one color single GSE96 GEO 85 10,125 860,625 one color multiple GSE10006 GEO 87 38,844 3,379,428 one color single GSE60 GEO 133 14,015 1,863,995 two color single GSE8218 GEO 148 17,900 2,649,200 one color single GSE511 GEO 30 5,390 161,700 one color single GSE12643 GEO 20 10,125 202,500 one color single GSE4381 GEO 29 11,625 337,125 two color single GSE8919 GEO 193 17,663 3,408,959 one color single GSE11299 GEO 12 25,268 303,216 one color single GSE8481 GEO 63 17,900 1,127,700 one color single GSE137 GEO 70 9,001 630,070 two color single GSE6791 GEO 84 38,844 3,262,896 one color multiple GSE61 GEO 84 7,897 663,348 two color single GSE2361 GEO 36 17,900 644,400 one color multiple GSE8441 GEO 22 17,900 393,800 one color single GSE15219 GEO 16 18,074 289,184 one color single GSE4108 GEO 58 7,400 429,200 two color single GSE5307 GEO 33 24,929 822,657 two color single GSE11882 GEO 172 38,844 6,681,168 one color single  10  Dataset ID Data Source Samples Probes Dataset Size Array Type Tissue Groups GSE7638 GEO 160 17,900 2,864,000 one color single GSE755 GEO 173 10,125 1,751,625 one color single GSE12649 GEO 102 17,900 1,825,800 one color single GSE13879 GEO 24 16,397 393,528 one color single GSE12669 GEO 40 19,521 780,840 one color single GSE8325 GEO 12 17,012 204,144 two color single GSE9164 GEO 14 36,286 508,004 two color single GSE412 GEO 120 10,125 1,215,000 one color single GSE6613 GEO 105 17,900 1,879,500 one color single GSE5107 GEO 83 17,900 1,485,700 one color single GSE7023 GEO 47 14,016 658,752 one color single GSE89 GEO 40 5,390 215,600 one color single GSE7880 GEO 43 7,607 327,101 one color single GSE7849 GEO 78 10,125 789,750 one color single GSE994 GEO 75 17,900 1,342,500 one color single GSE90 GEO 12 5,390 64,680 one color single GSE361 GEO 20 5,390 107,800 one color single GSE88 GEO 31 5,390 167,090 one color single GSE443 GEO 11 10,125 111,375 one color single  Table 3. Expression dataset features part 2 Dataset ID Median correlation 90th quantile correlation Missing value proportion PC1 variance proportion Log-transformed GSE420 0.0098 0.54 0 0.93 no GSE685 0.0059 0.58 0 0.98 no GSE833 0.026 0.6 0 0.85 no GSE8514 0.0049 0.45 0 0.89 no GSE5808 0.0088 0.47 0 0.99 yes  11  Dataset ID Median correlation 90th quantile correlation Missing value proportion PC1 variance proportion Log-transformed GSE2377 -0.002 0.27 0 0.85 no GSE7509 0.016 0.38 0 0.79 no GSE2164 0.55 0.83 0.0006 0.86 no GSE6008 -0.002 0.21 0 0.99 yes GSE1397 0.013 0.45 0 0.78 no GSE8479 0.46 0.68 0 0.97 no GSE9006 0.0068 0.23 0 0.95 no GSE8586 0.00098 0.31 0 0.88 no GSE6536 0.32 0.6 0 0.99 yes GSE1037 0.48 0.7 0.0049 0.32 yes GSE8121 0.0088 0.52 0 0.72 no GSE3526 -0.014 0.32 0.0028 0.71 no GSE8052 -0.0049 0.53 0 0.99 no GSE7307 0.25 0.62 0 0.61 no GSE8507 0.015 0.45 0 0.82 no allander-gist 0.05 0.45 0 0.77 yes GSE58 0.13 0.77 0.13 0.68 yes GSE53 0.18 0.71 0.41 0.74 yes luo-prostate 0.057 0.4 0.00044 0.24 yes ma-breast -0.00098 0.28 0.0029 0.53 yes khan-bluecell 0.039 0.33 0 0.66 yes GSE59 0.022 0.27 0.21 0.25 yes GSE3211 0.043 0.41 0.18 0.75 yes GSE4988 0 0.39 0 0.2 yes GSE6818 0 0.24 0 0.98 no GSE4058 0.0039 0.36 0.13 0.74 yes jazaeri-breast -0.0029 0.31 0.000032 0.67 yes GSE3398 0.013 0.26 0.028 0.36 yes dhanasekaran- prstate -0.0039 0.38 0.082 0.37 yes GSE3023 0.00098 0.33 0.017 0.53 yes GSE343 -0.0059 0.38 0.076 0.53 yes vantveer-breast -0.0029 0.23 0.0017 0.17 yes GSE4007 0.019 0.29 0.1 0.57 yes GSE3630 0.017 0.35 0.012 0.67 yes GSE3497 0.03 0.35 0.19 0.37 yes ramaswamy- cancer 0.24 0.57 0 0.66 no GSE7390 0.0039 0.2 0 0.99 no GSE96 -0.026 0.32 0 0.66 no GSE10006 0.016 0.27 0 0.91 no GSE60 0.027 0.32 0.37 0.33 yes  12  Dataset ID Median correlation 90th quantile correlation Missing value proportion PC1 variance proportion Log-transformed GSE8218 -0.0088 0.37 0 0.99 yes GSE511 0.024 0.41 0 0.91 no GSE12643 0.002 0.47 0 0.99 yes GSE4381 0.00098 0.37 0.019 0.63 yes GSE8919 0.021 0.31 0 0.92 no GSE11299 0.39 0.77 0 0.96 no GSE8481 0.17 0.5 0 0.75 no GSE137 0.21 0.47 0.08 0.24 yes GSE6791 -0.0059 0.42 0 0.99 no GSE61 0.023 0.31 0.16 0.42 yes GSE2361 -0.021 0.33 0 0.62 no GSE8441 0.21 0.52 0 0.95 no GSE15219 -0.023 0.53 0 0.9 no GSE4108 0.4 0.69 0.18 0.36 yes GSE5307 -0.002 0.35 0.051 0.4 yes GSE11882 0.002 0.37 0 0.38 no GSE7638 0.002 0.44 0 0.99 no GSE755 -0.002 0.18 0 0.87 no GSE12649 0.0059 0.34 0 0.93 no GSE13879 0.15 0.49 0 0.21 yes GSE12669 -0.0029 0.39 0 0.98 no GSE8325 -0.002 0.44 0.037 0.16 yes GSE9164 0.12 0.56 0.00048 0.98 yes GSE412 0.47 0.68 0.00046 0.89 no GSE6613 0.00098 0.21 0 0.92 no GSE5107 -0.0078 0.3 0 0.85 no GSE7023 0 0.35 0 0.99 no GSE89 -0.0059 0.35 0 0.84 no GSE7880 0.13 0.43 0 0.99 no GSE7849 -0.0059 0.23 0 0.81 no GSE994 -0.00098 0.26 0 0.79 no GSE90 0.7 0.9 0 0.96 no GSE361 0.015 0.49 0 0.9 no GSE88 0.0059 0.41 0 0.92 no GSE443 -0.0049 0.49 0 0.91 no PC1 – The first principle component of the data    13  2.2 Dataset features  We recorded or computed multiple features of each data set, detailed in Table 2-3. The quantity of probes that target „known genes‟ (those  annotated in the National Center for Biotechnology Information database (NCBI)) were recorded as opposed to total probes used on the microarray since ultimately the selected coexpression links were only kept between these types of probes (Section 2.7). The dataset size is the product |samples| * |‟known gene‟ probes|. We computed additional features from the data: proportion of missing values, proportion of variance for principal component 1 of the data. Singular value decomposition (SVD) [19] was used to transform the expression data matrix to obtain the principal components of the data. Then the proportion of variance held by principal component 1 (PC1) was measured. To compute this data feature, missing expression values were replaced by the mean signal level of their associated expression profiles prior to singular value decomposition. For each expression dataset within the Gemma database, correlation magnitudes between all rows of the expression data matrix are computed and density distribution curves are displayed on the Gemma web site (Figure 1). We measured the median and 90 th  quantile correlation values from these density distributions. These two features indicate the proportion of positively correlated gene pairs in the dataset. For example, most datasets have a correlation density distribution with a median near zero (Figure 1A). However, datasets that have more positively correlated gene pairs between all expression profiles would have a median correlation of perhaps 0.3 (Figure 1B).      14  Figure 1. Correlation magnitude density curves: Density distribution curves of correlation values between all expression profiles in expression data matrix shown. Images from Gemma website: http://www.chibi.ubc.ca/Gemma (A) Dataset ID GSE3023. (B) Dataset ID GSE6536. A B   2.3 Baseline algorithm settings  The data filtering procedures and criteria for coexpression link selection used in this thesis were based on a previously published algorithm developed in our lab [7]. The original algorithm has default settings for each parameter chosen, based on pilot experiments run with a small number of expression datasets; however these settings were not thoroughly evaluated for their optimality for gene function prediction. We used the coexpression links produced under those default settings as our “baseline standard” against which to measure performance (Table 4). A flow chart showing the sequence of procedures in this algorithm is shown in Figure 2. The red boxes identify procedures not originally implemented in the algorithm but will be tested in this thesis and are detailed in subsequent sections.        15  Table 4. Baseline algorithm settings Parameter Setting Correlation metric Pearson correlation coefficient Missing values filter 30% of samples must be present Low variance filter 5% lowest removed Low expression filter 30% lowest removed P-value cutoff ≥ 0.01 removed Correlation distribution cut ≤ highest 1% of correlations removed SVD normalization No Log-transformation No Removing negative correlations No  The baseline algorithm is as follows, with settable parameters as given in Table 4. First, the baseline algorithm begins by filtering out probes of the input expression matrix that have more than 70% of their values missing. Second, for ratiometric arrays, a proportion of probes according to the lowest expression variance between samples are filtered out of the expression matrix. For one color array designs, a proportion of probes according to the lowest variation coefficient for expression among samples are filtered out of the expression matrix. Third, a proportion of probes are filtering out according to the lowest mean expression level in the expression matrix. For datasets of a two color microarray design, the expression level is estimated as the mean of the intensities from two channels. For some datasets this information was not available, so the filter was not used. Finally, each gene expression profile is compared to all others using the Pearson correlation coefficient. The selection of the most significant pairings between expression profiles is based on two criteria. The first criterion evaluates each correlated gene pair on the basis of statistical significance with a P-value threshold. The assumption is that under the null hypothesis, each correlation between the expression profiles of two genes is from a bivariate normal distribution. Then the ratio of the correlation magnitude over the standard error of the  16  correlation, under the null hypothesis of no correlation, follows a t-distribution with n-2 degrees of freedom, where n is the number of samples in each expression profile. P-values were corrected for the number of genes tested using the Bonferroni multiple test correction. The second criterion is the correlation distribution cut which selects the top proportion of gene pairings on the basis of correlation magnitude (considering both positive and negative correlations). The more stringent of these two criteria determine the correlation thresholds at each tail of the distribution and mark the cutoff for coexpression link selection. For example, for a particular dataset it may be found that the top 1% of correlations will include gene pairs with a correlation over 0.8 and all gene pairs with a correlation less than -0.75. However, when finding a correlation corresponding to a p-value of 0.01, under the assumptions stated earlier and considering the number of unique genes in the dataset, it may be found that the statistical significance threshold is + /- 0.85 which is more stringent at both tails of the correlation distribution than the correlation distribution cut and becomes the correlation threshold for link selection.              17  Figure 2. Coexpression algorithm procedures: Flow chart demonstrating sequence of algorithm procedures for preprocessing the expression data and selecting the coexpression links for each expression dataset. The boxes in red were not part of the of the original coexpression algorithm (Sect 2.3), and are tested for their effect on performance in this thesis.     18  2.4 Filtering  Three filtering procedures were executed sequentially on the probes or probe sets for Affymetrix microarrays. The exact details of these procedures are based on a previously published coexpression algorithm [7], detailed in section 2.3, according to three criteria: probes with too many missing values, probes with low variance between samples comparatively and probes with a low mean expression level comparatively. 2.5 Additional normalization  Additional normalization methods were tested on the expression data to remove potential experimental noise. The first normalization method tested was singular value decomposition (SVD) [19], which is applied to the filtered expression data decomposing it into three matrices (U, V T , D), giving information on the principal components of the data. The first principal component representing the largest source of variation was removed by setting its corresponding eigenvalue to zero in the „D‟ matrix, followed by reconstructing a “corrected” dataset by matrix multiplication. We also tested an SVD-based normalization method that we call „Balance‟ which is essentially the normalization procedure  used in the SPELL algorithm [20]. The Balance method sets all the principal components of the data to one in the „D‟ matrix and then reconstitutes the three SVD matrices to form a “corrected” expression data matrix. 2.6 Log-transformation  The effect of log-transformation of raw expression values on algorithm performance was tested for expression datasets using a one color microarray technology type. The „one color‟ datasets which originally had log-transformed expression values were „unlogged‟ by raising each expression value to the power of two (2 X). The „one color‟ datasets which originally had raw  19  expression values (not log-transformed) were „logged‟ by taking the log2 of each expression value (log2X). In this case, negative expression values and values equal to zero were essentially left as missing values after log-transformation. All transformations were done prior to expression data filtering. 2.7 Coexpression link selection After any filtering or normalization is complete, each gene expression profile is compared to all others using a correlation coefficient, either Pearson or Spearman. The option to discard all negative correlations between genes is executed at this point if desired. The link selection procedure which then follows is based on the previously published algorithm [7], detailed in section 2.3, which chooses links based on two main criteria: a statistical significance threshold and a correlation distribution cut. One difference from the original algorithm is that for the statistical significance threshold when using the Spearman correlation metric the p-values are computed differently and do not use the assumptions related to the t-distribution [32]. The impact of each individual correlation criterion is also investigated in this thesis by selecting coexpression links for each dataset by both thresholds independently. Only coexpression links formed between probes that target at least one known gene were kept to prepare for their functional evaluation at the gene level, described in the following sections. 2.8 Functional similarity metric  Annotations from the Gene Ontology (GO) [16] were used as the primary metric for evaluating the functional similarity of each selected coexpression link. The Gene Ontology is a controlled vocabulary of gene attributes in a hierarchical structure. The ontology covers three main domains which serve as the root attributes in the hierarchy: cellular component, biological  20  process, molecular function. Each gene attribute or „GO term‟ has genes annotated to it which are associated by function. The GO terms start off as more general attributes near the top of the hierarchy, related to one of the three root attributes, and then become more specific in function near the bottom of the hierarchy at the leaves. In this methodology, each gene may have a set of GO terms associated with it (some genes have zero GO terms). In our method, this set includes all parent terms in the hierarchy of the directly annotated GO terms. The GO overlap score of a coexpression link between genes A and B is simply the GO term overlap |GOa ∩ GOb|, where GOa denotes the set of GO terms associated with gene A and GOb denotes the set of GO terms associated with gene B. This metric was previously extensively validated by Mistry and Pavlidis (2008) [33]. GO overlap scores were recorded for all links in every analysis of the datasets. GO overlap scores were also recorded for randomly selected pairs from the microarray design of each dataset. After scoring different quantities of random pairs, it was found that generating more than 10,000 random pairs did not significantly change the resulting distribution of GO overlap scores. Therefore the set size of 10,000 random gene pairs was chosen for each expression dataset. The primary measure of performance used was the difference in the average GO overlap score between the coexpressed and randomly selected gene pairs for each dataset while under the same set of data treatment and link selection parameters, named the mean GO shift. A positive mean GO shift for a link set implies it performs better than its associated set of random links in terms of functional similarity. The coverage of the Gene Ontology annotations was 18512 unique human genes with an average of over seven GO terms annotated to each gene. 2.9 Evaluating performance of alternate algorithm settings To evaluate the performance of alternate algorithm settings and procedures in contrast to baseline conditions (Table 4), we computed several measures. To first evaluate the performance  21  in individual datasets under an alternate algorithm setting, we measured the change in their functional similarity measure from baseline conditions (DELTA (Mean GO shift)). A positive DELTA (Mean GO shift) value implies that for a particular dataset, using the alternate algorithm setting or procedure performs better than compared to baseline conditions (Table 4). We also performed one-sided Wilcoxon signed rank tests on these DELTA (Mean GO shift) values to evaluate the statistical significance of this change in performance across all datasets. For each signed rank test, we used the alternate hypothesis that the DELTA (Mean GO shift) values were greater than zero, as we were primarily interested in whether the change in performance was an improvement statistically and not just significantly different overall.  To summarize the average change in performance across all datasets when varying the different thresholds and settings of our coexpression algorithm, we used three additional measures. The first measure, later referred to as the „average‟ score, is the average of the mean GO shift scores across all datasets (AVG mean GO shift). The second measure was the standard deviation of the mean GO shift scores for link sets produced under the alternate algorithm settings (SD Mean GO shift). The third measure, later referred to as the „weighted average‟ score, is the average of the mean GO shift scores across all datasets while taking into account the change in the number of links produced (Weighted AVG mean GO shift).  For the „weighted average‟ scores, the product of the mean GO shift score and link count are summed for n link sets produced under the alternate algorithm setting or procedure. Then this sum is divided by the sum of link counts for n datasets under the alternate algorithm setting being tested to give the Weighted AVG mean GO shift score. The number of datasets applicable  22  to the alternate algorithm setting or procedure being tested is n, which in most cases is all 80 datasets. 2.10 Protein to protein interactions  Human protein-to-protein interactions were downloaded from the InnateDB database [34]. Given a coexpression link consisting of genes A and B, the interaction of A and B on the protein level was considered a functional match. The proportion of matches over the total amount of links in a link set was measured as the „PPI%‟. The „PPI% shift‟ was the difference in the „PPI%‟ measure between selected coexpression links and randomly selected gene pairs for a dataset under one set of conditions. The „PPI% shift‟ was used as an additional standard to evaluate the functional similarity of coexpression link sets for each dataset. A positive „PPI% shift‟ for a link set implies it performs better than its associated set of random links in terms of functional similarity. These protein-to-protein interactions have a coverage of 16797 unique human genes with 45750 total interactions. 2.11 KEGG pathways  Gene annotations for human biological pathways were obtained from the KEGG [35] pathway database (http://www.genome.jp/kegg/pathway.html). Every KEGG pathway has an associated set of genes annotated to it. Given a coexpression link consisting of genes A and B, the annotation of A and B to the same KEGG pathway was considered a functional match. The proportion of matches over the total amount of links in a link set was measured as the „KEGG%‟. The „KEGG% shift‟ was the difference in the „KEGG%‟ measure between selected coexpression links and randomly selected gene pairs for a dataset under one set of conditions. The „KEGG% shift‟ was also used as an additional standard to evaluate the functional similarity of  23  coexpression link sets for each dataset. A positive „KEGG% shift‟ for a link set implies it performs better than its associated set of random links in terms of functional similarity. There were 217 human pathways in total that covered 4972 unique human genes. 2.12 MitoCarta genes  The final evaluation method we applied was based on MitoCarta [36]. The Human MitoCarta gene list contains 1023 genes that encode proteins with strong support of being localized to the mitochondria, and which have been shown to be coexpressed [37, 38]. Out of the selected coexpression links for an analysis, the number of links between two human MitoCarta genes was measured over the total number of possible links that can be formed from the MitoCarta genes present in the expression dataset to begin with (MitoCarta ratio). The „MitoCarta ratio shift‟ was the difference in the MitoCarta ratio measure between selected coexpression links and randomly selected gene pairs for a dataset under one set of conditions.  24  3 Results 3.1 Summary of performance statistics  To understand how the preprocessing procedures and methods used for coexpression link selection affect the performance in functional prediction, we adjusted the stringency of our algorithm settings controlling these steps as well as testing new methods. To monitor the average change in performance, we tracked three measures of functional similarity according to the Gene Ontology (GO) annotations summarized below (Table 5) for each analysis (Methods): AVG Mean GO shift, SD Mean GO shift and Weighted AVG Mean GO shift. These measures summarize the effect of each alternate threshold setting or new procedure on algorithm performance across all 80 expression datasets. While the „average‟ scores simply summarizes the raw performance values across all datasets, the „weighted average‟ scores take into account the change in the average number of links obtained as we do not desire an improvement in performance at the cost of greatly reducing the number of links produced. The change in these measures or „DELTA‟s gives an indication on how well these alternate settings or new procedures perform better than the coexpression links produced under baseline algorithm settings for all datasets (positive „DELTA‟ is an improvement in performance). Further details on the individual change in performance for all datasets when configuring the stringency of each algorithm procedure or testing new procedures, is uncovered and discussed in later sections.        25  Table 5. Gene Ontology functional similarity statistics  AVG Mean GO Shift AVG Mean GO Shift-DELTA SD Mean GO shift Weighted AVG Mean GO Shift Weighted AVG Mean GO shift- DELTA AVG Link Count PEARSON CORRELATION METRIC Baseline 0.67 0 0.79 0.81 0 697,390 Baseline (Corr. Dist. Cut) 0.48 -0.2 0.54 0.72 -0.082 1,395,360 Baseline (P-value threshold) 0.63 -0.038 0.75 0.41 -0.4 13,701,743 Removing negative correlations 0.8 0.13 0.88 1.1 0.28 420,329 SVD normalization 0.42 -0.25 0.69 0.25 -0.56 732,424 SVD normalization (TC) 0.49 0.12 0.75 0.68 0.092 307,194 Balance normalization 0.48 0.27 1.2 0.016 -0.036 5,124 Missing values cut = 50% (28 datasets) * 0.3 0.075 0.66 0.5 0.012 1,374,348 Missing values cut = 70% (28 datasets) * 0.39 0.16 0.74 0.52 0.032 1,333,784 Low variance cut = 0% 0.47 -0.0037 0.54 0.72 -0.0018 1,541,028 Low variance cut = 15% 0.47 -0.0023 0.53 0.71 -0.016 1,116,016 Low expression cut = 0% 0.36 -0.16 0.43 0.43 -0.29 2,360,463 Low expression cut = 45% 0.58 0.062 0.62 0.87 0.15 1,002,551 Corr. Dist. Cut = 0.1% 0.56 0.081 0.71 0.86 0.14 144,173 Corr. Dist. Cut = 10% 0.37 -0.1 0.38 0.46 -0.27 13,356,312 P-value threshold = 0.001 0.65 0.02 0.87 0.42 0.0068 12,142,044 P-value threshold = 0.1 0.59 -0.044 0.69 0.38 -0.024 15,633,857 Reverse log- transformation (6 datasets) ~ 0.6 -0.11 0.53 0.72 0.11 579,524 Log-transformation (46 datasets) ~~ 0.81 0.27 0.79 0.63 0.14 1,586,755 SPEARMAN CORRELATION METRIC Baseline 0.71 0 1 0.9 0 668,886 Baseline (Corr. Dist. Cut) 0.52 -0.19 0.53 0.78 -0.12 1,410,173 Baseline (P-value threshold) 1.3 0.61 1.2 1.1 0.18 598,404 Removing negative correlations 0.85 0.14 1.1 1.3 0.4 378,205 SVD normalization 0.65 -0.065 1.6 1.5 0.61 172,859 SVD normalization (TC) -0.15 -0.56 1.1 -0.72 -1.5 7,441  26   AVG Mean GO Shift AVG Mean GO Shift-DELTA SD Mean GO shift Weighted AVG Mean GO Shift Weighted AVG Mean GO shift- DELTA AVG Link Count Missing values cut = 50% (28 datasets) * 0.45 0.15 0.91 0.6 0.077 1,270,753 Missing values cut = 70% (28 datasets) * 0.54 0.23 0.88 0.62 0.093 1,318,569 Low variance cut = 0% 0.56 0.035 0.95 0.8 0.018 1,404,279 Low variance cut = 15% 0.57 0.045 0.94 0.81 0.023 1,066,854 Low expression cut = 0% 0.57 0.0079 0.59 0.44 -0.33 106,093 Low expression cut = 45% 0.76 0.19 0.81 1.3 0.5 54,019 Corr. Dist. Cut = 0.1% 0.63 0.1 1.1 0.45 -0.34 14,871 Corr. Dist. Cut = 10% 0.52 -0.0091 0.53 0.039 -0.74 1,405,739 P-value threshold = 0.001 1.3 -0.007 1.2 1.1 0.027 10,895,815 P-value threshold = 0.1 1.3 -0.0021 0.83 1.1 -0.014 14,099,479 Blue colored rows – all analyses run under baseline algorithm settings (Table 4) Green colored rows – all analyses run under baseline algorithm settings except using only the correlation distribution cut for link selection Red colored rows - all analyses run under baseline algorithm settings except using only the p-value threshold for link selection * Statistics gathered only for datasets which contain missing values in expression data ~ Statistics gathered only for datasets of one color microarray technology type which originally had log-transformed expression values ~~ Statistics gathered only for datasets of one color microarray technology type which originally did not have log- transformed expression values AVG – average SD – standard deviation  3.2 Baseline coexpression settings  To effectively evaluate the significance of each component of coexpression analysis towards predicting gene function, we needed to compare our performance against a baseline measure. The structure of the coexpression analysis method was based on a previously published algorithm which executed similar preprocessing methods and coexpression link selection criteria (Methods). We produced coexpression links under the default settings used in the original algorithm as our „baseline links‟ for each expression dataset (Table 4). We also produced 10,000 randomly selected gene pairs for each dataset and for each analysis measured the performance of  27  the selected coexpression links against the performance of the random links in terms of functional similarity. The functional similarity of each selected coexpression link was evaluated by computing the overlap of their respective Gene Ontology (GO) annotations (Methods). We first produced coexpression links under baseline algorithm settings (Table 4) for all 80 expression data sets. We then evaluated the functional similarity of these baseline link sets by using the difference in the average GO overlap score between selected and random link sets, giving us the „mean GO shift‟ measure (Figure 3). Under baseline algorithm settings, it seems the majority of datasets perform well however 15 datasets still produce a negative mean GO shift which implies their coexpression links do not provide better functional information than simply picking random gene pairs. There are potentially two factors that could account for these results. Firstly, the preprocessing procedures and link selection criteria being executed under baseline algorithm settings may be unable to extract the highest number of functionally similar links from each dataset. Secondly, there may be technical features of these expression data sets possibly related to data quality which make it difficult to extract coexpression links of high functional similarity. Further analysis in to both these avenues within this thesis will help uncover more information, as to which if not both elements are causing coexpression link sets to perform worse than random gene pairs.          28  Figure 3. Performance of baseline coexpression links: Mean GO shift scores for coexpression links created under baseline settings for each parameter of coexpression algorithm. Scores for all 80 datasets shown (each point represents a dataset). The black line parallel to the X-axis distinguishes more clearly the positive mean GO shift scores from the negative mean GO shift scores. Scores are sorted in ascending order.  To validate the accuracy of the Gene Ontology as a standard for gene functional similarity, we tested additional functional standards on our coexpression algorithm. For each of the 80 baseline link sets, we computed three new measures: the proportion of links that share a protein-to-protein interaction (PPI%), the proportion of links that share gene annotations to the same KEGG pathway (KEGG%) and the amount of links between MitoCarta genes originally in the expression dataset (MitoCarta Ratio). The expression datasets used in this thesis were not specifically chosen for their study of mitochondrial genes; however this ratio was controlled for the amount of MitoCarta genes present in the original dataset (Methods). Also, it has been shown in previous studies that mitochondrial genes are found to be highly coexpressed [37, 38], and therefore should be detected by our coexpression algorithm. Similar to the mean GO shift, these additional measures were scored relative to the performance of the randomly selected links, giving us the PPI% shift, the KEGG% shift and the MitoCarta ratio shift. We assessed how well these additional standards validated our primary measure of functional similarity, the Gene Ontology, by comparing the respective scores for all functional standards (Figure 4). The range  29  of these new scoring measures was very small due to the fact that these additional standards, most notably the MitoCarta genes, have lower annotation coverage of the human genome in contrast to the Gene Ontology. Nevertheless, they do show some agreement with the GO functional standard, with the highest correlation (Figure 4) from the protein-to-protein interactions (correlation = 0.45, p-value = 2.87E-5). This instills more confidence in the GO annotations as a reasonable evaluator of functional similarity for coexpression links.                         30  Figure 4. Comparing standards for gene functional similarity: Functional similarity scores using the Gene Ontology are compared with 3 additional functional metrics scoring the baseline coexpression links for all 80 datasets. A high scoring outlier in graphs (A-B) was omitted in order to focus in on the overall trend for the majority of the data. The high scores of these outlier points are most likely caused by the smaller size of their datasets for which even a small number of functional matches would register a high „PPI% shift‟ or „KEGG% shift‟ score. The rank correlation between the two sets of scores and the associated p-value is shown at the top of each graph. (A) Protein-to-protein interactions (B) KEGG pathway annotations (C) MitoCarta genes A B  C      31  3.3 Expression correlation metrics We compared the „quality‟ of coexpression links selected using the Pearson linear correlation coefficient and the Spearman rank correlation coefficient. We selected links solely based on the correlation distribution cut in order to obtain a similar number of links under both correlation metrics. Besides the correlation metric, all other parameters were configured to baseline algorithm settings. We then compared the functional similarity of the links produced under both correlation metrics by recording their mean GO shift scores (Figure 5). The performance of both correlation metrics agree well across all datasets (correlation = 0.91). With the performance of the metrics being highly correlated the biological interpretations of the data do not significantly change when using the Spearman correlation metric. Therefore, we will focus on investigating all experiments that were run under the Pearson correlation coefficient in the remainder of the results.                32  Figure 5. Effect of correlation metric on performance: Mean GO shift scores, representing functional similarity for the coexpression links of all 80 datasets produced under both Pearson and Spearman correlation metrics displayed here. All link sets produced here were selected on the basis of only the correlation distribution cut criterion to produce similar link quantities for both metrics. The remaining conditions for both sets of scores are under baseline algorithm settings. Spearman rank correlation between scores of opposing correlation metrics is displayed at top of graph. The p-value of the one-sided Wilcoxon signed rank test is displayed at the top of graph. The black line with a slope of 1 drawn through the origin shows where scores would be identical under both correlation metrics.   3.4 Dataset features  It is clear from the performance of the baseline link sets shown earlier (Figure 3) that under the same preprocessing settings and link selection thresholds, some datasets provide more functionally related links than others. While all baseline links were formed under the same algorithm settings, the expression datasets from which they were obtained vary in their features (Table 2-3). We recorded a variety of features of each expression dataset (Methods) and examined the relationship between these dataset features and the quality of coexpression links produced. We kept all algorithm settings constant in this comparison (baseline settings; Figure 6- 14). Figure 6-8 shows that the size of the expression dataset is slightly (but significantly) positively correlated with performance. In contrast, missing values tend to reduce performance (Figure 9). Interestingly, the best performing datasets in coexpression analysis all have a PC1  33  variance proportion greater than 0.9 (Figure 10). Datasets that use a one color array design such as Affymetrix perform better overall compared to those of two color array designs (p < 0.05, t- test; Figure 11). Data sets with a high median correlation among probes perform better (positive correlation with performance is 0.21), suggesting that even a slight increase in the relative proportion of positive correlations improves performance (Figure 12-13). The expression datasets sampling from only one tissue group seem to perform slightly better than those that study multiple tissue groups (Figure 14), however this is not a sufficiently balanced comparison as only 9 datasets in total used multiple tissue groups. While none of these factors explain all the variance in performance (alone or together), there is clearly a connection between some dataset features and performance in functional prediction. Figure 6. Effect of number of samples on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the number of microarray samples in each dataset. The rank correlation between the two measures is displayed at the top of the graph.        34  Figure 7. Effect of number of probes on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the number of known gene probes in each dataset. The rank correlation between the two measures is displayed at the top of the graph.   Figure 8. Effect of dataset size on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the expression data matrix size in each dataset. The rank correlation between the two measures is displayed at the top of the graph.        35  Figure 9. Effect of missing values proportion on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the proportion of missing values in each dataset. The rank correlation between the two measures is displayed at the top of the graph.   Figure 10. Effect of PC1 variance proportion on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the proportion of variance held by the 1 st  principal component in each dataset. The rank correlation between the two measures is displayed at the top of the graph.        36  Figure 11. Effect of microarray technology type on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared between datasets using a one color or two color microarray technology type. The p-value from a two-sided t-test between both sets of scores is shown at top of graph.   Figure 12. Effect of median probe-probe correlation on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the median correlation magnitude in each dataset. The rank correlation between the two measures is displayed at the top of the graph.        37  Figure 13. Effect of 90 th  quantile probe-probe correlation on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the 90 th  quantile correlation magnitude in each dataset. The rank correlation between the two measures is displayed at the top of the graph.   Figure 14. Effect of number of tissue groups investigated on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared between datasets that investigated one tissue group or investigated multiple tissue groups. The p-value from a two-sided t-test between both sets of scores is shown at top of graph.   3.5 Configuring link selection  The goal of our link selection process was to extract coexpression links which contain significant functional relationships between genes. Ideally we would extract all such  38  functionally-relevant links, but there might be more value in selecting fewer links of (hopefully) higher quality. In terms of our link selection procedures, testing this amounts to varying the thresholds used to select the coexpression links. As described earlier, in the baseline method there are two thresholds we use to extract links: one using the properties of the coexpression distribution, and one using a test of statistical significance of the coexpression; in any given data set either threshold might be in force, complicating evaluation. Therefore to test the effect of varying these two thresholds independently, we altered our procedure to apply only one of the thresholds at a time (Methods).  We first varied the stringency of the correlation distribution cut. We produced coexpression link sets for each dataset under three different settings for the correlation distribution cut: 0.1%, 10% and 1% (baseline setting). The significance threshold was not used, and other settings were at baseline. We then measured the change in the overall functional similarity from taking fewer or more links compared to baseline link sets, giving us the „DELTA (mean GO shift)‟ values. We used the Wilcoxon signed rank test to determine if the improvement in performance, if any, over the baseline standard was statistically significant across all datasets. As shown in Figure 15, selecting a smaller proportion of the strongest correlations as links (0.1%) shows a statistically significant improvement over baseline (Wilcoxon p-value = 0.0057; Figure 15A). The majority of improvement happens for datasets that already perform well (mean GO shift > 0.25) under baseline conditions. Selecting a larger proportion of the strongest correlations does not guarantee more coexpression links of high quality as a negative change is common, compared to the performance of baseline link sets (Figure 15B). This negative change is not an improvement in performance statistically (Wilcoxon p-value = 1).   39   Figure 15. Effect of correlation distribution cut on performance: The change in functional similarity as compared to the performance of baseline links when using alternate thresholds for the correlation distribution cut. This is plotted against the performance of the baseline links on the x-axis. All link sets produced in this analysis only use the correlation distribution cut to select coexpression links. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) Correlation distribution cut = 0.1% (B) Correlation distribution cut = 10% A B     To test the effect of the statistical significance threshold, we produced coexpression link sets for each dataset under three different p-value thresholds: 0.001, 0.1 and 0.01 (baseline setting). All other algorithm settings were under baseline conditions. We found that as for the correlation distribution threshold, using a more stringent threshold produces a small improvement in the functional similarity of links across most datasets compared to baseline link sets (Figure 16A, Wilcoxon p-value = 0.0073). However, the change in performance is very small. Using a more permissive statistical correlation threshold reproduces the same trend but in the opposite direction (Figure 16B). The small change in performance when varying the statistical significance threshold may be due to the Bonferroni multiple test corrections made to the p-values, which can change the effect of this threshold depending on the number of unique  40  genes in each dataset. This issue can be understood by comparing the average link counts across different settings for the p-value threshold and the runs using different settings for the correlation distribution cut (Table 5). When looking at the change in the „weighted average‟ score for varying the thresholds of both correlation criteria, both show a similar pattern when varying their stringency however the correlation distribution cut seems to have a larger effect on performance showing a greater dip (-0.27) and rise (+0.14) compared to the p-value threshold (-0.024, +0.068). Figure 15. Effect of statistical significance threshold on performance: The change in functional similarity as compared to the performance of baseline links when using alternate settings for the p-value threshold. This is plotted against the performance of the baseline links on the x-axis. All link sets produced in this analysis only use the p- value threshold to select coexpression links. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) P-value threshold = 0.001 (B) P-value threshold = 0.1 A B    3.6 Removing negative correlations  We produced sets of coexpression links for each dataset using only positive correlations and tracked the change in performance (Figure 16). A majority of datasets perform better using only positive correlations (58/80 datasets). Furthermore, this change in performance is  41  statistically significant overall (Wilcoxon p-value = 6.5E-9). This new procedure obtained the largest improvement in its „weighted average‟ score across all analyses run under the Pearson correlation metric (+0.28). Figure 16. Effect of removing negative correlations on performance: The change in functional similarity as compared to the performance of baseline links when removing negatively correlated gene pairs. This is plotted against the performance of the baseline links on the x-axis. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance.   3.7 Filtering of expression data  Our baseline algorithm filters the expression data in an attempt to remove noisy signals that (hopefully) do not contribute functional information. It is possible that the filtering is non- optimal (either too stringent or too lenient). To explore this, we examined the impact of multiple filtering and additional procedures aimed at improving data quality. All links throughout these analyses were selected with the correlation distribution cut to keep the number of selected links relatively constant so that we could focus on the individual impact of each filtering procedure on the resulting performance. Thus, 80 specialized baseline link sets were produced for this analysis  42  using only the correlation distribution cut (1%) for link selection, keeping all other settings under baseline conditions.  We first examined a procedure which filtered out expression profiles of the data matrix which contained too many missing values; as shown earlier, data sets with many missing values tend to perform worse. The baseline filtering threshold for missing values filtered out expression profiles which had less than 30% of their samples present (not missing). We adjusted this threshold and produced new link sets which filtered out more profiles according to missing values by using alternate thresholds of 50% and 70%, while keeping all other settings under baseline conditions. This analysis applied to only 28 datasets which had any missing values; thus most of these were two-color data sets. When increasing the missing value threshold to 50% (Figure 17A), many datasets have a minimal change in their functional similarity scores, however those that do display a significant change have an improvement in performance or at least do not get considerably worse. This indicates that perhaps this increase in the missing value threshold is not removing profiles with true biological signal. When increasing the missing value threshold to 70% (Figure 17B), a larger number of datasets show an improvement in performance (Figure 17A). However, in both cases the improvements are not statistically significant (Wilcoxon p-values = 0.13, 0.095). When increasing the missing values threshold, the average link count among applicable datasets (Table 5) does not change greatly (1.395 E6, 1.374 E6, 1.334 E6), possibly indicating that insignificant links with many missing values are replaced by links with a stronger functional relationship. The results show that the missing value threshold is valuable in removing rows of the data matrix that can afford to be lost by increasing performance in many datasets while keeping the overall functional similarity relatively constant in others.  43  Figure 17. Effect of missing value filtering on performance: The change in functional similarity as compared to the performance of baseline links when using alternate settings for the missing value filter threshold. This is plotted against the performance of the baseline links on the x-axis. All link sets produced in this analysis only use the correlation distribution cut to select coexpression links. Scoring shown for only 28 datasets which contained missing values in their expression data. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) Missing value filter threshold = 50% (B) Missing value filter threshold = 70% A B     We then examined the filtering procedure which removed low-variance rows of the expression data matrix. The baseline filtering threshold for low sample variance removes the 5% of expression profiles which have the lowest variation of expression between samples. We produced new link sets using alternate low variance thresholds of 0% and 15%, while keeping all other settings at baseline. When evaluating the alternate threshold setting of no low variance filtering (Figure 18A), there is no change in performance overall as all points center around a „DELTA (mean GO shift)‟ value of zero. There is also approximately an equal amount of datasets improving as well as getting worse in performance. Increasing the low variance filter threshold to 15% (Figure 18B) improves the performance of certain datasets compared to baseline link sets while decreasing the performance of others, but overall no large changes in functional similarity compared to the effect of other algorithm parameters shown previously. Neither alternate threshold for low variance improves performance statistically over the baseline  44  filter setting (Wilcoxon p-values = 0.62, 0.71). The changes in the „average‟ (-0.0037, -0.0023) and „weighted average‟ scores (-0.0018, -0.016) when varying the low variance filter threshold are quite small (Table 5). We conclude that within the range of settings we tested, filtering for low variance does not have a significant impact on the overall functional similarity of coexpression links, and thus should likely be omitted. Figure 18. Effect of low variance filtering on performance: The change in functional similarity as compared to the performance of baseline links when using alternate settings for the low variance filter threshold. This is plotted against the performance of the baseline links on the x-axis. All link sets produced in this analysis only use the correlation distribution cut to select coexpression links. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) Low variance filter threshold = 0% (B) Low variance filter threshold = 15% A B     We then investigated the final filtering procedure, which removes rows of the expression data according to low expression levels. The baseline filtering threshold for low expression removes 30% of expression profiles which have the lowest mean expression level across samples. We adjusted this threshold and produced new link sets using alternate low expression thresholds of 0% and 45%, while keeping all other algorithm procedures under baseline settings. Using no low expression filtering (Figure 19A) shows a decrease in performance for the majority  45  of datasets compared to the baseline links. This decline in performance is not a significant improvement (Wilcoxon p-value = 1), indicating that bypassing low expression filtering is not a desired option in coexpression analysis. While increasing the low expression threshold to 45% (Figure 19B) decreases the performance in some datasets, most improve overall compared to baseline link sets (Wilcoxon p-value = 0.00088). Figure 19. Effect of low expression filtering on performance: The change in functional similarity as compared to the performance of baseline links when using alternate settings for the low expression filter threshold. This is plotted against the performance of the baseline links on the x-axis. All link sets produced in this analysis only use the correlation distribution cut to select coexpression links. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) Low expression filter threshold = 0% (B) Low expression filter threshold = 45% A B    3.8 Additional normalization  We tested the effect on performance of two additional normalization methods on the expression datasets. We first produced coexpression link sets for each dataset which used SVD as an additional normalization method (Methods), but keeping all other procedures under baseline settings. SVD normalization does not improve performance overall (Figure 20) as there are an equal amount of datasets with positive and negative „DELTA (mean GO shift)‟ scores  46  displayed. The SVD normalization method is certainly not a statistical improvement in performance over the baseline standard (Wilcoxon p-value = 0.997). We observed that most of the datasets that showed improvement were of two color microarray technology type. Most of these „two color‟ datasets that improved with SVD normalization had weaker performance for their associated baseline links. It was shown earlier (Section 3.4) that the microarray design used related to performance with two color array designs typically performing worse overall (Figure 11). In contrast, those datasets that performed well (mean GO shift > 0.5) under baseline conditions seem to get worse with SVD normalization, suggesting that in those cases the principal component of the data believed to be noise was composed of true biological variation in expression Figure 20. Effect of SVD normalization on performance: The change in functional similarity as compared to the performance of baseline links when using the SVD normalization method. This is plotted against the performance of the baseline links on the x-axis. The P-value shown at the top of the graph is produced from using the Wilcoxon signed rank test. Red points signify datasets with a two color microarray technology type and blue points signify datasets with a one color microarray technology type. The black line with a slope of zero shows where points would lie if there was no change in performance.    The second method we tested was the Balance normalization method which used similar procedures to SVD (Methods). We produced coexpression link sets for 4 datasets using the  47  Balance normalization method (Methods), keeping all other algorithm parameters under baseline settings. Although the results shown below indicate that 3 out of the 4 datasets improve with balance normalization (Table 6), the number of links obtained is reduced (approximately 98% less on average). This is reflected in the change of the „weighted average‟ score of -0.0358 (Table 5). Table 6. Effect of Balance normalization on performance Dataset ID Baseline-# of links Balance-# of links Baseline-Mean GO shift Balance-DELTA (mean GO shift) GSE343 589788 7066 -1.0309 0.1556 GSE4988 54863 7082 -0.4392 0.2961 GSE5808 10827 212 1.0547 0.7026 GSE9006 541517 6134 1.2617 -0.0941  3.9 Log-transformation of expression values  Out of the 80 expression datasets, 52 datasets used a one color microarray technology type and of these, 6 datasets had log-transformed expression values while the others reported raw expression data values. All of the datasets using a two color microarray technology type had log- transformed expression values. What was not yet known was whether log-transformation affected the ability of the „one color‟ datasets to yield biologically meaningful coexpression links. We investigated this by log-transforming the expression values of the 46 „one color‟ datasets which originally had untransformed values and reversing the log-transformation of expression values for the 6 datasets which originally had log-transformed values. We then computed coexpression link sets for these transformed expression datasets while keeping all other algorithm settings under baseline conditions. Reversing the log-transformation of expression values shows a decline in performance for 5 of the 6 applicable „one color‟ datasets (Figure 21A). Log-transformation of the raw expression values shows an improvement in performance for the majority of the applicable „one color‟ datasets (Figure 21B), with two  48  showing a large increase in their mean GO shift scores (> 2.0). The change in the „weighted average‟ scores for both types of transformations shows an improvement overall (0.11, 0.14), with the log-transformation of expression values scoring slightly higher (Table 5). However, the change in „average‟ scores (Table 5) agrees with what is shown visually below (unlogged = - 0.11, logged = 0.27). Figure 21. Effect of log-transformation on performance: The change in the functional similarity (mean GO shift) is measured for links produced after log-transformation of expression values or after reversing log-transformation of expression values for the applicable sets using a one color microarray technology type (52 datasets). This change is plotted against the performance of all 52 associated baseline „one color‟ link sets on the X axis. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) Reversing the log- transformation of expression values of 6 „one color‟ datasets. (B) Log-transformation of expression values for 46 „one-color‟ datasets. A B   49  4 Discussion  This thesis provides information on the extrinsic and intrinsic factors that influence the ability of coexpression analysis to predict gene function. The results show that the preprocessing procedures and link selection methods used in coexpression analysis affect the biological significance of the coexpression links produced. We have shown that expression datasets that do not perform well compared to others tend to have similar dataset features. The influence of these dataset features should be taken into account in order to make accurate predictions for gene function through coexpression analysis.  On average, larger expression datasets produced higher-quality coexpression links (Figure 8). This is expected as larger expression datasets provide more opportunities to form coexpression links (more probes) across all possible gene pairs as well as more statistical power added to these connections from higher sample numbers. Poorly performing datasets were mostly associated with two color microarray designs (Figure 11). This could be due to a number of common technical errors involved in dye labeling or hybridizing two samples together on the same microarray which can affect the raw expression values. The treatment of two color expression data in a more specialized manner may correct for this noise. Finally, it was apparent that datasets which produced more positive correlations than negative correlations even slightly tending to perform better (Figure 12). In general, most of these datasets features cannot be controlled for in an experiment and therefore it would be beneficial to weight the significance of the coexpression links based on the characteristics of each input dataset.  50   Removing negatively coexpressed gene pairs prior to link selection had the most consistent improvement in link functional similarity across all datasets (Figure 16). This might explain why datasets that had more positive correlations, performed better (Figure 12). Although removing negative correlations produced a consistent improvement in performance compared to the baseline standard, it does not necessarily mean that negatively coexpressed gene pairs do not capture significant biological relationships. Since performance in our algorithm was primarily measured using GO annotations and the Gene Ontology is constructed in a way that relates genes that are involved in the same biological process or pathway, they are more likely to be positively correlated in expression. Negatively coexpressed gene pairs, which can involve one gene inhibiting the biological process of another gene, will produce lower performance scores by this standard. However, negatively coexpressed gene pairs should not be removed due to this bias and all correlated gene pairs should be kept before link selection.  The filtering procedures varied in their effects on performance. The missing value threshold, when increased, improves performance (Figure 17). Thus, even if missing values take up less than half of the samples in expression profiles, they tend to have a negative effect on the functional similarity of the coexpression links. Changing the threshold for low variance filtering did not have a significant effect on performance (Figure 18A). Increasing the threshold did improve the performance of some datasets but made others worse and in both cases only to a small degree (Figure 18B). Consequently, it does not seem beneficial to keep the low variance filter as a preprocessing procedure in coexpression analysis. Finally, the results of the “low expression filter” showed that it is important to performance as disabling it reduced the functional similarity of many link sets overall (Figure  51  19A). Increasing the stringency of the low expression filter well over the baseline setting did not have a consistent improvement in performance across all datasets (Figure 19B). Filtering the expression data by low mean expression level follows the assumption that genes expressed at low values across various experimental conditions are likely the result of experimental noise on the microarray. However, it is still possible for a gene pair to be significantly coexpressed at low average expression levels. Therefore, while the results suggest that keeping a low expression filter increases the amount of functionally similar coexpression links it need not be increased over the baseline setting of 30%.  SVD normalization resulted in the most significant improvement in performance, for normalization methods, without greatly reducing the amount of links obtained. While not consistently seen across all 80 datasets, SVD seemed to work best for datasets using a two color microarray design (Figure 20). Consequently, it followed that SVD normalization seem to improve the datasets that mostly performed worse under baseline conditions, which as seen earlier were generally associated with a two color microarray design (Figure 11). This suggests that „two color‟ datasets may typically contain an element of experimental noise which is mostly contained within the first principal component of the data, as SVD normalization is able to remove its effect on performance.  Log-transformation of the one color expression data resulted in a majority of the datasets improving in performance (Figure 21B). Reversing the log-transformation of one color expression data mostly reduced performance overall (Figure 21A). In addition, the performance improvement shown by using log-transformation was demonstrated across a larger amount of datasets (46 datasets) and consequently found to be statistically significant overall (Wilcoxon p- value = 0.0016).  52   We predicted that, in general, using more stringent correlation thresholds (independent of correlation sign) would increase functional similarity. Indeed, making the thresholds more stringent produces less links with overall higher performance, and making thresholds less stringent produces more links but with overall lower performance compared to baseline. Neither correlation threshold criterion used alone under baseline algorithm settings produces better “weighted average” or better “average” scores than the combination of the two under regular baseline settings (Table 5). Therefore the combination of choosing links by both statistical significance and correlation magnitude is still the better scheme of coexpression link selection. However, the results showed that the correlation distribution cut has a greater impact on performance overall (Section 3.5). In addition, the change in the „weighted average‟ scores suggests that making the correlation distribution cut more stringent (0.1%) improves performance to a greater degree (DELTA = 0.14) than increasing the stringency (0.001) of the p- value threshold (DELTA = 0.0068) when compared to the performance under baseline settings (Table 5). We relied on the Gene Ontology annotations as the primary standard to measure similarity in gene function in this thesis, because of its high coverage (18,512 human genes with GO annotations). The fact that the GO measure of functional similarity showed reasonable agreement with the other, lower-coverage metrics (Figure 4) suggests that our findings are not simply a function of particularities of GO annotations, though clearly there is overlap between the standards. This interdependence is one reason we introduced the MitoCarta standard: it is based in the observation in numerous studies that genes involved in mitochondrial function tend to be coexpressed [37, 38]. Thus recovery of links among MitoCarta genes is a sign of quality of coexpression analysis, though not functional prediction per se.  53   Our results suggest that the settings of the baseline algorithm could be adjusted to improve performance. A set of parameters we predict will result in better performance (without drastically reducing the number of predictions made) are summarized in Table 7. For several „one color‟ datasets there are expression profiles that sometimes contain negative expression values or values equal to zero due to MAS5 data adjustment procedures. With our log- transformation procedure all these entries will be treated as missing values and this, coupled with an increase in the missing values threshold, will remove potentially a significant amount of probes before any filtering according to low expression. Therefore since we do not want to filter out expression profiles of low average expression level that may still produce useful functional predictions we might want to decrease the filter stringency (Table 7). The results on the correlation distribution cut showed that making it more stringent (0.1%) produces a significant improvement in performance. However, it also reduces the link counts on average (Table 5). To help ensure that useful coexpression links are being retained in this process, we suggest that the correlation distribution cut should be set to 0.5%, while still being coupled with the p-value threshold. While the summarized performance scores under the Spearman correlation metric do show some improvement over the use of the Pearson correlation metric (Table 5) in some analyses (SVD normalization, Baseline with p-value cutoff), they also obtain a larger standard deviation in their „average‟ scores due mostly to a reduction in average link counts. Thus we see little to be gained by switching to the rank correlation coefficient. Experiments running this new set of algorithm settings or slight variations of it on a collection of expression datasets are ongoing and will help determine the most favorable procedures and associated settings to implement in order to extract the highest number of biologically meaningful coexpression links from an expression dataset.   54  Table 7. New algorithm settings to achieve better performance Parameter Setting Correlation metric Pearson correlation coefficient Missing values filter 70% of samples must be present Low variance filter 0% lowest removed Low expression filter 20-30% lowest removed P-value cutoff ≥ 0.01 removed Correlation distribution cut ≤ highest 0.5% of correlations removed SVD normalization Yes (only for ‘two color’ datasets) Log-transformation Yes (only for ‘one color’ datasets that are not log- transformed already) Removing negative correlations No   55  5 Conclusion  This thesis has uncovered valuable information on how the accuracy of gene functional predictions can be affected by the preprocessing procedures and link selection methods used in coexpression analysis. This thesis also demonstrates that the various features of any expression dataset including data quality can affect the biological significance of the coexpression links produced through this type of analysis. There is still a great deal of information that can be discovered with further experiments using procedures or methods not investigated in this thesis. We did not evaluate the effect of additional correlation metrics on performance such as the Euclidean distance metric and the mutual information measure. The effect of choosing a link selection method based on coexpression network properties was not evaluated in this thesis and could provide interesting information affecting performance. This thesis may aid researchers in further analyses involving the detection of highly coexpressed gene pairs of biological significance.  56  6 Bibliography 1. Wren, J.D., A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature-data divide. Bioinformatics, 2009. 25(13): p. 1694-701. 2. Cho, R.J., et al., A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell, 1998. 2(1): p. 65-73. 3. Stuart, J.M., et al., A gene-coexpression network for global discovery of conserved genetic modules. Science, 2003. 302(5643): p. 249-55. 4. Tavazoie, S., et al., Systematic determination of genetic network architecture. Nat Genet, 1999. 22(3): p. 281-5. 5. Reverter, A., et al., A gene coexpression network for bovine skeletal muscle inferred from microarray data. Physiol Genomics, 2006. 28(1): p. 76-83. 6. Zhou, X., M.C. Kao, and W.H. Wong, Transitive functional annotation by shortest-path analysis of gene expression data. Proc Natl Acad Sci U S A, 2002. 99(20): p. 12783-8. 7. Lee, H.K., et al., Coexpression analysis of human genes across many microarray data sets. Genome Res, 2004. 14(6): p. 1085-94. 8. Voy, B.H., et al., Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Comput Biol, 2006. 2(7): p. e89. 9. Carter, S.L., et al., Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics, 2004. 20(14): p. 2242-50. 10. Mao, L., et al., Arabidopsis gene co-expression network and its functional modules. BMC Bioinformatics, 2009. 10(1): p. 346. 11. Elo, L.L., et al., Systematic construction of gene coexpression networks with applications to human T helper cell differentiation process. Bioinformatics, 2007. 23(16): p. 2096-103. 12. Eisen, M.B., et al., Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A, 1998. 95(25): p. 14863-8. 13. Butte, A.J. and I.S. Kohane, Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput, 2000: p. 418-29. 14. Prieto, C., et al., Human gene coexpression landscape: confident network derived from tissue transcriptomic profiles. PLoS One, 2008. 3(12): p. e3911. 15. Wolfe, C.J., I.S. Kohane, and A.J. Butte, Systematic survey reveals general applicability of "guilt- by-association" within gene coexpression networks. BMC Bioinformatics, 2005. 6: p. 227. 16. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9. 17. Tu, Y., G. Stolovitzky, and U. Klein, Quantitative noise analysis for gene expression microarray experiments. Proc Natl Acad Sci U S A, 2002. 99(22): p. 14031-6. 18. Quackenbush, J., Microarray data normalization and transformation. Nat Genet, 2002. 32 Suppl: p. 496-501. 19. Alter, O., P.O. Brown, and D. Botstein, Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci U S A, 2000. 97(18): p. 10101-6. 20. Hibbs, M.A., et al., Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics, 2007. 23(20): p. 2692-9. 21. Breitling, R., Biological microarray interpretation: the rules of engagement. Biochim Biophys Acta, 2006. 1759(7): p. 319-27. 22. Hackstadt, A.J. and A.M. Hess, Filtering for increased power for microarray data analysis. BMC Bioinformatics, 2009. 10: p. 11. 23. Edgar, R., M. Domrachev, and A.E. Lash, Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res, 2002. 30(1): p. 207-10.  57  24. Allander, S.V., et al., Gastrointestinal stromal tumors with KIT mutations exhibit a remarkably homogeneous gene expression profile. Cancer Res, 2001. 61(24): p. 8624-8. 25. Luo, J., et al., Human prostate cancer and benign prostatic hyperplasia: molecular dissection by gene expression profiling. Cancer Res, 2001. 61(12): p. 4683-8. 26. Ma, X.J., et al., Gene expression profiles of human breast cancer progression. Proc Natl Acad Sci U S A, 2003. 100(10): p. 5974-9. 27. Khan, J., et al., Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med, 2001. 7(6): p. 673-9. 28. Jazaeri, A.A., et al., Gene expression profiles of BRCA1-linked, BRCA2-linked, and sporadic ovarian cancers. J Natl Cancer Inst, 2002. 94(13): p. 990-1000. 29. Dhanasekaran, S.M., et al., Delineation of prognostic biomarkers in prostate cancer. Nature, 2001. 412(6849): p. 822-6. 30. van 't Veer, L.J., et al., Gene expression profiling predicts clinical outcome of breast cancer. Nature, 2002. 415(6871): p. 530-6. 31. Ramaswamy, S., et al., Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A, 2001. 98(26): p. 15149-54. 32. Best, D.J. and D.E. Roberts, Algorithm AS 89: The upper tail probabilities of Spearman's rho. Journal of the Royal Statistical Society, Series C (Applied Statistics), 1975. 24(3): p. 377-379. 33. Mistry, M. and P. Pavlidis, Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics, 2008. 9: p. 327. 34. Lynn, D.J., et al., InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol, 2008. 4: p. 218. 35. Kanehisa, M. and S. Goto, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 2000. 28(1): p. 27-30. 36. Pagliarini, D.J., et al., A mitochondrial protein compendium elucidates complex I disease biology. Cell, 2008. 134(1): p. 112-23. 37. Hibbs, M.A., et al., Directing experimental biology: a case study in mitochondrial biogenesis. PLoS Comput Biol, 2009. 5(3): p. e1000322. 38. Mootha, V.K., et al., Identification of a gene causing human cytochrome c oxidase deficiency by integrative genomics. Proc Natl Acad Sci U S A, 2003. 100(2): p. 605-10.  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0068712/manifest

Comment

Related Items