UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Evaluating coexpression analysis for gene function prediction Lotay, Vaneet Singh 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2010_spring_lotay_vaneet.pdf [ 992.8kB ]
Metadata
JSON: 24-1.0068712.json
JSON-LD: 24-1.0068712-ld.json
RDF/XML (Pretty): 24-1.0068712-rdf.xml
RDF/JSON: 24-1.0068712-rdf.json
Turtle: 24-1.0068712-turtle.txt
N-Triples: 24-1.0068712-rdf-ntriples.txt
Original Record: 24-1.0068712-source.json
Full Text
24-1.0068712-fulltext.txt
Citation
24-1.0068712.ris

Full Text

EVALUATING COEXPRESSION ANALYSIS FOR GENE FUNCTION PREDICTION  by VANEET SINGH LOTAY B. Sc., University of Manitoba, 2006  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  December 2009  © Vaneet Singh Lotay, 2009  Abstract Microarray expression data sets vary in size, data quality and other features, but most methods for selecting coexpressed gene pairs use a „one size fits all‟ approach. There have been many different procedures for selecting coexpressed gene pairs of high functional similarity from an expression dataset. However, it is not clear which procedure performs best as there are few studies reporting comparisons of these approaches. The goal of this thesis is to develop a set of “best practices” in order to select coexpression links of high functional similarity from an expression dataset, along which methods for identifying datasets likely to yield poor information. With these goals, we hope to improve the quality of gene function predictions produced by coexpression analysis. Using 80 human expression datasets we examined the impact of different thresholds, correlation metrics, expression data filtering and transformation procedures on performance in functional prediction. We also investigated the relationship between data quality and other features of expression datasets and their performance in functional prediction. We used the annotations of the Gene Ontology as a primary metric to measure similarity in gene function, and employ additional functional metrics for validation. Our results show that several dataset features have a greater influence on the performance in functional prediction than others. Expression datasets which produce coexpressed gene pairs of poor functional quality can be identified by a similar set of data features. Some procedures used in coexpression analysis have a negligible effect on the quality of functional predictions while others are essential to achieving the best performance in the algorithm. We also find that some procedures interact greatly with features of expression datasets and that these interactions increase the number of high quality coexpressed gene pairs retrieved through coexpression ii  analysis. This thesis uncovers important information on the many intrinsic and extrinsic factors that influence the performance in functional prediction of coexpression analysis. The information summarized here will help guide future studies using coexpression analysis and improve the quality of gene function predictions.  iii  Table of contents Abstract ........................................................................................................................................................................ii List of tables ................................................................................................................................................................ vi List of figures .............................................................................................................................................................vii Acknowledgements .................................................................................................................................................. viii Dedication.................................................................................................................................................................... ix 1  2  3  Introduction ........................................................................................................................................................ 1 1.1  Predicting gene function .............................................................................................................................. 1  1.2  Coexpression link selection .......................................................................................................................... 1  1.3  Expression dataset features ......................................................................................................................... 4  1.4  Evaluation of coexpression links .................................................................................................................. 4  1.5  Preprocessing expression data .................................................................................................................... 5  Methods ............................................................................................................................................................... 7 2.1  Expression data ............................................................................................................................................ 7  2.2  Dataset features ......................................................................................................................................... 13  2.3  Baseline algorithm settings ........................................................................................................................ 14  2.4  Filtering ..................................................................................................................................................... 18  2.5  Additional normalization ........................................................................................................................... 18  2.6  Log-transformation .................................................................................................................................... 18  2.7  Coexpression link selection ........................................................................................................................ 19  2.8  Functional similarity metric ....................................................................................................................... 19  2.9  Evaluating performance of alternate algorithm settings ........................................................................... 20  2.10  Protein to protein interactions ................................................................................................................... 22  2.11  KEGG pathways......................................................................................................................................... 22  2.12  MitoCarta genes......................................................................................................................................... 23  Results ................................................................................................................................................................ 24 3.1  Summary of performance statistics ............................................................................................................ 24  3.2  Baseline coexpression settings ................................................................................................................... 26  3.3  Expression correlation metrics .................................................................................................................. 31  3.4  Dataset features ......................................................................................................................................... 32  3.5  Configuring link selection .......................................................................................................................... 37  3.6  Removing negative correlations ................................................................................................................. 40  3.7  Filtering of expression data ....................................................................................................................... 41  3.8  Additional normalization ........................................................................................................................... 45  iv  3.9  Log-transformation of expression values ................................................................................................... 47  4  Discussion .......................................................................................................................................................... 49  5  Conclusion ......................................................................................................................................................... 55  6  Bibliography ...................................................................................................................................................... 56  v  List of tables Table 1. Some previous coexpression studies ............................................................................................................ 3 Table 2. Expression dataset features part 1 .............................................................................................................. 7 Table 3. Expression dataset features part 2 ............................................................................................................ 10 Table 4. Baseline algorithm settings ........................................................................................................................ 15 Table 5. Gene Ontology functional similarity statistics .......................................................................................... 25 Table 6. Effect of Balance normalization on performance .................................................................................... 47 Table 7. New algorithm settings to achieve better performance ........................................................................... 54  vi  List of figures Figure 1. Correlation magnitude density curves ..................................................................................................... 14 Figure 2. Coexpression algorithm procedures ........................................................................................................ 17 Figure 3. Performance of baseline coexpression links ............................................................................................ 28 Figure 4. Comparing standards for gene functional similarity ............................................................................. 30 Figure 5. Effect of correlation metric on performance .......................................................................................... 32 Figure 6. Effect of number of samples on performance ......................................................................................... 33 Figure 7. Effect of number of probes on performance ........................................................................................... 34 Figure 8. Effect of dataset size on performance ...................................................................................................... 34 Figure 9. Effect of missing values proportion on performance ............................................................................. 35 Figure 10. Effect of PC1 variance proportion on performance ............................................................................. 35 Figure 11. Effect of microarray technology type on performance ........................................................................ 36 Figure 12. Effect of median probe-probe correlation on performance ................................................................. 36 Figure 13. Effect of 90th quantile probe-probe correlation on performance ........................................................ 37 Figure 14. Effect of number of tissue groups investigated on performance ......................................................... 37 Figure 15. Effect of correlation distribution cut on performance ......................................................................... 39 Figure 15. Effect of statistical significance threshold on performance ................................................................. 40 Figure 16. Effect of removing negative correlations on performance ................................................................... 41 Figure 17. Effect of missing value filtering on performance .................................................................................. 43 Figure 18. Effect of low variance filtering on performance ................................................................................... 44 Figure 19. Effect of low expression filtering on performance ................................................................................ 45 Figure 20. Effect of SVD normalization on performance ...................................................................................... 46 Figure 21. Effect of log-transformation on performance ....................................................................................... 48  vii  Acknowledgements I wish to offer my gratitude, first and foremost to my research supervisor, Dr. Paul Pavlidis. Thank you for your faith in my abilities and constant drive to push me to new levels as a scientific researcher and writer. I sincerely appreciate your willingness and availability to discuss research related matters whenever possible. It has been a pleasure to work under your guidance and support throughout my degree. I wish to offer my thanks to the other lab members at the Centre for High Throughput Biology lab where my research was conducted. Thank you for providing assistance and support throughout my work and allowing me to work in a friendly and enjoyable environment. I wish to offer my thanks to my thesis committee members, Dr. Jenny Bryan and Dr. Wyeth Wasserman. Thank you for your taking the time to give me guidance and feedback throughout the course of my research project. I wish to offer my thanks to the Bioinformatics program coordinator, Sharon Ruschkowski, for her assistance in helping me enter the Bioinformatics program and provide guidance whenever needed throughout my degree. Finally, I would like to give a special thanks to my family, most of all my parents for providing encouragement and support throughout my studies.  viii  Dedication  To my parents  ix  1  Introduction  1.1  Predicting gene function Determining the function of all human genes is a major goal of post-genomic biomedical  research. Approximately 37% (9334) of human genes have no publications documenting their function [1]. Continuing advances in genomic sequencing discover new hypothetical genes and proteins whose functions remains unknown. With an ever-growing amount of publicly annotated biological data, researchers continue to search for a reliable prediction method that will associate genes with known functions to genes of unknown function. Using gene expression data, the unknown functions of genes can be predicted through coexpression analysis. Two genes that have expression profiles which are highly correlated are more likely to have related biological functions than two genes that do not [2-4]. In expression data studies, it is not uncommon to detect that some of these pairings of highly correlated expression profiles will involve one gene with a yet unclassified biological function. Thus the unknown function of a gene can be predicted from a gene with a well-known function using the guilt-by-association concept.  1.2  Coexpression link selection Genes with highly correlated expression profiles are referred to as „coexpressed‟. In  expression data analyses, the expression of many genes is being studied and therefore there are many possible gene pairings to consider as „coexpressed‟ given all their expression profiles. Consequently, it is practical to select the most significant pairings which will most likely share related biological functions. These selected pairs are what we term „coexpression links‟. We 1  believe that the procedures involved in selecting these coexpression links greatly affects their ability to predict gene function. Coexpression studies in the past have used a variety of different correlation metrics and coexpression link selection procedures to extract the most significant relationships as coexpression links in expression data (Table 1). One common method is to use a constant correlation threshold of absolute value so that only the highest positive and negative correlations between expression profiles are selected as coexpression links [5, 6]. This method relies on the strength of the link‟s correlation for selection and will extract a different number of coexpression links for each expression dataset depending on the overall distribution of correlations among all expression profiles. A similar approach uses a percentage as a threshold to ensure the same proportion of correlated gene pairs are selected as links, to account for varying correlation magnitude distributions [7]. In other studies, links are selected based on statistical significance by controlling for false positive error rates under the t-distribution [8, 9]. For this approach gene pairs with correlated expression profiles are evaluated individually by the statistical likelihood of their association; and being selected as coexpression links under a Pvalue cutoff. The connectivity of nodes in a gene coexpression network is also employed as a link selection method. In a gene coexpression network, only the expression profile correlations between genes that satisfy a chosen correlation threshold (usually conservative), possibly from one of the methods mentioned previously, are connected as nodes of the original network. Then in some cases, the stringency of the correlation threshold is increased, thereby pruning the coexpression graph, until certain properties of the network can be optimized such as average clustering coefficient [10, 11]. This approach looks to retain highly connected genes, in terms of coexpressed neighbors, found in gene coexpression networks. A meta-analysis approach aims to detect coexpression between the same pair of genes in multiple expression data studies [7]. A high expression profile correlation between the same pair of genes found in different expression 2  data studies is often not coincidental and not only validates further their coexpression but also improves their chances of having related biological functions. Table 1. Some previous coexpression studies Reference Eisen, et al. (1998) [12]  Correlation metric Gene similarity metric similar to Pearson correlation coefficient Entropy and mutual information measures  Coexpression Link Selection Criteria none (all genes included in functional enrichment analysis) mutual information threshold  Stuart, et al. (2003) [3]  Pearson correlation coefficient  Tavazoie, et al. (1999) [4] Reverter, et al. (2006) [5] Zhou, et al. (2002) [6]  Euclidean distance metric  P-value cutoff (Bonferroni multiple test corrected) none (all genes included in functional enrichment analysis) constant correlation threshold  Voy, et al. (2006) [8]  Carter, et al.(2004) [9]  Pearson correlation coefficient (chosen) and Spearman correlation coefficient Pearson correlation coefficient  Elo, et al. (2007) [11]  Pearson correlation coefficient  Mao, et al. (2009) [10]  Pearson correlation coefficient  Lee, et al. (2004) [7]  Pearson correlation coefficient  Prieto, et al. (2008) [14]  Pearson and Spearman correlation coefficients Pearson correlation coefficient  Butte, et al. (2000) [13]  Wolfe, et al. (2005) [15]  Pearson correlation coefficient Pearson correlation coefficient  constant correlation threshold with leave-oneout cross validation constant correlation threshold  P-value cutoff (uncorrected for multiple comparisons) clustering coefficient based correlation threshold clustering coefficient based correlation threshold constant correlation threshold % and p-value cutoff (Bonferroni multiple test corrected) constant correlation threshold with 25% cross validation P-value cutoff (corrected for multiple tests)  While there are many approaches for selecting coexpression links (Table 1), it is far from clear if some methods are better than others, because there are few reports of comparisons of approaches. In this thesis, we will carry out multiple methods for link selection. We will also investigate, independently, the effect of each link selection method on the accuracy of coexpression links to predict gene function.  3  1.3  Expression dataset features We hypothesize that coexpression analysis of some expression data sets will provide  better functional predictions than other expression data sets. It may not be feasible for some data sets to produce biologically meaningful coexpression links; or link selection criteria must change to accommodate for these types of affected expression datasets. For this thesis, an expression dataset or expression data study involves a group of microarray expression profiles which are collected together and described in a single publication [7]. These studies will analyze the expression profiles obtained from a data matrix which represent the RNA levels of genes across a set of experimental conditions (different samples). The experimental conditions can represent a gene‟s expression measured in different tissue samples, different time points or under different conditions (e.g. drug treatments). While the columns identify the different expression samples, the rows of the matrix identify the different probes used on the microarray. In addition, the values in the data matrix could be raw expression intensities, expression ratios or possibly logtransformed values. Expression data sets vary in numerous ways (e.g., choice of platform) and additional variability is introduced by decisions made during analysis. We hypothesize that the quality of gene function predictions is related to these properties and decisions. In this thesis, we examine 80 human expression data sets with varying properties. We examine the influence of these properties on coexpression analysis and consequently, the ability to predict gene function.  1.4  Evaluation of coexpression links To assess the prediction power of the selected coexpression links, they must be evaluated  with a metric for gene functional similarity. In previous coexpression studies, various databases have been used to classify human gene function. Biological pathway annotations of The Kyoto Encyclopedia of Genes and Genomes pathway database (KEGG) or similar pathway databases 4  are used, defining “functionally related” as “in the same pathway” [3, 14]. Annotations from the Gene Ontology database [16] can also be used as a definition of function [3, 5-9, 11]. A third definition of gene function is provided by using protein interactions as the operational characteristic. In another study the results of research papers were used to relate gene function to more specific classes such as „housekeeping genes‟ [14]. Because the Gene Ontology contains the best coverage of gene functional annotations across the human genome, we used it as the primary metric for evaluating the functional similarity of the coexpression links. We also employ additional functional metrics to validate the annotations of the Gene Ontology (Methods).  1.5  Preprocessing expression data There are many steps in preparing the microarray before expression data can be analyzed:  affixing the probes to the microarray, hybridizing the RNA samples to the probes, and capturing the image of fluorescence levels as expression intensities. Technical variation in these steps is a source of noise. Controlling for these sources of noise, and correcting for them where possible, is important for identifying real biological signals in the RNA levels (differences between cell types and tissues) [17]. The steps used to adjust the data to remove noisy signals or systematic errors are grouped together as “preprocessing”. One aim of our study is to investigate the impact preprocessing procedures have on the quality of gene function predictions from coexpression links. One of the first preprocessing steps is normalization, which attempts to remove systematic errors in signal levels [18]. One type of normalization used for expression data is singular value decomposition (SVD) to remove components of the data identified as artifactual [19]. Another method employs a modified approach of SVD to “equalize” sources of variance  5  (including biological sources) [20]. Some normalization methods are specific to a particular type of array design while others are generic [21]. A second major preprocessing step is probe filtering, which removes expression profiles from the expression data matrix if they do not satisfy certain criteria. For example, filtering can be done to remove genes with low expression levels [7, 14, 22], low expression variance between samples [7, 14, 22] and too many missing expression values [7]. While one goal of such filtering is to remove noisy or irrelevant data, it also serves to reduce the number of subsequent comparisons performed and increases statistical power [22], at the possible cost of losing information. Filtering and normalization methods have not always been evaluated for their direct effect on producing biologically meaningful coexpression links.  6  2  Methods  2.1  Expression data Eighty human microarray expression datasets were used in the study with a total of 6732  samples. Most of these expression datasets were obtained from the Gene Expression Omnibus database (GEO) [23] while others were simply from individual publications (Table 2-3). All expression datasets were loaded and organized into the Gemma database (http://www.chibi.ubc.ca/Gemma), which we use in our lab. Expression data values remained in the same condition they were when retrieved from their database sources before any further analysis. More detailed information about the expression datasets such as the name of the microarray design used can be found on the Gemma website using the Dataset ID. Table 2. Expression dataset features part 1 Dataset ID  Data Source  Samples  Probes  GSE420  GEO  10  10,125  Dataset Size 101,250  GSE685  GEO  9  17,900  161,100  GSE833  GEO  11  5,390  59,290  GSE8514  GEO  15  38,844  582,660  GSE5808  GEO  18  17,900  322,200  GSE2377  GEO  48  10,125  486,000  GSE7509  GEO  26  38,844  1,009,944  GSE2164  GEO  87  10,125  880,875  GSE6008  GEO  103  17,900  1,843,700  GSE1397  GEO  28  17,900  501,200  GSE8479  GEO  65  17,663  1,148,095  Array Type one color one color one color one color one color one color one color one color one color one color one color  Tissue Groups single single single single single single single single single multiple single  7  Dataset ID  Data Source  Samples  Probes  GSE9006  GEO  117  17,900  Dataset Size 2,094,300  GSE8586  GEO  54  38,844  2,097,576  GSE6536  GEO  210  26,322  5,527,620  GSE1037  GEO  91  29,553  2,689,323  GSE8121  GEO  75  38,844  2,913,300  GSE3526  GEO  353  38,844  13,711,932  GSE8052  GEO  404  38,844  15,692,976  GSE7307  GEO  677  38,844  26,297,388  GSE8507  GEO  141  38,844  5,477,004  allander-gist  Allander, et al. (2001) [24]  19  1,791  34,029  GSE58  GEO  12  8,484  101,808  GSE53  GEO  26  4,506  117,156  luo-prostate  Luo, et al. (2001) [25]  25  5,725  143,125  ma-breast  Ma, et al. (2003) [26]  61  1,788  109,068  khan-bluecell  Khan, et al. (2001) [27]  88  2,097  184,536  GSE59  GEO  68  8,484  576,912  GSE3211  GEO  35  13,954  488,390  GSE4988  GEO  20  9,913  198,260  GSE6818  GEO  39  26,322  1,026,558  GSE4058  GEO  38  25,493  968,734  jazaeri-breast  Jazaeri, et al. (2002) [28]  61  5,530  337,330  GSE3398  GEO  73  25,493  1,860,989  dhanasekaranprstate GSE3023  Dhanasekaran, et al. (2001) [29]  53  8,591  455,323  GEO  38  45,062  1,712,356  GSE343  GEO  66  21,587  1,424,742  Array Type one color one color one color two color one color one color one color one color one color two color two color two color two color two color two color two color two color two color one color two color two color two color two color two color two color  Tissue Groups single single single single single multiple single multiple single single single single single single single multiple single single single single single single single multiple single  8  Dataset ID  Data Source  Samples  Probes  vantveer-breast  Van't Veer, et al. (2002) [30]  117  12,782  Dataset Size 1,495,494  GSE4007  GEO  126  34,195  4,308,570  GSE3630  GEO  68  29,957  2,037,076  GSE3497  GEO  114  33,905  3,865,170  ramaswamy-cancer  Ramaswamy, et al. (2001) [31]  280  6,216  1,740,480  GSE7390  GEO  198  17,900  3,544,200  GSE96  GEO  85  10,125  860,625  GSE10006  GEO  87  38,844  3,379,428  GSE60  GEO  133  14,015  1,863,995  GSE8218  GEO  148  17,900  2,649,200  GSE511  GEO  30  5,390  161,700  GSE12643  GEO  20  10,125  202,500  GSE4381  GEO  29  11,625  337,125  GSE8919  GEO  193  17,663  3,408,959  GSE11299  GEO  12  25,268  303,216  GSE8481  GEO  63  17,900  1,127,700  GSE137  GEO  70  9,001  630,070  GSE6791  GEO  84  38,844  3,262,896  GSE61  GEO  84  7,897  663,348  GSE2361  GEO  36  17,900  644,400  GSE8441  GEO  22  17,900  393,800  GSE15219  GEO  16  18,074  289,184  GSE4108  GEO  58  7,400  429,200  GSE5307  GEO  33  24,929  822,657  GSE11882  GEO  172  38,844  6,681,168  Array Type two color two color two color two color one color one color one color one color two color one color one color one color two color one color one color one color two color one color two color one color one color one color two color two color one color  Tissue Groups single single single single multiple single multiple single single single single single single single single single single multiple single multiple single single single single single  9  Dataset ID  Data Source  Samples  Probes  GSE7638  GEO  160  17,900  Dataset Size 2,864,000  GSE755  GEO  173  10,125  1,751,625  GSE12649  GEO  102  17,900  1,825,800  GSE13879  GEO  24  16,397  393,528  GSE12669  GEO  40  19,521  780,840  GSE8325  GEO  12  17,012  204,144  GSE9164  GEO  14  36,286  508,004  GSE412  GEO  120  10,125  1,215,000  GSE6613  GEO  105  17,900  1,879,500  GSE5107  GEO  83  17,900  1,485,700  GSE7023  GEO  47  14,016  658,752  GSE89  GEO  40  5,390  215,600  GSE7880  GEO  43  7,607  327,101  GSE7849  GEO  78  10,125  789,750  GSE994  GEO  75  17,900  1,342,500  GSE90  GEO  12  5,390  64,680  GSE361  GEO  20  5,390  107,800  GSE88  GEO  31  5,390  167,090  GSE443  GEO  11  10,125  111,375  Array Type one color one color one color one color one color two color two color one color one color one color one color one color one color one color one color one color one color one color one color  Tissue Groups single single single single single single single single single single single single single single single single single single single  Table 3. Expression dataset features part 2 Dataset ID  Median correlation  90th quantile correlation  Missing value proportion  PC1 variance proportion  Log-transformed  GSE420  0.0098  0.54  0  0.93  no  GSE685  0.0059  0.58  0  0.98  no  GSE833  0.026  0.6  0  0.85  no  GSE8514  0.0049  0.45  0  0.89  no  GSE5808  0.0088  0.47  0  0.99  yes  10  Dataset ID  Median correlation  90th quantile correlation  Missing value proportion  PC1 variance proportion  Log-transformed  GSE2377  -0.002  0.27  0  0.85  no  GSE7509  0.016  0.38  0  0.79  no  GSE2164  0.55  0.83  0.0006  0.86  no  GSE6008  -0.002  0.21  0  0.99  yes  GSE1397  0.013  0.45  0  0.78  no  GSE8479  0.46  0.68  0  0.97  no  GSE9006  0.0068  0.23  0  0.95  no  GSE8586  0.00098  0.31  0  0.88  no  GSE6536  0.32  0.6  0  0.99  yes  GSE1037  0.48  0.7  0.0049  0.32  yes  GSE8121  0.0088  0.52  0  0.72  no  GSE3526  -0.014  0.32  0.0028  0.71  no  GSE8052  -0.0049  0.53  0  0.99  no  GSE7307  0.25  0.62  0  0.61  no  GSE8507  0.015  0.45  0  0.82  no  allander-gist  0.05  0.45  0  0.77  yes  GSE58  0.13  0.77  0.13  0.68  yes  GSE53  0.18  0.71  0.41  0.74  yes  0.057  0.4  0.00044  0.24  yes  -0.00098  0.28  0.0029  0.53  yes  khan-bluecell  0.039  0.33  0  0.66  yes  GSE59  0.022  0.27  0.21  0.25  yes  GSE3211  0.043  0.41  0.18  0.75  yes  GSE4988  0  0.39  0  0.2  yes  GSE6818  0  0.24  0  0.98  no  GSE4058  0.0039  0.36  0.13  0.74  yes  -0.0029  0.31  0.000032  0.67  yes  luo-prostate ma-breast  jazaeri-breast GSE3398 dhanasekaranprstate  0.013  0.26  0.028  0.36  yes  -0.0039  0.38  0.082  0.37  yes  GSE3023  0.00098  0.33  0.017  0.53  yes  GSE343  -0.0059  0.38  0.076  0.53  yes  vantveer-breast  -0.0029  0.23  0.0017  0.17  yes  GSE4007  0.019  0.29  0.1  0.57  yes  GSE3630  0.017  0.35  0.012  0.67  yes  0.03  0.35  0.19  0.37  yes  0.24  0.57  0  0.66  no  GSE7390  0.0039  0.2  0  0.99  no  GSE96  GSE3497 ramaswamycancer  -0.026  0.32  0  0.66  no  GSE10006  0.016  0.27  0  0.91  no  GSE60  0.027  0.32  0.37  0.33  yes  11  Dataset ID GSE8218  Median correlation  90th quantile correlation  Missing value proportion  PC1 variance proportion  Log-transformed  -0.0088  0.37  0  0.99  yes  GSE511  0.024  0.41  0  0.91  no  GSE12643  0.002  0.47  0  0.99  yes  GSE4381  0.00098  0.37  0.019  0.63  yes  GSE8919  0.021  0.31  0  0.92  no  GSE11299  0.39  0.77  0  0.96  no  GSE8481  0.17  0.5  0  0.75  no  GSE137  0.21  0.47  0.08  0.24  yes  -0.0059  0.42  0  0.99  no  0.023  0.31  0.16  0.42  yes  GSE2361  -0.021  0.33  0  0.62  no  GSE8441  0.21  0.52  0  0.95  no  -0.023  0.53  0  0.9  no  GSE4108  0.4  0.69  0.18  0.36  yes  GSE5307  -0.002  0.35  0.051  0.4  yes  GSE11882  0.002  0.37  0  0.38  no  GSE7638  0.002  0.44  0  0.99  no  GSE755  -0.002  0.18  0  0.87  no  GSE12649  0.0059  0.34  0  0.93  no  GSE13879  0.15  0.49  0  0.21  yes  GSE12669  -0.0029  0.39  0  0.98  no  GSE8325  -0.002  0.44  0.037  0.16  yes  GSE9164  0.12  0.56  0.00048  0.98  yes  GSE6791 GSE61  GSE15219  GSE412  0.47  0.68  0.00046  0.89  no  GSE6613  0.00098  0.21  0  0.92  no  GSE5107  -0.0078  0.3  0  0.85  no  GSE7023  0  0.35  0  0.99  no  -0.0059  0.35  0  0.84  no  GSE7880  0.13  0.43  0  0.99  no  GSE7849  -0.0059  0.23  0  0.81  no  GSE994  -0.00098  0.26  0  0.79  no  GSE89  GSE90  0.7  0.9  0  0.96  no  GSE361  0.015  0.49  0  0.9  no  GSE88  0.0059  0.41  0  0.92  no  0.49  0  0.91  no  GSE443 -0.0049 PC1 – The first principle component of the data  12  2.2  Dataset features We recorded or computed multiple features of each data set, detailed in Table 2-3. The  quantity of probes that target „known genes‟ (those annotated in the National Center for Biotechnology Information database (NCBI)) were recorded as opposed to total probes used on the microarray since ultimately the selected coexpression links were only kept between these types of probes (Section 2.7). The dataset size is the product |samples| * |‟known gene‟ probes|. We computed additional features from the data: proportion of missing values, proportion of variance for principal component 1 of the data. Singular value decomposition (SVD) [19] was used to transform the expression data matrix to obtain the principal components of the data. Then the proportion of variance held by principal component 1 (PC1) was measured. To compute this data feature, missing expression values were replaced by the mean signal level of their associated expression profiles prior to singular value decomposition. For each expression dataset within the Gemma database, correlation magnitudes between all rows of the expression data matrix are computed and density distribution curves are displayed on the Gemma web site (Figure 1). We measured the median and 90th quantile correlation values from these density distributions. These two features indicate the proportion of positively correlated gene pairs in the dataset. For example, most datasets have a correlation density distribution with a median near zero (Figure 1A). However, datasets that have more positively correlated gene pairs between all expression profiles would have a median correlation of perhaps 0.3 (Figure 1B).  13  Figure 1. Correlation magnitude density curves: Density distribution curves of correlation values between all expression profiles in expression data matrix shown. Images from Gemma website: http://www.chibi.ubc.ca/Gemma (A) Dataset ID GSE3023. (B) Dataset ID GSE6536.  A  2.3  B  Baseline algorithm settings The data filtering procedures and criteria for coexpression link selection used in this  thesis were based on a previously published algorithm developed in our lab [7]. The original algorithm has default settings for each parameter chosen, based on pilot experiments run with a small number of expression datasets; however these settings were not thoroughly evaluated for their optimality for gene function prediction. We used the coexpression links produced under those default settings as our “baseline standard” against which to measure performance (Table 4). A flow chart showing the sequence of procedures in this algorithm is shown in Figure 2. The red boxes identify procedures not originally implemented in the algorithm but will be tested in this thesis and are detailed in subsequent sections.  14  Table 4. Baseline algorithm settings Parameter Correlation metric Missing values filter Low variance filter Low expression filter P-value cutoff Correlation distribution cut SVD normalization Log-transformation Removing negative correlations  Setting Pearson correlation coefficient 30% of samples must be present 5% lowest removed 30% lowest removed ≥ 0.01 removed ≤ highest 1% of correlations removed No No No  The baseline algorithm is as follows, with settable parameters as given in Table 4. First, the baseline algorithm begins by filtering out probes of the input expression matrix that have more than 70% of their values missing. Second, for ratiometric arrays, a proportion of probes according to the lowest expression variance between samples are filtered out of the expression matrix. For one color array designs, a proportion of probes according to the lowest variation coefficient for expression among samples are filtered out of the expression matrix. Third, a proportion of probes are filtering out according to the lowest mean expression level in the expression matrix. For datasets of a two color microarray design, the expression level is estimated as the mean of the intensities from two channels. For some datasets this information was not available, so the filter was not used. Finally, each gene expression profile is compared to all others using the Pearson correlation coefficient. The selection of the most significant pairings between expression profiles is based on two criteria. The first criterion evaluates each correlated gene pair on the basis of statistical significance with a P-value threshold. The assumption is that under the null hypothesis, each correlation between the expression profiles of two genes is from a bivariate normal distribution. Then the ratio of the correlation magnitude over the standard error of the 15  correlation, under the null hypothesis of no correlation, follows a t-distribution with n-2 degrees of freedom, where n is the number of samples in each expression profile. P-values were corrected for the number of genes tested using the Bonferroni multiple test correction. The second criterion is the correlation distribution cut which selects the top proportion of gene pairings on the basis of correlation magnitude (considering both positive and negative correlations). The more stringent of these two criteria determine the correlation thresholds at each tail of the distribution and mark the cutoff for coexpression link selection. For example, for a particular dataset it may be found that the top 1% of correlations will include gene pairs with a correlation over 0.8 and all gene pairs with a correlation less than -0.75. However, when finding a correlation corresponding to a p-value of 0.01, under the assumptions stated earlier and considering the number of unique genes in the dataset, it may be found that the statistical significance threshold is +/- 0.85 which is more stringent at both tails of the correlation distribution than the correlation distribution cut and becomes the correlation threshold for link selection.  16  Figure 2. Coexpression algorithm procedures: Flow chart demonstrating sequence of algorithm procedures for preprocessing the expression data and selecting the coexpression links for each expression dataset. The boxes in red were not part of the of the original coexpression algorithm (Sect 2.3), and are tested for their effect on performance in this thesis.  17  2.4  Filtering Three filtering procedures were executed sequentially on the probes or probe sets for  Affymetrix microarrays. The exact details of these procedures are based on a previously published coexpression algorithm [7], detailed in section 2.3, according to three criteria: probes with too many missing values, probes with low variance between samples comparatively and probes with a low mean expression level comparatively.  2.5  Additional normalization Additional normalization methods were tested on the expression data to remove potential  experimental noise. The first normalization method tested was singular value decomposition (SVD) [19], which is applied to the filtered expression data decomposing it into three matrices (U, VT, D), giving information on the principal components of the data. The first principal component representing the largest source of variation was removed by setting its corresponding eigenvalue to zero in the „D‟ matrix, followed by reconstructing a “corrected” dataset by matrix multiplication. We also tested an SVD-based normalization method that we call „Balance‟ which is essentially the normalization procedure used in the SPELL algorithm [20]. The Balance method sets all the principal components of the data to one in the „D‟ matrix and then reconstitutes the three SVD matrices to form a “corrected” expression data matrix.  2.6  Log-transformation The effect of log-transformation of raw expression values on algorithm performance was  tested for expression datasets using a one color microarray technology type. The „one color‟ datasets which originally had log-transformed expression values were „unlogged‟ by raising each expression value to the power of two (2X). The „one color‟ datasets which originally had raw 18  expression values (not log-transformed) were „logged‟ by taking the log2 of each expression value (log2X). In this case, negative expression values and values equal to zero were essentially left as missing values after log-transformation. All transformations were done prior to expression data filtering.  2.7  Coexpression link selection After any filtering or normalization is complete, each gene expression profile is  compared to all others using a correlation coefficient, either Pearson or Spearman. The option to discard all negative correlations between genes is executed at this point if desired. The link selection procedure which then follows is based on the previously published algorithm [7], detailed in section 2.3, which chooses links based on two main criteria: a statistical significance threshold and a correlation distribution cut. One difference from the original algorithm is that for the statistical significance threshold when using the Spearman correlation metric the p-values are computed differently and do not use the assumptions related to the t-distribution [32]. The impact of each individual correlation criterion is also investigated in this thesis by selecting coexpression links for each dataset by both thresholds independently. Only coexpression links formed between probes that target at least one known gene were kept to prepare for their functional evaluation at the gene level, described in the following sections.  2.8  Functional similarity metric Annotations from the Gene Ontology (GO) [16] were used as the primary metric for  evaluating the functional similarity of each selected coexpression link. The Gene Ontology is a controlled vocabulary of gene attributes in a hierarchical structure. The ontology covers three main domains which serve as the root attributes in the hierarchy: cellular component, biological 19  process, molecular function. Each gene attribute or „GO term‟ has genes annotated to it which are associated by function. The GO terms start off as more general attributes near the top of the hierarchy, related to one of the three root attributes, and then become more specific in function near the bottom of the hierarchy at the leaves. In this methodology, each gene may have a set of GO terms associated with it (some genes have zero GO terms). In our method, this set includes all parent terms in the hierarchy of the directly annotated GO terms. The GO overlap score of a coexpression link between genes A and B is simply the GO term overlap |GOa ∩ GOb|, where GOa denotes the set of GO terms associated with gene A and GOb denotes the set of GO terms associated with gene B. This metric was previously extensively validated by Mistry and Pavlidis (2008) [33]. GO overlap scores were recorded for all links in every analysis of the datasets. GO overlap scores were also recorded for randomly selected pairs from the microarray design of each dataset. After scoring different quantities of random pairs, it was found that generating more than 10,000 random pairs did not significantly change the resulting distribution of GO overlap scores. Therefore the set size of 10,000 random gene pairs was chosen for each expression dataset. The primary measure of performance used was the difference in the average GO overlap score between the coexpressed and randomly selected gene pairs for each dataset while under the same set of data treatment and link selection parameters, named the mean GO shift. A positive mean GO shift for a link set implies it performs better than its associated set of random links in terms of functional similarity. The coverage of the Gene Ontology annotations was 18512 unique human genes with an average of over seven GO terms annotated to each gene.  2.9  Evaluating performance of alternate algorithm settings To evaluate the performance of alternate algorithm settings and procedures in contrast to  baseline conditions (Table 4), we computed several measures. To first evaluate the performance 20  in individual datasets under an alternate algorithm setting, we measured the change in their functional similarity measure from baseline conditions (DELTA (Mean GO shift)). A positive DELTA (Mean GO shift) value implies that for a particular dataset, using the alternate algorithm setting or procedure performs better than compared to baseline conditions (Table 4). We also performed one-sided Wilcoxon signed rank tests on these DELTA (Mean GO shift) values to evaluate the statistical significance of this change in performance across all datasets. For each signed rank test, we used the alternate hypothesis that the DELTA (Mean GO shift) values were greater than zero, as we were primarily interested in whether the change in performance was an improvement statistically and not just significantly different overall. To summarize the average change in performance across all datasets when varying the different thresholds and settings of our coexpression algorithm, we used three additional measures. The first measure, later referred to as the „average‟ score, is the average of the mean GO shift scores across all datasets (AVG mean GO shift). The second measure was the standard deviation of the mean GO shift scores for link sets produced under the alternate algorithm settings (SD Mean GO shift). The third measure, later referred to as the „weighted average‟ score, is the average of the mean GO shift scores across all datasets while taking into account the change in the number of links produced (Weighted AVG mean GO shift).  For the „weighted average‟ scores, the product of the mean GO shift score and link count are summed for n link sets produced under the alternate algorithm setting or procedure. Then this sum is divided by the sum of link counts for n datasets under the alternate algorithm setting being tested to give the Weighted AVG mean GO shift score. The number of datasets applicable  21  to the alternate algorithm setting or procedure being tested is n, which in most cases is all 80 datasets.  2.10 Protein to protein interactions Human protein-to-protein interactions were downloaded from the InnateDB database [34]. Given a coexpression link consisting of genes A and B, the interaction of A and B on the protein level was considered a functional match. The proportion of matches over the total amount of links in a link set was measured as the „PPI%‟. The „PPI% shift‟ was the difference in the „PPI%‟ measure between selected coexpression links and randomly selected gene pairs for a dataset under one set of conditions. The „PPI% shift‟ was used as an additional standard to evaluate the functional similarity of coexpression link sets for each dataset. A positive „PPI% shift‟ for a link set implies it performs better than its associated set of random links in terms of functional similarity. These protein-to-protein interactions have a coverage of 16797 unique human genes with 45750 total interactions.  2.11 KEGG pathways Gene annotations for human biological pathways were obtained from the KEGG [35] pathway database (http://www.genome.jp/kegg/pathway.html). Every KEGG pathway has an associated set of genes annotated to it. Given a coexpression link consisting of genes A and B, the annotation of A and B to the same KEGG pathway was considered a functional match. The proportion of matches over the total amount of links in a link set was measured as the „KEGG%‟. The „KEGG% shift‟ was the difference in the „KEGG%‟ measure between selected coexpression links and randomly selected gene pairs for a dataset under one set of conditions. The „KEGG% shift‟ was also used as an additional standard to evaluate the functional similarity of 22  coexpression link sets for each dataset. A positive „KEGG% shift‟ for a link set implies it performs better than its associated set of random links in terms of functional similarity. There were 217 human pathways in total that covered 4972 unique human genes.  2.12 MitoCarta genes The final evaluation method we applied was based on MitoCarta [36]. The Human MitoCarta gene list contains 1023 genes that encode proteins with strong support of being localized to the mitochondria, and which have been shown to be coexpressed [37, 38]. Out of the selected coexpression links for an analysis, the number of links between two human MitoCarta genes was measured over the total number of possible links that can be formed from the MitoCarta genes present in the expression dataset to begin with (MitoCarta ratio). The „MitoCarta ratio shift‟ was the difference in the MitoCarta ratio measure between selected coexpression links and randomly selected gene pairs for a dataset under one set of conditions.  23  3  Results  3.1  Summary of performance statistics To understand how the preprocessing procedures and methods used for coexpression link  selection affect the performance in functional prediction, we adjusted the stringency of our algorithm settings controlling these steps as well as testing new methods. To monitor the average change in performance, we tracked three measures of functional similarity according to the Gene Ontology (GO) annotations summarized below (Table 5) for each analysis (Methods): AVG Mean GO shift, SD Mean GO shift and Weighted AVG Mean GO shift. These measures summarize the effect of each alternate threshold setting or new procedure on algorithm performance across all 80 expression datasets. While the „average‟ scores simply summarizes the raw performance values across all datasets, the „weighted average‟ scores take into account the change in the average number of links obtained as we do not desire an improvement in performance at the cost of greatly reducing the number of links produced. The change in these measures or „DELTA‟s gives an indication on how well these alternate settings or new procedures perform better than the coexpression links produced under baseline algorithm settings for all datasets (positive „DELTA‟ is an improvement in performance). Further details on the individual change in performance for all datasets when configuring the stringency of each algorithm procedure or testing new procedures, is uncovered and discussed in later sections.  24  Table 5. Gene Ontology functional similarity statistics AVG Mean GO Shift  AVG Mean GO Shift-DELTA  SD Mean GO shift  Weighted AVG Mean GO Shift  Weighted AVG Mean GO shiftDELTA  AVG Link Count  PEARSON CORRELATION METRIC Baseline Baseline (Corr. Dist. Cut) Baseline (P-value threshold) Removing negative correlations  0.67  0  0.79  0.81  0  697,390  0.48  -0.2  0.54  0.72  -0.082  1,395,360  0.63  -0.038  0.75  0.41  -0.4  13,701,743  0.8  0.13  0.88  1.1  0.28  420,329  SVD normalization SVD normalization (TC) Balance normalization Missing values cut = 50% (28 datasets) * Missing values cut = 70% (28 datasets) * Low variance cut = 0% Low variance cut = 15% Low expression cut = 0% Low expression cut = 45%  0.42  -0.25  0.69  0.25  -0.56  732,424  0.49  0.12  0.75  0.68  0.092  307,194  0.48  0.27  1.2  0.016  -0.036  5,124  0.3  0.075  0.66  0.5  0.012  1,374,348  0.39  0.16  0.74  0.52  0.032  1,333,784  0.47  -0.0037  0.54  0.72  -0.0018  1,541,028  0.47  -0.0023  0.53  0.71  -0.016  1,116,016  0.36  -0.16  0.43  0.43  -0.29  2,360,463  0.58  0.062  0.62  0.87  0.15  1,002,551  Corr. Dist. Cut = 0.1%  0.56  0.081  0.71  0.86  0.14  144,173  Corr. Dist. Cut = 10% P-value threshold = 0.001 P-value threshold = 0.1 Reverse logtransformation (6 datasets) ~ Log-transformation (46 datasets) ~~  0.37  -0.1  0.38  0.46  -0.27  13,356,312  0.65  0.02  0.87  0.42  0.0068  12,142,044  0.59  -0.044  0.69  0.38  -0.024  15,633,857  0.6  -0.11  0.53  0.72  0.11  579,524  0.81  0.27  0.79  0.63  0.14  1,586,755  SPEARMAN CORRELATION METRIC Baseline Baseline (Corr. Dist. Cut) Baseline (P-value threshold) Removing negative correlations  0.71  0  1  0.9  0  668,886  0.52  -0.19  0.53  0.78  -0.12  1,410,173  1.3  0.61  1.2  1.1  0.18  598,404  0.85  0.14  1.1  1.3  0.4  378,205  SVD normalization SVD normalization (TC)  0.65  -0.065  1.6  1.5  0.61  172,859  -0.15  -0.56  1.1  -0.72  -1.5  7,441  25  AVG Mean GO Shift Missing values cut = 50% (28 datasets) * Missing values cut = 70% (28 datasets) * Low variance cut = 0% Low variance cut = 15% Low expression cut = 0% Low expression cut = 45% Corr. Dist. Cut = 0.1%  Weighted AVG Mean GO Shift  Weighted AVG Mean GO shiftDELTA  AVG Mean GO Shift-DELTA  SD Mean GO shift  AVG Link Count  0.45  0.15  0.91  0.6  0.077  1,270,753  0.54  0.23  0.88  0.62  0.093  1,318,569  0.56  0.035  0.95  0.8  0.018  1,404,279  0.57  0.045  0.94  0.81  0.023  1,066,854  0.57  0.0079  0.59  0.44  -0.33  106,093  0.76  0.19  0.81  1.3  0.5  54,019  0.63  0.1  1.1  0.45  -0.34  14,871  Corr. Dist. Cut = 10% 0.52 -0.0091 0.53 0.039 -0.74 1,405,739 P-value threshold = 0.001 1.3 -0.007 1.2 1.1 0.027 10,895,815 P-value threshold = 0.1 1.3 -0.0021 0.83 1.1 -0.014 14,099,479 Blue colored rows – all analyses run under baseline algorithm settings (Table 4) Green colored rows – all analyses run under baseline algorithm settings except using only the correlation distribution cut for link selection Red colored rows - all analyses run under baseline algorithm settings except using only the p-value threshold for link selection * Statistics gathered only for datasets which contain missing values in expression data ~ Statistics gathered only for datasets of one color microarray technology type which originally had log-transformed expression values ~~ Statistics gathered only for datasets of one color microarray technology type which originally did not have logtransformed expression values AVG – average SD – standard deviation  3.2  Baseline coexpression settings To effectively evaluate the significance of each component of coexpression analysis  towards predicting gene function, we needed to compare our performance against a baseline measure. The structure of the coexpression analysis method was based on a previously published algorithm which executed similar preprocessing methods and coexpression link selection criteria (Methods). We produced coexpression links under the default settings used in the original algorithm as our „baseline links‟ for each expression dataset (Table 4). We also produced 10,000 randomly selected gene pairs for each dataset and for each analysis measured the performance of 26  the selected coexpression links against the performance of the random links in terms of functional similarity. The functional similarity of each selected coexpression link was evaluated by computing the overlap of their respective Gene Ontology (GO) annotations (Methods). We first produced coexpression links under baseline algorithm settings (Table 4) for all 80 expression data sets. We then evaluated the functional similarity of these baseline link sets by using the difference in the average GO overlap score between selected and random link sets, giving us the „mean GO shift‟ measure (Figure 3). Under baseline algorithm settings, it seems the majority of datasets perform well however 15 datasets still produce a negative mean GO shift which implies their coexpression links do not provide better functional information than simply picking random gene pairs. There are potentially two factors that could account for these results. Firstly, the preprocessing procedures and link selection criteria being executed under baseline algorithm settings may be unable to extract the highest number of functionally similar links from each dataset. Secondly, there may be technical features of these expression data sets possibly related to data quality which make it difficult to extract coexpression links of high functional similarity. Further analysis in to both these avenues within this thesis will help uncover more information, as to which if not both elements are causing coexpression link sets to perform worse than random gene pairs.  27  Figure 3. Performance of baseline coexpression links: Mean GO shift scores for coexpression links created under baseline settings for each parameter of coexpression algorithm. Scores for all 80 datasets shown (each point represents a dataset). The black line parallel to the X-axis distinguishes more clearly the positive mean GO shift scores from the negative mean GO shift scores. Scores are sorted in ascending order.  To validate the accuracy of the Gene Ontology as a standard for gene functional similarity, we tested additional functional standards on our coexpression algorithm. For each of the 80 baseline link sets, we computed three new measures: the proportion of links that share a protein-to-protein interaction (PPI%), the proportion of links that share gene annotations to the same KEGG pathway (KEGG%) and the amount of links between MitoCarta genes originally in the expression dataset (MitoCarta Ratio). The expression datasets used in this thesis were not specifically chosen for their study of mitochondrial genes; however this ratio was controlled for the amount of MitoCarta genes present in the original dataset (Methods). Also, it has been shown in previous studies that mitochondrial genes are found to be highly coexpressed [37, 38], and therefore should be detected by our coexpression algorithm. Similar to the mean GO shift, these additional measures were scored relative to the performance of the randomly selected links, giving us the PPI% shift, the KEGG% shift and the MitoCarta ratio shift. We assessed how well these additional standards validated our primary measure of functional similarity, the Gene Ontology, by comparing the respective scores for all functional standards (Figure 4). The range 28  of these new scoring measures was very small due to the fact that these additional standards, most notably the MitoCarta genes, have lower annotation coverage of the human genome in contrast to the Gene Ontology. Nevertheless, they do show some agreement with the GO functional standard, with the highest correlation (Figure 4) from the protein-to-protein interactions (correlation = 0.45, p-value = 2.87E-5). This instills more confidence in the GO annotations as a reasonable evaluator of functional similarity for coexpression links.  29  Figure 4. Comparing standards for gene functional similarity: Functional similarity scores using the Gene Ontology are compared with 3 additional functional metrics scoring the baseline coexpression links for all 80 datasets. A high scoring outlier in graphs (A-B) was omitted in order to focus in on the overall trend for the majority of the data. The high scores of these outlier points are most likely caused by the smaller size of their datasets for which even a small number of functional matches would register a high „PPI% shift‟ or „KEGG% shift‟ score. The rank correlation between the two sets of scores and the associated p-value is shown at the top of each graph. (A) Protein-to-protein interactions (B) KEGG pathway annotations (C) MitoCarta genes  A  B  C  30  3.3  Expression correlation metrics We compared the „quality‟ of coexpression links selected using the Pearson linear  correlation coefficient and the Spearman rank correlation coefficient. We selected links solely based on the correlation distribution cut in order to obtain a similar number of links under both correlation metrics. Besides the correlation metric, all other parameters were configured to baseline algorithm settings. We then compared the functional similarity of the links produced under both correlation metrics by recording their mean GO shift scores (Figure 5). The performance of both correlation metrics agree well across all datasets (correlation = 0.91). With the performance of the metrics being highly correlated the biological interpretations of the data do not significantly change when using the Spearman correlation metric. Therefore, we will focus on investigating all experiments that were run under the Pearson correlation coefficient in the remainder of the results.  31  Figure 5. Effect of correlation metric on performance: Mean GO shift scores, representing functional similarity for the coexpression links of all 80 datasets produced under both Pearson and Spearman correlation metrics displayed here. All link sets produced here were selected on the basis of only the correlation distribution cut criterion to produce similar link quantities for both metrics. The remaining conditions for both sets of scores are under baseline algorithm settings. Spearman rank correlation between scores of opposing correlation metrics is displayed at top of graph. The p-value of the one-sided Wilcoxon signed rank test is displayed at the top of graph. The black line with a slope of 1 drawn through the origin shows where scores would be identical under both correlation metrics.  3.4  Dataset features It is clear from the performance of the baseline link sets shown earlier (Figure 3)  that under the same preprocessing settings and link selection thresholds, some datasets provide more functionally related links than others. While all baseline links were formed under the same algorithm settings, the expression datasets from which they were obtained vary in their features (Table 2-3). We recorded a variety of features of each expression dataset (Methods) and examined the relationship between these dataset features and the quality of coexpression links produced. We kept all algorithm settings constant in this comparison (baseline settings; Figure 614). Figure 6-8 shows that the size of the expression dataset is slightly (but significantly) positively correlated with performance. In contrast, missing values tend to reduce performance (Figure 9). Interestingly, the best performing datasets in coexpression analysis all have a PC1 32  variance proportion greater than 0.9 (Figure 10). Datasets that use a one color array design such as Affymetrix perform better overall compared to those of two color array designs (p < 0.05, ttest; Figure 11). Data sets with a high median correlation among probes perform better (positive correlation with performance is 0.21), suggesting that even a slight increase in the relative proportion of positive correlations improves performance (Figure 12-13). The expression datasets sampling from only one tissue group seem to perform slightly better than those that study multiple tissue groups (Figure 14), however this is not a sufficiently balanced comparison as only 9 datasets in total used multiple tissue groups. While none of these factors explain all the variance in performance (alone or together), there is clearly a connection between some dataset features and performance in functional prediction. Figure 6. Effect of number of samples on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the number of microarray samples in each dataset. The rank correlation between the two measures is displayed at the top of the graph.  33  Figure 7. Effect of number of probes on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the number of known gene probes in each dataset. The rank correlation between the two measures is displayed at the top of the graph.  Figure 8. Effect of dataset size on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the expression data matrix size in each dataset. The rank correlation between the two measures is displayed at the top of the graph.  34  Figure 9. Effect of missing values proportion on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the proportion of missing values in each dataset. The rank correlation between the two measures is displayed at the top of the graph.  Figure 10. Effect of PC1 variance proportion on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the proportion of variance held by the 1st principal component in each dataset. The rank correlation between the two measures is displayed at the top of the graph.  35  Figure 11. Effect of microarray technology type on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared between datasets using a one color or two color microarray technology type. The p-value from a two-sided t-test between both sets of scores is shown at top of graph.  Figure 12. Effect of median probe-probe correlation on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the median correlation magnitude in each dataset. The rank correlation between the two measures is displayed at the top of the graph.  36  Figure 13. Effect of 90th quantile probe-probe correlation on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared to the 90th quantile correlation magnitude in each dataset. The rank correlation between the two measures is displayed at the top of the graph.  Figure 14. Effect of number of tissue groups investigated on performance: The functional similarity (Mean GO shift) of the baseline link sets is compared between datasets that investigated one tissue group or investigated multiple tissue groups. The p-value from a two-sided t-test between both sets of scores is shown at top of graph.  3.5  Configuring link selection The goal of our link selection process was to extract coexpression links which contain  significant functional relationships between genes. Ideally we would extract all such 37  functionally-relevant links, but there might be more value in selecting fewer links of (hopefully) higher quality. In terms of our link selection procedures, testing this amounts to varying the thresholds used to select the coexpression links. As described earlier, in the baseline method there are two thresholds we use to extract links: one using the properties of the coexpression distribution, and one using a test of statistical significance of the coexpression; in any given data set either threshold might be in force, complicating evaluation. Therefore to test the effect of varying these two thresholds independently, we altered our procedure to apply only one of the thresholds at a time (Methods). We first varied the stringency of the correlation distribution cut. We produced coexpression link sets for each dataset under three different settings for the correlation distribution cut: 0.1%, 10% and 1% (baseline setting). The significance threshold was not used, and other settings were at baseline. We then measured the change in the overall functional similarity from taking fewer or more links compared to baseline link sets, giving us the „DELTA (mean GO shift)‟ values. We used the Wilcoxon signed rank test to determine if the improvement in performance, if any, over the baseline standard was statistically significant across all datasets. As shown in Figure 15, selecting a smaller proportion of the strongest correlations as links (0.1%) shows a statistically significant improvement over baseline (Wilcoxon p-value = 0.0057; Figure 15A). The majority of improvement happens for datasets that already perform well (mean GO shift > 0.25) under baseline conditions. Selecting a larger proportion of the strongest correlations does not guarantee more coexpression links of high quality as a negative change is common, compared to the performance of baseline link sets (Figure 15B). This negative change is not an improvement in performance statistically (Wilcoxon p-value = 1).  38  Figure 15. Effect of correlation distribution cut on performance: The change in functional similarity as compared to the performance of baseline links when using alternate thresholds for the correlation distribution cut. This is plotted against the performance of the baseline links on the x-axis. All link sets produced in this analysis only use the correlation distribution cut to select coexpression links. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) Correlation distribution cut = 0.1% (B) Correlation distribution cut = 10%  A  B  To test the effect of the statistical significance threshold, we produced coexpression link sets for each dataset under three different p-value thresholds: 0.001, 0.1 and 0.01 (baseline setting). All other algorithm settings were under baseline conditions. We found that as for the correlation distribution threshold, using a more stringent threshold produces a small improvement in the functional similarity of links across most datasets compared to baseline link sets (Figure 16A, Wilcoxon p-value = 0.0073). However, the change in performance is very small. Using a more permissive statistical correlation threshold reproduces the same trend but in the opposite direction (Figure 16B). The small change in performance when varying the statistical significance threshold may be due to the Bonferroni multiple test corrections made to the p-values, which can change the effect of this threshold depending on the number of unique 39  genes in each dataset. This issue can be understood by comparing the average link counts across different settings for the p-value threshold and the runs using different settings for the correlation distribution cut (Table 5). When looking at the change in the „weighted average‟ score for varying the thresholds of both correlation criteria, both show a similar pattern when varying their stringency however the correlation distribution cut seems to have a larger effect on performance showing a greater dip (-0.27) and rise (+0.14) compared to the p-value threshold (-0.024, +0.068). Figure 15. Effect of statistical significance threshold on performance: The change in functional similarity as compared to the performance of baseline links when using alternate settings for the p-value threshold. This is plotted against the performance of the baseline links on the x-axis. All link sets produced in this analysis only use the pvalue threshold to select coexpression links. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) P-value threshold = 0.001 (B) P-value threshold = 0.1  A  3.6  B  Removing negative correlations We produced sets of coexpression links for each dataset using only positive correlations  and tracked the change in performance (Figure 16). A majority of datasets perform better using only positive correlations (58/80 datasets). Furthermore, this change in performance is 40  statistically significant overall (Wilcoxon p-value = 6.5E-9). This new procedure obtained the largest improvement in its „weighted average‟ score across all analyses run under the Pearson correlation metric (+0.28). Figure 16. Effect of removing negative correlations on performance: The change in functional similarity as compared to the performance of baseline links when removing negatively correlated gene pairs. This is plotted against the performance of the baseline links on the x-axis. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance.  3.7  Filtering of expression data Our baseline algorithm filters the expression data in an attempt to remove noisy signals  that (hopefully) do not contribute functional information. It is possible that the filtering is nonoptimal (either too stringent or too lenient). To explore this, we examined the impact of multiple filtering and additional procedures aimed at improving data quality. All links throughout these analyses were selected with the correlation distribution cut to keep the number of selected links relatively constant so that we could focus on the individual impact of each filtering procedure on the resulting performance. Thus, 80 specialized baseline link sets were produced for this analysis  41  using only the correlation distribution cut (1%) for link selection, keeping all other settings under baseline conditions. We first examined a procedure which filtered out expression profiles of the data matrix which contained too many missing values; as shown earlier, data sets with many missing values tend to perform worse. The baseline filtering threshold for missing values filtered out expression profiles which had less than 30% of their samples present (not missing). We adjusted this threshold and produced new link sets which filtered out more profiles according to missing values by using alternate thresholds of 50% and 70%, while keeping all other settings under baseline conditions. This analysis applied to only 28 datasets which had any missing values; thus most of these were two-color data sets. When increasing the missing value threshold to 50% (Figure 17A), many datasets have a minimal change in their functional similarity scores, however those that do display a significant change have an improvement in performance or at least do not get considerably worse. This indicates that perhaps this increase in the missing value threshold is not removing profiles with true biological signal. When increasing the missing value threshold to 70% (Figure 17B), a larger number of datasets show an improvement in performance (Figure 17A). However, in both cases the improvements are not statistically significant (Wilcoxon p-values = 0.13, 0.095). When increasing the missing values threshold, the average link count among applicable datasets (Table 5) does not change greatly (1.395 E6, 1.374 E6, 1.334 E6), possibly indicating that insignificant links with many missing values are replaced by links with a stronger functional relationship. The results show that the missing value threshold is valuable in removing rows of the data matrix that can afford to be lost by increasing performance in many datasets while keeping the overall functional similarity relatively constant in others.  42  Figure 17. Effect of missing value filtering on performance: The change in functional similarity as compared to the performance of baseline links when using alternate settings for the missing value filter threshold. This is plotted against the performance of the baseline links on the x-axis. All link sets produced in this analysis only use the correlation distribution cut to select coexpression links. Scoring shown for only 28 datasets which contained missing values in their expression data. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) Missing value filter threshold = 50% (B) Missing value filter threshold = 70%  A  B  We then examined the filtering procedure which removed low-variance rows of the expression data matrix. The baseline filtering threshold for low sample variance removes the 5% of expression profiles which have the lowest variation of expression between samples. We produced new link sets using alternate low variance thresholds of 0% and 15%, while keeping all other settings at baseline. When evaluating the alternate threshold setting of no low variance filtering (Figure 18A), there is no change in performance overall as all points center around a „DELTA (mean GO shift)‟ value of zero. There is also approximately an equal amount of datasets improving as well as getting worse in performance. Increasing the low variance filter threshold to 15% (Figure 18B) improves the performance of certain datasets compared to baseline link sets while decreasing the performance of others, but overall no large changes in functional similarity compared to the effect of other algorithm parameters shown previously. Neither alternate threshold for low variance improves performance statistically over the baseline 43  filter setting (Wilcoxon p-values = 0.62, 0.71). The changes in the „average‟ (-0.0037, -0.0023) and „weighted average‟ scores (-0.0018, -0.016) when varying the low variance filter threshold are quite small (Table 5). We conclude that within the range of settings we tested, filtering for low variance does not have a significant impact on the overall functional similarity of coexpression links, and thus should likely be omitted. Figure 18. Effect of low variance filtering on performance: The change in functional similarity as compared to the performance of baseline links when using alternate settings for the low variance filter threshold. This is plotted against the performance of the baseline links on the x-axis. All link sets produced in this analysis only use the correlation distribution cut to select coexpression links. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) Low variance filter threshold = 0% (B) Low variance filter threshold = 15%  A  B  We then investigated the final filtering procedure, which removes rows of the expression data according to low expression levels. The baseline filtering threshold for low expression removes 30% of expression profiles which have the lowest mean expression level across samples. We adjusted this threshold and produced new link sets using alternate low expression thresholds of 0% and 45%, while keeping all other algorithm procedures under baseline settings. Using no low expression filtering (Figure 19A) shows a decrease in performance for the majority 44  of datasets compared to the baseline links. This decline in performance is not a significant improvement (Wilcoxon p-value = 1), indicating that bypassing low expression filtering is not a desired option in coexpression analysis. While increasing the low expression threshold to 45% (Figure 19B) decreases the performance in some datasets, most improve overall compared to baseline link sets (Wilcoxon p-value = 0.00088). Figure 19. Effect of low expression filtering on performance: The change in functional similarity as compared to the performance of baseline links when using alternate settings for the low expression filter threshold. This is plotted against the performance of the baseline links on the x-axis. All link sets produced in this analysis only use the correlation distribution cut to select coexpression links. P-values are shown at the top of each graph, produced from using the Wilcoxon signed rank test. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) Low expression filter threshold = 0% (B) Low expression filter threshold = 45%  A  3.8  B  Additional normalization We tested the effect on performance of two additional normalization methods on the  expression datasets. We first produced coexpression link sets for each dataset which used SVD as an additional normalization method (Methods), but keeping all other procedures under baseline settings. SVD normalization does not improve performance overall (Figure 20) as there are an equal amount of datasets with positive and negative „DELTA (mean GO shift)‟ scores 45  displayed. The SVD normalization method is certainly not a statistical improvement in performance over the baseline standard (Wilcoxon p-value = 0.997). We observed that most of the datasets that showed improvement were of two color microarray technology type. Most of these „two color‟ datasets that improved with SVD normalization had weaker performance for their associated baseline links. It was shown earlier (Section 3.4) that the microarray design used related to performance with two color array designs typically performing worse overall (Figure 11). In contrast, those datasets that performed well (mean GO shift > 0.5) under baseline conditions seem to get worse with SVD normalization, suggesting that in those cases the principal component of the data believed to be noise was composed of true biological variation in expression Figure 20. Effect of SVD normalization on performance: The change in functional similarity as compared to the performance of baseline links when using the SVD normalization method. This is plotted against the performance of the baseline links on the x-axis. The P-value shown at the top of the graph is produced from using the Wilcoxon signed rank test. Red points signify datasets with a two color microarray technology type and blue points signify datasets with a one color microarray technology type. The black line with a slope of zero shows where points would lie if there was no change in performance.  The second method we tested was the Balance normalization method which used similar procedures to SVD (Methods). We produced coexpression link sets for 4 datasets using the 46  Balance normalization method (Methods), keeping all other algorithm parameters under baseline settings. Although the results shown below indicate that 3 out of the 4 datasets improve with balance normalization (Table 6), the number of links obtained is reduced (approximately 98% less on average). This is reflected in the change of the „weighted average‟ score of -0.0358 (Table 5). Table 6. Effect of Balance normalization on performance Dataset ID  Baseline-# of links  Balance-# of links  Baseline-Mean GO shift  Balance-DELTA (mean GO shift)  GSE343  589788  7066  -1.0309  0.1556  GSE4988  54863  7082  -0.4392  0.2961  GSE5808  10827  212  1.0547  0.7026  GSE9006  541517  6134  1.2617  -0.0941  3.9  Log-transformation of expression values Out of the 80 expression datasets, 52 datasets used a one color microarray technology  type and of these, 6 datasets had log-transformed expression values while the others reported raw expression data values. All of the datasets using a two color microarray technology type had logtransformed expression values. What was not yet known was whether log-transformation affected the ability of the „one color‟ datasets to yield biologically meaningful coexpression links. We investigated this by log-transforming the expression values of the 46 „one color‟ datasets which originally had untransformed values and reversing the log-transformation of expression values for the 6 datasets which originally had log-transformed values. We then computed coexpression link sets for these transformed expression datasets while keeping all other algorithm settings under baseline conditions. Reversing the log-transformation of expression values shows a decline in performance for 5 of the 6 applicable „one color‟ datasets (Figure 21A). Log-transformation of the raw expression values shows an improvement in performance for the majority of the applicable „one color‟ datasets (Figure 21B), with two 47  showing a large increase in their mean GO shift scores (> 2.0). The change in the „weighted average‟ scores for both types of transformations shows an improvement overall (0.11, 0.14), with the log-transformation of expression values scoring slightly higher (Table 5). However, the change in „average‟ scores (Table 5) agrees with what is shown visually below (unlogged = 0.11, logged = 0.27). Figure 21. Effect of log-transformation on performance: The change in the functional similarity (mean GO shift) is measured for links produced after log-transformation of expression values or after reversing log-transformation of expression values for the applicable sets using a one color microarray technology type (52 datasets). This change is plotted against the performance of all 52 associated baseline „one color‟ link sets on the X axis. The black line with a slope of zero shows where points would lie if there was no change in performance. (A) Reversing the logtransformation of expression values of 6 „one color‟ datasets. (B) Log-transformation of expression values for 46 „one-color‟ datasets.  A  B  48  4  Discussion This thesis provides information on the extrinsic and intrinsic factors that influence the  ability of coexpression analysis to predict gene function. The results show that the preprocessing procedures and link selection methods used in coexpression analysis affect the biological significance of the coexpression links produced. We have shown that expression datasets that do not perform well compared to others tend to have similar dataset features. The influence of these dataset features should be taken into account in order to make accurate predictions for gene function through coexpression analysis. On average, larger expression datasets produced higher-quality coexpression links (Figure 8). This is expected as larger expression datasets provide more opportunities to form coexpression links (more probes) across all possible gene pairs as well as more statistical power added to these connections from higher sample numbers. Poorly performing datasets were mostly associated with two color microarray designs (Figure 11). This could be due to a number of common technical errors involved in dye labeling or hybridizing two samples together on the same microarray which can affect the raw expression values. The treatment of two color expression data in a more specialized manner may correct for this noise. Finally, it was apparent that datasets which produced more positive correlations than negative correlations even slightly tending to perform better (Figure 12). In general, most of these datasets features cannot be controlled for in an experiment and therefore it would be beneficial to weight the significance of the coexpression links based on the characteristics of each input dataset. 49  Removing negatively coexpressed gene pairs prior to link selection had the most consistent improvement in link functional similarity across all datasets (Figure 16). This might explain why datasets that had more positive correlations, performed better (Figure 12). Although removing negative correlations produced a consistent improvement in performance compared to the baseline standard, it does not necessarily mean that negatively coexpressed gene pairs do not capture significant biological relationships. Since performance in our algorithm was primarily measured using GO annotations and the Gene Ontology is constructed in a way that relates genes that are involved in the same biological process or pathway, they are more likely to be positively correlated in expression. Negatively coexpressed gene pairs, which can involve one gene inhibiting the biological process of another gene, will produce lower performance scores by this standard. However, negatively coexpressed gene pairs should not be removed due to this bias and all correlated gene pairs should be kept before link selection. The filtering procedures varied in their effects on performance. The missing value threshold, when increased, improves performance (Figure 17). Thus, even if missing values take up less than half of the samples in expression profiles, they tend to have a negative effect on the functional similarity of the coexpression links. Changing the threshold for low variance filtering did not have a significant effect on performance (Figure 18A). Increasing the threshold did improve the performance of some datasets but made others worse and in both cases only to a small degree (Figure 18B). Consequently, it does not seem beneficial to keep the low variance filter as a preprocessing procedure in coexpression analysis. Finally, the results of the “low expression filter” showed that it is important to performance as disabling it reduced the functional similarity of many link sets overall (Figure 50  19A). Increasing the stringency of the low expression filter well over the baseline setting did not have a consistent improvement in performance across all datasets (Figure 19B). Filtering the expression data by low mean expression level follows the assumption that genes expressed at low values across various experimental conditions are likely the result of experimental noise on the microarray. However, it is still possible for a gene pair to be significantly coexpressed at low average expression levels. Therefore, while the results suggest that keeping a low expression filter increases the amount of functionally similar coexpression links it need not be increased over the baseline setting of 30%. SVD normalization resulted in the most significant improvement in performance, for normalization methods, without greatly reducing the amount of links obtained. While not consistently seen across all 80 datasets, SVD seemed to work best for datasets using a two color microarray design (Figure 20). Consequently, it followed that SVD normalization seem to improve the datasets that mostly performed worse under baseline conditions, which as seen earlier were generally associated with a two color microarray design (Figure 11). This suggests that „two color‟ datasets may typically contain an element of experimental noise which is mostly contained within the first principal component of the data, as SVD normalization is able to remove its effect on performance. Log-transformation of the one color expression data resulted in a majority of the datasets improving in performance (Figure 21B). Reversing the log-transformation of one color expression data mostly reduced performance overall (Figure 21A). In addition, the performance improvement shown by using log-transformation was demonstrated across a larger amount of datasets (46 datasets) and consequently found to be statistically significant overall (Wilcoxon pvalue = 0.0016).  51  We predicted that, in general, using more stringent correlation thresholds (independent of correlation sign) would increase functional similarity. Indeed, making the thresholds more stringent produces less links with overall higher performance, and making thresholds less stringent produces more links but with overall lower performance compared to baseline. Neither correlation threshold criterion used alone under baseline algorithm settings produces better “weighted average” or better “average” scores than the combination of the two under regular baseline settings (Table 5). Therefore the combination of choosing links by both statistical significance and correlation magnitude is still the better scheme of coexpression link selection. However, the results showed that the correlation distribution cut has a greater impact on performance overall (Section 3.5). In addition, the change in the „weighted average‟ scores suggests that making the correlation distribution cut more stringent (0.1%) improves performance to a greater degree (DELTA = 0.14) than increasing the stringency (0.001) of the pvalue threshold (DELTA = 0.0068) when compared to the performance under baseline settings (Table 5). We relied on the Gene Ontology annotations as the primary standard to measure similarity in gene function in this thesis, because of its high coverage (18,512 human genes with GO annotations). The fact that the GO measure of functional similarity showed reasonable agreement with the other, lower-coverage metrics (Figure 4) suggests that our findings are not simply a function of particularities of GO annotations, though clearly there is overlap between the standards. This interdependence is one reason we introduced the MitoCarta standard: it is based in the observation in numerous studies that genes involved in mitochondrial function tend to be coexpressed [37, 38]. Thus recovery of links among MitoCarta genes is a sign of quality of coexpression analysis, though not functional prediction per se.  52  Our results suggest that the settings of the baseline algorithm could be adjusted to improve performance. A set of parameters we predict will result in better performance (without drastically reducing the number of predictions made) are summarized in Table 7. For several „one color‟ datasets there are expression profiles that sometimes contain negative expression values or values equal to zero due to MAS5 data adjustment procedures. With our logtransformation procedure all these entries will be treated as missing values and this, coupled with an increase in the missing values threshold, will remove potentially a significant amount of probes before any filtering according to low expression. Therefore since we do not want to filter out expression profiles of low average expression level that may still produce useful functional predictions we might want to decrease the filter stringency (Table 7). The results on the correlation distribution cut showed that making it more stringent (0.1%) produces a significant improvement in performance. However, it also reduces the link counts on average (Table 5). To help ensure that useful coexpression links are being retained in this process, we suggest that the correlation distribution cut should be set to 0.5%, while still being coupled with the p-value threshold. While the summarized performance scores under the Spearman correlation metric do show some improvement over the use of the Pearson correlation metric (Table 5) in some analyses (SVD normalization, Baseline with p-value cutoff), they also obtain a larger standard deviation in their „average‟ scores due mostly to a reduction in average link counts. Thus we see little to be gained by switching to the rank correlation coefficient. Experiments running this new set of algorithm settings or slight variations of it on a collection of expression datasets are ongoing and will help determine the most favorable procedures and associated settings to implement in order to extract the highest number of biologically meaningful coexpression links from an expression dataset.  53  Table 7. New algorithm settings to achieve better performance Parameter Correlation metric Missing values filter Low variance filter Low expression filter P-value cutoff Correlation distribution cut SVD normalization Log-transformation Removing negative correlations  Setting Pearson correlation coefficient 70% of samples must be present 0% lowest removed 20-30% lowest removed ≥ 0.01 removed ≤ highest 0.5% of correlations removed Yes (only for ‘two color’ datasets) Yes (only for ‘one color’ datasets that are not logtransformed already) No  54  5  Conclusion This thesis has uncovered valuable information on how the accuracy of gene functional  predictions can be affected by the preprocessing procedures and link selection methods used in coexpression analysis. This thesis also demonstrates that the various features of any expression dataset including data quality can affect the biological significance of the coexpression links produced through this type of analysis. There is still a great deal of information that can be discovered with further experiments using procedures or methods not investigated in this thesis. We did not evaluate the effect of additional correlation metrics on performance such as the Euclidean distance metric and the mutual information measure. The effect of choosing a link selection method based on coexpression network properties was not evaluated in this thesis and could provide interesting information affecting performance. This thesis may aid researchers in further analyses involving the detection of highly coexpressed gene pairs of biological significance.  55  6  Bibliography  1.  Wren, J.D., A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature-data divide. Bioinformatics, 2009. 25(13): p. 1694-701. Cho, R.J., et al., A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell, 1998. 2(1): p. 65-73. Stuart, J.M., et al., A gene-coexpression network for global discovery of conserved genetic modules. Science, 2003. 302(5643): p. 249-55. Tavazoie, S., et al., Systematic determination of genetic network architecture. Nat Genet, 1999. 22(3): p. 281-5. Reverter, A., et al., A gene coexpression network for bovine skeletal muscle inferred from microarray data. Physiol Genomics, 2006. 28(1): p. 76-83. Zhou, X., M.C. Kao, and W.H. Wong, Transitive functional annotation by shortest-path analysis of gene expression data. Proc Natl Acad Sci U S A, 2002. 99(20): p. 12783-8. Lee, H.K., et al., Coexpression analysis of human genes across many microarray data sets. Genome Res, 2004. 14(6): p. 1085-94. Voy, B.H., et al., Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Comput Biol, 2006. 2(7): p. e89. Carter, S.L., et al., Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics, 2004. 20(14): p. 2242-50. Mao, L., et al., Arabidopsis gene co-expression network and its functional modules. BMC Bioinformatics, 2009. 10(1): p. 346. Elo, L.L., et al., Systematic construction of gene coexpression networks with applications to human T helper cell differentiation process. Bioinformatics, 2007. 23(16): p. 2096-103. Eisen, M.B., et al., Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A, 1998. 95(25): p. 14863-8. Butte, A.J. and I.S. Kohane, Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput, 2000: p. 418-29. Prieto, C., et al., Human gene coexpression landscape: confident network derived from tissue transcriptomic profiles. PLoS One, 2008. 3(12): p. e3911. Wolfe, C.J., I.S. Kohane, and A.J. Butte, Systematic survey reveals general applicability of "guiltby-association" within gene coexpression networks. BMC Bioinformatics, 2005. 6: p. 227. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9. Tu, Y., G. Stolovitzky, and U. Klein, Quantitative noise analysis for gene expression microarray experiments. Proc Natl Acad Sci U S A, 2002. 99(22): p. 14031-6. Quackenbush, J., Microarray data normalization and transformation. Nat Genet, 2002. 32 Suppl: p. 496-501. Alter, O., P.O. Brown, and D. Botstein, Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci U S A, 2000. 97(18): p. 10101-6. Hibbs, M.A., et al., Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics, 2007. 23(20): p. 2692-9. Breitling, R., Biological microarray interpretation: the rules of engagement. Biochim Biophys Acta, 2006. 1759(7): p. 319-27. Hackstadt, A.J. and A.M. Hess, Filtering for increased power for microarray data analysis. BMC Bioinformatics, 2009. 10: p. 11. Edgar, R., M. Domrachev, and A.E. Lash, Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res, 2002. 30(1): p. 207-10. 56  2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.  24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38.  Allander, S.V., et al., Gastrointestinal stromal tumors with KIT mutations exhibit a remarkably homogeneous gene expression profile. Cancer Res, 2001. 61(24): p. 8624-8. Luo, J., et al., Human prostate cancer and benign prostatic hyperplasia: molecular dissection by gene expression profiling. Cancer Res, 2001. 61(12): p. 4683-8. Ma, X.J., et al., Gene expression profiles of human breast cancer progression. Proc Natl Acad Sci U S A, 2003. 100(10): p. 5974-9. Khan, J., et al., Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med, 2001. 7(6): p. 673-9. Jazaeri, A.A., et al., Gene expression profiles of BRCA1-linked, BRCA2-linked, and sporadic ovarian cancers. J Natl Cancer Inst, 2002. 94(13): p. 990-1000. Dhanasekaran, S.M., et al., Delineation of prognostic biomarkers in prostate cancer. Nature, 2001. 412(6849): p. 822-6. van 't Veer, L.J., et al., Gene expression profiling predicts clinical outcome of breast cancer. Nature, 2002. 415(6871): p. 530-6. Ramaswamy, S., et al., Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A, 2001. 98(26): p. 15149-54. Best, D.J. and D.E. Roberts, Algorithm AS 89: The upper tail probabilities of Spearman's rho. Journal of the Royal Statistical Society, Series C (Applied Statistics), 1975. 24(3): p. 377-379. Mistry, M. and P. Pavlidis, Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics, 2008. 9: p. 327. Lynn, D.J., et al., InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol, 2008. 4: p. 218. Kanehisa, M. and S. Goto, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 2000. 28(1): p. 27-30. Pagliarini, D.J., et al., A mitochondrial protein compendium elucidates complex I disease biology. Cell, 2008. 134(1): p. 112-23. Hibbs, M.A., et al., Directing experimental biology: a case study in mitochondrial biogenesis. PLoS Comput Biol, 2009. 5(3): p. e1000322. Mootha, V.K., et al., Identification of a gene causing human cytochrome c oxidase deficiency by integrative genomics. Proc Natl Acad Sci U S A, 2003. 100(2): p. 605-10.  57  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0068712/manifest

Comment

Related Items