Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Knowledge discovery from large-scale biological networks and their relationships. Zhang, Xi 2010

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2010_spring_zhang_xi.pdf [ 5.61MB ]
Metadata
JSON: 24-1.0069679.json
JSON-LD: 24-1.0069679-ld.json
RDF/XML (Pretty): 24-1.0069679-rdf.xml
RDF/JSON: 24-1.0069679-rdf.json
Turtle: 24-1.0069679-turtle.txt
N-Triples: 24-1.0069679-rdf-ntriples.txt
Original Record: 24-1.0069679-source.json
Full Text
24-1.0069679-fulltext.txt
Citation
24-1.0069679.ris

Full Text

KNOWLEDGE DISCOVERY FROM LARGE-SCALE BIOLOGICAL NETWORKS AND THEIR RELATIONSHIPS by Xi Zhang  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April 2010  Xi Zhang, 2010  Abstract The ultimate aim of postgenomic biomedical research is to understand mechanisms of cellular systems in a systematical way. It is therefore necessary to examine various biomolecular networks and to investigate how the interactions between biomolecules determine biological functions within cellular systems. Rapid advancement in highthroughput techniques provides us with increasing amounts of large-scale data sets that could be transformed into biomolecular networks. Analyzing and integrating these biomolecular networks have become major challenges. I approached these challenges by developing novel methods to extract new knowledge from various types of biomolecular networks.  Protein-protein interactions and domain-domain interactions are extremely important in a wide range of biological functions. However, the interaction data are incomplete and inaccurate due to experimental limitations. Therefore, I developed a novel algorithm to predict interactions between membrane proteins in yeast based on the protein interaction network and the domain interaction network. In addition, I also developed a novel algorithm, a gram-based interaction analysis tool (GAIA), to identify interacting domains by integrating the protein primary sequences, the domain annotations and interactions and the structural annotations of proteins. Biological assessment against several metrics indicated that both algorithms were capable of satisfactory performance, facilitating the elucidation of cell interactome. Predicting biological pathways is one of major challenges in systems biology. I proposed a novel integrated approach, called Pandora, which used network topology to predict biological pathways by integrating four types of biological  ii  evidence (protein-protein interactions, genetic interactions, domain-domain interactions, and semantic similarity of GO terms). I demonstrated that Pandora achieved better performance compared to other predictive approaches, allowing the reconstruction of biological pathways and the delineation of cellular machinery in a systematic view. Finally, I focused on investigating biological network perturbations in diseases. I developed a novel algorithm to capture highly disturbed sub-networks in the human interactome as the signatures linked to cancer outcomes. This method was applied to breast cancer and yielded improved predictive performance, providing the possibility to predict the outcome of cancers based on “network-based gene signatures”. These methods and tools contributed to the analysis and understanding of a wide variety of biological networks and the relationships between them.  iii  Table of Contents Abstract ......................................................................................................................... ii Table of Contents ......................................................................................................... iv List of Tables............................................................................................................... vii List of Figures ............................................................................................................ viii List of Abbreviations and Acronyms .......................................................................... ix Acknowledgements ....................................................................................................... x Dedication..................................................................................................................... xi Co-authorship Statement............................................................................................ xii 1 Introduction.............................................................................................................. 1 1.1 Background and significance................................................................................. 1 1.2 Biological networks............................................................................................... 3 1.2.1 The definition of biological networks.............................................................. 3 1.2.2 Types of biological networks.......................................................................... 4 1.2.2.1 Protein-protein interaction network.......................................................... 4 1.2.2.2 Genetic interaction network ..................................................................... 7 1.2.2.3 Gene expression network ....................................................................... 11 1.2.2.4 Metabolic network ................................................................................. 12 1.2.2.5 Signaling network.................................................................................. 14 1.3 Basic network nomenclature................................................................................ 16 1.3.1 Degree.......................................................................................................... 16 1.3.2 Degree distribution ....................................................................................... 18 1.3.3 Scale-free networks ...................................................................................... 18 1.3.4 Clustering coefficient.................................................................................... 19 1.4 Computational methods for protein-protein interaction prediction from biological networks ................................................................................................................... 19 1.5 Computational approaches to predict biological pathways from biological networks ................................................................................................................................. 22 1.6 Biological networks as signatures of human diseases........................................... 24 1.7 Thesis overview and chapter objectives............................................................... 26 1.8 References........................................................................................................... 31 2 A New Approach to Predict Interactions Between Integral Membrane Proteins in Yeast ............................................................................................................................ 39 2.1 Introduction......................................................................................................... 39 2.2 Materials and methods......................................................................................... 41 2.2.1 Obtaining a list of putative membrane proteins in yeast ................................ 41 2.2.2 Integration of protein-protein interaction data in yeast .................................. 41 2.2.3 Retrieving domain-domain interaction data in yeast...................................... 42 2.2.4 Creating an experimentally validated data set or “gold-standard” data set ..... 42 2.2.5 A model for scoring protein-protein interactions and domain-domain interactions............................................................................................................ 42 iv  2.2.6 ROC curve.................................................................................................... 44 2.3 Results and Discussion ........................................................................................ 45 2.3.1 Properties of the interactions between integral membrane proteins................ 45 2.3.2 The interactome map of integral membrane proteins in yeast ........................ 48 2.3.3 Properties of the domain-domain interactions between integral membrane proteins ................................................................................................................. 51 2.3.4 Comparison between other large-scale data sets............................................ 55 2.4 Conclusion .......................................................................................................... 56 2.5 References........................................................................................................... 58 3 GAIA: a gram-based interaction analysis tool - an approach for identifying interacting domains in yeast ....................................................................................... 62 3.1 Background......................................................................................................... 62 3.2 Methods .............................................................................................................. 65 3.2.1 The GAIA algorithm (Figure 3.1) ................................................................. 65 3.2.2 Data set collection ........................................................................................ 67 3.2.3 Evaluation of the GAIA algorithm................................................................ 68 3.2.4 Data and program availability....................................................................... 69 3.3 Results and discussion......................................................................................... 70 3.3.1 Performance of the GAIA algorithm............................................................. 70 3.3.2 Parameters of the GAIA algorithm................................................................ 72 3.3.3 Case studies on predicted DDIs .................................................................... 72 3.3.4 Detecting new DDI-mediated PPIs and unknown domains............................ 73 3.3.5 Characterizing over-represented gram pairs .................................................. 76 3.3.6 Comparison between different approaches.................................................... 78 3.4 Conclusions......................................................................................................... 80 3.5 References........................................................................................................... 81 4 Pandora, a PAthway and Network DiscOveRy Approach based on common biological evidence ...................................................................................................... 85 4.1 Introduction......................................................................................................... 85 4.2 Methods .............................................................................................................. 88 4.2.1 Data sources ................................................................................................. 88 4.2.2 Gene ontology similarity scores.................................................................... 89 4.2.3 Data integration to a weighted biological network ........................................ 90 4.2.4 Pathway finding algorithm............................................................................ 91 4.2.5 Evaluation of the algorithm (Adjusted Rand Index) ...................................... 92 4.2.6 Network Randomization ............................................................................... 93 4.3 Results and discussion......................................................................................... 94 4.3.1 Parameter tuning........................................................................................... 94 4.3.2 Summary statistics of identified pathways .................................................... 96 4.3.3 Validation of our approach ........................................................................... 99 4.3.4 Comparison between different approaches.................................................. 112 4.3.5 Biological examples of predicted pathways................................................. 113 4.3.6 Revealed redundant pathways..................................................................... 117 4.4 Conclusion ........................................................................................................ 122 4.5 References......................................................................................................... 123  v  5 A Novel Approach to Predict Cancer Outcomes Based on the Relationship between Protein Structural Information and Protein Networks............................. 128 5.1 Introduction....................................................................................................... 128 5.2 Materials and methods....................................................................................... 133 5.2.1 Data set collection ...................................................................................... 133 5.2.2 Gene signature finding algorithm................................................................ 133 5.2.3 Calculation of neighboring gene expression profiling score ........................ 134 5.2.4 Construction of the naïve Bayes classifier................................................... 135 5.3 Results and Discussion ...................................................................................... 137 5.3.1 Parameter tuning and validation on breast cancer data ................................ 137 5.3.2 The identified biomarkers may be involved in carcinogenesis..................... 138 5.3.3 Somatic mutations increase the accuracy of our approach........................... 141 5.3.4 A list of over-represented domains that tend to disrupt the protein interactions network ............................................................................................................... 143 5.3.5 Comparison between approaches ................................................................ 147 5.3.6 The robustness of our approach .................................................................. 150 5.4 Conclusion ........................................................................................................ 150 5.5 References......................................................................................................... 152 6. Conclusions............................................................................................................ 156 6.1 Summary........................................................................................................... 156 6.2 Knowledge discovery on the basis of networked data ........................................ 156 6.2.1 The prediction of protein-protein interactions and domain-domain interactions from noisy and incomplete high-throughput data................................................. 157 6.2.2 The reconstruction of biological pathways from a large set of biomolecular interactions.......................................................................................................... 159 6.2.3 The identification of active sub-networks associated with the dynamic behaviors of biosystems ...................................................................................... 160 6.3 Limitations of computational studies on networks ............................................. 161 6.4 Future directions of biomolecular network analysis ........................................... 162 6.5 References......................................................................................................... 167 Appendices ................................................................................................................ 169 Appendix A Detecting protein-domains DNA-Motifs association in Saccharomyces cerevisiae regulatory networks ................................................................................ 169 Appendix B Quantum dot conjugates for targeted silencing of bcr/abl gene by RNA interference in human myelogenous leukemia K562 cells........................................ 183 Appendix C New perspectives in predicting membrane protein-protein interactions ............................................................................................................................... 205  vi  List of Tables Table 1.1 A list of currently available protein-protein interaction data derived from experimental approaches in a variety of organisms.......................................................... 8 Table 1.2 A list of commonly used open access metabolic pathway databases............... 13 Table 1.3 A list of commonly used open access signaling pathway databases................ 15 Table 2.1 A list of statistically significant (P-value < 0.05, Z-test) domains among integral membrane proteins in yeast. ............................................................................. 53 Table 3.1 A list of the most frequent gram pairs in our domain-domain interaction data set. ................................................................................................................................ 77 Table 4.1 A list of identified pathways in this study. ................................................... 100 Table 4.2 Summary Statistics of Topological Properties of Source PPI Network and Pathway Network........................................................................................................ 110 Table 4.3 A list of discovered redundant pathway pairs in yeast. ................................. 118 Table 5.1 A list of over-represented domains within the ‘singlish-interface’ proteins. . 144 Table 5.2 Feature comparison between different approaches. ...................................... 148  vii  List of Figures Figure 1.1 Five major types of biological networks. ........................................................ 5 Figure 1.2 An example of a scale-free network.............................................................. 17 Figure 2.1 Curve of receiver operating characteristics (ROC) plotted by the different cutoff values when tested against the gold-standard data set............................................... 47 Figure 2.2 The number of common interaction partners of gene pairs in our predicted interactions, the gold-standard data set and five random data sets.................................. 50 Figure 2.3 2D scatter plot with marginal histograms showing the correlation between the frequency of the occurrence of one domain-domain interaction (on the X-axis) and the number of involved integral membrane proteins (on the Y-axis).................................... 54 Figure 2.4 Comparison of the prediction results from three large-scale methods............ 57 Figure 3.1 The general flowchart of the GAIA algorithm. ............................................. 66 Figure 3.2 The performance of the GAIA algorithm using different length gram pairs... 71 Figure 3.3 3D structure of the interaction between RPB1/YDL140C and PRB2/YOR224C........................................................................................................... 74 Figure 4.1 3D PPV performance plot tested on different combinations of the threshold of confidence scores (c) and the threshold of topological similarity scores (s). .................. 95 Figure 4.2. 3D recall rate performance plot tested on different combinations of the threshold of confidence scores (c) and the threshold of topological similarity scores (s).97 Figure 4.3 Distribution of pathway sizes of different approaches................................... 98 Figure 4.4 Comparison between different approaches based on PPV scores tested on Reactome, KEGG and BioCyc pathway annotations. A bar plot demonstrates the performance of each approach tested on three pathway annotations............................. 114 Figure 4.5 An example of identified pathways by our approach................................... 116 Figure 4.6 The redundant pathway organization in S. cerevisiae. The redundant pathway organization in yeast was generated from discovered pathway pairs............................ 119 Figure 4.7 The redundant pathway network in yeast showing at the detail of PPIs....... 121 Figure 5.1 A schematic view of a ‘singlish-interface’ protein and a ‘multiple-interface’ protein. ....................................................................................................................... 131 Figure 5.2 The performance of our approach using different thresholds of domain index scores (Sd)................................................................................................................... 139 Figure 5.3 A network of 171 gene signatures identified in the breast cancer data set using our approach. .............................................................................................................. 140 Figure 5.4 Predictive performance comparison between different approaches. ............ 149  viii  List of Abbreviations and Acronyms AUC  Area Under the Curve  DDI  Domain-domain Interaction  EM  Expectation Maximization  GI  Genetic Interactions  GO  Gene Ontology  MLE  Maximum Likelihood Estimation  MS  Mass Spectrometry  NPV  Negative Predictive Value  PPI  Protein-protein Interactions  PPV  Positive Predictive Value  ROC  Receiver Operating Characteristic  SGA  Synthetic Genetic Array  SLAM  Synthetic Lethal Analysis by Microarray  SVM  Support Vector Machine  TAP  Tandem Affinity Purification  TF  Transcription Factor  Y2H  Yeast Two-hybrid System  ix  Acknowledgements It has been truly an honor to work with so many outstanding people during my PhD journey. First and foremost, I would like to express my deepest gratitude to my thesis supervisor Francis Ouellette for the opportunity to work in his group at the UBC Bioinformatics Centre (UBiC) and the Ontario Institute for Cancer Research (OICR) and for his generosity, support, encouragement, and guidance during my studies. He taught me not only scientific judgments, but also invaluable wisdoms in life.  I also particularly thank to Drs Artem Cherkasov, Leonard Foster and Wyeth Wasserman for serving on my thesis committee. They have been extremely helpful to all my research projects. Their comments and suggestions have been the driving force in my research.  I have really appreciated the opportunities to collaborate with excellent researcher including Yu Zhao and Nawar Malhis. I have enjoyed and appreciated the friendship and help of my fellow graduate students, among them Yvonne Li, Hao Hao, Leon French, Warren Cheung and Xiaohui Chen. The research described in this thesis would not have been possible without the help and advice of numerous others at UBiC and OICR including Stefanie Butland, John Ling, Michelle Brazas, Quang Trinh, Joseph Yamada, Paul Boutros and Victor Gu, as well as Judy Ramai for helping me with administrative issues.  Finally, I am grateful for salary and travel funding from the Canadian Institutes of Health Research, the Ontario Institute for Cancer Research, the UBC Bioinformatics Graduate Program and the CIHR/MSFHR Strategic Training Program in Bioinformatics.  x  Dedication I would like to dedicate this PhD thesis to my parents, Zhenru Zhang and Wenying Xu, for bringing me to this world and for their unconditional love and encouragement; and my sister Lei Zhang for her continuing support all these years. I also would like to thank Li Zhang, for more than words can express.  xi  Co-authorship Statement The work presented in this Ph.D. thesis was in part due to collaboration with other researchers. Those involved in the research are list in the publication citations in each of the chapters. Their individual contributions to the research are outline below.  Chapter 2: I collected membrane protein data and protein interaction data, developed and implemented the predictive model to score protein-protein interactions and domaindomain interactions in networks, performed the analysis on domain interactions in membrane proteins, and drafted and revised the manuscript. Francis Ouellette helped in the conception and the design of this study, and revised the manuscript.  Chapter 3: I collected protein interaction data, domain interaction data and protein sequence data. I designed and developed the interacting domain prediction (GAIA) algorithm, performed the analyses on over-represented 4-gram pairs in the domain interaction data set and comparison to different previous approaches. I also drafted and revised the manuscript. Francis Ouellette helped in the conception and the design of this study, and revised the manuscript.  Chapter 4: I acquired and analyzed protein interaction data, genetic interaction data and domain interaction data. I implemented the algorithm to calculate semantic similarity scores of GO terms. I designed and implemented the pathway discovery tool (Pandora). I performed the analyses on identifying redundant pathways and comparison to different pathway databases. I also drafted and revised the manuscript. Francis Ouellette helped in the conception and the design of this study, and revised the manuscript. xii  Chapter 5: I collected protein interaction data, domain interaction data, somatic mutation data and gene expression data. I designed and implemented the method to identify gene signatures associated to cancer outcomes. I also designed and implemented the predictive approach to classify different cancer outcomes. I performed the analyses on the effect of somatic mutations located in domains and comparison to different predictive approaches. I also drafted and revised the manuscript. Francis Ouellette helped in the conception and the design of this study, and revised the manuscript.  Appendix A: I performed the literature review and carried out the resultant domains analyses, and drafted the manuscript. Nawar Malhis designed the study and implemented the algorithm, and drafted and revised the manuscript. Francis Ouellette revised and reviewed the manuscript.  Appendix B: I developed a tool to predict stranded siRNA sequence in the experiment. Yu Zhao conceived and designed the study, and revised the manuscript.  Appendix C: I performed the literature review, drafted and revised the manuscript. Francis Ouellette revised and reviewed the manuscript.  xiii  1 Introduction 1.1 Background and significance The recent development of high-throughput techniques has been putting the study of biological networks under the spotlight. Biological functions of individual molecules were traditionally studied based on reductionism, a scientific paradigm to study all aspects of life separately on the small scale level [1], which has undoubtedly led to tremendous amounts of discoveries. More and more lines of evidence, however, indicate that most biological phenomena are not caused by the functions of each molecule alone; instead, they arise from complex interactions between different components of the cell, such as DNA, RNA, proteins and other small molecules. As a newly inter-disciplinary field, systems biology aims on the study of complex interactions in biological systems in an integrative and systematic way [2]. A key challenge we are facing in systems biology today is to understand how biological events happening in a living cell are regulated by diverse biological networks consisting of interactions between individual cellular constituents.  The analysis of biological networks is of great importance to biologists for several reasons. First, most basic activities of a cell can be studied as information flow in various types of networks. For example, the regulation of transcription can be revealed by gene expression networks. Since gene expression is controlled by specific transcription factors (TFs), regulatory networks that contain gene expression data can provide information about the relationships between genes and their transcription factors. Moreover, interactions within the protein-protein interaction (PPI) networks play important roles in  1  cellular functions and these interactions are the foundation of almost all biochemical processes. In addition, genetic interaction networks can shed light on the pathway organization within a living cell.  Secondly, the architecture of many biological information systems can be represented by different types of networks. For instance, the Gene Ontology (GO) project provides a human curated vocabulary of controlled terms that describes gene product characteristics and gene product annotation data in a hierarchical tree [3]. Although curation may introduce the subjectivity in what terms are covered, it does offer a standard that has proved to be very useful in a number of curation activities [4]. Such hierarchical organization is regarded as one special type of network. Another example is the Reactome database, a manually curated core human biological pathway database, which is also organized in a hierarchical way [5, 6].  Thirdly, network theory is widely used in other scientific fields. A number of networkbased algorithms have been proposed to infer new information from various complex technological and social networks including the Internet, the ecological food web and our society [7-10]. As technological and social networks possess similar architectural features to those of biological networks, it is reasonable to speculate that the methodologies used in analyzing such networks can be applicable for analyzing biological networks and subsequently delineating intricate functional associations that regulate cellular functions.  Finally, incorporating diverse biological networks helps us obtain new information and generate a broad view of the cell’s function organization. Each type of biological network by itself reveals only a small piece of information about the relationship between the 2  genes within that system of networks. Our understanding of the relationship between genes will be more complete if other biological networks provide supportive data. Previous studies have identified novel genetic interactions by integrating diverse networks [11-13]. In addition, biological network integration has made contributions to numerous clinical applications including disease diagnosis [14-16] and the development of new drugs [17-19].  1.2 Biological networks 1.2.1 The definition of biological networks Biological networks maintain the biological processes and molecular functions of a living cell by the collaborative efforts of individual components in the cell, such as DNA, mRNA and proteins and other small and large molecules. Depending on the properties of the molecules participating in the network, the biological networks can be categorized into different types. For example, at the DNA level, transcriptions factors (TFs) can regulate the process of transcription. The relationship between TFs and their regulated genes constitutes a gene regulatory network. Similarly, at the protein level, proteins can cause posttranslational modifications of other proteins, or form protein complexes and pathways together with other proteins. Such local associations between protein molecules are called protein-protein interactions (PPIs), which are constituents of a protein interaction network. The biochemical reactions in cellular metabolism can likewise be integrated into a metabolic network whose fluxes are regulated by enzymes that catalyze the reactions. In many cases, the interactions at different levels can be integrated into a signaling network. For example, external signals from the exterior of a cell are first  3  mediated to the inside of that cell by a cascade of protein-protein interactions of the signaling molecules. Then, both biochemical reactions and transcription regulations involving protein-DNA interactions trigger the expression of some genes to respond to the signals. Biological networks are organized in a hierarchical manner in terms of network structure (Figure 1.1). The basic unit of a biological network is the individual molecule. Any interaction between a pair of molecules is the second level. A number of interactions can make up local structures such as network motifs, clusters and pathways, which is the third level. The final level is the global network. Therefore, from the viewpoint of systems biology, a living cell is supported by global biological networks in which each local structure performs specific biological functions via the interactions between the molecules involved in the structure.  1.2.2 Types of biological networks Depending on the types of participating molecules, biological networks can be categorized into the following five types of networks: 1) transcription regulatory network (TF-DNA interactions); 2) gene regulatory network (genetic interactions); 3) protein interaction network (PPIs); 4) metabolic network (enzyme-substrate interactions); 5) signaling network. Several types of the specific networks used for our analysis in this thesis are described as following:  1.2.2.1 Protein-protein interaction network Protein-protein interactions play a critical role in a wide variety of cellular processes. For example, the signal transduction process involves a cascade of interactions between  4  Figure 1.1 Five major types of biological networks.  A hierarchical structure of five major types of biological networks. The left column shows a schematic view of biomolecular interactions within the five major types of biological networks illustrated as the right column.  5  signaling proteins that transfer signals from the exterior to the interior of a cell. Proteinprotein interactions participate in the assembly of the structural compartments of a cell, such as cytoskeleton, the nuclear pore and large protein complex such as DNA polymerase and the proteasome. Therefore, studying PPIs from a network prospective helps us to understand how proteins work collaboratively to exert a specific biological function.  Currently, yeast two-hybrid (Y2H) [20, 21] and the tandem affinity purification (TAP) followed by mass spectrometry [22-25] are the two mainstream experimental techniques to identify PPIs on a large scale. Many of the existing PPI databases (reviewed in [26, 27]) identify and annotate “small-scale” PPIs. Although this is still controversial, evidence seems to indicate that all types of identification of PPIs are valid, where data is poorer is where there is only one publication that identifies the interaction [27, 28]. In the yeast two-hybrid system, a bait protein (a protein of interest) fused to a DNA binding domain hybridizes with a prey protein fused to an activation domain. The transcription of a reporter gene is activated if a bait protein interacts with a prey protein, as the DNA binding domain binds to the promoter and the activation domain is responsible for the activation of transcription. Y2H is an in vivo technique, which happens in the native environment of biological processes. Y2H can identify binary, weak or transient interactions.  An alternative way to identify protein-protein interactions is to add an antigen peptide tag onto a protein of interest and then express it in the cell. The tagged protein is expressed and is assembled into its native protein complex. Proteins complexes are then affinity6  purified on the respective affinity matrix. After purification, proteins interacting with the tagged protein are analyzed and identified through SDS-PAGE followed by mass spectrometry. These approaches have identified a significant amount of new PPIs, which makes it possible to build a more robust interactome of cells.  Although the two above-mentioned high-throughput techniques are powerful in identifying PPIs, they have limitations. Y2H typically does not by definition, and because of the experimental process, provide information about the members of a protein complex. Also, Y2H may have a bias against cytosolic proteins because the binding must occur in the nucleus (within the transcriptional apparatus). On the other hand, TAP/MS can detect protein complexes but lacks the capability to identify weak or transient interactions. It is important, therefore, to complement PPIs identified from both techniques, and to minimize the drawbacks and pitfalls of each technique. Some investigators have demonstrated that many PPIs will never be detected with some of these methods [27, 28].  Large-scale PPI data sets are currently available for a variety of organisms (Table 1.1). These data sets have significantly facilitated the study of protein interaction networks.  1.2.2.2 Genetic interaction network For a given gene pair, if the phenotypes caused by mutations in both genes (double mutant) differ from the combination of phenotypes caused by single mutation in  7  Table 1.1 A list of currently available protein-protein interaction data derived from experimental approaches in a variety of organisms Organism S. cerevisiae  Number of Interactions  Experimental Type  Publication  854  Y2H  Uetz et al. 2000 [21]  3,986  Y2H  Ito et al. 2001 [20]  3,221 (spoke)  TAP/MS  Gavin et al. 2002 [23]  HMS-PCI  Ho et al. 2002 [24]  5,068  DIP (Small scale)  Salwinski et al. 2004 [29]  C. elegans  4,027  Y2H  Li et al. 2004 [30]  D. melanogaster  20,405  Y2H  Giot et al. 2003 [31]  H. sapiens  221  TAP/MS  Bouwmeester et al. 2004 [32]  10,534  HPRD (Small scale)  Prasad et al. 2009 [33]  31,304 (matrix) 3,589 (spoke) 25,333 (matrix)  Spoke interactions are interactions between the bait protein and other protein complex members. Matrix interactions are all possible interactions among all members in a complex. This table was modified from [34].  8  each gene, this pair of genes are considered to genetically interact. Genetic interactions can be aggravating or alleviating phenotypically. For example, a genetic interaction is defined as a "suppression interaction" when the phenotype of the double mutant is less defective than the phenotype of the single mutants combined, and an "enhancement interaction" when the phenotype of the double mutant is more defective than the phenotype of the single mutants combined. There is a special type of genetic interactions named “synthetic lethal”, in which case cells are viable when only one nonessential gene is mutated, but a combination of mutations in two nonessential genes cause cell lethality. A synthetic lethal relationship suggests that these two genes are likely to be situated in parallel pathways that complement each other’s functionality. In this way, synthetic lethal interactions provide information regarding the organization of pathways. Great efforts have been put toward mapping the synthetic lethal interactions at a large scale in several model organisms.  There are three possible biological interpretations of the relationships between genetic interactions and biological pathways. 1) Between-pathway: genetic interactions that occur between pathways indicate that the involved gene products are respectively a part of redundant and parallel pathways that both perform a similar or a functionally related biological process. If the product of one gene is defective, the other pathway can compensate for it. 2) Within-pathway: genetic interactions that occur within a single pathway indicate that the two gene products are both part of a molecular complex that cannot function properly if both genes are mutated. 3) Synergistic pathway: two gene products may be unrelated to each other, but each deleted allele has a side effect and those side effects are synergistic. An example of this relationship is when two  9  functionally unrelated genes are simultaneously mutated, to trigger a cellular stress response [35]. In light of these biological implications, understanding the details of genetic interactions becomes critical to the study of pathway organization.  The synthetic lethal interaction network of S. cerevisiae has been extensively studied using two general strategies. The first approach is called “synthetic genetic array” (SGA) analysis [36, 37]. In the SGA approach, a haploid yeast strain carrying the query gene is crossed systematically with a library of yeast deletion strains in an arrayed format. The resultant diploid yeast is then sporulated. The double mutant haploid progeny are selected as they can only grow in a certain condition. Synthetic sick and synthetic lethal interactions are then identified by examining the resulting colonies of the double mutants manually or with the help of imaging software. The second technique is called “synthetic lethal analysis by microarray” (SLAM) [38]. Compared to SGA, SLAM instead uses a “molecular bar codes” DNA sequences to uniquely tag each deletion strain. Double mutant strains are identified as barcode DNA sequences of a query mutation hybridize to barcode DNA sequences of one deletion strain on a microarray.  SGA and SLAM approaches both generate a large set of genetic interactions. Using the SGA approach, a total of 2,012 synthetic lethal interactions and 2,113 synthetic sick interactions in yeast have been found to be involved in diverse biological functions such as cell polarity, chromosome segregation and DNA synthesis and repair. Using the dSLAM approach (a modified version of SLAM) [39], a total of 4,956 synthetic fitness or lethality defects have been identified by querying 74 single deletion strains involved in DNA replication and repair. The current estimation for the number of genes in S.  10  cerevisiae is approximately 6,000 [40]. In contrast, the gene coverage of known synthetic lethal interactions is approximately 32% in S. cerevisiae based on the estimated total number of genes, suggesting that more genetic interaction data sets are still desired in the future as only a small percentage of all possible gene pairs have ever been experimentally tested.  1.2.2.3 Gene expression network DNA microarrays can be used to examine patterns of gene expression on a global scale. DNA microarray technique is able to detect mRNA expression levels of thousands of genes simultaneously. Under appropriate experimental conditions, gene expression data can capture the changes of molecular states associated with cellular phenotypes under different conditions such as environmental perturbations, the effects of drug treatments and various “wellness” and disease states. Gene expression profiles can be represented as gene expression networks in which genes are connected by pair-wise associations such as similarity of the gene expression profiles. As the adaptive response of cells to external stimuli involves gene expression variations in a coordinated manner, investigating information flow in the gene expression network can provide insights into how cellular processes generate dynamic associations in response to environmental perturbations.  Gene expression datasets are stored in publicly available databases such as the Stanford Microarray Database (SMD) [41], Gene Expression Ominibus [42] and ExpresDB [43]. These databases contain gene expression profiles of diverse organisms under different conditions, thus are tremendously useful for the analysis of underlying biological processes.  11  1.2.2.4 Metabolic network In a living cell, each unique biological function can be fundamentally understood as the interactions between its chemical constituents. The total set of all chemical reactions that participate in the process through which energy is generated and used is called metabolism. Metabolism is usually divided into two categories: 1) catabolism that breaks down molecules into smaller units, for example to harvest energy in cellular respiration chain; 2) anabolism that uses energy to construct components of cells such as proteins and nucleic acids [44].  The chemical reactions of metabolism are organized into metabolic pathways in which a series of chemical reactions transform one chemical into another catalyzed by a sequence of enzymes. For example, the citric acid cycle (Krebs' cycle) is a pathway to generate GTP and other valuable intermediates through acetyl-CoA oxidation. Pathway organization is diverse. It can be a linear sequence of a few chemical reactions, a cycle or a mixed structure. In some cases, the starting product of a pathway can be the ending product of another pathway. For example, pyruvate produced by glycolysis is decarboxylated to acetyl-CoA in the citric acid cycle.  Individual pathways do not work alone. The complete set of metabolic pathways emerges as a complicated and interactive network, defined as a metabolic network, which includes all chemical reactions of metabolism and is organized by pathways. Currently, most metabolic pathway annotations are derived by manual curation and stored in pathway databases such as the Kyoto Encycolpeida of Genes and Genomes (KEGG) [45] and BioCyc [46]. Table 1.2 lists some publicly available metabolic pathway databases.  12  Table 1.2 A list of commonly used open access metabolic pathway databases Database  Website  KEGG  http://www.genome.jp/kegg/  Reactome  http://www.reactome.org/  BioCyc  http://www.biocyc.org  EcoCyc  http://www.ecocyc.org  MetaCyc  http://metacyc.org  MPB  http://www.gwu.edu/mpb  HumanCyc  http://humancyc.org/  aMAZE  http://www.amaze.ulb.ac.be/  WikiPathways  http://wikipathways.org  Yeast consensus metabolic network  http://www.comp-sys-bio.org/yeastnet/  SGD Pathway Tools  http://pathway.yeastgenome.org/  PC - Pathway Commons  http://www.pathwaycommons.org/pc/  MIPS CYGD  http://mips.helmholtzmuenchen.de/genre/proj/yeast/index.jsp  13  1.2.2.5 Signaling network Living cells utilize a variety of interacting chemicals and molecules to react and respond to internal and external signals from their cellular environment. Such signal propagation is called signal transduction. Signal transduction plays a critical role in cell system communication that governs basic cellular activities and coordinates cell actions [47].  A signaling pathway is defined as an ordered sequence of biochemical reactions (e.g., phosphorylation) for the purpose of communicating information from one place to another place in the cell [48]. The mitogen-activated protein (MAP) kinase cascade is a typical example of signaling pathway that transfers information from an externally activated cell surface receptor to the nucleus where gene regulation can be modulated.  Different signaling pathways interact and communicate with each other, forming a complicated signaling network, which includes the complete set of signaling pathways that are involved in the cell information processes. However, to fully understand how signaling networks are organized is an extremely challenging task as current experimental approaches are small-scale and focus only on a small part of cell signaling pathways. Same as the data source of metabolic pathways, most signaling pathway annotations are based on manual curation from the literature by biologists. Therefore, it is necessary to develop new computational methods to reconstruct signaling pathways by exploiting high-throughput genomic and proteomic data. Table 1.3 lists some signal transduction databases depositing experimentally determined signaling molecules and signaling pathways.  14  Table 1.3 A list of commonly used open access signaling pathway databases Database  Website  KEGG  http://www.genome.jp/kegg/  Reactome  http://www.reactome.org/  SPAD  http://www.grt.kyushu-u.ac.jp/spad/index.html  aMAZE  http://www.amaze.ulb.ac.be/  NetPath  http://www.netpath.org/  SigPath  http://icb.med.cornell.edu/crt/SigPath/index.xml  MiST  http://genomics.ornl.gov/mist  TRANSPATH  http://www.biobase-international.com  CellML  http://www.cellmo.org/models  15  1.3 Basic network nomenclature Many systems in real life can be displayed as graphic networks, in which each node represents an object and each edge represents the relationship between the pair of objects. For instance, all Air Canada flight routes can be displayed as a network in which each node represents a city and each edge represents a flight route. Likewise, in biological systems, components of cells such as DNA, RNA, proteins and small molecules can be represented as nodes and the relationship between them, such as protein-protein interaction, genetic interaction, co-expression can be represented as edges in a network. In order to characterize and compare various biological networks and to analyze the biological behaviors that are represented by these networks, it is necessary to establish quantifiable descriptions for these networks. Therefore, we define the following basic topological properties for a given graphic network (Figure 1.2) G = (V, E) where V is the node set and E is the edge set.  1.3.1 Degree Degree is the most basic property of a node within a network. Given a node X, the degree of X means the number of connections (edges) linking node X and other nodes within a network. For example, the degree of node X in Figure 1.2A is 6. In directed networks, there are two types of degree: incoming and outgoing degree. Incoming degree is the number of connections (edges) from other nodes to node X. Outgoing degree is the number of connections (edges) from node X to other nodes. For example, the incoming  16  Figure 1.2 An example of a scale-free network.  17  and outgoing degree of node X in Figure 1.2B are 1 and 5, respectively. The average degree of the whole network is the summation of the degrees of all individual nodes divided by the number of nodes in the network.  1.3.2 Degree distribution The degree distribution P(k) indicates the probability that a selected node has the degree of k (k= 1, 2, 3…) over the whole network. It can be calculated by counting the number of nodes with the degree of k in the network and then dividing it by the total number of nodes in the network. Degree distribution reflects different classes of networks. For example, a random network has a Poisson degree distribution, whereas most networks in the real world such as the Internet, the world wide web, and some social networks has a power-law degree distribution.  1.3.3 Scale-free networks Most biological networks are scale-free networks as their degree distribution follows a power law: P(k) ∼ k-γ, where γ is a constant usually in the range of 2 to 3 (2< γ < 3). A network with such a degree distribution indicates that there are a small fraction of nodes with degrees that greatly exceed the average whereas most nodes have only a few connections (Figure 1.2). Those nodes with the highest degrees are often called “hubs”.  The power law distribution of scale-free networks reflects two fundamental processes involved in the development of real-life networks such as biological networks and social networks [49]. The first process is called growth in which the network grows in a continuous-time branching scheme so that new nodes are added to the system over an  18  extended time period. The second process is called preferential attachment in which nodes display a tendency to link with nodes that already have many connections. Such degree distribution is important to maintain the network topology even if a few nodes are removed from the network [50].  1.3.4 Clustering coefficient In real-life networks, many links are transitive. For example, as seen in Figure 1.2, if node X is connected to node Y, and node Y is connected to node Z, then it is very likely that node X also has a direct link to node Z. The clustering coefficient C is utilized to capture this phenomenon [51]. The clustering coefficient C of a node can be calculated by the equation: Cx=2nx/k(k-1), where nx is the number of links between the neighboring nodes of node X and k is the degree of node X. Therefore, it is the ratio of the number of existing links between a node’s neighbors to the maximum number of possible links between them. For example, the clustering coefficient of node X in Figure 1.2 is 0.7. The clustering coefficient of a network is the summation of all individual coefficients of the nodes in the network divided by the number of nodes in the network. The clustering coefficient indicates the completeness of a network.  1.4 Computational methods for protein-protein interaction prediction from biological networks Owing to the current emergence of high-throughput technologies, protein-protein interactions in different organisms have been identified on a large scale, allowing the possibility to elucidate basic biological mechanisms by studying the interactome that can be defined as the complete set of protein-protein interactions in terms of proteomics.  19  However, the currently available protein-protein data is still far from complete and contains many false positives [52] due to the intrinsic limitations of experimental approaches. In addition, experiment methods are often biased toward proteins with specific cellular localization, resulting in low coverage [53-55]. Consequently, it is necessary to develop computational methods that are capable of filtering existing noisy protein interaction data and predicting undiscovered protein-protein interactions. Results from such tools provide us with testable hypotheses for experimental screening or validation.  A wide range of computational methods predicts protein-protein interactions. Based on the types of training data, current predictive approaches can be categorized into the following seven classes: (1) Gene neighbor and gene cluster methods [56-58]. Such methods are based on the fact that co-regulated genes tend to share physical locations (ie. they may be co-transcribed in operons). (2) Gene fusion methods, in which a pair of interacting proteins in one genome are fused into a single protein in another genome [5961]. (3) Phylogenetic profile methods, under the assumption that functionally linked and potentially interacting non-homologous proteins tend to co-evolve and their orthologs tend to appear in the same subset of organisms [62-64]. A phylogenetic profile is constructed for each protein as a binary vector indicating the presence of this protein in different organisms, High similarity of phylogenetic profiles of a protein pair suggests interactions. (4) Co-evolution methods, in which a protein interaction can be measured quantitatively by calculating the correlation coefficient between the phylogenetic trees (defined as distance matrices) of two non-homologous interacting protein families [6567]. (5) Primary sequence methods, in which sequence features such as residue  20  frequencies, residue pairing preferences and sequence profiles are employed to predict protein interactions [68-70]. (6) 3D protein structure methods, which assess the fitness of two potential interacting proteins from the complexes within known 3D structures or protein-protein docking models [71-73]. (7) Network topology methods, in which interacting proteins are identified based on their topological properties such as similar neighborhood [74-76] and highly connected clusters [77-79] in protein-protein interaction networks.  As basic functional units of proteins, domains can interact with each other and then mediate related protein-protein interactions [80, 81]. A variety of approaches, therefore, have been developed to predict protein-protein interaction at the domain level. In such methods, domain-domain interactions are first predicted from protein interaction networks and then novel protein interactions are identified on the basis of the previously predicted domain-domain interactions. These approaches can be categorized into three general classes: (1) Association methods, in which a log-odds scoring model is used to identify domains. This model distinguishes interacting protein pairs from non-interacting ones [82-84]. (2) Bayesian network models and maximum likelihood estimation (MLE) methods, in which MLE maximizes the probability of interactions of all putative domain pairs in a Bayesian network [85, 86]. (3) Integrative methods, in which interacting domains are identified using not only protein-protein interaction and domain annotation data, but also their primary sequences [87].  21  1.5 Computational approaches to predict biological pathways from biological networks Biological pathways can be described as a defined group of biological entities that are organized in a specified order and perform a specified biological task or function [88]. In a living cell, a complex system is maintained to keep the cell’s viability through intricate and interconnected biological pathways. These biological pathways include interactions between many different biological components such as genes, proteins and small molecules. The inputs and outputs of various pathways are separated, directed and organized in a highly sophisticated manner. Our understanding of how each biological pathway works and interacts with other pathways is far from complete. Currently, most pathway annotations are derived by manual curation from literature and stored in pathway databases. There is, therefore, a growing demand for the development of computational tools that analyze existing and predict new biological pathways. These tools could help us reveal participants of biological processes.  There are basically three types of pathway predictions: metabolic, signaling and functional.  Most approaches for predicting metabolic pathway seek to identify a specific set of reactions and compounds that are active in a metabolic pathway from databases that include information about enzymes, compounds and reactions. McShan and coworkers developed a heuristic algorithm to search active metabolic pathways using a biochemical state space in which each compound represented a state and each transformation between compounds represented a state transition [89]. Another group used classical graph theory  22  to find pathways. In their method, each compound was represented as a node and was weighted by the number of reactions it was involved in. Subsequently, a pathway find algorithm was introduced to search putative pathways by minimizing weight sum between a pair of source and target reactions [90]. More recently, an optimization approach was proposed to identify pathways based on biochemical constraints and objective functions [91].  As signal transduction usually involves interactions between proteins in the same biological process, most signaling pathway prediction methods are based on protein interaction data and gene expression data. Steuer et al. compiled a set of putative pathways from protein interaction networks and scored them by comparing to the gene clusters derived from gene expression data [92]. A similar approach used information about the order of the components in pathways [93]. Scott et al. developed a sophisticated method using a weighted protein interaction network [94].  A functional pathway reveals the relevance of the members to the pathway. Protein interaction data and genetic interaction data are the major data sources used to infer functional pathways. Kelley and Ideker developed a log-odds scoring model that identified 360 pathway pairs and 401 pathways in yeast by incorporating physical interactions and genetic interactions (synthetic-lethal and synthetic-sick interactions) [95]. Their study provides a starting point to reveal pathway organization and function from high-throughput data. Ulitsky and Shamir proposed a modified methodology based on Kelley and Ideker’s approach [96]. Instead of employing both physical interactions and genetic interactions, Ma and colleagues designed a method using synthetic lethal  23  interactions alone. They identified 2,590 pathway pairs and 5,180 pathways in yeast by searching approximately complete bipartite graphs within the synthetic lethal interaction network [97]. In a recent publication, Brady and colleagues introduced a novel approach that discovered 602 pathways and 1,510 pathway pairs by searching stable bipartite subgraphs on two different versions of the genetic interaction network [98]. In order to improve the accuracy and the coverage of pathway discovery, I propose a novel pathway discovery approach described in Chapter 4 by integrating heterogeneous biological evidences.  1.6 Biological networks as signatures of human diseases A deluge of data from a variety of experimental approaches makes it possible to better understand bimolecular networks underlying particular phenotypes associated with human diseases. Diverse types of networks have been constructed to predict specific diseased-related genes. The advancement in this front is providing feasible and efficient solutions for the diagnosis and prognosis of diseases.  Gene expression data is widely used to identify gene signatures associated with the onset and the progression of disease. Two groups identified approximately 70 gene markers that could predict the metastasis of breast cancer at the accuracy of 60 -70% in two independent large-scale gene expression studies [99, 100]. In another study, gene expression profiles combined with reverse-engineered gene networks were able to identify the genetic mediators that interconnect pathways associated with prostate cancer [14]. Moreover, a sample-specific regulatory network was generated to capture the relationship patterns between the transcription factors and the regulated genes, which was  24  capable of classifying different disease states [101]. More recently, simultaneous genome-wide assays of both gene expression and genetic variation were utilized to map the genetic factors that underpin individual differences in quantitative levels of expression by expression quantitative trait loci (eQTLs) and to identify networks of genes involved in disease pathogenesis [102-104].  The analysis of protein interaction networks provide us with another approach to investigate the perturbations associated with human diseases as these perturbations change the status of the protein interaction networks. There has been an extensive literature on the application of protein interaction networks to elucidate how network disruption drives complex disease processes. Wachi et al. tested the degree distribution and centrality of a set of differentially expressed genes in a human PPI network and found that up-regulated genes manifested higher network connectivity in the cancerous tissues [105]. Similarly, another group found that cancer-related proteins tended to have twice as many interaction partners as non-cancer-related proteins [106]. Since disease genes and non-disease genes have different topological properties as shown in the previous studies, recent studies predicted new disease-associated genes based on their topological properties in protein interaction networks [107, 108]. Instead of finding disease-associated genes, other studies have focused on the identification of sub-networks related to the diseases, proposing hypotheses regarding the involved molecular complexes, signaling pathways and other cell machineries. Chen et al. developed a method to mine an Alzheimer’s disease sub-network from a protein interaction network [109]. Pujana et al. constructed a breast cancer-related network based on a network in  25  which proteins were linked by various types of biological evidence such as proteinprotein interaction and co-expression [15].  Finally, perturbations in metabolic and signaling networks can be utilized in the study of diseases. Lee et al. constructed a bipartite human disease association network in which nodes represented diseases and two diseases were linked if the mutated enzymes associated with them catalyzed adjacent metabolic reactions [110]. Lee et al. proposed a new disease classification method based on the metabolic and signaling pathway activities derived from the subset of genes in the pathway whose combined gene expression delivers optimal discriminative power for the disease phenotype of each individual patient [111].  1.7 Thesis overview and chapter objectives High-throughput experimental techniques in molecular biology generate an enormous amount of data. Efficient approaches could reveal essential biological mechanisms. More sophisticated theoretical methodologies and computational tools have to be developed to study dynamic interaction networks of genes, proteins and biochemical reactions, rather than individual components. In general, the goal of my Ph.D. research is to discover knowledge based on networked data generated by high-throughput techniques, which includes exploiting special features of the biological systems, and gaining biological insights by further interpreting these features in a systematic manner.  This thesis describes the different projects I undertook to achieve the aforementioned goal, and these projects are briefly introduced in the following paragraphs.  26  Protein-protein interactions (PPIs) play an extremely important role in performing a variety of biological functions. The interactomes of several model organisms including budding yeast Saccharomyces cerevisiae have recently been studied using experimental techniques such as the yeast two-hybrid assay. However, these techniques are generally biased against integral membrane proteins due to their intrinsic limitations. Given the fact that the interactions between integral membrane proteins cover a large fraction of the whole interactome, I reported a study that predicted the interactions between integral membrane proteins in yeast by a quantitative model in Chapter 2 (as published in [112]). I integrated protein interaction and domain interaction data from disparate sources and applied a log likelihood scoring method on all putative integral membrane proteins in yeast to predict their interactions based on a cut-off threshold. I showed that my approach improved on other predictive approaches when tested on a “gold-standard” data set and achieved 74.6% true positive rate at the expense of 9.9% false positive rate. Furthermore, I found that two integral membrane proteins were more likely to interact with each other if they shared more common interaction partners. This study allowed us to reach a more extensive understanding of the yeast integral membrane proteins from a network view, which complemented previous prediction approaches based on the genomic context.  In Chapter 3 (as published in [87]), I investigated the prediction of domain-domain interactions. Protein domains, which are defined as independently folding structural blocks of proteins, physically interact with each other to perform biological functions. The identification of Domain-Domain Interactions (DDIs) is of great biological interests because it is generally accepted that PPIs are mediated by DDIs. Therefore, much effort has been put toward the prediction of domain pair interactions based on computational  27  methods. Many DDI prediction tools using PPIs networks and domain evolution information have been reported. However, tools that combine primary sequences, domain annotations, and structural annotations of proteins had not been evaluated. In this study, I reported a novel approach called Gram-bAsed Interaction Analysis (GAIA). GAIA extracted peptide segments that were composed of fixed lengths of continuous amino acids, called n-grams (where n is the number of amino acids), from the annotated domain and DDI data set in S. cerevisiae (budding yeast) and identified a list of n-grams that might contribute to DDIs and PPIs based on the frequencies of their appearance. GAIA also reported the coordinate position of gram pairs on each interacting domain pair. I demonstrated that my approach improved on other DDI prediction approaches. I also identified a list of 4-gram pairs that were significantly over-represented in the DDI data set and might mediate PPIs. GAIA represented a novel and reliable way to predict DDIs that mediated PPIs, which made GAIA an important contribution to this active research area.  Many biological phenomena involve extensive interactions between cellular pathways. However, extraction of the inherent biological pathways remains a major challenge in systems biology. With the advancement of high-throughput functional genomic techniques, it has become possible to infer biological pathways and the pathway organization in a systematic way by integrating disparate lines of biological information. Therefore, I proposed a novel integrated approach that utilized network topology to predict biological pathways in Chapter 4 (as published in [113]). I integrated four types of biological evidence (protein-protein interaction, genetic interaction, domain-domain interaction, and semantic similarity of GO terms) to generate a functionally associated  28  network. This network was then used to develop a new pathway finding algorithm to predict biological pathways in yeast. My approach discovered 195 biological pathways and 31 functionally redundant pathway pairs in yeast. By comparing our identified pathways to three public pathway databases (KEGG, BioCyc and Reactome), I observed that my approach achieved a maximum positive predictive value of 12.8% and improved on other predictive approaches. This study allowed us to reconstruct biological pathways and to delineate cellular machinery in a systematic view.  In Chapter 5 (manuscript in preparation), I investigated network perturbation in human diseases such as cancers. A novel algorithm was developed to capture highly disturbed sub-networks in the human interactome, and these sub-networks were tagged as signatures that are linked to cancer outcomes. Compared to previous studies, my approach greatly improved the prediction performance at the accuracy of 88.3%, sensitivity of 87.2% and specificity of 88.9%. This approach provided the possibility to predict the prognosis of cancer and other diseases based on “network-based gene signatures”. It helped elucidate molecular dysfunctions of proteins involved in genetic diseases and cancers in the context of interaction networks.  The methods and tools presented in these chapters are broadly applicable to the analysis, mining and understanding of a wide variety of biological networks. Therefore, this thesis provides novel and efficient solutions to the field of network biology.  In addition to the studies described above, I have been involved in other collaborative projects. I assisted Nawar Malhis to develop a new approach (described in Appendix A) to detect associations between protein domains and DNA motifs in Saccharomyces 29  cerevisiae based on Gene Ontology terms [114]. My interacting domain prediction tool GAIA described in Chapter 3 were modified and used by Zhao et al. to predict doublestranded siRNA sequence in a quantum dots (QDs)-based RNA interference study described in Appendix B (in press). Finally, I wrote a book chapter about new perspectives in predicting membrane protein-protein interactions described in Appendix C (in press).  30  1.8 References 1. 2. 3.  4. 5.  6.  7. 8. 9. 10. 11. 12.  13. 14. 15.  Sauer U, Heinemann M, Zamboni N: Genetics. Getting closer to the whole picture. Science 2007, 316(5824):550-551. Ideker T, Galitski T, Hood L: A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet 2001, 2:343-372. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29. Ashburner M, Bergman CM: Drosophila melanogaster: a case study of a model genomic sequence and its consequences. Genome Res 2005, 15(12):1661-1667. Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, Kanapin A, Lewis S, Mahajan S, May B, Schmidt E, Vastrik I, Wu G, Birney E, Stein L, D'Eustachio P: Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 2009, 37(Database issue):D619-622. Vastrik I, D'Eustachio P, Schmidt E, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, Matthews L, Wu G, Birney E, Stein L: Reactome: a knowledge base of biologic pathways and processes. Genome Biol 2007, 8(3):R39. Degenne A, Forsé M: Introducing social networks. London ; Thousand Oaks, Calif.: Sage; 1999. Good BM, Tennis JT, Wilkinson MD: Social tagging in the life sciences: characterizing a new metadata resource for bioinformatics. BMC Bioinformatics 2009, 10:313. Lewis TG: Network science : theory and practice. Hoboken, N.J.: John Wiley & Sons; 2009. Pastor-Satorras R, Rubí JM, Diaz-Guilera A: Statistical mechanics of complex networks. New York: Springer; 2003. Onami S, Kitano H: Genome-wide prediction of genetic interactions in a metazoan. Bioessays 2006, 28(11):1087-1090. Wong SL, Zhang LV, Tong AH, Li Z, Goldberg DS, King OD, Lesage G, Vidal M, Andrews B, Bussey H, Boone C, Roth FP: Combining biological networks to predict genetic interactions. Proc Natl Acad Sci U S A 2004, 101(44):1568215687. Zhong W, Sternberg PW: Genome-wide prediction of C. elegans genetic interactions. Science 2006, 311(5766):1481-1484. Ergun A, Lawrence CA, Kohanski MA, Brennan TA, Collins JJ: A network biology approach to prostate cancer. Mol Syst Biol 2007, 3:82. Pujana MA, Han JD, Starita LM, Stevens KN, Tewari M, Ahn JS, Rennert G, Moreno V, Kirchhoff T, Gold B, Assmann V, Elshamy WM, Rual JF, Levine D, Rozek LS, Gelman RS, Gunsalus KC, Greenberg RA, Sobhian B, Bertin N, Venkatesan K, Ayivi-Guedehoussou N, Sole X, Hernandez P, Lazaro C, Nathanson KL, Weber BL, Cusick ME, Hill DE, Offit K, Livingston DM, Gruber  31  16. 17.  18. 19. 20. 21.  22.  23.  24.  SB, Parvin JD, Vidal M: Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat Genet 2007, 39(11):1338-1349. Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, Bull S, Pawson T, Morris Q, Wrana JL: Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol 2009, 27(2):199-204. di Bernardo D, Thompson MJ, Gardner TS, Chobot SE, Eastwood EL, Wojtovich AP, Elliott SJ, Schaus SE, Collins JJ: Chemogenomic profiling on a genomewide scale using reverse-engineered gene networks. Nat Biotechnol 2005, 23(3):377-383. Xing H, Gardner TS: The mode-of-action by network identification (MNI) algorithm: a network biology approach for molecular target identification. Nat Protoc 2006, 1(6):2551-2554. Gardner TS, di Bernardo D, Lorenz D, Collins JJ: Inferring genetic networks and identifying compound mode of action via expression profiling. Science 2003, 301(5629):102-105. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001, 98(8):4569-4574. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403(6770):623-627. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415(6868):141-147. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415(6868):180-183.  32  25.  26. 27.  28.  29. 30.  31.  32.  Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, Peregrin-Alvarez JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MH, Butland G, Altaf-Ul AM, Kanaya S, Shilatifard A, O'Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440(7084):637643. Lehne B, Schlitt T: Protein-protein interaction databases: keeping up with growing interactomes. Hum Genomics 2009, 3(3):291-297. Cusick ME, Yu H, Smolyar A, Venkatesan K, Carvunis AR, Simonis N, Rual JF, Borick H, Braun P, Dreze M, Vandenhaute J, Galli M, Yazaki J, Hill DE, Ecker JR, Roth FP, Vidal M: Literature-curated protein interaction datasets. Nat Methods 2009, 6(1):39-46. Braun P, Tasan M, Dreze M, Barrios-Rodiles M, Lemmens I, Yu H, Sahalie JM, Murray RR, Roncari L, de Smet AS, Venkatesan K, Rual JF, Vandenhaute J, Cusick ME, Pawson T, Hill DE, Tavernier J, Wrana JL, Roth FP, Vidal M: An experimentally derived confidence score for binary protein-protein interactions. Nat Methods 2009, 6(1):91-97. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32(Database issue):D449-451. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303(5657):540-543. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL, Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM: A protein interaction map of Drosophila melanogaster. Science 2003, 302(5651):1727-1736. Bouwmeester T, Bauch A, Ruffner H, Angrand PO, Bergamini G, Croughton K, Cruciat C, Eberhard D, Gagneur J, Ghidelli S, Hopf C, Huhse B, Mangano R, Michon AM, Schirle M, Schlegl J, Schwab M, Stein MA, Bauer A, Casari G, Drewes G, Gavin AC, Jackson DB, Joberty G, Neubauer G, Rick J, Kuster B,  33  33.  34. 35. 36.  37.  38. 39. 40.  41. 42. 43. 44. 45.  46.  Superti-Furga G: A physical and functional map of the human TNF-alpha/NFkappa B signal transduction pathway. Nat Cell Biol 2004, 6(2):97-105. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A: Human Protein Reference Database--2009 update. Nucleic Acids Res 2009, 37(Database issue):D767-772. Kepes Fo: Biological networks. [Hackensack], NJ: World Scientific; 2007. Eddy SR: Genetics. Total information awareness for worm genetics. Science 2006, 311(5766):1381-1382. Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N, Robinson M, Raghibizadeh S, Hogue CW, Bussey H, Andrews B, Tyers M, Boone C: Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 2001, 294(5550):2364-2368. Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, Chen Y, Cheng X, Chua G, Friesen H, Goldberg DS, Haynes J, Humphries C, He G, Hussein S, Ke L, Krogan N, Li Z, Levinson JN, Lu H, Menard P, Munyana C, Parsons AB, Ryan O, Tonikian R, Roberts T, Sdicu AM, Shapiro J, Sheikh B, Suter B, Wong SL, Zhang LV, Zhu H, Burd CG, Munro S, Sander C, Rine J, Greenblatt J, Peter M, Bretscher A, Bell G, Roth FP, Brown GW, Andrews B, Bussey H, Boone C: Global mapping of the yeast genetic interaction network. Science 2004, 303(5659):808-813. Ooi SL, Shoemaker DD, Boeke JD: DNA helicase gene interaction network defined using synthetic lethality analyzed by microarray. Nat Genet 2003, 35(3):277-286. Pan X, Yuan DS, Ooi SL, Wang X, Sookhai-Mahadeo S, Meluh P, Boeke JD: dSLAM analysis of genome-wide genetic interactions in Saccharomyces cerevisiae. Methods 2007, 41(2):206-221. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin H, Oliver SG: Life with 6000 genes. Science 1996, 274(5287):546, 563-547. SMD [http://genome-www5.stanford.edu/] Ominibus [http://www.ncbi.nlm.nih.gov/geo/] ExpresDB [http://twod.med.harvard.edu/ExpressDB/] metabolism [http://en.wikipedia.org/wiki/Metabolism] Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 2006, 34(Database issue):D354-357. Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky L, Kaipa P, Ahren D, Tsoka S, Darzentas N, Kunin V, Lopez-Bigas N: Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res 2005, 33(19):6083-6089.  34  47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64.  Witzany G: Life: The Communicative Structure a new philosophy: Norderstedt; 2000. Sadava D, Heller HC, Orians GH, Purves WK, Hillis D: Life: The Science of Biology, 8 edn: Sinauer Associates Inc.; 2008. Barabasi AL, Albert R: Emergence of scaling in random networks. Science 1999, 286(5439):509-512. Albert R: Scale-free networks in cell biology. J Cell Sci 2005, 118(Pt 21):49474957. Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393(6684):440-442. Bork P, Jensen LJ, von Mering C, Ramani AK, Lee I, Marcotte EM: Protein interaction networks from yeast to human. Curr Opin Struct Biol 2004, 14(3):292-299. Fernandez-Suarez M, Chen TS, Ting AY: Protein-protein interaction detection in vitro and in cells by proximity biotinylation. J Am Chem Soc 2008, 130(29):9251-9253. Plewczynski D, Ginalski K: The interactome: predicting the protein-protein interactions in cells. Cell Mol Biol Lett 2009, 14(1):1-22. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002, 417(6887):399-403. Bowers PM, Pellegrini M, Thompson MJ, Fierro J, Yeates TO, Eisenberg D: Prolinks: a database of protein functional linkages derived from coevolution. Genome Biol 2004, 5(5):R35. Ermolaeva MD, White O, Salzberg SL: Prediction of operons in microbial genomes. Nucleic Acids Res 2001, 29(5):1216-1221. Salgado H, Moreno-Hagelsieb G, Smith TF, Collado-Vides J: Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci U S A 2000, 97(12):6652-6657. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402(6757):86-90. Marcotte CJ, Marcotte EM: Predicting functional linkages from gene fusions with confidence. Appl Bioinformatics 2002, 1(2):93-100. Yanai I, Derti A, DeLisi C: Genes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomes. Proc Natl Acad Sci U S A 2001, 98(14):7940-7945. Galperin MY, Koonin EV: Who's your neighbor? New computational approaches for functional genomics. Nat Biotechnol 2000, 18(6):609-613. Eisenberg D, Marcotte EM, Xenarios I, Yeates TO: Protein function in the postgenomic era. Nature 2000, 405(6788):823-826. Date SV, Marcotte EM: Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat Biotechnol 2003, 21(9):10551062.  35  65. 66. 67. 68. 69.  70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81.  Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, ThierryMieg N, Vidal M: Protein interaction mapping in C. elegans using proteins involved in vulval development. Science 2000, 287(5450):116-122. Goh CS, Bogan AA, Joachimiak M, Walther D, Cohen FE: Co-evolution of proteins with their interaction partners. J Mol Biol 2000, 299(2):283-293. Pazos F, Valencia A: Similarity of phylogenetic trees as indicator of proteinprotein interaction. Protein Eng 2001, 14(9):609-614. Han DS, Kim HS, Jang WH, Lee SD, Suh JK: PreSPI: a domain combination based prediction system for protein-protein interaction. Nucleic Acids Res 2004, 32(21):6312-6320. Pitre S, Dehne F, Chan A, Cheetham J, Duong A, Emili A, Gebbia M, Greenblatt J, Jessulat M, Krogan N, Luo X, Golshani A: PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. BMC Bioinformatics 2006, 7:365. Martin S, Roe D, Faulon JL: Predicting protein-protein interactions using signature products. Bioinformatics 2005, 21(2):218-226. Aloy P, Russell RB: InterPreTS: protein interaction prediction through tertiary structure. Bioinformatics 2003, 19(1):161-162. Lu L, Lu H, Skolnick J: MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins 2002, 49(3):350-364. Aloy P, Russell RB: Interrogating protein interaction networks through structural biology. Proc Natl Acad Sci U S A 2002, 99(9):5896-5901. McDermott J, Bumgarner R, Samudrala R: Functional annotation from predicted protein interaction networks. Bioinformatics 2005, 21(15):32173226. Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006, 22(13):1623-1630. Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol 2007, 3:88. Liu ZP, Wu LY, Wang Y, Zhang XS, Chen L: Bridging protein local structures and protein functions. Amino Acids 2008, 35(3):627-650. Guimera R, Nunes Amaral LA: Functional cartography of complex metabolic networks. Nature 2005, 433(7028):895-900. Pereira-Leal JB, Enright AJ, Ouzounis CA: Detection of functional modules from protein interaction networks. Proteins 2004, 54(1):49-57. McGough AM, Staiger CJ, Min JK, Simonetti KD: The gelsolin family of actin regulatory proteins: modular structures, versatile functions. FEBS Lett 2003, 552(2-3):75-81. Kato Y, Nagata K, Takahashi M, Lian L, Herrero JJ, Sudol M, Tanokura M: Common mechanism of ligand recognition by group II/III WW domains: redefining their functional classification. J Biol Chem 2004, 279(30):3183331841.  36  82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99.  100.  Sprinzak E, Margalit H: Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol 2001, 311(4):681-692. Bock JR, Gough DA: Predicting protein--protein interactions from primary structure. Bioinformatics 2001, 17(5):455-460. Espadaler J, Romero-Isart O, Jackson RM, Oliva B: Prediction of proteinprotein interactions using distant conservation of sequence patterns and structure relationships. Bioinformatics 2005, 21(16):3360-3368. Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions. Genome Res 2002, 12(10):1540-1548. Riley R, Lee C, Sabatti C, Eisenberg D: Inferring protein domain interactions from databases of interacting proteins. Genome Biol 2005, 6(10):R89. Zhang KX, Ouellette BF: GAIA: a gram-based interaction analysis tool--an approach for identifying interacting domains in yeast. BMC Bioinformatics 2009, 10 Suppl 1:S60. Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC: Getting started in biological pathway construction and analysis. PLoS Comput Biol 2008, 4(2):e16. McShan DC, Rao S, Shah I: PathMiner: predicting metabolic pathways by heuristic search. Bioinformatics 2003, 19(13):1692-1698. Croes D, Couche F, Wodak SJ, van Helden J: Inferring meaningful pathways in weighted metabolic networks. J Mol Biol 2006, 356(1):222-236. Beasley JE, Planes FJ: Recovering metabolic pathways via optimization. Bioinformatics 2007, 23(1):92-98. Steffen M, Petti A, Aach J, D'Haeseleer P, Church G: Automated modelling of signal transduction networks. BMC Bioinformatics 2002, 3:34. Liu Y, Zhao H: A computational approach for ordering signal transduction pathway components from genomics and proteomics Data. BMC Bioinformatics 2004, 5:158. Scott J, Ideker T, Karp RM, Sharan R: Efficient algorithms for detecting signaling pathways in protein interaction networks. J Comput Biol 2006, 13(2):133-144. Kelley R, Ideker T: Systematic interpretation of genetic interactions using protein networks. Nat Biotechnol 2005, 23(5):561-566. Ulitsky I, Shamir R: Pathway redundancy and protein essentiality revealed in the Saccharomyces cerevisiae interaction networks. Mol Syst Biol 2007, 3:104. Ma X, Tarone AM, Li W: Mapping genetically compensatory pathways from synthetic lethal interactions in yeast. PLoS ONE 2008, 3(4):e1922. Brady A, Maxwell K, Daniels N, Cowen LJ: Fault tolerance in protein interaction networks: stable bipartite subgraphs and redundant pathways. PLoS ONE 2009, 4(4):e5364. van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530-536. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D,  37  101. 102.  103. 104. 105. 106. 107. 108.  109. 110. 111. 112. 113. 114.  Foekens JA: Gene-expression profiles to predict distant metastasis of lymphnode-negative primary breast cancer. Lancet 2005, 365(9460):671-679. Tuck DP, Kluger HM, Kluger Y: Characterizing disease states from topological properties of transcriptional regulatory networks. BMC Bioinformatics 2006, 7:236. Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, Colinayo V, Ruff TG, Milligan SB, Lamb JR, Cavet G, Linsley PS, Mao M, Stoughton RB, Friend SH: Genetics of gene expression surveyed in maize, mouse and man. Nature 2003, 422(6929):297-302. Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genome-wide variation in human gene expression. Nature 2004, 430(7001):743-747. Brem RB, Yvert G, Clinton R, Kruglyak L: Genetic dissection of transcriptional regulation in budding yeast. Science 2002, 296(5568):752-755. Wachi S, Yoneda K, Wu R: Interactome-transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues. Bioinformatics 2005, 21(23):4205-4208. Jonsson PF, Bates PA: Global topological features of cancer proteins in the human interactome. Bioinformatics 2006, 22(18):2291-2297. Oti M, Snel B, Huynen MA, Brunner HG: Predicting disease genes using protein-protein interactions. J Med Genet 2006, 43(8):691-698. Franke L, van Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 2006, 78(6):10111025. Chen JY, Shen C, Sivachenko AY: Mining Alzheimer disease relevant proteins from integrated protein interactome data. Pac Symp Biocomput 2006:367-378. Lee DS, Park J, Kay KA, Christakis NA, Oltvai ZN, Barabasi AL: The implications of human metabolic network topology for disease comorbidity. Proc Natl Acad Sci U S A 2008, 105(29):9880-9885. Lee E, Chuang HY, Kim JW, Ideker T, Lee D: Inferring pathway activity toward precise disease classification. PLoS Comput Biol 2008, 4(11):e1000217. Zhang KX, Ouellette BFF: A new approach to predict interactions between integral membrane proteins in yeast. IEEE Congress on Evolutionary Computation 2008 2008:1801-1806. Zhang KX, Ouellette BF: Pandora, a PAthway and Network DiscOveRy Approach based on common biological evidence. Bioinformatics 2009. Malhis N, Zhang KX, Ouellette BF: Detecting Protein-Domains DNA-Motifs Association in Saccharomyces cerevisiae Regulatory Networks. AINA Workshops 2007, 1:639-644.  38  2 A New Approach to Predict Interactions Between Integral Membrane Proteins in Yeast1 2.1 Introduction Cells are multi-molecular entities whose biological functions rely on stringent regulation, both temporally and spatially. These regulations are realized through a variety of molecular interactions including protein-DNA interactions, protein-RNA interactions and protein-protein interactions (PPIs). PPIs are extremely important in a wide range of biological functions from enzyme catalysis, signal transduction and more structural functions. Hence, owing to advanced large-scale techniques such as yeast two-hybrid and mass spectrometry, interactomes of several model organisms such as Saccharomyces cerevisiae [1-6], Drosophila melanogaster [7, 8] and Caenorhabditis elegans [9] have been extensively studied. Such large-scale interaction networks provide a good opportunity to explore and decipher new information.  Proteins with at least one transmembrane domain constitute 20% to 35% of all known proteins, and therefore account for an important segment of the proteins involved in biological mechanisms. However, research on membrane protein interactions lags behind for several reasons. First, although the current available interactomes contain adequate interactions for analysis, the data set still has a large amount of false positives. For example, compared with a gold-standard data set, identified protein-protein interactions from three frequently-used high-throughput methods (yeast two-hybrid [6], tandem affinity purification (TAP) [1, 2] and high-throughput mass spectrometry protein A version of this chapter has been published. Zhang KX, Ouellette BFF: A new approach to predict interactions between integral membrane proteins in yeast. IEEE Congress on Evolutionary Computation 2008:1801-1806. 1  39  complex identification (HMS-PCI)) [3] yield very low accuracy, coverage and overlap [10]. Second, some large-scale experimental techniques are biased against membrane proteins. For instance, in order to check if proteins interact or not, they need to be expressed in the nucleus which may not be their native living environment.  Several groups have addressed the interactome of membrane proteins in yeast. Miller and colleagues [11] worked on identifying interactions between integral membrane proteins in yeast using a modified split-ubiquitin technique. To address challenges brought by experimental techniques, Xia and colleagues [12] developed a computational method to predict interactions between the helical membrane proteins in yeast by integrating 11 genomic features such as sequence, function, localization, abundance, regulation, and phenotype using logistic regression. But it suffers low prediction power and low overlap with experimental results.  In addition to utilizing genomic features to predict protein-protein interactions, graph theory based on the topology of network is an alternative approach to infer proteinprotein relationships from protein interaction networks and showing interesting results [13, 14]. Here, we propose a method to predict interactions between membrane proteins using a probabilistic model based on the topology of protein-protein interaction network and that of domain-domain interaction network in yeast. It has been demonstrated that the more closely a pair of proteins are functionally related to each other, the more likely they are to share interaction partners [15]. Moreover, domain-domain interactions have been shown as indicators of protein interactions due to the binding of modular domains or motifs [16, 17]. We integrated protein-protein interaction and domain-domain interaction  40  data from disparate sources and applied a log likelihood scoring method on all putative integral membrane proteins in yeast to predict all putative integral membrane proteinprotein interactions based on a cut-off threshold. In addition, we performed a statistical analysis of the significance of domain-domain interactions within a reliable set of membrane proteins.  2.2 Materials and methods 2.2.1 Obtaining a list of putative membrane proteins in yeast A list of putative membrane proteins in yeast was identified from proteins annotated as such in the SGD database [18]. An additional list was extracted from the TMHMM [19] and Phobius [20] web servers. In order to generate a more comprehensive a list of putative membrane proteins in yeast, the union of these two lists was used. In summary, we collected 1412 membrane proteins.  2.2.2 Integration of protein-protein interaction data in yeast  In order to increase data reliability, protein-protein interaction data was retrieved from several data sources: BIND [21], DIP [22], IntAct [23] and BioGrid [24], all of which are currently maintained in our own data warehouse, Atlas [25]. We considered those interactions appearing in at least one data source. 55,145 interactions in yeast in total were used for our analysis.  41  2.2.3 Retrieving domain-domain interaction data in yeast  Domain annotations of each protein were extracted from the Pfam database [26]. There was an average of 1.5 domains on each membrane protein. 3,274 domain-domain interactions involving 704 proteins were taken from the iPfam database [27]. iPfam is a resource that contains domain-domain interactions based on the PDB entries.  2.2.4 Creating an experimentally validated data set or “gold-standard” data set We created a “gold-standard” protein-protein interaction data set between all putative membrane proteins using a similar approach to Xia and colleagues [12]. Positive interaction data was derived by choosing the membrane protein pairs as those in the same complex in the MIPS database [28], whereas negative interactions data are the pairs not in the same complex. The resulting gold-standard data set contains 515 positive interactions and 1,208,234 negative interactions between integral membrane proteins.  2.2.5 A model for scoring protein-protein interactions and domain-domain interactions A scoring model should predict how closely a pair of genes is related in a protein-protein interaction network. According to previous research, if two proteins interact with an overlapping similar group of proteins, they are likely to interact with each other [2, 3, 29], thus, for a given pair of proteins, we found a common set of interactors for this pair of proteins. A scoring method was employed to calculate the likelihood that a group of proteins (a pair of query proteins) and the whole set of their common interactors are more densely connected (the number of PPIs within a group of proteins) than would be expected at random [30]: 42  (1)  where S is a set of common interactors plus a given pair of proteins and I is a set of protein-protein interactions among those genes. PI(x, y) is an indicator function that equals 1 if and only if the interaction (x, y) occurs in I and otherwise 0. For network N, interactions are expected to occur with high probability for every pair of proteins in S. In our work, we followed the previous knowledge to estimate β and set β to 0.9 [30]. For network Ncontrol, the probability of observing each interaction cx,y was determined by estimating the fraction of all randomly generated networks that contain the proteinprotein interaction between a given pair of proteins. Comparable control networks were randomly generated by building interaction networks with same node number from the same gene set and same number of DDIs, and by repeating the process 100 times.  Given a pair of proteins, their domain annotations and a list of domain-domain interactions between them, we employed a related log-odds scoring model (2) to evaluate the probability that the domain-domain interactions bridging between these two genes and their common interaction partners were denser than random based on the above scoring method:  (2)  43  Compared to the previous equation, DI(m, n) is an indicator function that equals 1 if and only if the domain-domain interaction (m,n) occurs in I and otherwise 0; Dx/Dy is the number of domains in each protein x and y; for network Ncontrol, the probability of observing each domain-domain interaction cx,y was determined by estimating the fraction of all randomly generated networks that contain that domain-domain interactions occurring between two proteins. Therefore, the final scoring function for a given pair of genes was then:  S final = S p + Sd  (3)  2.2.6 ROC curve The performance of the scoring method were measured by area under the receiver operating characteristic (ROC) curve. The area was calculated by the ROC package in R [31]. ROC curve displays the relationship between sensitivity and specificity. The curve was generated by calculating the true positive rate (sensitivity) and the false positive rate (1-specificity) at different threshold scores derived from the protein-protein interaction data alone and combined scores from both protein-protein interactions and domaindomain interactions against the “gold-standard” data set. If an interaction was above the threshold and in the positive data set of the “gold-standard” data set, it was regarded as a true positive. If not in the positive data set of the “gold-standard” dataset, it is a false positive. If an interaction was below the threshold and in the negative data set of the “gold-standard” dataset, it was regarded as a true negative. If not in the negative data set of the “gold-standard” dataset, it is a false negative. Due to the imbalanced proportion of positive and negative samples in the “gold-standard” data set, we randomly picked the  44  same number of negative samples as positive samples in the “gold-standard” data set 100 times and took the average of true negatives and false negatives when we calculated the sensitivity and specificity.  The true positive rate and the false positive rate were calculated as follows:  True Positive Rate =  No. of True Positives No. of True Positives + No. of False Negatives  False Positive Rate = 1 !  No. of True Negatives No. of True Negatives + No. of False Positives  (4)  (5)  2.3 Results and Discussion 2.3.1 Properties of the interactions between integral membrane proteins There are debates about whether protein-protein interaction networks follow a power-law degree distribution. It has been widely accepted that protein-protein interaction networks in yeast obey this property [8, 32] but there is a recent report challenging this point [33]. The degree distribution of each node and the number of common interactors of each PPI pair in the integral membrane PPI network in our data set followed a power-law (Rsquared: 0.92, p < 2.2e-16). 91% of integral transmembrane proteins had low interactors (below 50). The number of common interactors of each protein pair also followed a power-law distribution. However, the scale-free topological property presented in the interactome in yeast needs to be further verified when more large-scale PPI data sets are available in the future. The reason is that partial sampling of other types of networks such as exponential and truncated normal networks also results in sub-networks with the scalefree topological property [34]. Therefore, it is possible that scale-free topology of the  45  integral transmembrane PPI network may not represent the true topology of the full interactome.  For each possible interaction between integral membrane proteins, we calculated three different scores: PPI score, DDI score and a combined PPI/DDI score according to (1-3). This generated a table with 996,166 interacting pairs of proteins, each with three interaction probability scores. A ROC curve was plotted by measuring sensitivity and specificity when tested against the gold-standard data set at different cut-off values (Figure 2.1). The area under curve (AUC) is 0.91 for combined score, 0.85 for PPI and 0.74 for DDI, respectively, which indicates good predictive performance of the scoring method. The AUC differences between combined score and PPI and between combined score and DDI are statistically significant (p-value = 8.1 ×10-3 and p-value = 4.58 ×10-4 by Fisher Exact Test), which suggests that better performance can be achieved if we combine scores rather than using PPI scores or DDI scores alone. It is estimated that 5,000 interactions between integral membrane proteins [12]. Based on that number, we achieved 74.6% true positive rate (sensitivity) at the expense of 9.9% false positive rate (1 – specificity) for a cut-off score of 350, which predicted 4660 interactions between integral membrane proteins, about 0.54% coverage of all possible interactions among integral membrane proteins.  46  Figure 2.1 Curve of receiver operating characteristics (ROC) plotted by the different cut-off values when tested against the gold-standard data set.  The area under the curve plotted by PPIs combined with DDIs is 0.91, 0.85 for PPI and 0.74 for DDI, respectively. Cut-off values are arbitrarily chosen from 50 to 10,000.  47  Since our scoring method depends upon interactions between a gene pair and their common protein-protein interactors, we studied the connectivity of integral membrane proteins in our predicted integral membrane PPI data set, the gold-standard positive data set and 5 random data sets. There were, on average, 16.4 common interaction partners shared by a protein pair in our predicted interaction data set, 15.6 in the gold-standard data set and 0.13 in the random data sets, respectively (Figure 2.2). We observed that the average number of common interactors between proteins in predicted interactions is close to the number of the gold-standard positive data set, whereas there is a pronounced difference between predicted interactions and random interactions or between interactions in the gold-standard data set and the random data set at the p-value 1.43×10-4. This observation indicates the reliability of our scoring approach to predict the interactions between integral membrane proteins using network topological information.  2.3.2 The interactome map of integral membrane proteins in yeast The map of the interactome of integral membrane protein was built based on 4,531 predicted protein-protein interactions between integral membrane proteins at the cutoff value of 455 by Cytoscape [35]. 53.4% (281/527) of proteins in the interactome map contains at least one transmembrane helix according to the predictions by TMHMM. 80% (392/513) interactions within the gold-standard data set overlaps with those within the interactome map but only accounts for 8.4% of the whole interactome of integral membrane proteins. By checking the topology properties of the interactome map, we found that 84% of the interactions in the gold-standard data set are in the same complex such as lipid biosynthesis, energy couple proton transport, protein biosynthesis, protein targeting to mitochondria and ATP synthesis coupled electron transport, which reflects 48  the characteristics of performed experiments (detecting protein-protein interactions between same complexes). Our predicted interactions indicate new members in complexes such as transport, secretion, vesicle-mediated transport and intracellular transport, which are missed from experimental methods. One example is that in the group of protein import into nucleus, KAP95 and SSA1 do not interact with other proteins within the group according to the gold-standard data set, however they both play a critical role on nuclear localization signal (NLS)-directed nuclear transport by interacting with other proteins to guide transport across the nuclear pore complex [36, 37]. Furthermore, as observed from the map, some interactions not within the gold-standard data set are found to bridge two complexes. For example, NUP116 and ATP14 are predicted to interact with each other connecting two groups: protein import into nucleus and energy couple protein transport. Although there is no evidence demonstrating the direct interaction between NUP116 and ATP14, some research results indicate that ATP14 might be involved in ATP synthesis in the process of protein importing into nucleus [38, 39]. Interestingly, we found some new complexes such as peroxisome organization and biogenesis related to the functions of peroxisome membrane proteins such as peroxisome biogenesis and peroxisomal matrix protein import [40-42].  49  Figure 2.2 The number of common interaction partners of gene pairs in our predicted interactions, the gold-standard data set and five random data sets.  On average, there are 16.4 common interaction partners between a protein pair in our predicted interactions, 15.6 in the gold-standard data set and 0.13 in the random data sets, respectively. The centre of the box is the median and the box spans from the first to third quartiles.  50  2.3.3 Properties of the domain-domain interactions between integral membrane proteins After we retrieved the domain annotation of each integral membrane protein from Pfam, we plotted the distribution of the number of domains in each protein. It has been found that approximately 58% (813/1412) of the integral membrane proteins do not have domain annotation in Pfam yet 69% (629/911) of them contain transmembrane helices, indicating domain-domain interaction data is biased against integral membrane proteins due to the technical difficulty to identify the interactions between integral membrane proteins. We found 426 distinct domains across 704 integral membrane proteins. A list of statistically significant domains is illustrated in Table 2.1 and shows specific transmembrane domains such as Mitochondrial carrier protein, Flocculin repeat and ABC transporter domains. These domains usually occur repeatedly in one integral membrane protein, whereas we did not find any integral membrane protein that contains many unique domains. We therefore speculate that domain-domain interactions between two integral membrane proteins mostly occurs within these membrane-specific domains and proteins with these repeated domains have a greater chance to interact.  By examining domain-domain interactions from iPfam combined with the domain annotation, there were a total of 176 unique domain-domain interactions corresponding to 704 membrane proteins. A list of over-presented domain-domain interactions (Enrichment Factor (EF) > 200) is illustrated in Figure 2.3. We plotted how many times one type of domain-domain interaction occurs in all membrane proteins versus the number of involved proteins. From Figure 2.3, it is observed that frequency of most DDIs correlates to the number of involved proteins. In addition, by comparing the probabilities  51  of each DDI in two different datasets (integral membrane proteins vs. non-membrane proteins), we categorized all DDIs in the integral membrane protein dataset as membrane specific DDIs or non-membrane specific DDIs based on their enrichment factor scores. It can be observed from Figure 2.3 that 86% of DDIs (130/151) are membrane specific domain-domain interactions. These DDIs are closely related to some major functions of the integral membrane protein. For example, two DDIs (PF0005: ABC transporter <-> PF0005: ABC transporter, PF0005: ABC transporter <-> PF00664: ABC transporter transmembrane region) with very high EF scores play a critical role in synthesizing ATP and providing a source of energy to catalyze active transport across the membrane. Another DDI (PF00957: Synaptobrevin <-> PF05739: SNARE domain) with a high EF score is known to play a role in the process of vesicular fusion through the interaction between SNARE and Syntaxin [43]. Among the non-membrane specific DDIs, most of them are repeats such as ankyrin-like (ANK) repeats, WD40 repeats and tetratricopeptide repeats. Previous evidence indicates that these repeating units can serve as a rigid scaffold for protein interactions where they coordinate the process of assembling protein complexes [44, 45]. Besides these repeats, other non-membrane specific DDIs such as protein kinase domain and pleckstrin homology (PH) domain serve biological functions such as protein kinase activity. Therefore, we speculated that most of the PPIs between integral membrane proteins are mediated by the membrane specific DDIs while some generalized DDIs might play an auxiliary role. In order to check if domain-domain interactions are correlated with protein-protein interactions, we compared the number of predicted DDIs to that is expected at random. The number of predicted DDIs is  52  Table 2.1 A list of statistically significant (P-value < 0.05, Z-test) domains among integral membrane proteins in yeast.  Pfam_ID  Name  p-value  PF00153  Mitochondrial carrier protein  3.484×10-11  PF00624  Flocculin repeat  2.451×10-11  PF00005  ABC transporter  3.341×10-10  PF00324  Amino acid permease  8.883×10-9  PF00674  DUP family  1.439×10-9  PF00400  WD domain, G-beta repeat  6.254×10-8  PF00083  Sugar (and other) transporter  5.235×10-8  PF02985  HEAT repeat  2.648×10-7  PF00023  Ankyrin repeat  6.332×10-6  PF00515  Tetratricopeptide repeat  2.345×10-5  PF01061  ABC-2 type transporter  6.329×10-4  PF07690  Major Facilitator Superfamily  8.412×10-3  53  Figure 2.3 2D scatter plot with marginal histograms showing the correlation between the frequency of the occurrence of one domain-domain interaction (on the X-axis) and the number of involved integral membrane proteins (on the Y-axis).  The total number of data points is 151. Diamonds represent membrane specific DDIs and circles represent non-membrane specific DDIs. A membrane specific DDI is determined if its enrichment factor (EF) is great than 1. The enrichment factor is calculated based on two datasets (membrane vs. non-membrane). From the figure, it can be observed that the frequency of most DDIs correlates to the number of involved proteins. A list of DDIs with the top 10 EF scores is listed in the adjoining plot.  54  statistically significant at a p-value of 6.45×10-5 by Z-test, indicating that DDIs play an important role in integral membrane protein-protein interactions.  2.3.4 Comparison between other large-scale data sets Two research projects focus on large-scale detection of membrane protein-protein interactions on yeast. Miller and colleagues [11] identified 1,949 putative non-self interactions among 705 integral membrane proteins using a modified split-ubiquitin technique. Xia and colleagues [12] predicted 4,145 helical membrane protein interactions among 516 proteins by using logistic regression based on genomic contexts such as sequence, function, localization, abundance, regulation, and phenotype. Compared to the 1,949 experimental interactions from Miller et al., 438 interactions are correctly predicted by our method. 79 interactions overlap between the predictions from Xia et al. and Miller et al., indicating that our approach greatly improves on the method of Xia et al. (Figure 2.4). Interestingly, there are only 79 protein-protein interactions overlapping between the results from all three approaches. We compared three sets of resultant interactions to 515 interactions in the gold-standard data set. Here, our approach correctly predicted 393 (81%) interactions, Xia et al. predicted 196 (38%) interactions and Miller et al. detected 4 (1.7%) interactions. This further validates the performance of our approach. The reason for these differences among three large-scale sets of membrane protein interactions may be that each approach focuses on different aspects. The experimental result from Miller et al. is reliable but probably contains false positives and false negatives due to the intrinsic limitation of experimental techniques they employed. The approach proposed of Xia et al. focuses on interactions between complexes instead of on binary protein-protein interactions, so the result from Xia et al. is prone to predict interactions in the complex. 55  Our approach emphasizes the interactions through the topological properties of PPI and DDI networks that are important features for predicting membrane protein interactions.  2.4 Conclusion By exploring the global topological properties of protein-protein interaction data in yeast as well as DDI data, we proposed a new scoring model that identifies protein-protein interactions between integral membrane proteins and constructed the interactome map of integral membrane proteins based on the resultant scores. This was achieved by calculating the log likelihood of a pair of investigated proteins share more interacting protein and domain partners than the random pair. We predicted 4,660 interactions between integral membrane proteins and built the interactome of integral membrane proteins in yeast. Tested on the gold-standard data set, our approach improves on other predictive approaches and achieves 74.6% true positive rate at the expense of 9.9% false positive rate. This study will allow us to reach a fuller understanding of the integral membrane protein network in yeast and complements the previous prediction approach based on genomic context. The resultant predictions provide testable hypotheses for experimental validation. Future work is needed to integrate other related genomic context characteristics to achieve a better prediction performance and a more general and comprehensive view of the interactome of integral membrane proteins in yeast.  56  Figure 2.4 Comparison of the prediction results from three large-scale methods.  57  2.5 References 1.  2.  3.  4. 5.  6.  7.  Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415(6868):141-147. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415(6868):180-183. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001, 98(8):4569-4574. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, Peregrin-Alvarez JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MH, Butland G, Altaf-Ul AM, Kanaya S, Shilatifard A, O'Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440(7084):637643. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403(6770):623-627. Formstecher E, Aresta S, Collura V, Hamburger A, Meil A, Trehin A, Reverdy C, Betin V, Maire S, Brun C, Jacq B, Arpin M, Bellaiche Y, Bellusci S, Benaroch P, Bornens M, Chanet R, Chavrier P, Delattre O, Doye V, Fehon R, Faye G, Galli T, 58  8.  9.  10. 11. 12. 13. 14. 15. 16.  17.  Girault JA, Goud B, de Gunzburg J, Johannes L, Junier MP, Mirouse V, Mukherjee A, Papadopoulo D, Perez F, Plessis A, Rosse C, Saule S, StoppaLyonnet D, Vincent A, White M, Legrain P, Wojcik J, Camonis J, Daviet L: Protein interaction mapping: a Drosophila case study. Genome Res 2005, 15(3):376-384. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL, Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM: A protein interaction map of Drosophila melanogaster. Science 2003, 302(5651):1727-1736. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303(5657):540-543. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002, 417(6887):399-403. Miller JP, Lo RS, Ben-Hur A, Desmarais C, Stagljar I, Noble WS, Fields S: Large-scale identification of yeast integral membrane protein interactions. Proc Natl Acad Sci U S A 2005, 102(34):12123-12128. Xia Y, Lu LJ, Gerstein M: Integrated prediction of the helical membrane protein interactome in yeast. J Mol Biol 2006, 357(1):339-349. Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 2005, 21 Suppl 1:i302-310. Valente AX, Cusick ME: Yeast Protein Interactome topology provides framework for coordinated-functionality. Nucleic Acids Res 2006, 34(9):28122819. Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol 2003, 5(1):R6. Jothi R, Cherukuri PF, Tasneem A, Przytycka TM: Co-evolutionary analysis of domains in interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions. J Mol Biol 2006, 362(4):861-875. Pawson T, Nash P: Assembly of cell regulatory systems through protein interaction domains. Science 2003, 300(5618):445-452.  59  18.  19. 20. 21.  22. 23.  24. 25. 26.  27. 28.  Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon GC, Myers CL, Parsons A, Friesen H, Oughtred R, Tong A, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya OG, Ideker T, Dolinski K, Batada NN, Tyers M: Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae. J Biol 2006, 5(4):11. Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 2001, 305(3):567-580. Kall L, Krogh A, Sonnhammer EL: A combined transmembrane topology and signal peptide prediction method. J Mol Biol 2004, 338(5):1027-1036. Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, Buzadzija K, Cavero R, D'Abreo C, Donaldson I, Dorairajoo D, Dumontier MJ, Dumontier MR, Earles V, Farrall R, Feldman H, Garderman E, Gong Y, Gonzaga R, Grytsan V, Gryz E, Gu V, Haldorsen E, Halupa A, Haw R, Hrvojic A, Hurrell L, Isserlin R, Jack F, Juma F, Khan A, Kon T, Konopinsky S, Le V, Lee E, Ling S, Magidin M, Moniakis J, Montojo J, Moore S, Muskat B, Ng I, Paraiso JP, Parker B, Pintilie G, Pirone R, Salama JJ, Sgro S, Shan T, Shu Y, Siew J, Skinner D, Snyder K, Stasiuk R, Strumpf D, Tuekam B, Tao S, Wang Z, White M, Willis R, Wolting C, Wong S, Wrong A, Xin C, Yao R, Yates B, Zhang S, Zheng K, Pawson T, Ouellette BF, Hogue CW: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res 2005, 33(Database issue):D418-424. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32(Database issue):D449-451. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucleic Acids Res 2004, 32(Database issue):D452-455. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34(Database issue):D535-539. Shah SP, Huang Y, Xu T, Yuen MM, Ling J, Ouellette BF: Atlas - a data warehouse for integrative bioinformatics. BMC Bioinformatics 2005, 6(1):34. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, 32(Database issue):D138-141. Finn RD, Marshall M, Bateman A: iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics 2005, 21(3):410-412. Mewes HW, Frishman D, Mayer KF, Munsterkotter M, Noubibou O, Pagel P, Rattei T, Oesterheld M, Ruepp A, Stumpflen V: MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res 2006, 34(Database issue):D169-172.  60  29. 30. 31. 32.  33. 34. 35.  36. 37. 38. 39. 40. 41. 42. 43. 44. 45.  Yu H, Paccanaro A, Trifonov V, Gerstein M: Predicting interactions in protein networks by completing defective cliques. Bioinformatics 2006, 22(7):823-829. Kelley R, Ideker T: Systematic interpretation of genetic interactions using protein networks. Nat Biotechnol 2005, 23(5):561-566. The R Project for Statistical Computing [www.r-project.org/] Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJ, Cusick ME, Roth FP, Vidal M: Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 2004, 430(6995):88-93. Tanaka R, Yi TM, Doyle J: Some protein interaction data do not exhibit power law statistics. FEBS Lett 2005, 579(23):5140-5144. Han JD, Dupuy D, Bertin N, Cusick ME, Vidal M: Effect of sampling on topology predictions of protein-protein interaction networks. Nat Biotechnol 2005, 23(7):839-844. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):24982504. Denning D, Mykytka B, Allen NP, Huang L, Al B, Rexach M: The nucleoporin Nup60p functions as a Gsp1p-GTP-sensitive tether for Nup2p at the nuclear pore complex. J Cell Biol 2001, 154(5):937-950. Liu SM, Stewart M: Structural basis for the high-affinity binding of nucleoporin Nup1p to the Saccharomyces cerevisiae importin-beta homologue, Kap95p. J Mol Biol 2005, 349(3):515-525. Dingwall C, Laskey RA: Protein import into the cell nucleus. Annu Rev Cell Biol 1986, 2:367-390. Vargas DY, Raj A, Marras SA, Kramer FR, Tyagi S: Mechanism of mRNA transport in the nucleus. Proc Natl Acad Sci U S A 2005, 102(47):17008-17013. Eckert JH, Erdmann R: Peroxisome biogenesis. Rev Physiol Biochem Pharmacol 2003, 147:75-121. Heiland I, Erdmann R: Biogenesis of peroxisomes. Topogenesis of the peroxisomal membrane and matrix proteins. Febs J 2005, 272(10):2362-2372. Honsho M, Hiroshige T, Fujiki Y: The membrane biogenesis peroxin Pex16p. Topogenesis and functional roles in peroxisomal membrane assembly. J Biol Chem 2002, 277(46):44513-44524. Weimbs T, Low SH, Chapin SJ, Mostov KE, Bucher P, Hofmann K: A conserved domain is present in different families of vesicular fusion proteins: a new superfamily. Proc Natl Acad Sci U S A 1997, 94(7):3046-3051. D'Andrea LD, Regan L: TPR proteins: the versatile helix. Trends Biochem Sci 2003, 28(12):655-662. Li D, Roberts R: WD-repeat proteins: structure characteristics, biological function, and their involvement in human diseases. Cell Mol Life Sci 2001, 58(14):2085-2097.  61  3 GAIA: a gram-based interaction analysis tool - an approach for identifying interacting domains in yeast2 3.1 Background Biological functions of cells are determined by the strict regulation of molecular interactions of proteins, lipids, carbohydrates and nuclear acids both temporally and spatially. Protein-Protein Interactions (PPIs) play important roles in all biological functions from enzyme catalysis, signal transduction, as well as many structural functions. Owing to advances in large-scale techniques such as the yeast two-hybrid system and affinity purification followed by mass spectrometry, interactomes of several model organisms such as Saccharomyces cerevisiae [1-6], Drosophila melanogaster [7, 8] and Caenorhabditis elegans [9] have recently been extensively studied. While such large-scale interaction data sets provide tremendous opportunities for data exploration, there are limitations: 1) the experimental techniques for detecting PPIs are timeconsuming, costly and labor intensive; 2) the quality of certain datasets is uneven; and 3) technical limitations such as the requirement to tag proteins of interest still exist. It has been widely accepted that some proteins interact with each other through interactions between their domains, which are defined as independent structural and/or functional blocks of proteins. For example, some cytoskeletal proteins interact with actin because of the interaction between their gelsolin repeat domains [10]. It has been reported that sets of conserved residues within the WW domains can bind to proline-rich peptides [11]. Therefore, the identification of domain-domain interactions (DDIs) can potentially shed light on the mechanism underlying PPIs. Unfortunately, identifying neither DDIs nor A version of this chapter has been published. Zhang KX, Ouellette BF: GAIA: a gram-based interaction analysis tool--an approach for identifying interacting domains in yeast. BMC Bioinformatics 2009, 10 Suppl 1:S60. 2  62  PPIs through experimental approaches is trivial. As a complementary alternative, computational approaches that identify DDIs have been studied intensively for years yielding some interesting results.  The currently available computational DDI prediction approaches can be categorized as follows: 1) Association-based approaches where each DDI is scored by the association of the number of interacting domain pairs between interacting protein pairs and noninteracting protein pairs. These methods, however, only compute each DDI locally without considering the information of other DDIs between protein pairs [12-14]. Deng et al. proposed an optimized approach, maximum likelihood estimation (MLE), which globally calculates the probabilities of interaction between two domains using the expectation-maximization (EM) algorithm [15]. 2) Pattern-based approaches where the domain interaction pattern of each interacting protein pair is utilized to predict DDIs by applying machine learning approaches such as clustering algorithm [16] or random forest algorithm [17]. 3) The Co-evolution-based approach where a pair of domains is regarded as interacting with each other if they share very similar phylogenetic trees [18]. However, one of the caveats for these DDI prediction approaches is that the information regarding the sequences and structures of these domains is neglected and as a result they suffer from low sensitivities and specificities.  It is known that segments of n contiguous amino acids (or n-grams) correlate to specific secondary structure elements [19, 20]. Therefore, n-gram-based methods are widely exploited to predict the secondary structure or subcellular localization of proteins and to classify protein families using machine learning techniques [21-23]. The finding that n-  63  grams are closely related to the secondary structure of protein domains prompts us to wonder whether n-grams can interact with each other. In fact, several studies have reported the interaction between n-grams. For example, molecular interaction exists between Smurf1 WW2 domain and PPXY motifs of Smad1[24]. Src-homology 3 domain (SH3) binds to a PXXP peptide [25]. Therefore, we hypothesize that some overrepresented gram-gram interactions mediate DDIs and thus PPIs. In this study, we introduced a novel DDI prediction approach based on the primary sequence of proteins, by extracting n-gram frequencies from the annotated domain and DDI data set in yeast. This approach adopted following substantial expansion from a related study reported previously [26]: 1) Instead of predicting PPIs, this work predicts DDIs based on domain sequence and interaction data; 2) A new scoring model was developed to quantify each gram pair.  Our approach, called GAIA, improves on other prediction approaches. When tested against a gold-standard data set, GAIA achieves a true positive rate (sensitivity) of 82% with a false positive rate (1 – specificity) of 21% and performs more accurately when the length of the gram is set to 4 amino acids. Using GAIA, we generated a list of 4-gram pairs that are significantly over-represented in DDIs in yeast. We postulate that these pairs mediate the DDIs in yeast. Overall, we demonstrate that GAIA, a gram-based method, provides a novel and reliable way to predict DDIs that may mediate PPIs in yeast. Our results, which show the localization of interacting grams/hotspots, provide testable hypotheses for experimental validation. Complemented with other prediction methods, this study facilitates us to elucidate the entire interactome of cells.  64  3.2 Methods The aim of this work is to predict DDIs based on the frequency of each possible grampair from a pair of query proteins. The frequencies of aforementioned gram-pairs are calculated from the annotated DDI data set and random data sets. In addition to predicting DDIs, GAIA also generates a list of gram pairs and their protein primary structure coordinates that contribute to the interaction between pairs of domains on query proteins. Details of how the GAIA algorithm works are provided in the following section, along with information about the data set collection, performance evaluation, and development environment.  3.2.1 The GAIA algorithm (Figure 3.1) Step A.  For each 4-gram Gi, in query protein A, we generated a list of iPfam  annotated domains dlistG[i] that contain this gram and the number of hits of this gram in each domain; Step B.  For each 4-gram Gj appearing in query protein B, we also generated a list  of Pfam annotated domains dlistG[j] that contain this gram and the number of hits of this gram in each domain; Step C.  For each gram-pair (Gi, Gj) between the query proteins A and B, we  calculated the frequency of hits freq[i][j] for this gram-pair represented in interacting domain-domain pairs previously established in Pfam [27]. Then, the final frequency of hits score[i][j] for this gram-pair was weighted by weightScore[i][j] to determine if the  65  Figure 3.1 The general flowchart of the GAIA algorithm.  In step A and step B, two lists of Pfam annotated domains that contain two corresponding grams were generated by searching the domain sequence data set from Pfam. In step C, we all possible domain pairs derived from two lists of domains from step A and B were searched and the frequency of hits were calculated by searching how many domain pairs in the iPfam data set. In step D, if a pair of query proteins contained any gram-pair whose hit frequency determined in step C was over a preset threshold, they were predicted in our study to be interacting with each other. At the same time, a profile containing the number of hits and the localization of hits on protein sequences was also generated. The predicted protein pair and corresponding gram pairs were further validated based on their 3D structures.  66  number of its occurrences in the interacting domain pairs is statistically significant. The hit scores and weight scores are calculated by the following formulas:  hitscore[i][ j] = No. of hits * weightScore[i][ j]  (1)  weightScore[i][ j] = P(real|random ) (Gram[i][ j])  (2)  Here, P(real|random)[i][j] is the probability of the number of occurrences of Gram[i][j] in the interacting domain pairs is expected at random. Comparable control domain pairs were randomly generated by pairing domains from the DDI data set. Step D.  For each gram-pair generated from Step C, if the hit frequency was over  the preset threshold c and this gram-pair was located in a domain region, then this grampair and their corresponding domain pair was predicted to interact with each other. A profile containing the number of hits and the positions of the gram-pairs in the input query protein pair was simultaneously generated. This profile is important because it provides information on the amino acid hotspots that are potentially contributing to the physical interaction between the pair of query proteins.  3.2.2 Data set collection  We compiled 3,020 DDIs in yeast and their corresponding amino acid sequences from Pfam [28], a database containing protein domains and domain families, and iPfam [27], a database of DDIs derived from their RCSB Protein Data Bank (PDB) crystal structures [29]. For the purpose of evaluating prediction performance, we used a “gold-standard” dataset that contained 595 PPIs compiled from a PPI dataset identified by the homologous protein interaction verification (HPIV) method [30]. It is reported that the  67  HPIV positive dataset has better quality when used as the training data set for predicting PPIs [30]. All interacting protein pairs in our positive gold-standard dataset were expected to match three following criteria: 1) each pair is in the HPIV positive dataset; 2) each protein contains more than one domain; 3) each pair contains at least one iPfam DDI. We generated another “gold-standard” negative dataset containing 595 noninteracting protein pairs from the HPIV negative dataset. Compared to other simple approaches [31, 32], HPIV applies multiple lines of evidence such as functional, localization, expression and homology-based data [30].  3.2.3 Evaluation of the GAIA algorithm  The performance of the scoring method was measured by area under the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC). ROC curve provide an indication of sensitivity and specificity. The area under the curve highlights discrimination (i.e., the correct classification of interacting and non-interacting proteins). The ROC curve was generated by calculating the true positive rate (sensitivity) and the false positive rate (1-specificity) at the different thresholds on scores derived from PPIs and DDIs in the network, and combined scores from both kinds of interactions against the “gold-standard” data set. If the number of hits of any domain pair in a protein pair was above the threshold and it was in the DDIs of the positive portion of the “goldstandard” data set, then it was regarded as a true positive. Alternatively, if it was not in the positive portion of the “gold-standard” dataset, then it was a false positive. If the number of hits of a domain pair in a protein pair was below the threshold and it was in the negative portion of the “gold-standard” data set, then it was regarded as a true  68  negative. Alternatively, if it was not in the negative portion of the “gold-standard” data set, then it was a false negative.  The sensitivity, specificity and positive prediction value (PPV) were calculated as follows:  Sensitivity =  No. of True Positives No. of True Positives + No. of False Negatives  (3)  Specificity =  No. of True Negatives No. of True Negatives + No. of False Positives  (4)  PPV =  No. of True Positives No. of True Positives + No. of False Positives  (5)  3.2.4 Data and program availability  The related data sets and scripts, source code, and binaries are available for download from [33]. All scripts were written in Perl language version 5.8.6 and tested on a MacOS10.4.10 with a Macintosh work station (2.4 GHz Intel Core 2 Duo with 2GB 667 MHz DDR2 SDRAM). The source code and scripts are distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.  69  3.3 Results and discussion 3.3.1 Performance of the GAIA algorithm  To evaluate the performance of our algorithm, we tested the GAIA algorithm against both gold-standard positive and negative PPI data sets by setting the n-gram length to 4 and the threshold of DDI hits to 8.3. For the positive data set, 82% (886 out of 1080) of interacting domain pairs were successfully predicted. For the negative data set, 21% (161 out of 767) of non-interacting domain pairs were incorrectly predicted to interact with each other. These results indicate that our algorithm achieves a sensitivity of 82% and a specificity of 79%. A receiver operating characteristic (ROC) curve was plotted by measuring the sensitivity and specificity of GAIA tested against two gold-standard data sets at different cut-off values of DDI’s hits (Figure 3.2). The area under the curve (AUC) for the 4-gram is 0.79.  Next, we tested whether predicted DDIs could be utilized to predict PPIs. When there is at least one of our predicted DDI existing between a pair of proteins, this pair of proteins is predicted as interacting with each other. For the positive data set, it was observed that 76% (452 out of 595) of interacting protein pairs were successfully predicted. For the negative data set, 25% (149 out of 595) of non-interacting protein pairs were incorrectly detected to interact with each other, reaching a sensitivity of 76% and a specificity of 75% when the threshold of DDI’s hits is set to 8.3. These results demonstrate GAIA superiority to even in vivo experimental PPI identification approaches showing sensitivity of 60-65% [1-6, 8] as reported in recent publications [26, 34, 35]. However, it should be noted that PPIs are predicted in GAIA under the assumption that interactions of  70  Figure 3.2 The performance of the GAIA algorithm using different length gram pairs.  Curve of receiver operating characteristics (ROC) plotted for different thresholds when tested against the gold-standard positive and negative data set. The area under the curve plotted by 3-gram is 0.51, 0.79 for 4-grams and 0.52 for 5-grams, respectively.  71  given proteins are mediated by pairs of domains. Therefore, GAIA is not able to predict those PPIs mediated by amino acid segments outside of known interacting domains.  3.3.2 Parameters of the GAIA algorithm  The GAIA algorithm is solely based on protein sequence; no further information is needed. Only two parameters are needed to tune GAIA: (i) the length of gram (Lg) and (ii) the threshold of the number of DDI hits (Nhit). From observations of the ROC plots (Figure 3.2), we found that with gram length of 3 or less, the DDI hits are not specific to the input DDI data set, therefore, yielding low true positive and high false positive rates. Conversely, with gram length of 5 or more, the DDI hits are too specific/low to differentiate between the positive and negative data sets. Therefore, we concluded that 4gram yielded the best accuracy. Choosing a proper threshold value optimizes the sensitivity at the expense of the specificity. Similarly, a higher threshold results in a decreased sensitivity with an increase in specificity. Based on the ROC plots, it was found that GAIA achieves a sensitivity of 82% and specificity of 79% when the threshold is set to 8.3 (Figure 3.2).  3.3.3 Case studies on predicted DDIs  Our predictions were directly validated for some PPIs using documented threedimensional structures available in the literature. For example, RPB1 (YDL140C, NP_010141.1) and PRB2 (YOR224C, NP_014867.1), two subunits of RNA polymerase II, are known to interact with each other [36]. Based on the iPfam annotation, these two proteins have three DDIs: PF04983 vs. PF03870; PF05000 vs. PF03870; PF04922 vs.  72  PF03870. GAIA successfully predicted the interaction between this pair of proteins. Interestingly, we found a 4-gram pair (KLTL:EAAS) which may contribute to the PPI. The first 4-gram, KLTL, is located in the region of residues 533 – 536 which corresponds to PF04983 (RNA polymerase Rpb1) on RPB1. EAAS is located in the region of residues 27 – 30, which corresponds to PF03870 (RNA polymerase Rpb8) on RPB8 (Figure 3.3).  The interaction between COR1 (YBL045C, NP_009508.1) and QCR2 (YPR191W, NP_015517.1), two subunits of the ubiquinol cytochrome-c reductase complex (cytochrome bc1 complex) involved in cell respiration as a part of the mitochondrial inner membrane electron transport chain [37] was also examined. The interaction between COR1 and QCR2 has been validated by experimental approaches [1, 5, 36] and also by the GAIA algorithm. From the GAIA results, two gram pairs may contribute to this interaction. The first pair (GVSN:GGLF) is located in the region of residues 68 – 71 which corresponds to PF00675 (Peptidase family M16) on COR1 and the region of residues 282-285 which corresponds to PF05193 (Peptidase M16 inactive domain) on QCR2. The second pair (LHST:VRDQ) is located in the region of residues 164 – 167 which also corresponds to PF00675 on COR1 and the region of residues 289 – 292 which corresponds to PF05193 on QCR2.  3.3.4 Detecting new DDI-mediated PPIs and unknown domains  The GAIA tool performs well on previously reported PPIs mediated by DDIs in the goldstandard data set at a true positive rate of 82%. We therefore sought to apply the GAIA  73  Figure 3.3 3D structure of the interaction between RPB1/YDL140C and PRB2/YOR224C.  A 4-gram pair KLTL:EAAS (red region) that is predicted to contribute the DDI between PF04983 on RPB1 and PF03870 on RPB8 is highlighted. This gram pair is expanded on the right side of the figure for clarity. The figure was generated based on the PDB crystal structures (PDB: 1y1v) using the protein structural viewing tool Cn3D [38].  74  tool to identity novel PPIs and to determine the domains through which these interactions are mediated. Recently, Smy2p (YBR172C, NP_009731.2), a yeast gene encoding a protein of unknown function, was found to interact with Sec23p (YPR181C, NP_015507.1)/Sec24p (YIL109C, NP_012157.1) subcomplex and to participate in the coat protein complex II (COPII) vesicle formation from the endoplasmic reticulum (ER) [39]. The interaction between Smy2p and Sec23p was successfully predicted by GAIA. According to the domain annotations from the Pfam database [28], there is one annotated domain (PF02213: GYF) in Smy2p and 5 annotated domains (PF04810: zf-Sec23_Sec24; PF04811: Sec23_trunk; PF08033: Sec23_BS; PF04815: Sec23_helical; PF00626: Gelsolin) in Sec23p. Currently, there is no report of DDIs between Smy2p and Sec23p in the literature. However, upon close examination of the prediction results from GAIA, we found two gram-pairs that may contribute to this PPI. The first pair has 18.7 hits in the DDI data set and is located at residues 410 – 413 of Sec23p, which corresponds to PF08033 and residues 68 – 71 of Smy2p. The second pair has 15.3 DDI hits and is located at residues 409 – 412 of Sec23p which corresponds to PF08033 and residues 499 – 502 of Smy2p. These results suggest that the Beta sandwich domain on Sec23p might be involved in the PPI between Sec23p and Sym2p. Furthermore, we found that a pair of 4-grams located at residues 616 – 619 in the Beta sandwich domain of Sec24p interacts with a 4-gram located at residue 713 – 716 of Sym2p, further supporting the important role for the Beta sandwich domain in the interaction between Sec23p/Sec24p and Sym2p. However, no known domain annotations have been associated with the location of the 4grams on Smy2p, suggesting that potential domains of functional interest on Smy2p need to be further validated experimentally.  75  In addition to identifying new PPIs mediated by DDI, we tested our GAIA tool to infer new interacting domains from the predicted PPIs. Bud5 (YCR038C, NP_009967.2) and Bud8 (YLR353W, NP_013457.1) are two proteins involved in bud-site selection of diploid cells in yeast [40]. Krappmann et. al utilized the systematic structure-function analyses to identify that Bud5p physically interacts with Bud8p, and also interacts with Bud9p (YGR041W) which is involved in the delivery of the proteins to the cell poles [41]. They found that the region of residues 74 – 216 on Bud8p and the region of residues 91 – 218 on Bud9p are interacting domains required to bind Bud5. GAIA predicted a 4gram pair that might mediate this interaction. This gram pair has 12.4 hits in the DDI data set and is located at residues 183 – 186 of Bud8p, which corresponds within the newly discovered 74 – 216 region mentioned above. This data supports our hypothesis that GAIA can be used to detect novel interacting domains from public domain-related data sets.  3.3.5 Characterizing over-represented gram pairs  In our study, we have demonstrated that gram pairs are valid elements in determining DDIs. In order to shed light on how these gram pairs actually interact with each other, we sought to identify and characterize the gram pairs over-represented in DDIs in the yeast proteome. We generated a list (Table 3.1) of over-represented gram pairs from the DDI data by quantifying occurrences in both DDI data set and randomized negative data sets. The randomized negative data sets contain the same number of domain pairs as the iPfam DDI data set but these domain pairs do not exist in the iPfam DDI data set. As shown in Table 3.1, we found that most over-represented gram pairs are identical to each other.  76  Table 3.1 A list of the most frequent gram pairs in our domain-domain interaction data set. Gram A Gram B Frequency P-value LKEL 2.2×10-16 LKEL 36 ELLK 7.7×10-16 ELLK 35 LKKI 2.2×10-16 LKKI 33 LKKL 2.2×10-16 LKKL 32 LSKL 2.2×10-16 LSKL 32 DLSK 2.2×10-16 DLSK 31 ELLN 2.2×10-16 ELLN 31 LKSL 2.2×10-16 LKSL 31 EKLV 2.2×10-16 EKLV 30 LKNL 2.2×10-16 LKNL 30 For clarity, only gram pairs whose number of occurrence is greater than 30 were listed. Pvalues for gram pairs were calculated using z-test by comparing the actual frequency of each gram pair to its corresponding frequencies in 1000 randomized domain-domain interaction data sets.  77  This finding suggests that some types of domains tend to interact with themselves. Such self-interactions could occur between SNARE transmembrane domains that promote the hemifusion-to-fusion transition [42]. Analyzing the DDI pair in iPfam, we found that such self-interactions between domains constitute approximately half (51%) of iPfam DDIs. It is therefore not surprising that these identical gram pairs occur so frequently in the DDI pairs. A majority of these interacting gram pairs consist of two consecutive hydrophilic (K, E or N) amino acids flanked by two hydrophobic amino acids (L, I or V), or two consecutive hydrophobic amino acids flanked by two hydrophilic amino acids. We reason that this kind of distribution of hydrophobicity may place the two amino acids in the middle in an environment where their hydrophobicity is reinforced by the surrounding amino acids of opposite hydrophobicity. Such reinforcement of hydrophobicity may increase the opportunity of this gram interacting with another gram with similar hydrophobicity reinforcement.  3.3.6 Comparison between different approaches  DDI prediction algorithms similar to GAIA such as association method (AM) [14], maximum likelihood estimation approach (MLE) [15] and relative co-evolution of domain pairs approach (RCDP) [18] have recently been reported. It is difficult to compare the prediction accuracy of each approach directly because different testing datasets were utilized in each study. It is reported that AM achieves a sensitivity of 97% when tested against a small subset of interacting proteins. MLE achieved a sensitivity of 77.6% and a positive prediction value (PPV) of 42.5% when tested against a combined data set identified by yeast two-hybrid (Y2H) system. RCDP reported a sensitivity of  78  63.95% against a positive data set containing interacting proteins with DDIs derived from Protein Data Bank (PDB) crystal structures [18] and a specificity of 55.19% against a data set of randomly generated protein pairs. In order to eliminate the possibility that our gold standard data set is biased towards GAIA, therefore, we tested GAIA against the same testing data set (combined data from two Y2H data sets derived from Uetz et al. [6] and Ito et al. [4]) used in each approach. GAIA achieved sensitivity of 78% whereas AM and MLE achieved sensitivity of 42.5% and 24%, respectively, at same specificity of 79% [15], indicating that GAIA outperforms both AM and MLE. To account for the consideration that the improved performance is due to the better quality of input data, we also trained AM and MLE on 6304 PPIs containing the identical number of DDIs as our GAIA training data set. AM achieved a sensitivity of 51% with a specificity of 79% and MLE achieved a sensitivity of 57% with a specificity of 79% when tested against our gold-standard data set, proving that protein sequence information combined with structural information derived from iPfam is a better indicator to predict DDIs. GAIA achieved a sensitivity of 83% when the specificity was set to 55% using the same testing data set as RCDP, illustrating that GAIA performs better than RCDP. In summary, GAIA has the following advantages compared to the other aforementioned approaches: 1) GAIA can achieve better sensitivity and specificity in detecting DDIs; 2) GAIA is solely based on domain sequences and DDIs derived from PDB, rather than just PPI information, since prediction performance may be affected by poor PPI data set quality. We strongly believe that gram pairs such as those used in GAIA play a “signature” role in mediating the binding of a domain pair or protein pair. 3) By using protein sequences, GAIA precisely specifies the localization of interacting grams/hotspots.  79  3.4 Conclusions GAIA is a novel tool for identifying DDIs that mediate PPIs. GAIA takes the public DDI data set and the domain sequence data set as inputs and predicts the interaction between a query protein pair if the DDI hit frequencies of the gram pairs across the query proteins are above the preset threshold (8.3 DDIs). Tested against a “gold-standard” data set, GAIA achieves 82% true positive rate at the expense of 21% false positive rate. GAIA was used to identify a list of 4-gram pairs that is significantly over-represented in the DDI data set that may mediate PPIs. GAIA allows us to predict currently unknown interacting domains and to identify potential interacting gram pairs/hotspots between proteins. This study complements previous prediction approaches and improves upon similar prediction modeling systems. The resultant predictions provide testable hypotheses for experimental validation. GAIA is limited by its highly intensive computational time (10 mins/per pair), which is currently being addressed by making changes to GAIA so that it can run in a distributed environment. While GAIA has good prediction capacity, increasing the size of the DDI data set would assist identification of a more complete set of gram pairs within the DDI data sets. This could ultimately lead us to a more complete identification of PPIs mediated by DDIs.  80  3.5 References 1.  2.  3.  4. 5.  6.  7.  Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415(6868):141-147. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415(6868):180-183. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001, 98(8):4569-4574. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, Peregrin-Alvarez JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MH, Butland G, Altaf-Ul AM, Kanaya S, Shilatifard A, O'Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440(7084):637643. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403(6770):623-627. Formstecher E, Aresta S, Collura V, Hamburger A, Meil A, Trehin A, Reverdy C, Betin V, Maire S, Brun C, Jacq B, Arpin M, Bellaiche Y, Bellusci S, Benaroch P,  81  8.  9.  10. 11.  12. 13. 14. 15. 16. 17.  Bornens M, Chanet R, Chavrier P, Delattre O, Doye V, Fehon R, Faye G, Galli T, Girault JA, Goud B, de Gunzburg J, Johannes L, Junier MP, Mirouse V, Mukherjee A, Papadopoulo D, Perez F, Plessis A, Rosse C, Saule S, StoppaLyonnet D, Vincent A, White M, Legrain P, Wojcik J, Camonis J, Daviet L: Protein interaction mapping: a Drosophila case study. Genome Res 2005, 15(3):376-384. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL, Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM: A protein interaction map of Drosophila melanogaster. Science 2003, 302(5651):1727-1736. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303(5657):540-543. McGough AM, Staiger CJ, Min JK, Simonetti KD: The gelsolin family of actin regulatory proteins: modular structures, versatile functions. FEBS Lett 2003, 552(2-3):75-81. Kato Y, Nagata K, Takahashi M, Lian L, Herrero JJ, Sudol M, Tanokura M: Common mechanism of ligand recognition by group II/III WW domains: redefining their functional classification. J Biol Chem 2004, 279(30):3183331841. Kim WK, Park J, Suh JK: Large scale statistical prediction of protein-protein interaction by potentially interacting domain (PID) pair. Genome Inform 2002, 13:42-50. Ng SK, Zhang Z, Tan SH, Lin K: InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res 2003, 31(1):251-254. Sprinzak E, Margalit H: Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol 2001, 311(4):681-692. Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions. Genome Res 2002, 12(10):1540-1548. Wojcik J, Schachter V: Protein-protein interaction map inference using interacting domain profile pairs. Bioinformatics 2001, 17 Suppl 1:S296-305. Chen XW, Liu M: Prediction of protein-protein interactions using random decision forest framework. Bioinformatics 2005, 21(24):4394-4400.  82  18.  19. 20. 21. 22. 23. 24. 25. 26.  27. 28. 29. 30. 31.  32. 33. 34.  Jothi R, Cherukuri PF, Tasneem A, Przytycka TM: Co-evolutionary analysis of domains in interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions. J Mol Biol 2006, 362(4):861-875. Pauling L, Corey RB, Branson HR: The structure of proteins; two hydrogenbonded helical configurations of the polypeptide chain. Proc Natl Acad Sci U S A 1951, 37(4):205-211. Vries JK, Liu X, Bahar I: The relationship between n-gram patterns and protein secondary structure. Proteins 2007, 68(4):830-838. Birzele F, Kramer S: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics 2006, 22(21):2628-2634. King BR, Guda C: ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes. Genome Biol 2007, 8(5):R68. Wu CH, Huang H, Yeh LS, Barker WC: Protein family classification and functional annotation. Comput Biol Chem 2003, 27(1):37-47. Sangadala S, Metpally RP, Reddy BV: Molecular interaction between Smurf1 WW2 domain and PPXY motifs of Smad1, Smad5, and Smad6--modeling and analysis. J Biomol Struct Dyn 2007, 25(1):11-23. Lim WA, Richards FM, Fox RO: Structural determinants of peptide-binding orientation and of sequence specificity in SH3 domains. Nature 1994, 372(6504):375-379. Pitre S, Dehne F, Chan A, Cheetham J, Duong A, Emili A, Gebbia M, Greenblatt J, Jessulat M, Krogan N, Luo X, Golshani A: PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. BMC Bioinformatics 2006, 7:365. Finn RD, Marshall M, Bateman A: iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics 2005, 21(3):410-412. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A: The Pfam protein families database. Nucleic Acids Res 2007. PDB [http://www.pdb.org/] Saeed R, Deane C: An assessment of the uses of homologous interactions. Bioinformatics 2008, 24(5):689-695. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 2003, 302(5644):449453. Qi Y, Klein-Seetharaman J, Bar-Joseph Z: Random forest similarity for protein-protein interaction prediction from multiple sources. Pac Symp Biocomput 2005:531-542. GAIA [http://www.oicr.on.ca/research/ouellette/gaia] Bader GD, Hogue CW: Analyzing yeast protein-protein interaction data obtained from different sources. Nat Biotechnol 2002, 20(10):991-997.  83  35. 36.  37. 38. 39. 40. 41. 42.  Kiermer V: Protein-protein interactions: better by the dozen. Nat Methods 2007, 4(5):389. Collins SR, Kemmeren P, Zhao XC, Greenblatt JF, Spencer F, Holstege FC, Weissman JS, Krogan NJ: Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics 2007, 6(3):439450. Hunte C, Palsdottir H, Trumpower BL: Protonmotive pathways and mechanisms in the cytochrome bc1 complex. FEBS Lett 2003, 545(1):39-46. Cn3D [http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml] Higashio H, Sato K, Nakano A: Smy2p Participates in COPII Vesicle Formation Through the Interaction with Sec23p/Sec24p Subcomplex. Traffic 2007. Ni L, Snyder M: A genomic study of the bipolar bud site selection pattern in Saccharomyces cerevisiae. Mol Biol Cell 2001, 12(7):2147-2170. Krappmann AB, Taheri N, Heinrich M, Mosch HU: Distinct domains of yeast cortical tag proteins Bud8p and Bud9p confer polar localization and functionality. Mol Biol Cell 2007, 18(9):3323-3339. Hofmann MW, Peplowska K, Rohde J, Poschner BC, Ungermann C, Langosch D: Self-interaction of a SNARE transmembrane domain promotes the hemifusion-to-fusion transition. J Mol Biol 2006, 364(5):1048-1060.  84  4 Pandora, a PAthway and Network DiscOveRy Approach based on common biological evidence3 4.1 Introduction One definition of a biological pathway is a defined group of biological entities that are organized in a specified order and perform a specified biological task or function [1]. Complex structures in cells can be viewed as organizers of pathways, separating, directing and organizing the inputs and outputs of various pathways. Our understanding of how each pathway works and interacts with other pathways is, however, far from complete. Using high-throughput techniques, the internal organization of cells can be studied from a systematic perspective. For example, the interactomes of several model organisms such as Saccharomyces cerevisiae [2-7], Drosophila melanogaster [8, 9] and Caenorhabditis elegans [10] have been extensively studied in large-scale protein-protein interaction studies, providing rich data sets from which to map disparate functional modules in these interactomes onto biological pathways at the protein level. To complement these proteomic studies, the generation of large-scale genetic interactome data sets have helped us to interpret pathway organization in S. cerevisiae [11-14], C. elegans [15, 16] and D. melanogaster [17] at the gene to phenotype level. Similarly, at the transcription level, microarray techniques have generated large amounts of data enabling the construction of transcription networks for specific biological pathways under any given biological condition of interest [18]. In spite of these developments, results to date have yielded few overlapping data sets, making it difficult to infer the organization A version of this chapter has been published. Zhang KX, Ouellette BF: Pandora, a PAthway and Network DiscOveRy Approach based on common biological evidence. Bioinformatics 2010 Feb 15;26(4):529-35. Epub 2009 Dec 22. 3  85  of pathways. This situation has prompted us to propose and develop a novel computational approach that integrates disparate biological information and predicts specific pathways (defined group of proteins that are organized in a specified order and perform a specified biological task or function) and their organization.  In defining a pair of proteins as the basic unit of a pathway, and by revealing the functional relevance of these pairs, biological evidence can be used to infer their roles in the context of a pathway. It is possible for us to utilize databases containing biological data sets to explore how pathways are organized. Kelley and Ideker [19] developed a logodds scoring model that identified 360 pathway pairs and 401 pathways in yeast by incorporating physical interactions and genetic interactions (synthetic-lethal and synthetic-sick interactions). Their study provides a starting point to reveal pathway organization and function from high-throughput data. Ulitsky and Shamir [20] proposed a modified methodology based on Kelley and Ideker’s approach and identified 140 pathway pairs and 280 pathways that contain more information regarding genetic interactions than the previous method. In both approaches, the connection of each protein pair is scored by the probability of observing this connection at random for the given networks, which might result in limited performance due to inaccurate null hypotheses of the underlying statistical tests. Further, neither of these methods consider the situation where some identified pathways contain both dense physical interactions and dense genetic interactions, resulting in large pathway sizes that need to be further clustered. Instead of employing both physical interactions and genetic interactions, Ma and colleagues [21] designed a method using synthetic lethal interactions alone. They identified 2,590 pathway pairs and 5,180 pathways in yeast by searching approximately  86  complete bipartite graphs within the synthetic lethal interaction network. In a recent publication, Brady and colleagues introduced a novel approach that discovered 602 and 1,510 pathway pairs by searching stable bipartite subgraphs on two different versions of genetic interaction networks [22]. However, since genetic interaction data is far from complete, only partial pathway organization can be inferred when using genetic interaction data alone, as the proteins outside of genetic interaction data sets have been overlooked. Thus, a more comprehensive understanding of the cellular pathway organization requires more heterogeneous data that is functionally associated to complement the genetic interaction data.  To address the above limitations, we incorporated four types of functionally associated data in the model organism S. cerevisiae: protein-protein interactions (PPIs), genetic interactions (GIs), domain-domain interactions (DDIs) and semantic similarity of GO terms. PPI data increases the gene coverage compared to the genetically interacting gene list. However, it has been demonstrated that the quality of large-scale PPI data is limited by its high false positive and false negative rates [23, 24]. To overcome these limitations, we included DDIs to provide more biological evidence for protein pairs, as it has been widely accepted that some proteins interact with each other through interactions between their respective domains which are defined as independently structural and/or functional blocks of proteins [25, 26]. Semantic similarities of GO terms provide further evidence for a protein pair in terms of their biological functions. We integrated these four biological data sources for protein pairs with a weighted score that represents pathway relevance between a pair of proteins. We developed a new graph clustering algorithm to group proteins sharing similar neighborhoods on the weighted network of yeast. By  87  comparing our results to pathway annotations from KEGG [27], BioCyc [28] and Reactome [29], we found that our approach is able to predict biological pathways with a higher positive predictive value (PPV) compared to other approaches (Brady et al, 2009; Kelley et al, 2005; Ma et al, 2008; Ulitsky et al, 2007). Our results, which revealed new members of pathways, provide testable hypotheses for experimental validation. Complemented with other predictive methods, our study makes promising progress in the process of deciphering the entire pathway organization in yeast cells. This approach has application in other eukaryotic systems where large data sets are available.  4.2 Methods 4.2.1 Data sources We downloaded physical interaction and genetic interaction data for S. cerevisiae from the BioGRID database (http://www.thebiogrid.org) [30] version 2.0.49. The BioGRID database is a literature-based repository containing physical interaction and genetic interaction data. Interactions are categorized as “Two-hybrid”, “Affinity CaptureLuminescence”, “Affinity Capture-MS”, “Affinity Capture-RNA”, “Affinity CaptureWestern”, “Biochemical Activity”, “Co-crystal Structure”, “Co-fractionation”, “Copurification”, “Co-localization”, “Far Western”, “FRET”, “PCA”, “Protein-peptide”, “Protein-RNA”, “Reconstituted Complex” in the BioGRID database are selected. For genetic interactions, only interactions labeled as “synthetic lethality” in BioGRID were selected. After removing redundant interactions, the interaction data contained 43,687 unique physical interactions and 10,735 genetic interactions. We also compiled 7,820 domain-domain interactions in yeast from two sources: 1) the iPfam database [31], a DDI  88  database derived from RCSB Protein Data Bank (PDB) crystal structures (http://www.pdb.org); and 2) the list of predicted DDIs from our previously published GAIA algorithm [32], a method to identify interacting protein domains.  4.2.2 Gene ontology similarity scores  The functional relationship of proteins can be estimated from how they share protein annotation in a controlled vocabulary system, such as Gene Ontology (GO) [33]. We assigned a semantic similarity score to each protein pair to represent how close they work together in a molecular function. We downloaded the GO terms associated with each protein from the Saccharomyces Genome Database [34], as of October 2008. Given two groups of GO terms (G1 and G2) for two query proteins P1 and P2, semantic similarity between protein pairs was calculated by a similar approach as G-SESAME [35] :  Sim(G1,G2) =  " "  Sim(Term1,Term2)  1!i! G1 1! j! G 2  G1 # G2  where |G1| and |G2| is the number of GO terms associated with P1 and P2, respectively. The range of semantic similarity scores lies between 0 and 1.The semantic similarity score between two GO terms t1 and t2 was calculated by the following equation:  Sim(t1,t2) =  #  (Scoret1 (t) + Scoret 2 (t))  t!ancestors(t1"t 2)  #  t!ancestors(t1)  Scoret1 (t) +  #  Scoret 2 (t)  t!ancestors(t 2)  89  Score() is the function to measure the edge (semantic relations) connecting two GO terms and defined as:  Scoret1 (t) = max{weight ! Scoret1 (t ')} if t " t1  where t’ is the children of the GO term t. If t = t1, the score is 1. The weight score is 0.8 for the “is-a” relation and 0.6 for the “part-of” relation as Wang did [35].  4.2.3 Data integration to a weighted biological network  For each protein pair in the physical and genetic interaction data, we assigned a confidence score to each connection by combining four types of biological evidence: physical interaction, genetic interaction, domain-domain interaction and GO term similarity. If a physical interaction connects a pair of proteins, we assigned 1 to it, otherwise 0. If a domain-domain interaction connects a pair of proteins, we assigned 1 to it, otherwise 0. To minimize genetic interactions within pathways, we assigned 0 to a pair of proteins if a genetic interaction connects them, otherwise 1. We followed the previously described method to calculate a GO term similarity score for each pair. An integrated/confidence score (c) was calculated by averaging these four scores under the assumption that the score from each type of evidence contributes equally to the association between a pair of proteins. Finally, we generated a biological network in which each protein connects to other proteins by the weighted edges. In total, the resultant network contained 5,280 proteins.  90  4.2.4 Pathway finding algorithm  We developed a new clustering algorithm based on the weighted network. Given a weighted biological network G in yeast, our algorithm computes the following steps to find clusters representing pathways { P } in a similar fashion as previous studies [36, 37]:  Step A. For each protein in the network, a pathway protein label was applied if it had at least n topologically similar proteins. Here, n was set to 2, the minimal size of a pathway being two proteins. Given a protein x, a set of topologically similar proteins Y of protein x was defined by the Jaccard coefficient :  #% {neighbors ( ti )} ! {neighbors ( x )} Y =$ > s : ti is one of neighbors of &% {neighbors ( ti )} " {neighbors ( x )}  '% x( )%  Here, s is the threshold of topological similarity scores.  Step B. Each protein labeled as a pathway protein was used as a starting point of a pathway P by iteratively searching topologically similar proteins to it and adding them to P unless it had already been classified.  Step C. Each remaining protein (not labeled as a pathway protein) was added to each pathway if it has connections to multiple pathways; otherwise, it was classified as a nonpathway protein.  91  Pathway Finding Algorithm Input: G, s, n Output: { P } for each x ∈G do T = neighbors(x) // T is a set of neighbors of x for each t ∈T do y=  {neighbors(t)} ! {neighbors(x)} {neighbors(t)} " {neighbors(x)}  if (y > s) topological_neighbors←t end if end for if (topological_neighbors >= n) pathway_proteins ! x end if end for until each protein x ! pathway_proteins is assigned to a pathway ID do assign x to a pathway P recursively find topological similar proteins Y of x until each protein y ! Y is assigned to a pathway ID do assign protein y to P end until end until return { P }  4.2.5 Evaluation of the algorithm (Adjusted Rand Index)  We utilized the adjusted rand index (ARI) [38] to measure the similarity of our resultant pathway organization to other pathway annotation sources. The adjusted rand index has been widely used in determining the agreement between two partitions of any network. Scores lie between 0 and 1, and when the two tested partitions agree perfectly, the score is 1. For each identified pathway from our approach, we compared it to every pathway in three pathway databases (KEGG [27], BioCyc [28] and Reactome [29]) and calculated the adjusted rand index score for each identified pathway. Given a pathway X from our  92  approach and an annotated pathway Y from KEGG or Reactome, the adjusted rand index was calculated as:  ARI (X , Y ) =  2(A ! B " C ! D ) ((A + D )! (D + B )+ (A + C )! (C + B ))  where A, denoted as (X∩Y), is the number of proteins appearing in both pathways X and Y; B, denoted as (Z-(X∪Y)) is the number of proteins appearing in neither pathway X nor Y given the number of proteins Z (The number of proteins in this study is 5,280.) in yeast; C, denoted as (X-(X∩Y)), is the number of proteins appearing in pathway X but not in Y; D, denoted as (Y-(X∩Y)), is the number of proteins appearing in pathway Y but not in X.  The final index score of pathway X is defined as the maximal score compared to all annotated pathways in databases: n  ScoreARI (X ) = Max (ARI (X , Yi )) i =0  We regarded pathway X as a true positive if ScoreARI(X) is greater than or equal to 0.5, which meant that at least half of two tested pathways agree with each other. This cutoff is significantly greater than found by chance (Wilcoxon Rank Sum test , P < 10-4).  4.2.6 Network Randomization Comparable control networks were generated by randomly rewiring a pair of edges to connect different pairs of nodes in the interaction networks and then repeating the rewiring step. The number of the repeats equals to the total number of the edges in the 93  networks. This method was previously reported and utilized by other groups [39, 40]. With this approach, the degree distribution of a given interaction network can be preserved. The randomization procedure was repeated 1000 times.  4.3 Results and discussion 4.3.1 Parameter tuning Pandora identifies pathways by finding neighboring proteins based on confidence scores of protein pairs derived from multiple types of biological evidence. Only two parameters for this method require tuning: (i) the threshold of confidence scores (c); and (ii) the threshold of topological similarity scores (s). We applied our pathway finding approach using different combinations of c and s. We then evaluated the performance of our approach by calculating the positive predictive value (PPV), which is generated by comparing our identified pathways to the Reactome pathways based on ARI scores. Here, PPV is defined as: No. of True Positives / (No. of True Positives + No. of False Positives). From the observation of the performance plot (Figure 4.1), we concluded that our approach achieves the best PPV performance if c and s were set to 0.7 and 0.5, respectively. With these settings, the PPV is 12.8% when tested against the Reactome pathway annotations. Identical settings show good performance for the KEGG and BioCyc pathway annotations (Figure 4.1). In addition, when c and s were set as 0.7 and 0.5, we also observed the best recall rates (sensitivity) obtained by our approach when tested on three pathway databases (Figure 4.2). The best recall rates for Reactome, KEGG and BioCyc are 6.6%, 8.3% and 8%, respectively. We found that with higher c  94  Figure 4.1 3D PPV performance plot tested on different combinations of the threshold of confidence scores (c) and the threshold of topological similarity scores (s).  The positive predictive values (PPVs) of our approach are plotted for different combinations of thresholds when tested against Reactome (A), KEGG (B) and BioCyc (C). For simplicity, only c ranging from 0.6 to 0.9 and s ranging from 0.2 to 0.9 are tested. The red dot represents the peak showing the best performance of our approach when c and s set as 0.7 and 0.5, respectively.  95  and s, small sub-networks are generated, and consequently lowering the PPV. On the contrary, with lower c and s, the network contains high noise and generates many false positives.  4.3.2 Summary statistics of identified pathways  Our approach identified 195 biological pathways, which covers 31% (1,617 out of 5,280) of the yeast proteins, 38% (16,685 out of 43,687) of the physical interactions, 8.3% (890 out of 10,735) of the synthetic lethal interactions, and 18% (1,407 out of 7,820) of the DDIs involving yeast proteins. The relatively high coverage of both physical interactions and DDIs and the low coverage of genetic interactions indicate that the pathways identified in our study tend to have dense physical interactions while the genetic interactions in these pathways are sparse. It is not surprising that we identified fewer pathways than previous methods because more constraints such as GO term similarity scores and DDIs were applied in identification of the pathways to ensure the reliability of identified pathways. The size of identified pathways ranged from 2 to 407 proteins, with a strong bias to short pathways. The distribution of pathway size in our study is statistically consistent with that of pathways generated from two previous methods (Kelley and Ideker, 2005; Ulitsky and Shamir, 2007) based on physical interaction data and genetic interaction data with the p-value of 0.04 and 2.4 x 10-5, respectively, by the Wilcoxon Rank Sum test (Figure 4.3). However, the distribution is not consistent with that of those approaches (Brady, et al., 2009; Ma, et al., 2008) based on genetic  96  Figure 4.2. 3D recall rate performance plot tested on different combinations of the threshold of confidence scores (c) and the threshold of topological similarity scores (s).  The recall rates of our approach were plotted for different combinations of thresholds when tested against Reactome (A), KEGG (B) and BioCyc (C). For simplicity, only s ranging from 0.6 to 0.9 and c ranging from 0.2 to 0.9 are tested. The red dot represents the peak showing the best performance of our approach when c and s set as 0.7 and 0.5, respectively.  97  Figure 4.3 Distribution of pathway sizes of different approaches.  The distribution of pathway sizes of Kelly and Ideker [19] is represented by the red line; the distribution of pathway sizes of Ulitsky and Shamir [20] is represented by the blue line; the distribution of pathway sizes of Ma et al. is represented by the green line; the distribution of pathway sizes of Brady et al. is represented by the brown line and the distribution of pathway sizes of our approach is represented by the black line. All pathways are non-redundant.  98  interactions alone, with the p-value of 0.42 and 0.07, respectively, by the Wilcoxon Rank Sum test. We also found a correlation between the number of protein hubs and the size of the pathway (the Pearson correlation coefficient is 0.79 at p-value < 2.2 x 10-16). In other words, more protein hubs were identified in pathways of larger size. Here, we defined the top 20% of proteins in the PPI network of S. cerevisiae with high degrees as “protein hubs” as Yu and colleagues presented [41]. Taken together, we propose that such a distribution of pathway size reflects a scale-free topological property present in the network, a property that is currently supported by multiple types of biological evidence but not by the genetic interaction network alone. A list of the identified pathways and their members found in our study is listed in Table 4.1. We found that the topological properties of the source PPI network are similar to those of the network of our identified pathways, which indicates that our approach does not appear to have a bias towards the highly connected areas of the source PPI network (Table 4.2).  4.3.3 Validation of our approach  GO term enrichment analysis was used to measure the cellular functions of identified pathways as performed in previous studies [42, 43]. However, because GO semantic similarity scores have been integrated into our approach as one of the types of biological evidence, we used a different evaluation method to measure pathway biological function. We tested our identified pathways on three public pathway databases: KEGG [27], BioCyc [28] and Reactome [29]. The KEGG database contains manually annotated pathways based on biochemical evidence from the literature, including metabolism, genetic information processing, environmental information processing and cellular  99  Table 4.1 A list of identified pathways in this study. 195 pathways were identified in this study. Each pathway is numbered in the first column. Numbers of Proteins is the number of gene products in each pathway. Protein list is the list of gene products in each pathway. The annotation of each pathway was assigned by the GO term with the smallest P-value from FunctionAssociate [44]. ARI score is the highest ARI score of each identified pathway obtained by comparing it to every pathway in three pathway databases. Pathway 0  No. of Proteins  Proteins  GO Annotation  407 YGR129W YPR165W YKL168C YMR197C YDL116W YIL143C 0000398: nuclear mRNA YDR167W YLR052W YPR086W YBR279W YKR055W YNL025C splicing, via YDR443C YGL092W YMR183C YER146W YPL140C YML049C spliceosome/mRNA YDR416W YDL108W YMR240C YPL115C YPL119C YLL019C YLR078C splicing/pre-mRNA splicing YML015C YER107C YPR054W YMR075W YDR088C YHR079C YGL120C YMR001C YBR055C YGR040W YGL210W YLR293C YOL115W YOR231W YPR168W YGR104C YPR057W YNL180C YDL101C YJL081C YNR031C YML010W YDR164C YDR283C YCR081W YJL176C YFR005C YFL013C YMR091C YBR065C YDL175C YOR267C YER031C YDR079C-A YNL286W YNL093W YKR022C YGL078C YHR102W YOR194C YKL095W YHR041C YHR086W YNL298W YHR165C YMR263W YOL113W YJR132W YDR192C YPR178W YER022W YKL006C-A YPL026C YLR262C YGL222C YDL030W YER112W YLR385C YKR029C YML031W YDL087C YLR335W YDR490C YHR099W YDL098C YML103C YMR104C YPL129W YJR059W YKL139W YPR082C YNL112W YDR389W YOR089C YJR022W YPR034W YIL126W YDR477W YBR188C YER114C YKL173W YFL005W YGR275W YOR244W YKL012W YAL041W YDR308C YFR037C YIL097W YNL216W YHL007C YJL050W YAR042W YIR009W YOR127W YGR116W YER164W YGR002C YHR082C YMR153W YDR498C YDR359C YAR018C YDL127W YGL238W YNR010W YLR321C YBR200W YGR091W YGL244W YGR005C YGL098W YDL209C YLR418C YDR247W YKR008W YHL031C YPR056W YOR185C YBR264C YLR116W YHR205W YBR135W YLR026C YDR507C YOL145C YOR290C YLR248W YKL126W YNL097C YML127W YBR095C YIL004C YKL203C YLR113W YFL024C YGL194C YMR135C YOR106W YMR236W YER172C YJL124C YOL051W YFL002C YLR424W YMR112C YGL212W YIL084C YNR023W YJR042W YGR074W YGL180W YPR101W YGL158W YGR186W YDL002C YDR190C YDL088C YNL161W YDR334W YLR071C YPL031C YBR059C YBR231C YDR379W YGR063C YDR189W YOR304W YOR233W YLL036C YOR123C YGL025C YDL005C YDL056W YJL165C YOL149W YCL032W YCR042C YJR082C YOR134W YIL115C YJL141C YCR038C YLR362W YGR278W YDR378C YNL307C YFL017W-A YDR311W YLR438C-A YNR011C YCR033W YKL048C YBR253W YBL016W YER009W YGL100W YCL039W YPL151C YBR289W YBR198C YHR058C YPL209C YDR364C YJL041W YKR062W YER136W YLR229C YGR075C YFL033C YMR005W YLR208W YBL026W YFL038C YEL018W YGL172W YMR290C YBL079W YDR460W YLR147C YIL061C YHL025W YGR056W YOL135C YGL195W YDR235W YNL289W YGR119C YPR106W YER013W YER029C YDR309C YNL236W YFR009W YOL004W YKL028W YAL011W YNL304W YHR056C YJL128C YFL047W YML098W YDR002W YBR260C YLR396C YML001W YIR005W YMR047C YLR276C YJL061W YBR193C YDL043C YJL203W YPL235W YBR119W YHR061C YLL008W YFR002W YPL122C YDR264C YBL093C YPL011C YPL082C YOR319W YDR422C YLR371W YDR303C YDR523C YOR204W YKL058W YPR023C YKL068W YMR235C YJL187C YDL019C YBR103W YLR298C YBR245C YDL028C YKR014C YLR093C YDR073W YFL029C YDL076C YMR227C YCL024W YML041C YPL153C YNL330C YGL095C YKL074C YOR308C YLR005W YBR160W YLR275W YPR025C YMR288W YOR327C YPR070W YPL232W YPR182W YNL245C YML046W YER035W YJL095W YER171W YOR098C YDR163W YLR268W YIL079C YOL018C YKL092C YIL063C YJL106W YGR006W YHR030C YNR038W YOR141C YDR473C YMR268C YGL150C YKL057C YGR152C YKR082W YMR129W YKR024C YPL204W YBR142W  100  Pathway  No. of Proteins  Proteins  GO Annotation  YOR159C YDL159W YCR052W YGL112C YPR031W YDR485C YNL136W YKR086W YPL181W YNL059C YLR033W YDR145W YLR085C YLR398C YNL090W YGR233C YIL112W YGR274C YML114C YDL031W YIL095W YER105C YER027C YLR096W YDR468C YDL240W YER155C YAL030W YPL016W YOR174W YER111C YNL118C YMR213W YBR152W YCR091W YLR357W YDR240C YNL299W YMR033W YGR092W YNL147W YKL196C YNL107W YGR013W YAR019C YLR182W YLR055C YJR050W YOR148C YOR036W YCR073C YDR255C YJR066W YPL139C 1 2  3  4 5  6  7 8  9  3 YPR156C YGR138C YLL028W 65 YMR193W YML025C YDL202W YPL118W YBR122C YGR220C YNL081C YJR113C YDR322W YGR076C YDR494W YDR237W YBL038W YKL155C YIL093C YMR188C YOR150W YNL252C YKR085C YPL013C YDR175C YML009C YDR116C YLR439W YNL177C YJR101W YKL167C YPR166C YBR146W YHL004W YOR158W YGL129C YNL005C YDR462W YGL068W YGR084C YBR251W YDR347W YBR268W YMR024W YBR282W YPL173W YJL063C YDR337W YDR405W YNR037C YDR041W YNL185C YNL137C YMR158W YDL045W-A YER050C YCR003W YLR312W-A YNL306W YNL284C YKL003C YGR165W YCR046C YHR147C YGR215W YKR006C YCR071C Q0140 YBL090W 25 YOR232W YGR082W YPL063W YNR017W YKL195W YMR203W YOR297C YEL020W-A YJL143W YJR135W-A YLR008C YNL131W YGR181W YKR065C YHR005C-A YPR133W-A YHR117W YNL328C YGR033C YDL217C YBR091C YIL022W YJL104W YJL054W YNL121C 7 YER087C-B YLR292C YOR254C YBR171W YDR086C YPL094C YLR378C 16 YGR162W YMR146C YDR429C YPL237W YKR059W YOR361C YLR192C YNL244C YGL049C YER025W YJL138C YJR007W YMR260C YBR079C YMR309C YOL139C 56 YPL249C-A YLR075W YFR031C-A YDR012W YGL147C YLR406C YGR034W YIL018W YPL198W YLR029C YGL135W YEL054C YOR063W YFR032C-A YOL120C YDR382W YNL069C YKL180W YBR191W YBL027W YHR141C YML073C YPL143W YDL082W YPR102C YJL177W YDL136W YMR194W YBR031W YBR084C-A YLR325C YGL103W YBL092W YDR418W YGR085C YHL033C YOL039W YLR340W YNL301C YMR242C YHL001W YOR312C YPL220W YLL045C YGL076C YDL081C YDR471W YHR010W YLR448W YOL127W YOR234C YDL075W YLR344W YIL133C YPL131W YGL031C 2 YKL025C YGL094C  0000297: spermine transporter activity 0005761: mitochondrial ribosome  0006626: protein-mitochondrial targeting/mitochondrial protein import 0031205: Sec complex 0003743: translation initiation factor activity 0005842: cytosolic large ribosomal subunit (sensu Eukaryota)/60S ribosomal subunit/cytosolic large ribosomal subunit  0004535: poly(A)-specific ribonuclease activity 0000777: condensed chromosome kinetochore  22 YLR045C YBR233W-A YIR010W YCL029C YKR083C YGL061C YIL144W YER018C YGR113W YAL034W-A YDR201W YMR117C YJR112W YKR037C YDR320C-A YPL233W YDR016C YKL052C YKL138C-A YOL069W YBR211C YJR089W 152 YDL111C YHR119W YNL232W YDL213C YDL167C YMR049C YKL214C 0007028: cytoplasm YKR060W YBL008W YLL011W YDR195W YGR128C YOR206W organization and biogenesis YNL002C YJL122W YCR035C YHR196W YCR057C YOL010W YLR186W YER006W YJR041C YLR129W YOR272W YDR365C YHR148W YGR103W YKL021C YLR115W YGR156W YDR280W YMR229C YNL075W YHR089C YLR221C YJR140C YPL190C YHR197W YGR081C YDR398W YDR381W YDR101C YGR159C YBR258C YGR158C YNL110C YOR001W YPR144C YOL077C YNL317W YGR145W YOR078W YHR085W YNL308C YPL217C YNL207W YPL043W YLR196W YBL004W YDR449C YBR215W YOL021C YER126C YDR432W YER082C YFR021W YAR003W YPL093W YPL126W YLR222C YPL138C YNL016W YBR267W YKL193C YMR131C YNL251C YBR175W YDL148C YGR195W YER002W YPL012W YGR090W YHR066W YHR088W YLR435W YOL041C YLR197W YDR312W YGR095C YDR060W YNL004W YHR069C YPL211W YDR087C YDL060W YKL059C YPR143W YJL069C YAL025C YKL009W YGL029W YHR072W-A YNL132W YPR137W YKL172W YJR002W YIL019W YDR301W YNR053C YDR469W YOR038C YOR250C YCR072C YLR277C YMR061W YJR093C YOL123W YCL059C YML093W YER127W  101  Pathway  No. of Proteins  Proteins  GO Annotation  10  YCL011C YGL122C YMR128W YPR112C YNL182C YLR009W YOL142W YLR074C YBR212W YDR228C YBR247C YKL018W YNL175C YOR310C YOR294W YMR093W YLR015W YIR001C YDR324C YKR081C YJL109C YOL144W YGL111W YDL208W YAL043C YGL044C YNR054C YDL229W YNL197C YPR107C YLR409C YJL010C 3 YPR018W YBR195C YML102W 0006334: nucleosome assembly  11  2 YEL009C YPL038W  unknown  12  2 YEL015W YBR094W  unknown  13  15  16 17  18 19 20  21  23  24 26  27  28 29  30  16 YDR329C YHR160C YJL210W YMR026C YLR191W YOL147C YDR244W 0007031: peroxisome YDR142C YDR265W YDL065C YOL044W YGL153W YNL214W organization and biogenesis YPL112C YAL055W YGR077C 3 YNL071W YER178W YBR221C 0045254: pyruvate dehydrogenase complex/pyruvate dehydrogenase complex (lipoamide) 3 YJL137C YFR015C YLR258W 0035251: UDPglucosyltransferase activity 14 YMR256C YHR051W YDL067C Q0045 YLR395C Q0275 YNL052W Q0250 0005746: mitochondrial YPR191W YBL045C YFR033C YEL024W YGL187C YIL111W electron transport chain/respiratory chain 7 YDL165W YNR052C YCR093W YAL021C YPR072W YIL038C YER068W 0000289: poly(A) tail shortening 2 YGR240C YMR205C 0003872: 6phosphofructokinase activity 30 YGL070C YPR187W YPR110C YPR190C YOR340C YDR156W YDL140C 0030880: RNA polymerase YNR003C YOL005C YHR143W-A YDL150W YOR210W YKR025W complex YDR404C YNL248C YIL021W YNL113W YDR045C YBR154C YJR063W YJL011C YOR116C YJL148W YNL151C YKL144C YOR151C YOR207C YPR010C YOR224C YOR341W 2 YOR230W YOR229W 0003714: transcription corepressor activity/transcription corepressor activity 2 YDR004W YER095W 0003714: transcription corepressor activity/transcription corepressor activity 11 YBR155W YPR189W YMR186W YGR123C YDR168W YOR027W 0006457: protein YPL240C YGL213C YCR060W YBL075C YAL005C folding/chaperone activity 11 YCL008C YLR417W YLR119W YMR077C YPL065W YLR025W YPL002C 0043162: ubiquitin-dependent YGR206W YKL002W YJR102C YKL041W protein catabolism via the multivesicular body pathway/ubiquitin-dependent protein catabolism via the MVB pathway 35 YNL178W YPL090C YNL096C YML063W YGL123W YJR145C YOL040C 0005843: cytosolic small YHL015W YGR214W YOR096W YML024W YLR441C YDR447C ribosomal subunit (sensu YLR287C-A YDR064W YML026C YIL069C YBR048W YJR123W Eukaryota)/40S ribosomal YPL081W YDL083C YLR367W YGR027C YHR203C YDR450W subunit/cytosolic small YER102W YBR189W YER074W YCR031C YJL190C YBL072C YGR118W ribosomal subunit YBR181C YDR025W YMR230W 3 YHR206W YLR006C YIL147C 0042542: response to hydrogen peroxide 3 YMR080C YGR072W YHR077C 0000184: mRNA catabolism, nonsense-mediated decay/mRNA catabolism, nonsense-mediated/nonsensemediated mRNA decay 4 YGR222W YMR257C YLR067C YNR045W 0045182: translation regulator activity  102  Pathway 31  32  33  34 35 36  38  No. of Proteins  Proteins  4 YDL095W YAL023C YOR321W YDL093W  0004169: dolichyl-phosphatemannose-protein mannosyltransferase activity/Oglycoside mannosyltransferase/dolichylphosphate-mannose-protein Omannosyltransferase activity/protein Omannosyltransferase 2 YML075C YLR450W 0004420: hydroxymethylglutaryl-CoA reductase (NADPH) activity 11 YLR127C YDL008W YKL022C YDR118W YGL240W YOR249C YBL084C 0008054: cyclin YLR102C YNL172W YFR036W YHR166C catabolism/degradation of cyclin 3 YDR181C YMR127C YOR213C 0004406: H3/H4 histone acetyltransferase activity 3 YOR026W YJL030W YGL086W 0007094: mitotic spindle checkpoint 4 YIL035C YOR039W YOR061W YGL019W 0005956: protein kinase CK2 complex/casein kinase II complex 8 YGR047C YGR246C YAL001C YOR110W YNL039W YDR362C YPL007C 0003709: RNA polymerase III YBR123C transcription factor activity  39  10 YIL046W YFL009W YDR328C YDL132W YLR368W YDR054C YML088W YOL133W YJL204C YJR090C  40  26 YBR210W YMR292W YDL145C YNL287W YIL076W YGR172C YAL042W YGL137W YOR016C YHR110W YNL044W YGL200C YAR002C-A YCL001W YGL002W YDL018C YML012W YML067C YAL007C YNL263C YER074W-A YFR051C YPL010W YGR284C YFL048C YDR238C 6 YPR019W YLR274W YEL032W YBR202W YBL023C YGL201C  41  42  43  14 YNL126W YDR356W YAL047C YOR373W YPL255W YHR172W YKL042W YLR212C YGL075C YBL034C YGL093W YPL124W YOR257W YNL225C 8 YOR115C YDR472W YML077W YDR407C YDR246W YMR218C YBR254C YKR068C  44  3 YDL220C YDR082W YLR010C  45  6 YDR179C YDL216C YIL071C YMR025W YOL117W YJR084W  46  2 YHR157W YMR133W  47  48  GO Annotation  15 YNL243W YLR337C YDL029W YBL007C YKL129C YCR088W YMR109W YJL020C YDR388W YNL084C YJR065C YCR009C YIR006C YHR114W YOR181W 2 YLR318W YIL009C-A  0019005: SCF ubiquitin ligase complex/SCF complex/Skp1/Cul1/F-box protein complex/cullin complex 0045045: secretory pathway  0042555: MCM complex/minichromosome maintenance complex 0005200: structural constituent of cytoskeleton 0030008: TRAPP complex/transport protein particle 0016233: telomere capping 0008180: signalosome complex/COP9 complex 0000737: DNA catabolism, endonucleolytic/endonucleolytic degradation of DNA 0015629: actin cytoskeleton  0003720: telomerase activity  49  3 YBR179C YDR470C YOR211C  0008053: mitochondrial fusion  50  2 YIL074C YER081W  51  6 YPR088C YKL122C YML105C YDL092W YPL210C YPL243W  52  3 YKL008C YHL003C YMR298W  0004617: phosphoglycerate dehydrogenase activity 0048500: signal recognition particle 0046513: ceramide biosynthesis  54  4 YAR008W YPL083C YMR059W YLR105C  55  6 YAL003W YLR249W YDR385W YPR080W YOR133W YKL081W  0000214: tRNA-intron endonuclease complex 0003746: translation elongation factor activity  103  Pathway 56 57 58  No. of Proteins  Proteins  2 YPL049C YDR480W 10 YMR095C YNL333W YDR533C YPL280W YOR391C YNL334C YMR322C YMR096W YFL060C YFL059W 2 YKL007W YIL034C  59  5 YLR148W YDR080W YMR231W YPL045W YDL077C  60  3 YNL021W YDR295C YPR179C  61  62 63 64 65 66  67 68  69 70 71 72 73  GO Annotation 0008134: transcription factor binding/TF binding 0042816: vitamin B6 metabolism 0030834: regulation of actin filament depolymerization/regulation of actin depolymerization 0030897: HOPS complex  0006476: protein amino acid deacetylation 34 YML092C YGL004C YOR261C YER021W YPR103W YDL007W 0000502: proteasome complex YMR314W YPR108W YOR259C YBL041W YGL011C YER012W (sensu Eukaryota)/26S YER094C YJL001W YHR200W YLR421C YOR117W YFR004W YGL048C proteasome YDR427W YKL145W YOR362C YDL147W YGR253C YGR135W YFR052W YGR232W YOR157C YIL075C YDR394W YOL038W YDL097C YFR050C YHR027C 2 YJR103W YBL039C 0003883: CTP synthase activity 2 YML032C YDL059C  0000739: DNA strand annealing activity 2 YLR442C YDR227W 0030527: structural constituent of chromatin 2 YPR131C YOL076W 0017196: N-terminal peptidylmethionine acetylation 11 YAL033W YGR030C YBR167C YHR062C YNL221C YNL282W YIR015W 0000172: ribonuclease MRP YLR145W YBL018C YDR478W YBR257W complex/RNase MRP complex/ribonuclease mitochondrial RNA processing complex 5 YGL087C YDR092W YCR066W YDL064W YLR032W 0008639: small protein conjugating enzyme activity 4 YBR126C YDR074W YML100W YMR261C 0005946: alpha,alpha-trehalosephosphate synthase complex (UDP-forming) 3 YCR020C-A YPR051W YEL053C 0031365: N-terminal protein amino acid modification 2 YNL104C YOR108W  0003852: 2-isopropylmalate synthase activity 10 YMR224C YBL097W YFL008W YJL074C YDR325W YFR031C YLR272C 0000799: nuclear condensin YLR086W YNL250W YDR369C complex 11 YPR105C YNL258C YGL223C YGL005C YGL145W YER157W YLR440C 0017119: Golgi transport YNL041C YGR120C YML071C YNL051W complex  78  7 YGR179C YLR381W YDR254W YJR135C YPR046W YDR318W 0000777: condensed YPL018W chromosome kinetochore 8 YDR225W YBL003C YNL031C YBR009C YBR010W YDR224C YBL002W 0000786: nucleosome YNL030W 8 YDR176W YDR448W YPL254W YGL066W YBR081C YOR023C 0016573: histone acetylation YGR252W YOL148C 6 YPR162C YNL261W YLL004W YHR118C YBR060C YML065W 0005664: nuclear origin of replication recognition complex/eukaryotic ORC/nuclear ORC 6 YJL154C YOR132W YJL053W YOR069W YMR004W YHR012W 0030904: retromer complex  79  4 YNL139C YDR138W YML062C YHR167W  80  2 YPL076W YDR437W  81  2 YMR294W YHR129C  0005869: dynactin complex  82  5 YLR170C YPR029C YHL019C YKL135C YPL259C  83  2 YNL283C YOL105C  0030121: AP-1 adaptor complex/HA1 0004888: transmembrane receptor activity  74 76 77  0000347: THO complex  104  Pathway 84  85  No. of Proteins  Proteins  GO Annotation  11 YOR332W YOR270C YPR036W YGR020C YBR127C YLR447C YDL185W 0046961: hydrogen-transporting YHR039C-A YMR054W YEL051W YKL080W ATPase activity, rotational mechanism 5 YDR097C YCR092C YMR167W YOL090W YNL082W 0005524: ATP binding  86  2 YNL288W YGR134W  0030014: CCR4-NOT complex  87  2 YHR094C YMR011W  88  2 YFL003C YDL154W  89  4 YOR125C YLR201C YDR204W YGR255C  91  3 YDR100W YMR071C YKR088C  0015578: mannose transporter activity 0007131: meiotic recombination/female meiotic recombination/gene conversion with reciprocal crossover 0006743: ubiquinone metabolism/coenzyme Q metabolism unknown  92 93  7 YDL105W YLR383W YER038C YLR007W YDR288W YML023C YOL034W 2 YGL051W YAR033W  94  4 YJL099W YMR237W YKR027W YOR299W  95  2 YFR010W YOR124C  0004843: ubiquitin-specific protease activity/UBP/UCH2  96  5 YOR260W YLR291C YGR083C YKR026C YDR211W  97  2 YIR021W Q0115  98  4 YER149C YLL021W YLR319C YNL271C  99  2 YAL051W YOR363C  0005851: eukaryotic translation initiation factor 2B complex/eIF-2B 0000376: RNA splicing, via transesterification reactions with guanosine as nucleophile 0005519: cytoskeletal regulatory protein binding 0016563: transcriptional activator activity/transcription activating factor 0005545: phosphatidylinositol binding 0008250: oligosaccharyl transferase complex/OST complex 0008023: transcription elongation factor complex 0030433: ER-associated protein catabolism/ER-associated protein degradation/ERAD/endoplasmic reticulum-associated protein catabolism 0004749: ribose-phosphate diphosphokinase activity/ribosephosphate pyrophosphokinase activity 0000146: microfilament motor activity 0030123: AP-3 adaptor complex 0016925: protein sumoylation/SUMO-protein conjugation/protein sumolation/small ubiquitinrelated protein 1 conjugation/sumoylation  100 101  4 YHR107C YCR002C YJR076C YLR314C 10 YDL232W YGL022W YJL002C YOR103C YEL002C YGL226C-A YMR149W YOR085W YPL227C YML019W  102  2 YPL046C YNL230C  103  3 YOL013C YLR207W YMR022W  104  5 YOL061W YKL181W YBL068W YER099C YHL011C  105  6 YOR326W YPR188C YHR023W YGL106W YAL029C YPL242C  106  4 YGR261C YJL024C YBR288C YPL195W  107  4 YPL020C YDR510W YDR390C YPR180W  0030915: Smc5-Smc6 complex 0016050: vesicle organization and biogenesis unknown  105  Pathway  No. of Proteins  Proteins  108  2 YGR010W YLR328W  109  4 YOR153W YGR281W YPL058C YDR011W  110  2 YMR255W YDL207W  111  2 YKL130C YBR130C  112  2 YNL291C YGR217W  113  3 YGR108W YLR079W YMR036C  114  2 YHR026W YPL234C  115  2 YJL089W YMR280C  116  3 YOR253W YDL040C YHR013C  117  3 YHR178W YBL005W YGL013C  118  2 YGL058W YDL074C  119  5 YPL218W YHR098C YPR181C YIL109C YNL049C  120  2 YBR037C YBR024W  121  8 YJL008C YJR064W YJL111W YDR212W YDR188W YIL142W YJL014W YDL143W  122  5 YML115C YER001W YPL050C YJL183W YEL036C  123 124  126  127  128 130  GO Annotation 0019674: NAD metabolism/NAD (oxidized) metabolism/nicotinamide adenine dinucleotide metabolism/oxidized NAD metabolism/oxidized nicotinamide adenine dinucleotide metabolism 0042910: xenobiotic transporter activity 0005643: nuclear pore/NPC/nuclear pore complex/nuclear pore membrane protein 0008298: intracellular mRNA localization/intracellular mRNA positioning/mRNA localization, intracellular/mRNA positioning, intracellular 0006816: calcium ion transport 0000079: regulation of cyclin dependent protein kinase activity/regulation of CDK activity 0000220: hydrogen-transporting ATPase V0 domain 0045722: positive regulation of gluconeogenesis 0004596: peptide alpha-Nacetyltransferase activity 0042221: response to chemical substance 0016574: histone ubiquitination/histone ubiquitinylation/histone ubiquitylation 0030127: COPII vesicle coat 0008379: thioredoxin peroxidase activity/peroxiredoxin activity 0005832: chaperonin-containing T-complex/CCT particle/TriC  0000136: mannosyltransferase complex 4 YBR278W YNL262W YPR175W YDR121W 0008622: epsilon DNA polymerase complex 3 YLR433C YML057W YKL190W 0005955: calcineurin complex/calcium-dependent protein serine/threonine phosphatase complex 6 YKL010C YDL190C YPR093C YKL034W YDR059C YBR082C 0016567: protein ubiquitination/protein ubiquitinylation/protein ubiquitylation 7 YPL086C YLR384C YMR312W YGR200C YKL110C YHR187W YPL101W 0016944: RNA polymerase II transcription elongation factor activity/Pol II transcription elongation factor activity 2 YIR025W YDR260C 0030071: regulation of mitotic metaphase/anaphase transition 4 YOL146W YDR013W YJL072C YDR489W  0000811: GINS complex/Go, Ichi, Ni and San complex  106  Pathway  No. of Proteins  Proteins  GO Annotation  131  6 YAL024C YNL098C YBR140C YLR310C YOR101W YOL081W  133  2 YGR264C YGL245W  134  4 YBL037W YOL062C YJR058C YJR005W  136  3 YIL065C YLL001W YJL112W  0030128: clathrin coat of endocytic vesicle/clathrin coat of endocytotic vesicle 0048285: organelle fission  138  7 YLR141W YJL025W YML043C YBL014C YKL125W YBL025W YMR270C  0000120: RNA polymerase I transcription factor complex  139  2 YML013W YBR201W  140  2 YBR068C YPL265W  141  5 YBR052C YDR032C YOR086C YCR004C YML072C  0043161: proteasomal ubiquitin-dependent protein catabolism/proteasomal processing/proteasome pathway 0015171: amino acid transporter activity unknown  142  4 YJL115W YPL001W YLL022C YEL056W  143 144  145 146  147  148  149  150 151  153 154 155 156  0007265: Ras protein signal transduction/Ras mediated signal transduction 0017102: methionyl glutamyl tRNA synthetase complex  0000123: histone acetyltransferase complex/histone acetylase complex 2 YLR028C YMR120C 0003937: IMP cyclohydrolase activity/inosinicase 2 YIL125W YDR148C 0006103: 2-oxoglutarate metabolism/ketoglutarate metabolism 5 YJR068W YOL094C YBR087W YOR217W YNL290W 0005663: DNA replication factor C complex 3 YDR484W YJL029C YDR027C 0000938: GARP complex/Golgi associated retrograde protein complex/VFT tethering complex/Vps fifty three tethering complex 2 YEL037C YMR276W 0030433: ER-associated protein catabolism/ER-associated protein degradation/ERAD/endoplasmic reticulum-associated protein catabolism 6 YDL091C YBR273C YDL126C YJL048C YMR067C YDR330W 0019941: modificationdependent protein catabolism/protein-liganddependent protein catabolism 5 YCR039C YDR451C YMR043W YML027W YMR042W 0006357: regulation of transcription from RNA polymerase II promoter/regulation of transcription from Pol II promoter 3 YNL312W YAR007C YJL173C 0005662: DNA replication factor A complex 15 YPR020W YKL016C YBL099W Q0130 YBR039W Q0085 YPL078C 0015986: ATP synthesis YML081C-A YLR295C YJR121W YDL004W Q0080 YPL271W YDR377W coupled proton transport YDR298C 5 YGR078C YNL153C YML094W YEL003W YLR200W 0016272: prefoldin complex/GIM complex 4 YBL021C YGL237C YOR358W YKL109W 0016602: CCAAT-binding factor complex 2 YBR281C YFR044C unknown 2 YOL130W YFL050C  0015693: magnesium ion transport  107  Pathway  No. of Proteins  Proteins  GO Annotation  157  2 YMR198W YPR141C  158  5 YER110C YLR347C YGR218W YNL189W YMR308C  0003777: microtubule motor activity/dynein/kinesin 0008320: protein carrier activity  159  3 YIL132C YHL006C YLR376C  0045021: error-free DNA repair  160  2 YDR099W YER177W  0001402: signal transduction during filamentous growth  161  2 YLR432W YHR216W  0046039: GTP metabolism  162  2 YDL131W YDL182W  163  2 YPL178W YMR125W  164  7 YGL233W YLR166C YER008C YJL085W YDR166C YPR055W YIL068C  0004410: homocitrate synthase activity 0005846: snRNA cap binding complex 0000145: exocyst  165  7 YGR003W YLR234W YMR201C YJR052W YMR190C YER162C YBR114W  166  2 YMR108W YCL009C  167  2 YLR181C YKR035W-A  169  3 YMR094W YGR140W YMR168C  170  4 YER070W YJL026W YIL066C YGR180C  172  2 YGR121C YPR138C  173  3 YAL010C YOL009C YLL006W  174  3 YKL166C YPL203W YJL164C  175  2 YER116C YDL013W  unknown  176  2 YER044C-A YJR021C  0007131: meiotic recombination/female meiotic recombination/gene conversion with reciprocal crossover  177  2 YNR069C YLL049W  178  4 YGL170C YLR227C YOL091W YOR177C  0005816: spindle pole body  179  3 YBR112C YCR084C YKL213C  180  5 YGR192C YKL152C YCR012W YKL060C YJR009C  0016565: general transcriptional repressor activity 0006096: glycolysis  182  3 YML085C YML124C YFL037W  0045298: tubulin  183  3 YLR154C YDR279W YNL072W  184  3 YDR009W YML051W YPL248C  0004523: ribonuclease H activity/RNase H activity/calf thymus ribonuclease H activity 0006012: galactose metabolism  186  2 YHR007C YMR015C  0008204: ergosterol metabolism  188  2 YMR271C YML106W  0004588: orotate phosphoribosyltransferase activity  189  2 YOR269W YLR254C  190  2 YKL013C YLR370C  191  2 YNL048W YBR110W  192  6 YHR150W YGR004W YDR479C YLR324W YMR204C YBR168W  0000715: nucleotide-excision repair, DNA damage recognition 0005948: acetolactate synthase complex 0005770: late endosome/PVC/prevacuolar compartment 0019237: centromeric DNA binding 0004748: ribonucleosidediphosphate reductase activity/ribonucleotide reductase 0008519: ammonium transporter activity 0019867: outer membrane 0005952: cAMP-dependent protein kinase complex/PKA  0005885: Arp2/3 protein complex 0000030: mannosyltransferase activity 0005778: peroxisomal  108  Pathway  No. of Proteins  Proteins  GO Annotation membrane  193  2 YNL094W YHR016C  0030479: actin cortical patch/actin cortical patch (sensu Fungi)/actin cortical patch (sensu Saccharomyces)/actin patch 0015578: mannose transporter activity unknown  194  2 YDL245C YHR096C  195  2 YHR064C YEL034W  196  5 YLR459W YHR188C YDR434W YLR088W YDR331W  197  2 YFR053C YGL253W  0016255: attachment of GPI anchor to protein 0004396: hexokinase activity  198  3 YDL192W YDR170C YDL137W  0005798: Golgi vesicle  200  2 YMR159C YPL149W  0006914: autophagy  201  2 YDR453C YML028W  202  3 YDR093W YAL026C YER166W  203  2 YER159C YDR397C  204  2 YGL161C YGL198W  0008379: thioredoxin peroxidase activity/peroxiredoxin activity 0004012: phospholipidtranslocating ATPase activity/aminophospholipidtransporting ATPase/flippase/magnesiumATPase 0003714: transcription corepressor activity/transcription corepressor activity 0017137: Rab GTPase binding/Rab interactor activity  205  2 YNL238W YGL203C  0008236: serine-type peptidase activity/serine protease  206  2 YDR160W YFR029W  0007600: sensory perception  207  2 YGR238C YHR158C  208  2 YMR169C YMR170C  209  2 YGL250W YLR082C  0001100: negative regulation of exit from mitosis 0019482: beta-alanine metabolism unknown  210  2 YJR159W YDL246C  unknown  211  2 YOR337W YDR207C  unknown  212  2 YLR320W YHR154W  unknown  213  2 YDR233C YDL204W  unknown  109  Table 4.2 Summary Statistics of Topological Properties of Source PPI Network and Pathway Network. Network Source PPI network  No. of Proteins  No. of PPIs  Avg. Degree  Avg. Clustering Coefficient  4,800  43,687  9.1  0.2  Identified Pathway 1,617 PPI network  16,685  10.3  0.23  110  processes. BioCyc is a collection of metabolic pathways of 570 organisms and on average pathways in BioCyc are 4.2 times smaller than KEGG pathways. The Reactome database is another manually curated core human biological pathway database. Pathway annotations of organisms other than human are derived by mapping their human counterparts onto these organisms based on protein orthology data. Currently, there are 96, 150 and 381 biological pathways of yeast containing at least two protein members in KEGG, BioCyc and Reactome, respectively. We calculated the adjusted rand index (ARI) scores to quantify the similarity of our 195 resultant pathways and pathway annotations from each pathway database (see methods). In this study, we computed the ARI score of each of our identified pathways against every pathway in three pathway databases, and selected the highest resultant score to be the ARI score for the tested pathway. For the KEGG database, we found 4% (8 out of 195) of our identified pathways with ARI scores equal to or greater than 0.5 when tested against the pathways in KEGG. This low percentage, however, is still significantly greater than that found purely by chance (Ztest, P < 0.001) with regard to the similarity between the pathways discovered by our approach and the KEGG pathways. For the BioCyc database, there are 5.6% (11 out of 195) pathways with ARI scores equal to or greater than 0.5 when tested against the pathways in BioCyc (Z-test, P < 4.1 × 10-3). For the Reactome database, there are 12.8% (25 out of 195) pathways with ARI scores equal to or greater than 0.5 when tested against the pathways in Reactome (Z-test, P < 2.6 × 10-4). The observed discrepancy on the percentages when tested on three reference databases can be explained by the different ways KEGG, BioCyc and Reactome are curated. KEGG and BioCyc mainly emphasize metabolic and signaling pathways, whereas Reactome employs a more general way to  111  collect biological reaction data of pathways. We tested the degree of overlap between these three reference databases using ARI values. We found that there is a 26% overlap between KEGG and BioCyc, possibly due to their similar emphasis on metabolic and signaling pathways. In contrast, there are only 14% and 16% overlap between Reactome and KEGG and between Reactome and BioCyc, respectively. This result further addresses the observed discrepancy of PPV when tested on different databases. Furthermore, KEGG relies on Enzyme Commission (EC) numbers to map the physical polypeptides involved in metabolic reactions to public gene/protein annotation databases, and as a result, mis-mapping may lead to the incompleteness of pathway organization.  We tested whether the proteins within each identified pathway share highly similar phenotypic response patterns. We tested our identified pathways on a data set containing phenotypic response measurements under different treatments [45] as used by Ulitsky and Shamir [20]. We found that proteins within the same pathway in our study show significantly higher correlation to phenotypic response patterns compared to that expected by random (the average Pearson correlation coefficient is 0.39 at p-value < 4.2 × 10-10).  4.3.4 Comparison between different approaches Pathway organization derived from biological networks has been widely studied. These approaches are described in previous publications and can be classified into two categories: 1) statistical models with multiple data sources (physical interactions and genetic interactions); 2) graph-based models with a single data source (genetic interactions). In this study, we employed a graph-based model, but with diverse lines of  112  biological evidence. To compare the performance of different approaches, we computed the PPV values by calculating the ARI scores between identified pathways from each approach and the pathways from Reactome, KEGG and BioCyc. For the Reactome database, the PPV of Kelley and Ideker [19], 3.7% (15 out of 404 pathways), is very close to that of Ulitsky and Shamir [20], which is 3.2% (9 out of 280 pathways). This finding is not surprising because the approach of both methods is identical. Two other approaches share similar PPV values: 0.08% (1 out of 1297 pathways) for Ma et al. [21] and 0.9% (1 out of 108 pathways) for Brady et al. [22] on the more recent version of genetic interaction network. Our approach achieves a PPV of 12.8%, indicating that our approach outperforms the other methods when tested on Reactome (Figure 4.4). For the KEGG and Biocyc pathway database, performance of the four aforementioned methods follows the same trend as when tested on Reactome (Figure 4.4). To compare the performance of different approaches when tested on negative data, we found that all approaches achieve the negative predictive value (NPV) of 100% if tested on randomized pathway data sets, further suggesting better performance of our approach at the same level of NPV. Here, NPV is defined as: No. of True Negatives/(No. of True Negatives + No. of False Negatives).  4.3.5 Biological examples of predicted pathways In our study, we have demonstrated that our predicted pathways bear biological meanings as they can be validated by comparing to annotated pathways in Reactome, KEGG and BioCyc. Also, proteins in the same pathway share very similar phenotypic response patterns. The next logical step is to identify relevance and function of these predicted pathways. We present several examples to show that biological insights can be inferred 113  Figure 4.4 Comparison between different approaches based on PPV scores tested on Reactome, KEGG and BioCyc pathway annotations. A bar plot demonstrates the performance of each approach tested on three pathway annotations.  114  from resultant pathways identified in this study. One example is pathway 61 with an ARI score of 0.89 when compared to the “Orc1 removal from chromatin” pathway in Reactome (Figure 4.5). Pathway 61 itself is enriched for 4 GO terms (0000502: proteasome complex/26S proteasome; 0006508: proteolysis and peptidolysis; 0044257: cellular protein catabolism and 0030163: protein catabolism/protein degradation), which is consistent with pathway annotation in Reactome. 94% (32 out of 34) of the proteins in pathway 61 are annotated as belonging to the pathway Orc1 removal from chromatin in Reactome; only two proteins (YGL004C, YLR421C) are not included. In fact, YLR421C is a known member of the 26S proteasome [46, 47] based on the KEGG annotation while YGL004C is missing from the KEGG pathway, but is a protein highly related to the proteasome complex [47]. This example demonstrates the ability of our approach to identify new pathway members, thus providing testable hypotheses for experimental validation. Another interesting example is pathway 20, which is found to match pathway sce03020 “RNA polymerase” in KEGG, with an ARI score of 0.95. Pathway 20 is enriched for the GO term 0030880 (RNA polymerase complex), indicating that it has a similar biological function to the pathway in KEGG. We found pathway 20 contains one more protein (YKR025W) than listed in the KEGG pathway seco03020. As a subunit of RNA polymerase, YKR025W has been extensively studied recently and it plays an important role in the regulation of RNA polymerase III transcription [48, 49]. Therefore, it is probable that YKR025W is a missing member of the pathway involved in the function of RNA polymerase.  115  Figure 4.5 An example of identified pathways by our approach.  Pathway 61 yields an ARI score of 0.89 when compared to the “Orc1 removal from chromatin” pathway in Reactome. 32 out of 34 gene products (blue nodes) in pathway 61 are annotated as belonging to the pathway “Orc1 removal from chromatin” in Reactome. Two gene products (red nodes) are not included. However, YGL004C and YLR421C might be components of the cell cycle control pathway. 116  4.3.6 Revealed redundant pathways Since genetic interactions suggest the existence of parallel pathways, we investigated the possibility of functionally redundant pathway pairs existing in the pathways we identified. To evaluate this, we calculated a Z-score for each possible pathway pair in our identified pathways to show whether or not the difference between the observed number of GIs of our pathway pair and the expected number of GIs of pathway pairs in a random set is statistically significant. We found 31 pathway pairs with p-value < 0.01 (Figure 4.6). A list of these pathway pairs is summarized in Table 4.3. We also found that 58% (18 out of 31) of the pathway pairs contain at least one common functional-enrichment GO term, suggesting the presence of pathway redundancy. For example, pathway 35 and 73 are annotated as the pathways involved in mitotic spindle checkpoint and condensed chromosome kinetochore, respectively. They also share 7 function-enriched GO terms (0000777: condensed chromosome kinetochore, 0000778: condensed nuclear chromosome kinetochore, 0000780: condensed nuclear chromosome, pericentric region/condensed nuclear chromosome, centromere, 0000779: condensed chromosome, pericentric region/condensed chromosome, centromere, 0000775: chromosome, pericentric region/centromere, 0000794: condensed nuclear chromosome and 0000793: condensed chromosome) with each other. Pathway 35 shares high similarity with the Reactome pathway 504720 (Amplification of signal from unattached kinetochores via a MAD2 inhibitory signal), with the ARI score of 0.8.  Our predicted pathway pairs represent a redundancy mechanism between a pair of pathways in which the proteins can compensate for each other to perform a same or  117  Table 4.3 A list of discovered redundant pathway pairs in yeast. 31 pathway pairs in our study show they contain statistically more genetic interactions across them than expected at random. Pathway A and B in the table represents a pair of redundant pathway pair. Numbers of Genetic Interactions (No. of GIs) is the number of genetic interactions across a pair of redundant pathway pair. P-value for each pair was calculated using Z-test by comparing the actual number of GIs of each pair to its corresponding numbers in 1000 randomized networks. Pathway A 35 157 35 153 127 35 23 81 153 118 118 175 157 153 153 175 153 81 98 157 118 118 118 35 157 157 197 153 153 212 63  Pathway B 157 81 153 81 118 73 201 98 182 95 201 107 153 73 189 165 47 182 189 73 157 211 116 65 63 189 153 63 177 201 201  No. of GIs 6 4 15 10 14 21 4 8 15 4 4 8 10 35 10 13 55 4 5 8 2 2 3 3 2 2 5 5 5 2 2  P-value 7.10 x 10-16 8.90 x 10-16 8.90 x 10-16 8.90 x 10-16 8.90 x 10-16 8.90 x 10-16 4.40 x 10-10 4.40 x 10-10 4.40 x 10-10 3.90 x 10-9 3.90 x 10-9 3.90 x 10-9 3.90 x 10-9 3.90 x 10-9 3.90 x 10-9 6.30 x 10-8 6.30 x 10-8 6.30 x 10-8 6.30 x 10-8 3.50 x 10-7 1.20 x 10-6 1.20 x 10-6 1.20 x 10-6 1.20 x 10-6 1.20 x 10-6 1.20 x 10-6 1.20 x 10-6 1.20 x 10-6 1.20 x 10-6 1.20 x 10-6 1.20 x 10-6  118  Figure 4.6 The redundant pathway organization in S. cerevisiae. The redundant pathway organization in yeast was generated from discovered pathway pairs.  Each node represents a pathway and each edge represents the connection between a pair of redundant pathways. Numbers on nodes are identifiers of our discovered pathways in Table 4.3. The annotation of each pathway was assigned by the GO term with the smallest P-value derived from FuncAssociate [44]. Pathways without GO term annotations were represented as squared nodes. Pathway size was mapped to node color.  119  functional related biological process. Therefore, we speculated that proteins having similar biological functions might genetically interact with each other if they appear in our identified pathway pairs. For example, pathway 175 and pathway 73 are predicated to be a pair of parallel pathways. We found that there is one enriched GO term (0015630:microtubule cytoskeleton) common to both pathways and there are 6 synthetic lethal interactions between this pair of pathways, suggesting functional redundancy between them. Due to technical limitations, a large number of genetic interactions in yeast either have been found to be false negatives, or have not yet been tested [14]. Thus we hypothesized that a pair of proteins found within a pathway pair might genetically interact with each if they share at least one common GO term. We did a 10-fold crossvalidation test and our approach achieved a sensitivity of 72% and a specificity of 81%, suggesting good capacity of discovering genetic interactions. For example, ADA2 (YDR448W) in pathway 76 and BRE1 (YDL074C) in pathway 118 share 2 common GO terms (0016570: histone modification and 0016569: covalent chromatin modification) yet do not genetically interact with each other based on the genetic interaction data. By our approach, however, we predict them as a pair of genetically interacting proteins. In a very recent publication [50], it was reported that there is a synthetic fitness or lethality defect interaction between ADA2 and BRE1, involved in yeast histone acetylation and deacetylation. This finding provides a good example of the ability of our approach to predict novel genetic interactions. We generated a network of discovered redundant pathways (Figure 4.6 and 4.7). As expected, most pathways show the 1:1 redundant relationship. Interestingly, we found that several pathways, such as pathways 35, 118 and 153, demonstrate the 1:N redundant relationship. By closely examining these pathways,  120  Figure 4.7 The redundant pathway network in yeast showing at the detail of PPIs.  Proteins within each pathway are connected by PPIs from our source protein interaction network. Each circle represents a redundant pathway, each node represents a protein and each edge represents a PPI. Each circle was labeled to the identifiers of our discovered pathways.  121  we found them to contain a 3.6 fold enrichment of GO annotations compared to other pathways, indicating that they are multi-tasking pathways and have multiple functional redundancy with other pathways.  4.4 Conclusion In this study, we introduced a systematic multiple evidence-based pathway finding approach in S. cerevisiae. In contrast to previous approaches, we examined the pathway organization in yeast in terms of the protein relationship scored by multiple types of biological evidence and discovered 195 biological pathways, which covers 16,685 physical interactions, 890 synthetic lethal interactions and 1,407 domain-domain interactions involving 1,617 yeast genes/proteins. Compared to other predictive approaches, our approach achieved better performance when tested against to the Reactome, KEGG and BioCyc pathway databases. We discovered 31 functionally redundant pathway pairs by a probabilistic test. Analysis of the resulting pathways and pathway pairs provided a more comprehensive and reliable view of pathway organization in yeast. As the size of genetic interaction networks in other model organisms grows, our study could lead us to a more complete identification of the functional interactome interpreted by pathway organization. This could shed light on the overall picture of how subsystems in cells, such as pathways, work together to determine phenotypes and functions.  122  4.5 References 1. 2.  3.  4.  5. 6.  7.  Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC: Getting started in biological pathway construction and analysis. PLoS Comput Biol 2008, 4(2):e16. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415(6868):141-147. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415(6868):180-183. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001, 98(8):4569-4574. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, Peregrin-Alvarez JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MH, Butland G, Altaf-Ul AM, Kanaya S, Shilatifard A, O'Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440(7084):637643. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403(6770):623-627. 123  8.  9.  10.  11.  12.  13.  14.  Formstecher E, Aresta S, Collura V, Hamburger A, Meil A, Trehin A, Reverdy C, Betin V, Maire S, Brun C, Jacq B, Arpin M, Bellaiche Y, Bellusci S, Benaroch P, Bornens M, Chanet R, Chavrier P, Delattre O, Doye V, Fehon R, Faye G, Galli T, Girault JA, Goud B, de Gunzburg J, Johannes L, Junier MP, Mirouse V, Mukherjee A, Papadopoulo D, Perez F, Plessis A, Rosse C, Saule S, StoppaLyonnet D, Vincent A, White M, Legrain P, Wojcik J, Camonis J, Daviet L: Protein interaction mapping: a Drosophila case study. Genome Res 2005, 15(3):376-384. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL, Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM: A protein interaction map of Drosophila melanogaster. Science 2003, 302(5651):1727-1736. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303(5657):540-543. Meluh PB, Pan X, Yuan DS, Tiffany C, Chen O, Sookhai-Mahadeo S, Wang X, Peyser BD, Irizarry R, Spencer FA, Boeke JD: Analysis of genetic interactions on a genome-wide scale in budding yeast: diploid-based synthetic lethality analysis by microarray. Methods Mol Biol 2008, 416:221-247. Schuldiner M, Collins SR, Thompson NJ, Denic V, Bhamidipati A, Punna T, Ihmels J, Andrews B, Boone C, Greenblatt JF, Weissman JS, Krogan NJ: Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell 2005, 123(3):507-519. Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N, Robinson M, Raghibizadeh S, Hogue CW, Bussey H, Andrews B, Tyers M, Boone C: Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 2001, 294(5550):2364-2368. Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, Chen Y, Cheng X, Chua G, Friesen H, Goldberg DS, Haynes J, Humphries C, He G, Hussein S, Ke L, Krogan N, Li Z, Levinson JN, Lu H, Menard P, Munyana C, Parsons AB, Ryan O, Tonikian R, Roberts T, Sdicu AM, Shapiro J, Sheikh B, Suter B, Wong SL, Zhang LV, Zhu H, Burd CG, Munro S, Sander C, Rine J, Greenblatt J, Peter M, Bretscher A, Bell G, Roth FP, Brown GW, Andrews B, Bussey H, Boone C: Global mapping of the yeast genetic interaction network. Science 2004, 303(5659):808-813.  124  15.  16. 17. 18. 19. 20. 21. 22. 23.  24. 25. 26. 27.  28.  29.  Kamath RS, Fraser AG, Dong Y, Poulin G, Durbin R, Gotta M, Kanapin A, Le Bot N, Moreno S, Sohrmann M, Welchman DP, Zipperlen P, Ahringer J: Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature 2003, 421(6920):231-237. Lehner B, Crombie C, Tischler J, Fortunato A, Fraser AG: Systematic mapping of genetic interactions in Caenorhabditis elegans identifies common modifiers of diverse signaling pathways. Nat Genet 2006, 38(8):896-903. Boutros M, Kiger AA, Armknecht S, Kerr K, Hild M, Koch B, Haas SA, Consortium HF, Paro R, Perrimon N: Genome-wide RNAi analysis of growth and viability in Drosophila cells. Science 2004, 303(5659):832-835. Curtis RK, Oresic M, Vidal-Puig A: Pathways to the analysis of microarray data. Trends Biotechnol 2005, 23(8):429-435. Kelley R, Ideker T: Systematic interpretation of genetic interactions using protein networks. Nat Biotechnol 2005, 23(5):561-566. Ulitsky I, Shamir R: Pathway redundancy and protein essentiality revealed in the Saccharomyces cerevisiae interaction networks. Mol Syst Biol 2007, 3:104. Ma X, Tarone AM, Li W: Mapping genetically compensatory pathways from synthetic lethal interactions in yeast. PLoS ONE 2008, 3(4):e1922. Brady A, Maxwell K, Daniels N, Cowen LJ: Fault tolerance in protein interaction networks: stable bipartite subgraphs and redundant pathways. PLoS ONE 2009, 4(4):e5364. Pitre S, North C, Alamgir M, Jessulat M, Chan A, Luo X, Green JR, Dumontier M, Dehne F, Golshani A: Global investigation of protein-protein interactions in yeast Saccharomyces cerevisiae using re-occurring short polypeptide sequences. Nucleic Acids Res 2008, 36(13):4286-4294. Zhu J, Zhang B, Smith EN, Drees B, Brem RB, Kruglyak L, Bumgarner RE, Schadt EE: Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet 2008, 40(7):854-861. Lim WA, Richards FM, Fox RO: Structural determinants of peptide-binding orientation and of sequence specificity in SH3 domains. Nature 1994, 372(6504):375-379. McGough AM, Staiger CJ, Min JK, Simonetti KD: The gelsolin family of actin regulatory proteins: modular structures, versatile functions. FEBS Lett 2003, 552(2-3):75-81. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 2006, 34(Database issue):D354-357. Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky L, Kaipa P, Ahren D, Tsoka S, Darzentas N, Kunin V, Lopez-Bigas N: Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res 2005, 33(19):6083-6089. Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, Kanapin A, Lewis S, Mahajan S, May B, Schmidt E, Vastrik I, Wu G, Birney E, Stein L, D'Eustachio P: Reactome  125  30. 31. 32. 33.  34.  35. 36.  37. 38. 39. 40. 41. 42. 43.  knowledgebase of human biological pathways and processes. Nucleic Acids Res 2009, 37(Database issue):D619-622. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34(Database issue):D535-539. Finn RD, Marshall M, Bateman A: iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics 2005, 21(3):410-412. Zhang KX, Ouellette BF: GAIA: a gram-based interaction analysis tool--an approach for identifying interacting domains in yeast. BMC Bioinformatics 2009, 10 Suppl 1:S60. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29. Nash R, Weng S, Hitz B, Balakrishnan R, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hong EL, Livstone MS, Oughtred R, Park J, Skrzypek M, Theesfeld CL, Binkley G, Dong Q, Lane C, Miyasato S, Sethuraman A, Schroeder M, Dolinski K, Botstein D, Cherry JM: Expanded protein information at SGD: new pages and proteome browser. Nucleic Acids Res 2007, 35(Database issue):D468-471. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms. Bioinformatics 2007, 23(10):1274-1281. Huttenhower C, Flamholz AI, Landis JN, Sahi S, Myers CL, Olszewski KL, Hibbs MA, Siemers NO, Troyanskaya OG, Coller HA: Nearest Neighbor Networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics 2007, 8:250. Mete M, Tang F, Xu X, Yuruk N: A structural approach for finding functional modules from large biological networks. BMC Bioinformatics 2008, 9 Suppl 9:S19. Hubert L, Arabie P: Comparing partitions. Journal of Classification 1985, 2(1):193-198. Maslov S, Sneppen K: Specificity and stability in topology of protein networks. Science 2002, 296(5569):910-913. Royer L, Reimann M, Andreopoulos B, Schroeder M: Unraveling protein networks with power graph analysis. PLoS Comput Biol 2008, 4(7):e1000108. Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M: The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput Biol 2007, 3(4):e59. Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S: AmiGO: online access to ontology and annotation data. Bioinformatics 2009, 25(2):288-289. Yi M, Stephens RM: SLEPR: a sample-level enrichment-based pathway ranking method -- seeking biological themes through pathway-level consistency. PLoS ONE 2008, 3(9):e3288.  126  44. 45. 46. 47. 48.  49. 50.  Berriz GF, King OD, Bryant B, Sander C, Roth FP: Characterizing gene sets with FuncAssociate. Bioinformatics 2003, 19(18):2502-2504. Brown JA, Sherlock G, Myers CL, Burrows NM, Deng C, Wu HI, McCann KE, Troyanskaya OG, Brown JM: Global analysis of gene function in yeast by quantitative phenotypic profiling. Mol Syst Biol 2006, 2:2006 0001. Husnjak K, Elsasser S, Zhang N, Chen X, Randles L, Shi Y, Hofmann K, Walters KJ, Finley D, Dikic I: Proteasome subunit Rpn13 is a novel ubiquitin receptor. Nature 2008, 453(7194):481-488. Seong KM, Baek JH, Yu MH, Kim J: Rpn13p and Rpn14p are involved in the recognition of ubiquitinated Gcn4p by the 26S proteasome. FEBS Lett 2007, 581(13):2567-2573. Flores A, Briand JF, Gadal O, Andrau JC, Rubbi L, Van Mullem V, Boschiero C, Goussot M, Marck C, Carles C, Thuriaux P, Sentenac A, Werner M: A proteinprotein interaction map of yeast RNA polymerase III. Proc Natl Acad Sci U S A 1999, 96(14):7815-7820. Rosonina E, Willis IM, Manley JL: Sub1 functions in osmoregulation and in transcription by both RNA polymerases II and III. Mol Cell Biol 2009, 29(8):2308-2321. Lin YY, Qi Y, Lu JY, Pan X, Yuan DS, Zhao Y, Bader JS, Boeke JD: A comprehensive synthetic genetic interaction network governing yeast histone acetylation and deacetylation. Genes Dev 2008, 22(15):2062-2074.  127  5 A Novel Approach to Predict Cancer Outcomes Based on the Relationship between Protein Structural Information and Protein Networks4 5.1 Introduction Carcinogenesis is a complex process with multiple genetic and environmental factors contributing to its development [1]. Understanding the underlying mechanism of this process and identifying related markers to assess the outcome of this process could lead to more efficient treatment and thus significantly reduce the mortality rate of cancers. Currently, the majority of breast cancer patients are over-treated [2] due to the lack of accurate assessment of the risk of metastasis. A substantial proportion of patients are receiving aggressive adjuvant therapy according to the current guidelines. Although the importance of identifying prognostic signatures that predict cancer outcomes is widely appreciated, it has remained a challenging task. With the emergence of DNA microarraybased tumor gene expression profiles, molecular diagnostics and prognostics are emerging. The reported predictive tools basically classify cancer outcomes based on the identification of gene expression signatures observed in different outcomes. However, the predictive performance of these approaches is limited. For instance, in two large-scale expression studies [3, 4], approximately 70 gene markers were identified and then applied to the prediction of the metastasis in breast cancer, but the accuracy only reached 6070%. This relatively low accuracy could be explained by some intrinsic shortcomings of the microarray data, as different experiment and analysis designs could yield inconsistent  A version of this chapter will be submitted for publication. Zhang KX, Ouellette BF: A Novel Approach to Predict Cancer Outcomes Based on Relationships between Protein Structural Information and Protein Networks. 4  128  results due to systematic errors [5]. Therefore, novel tools with stronger predictive power are needed to identify more accurate markers associated with cancer outcomes.  Protein-protein interactions (PPIs) play an important role in the process of healthy cell biology and carcinogenesis. At the molecular level, any genetic alternation such as somatic mutations, translocations, deletions and insertions will be responsible for some of the breakdown of the PPI-based regulatory mechanisms that govern normal cell functions and behaviors, leading to aberrant or uncontrolled cell growth and eventually to cancer [6]. The recent availability of large-scale PPI networks may make possible to identify better gene signatures by combining gene expression measurements with information about perturbed protein interaction networks in the cell. Chuang and colleagues developed a method to find sub-network-based signatures by incorporating the PPI network and the gene expression profiles [7]. The resultant sub-networks with their gene expression profiles were used as markers to determine whether a patient’s expression profile signifies metastatic or non-metastatic. Their study provided a starting point to reveal the usefulness of gene expression profiles in the context of PPI network. More recently, Taylor and colleagues proposed a new methodology to predict breast cancer outcome based on the correlation of gene expression profiles between hub proteins and their interacting partners in the PPI network [8]. These studies demonstrated that the PPI network information could be useful to differentiate a variety of cancer outcomes. Unfortunately, the predictive performance of these methods is not accurate enough. The methods of Chuang et al. and Taylor only showed modest improvement with the accuracy of 70-72% and 76%, respectively, compared with 62% and 63% reported in approaches without employing the network information [3, 4].  129  The protein-protein interactions could be mediated by the interactions between protein domains, which are defined as independent structure and/or function blocks of proteins. For example, some cytoskeletal proteins interact with actin via the interaction between the gelsolin repeat domains [9]. Disrupted domain interactions could stop the chain reaction of biological pathways at any point, leading to various diseases. Based on the relationship of a protein and its neighboring proteins in the protein interaction network, we can classify this protein to one of two types (Figure 5.1). We call this protein a ‘singlish-interface’ protein if it interacts with its neighboring proteins through the same domain-domain interaction; therefore, those domain-domain interactions are mutually exclusive. Conversely, we call this protein a ‘multiple-interface’ protein if it interacts with its neighboring proteins through different domain-domain interactions, as those interactions are simultaneously possible. It has been demonstrated that singlish-interface proteins evolve faster than multiple-interface proteins [10] and that faster evolution tends to generate more mutations on genes [11]. In addition, for a ‘singlish-interface’ protein, mutations on a given domain would simultaneously affect the interactions between this protein and multiple interacting partners, thus are more likely to disrupt protein interactions and disturb the protein interaction network. Therefore, we speculated that singlish-interface proteins are more likely to be involved in the process of carcinogenesis than multiple-interface proteins.  130  Figure 5.1 A schematic view of a ‘singlish-interface’ protein and a ‘multipleinterface’ protein.  Given a protein (red node) and its neighboring proteins in the protein interaction network, we can define it as a ‘singlish-interface’ protein or a ‘multiple-interface’ protein. The ‘singlish-interface’ protein interacts with its neighboring proteins through the same domain (the yellow line) ; therefore, those domain-domain interactions are mutually exclusive. Conversely, the ‘multiple-interface’ protein interacts with its neighboring proteins through different domains (blue lines), as those interactions are simultaneously possible.  131  Somatic mutations are one type of alternations in DNA that are neither inherited nor passed to offspring [1] and some of them, called “driver mutations”, can contribute to the development of the cancers or other diseases [12, 13]. Unfortunately, we have little knowledge about how the presence of these genomic variations in interacting domains perturbs the protein interaction network in cancerous cells. Therefore, in addition to PPI data and gene expression data, we also incorporated two types of data: domain-domain interactions (DDIs) and somatic mutations.  In this study, we propose an integrated approach for the identification of gene signatures to predict cancer outcomes using four types of data: PPI, DDI, gene expression profiles and somatic mutation. We first developed a model to score each protein based on the domain connections to interacting partners. A gene signature was identified if its score is above the preset threshold. Then we computed the correlation of gene expression profiles of the gene signatures and their neighboring proteins. A modified naïve classifier was used to predict cancer outcome based on the correlation. Compared to previous studies, our study has several advantages. First, besides the PPI network and the gene expression profiles, the DDI network and the somatic mutations within the interacting domains were integrated into our predictive approach, which achieved accuracy of 86.8%, sensitivity of 87.1% and specificity of 85.6%. Second, our results, a compiled list of cancer-associated gene signatures and domains, provide testable hypotheses for further experimental investigation. Third, our approach is not specific to a single type of cancer and can thus be applied to different types of cancers.  132  5.2 Materials and methods 5.2.1 Data set collection We downloaded 108,307 unique PPIs in human from the iRefIndex database (ftp://ftp.no.embnet.org/irefindex/data) version of June 4, 2009. The iRefIndex database [14] provides a non-redundant list of protein interactions derived from several major protein interaction databases including BIND [15], BioGRID [16], DIP [17], HPRD [18, 19], IntAct [20, 21], MINT [22], and OPHID [23]. We also used a set of DDIs downloaded from the iPfam database [24], a DDI database based on RCSB Protein Data Back (PDB) crystal structures (http://www.pdb.org), which consists of 3,020 DDIs and 914 domains. For somatic mutations involved in cancer, a list of 88,641 somatic mutations was retrieved from the COSMIC database (version 43) that contains the mutation data and associated information extracted from the primary literature [25]. We compiled a gene expression and outcome data set from a study of two groups of sporadic and non-familial breast cancer patients [26].  5.2.2 Gene signature finding algorithm Step A. For each protein x in the query PPI network, we generated a list of Pfam annotated domains dlist[x] of protein x and a list of neighboring/interacting proteins neighbor[x] of protein x;  Step B. For each domain Di in the domain list dlist[x], we counted the number of domain pairs between Di and a set of domains of neighbor[x] represented in the interacting domain-domain pairs previously established in Pfam.  133  Step C. A domain index score was assigned to each protein in the query PPI network by the following equation:  |dlist[ x ]|  Sx =  "  I(x) ! W NoDDIs( Di )  i=1  | dlist[x] |  where I(x) is an indicator function which equals 1 if and only if the protein x has at least one domain and otherwise 0. NoDDIs(Di) is the number of DDIs of between Di and a set of domains of neighboring/interacting proteins as calculated by the Step B. Here, W is an exponential function at the base of 2, which meant that we add weights exponentially to a domain if it has multiple DDIs.  Step D. For each protein x, if the domain index score was over the preset threshold c, this protein was regarded as a gene signature and was utilized for the neighboring gene expression analysis.  5.2.3 Calculation of neighboring gene expression profiling score Given a gene expression data set and a gene signature, we computed a score to measure the difference in expression of the gene signature and its neighboring proteins in the PPI network using the following equation:  134  N  Sdiff =  "E ! E i  x  i=1  n  where Ex is the expression value of the gene signature x; Ei is the expression value of the interactor i of the gene signature x; n is the number of interactors of the gene signature x. All scores were normalized and ranged from -1 to 1.  5.2.4 Construction of the naïve Bayes classifier As a probabilistic model based on Bayes' theorem, the naïve Bayes classifier has been widely applied to the classification problem in different fields of the biological sciences such as inferring cellular networks [27], modeling protein signaling pathways [28] and the prediction of protein-protein interaction interfaces [29]. Given the training dataset and testing dataset in which each data sample is represented as an n-dimensional vector (x1, x2, …, xn), m classes (C1, C2, …, Cm). Here, X is presented as a cancer patient sample in the training or the testing dataset; n is the number of gene signatures; m is the number of types of outcomes, corresponding to two outcomes – ‘good outcome’ (patients who were disease free after extended follow-up) and ‘poor outcome’ (patients who died of disease). The prediction procedure follows as:  According to Bayes theorem, we can get the highest posterior probability of each cancer patient sample X based on the following equation:  135  n  P(Cgood | X) =  (! P(xk | Cgood ))P(Cgood ) k =1  P(X)  where the class prior probabilities P(Cgood) is calculated by Sgood/S, the value of the number of training samples of class Cgood divided by the total number of training sample. P(x1|Cgood), P(x2|Cgood), …, P(xn|Cgood) can be easily calculated by Sgood(k)/Sgood, where Sgood(k) is the number of training samples of class Cgood having the gene expression profiling score xk falling into one certain bin/category, and Sgood the number of training samples belonging to Cgood. In this study, we divided the gene expression profiling score into 20 bins as it ranges from -1 to 1.  In order to classify cancer patient samples in the testing dataset, we calculated the P(X|Ci)P(Ci) for each class Ci. Sample X was then predicted as belonging to class Cgood if and only if  P(X | Cgood )P(Cgood ) > P(X | C poor )P(C poor )  In other words, it is assigned to the class Cgood for which P(X|Cgood)P(Cgood) is the maximum.  136  5.3 Results and Discussion 5.3.1 Parameter tuning and validation on breast cancer data We tested whether identified gene signatures are good indicators to distinguish two groups of sporadic and non-familial breast cancer patients [26]. Breast cancer is extensively studied, with a set of well-documented gene expression profiles of 295 consecutive breast cancer patients with different outcomes [3]. We defined patients who were disease free after extended follow-up as patients with ‘good outcome’ and those who died of disease as patients with ‘poor outcome’. The patient data was filtered to remove patients that were still alive with disease or dead from other reasons, as reported by Taylor [8]. The resultant dataset contained 179 patients with ‘good outcome’ and 74 patients with ‘poor outcome’. For each patient, a profile was computed based on the difference of the gene expression value between the gene signatures and their neighboring proteins. We adopted a five-fold cross-validation strategy in which we used 20% patient’s profile from the original set as the validation data, and the remaining patients’ profiles as the training data. This process was repeated five times such that each patient in the sample was used once as the validation data. For the identification of gene signatures, we applied a scoring procedure to the domains of each gene product based on the number of mutually exclusive DDIs they contain. There is only one parameter that needs to be tuned: the threshold of domain index scores (Sd). We subjected our approach to test on the breast cancer data set using different Sd. We then evaluated the performance of our approach by calculating three performance measurements: accuracy, sensitivity and specificity. In this study, accuracy is defined as: (No. of True Positives + No. of True Negatives) / (No. of True Positives + No. of False Positives + No. of True Negatives +  137  No. of False Negatives). Sensitivity is defined as: No. of True Positives / (No. of True Positives + No. of False Negatives). Specificity is defined as: No. of True Negatives / (No. of True Negatives + No. of False Positives). A true positive is defined as the case that a “poor outcome” patient was successfully predicted as having the “poor outcome” and a true negative is defined as the case a “good outcome” patient was correctly predicted as having the “good outcome”. From the observation of the performance plot (Figure 5.2), we concluded that our approach achieved the best performance with accuracy of 86.8%, sensitivity of 87.1% and specificity of 85.6% when Sd were set as 50. We found that with higher Sd, a smaller set of gene signatures were generated and more true positives missed. On the contrary, with lower Sd, the gene signature list contained more false positives and negatives.  5.3.2 The identified biomarkers may be involved in carcinogenesis A total of 171 gene signatures were identified in a breast cancer data set using the above approach. By examining gene ontology terms associated with these gene signatures, we found that they mainly are involved in 5 major cancer-related biological processes: transcription, DNA repair, signal transduction, cell cycle and protein phosphorylation (Figure 5.3). For instance, the well-known oncogenic transcription factors such as FOS, JUN and NFκB were identified as gene signatures by this study. We identified DNA repair genes including XRCC5, MSH, PCNA and others as gene signatures. These genes were demonstrated to cause cancer because mutations in those genes disable the  138  Figure 5.2 The performance of our approach using different thresholds of domain index scores (Sd).  Curve of receiver operating characteristics (ROC) plotted for different thresholds when our approach was tested against the breast cancer data set incorporating somatic mutation data and without incorporating somatic mutation data. The area under the curve (AUC) plotted for without somatic mutations and with somatic mutations is 0.861 and 0.892, respectively.  139  Figure 5.3 A network of 171 gene signatures identified in the breast cancer data set using our approach.  Each gene is labeled as different colors based on it biological function annotation derived from its gene ontology terms.  140  ability of DNA repair, which subsequently leads to the accumulation of mutations [3032]. Genes involved in signal transduction, an important type of pathways in cancer development, such as MARK14, VAV1 and PIK3R1 were also identified as gene signatures in this study. Besides, a group of cyclin-dependent kinases (CDK2, CDK3, CDK4, CDK6) that control cell proliferation [33] and other genes (SRC, ABL1) related to protein phosphorylation [34] were identified. In summary, 38% (65 out of 171) of the identified gene signatures are associated with cancers in Online Mendelian Inheritance in Man (OMIM; http://www.ncbi.nlm.nih.gov/omim/). This percentage is significantly greater than that found purely by chance (P < 10-12, Z-test), indicating the capability of our approach to identify disease genes. Interestingly, only 15% (26 out of 171) of the identified gene signatures were known cancer susceptibility genes compared to a list of genes downloaded from The Cancer Gene Census (http://www.sanger.ac.uk/genetics/CGP/Census/), which reports mutations causally implicated in cancer. This result was consistent with those of previous studies, which yielded 21% and 16%, respectively [7, 8]. The low percentage of known cancer susceptibility genes present in the gene signature list suggests that mutations in not only these genes, but also other genes, might collectively contribute to carcinogenesis by disrupting the modularity of the PPI network. We speculate that the other genes could be the downstream effectors of the cancer susceptibility genes and that changes in their expression could disrupt the PPI.  5.3.3 Somatic mutations increase the accuracy of our approach Some somatic mutations reveal the role of functional domains in cancer. For example, tumors highly sensitive to epidermal growth factor receptor (EGFR) tyrosine kinase 141  inhibitors often contain dominant mutations in exons that encode a portion of the tyrosine kinase (TK) domain of EGFR [35]. To investigate the possibility that somatic mutations within domains are important indicators to differentiate two classes of patients, we incorporated the somatic mutation data compiled from the COSMIC database to our scoring model (see methods) by searching for the genes having mutually exclusive domains that harbor somatic mutations. We hypothesized that these mutations could disrupt DDIs and PPIs and consequently change the modularity of the human protein interaction network. At the threshold of Sd = 50, our approach identified 126 gene signatures and achieved the accuracy of 88.3%, the sensitivity of 87.2% and the specificity of 88.9% when tested on the breast cancer outcome data (Figure 5.2). The improvement on performance suggests that the somatic mutation data is indeed an affecter to predict cancer outcome, however, its impact is limited. Due to the fact that some mutations, called “driver mutations”, contribute to the development of the cancers and others, called “passenger mutations” are effectively neutral [12], minor performance improvement could be explained by the incompleteness of currently available somatic mutation data or bias introduced by passenger mutations. With the help of next generation sequencing techniques, the size of the somatic mutations data in human will grow in the future and our approach will better distinguish “driver mutations” from “passenger mutations. It is anticipated that our approach could achieve better performance.  142  5.3.4 A list of over-represented domains that tend to disrupt the protein interactions network In order to investigate what types of domains tend to exist in ‘singlish-interface’ proteins and disrupt protein interactions, we calculated the number of involved domain-domain interactions of each domain in ‘singlish-interface’ proteins and compared it to that expected by chance. We identified 76 over-represented domains within ‘singlishinterface’ proteins (Table 5.1) (P < 0.01, Z-test). Interestingly, 74% (56 out of 76) of the domains were annotated as cell signaling domains such as SH2, Pkinase and Ras according to the SMART database [36], indicating that these domains are likely to play a critical role in carcinogenesis through disruption of the protein interactions within signaling pathways. For example, the SH2 domain of the oncoprotein Src interacts with 86 domains within 57 proteins. It has been demonstrated that SH2 domain regulates intracellular signaling cascades by interacting with high affinity to phosphotyrosinecontaining target peptides [37, 38] and is related to cancer cell migration and proliferation [39]. Another example is the Pkinase domain that performs the catalytic function of protein kinases [40, 41]. Many diseases including cancer are caused by dysfunction of phosphorylation [42].  143  Table 5.1 A list of over-represented domains within the ‘singlish-interface’ proteins. Domain  Name  DDIs  P-value  PF00017  SH2  86  1.63E-24  PF00018  SH3_1  70  1.39E-23  PF00069  Pkinase  49  1.41E-23  PF00071  Ras  45  2.13E-23  PF00170  bZIP_1  42  2.49E-23  PF07716  bZIP_2  34  2.97E-23  PF00036  efhand  32  4.48E-23  PF00562  RNA_pol_Rpb2_6  31  4.84E-23  PF01466  Skp1  29  5.14E-23  PF00130  C1_1  23  5.56E-23  PF00271  Helicase_C  23  6.34E-23  PF01193  RNA_pol_L  23  6.67E-23  PF00270  DEAD  23  7.18E-23  PF00169  PH  22  7.21E-23  PF00010  HLH  22  8.55E-23  PF00096  zf-C2H2  22  9.03E-23  PF00227  Proteasome  21  9.44E-23  PF05739  SNARE  21  9.94E-23  PF04998  RNA_pol_Rpb1_5  21  9.94E-23  PF00023  Ank  20  9.98E-23  PF01833  TIG  20  1.44E-22  PF05000  RNA_pol_Rpb1_4  19  2.05E-22  PF00433  Pkinase_C  19  2.17E-22  PF02985  HEAT  18  3.71E-22  PF00004  AAA  18  9.83E-22  PF00076  RRM_1  18  9.99E-22  PF01423  LSM  17  1.04E-21  PF04983  RNA_pol_Rpb1_3  17  1.90E-21  PF00179  UQ_con  16  3.51E-21  144  Domain  Name  DDIs  P-value  PF00786  PBD  16  3.90E-21  PF00400  WD40  16  6.04E-21  PF00620  RhoGAP  16  8.19E-21  PF01000  RNA_pol_A_bac  16  9.25E-21  PF01192  RNA_pol_Rpb6  16  1.69E-17  PF00995  Sec1  15  3.16E-17  PF03870  RNA_pol_Rpb8  15  4.31E-17  PF00134  Cyclin_N  14  4.60E-17  PF00102  Y_phosphatase  14  4.74E-17  PF04563  RNA_pol_Rpb2_1  14  9.54E-17  PF00022  Actin  14  9.84E-17  PF04565  RNA_pol_Rpb2_3  14  1.17E-16  PF00623  RNA_pol_Rpb1_2  13  1.32E-16  PF00996  GDI  13  1.55E-16  PF00736  EF1_GNE  13  1.62E-16  PF00804  Syntaxin  13  1.80E-16  PF00183  HSP90  13  1.88E-16  PF04560  RNA_pol_Rpb2_7  12  2.09E-16  PF04567  RNA_pol_Rpb2_5  12  8.24E-16  PF02463  SMC_N  12  2.44E-15  PF00595  PDZ  12  2.96E-15  PF08033  Sec23_BS  12  3.21E-15  PF01194  RNA_pol_N  12  3.32E-15  PF02115  Rho_GDI  12  3.36E-14  PF00125  Histone  12  3.37E-14  PF00626  Gelsolin  12  3.66E-12  PF04811  Sec23_trunk  12  3.78E-12  PF04561  RNA_pol_Rpb2_2  12  4.74E-12  PF01214  CK_II_beta  12  8.15E-12  PF00617  RasGEF  11  9.15E-12  145  Domain  Name  DDIs  P-value  PF00503  G-alpha  11  1.29E-11  PF00514  Arm  11  5.39E-11  PF00618  RasGEF_N  11  5.99E-11  PF00788  RA  11  6.04E-11  PF05192  MutS_III  11  6.15E-10  PF04997  RNA_pol_Rpb1_1  10  6.26E-10  PF00621  RhoGEF  10  6.40E-10  PF01138  RNase_PH  10  6.47E-08  PF00352  TBP  10  6.70E-08  PF00240  ubiquitin  10  6.72E-07  PF00515  TPR_1  10  6.73E-06  PF02984  Cyclin_C  10  7.19E-06  PF06470  SMC_hinge  10  7.25E-04  PF00046  Homeobox  10  7.57E-03  PF04566  RNA_pol_Rpb2_4  10  7.58E-03  PF03725  RNase_PH_C  10  7.79E-03  PF00917  MATH  10  8.04E-03  146  5.3.5 Comparison between approaches Identifying novel prognostic markers to classify cancer outcomes has been widely studied. The approaches described in previous publications can be categorized into three classes: i) gene expression pattern-based method, in which markers are selected based on whether their expression profiles can differentiate different groups of patients [3, 4]; ii) PPI sub-network-based method, in which each marker, representing as a sub-network in the PPI network was identified by maximizing the mutual information measuring the association between the expression value of each gene in the sub-network and the types of patients [7]; iii) PPI modularity-based method in which each gene signature was identified by comparing the difference of the gene expression value between a hub gene and their interacting partners in the PPI network [8]. In this study, we employed a novel approach based on finding genes in the PPI network with mutually exclusive domains and somatic mutations located in these domains as the markers. Table 5.2 shows a comparison of the approaches. Wang et al [4] and van’t Veer et al [3] reported 63% and 62% accuracy, respectively, for the prediction of metastasis using gene expression pattern-based methods. Using the PPI sub-network-based method, Chuang et al [7] yielded accuracy of 72.2% and 70.1% using the same data set. Using the PPI modularitybased method, Taylor et al [8] reported accuracy of 76%. We applied our approach on the same data set as Taylor et al used and adopted the identical training and testing strategy (five-fold cross-validation). We observed that our approach achieved accuracy of 88.3%, sensitivity of 87.2% and specificity of 88.9% when Sd were  147  Table 5.2 Feature comparison between different approaches. Features Publication  Gene expression profiles Proteinprotein interaction network Gene expression value difference of neighboring proteins Domaindomain interaction network Somatic mutations Accuracy  Gene expression pattern-based method van’t Wang Veer et et al. al. Lancet Nature, 2005 2002 [3] [4] √  √  PPI subnetworkbased method  PPI modularitybased method  this study  Chuang et al. Mol Syst Biol 2007 [7]  Taylor et al. Nat Biotechnol 2009 [8]  √  √  √  √  √  √  √  √  √ √ 62%  63%  72%  76%  88%  148  Figure 5.4 Predictive performance comparison between different approaches.  (A) Tested our approach on the same data set as Taylor et al used and adopted the identical training and testing; (B) Tested our approach on the same data set as other approaches used and adopted the identical training and testing.  149  set as 50, which indicates that our method outperforms other approaches and provides a promising solution to predict cancer outcome (Figure 5.4A).  5.3.6 The robustness of our approach To test the robustness of our approach, we first applied our approach to another independent data set that included 184 breast cancer patients with metastasis and 397 breast cancer patients without metastasis [3, 4]. Using five-fold cross-validation, our approach achieved accuracy of 83.2%, sensitivity of 84.6% and specificity of 82.5% (Figure 5.4B), which is better performance than previous studies [3, 4] in predicting breast cancer outcome when tested on an independent data set. Next, we compiled a set of 23 patients with oral squamous cell carcinomas (OSCCs) that contained gene expression data of 8 patients with pathological lymph node positivity and 15 patients with lymph node negativity [43]. Compared to patients without lymph node metastasis, patients with lymph node metastasis demonstrate very high death rates [44]. We applied our approach to this data set using the leave-one-out cross-validation (LOOCV) strategy due to insufficient sample size. We observed that our approach achieved accuracy of 92%, sensitivity of 93.3% and specificity of 87.5%, further validating the robustness of our predictive approach.  5.4 Conclusion Biological network information has proven to be useful for improvement of prognosis performance [7, 8]. In this context, our study constitutes the first novel predictive method to classify cancer outcomes based on information about protein interaction interfaces in the protein interaction networks and on the gene expression data. The favorable  150  predictive performance of our approach suggests that association exists between metastasis and protein interaction interfaces, probably due to genetic variances within domains interrupting physical interactions and then causing abnormal biological functions associated with cancer progression. The potential of the approach described in this study is substantially restrained by the limitations of currently available data sources. These data sources, such as the protein interaction data, the domain interaction data, the gene expression data and the somatic mutation data, are far from complete and contain biases. As the size of these data sets grows in the future, our method will ultimately lead to a more reliable and robust prognosis tool to assess cancer outcome.  151  5.5 References 1. 2. 3.  4.  5. 6. 7. 8. 9. 10. 11. 12.  13. 14.  Weinberg RA: The Biology of Cancer, 1 edn: Garland Science; 2006. Gebauer G: On the way to specifically targeting minimal residual disease? Breast Cancer Res 2008, 10(5):112. van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530-536. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymphnode-negative primary breast cancer. Lancet 2005, 365(9460):671-679. Jaluria P, Konstantopoulos K, Betenbaugh M, Shiloach J: A perspective on microarrays: current applications, pitfalls, and potential uses. Microb Cell Fact 2007, 6:4. Hanahan D, Weinberg RA: The hallmarks of cancer. Cell 2000, 100(1):57-70. Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol 2007, 3:140. Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, Bull S, Pawson T, Morris Q, Wrana JL: Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol 2009, 27(2):199-204. McGough AM, Staiger CJ, Min JK, Simonetti KD: The gelsolin family of actin regulatory proteins: modular structures, versatile functions. FEBS Lett 2003, 552(2-3):75-81. Kim PM, Lu LJ, Xia Y, Gerstein MB: Relating three-dimensional structures to protein networks provides evolutionary insights. Science 2006, 314(5807):1938-1941. Zhidkov I, Livneh EA, Rubin E, Mishmar D: MtDNA mutation pattern in tumors and human evolution are shaped by similar selective constraints. Genome Res 2009, 19(4):576-580. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C, Edkins S, O'Meara S, Vastrik I, Schmidt EE, Avis T, Barthorpe S, Bhamra G, Buck G, Choudhury B, Clements J, Cole J, Dicks E, Forbes S, Gray K, Halliday K, Harrison R, Hills K, Hinton J, Jenkinson A, Jones D, Menzies A, Mironenko T, Perry J, Raine K, Richardson D, Shepherd R, Small A, Tofts C, Varian J, Webb T, West S, Widaa S, Yates A, Cahill DP, Louis DN, Goldstraw P, Nicholson AG, Brasseur F, Looijenga L, Weber BL, Chiew YE, DeFazio A, Greaves MF, Green AR, Campbell P, Birney E, Easton DF, Chenevix-Trench G, Tan MH, Khoo SK, Teh BT, Yuen ST, Leung SY, Wooster R, Futreal PA, Stratton MR: Patterns of somatic mutation in human cancer genomes. Nature 2007, 446(7132):153-158. Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature 2009, 458(7239):719-724. Razick S, Magklaras G, Donaldson IM: iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 2008, 9:405. 152  15.  16. 17. 18.  19.  20.  21.  Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, Buzadzija K, Cavero R, D'Abreo C, Donaldson I, Dorairajoo D, Dumontier MJ, Dumontier MR, Earles V, Farrall R, Feldman H, Garderman E, Gong Y, Gonzaga R, Grytsan V, Gryz E, Gu V, Haldorsen E, Halupa A, Haw R, Hrvojic A, Hurrell L, Isserlin R, Jack F, Juma F, Khan A, Kon T, Konopinsky S, Le V, Lee E, Ling S, Magidin M, Moniakis J, Montojo J, Moore S, Muskat B, Ng I, Paraiso JP, Parker B, Pintilie G, Pirone R, Salama JJ, Sgro S, Shan T, Shu Y, Siew J, Skinner D, Snyder K, Stasiuk R, Strumpf D, Tuekam B, Tao S, Wang Z, White M, Willis R, Wolting C, Wong S, Wrong A, Xin C, Yao R, Yates B, Zhang S, Zheng K, Pawson T, Ouellette BF, Hogue CW: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res 2005, 33(Database issue):D418-424. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34(Database issue):D535-539. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32(Database issue):D449-451. Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S, Chandrika KN, Deshpande N, Palvankar K, Raghavnath R, Krishnakanth R, Karathia H, Rekha B, Nayak R, Vishnupriya G, Kumar HG, Nagini M, Kumar GS, Jose R, Deepthi P, Mohan SS, Gandhi TK, Harsha HC, Deshpande KS, Sarker M, Prasad TS, Pandey A: Human protein reference database--2006 update. Nucleic Acids Res 2006, 34(Database issue):D411-414. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M, Anand SK, Madavan V, Joseph A, Wong GW, Schiemann WP, Constantinescu SN, Huang L, Khosravi-Far R, Steen H, Tewari M, Ghaffari S, Blobe GC, Dang CV, Garcia JG, Pevsner J, Jensen ON, Roepstorff P, Deshpande KS, Chinnaiyan AM, Hamosh A, Chakravarti A, Pandey A: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13(10):2363-2371. Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, Kohler C, Khadake J, Leroy C, Liban A, Lieftink C, Montecchi-Palazzi L, Orchard S, Risse J, Robbe K, Roechert B, Thorneycroft D, Zhang Y, Apweiler R, Hermjakob H: IntAct--open source resource for molecular interaction data. Nucleic Acids Res 2007, 35(Database issue):D561-565. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source  153  22. 23. 24. 25.  26.  27. 28. 29. 30.  31. 32. 33. 34.  molecular interaction database. Nucleic Acids Res 2004, 32(Database issue):D452-455. Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G: MINT: the Molecular INTeraction database. Nucleic Acids Res 2007, 35(Database issue):D572-574. Brown KR, Jurisica I: Online predicted human interaction database. Bioinformatics 2005, 21(9):2076-2082. Finn RD, Marshall M, Bateman A: iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics 2005, 21(3):410-412. Forbes SA, Tang G, Bindal N, Bamford S, Dawson E, Cole C, Kok CY, Jia M, Ewing R, Menzies A, Teague JW, Stratton MR, Futreal PA: COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res, 38(Database issue):D652-657. van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 2002, 347(25):1999-2009. Friedman N: Inferring cellular networks using probabilistic graphical models. Science 2004, 303(5659):799-805. Sachs K, Perez O, Pe'er D, Lauffenburger DA, Nolan GP: Causal proteinsignaling networks derived from multiparameter single-cell data. Science 2005, 308(5721):523-529. Bradford JR, Needham CJ, Bulpitt AJ, Westhead DR: Insights into proteinprotein interfaces using a Bayesian network prediction method. J Mol Biol 2006, 362(2):365-386. Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N, Szabo S, Buckhaults P, Farrell C, Meeh P, Markowitz SD, Willis J, Dawson D, Willson JK, Gazdar AF, Hartigan J, Wu L, Liu C, Parmigiani G, Park BH, Bachman KE, Papadopoulos N, Vogelstein B, Kinzler KW, Velculescu VE: The consensus coding sequences of human breast and colorectal cancers. Science 2006, 314(5797):268-274. Naugler WE, Karin M: NF-kappaB and cancer-identifying targets and mechanisms. Curr Opin Genet Dev 2008, 18(1):19-26. Young MR, Yang HS, Colburn NH: Promising molecular targets for cancer prevention: AP-1, NF-kappa B and Pdcd4. Trends Mol Med 2003, 9(1):36-41. Schwartz MA, Assoian RK: Integrins and cell proliferation: regulation of cyclin-dependent kinases via cytoplasmic signaling pathways. J Cell Sci 2001, 114(Pt 14):2553-2560. Oh AS, Lahusen JT, Chien CD, Fereshteh MP, Zhang X, Dakshanamurthy S, Xu J, Kagan BL, Wellstein A, Riegel AT: Tyrosine phosphorylation of the nuclear receptor coactivator AIB1/SRC-3 is enhanced by Abl kinase and is required for its activity in cancer cells. Mol Cell Biol 2008, 28(21):6580-6593.  154  35.  36. 37. 38. 39.  40. 41. 42. 43. 44.  Paez JG, Janne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye FJ, Lindeman N, Boggon TJ, Naoki K, Sasaki H, Fujii Y, Eck MJ, Sellers WR, Johnson BE, Meyerson M: EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 2004, 304(5676):1497-1500. Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P: SMART 5: domains in the context of genomes and networks. Nucleic Acids Res 2006, 34(Database issue):D257-260. Marengere LE, Pawson T: Structure and function of SH2 domains. J Cell Sci Suppl 1994, 18:97-104. Pawson T: Protein modules and signalling networks. Nature 1995, 373(6515):573-580. Porter CJ, Matthews JM, Mackay JP, Pursglove SE, Schmidberger JW, Leedman PJ, Pero SC, Krag DN, Wilce MC, Wilce JA: Grb7 SH2 domain structure and interactions with a cyclic peptide inhibitor of cancer cell migration and proliferation. BMC Struct Biol 2007, 7:58. Hanks SK, Hunter T: Protein kinases 6. The eukaryotic protein kinase superfamily: kinase (catalytic) domain structure and classification. FASEB J 1995, 9(8):576-596. Hanks SK, Quinn AM: Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members. Methods Enzymol 1991, 200:38-62. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S: The protein kinase complement of the human genome. Science 2002, 298(5600):1912-1934. Kato Y, Uzawa K, Saito K, Nakashima D, Kato M, Nimura Y, Seki N, Tanzawa H: Gene expression pattern in oral cancer cervical lymph node metastasis. Oncol Rep 2006, 16(5):1009-1014. Greenberg JS, Fowler R, Gomez J, Mo V, Roberts D, El Naggar AK, Myers JN: Extent of extracapsular spread: a critical prognosticator in oral tongue cancer. Cancer 2003, 97(6):1464-1470.  155  6. Conclusions 6.1 Summary In this thesis, I have described my contributions to the development of computational methods and tools for analyzing biological networks. Specifically, I introduced novel computational approaches that predict protein-protein interactions, domain-domain interactions, biological pathway organizations and cancer-related gene signatures from bimolecular networks, such as protein interaction networks, domain interaction networks, genetic interaction networks and gene expression networks. For more efficient and reliable analysis, in addition to utilizing the data in these networks, I integrated information about Gene Ontology semantic similarity into these bimolecular networks. In the final chapter of this thesis, I will discuss the general conclusions that can be drawn from my thesis research, and I will propose potential future directions of biomolecular network analysis.  6.2 Knowledge discovery on the basis of networked data A living organism or a cell is a highly organized system of interacting macromolecules and metabolites, which can be viewed as a huge biological network consisting of interactions between molecules. The study of biomolecular networks presented in various biological systems, therefore, calls for the capture of the properties of both local interactions and the global network in a quantitative fashion. There are a large number of diversified studies focused on biomolecular networks that are ongoing. In my studies, I investigated the following main computational problems in this field.  156  6.2.1 The prediction of protein-protein interactions and domain-domain interactions from noisy and incomplete high-throughput data High-throughput techniques revealed a large amount of protein interactions. However, the current protein interaction data are noisy, and contain an undetermined portion of false positives [1, 2]. Furthermore, due to practical limitations such as insufficient time, labor and cost, only a part of the interactome has been investigated so far [3]. Therefore, I decided that it is necessary to develop sophisticated computational methods to analyze the available data. To undertake this task, I first focused on the prediction of interactions between membrane proteins. The reason for this focus was because the current protein interaction data are strongly biased against membrane proteins, which constitute as many as 20% to 35% of all known proteins. Based on the previous observation that two proteins are likely to interact with each other if they interact with a similar group of proteins, I developed a log likelihood scoring method to quantitatively measure the overlap of interacting proteins for a given pair of proteins in the protein interaction network and the domain interaction network (Chapter 2). Overall, this method predicted 4,660 interactions between integral membrane proteins of which 68% (3,168 out of 4,660) were novel. I demonstrated that this approach improved on other predictive approaches when tested on a “gold-standard” data set and achieved 74.6% true positive rate at the expense of 9.9% false positive rate. Furthermore, I confirmed that two membrane proteins are more likely to interact with each other if they share common interaction partners in the networks. This study resulted in a more extensive understanding of the yeast integral membrane proteins from a network view, which also complemented the previous prediction approaches based on the genomic context.  157  Domains are defined as independently structural and/or functional blocks of proteins. It is widely believed that some proteins interact with each other through interactions between their domains [4, 5]. As a result, studies of domain-domain interactions facilitate the identification of protein-protein interactions. More and more efforts have been devoted to this area, via either experimental or computational approaches. It is generally infeasible to study a large set of domain-domain interactions through experimental approaches such as crystallization due to high cost and low efficiency. Therefore, computational approaches have become an alternative to predict the interactions between protein domains. Previous studies that purely relied on domain annotations yielded interesting domain-domain interactions but were limited by low sensitivity and specificity. Given this situation, I proposed a novel approach called GAIA based on the primary sequences of the domains (Chapter 3). I hypothesized that a pair of small segments of n contiguous amino acids (or n-grams) can interact with each other and mediate domain-domain interactions. I showed that GAIA had better prediction performance than other prediction approaches at a sensitivity of 82% and a false positive rate of 21%. This result suggests that the primary sequence information of protein domains combined with domain annotations may be the optimal way to predict interacting domains. In this study, I also observed a list of significantly over-represented 4-gram pairs that may mediate the DDIs. Another advantage of GAIA is that GAIA is able to predict the localization of interacting grams/hotspots, therefore providing testable hypotheses for experimental validation. Overall, I demonstrated that GAIA, a gram-based method, is a novel and reliable way to predict DDIs that may mediate PPIs in yeast.  158  6.2.2 The reconstruction of biological pathways from a large set of biomolecular interactions A single large biological network within any cell consists of smaller units called biological pathways that perform specified biological tasks or functions. The currently available pathway data are overall far from complete. Besides, the majority of these data are metabolic pathways and signaling pathways that have been manually curated. This situation prompted me to consider computationally reconstructing the biological pathways. In fact, biological pathway reconstruction has been intensively studied for years in bioinformatics and engineering communities but has mainly been focusing on the reconstruction of metabolic pathways and signaling pathways. To overcome this bias, I developed a pathway discovery tool to infer biologically functional pathways called Pandora (Chapter 4). Instead of using single type of data, I incorporated four types of functionally associated data in the model organism S. cerevisiae: protein-protein interactions (PPIs), genetic interactions (GIs), domain-domain interactions (DDIs) and semantic similarity of GO terms. Because of the intrinsic properties of the tool, members of the resultant pathways identified by Pandora are highly functionally associated. By aligning the prediction results to the pathway annotations from three pathway databases: KEGG, BioCyc and Reactome, I showed that my approach is able to predict biological pathways with a higher positive predictive value (PPV) compared to other previous reported approaches. My results, which also revealed new members of pathways, provided testable hypotheses for experimental validation. Pandora represents promising progress toward deciphering the entire pathway organization in yeast cells, and points out future directions to discover pathways in other eukaryotic systems when more large data sets are available.  159  6.2.3 The identification of active sub-networks associated with the dynamic behaviors of biosystems Biological networks have been used to investigate the relationship between biomolecular interactions and human diseases at different levels. At the transcriptome level, differential expression analysis is performed to identify the most consistently altered genes [6-8]. At the interactome level, similar topological properties from human molecular interaction networks have proven to be useful in identifying disease-related sub-networks or disease genes [9-13]. At the metabolic and signaling network level, mutated enzymes that catalyze adjacent metabolic reactions or signal transductions are believed to associate with diseases [14, 15]. However, low accuracy is the major drawback of existing computational approaches, which has motivated me to integrate different levels of data in order to improve the accuracy. Because human disease such as cancer represents a dynamic and multifactorial process, it is reasonable and necessary to incorporate multiple types of data to identify active sub-networks associated with these diseases. I therefore proposed an integrated approach using various types of data including PPI, DDI, gene expression profiles and somatic mutation to predict cancer outcomes (with metastasis and without metastasis) by identifying cancer-related sub-networks (Chapter 5). The major difference between my approach and other network-based predictive approaches is that I employed a scoring model to quantify the involved DDI sub-network and somatic mutations of each protein. Tested on a set of breast cancer patients, my approach greatly improveed the predictive performance at the accuracy of 87%, sensitivity of 87.8% and specificity of 86.6%. In particular, 171 gene signatures were identified in a breast cancer data set and they were mainly involved in five major cancer-related biological processes: transcription, DNA repair, signal transduction, cell cycle and protein phosphorylation.  160  These results provided testable hypotheses for the identification of pathways associated with cancer metastasis. This study can be applied to the prediction of cancer prognosis, so that cancer patients can avoid unnecessary chemotherapy. In addition to breast cancer, this approach can also be applied to several other cancers including lung, colon and pancreatic cancers.  6.3 Limitations of computational studies on networks Complemented by experimental methods, network-driven computational approaches provide a promising path to reveal a more complete picture of biological systems. However, several challenges in knowledge discovery based on networked data should be noticed.  First, reliable data collection is still insufficient, which results in different “noisy” subsets of the complex cellular networks. Several important types of data ranging from signaling networks to the role of microRNAs in network topology and dynamics remain completely unexplored by any high-throughput technique [16]. On the other hand, available largescale data sets usually contain experimental artifacts, biases or noise caused by human or technical limitations. For example, two high throughput techniques of identifying protein physical interaction (Y2H and co-affinity purification followed by mass spectrometry) have sampling problems and could generate a substantial amount of false positives [1, 2]. Another important shortcoming of currently available large-scale PPI data sets is that they do not include the affinity of PPIs: there is no dynamic range for these data sets, and we know that proteins have very different affinities, and because of this current high throughput methods may never be able to “see” some interactions. Therefore, data  161  incompleteness and poor data quality also make it difficult to compile the gold-standard data set, and therefore affect the assessment of the prediction performance of various approaches.  Moreover, it is challenging to obtain direct knowledge on the dynamics of a biological network. Currently, the non-continuous time-coursed gene expression data are the only available data source that reflects different states of the cells. This limitation hinders the development of sophisticated theoretical models and quantitative simulation techniques.  The current computational network-based analysis is mainly focused on simple organisms such as the data generated from S. cerevisiae. Theoretically, methodologies can be applicable to a variety of organisms. However, even with the fast increase of heterogeneous biological data, the data of some organisms such as Mus musculus, Drosophila melanogaster and especially Homo sapiens are still far from complete. Prediction approaches based on multiple lines of evidence, therefore, are facing the challenges caused by data incompleteness. One interesting contribution this work could provide is to direct the PPI measurement of specific target protein pairs. One could look at the many orthologs present in model systems and the various technologies that now exist or are in development to verify and test interactions and affinities between a number of human proteins.  6.4 Future directions of biomolecular network analysis In this thesis, we proposed new solutions to extract knowledge from various types of biological network. These graph-based approaches were demonstrated to be efficient and  162  effective strategies to predict PPIs, DDIs, pathways and cancer outcomes. Despite these achievements, we still have much room to improve so that we can elucidate underlying mechanisms embedded in biological networks in a more accurate way. Future progress is expected in the following four topics.  First, the issue of sampling of biological networks needs to be addressed in order to identify the correct network statistics. Computational approaches such as inferring functional modules, finding network motifs are based on the assumption that the complete network is scale free [3]. However, currently available biological networks are incomplete. Therefore, topological properties of existing biological networks may not be accurately extrapolated to the complete ones. One possible solution is that we can develop methodologies to quantify the effects of inaccurate sampling caused by incomplete networks. For example, the scoring model to measure the likelihood of interaction between a pair of proteins in Chapter 2 is based on the number of PPIs and DDIs between this pair of proteins and their common interactors. Although this number is statistically significantly higher than that would be expected at random, it might be attributable to artifacts of sampling originated from experimental bias. In this case, the bias needs to be taken into account in a more sophisticated scoring model, so that a more accurate likelihood calculation can be achieved. Another possible solution is to increase the network coverage. The demands of increased coverage need to be satisfied by advanced techniques developed in the foreseeable future. An example of the advanced techniques in request is new high throughput technology that is capable of identifying all PPIs overlooked by various current high throughput technologies.  163  Second, in order to extract fundamental rules that govern complex living systems, we need to build general predictive models by integrating networked data at different levels. For example, one possible future work of the pathway discovery tool, Pandora, described in Chapter 4, is to integrate the gene expression data, as genes having similar expression patterns may be within the same biological pathway involved in a biological process. In GAIA described in Chapter 3, further integrating the three dimensional structural information of proteins such as the relative solvent accessible surface area might be helpful to the prediction of interacting domains. However, we are facing the challenge of how to integrate various types of networked data. Data integration can be accomplished by firstly quantifying each data type and then combining them. In Pandora, I adopted a simple rule in which each data type is weighted equally. To achieve optimal predictive power, however, different genomic features need to be properly integrated into a single probabilistic framework. Many machine learning methods such as Bayesian approaches, decision trees and support vector machines can be applied in Pandora to achieve better performance. In addition, a methodology is needed to evaluate the limits of data integration by examining how predictive performance changes when more types of data are integrated.  Meanwhile, we need to maximize our data collection abilities by finding the most relevant data sets to fit the biological problems we are trying to solve. For example, while it can successfully predict biological pathways, Pandora misses an important piece of information that is the directionality between each entity within pathways. It is because that the networked data sets used in Pandora are unidirectional such as PPIs, DDIs and synthetic lethal interactions. Therefore, one future improvement for Pandora is to collect  164  directional relationship data sets such as epistatic, conditional, and suppressive genetic interactions that provide serial information flow from one gene to the other gene [17]. In chapter 5, I described an approach to identify gene signatures based on disrupted subnetworks in PPI and DDI network caused by the somatic mutations present within specific domains. However, the impact of using somatic mutation data is limited due to the incompleteness of currently available somatic mutation data. With the help of the next generation sequencing techniques, the International Cancer Genome Consortium (ICGC) (http://www.icgc.org/) aims to generate comprehensive catalogues of genomic abnormalities such as somatic mutations, abnormal expression of genes, and epigenetic modifications in 50 different cancer types and make the data available to the entire research community. Integrated with ICGC data sets, a more sophisticated model could be developed to quantify the differences in the DDIs and PPIs between cancerous and pre-cancerous cell types caused by the genomic variations present within specific domains, which could constitute part of the efforts towards understanding how various aspects of genomic changes might affect cancer development, therefore providing insights on different cellular mechanisms functioning in normal versus cancerous states.  Finally, we need to consider the effect of different types of topological measurements when we interpret the network-based knowledge. Some topological features such as node-degree, the adjacency matrix and clustering coefficient indicate local connections in the graph, whereas other features such as a node’s betweenness centrality can identify the important node in terms of the whole network. The degree of a node is not necessarily correlated to its betweenness centrality [18]. To take this factor into account, the local topological measurement-based approaches described in this thesis such as Pandora and  165  predicting membrane PPIs could be further modified to examine global topological features such as betweenness centrality, so that we could measure biological relevance of protein pairs both locally and globally.  To conclude, the ultimate aim in this field is to construct a complete and high-resolution description of molecular topography and connect various types of interactions with physiological responses. Although systems biology-based approaches are not yet fully developed, it has been widely believed that they will become more and more essential to the understanding of complex biological phenomena such as diseases. By investigating the relationships and interactions between various parts of a complex living system such as protein interaction networks, domain interaction networks, gene expression networks and genetic interaction networks, we expect the findings presented in this thesis will provide some new solutions to mine knowledge from a variety of biological networks, which is a critical step towards a complete understanding of the underlying mechanisms of living organisms.  166  6.5 References 1.  2. 3. 4.  5. 6. 7.  8.  9. 10.  11. 12. 13.  Pitre S, Dehne F, Chan A, Cheetham J, Duong A, Emili A, Gebbia M, Greenblatt J, Jessulat M, Krogan N, Luo X, Golshani A: PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. BMC Bioinformatics 2006, 7:365. Zhu J, Zhang B, Smith EN, Drees B, Brem RB, Kruglyak L, Bumgarner RE, Schadt EE: Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet 2008, 40(7):854-861. Han JD, Dupuy D, Bertin N, Cusick ME, Vidal M: Effect of sampling on topology predictions of protein-protein interaction networks. Nat Biotechnol 2005, 23(7):839-844. Kato Y, Nagata K, Takahashi M, Lian L, Herrero JJ, Sudol M, Tanokura M: Common mechanism of ligand recognition by group II/III WW domains: redefining their functional classification. J Biol Chem 2004, 279(30):3183331841. McGough AM, Staiger CJ, Min JK, Simonetti KD: The gelsolin family of actin regulatory proteins: modular structures, versatile functions. FEBS Lett 2003, 552(2-3):75-81. Tuck DP, Kluger HM, Kluger Y: Characterizing disease states from topological properties of transcriptional regulatory networks. BMC Bioinformatics 2006, 7:236. van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530-536. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymphnode-negative primary breast cancer. Lancet 2005, 365(9460):671-679. Chen JY, Shen C, Sivachenko AY: Mining Alzheimer disease relevant proteins from integrated protein interactome data. Pac Symp Biocomput 2006:367-378. Franke L, van Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 2006, 78(6):10111025. Jonsson PF, Bates PA: Global topological features of cancer proteins in the human interactome. Bioinformatics 2006, 22(18):2291-2297. Oti M, Snel B, Huynen MA, Brunner HG: Predicting disease genes using protein-protein interactions. J Med Genet 2006, 43(8):691-698. Wachi S, Yoneda K, Wu R: Interactome-transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues. Bioinformatics 2005, 21(23):4205-4208.  167  14. 15. 16. 17. 18.  Lee DS, Park J, Kay KA, Christakis NA, Oltvai ZN, Barabasi AL: The implications of human metabolic network topology for disease comorbidity. Proc Natl Acad Sci U S A 2008, 105(29):9880-9885. Lee E, Chuang HY, Kim JW, Ideker T, Lee D: Inferring pathway activity toward precise disease classification. PLoS Comput Biol 2008, 4(11):e1000217. Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5(2):101-113. Drees BL, Thorsson V, Carter GW, Rives AW, Raymond MZ, Avila-Campillo I, Shannon P, Galitski T: Derivation of genetic interaction networks from quantitative phenotype data. Genome Biol 2005, 6(4):R38. Guimera R, Mossa S, Turtschi A, Amaral LA: The worldwide air transportation network: Anomalous centrality, community structure, and cities' global roles. Proc Natl Acad Sci U S A 2005, 102(22):7794-7799.  168  Appendices Appendix A Detecting protein-domains DNA-Motifs association in Saccharomyces cerevisiae regulatory networks 1. Introduction Living cells are characterized by temporally and spatially differentiated gene expression. The regulatory mechanisms of gene expression responding to various internal or external cues, such as tissue-specific development signals and environmental factors, are not completely understood. Although it is well known that proteins, including activators and repressors, play key roles in the transcriptional regulation of gene expression, questions such as how these proteins bind to specific sites across the regions of transcription start sites, enhancers and silencers still remain to be addressed [1]. The technical accessibility of Saccharomyces cerevisiae (Sc) for both advanced genetic and molecular analyses makes it an ideal organism for the study of the underlying regulation of transcription.  Many research groups have engaged in computational studies on transcriptional regulation. They have done so in the following ways. First, they have predicted individual transcription factor binding sites by applying position-specific scoring matrices (PSSMs) derived from the multiple alignments of binding site sequences. The PSSMs can be retrieved from the popular databases TRANSFAC [2, 3] and JASPAR [4]. However, this method suffers a large number of false positive predictions due to the short, degenerate nature of transcription factor binding site TFBS motifs [5]. Second, computational studies have predicted statistically significant function motifs in the promoter regions of genes  169  based on their gene expression profiles [6]. Third, other computational studies have predicted or discovered regulatory patterns between regulators and their target genes using one data set such as gene expression profiles [7-12] or a variety of data sources such as ChIPchip data, regulatory motif data and gene expression profile data [13-15].  A protein domain represents a conserved segment of sequence within a protein that usually corresponds to a structural or functional region [19]. Understanding protein domains’ functions is important in understanding biological processes including gene expression regulation [16]. DNA binding is one of the most essential domains’ functions for gene expression regulation. In order for regulators to function, they need to directly or indirectly bind to DNA through cis-regulating regions. Detecting these binding domains and their corresponding cis-regions can help biologists’ of further understandings of the regulation process across species.  Here, we report a probability-based computational method to help biologists infer the regulatory modules in yeast from regulatory DNA motif data and protein domain data. Identifying these pairs would allow a life scientist to greatly reduce and enrich the search space they would need to investigate in order to understand the regulatory networks of interest to them.  This method is based on finding associations between protein domains and DNA regulatory motifs over a large number of known regulation relations. These known relations are caused both by either the direct involvement of a given domain in the binding of a regulator protein to some DNA motifs or by the indirect involvement of this domain in other aspects of the regulation process, such as in the protein-protein 170  interaction upon the binding of co-regulators. In this study, we generated a list of associations in the form of domain-motif, showing potential statistically significant relations between binding motifs and domains of regulators (activators or repressors). Some of our association results were validated by previously published biological research.  2. Definitions An item, I, is a Boolean value that represents the existence of either a protein domain (domain item, Id) or a DNA motif (motif item, Im).  A Motif Matrix, MM, is a matrix of motif items Im. It holds information about the existence of a known motif in the upstream region of a gene, where each line corresponds to a gene and each column to a motif. The value at line/column is true if the column motif exists at least once in the 500 bp upstream region of the gene line.  A Protein Domain Matrix, PDM, is a matrix of domain items Id. It holds information about the existence of a known protein domain in a gene, where each line corresponds to a gene and each column to a protein domain. The value at line/column is true if the column domain exists at least once in the gene line.  A Regulation Array, RA, is an array of two columns, each line corresponding to a regulation relation in the form of a regulator-target. The first column corresponds to the regulator gene and the second column corresponds to the target gene of the line regulation relation. There are Nrr regulation relations in the RA (number of lines is Nrr).  171  3. Method The following method will generate association roles in the form of a support [17] and a score for each possible relation between regulator domain(s) and target motif(s). This score represents the ratio of the actual support of a set of domain(s) and motif(s) to the statistical predicted support, where the statistical predicted support of a set is the multiplication of the support of all its items. Those relations that have a support greater than a given support threshold SUPP_T value and a score greater than a given score threshold value SCORE_T will be predicted as output.  First, an association matrix, AM, is constructed based on RA. This AM will hold information of regulator domain(s) and target motif(s) from PDM and MM. Each line in AM corresponds to one regulation relation (one line from RA) and consists of all the regulator domain items Id and all the target gene upstream motif items Im. The association of protein domains and DNA motifs is calculated using the method presented in Figure 1 and Figure 2.  4. Input Data Three different types of data are used; protein domains, DNA binding motifs, and a set of known regulation relationships.  Protein domains: We obtained the domains annotation of each protein in yeast from Pfam [19]. A total of 3842 proteins (comprising about 60% of the yeast genome) with at least one domain were used in this experiment. Pfam is a database containing a large collection  172  of common protein domains and families, as well as the multiple alignments and hidden Markov models from which they are derived [19].  DNA motifs: Two types of sets of DNA motifs are used: (1) experimental identified target genes of specific transcription factor families (TRANSFAC) and (2) computational predicted transcriptional factor biding sites (Gibbs sampling). We compiled our dataset from the dataset of [15] and [20], resulting in a total of 428 TRANSFAC motifs with 5849 target genes and 356 Gibbs motifs with 5579 target genes in yeast.  Regulation relations: A set RR of 12450 known regulation relations involving in 141 regulators and 4054 target genes was collected from YEASTRACT [21], and then filtered based on different regulators GO term.  5. Results Each of the Gibbs Motif Matrix GMM and TRANSFAC Motif Matrix TMM datasets was tested for association with the Pfam dataset, PM, three times. Each time the PM was filtered by a different set of regulator GO terms (GO:003677 DNA binding proteins, GO:030528: transcription regulator activity proteins, and GO:003712 transcription cofactor activity that does not bind DNA itself). A support threshold of SUPP_T = 1.2% (this threshold value was empirically chosen to give a reasonable statistical significance to our results) and a score threshold of SCORE_T = 2.0 were used. Over those 6 tests and for relations of size equal or less than seven, k <= 7, a total of 19 domains passed both support and score threshold values and are presented in Table 1.  173  As shown in Table 1, we found that many domains are similar within both motif datasets (TRANSFAC and Gibbs). In order to test the deliverability of the algorithm to infer regulation-related associations, we compared three resultant domain lists filtered by three different GO terms. Our results demonstrated that most domains represent the biological characteristics of their filtering GO terms and could be divided into these two categories:  1) Domains that physically bind directly to the motif of the target genes. Examples are as follows: PF00023, PF00046, PF00072, PF00096, PF00125, PF00249, PF00250, PF00319, PF00447, PF00498, PF00533, PF02292, PF04082 and PF08618. These exist in domain lists filtered by GO:003677 (DNA binding proteins) and GO:030528 (transcription regulator activity proteins) but not by GO:003712 (transcription cofactor activity that does not bind DNA itself), which indicates that those domains on the regulator protein might regulate the target gene by binding to its DNA motif.  2) Domains that indirectly participate in gene regulation. Examples are domains PF00400 and PF08618, which exist in domain lists filtered by GO:030528 (transcription regulator activity proteins) and GO:003712 (transcription cofactor activity that does not bind DNA itself) but not by GO:003677 (DNA binding proteins), which implies the indirect regulation mechanism of these domains.  In order to give an indication of a domain function, we scanned these domains on all 3842 known genes with at least one domain in the whole yeast genome, then on those 327 genes in GO:030528 (transcription regulator activity proteins) and finally on the 221 genes of GO:003677 (DNA binding proteins). We counted the number of occurrences of each protein domain (Table 2). Each row tells us the putative function of each domain. If 174  a domain exists in both the “ALL” set and the “GO:030528” set but not in “GO:03677”, then the association relation with a DNA-motif is more likely to be an indirect relation through protein-protein interactions. For example, the function of PF00400 can be putatively inferred as the role of the co-regulator instead of directly binding to the cisregion. This inference can be made because, of the 84 proteins in yeast containing PF00400, 11 proteins containing are annotated as related to transcription regulation activity based on GO:030528 while no protein is known as related to DNA binding based on GO:003677. Similarly, domains PF00125, PF00249, PF00250, PF00319, PF00447, PF00498, PF00533, PF02292 and PF04082 have about the same number of proteins of (GO:0030528 transcription regulator activity) and (GO:003677 DNA binding), which indicates that their involvement in gene regulation is by DNA binding.  On our supplementary website (http://bioinformatics.ubc.ca/PdDm) one can find 1) list of all protein domains from Table 1, with their Pfam domain-name and function description, and 2) a complete list of all associations domain-motif with their score, support, predicted support p_support are presented. These association rules can be interpreted in the context of the results of Tables 1 and 2. For example, the Gibbs DNA motif “CLB2_M_Cluster_orfnumA2SD_n3” is found to be associated with PF00319. In Table 1, PF00319 associations are only found among GO terms. This indicates a direct DNA binding. In Table 2, PF00319 is found to exist only in DNA binding genes, so the association relation is likely to be a direct domain-motif binding relation. Functions of the Pfam protein domain list, presented in Table 4, support with our results.  175  Moreover, the domain PF00023 (ankyrin repeats) of GABPb is demonstrated to accommodate recognition of direct repeats of GGA or inverted repeats of the GGA core with variable spacing [22], whereas the motif “Y$CDC6_01” (gcgacgcgAGGcctcacgcgtcgg) contains the inverted repeat of the GGA core. Another example is that the domain PF00170 (bZIP1 transcription factor) recognizes and binds the target sequence ATGACTCAT in vivo called the AP-1 site, in which one part of the motif “Y$HIS3_05” (TGACTC) locates [23]. These findings are consistent with our results.  6. Conclusion By exploring the association relations between protein-domain and DNA-motif data, we detected some putative domain-motif direct and indirect interactions; these interactions indicated the functions for some of these domains and motifs. One of the next steps for expanding on this work will be to achieve better prediction performance by integrating protein-protein interaction data and gene co-expression data.  176  Table 1 A list of 19 domains that are associated with DNA motifs in set sizes less than or equal to 3 (k = 3) and that pass both support and score threshold values in one or more of the 6 tests. GO 003677 030528 003712 003677 030528 003712 MM TRAN TRAN TRAN Gibbs Gibbs Gibbs Nrr 8484 12450 168 8484 12450 168 # of genes 80 141 6 80 141 6 PF00010 X X X PF00023 X X PF00046 X X X PF00072 X X PF00096 X X X X PF00125 X PF00170 X X PF00172 X X X X X PF00249 X X PF00250 X X X X PF00319 X X X PF00400 X X PF00447 X PF00498 X X X PF00533 X X PF02292 X X X PF04082 X X PF08601 X X PF08618 X X  177  Table 2 The number of occurrences of each protein domain for three different sets of regulation relations based on their GO term. GO ALL 30528 3677 Size 3842 327 221 PF00010 7 6 4 PF00023 18 5 3 PF00046 6 6 4 PF00072 4 2 1 PF00096 35 28 17 PF00125 8 1 8 PF00170 8 8 3 PF00172 48 28 22 PF00249 14 6 5 PF00250 4 4 3 PF00319 4 4 4 PF00400 84 11 0 PF00447 5 4 3 PF00498 14 4 6 PF00533 10 1 2 PF02292 4 4 4 PF04082 26 14 13 PF08601 2 2 1 PF08618 1 1 0  178  Figure 1 Steps for calculating support and scoring the association of domains-motifs 1. For each item I ∈ AM O(I) = AM.count(I) / Nrr; 2. Initialize an association size variable k = 2; 3. Initialize CANDk to empty; Initialize LARGEk to empty; 4. For each pair of domain-motif items {Id, Im} ∈ AM If (O(Id) > SUPP_T AND O(Im) > SUPP_T) CANDk.add({Id, Im}); 5. If CANDk is empty, terminate; 6. For each set size k, Sj, in CANDk support(sj) = AM.count(Sj) / Nrr; If (support(sj) > SUPP_T) p_support(Sj) = 1; For each I ∈ Sj p_support(Sj) = p_support(Sj) * O(I); score(sj) = support(Sj) / p_support(Sj); If (score(sj) > SCORE_T) Print sj; Else LARGEk.add(sj); 7. using the Counting algorithm [18] GenCand, construct CANDk+1 from LARGEk; 8. increase the association size variable by one: k = k + 1; 9. If (CANDk is not empty and k < k_max) goto (6);  179  Figure 2 The Counting algorithm [18]. This algorithm takes the LARGEk set as an input and returns the CANDk+1 set as an output.  GenCand(LARGEk) { Initialize CANDk+1 to empty; For each Sj1 ∈ LARGEk For every item I, COUNT[i] = 0, //where i is an index of item I; For each Sj2 ∈ LARGEk, Sj2 <> Sj1; if (|Sj1 ∪ Sj2| == k) COUNTS[Sj1 – Sj2]++; For each counter COUNT[i] if (COUNT[i] == k) CANDk+1.add({LARGE[j1]}∪ {I}); } return CANDk+1; }  180  7. Reference 1. Wasserman, W.W., and A. Sandelin. 2004. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 5:276-87. 2. Matys, V., E. Fricke, R. Geffers, E. Gossling, M. Haubrock, R. Hehl, K. Hornischer, D. Karas, A.E. Kel, O.V. Kel-Margoulis, D.U. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R. Munch, I. Reuter, S. Rotert, H. Saxel, M. Scheer, S. Thiele, and E. Wingender. 2003. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31:374-8. 3. Matys, V., O.V. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre-Dirrie, I. Reuter, D. Chekmenev, M. Krull, K. Hornischer, N. Voss, P. Stegmaier, B. LewickiPotapov, H. Saxel, A.E. Kel, and E. Wingender. 2006. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34:D108-10. 4. Sandelin, A., W. Alkema, P. Engstrom, W.W. Wasserman, and B. Lenhard. 2004. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32:D91-4. 5. Ho Sui, S.J., J.R. Mortimer, D.J. Arenillas, J. Brumm, C.J. Walsh, B.P. Kennedy, and W.W. Wasserman. 2005. oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res. 33:3154-64. 6. Pilpel, Y., P. Sudarsanam, and G.M. Church. 2001. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet. 29:153-9. 7. Malhis, N., and Ruttan, A., Predicting Gene Regulatory Networks from MicroArray Time Series Data Uszing Fuzzy Elimination, Proceedings of the Second Annual BioTechnology and BioInformatics Symposium BIOT-05, October 2005, Colorado Spring, Colorado, USA, pp. 37-42. 8. Malhis, N., and Ruttan, A., Detecting Gene Regulation Relations from Microarray Time Series Data, Proceedings of the 2006 International Conference on Machine Learning; Models, Technologies & Applications MLMTA'06, June, 2006, Las Vegas, Nevada, USA. 9. Nguyen, D.H., and P. D'Haeseleer. 2006. Deciphering principles of transcription regulation in eukaryotic genomes. Mol Syst Biol. 2:2006 0012. 10. Segal, E., M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Koller, and N. Friedman. 2003. Module networks: identifying regulatory modules and their conditionspecific regulators from gene expression data. Nat Genet. 34:166-76. 11. Wilczynski, B., T.R. Hvidsten, A. Kryshtafovych, J. Tiuryn, J. Komorowski, and K. Fidelis. 2006. Using local gene expression similarities to discover regulatory binding site modules. BMC Bioinformatics. 7:505 12. Filkov, V., Skiena, S., and Zhi, J., Analysis techniques for microarray time-series data, Journal Comput. Biol. 9, 2002, pp. 317-330. 13. Lemmens, K., T. Dhollander, T. De Bie, P. Monsieurs, K. Engelen, B. Smets, J. Winderickx, B. De Moor, and K. Marchal. 2006. Inferring transcriptional modules from ChIP-chip, motif and microarray data. Genome Biol. 7:R37.  181  14. Monsieurs, P., G. Thijs, A.A. Fadda, S.C. De Keersmaecker, J. Vanderleyden, B. De Moor, and K. Marchal. 2006. More robust detection of motifs in coexpressed genes by using phylogenetic information. BMC Bioinformatics. 7:160. 15. Middendorf, M., Kundaje, A., Shah, M., Freund, Y., Wiggins, C., and Leslie, C., Motif Discovery through predictive modelling of gene regulation, RECOMB (Research in Computational Biology), May 2005. 16. Pawson, T., and P. Nash. 2003. Assembly of cell regulatory systems through protein interaction domains. Science. 300:445-52. 17. Agrawal R., and Srikant R. Fast Algorithms for Mining Association Rules in Large Databases. Proceedings of the 20th International Conference on Very Large Data Bases, pages 478-499, 1994. 18. Malhis, N., Ruttan, A., and Refai, H., An Efficient Approach for Candidate Set Generation, Journal of Information & Knowledge Management, Vol. 4, No. 4 (2005) 287-291. 19. Finn, R.D., J. Mistry, B. Schuster-Bockler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L. Sonnhammer, and A. Bateman. 2006. Pfam: clans, web tools and services. Nucleic Acids Res. 34:D247-51. 20. Thompson, W., E.C. Rouchka, and C.E. Lawrence. 2003. Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res. 31:3580-5. 21. Teixeira, M.C., P. Monteiro, P. Jain, S. Tenreiro, A.R. Fernandes, N.P. Mira, M. Alenquer, A.T. Freitas, A.L. Oliveira, and I. Sa-Correia. 2006. The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Res. 34:D446-51. 22. Graves, B.J. 1998. Inner workings of a transcription factor partnership. Science. 279:1000-2. 23. Berger, C., I. Jelesarov, and H.R. Bosshard. 1996. Coupled folding and site-specific binding of the GCN4-bZIP transcription factor to the AP-1 and ATF/CREB DNA sites studied by microcalorimetry. Biochemistry. 35:14984-91.  182  Appendix B Quantum dot conjugates for targeted silencing of bcr/abl gene by RNA interference in human myelogenous leukemia K562 cells. 1. Introduction RNA interference (RNAi) is a cellular pathway of gene silencing in a sequence-specific manner at the messenger RNA level. This phenomenon was first discovered in the nematode worm Caenorhabditis elegans [1], and investigated in plants and invertebrates [2-5]. In this mechanism, small interfering RNAs (siRNAs) were produced from long double-stranded RNAs of exogenous or endogenous origin by a Dicer enzyme (ribonuclease III type) [6]. The resulting siRNAs (usually 21–23 nucleotides) serve as guide sequences to induce target specific mRNA cleavage by several cellular nucleases [7, 8], thus blocking the translation of the mRNAs into proteins. RNAi provides several major advantages compared to prior methods (e.g. antisense DNA) for suppressing gene expression. Biochemical understandings of the RNAi pathway in human embryonic kidney 293 and HeLa cells strains prompt that transfection of mammalian cells with short RNAs could induce the functions of the sequence-specific RNAi pathway, thus overcoming the barriers of using RNAi as a genetic tool in mammals [9]. Promising results have since been achieved with small interfering RNAs (siRNAs) in animal models [10-12]. For example, the RNAi method has been used to generate tissue-specific knockdown mice for the study of gene functions in vivo [13], and as a result holds the potential of treating diseases as diverse as age-related muscular degeneration and hepatitis [14]. The rapid advancement of this powerful biotechnology has sped up the pace of understanding gene function in cell/organism physiology and helps investigators to dissect the cascade of molecular events underlying the pathogenesis of diseases,  183  including cancer [15, 16]. Using siRNAs and other small RNAs in mammalian cells could also contribute to solving the problems associated with gene therapy.  The potential use of the mammalian RNA interference (RNAi) technique has elicited a great deal of interest for the therapeutic regulation of gene expression [17, 18]. However, a major issue for oligonucleotide-based therapeutics involves the effective delivery of antisense or siRNA to their respective sites of action in the nucleus or cytoplasm. Although many studies demonstrated that antisense or siRNA compounds without a delivery agent can function in RNAi, many investigators believe that appropriate delivery platforms could improve the efficiency and thus be very helpful for oligonucleotidebased therapeutics [19, 20]. Due to the structure similarity between RNA and DNA, they can be delivered into cells with common delivery carriers, which can be classified as viral- and nonviral-based [21]. Virus vectors, such as lentiviruses [22, 23], adenoassociated viruses [24] and retroviruses [25], present safety and toxicity concerns which have limited their use in human in vivo. On the contrary, nonviral methods including cell penetrating peptides (CPPs) and cell targeting ligands (CTLs) are among promising strategies in siRNA delivery [26]. These nonviral methods involve the use of liposomes [27, 28], peptides [29, 30], cationic polymers [31] as well as nanoparticles (NPs) [32, 33] and nanotubes [34]. It is important to realize that tracking and monitoring the delivery of siRNA into cells can be difficult to accomplish without the help of a suitable tracking reagent. The organic fluorophores commonly used to label siRNAs lose over half of the intensity of their fluorescent signals in 5–10 s [35]. It is therefore not practical to apply these dyes in either long-term or multiplexed studies [35, 36]. On the other hand, fluorescent reporter plasmids, such as GFP (green fluorescent protein), require more than  184  2h for the targets to become observable [37] which limits their application in instantaneous monitoring of siRNA molecules.  In the present study, we applied fluorescent QDs conjugated with small interference RNA as a self-trackable non-viral vehicle to deliver double-stranded small interfering RNAs (siRNAs) designed to knock down the bcr/abl oncogene into leukemia K562 cells. QDs are brighter than most conventional fluorescent dyes by 10-fold and are significantly easier to detect than fluorescent reporter plasmids in vivo [35, 38]. QDs can fluoresce more than 20 times longer than conventional fluorescent dyes [39]. These properties have opened new possibilities for advanced molecular and cellular imaging as well as for ultrasensitive bioassays and diagnostics [38, 40, 41]. We have explored the utility of combining QDs with siRNA to track intracellular transport and evaluating the delivery efficiency [42]. The superior brightness and photostability of QDs probes in cells have enabled them to sustain not only fluorescence-activated cell sorting (FACS), but also live imaging and immunostaining procedures. Our QD/siRNA co-delivery technique has demonstrated that QDs could work as a vector that successfully transfected siRNA into cells and thus proved to be a new method for RNAi application. Our experiment results also suggested that the intensity of cellular fluorescence correlated with the levels of silencing, allowing the collection of a uniformly silenced cell population by FACS.  2. Materials and Methods 2.1. Materials The K562 human chronic myelogenous leukemia cell line was purchased from the American Type Culture Collection (Rockville, MD). All cells were cultured in RPMI  185  1640 medium (Nikken BioMedical Laboratory, Kyoto, Japan) supplemented with 10% fetal bovine serum (Life Technologies, Grand Island, NY). The cells were placed in a 5% CO2–95% air fully humidified atmosphere at 37℃and incubated until harvesting. The K562 cell line has been determined to express the b3a2 breakpoint of the bcr-abl mRNA. Double-stranded siRNA was firstly designed by the modified version of GAIA algorithm [43], and then literaturally validated by previous reports and methods [44]. The sequences of (sense: 5’-GCA GAG UUC AAA AGC CCU UdTdT-3’; antisense: 5’-AAG GGC UUU UGA ACU CUG CdTdT-3’) specific for the b3a2 breakpoint of the Bcr-Abl gene were chemically synthesized (Songon, China)(Figure1a). A 3’-overhang deoxythymidine dimmer (dTdT) was introduced to the 3’ end of the sequence for its stability. 2.2 bcr/abl siRNA-conjugated with QDs Carboxylate-modified highly luminescent semiconductor CdTe QDs (emission 625nm, diameter 5~15 nm.) were conjugated to siRNAs using the EDAC solution as the crosslinker. In this experiment, 10µL Carboxylate-modified QDs (13µM) , 250µL aminomodified siRNA (2µM) and 15mg newly prepared EDAC (final: 50 mg/mL)were mixed and vortexed for 4h at room temperature in the dark. Next, the compounds of QD-siRNA were purified by centrifugal concentrators (Vivaspin 500 centrifugal concentrators, 300KD). The supernatant was discarded and 50µL of QD-siRNA complexes were collected and kept at 0~4  [45].  186  2.3 Determination of conjugation efficiency and transfection of QD-siRNA into K562 cells To measure the conjugation efficiency of bcr/abl siRNA onto QDs, unconjugated siRNA and QDs were removed by centrifugation at ×10000g. The QD-siRNA conjugates were studied by high performance liquid chromatography (HPLC)(Agilent, USA). Conjugation efficiency is expressed by the difference of the peak positions between pure siRNAs and the QD-siRNA conjungates. Next, in order to study the effects of crosslinking in the transfection, a series of QDs and QD-siRNA conjugates were co-cultured with K562 cells. The QD-siRNA(final concentration: 2µM) conjugates were added to 2 ml of cultured cells. The mixtures were then incubated for 48h before seeding into cell culture flasks at the concentration of 1×106 cells/mL (Figure1b). The seeded cells were cultured at 37℃ with 5% CO2. MTT assays were applied to determine the cell activity and proliferation.  2.4 MTT assay The percentage of cell survival was measured using the MTT (3- [4,5-dimethylthiazol-2yl] -2,5 - diphenyltetrazolium bromide) colorimetric assay. After K562 cells were treated with the QDs-siRNA conjugates for 44h, cells were seeded into a 96-well plate at the concentration of 1×105 cells /mL, which were transfected with these QD-siRNA conjugates. Next, 10µL of MTT solution (5mg/ml) was added to each well of a 96-well cell culture plate. After incubation at 37℃ with 5% CO2 for 4h, blue formazan product was converted from MTT. To dissolve the formazan product and end the reaction, 100µL of  187  20% SDS (v/v) solution was added to each well. Spectrophotometric data were measured using an ELISA reader at the wavelength of 570 nm.  2.5 FACS Flow cytometry was performed on a FACS Vantage SE flow cytometer (Becton Dickinson) using a 488nm Ar laser and FL3 bandpass emission (650±10nm) for the red QDs. To investigate the apoptosis of K562 induced by the transfection of QD-siRNA, K562 cells were seeded at the density of 1×105 cells /well. SiRNA-conjugated QDs were then incubated with the cells at the concentration of 20µg/well for 24h. The cell was immobilized with pre-cooled 75% ethanol (v/v) at 4  overnight. The immobilized cells  were washed twice with 500µL of cold PBS (pH=7.4) per well and then incubated at 37℃ for 30min with RNase(20µg/mL). The cellular DNA was then stained with propidiun iodide (PI). Different PI fluorescent intensities were recorded and the transfection efficiency was quantified accordingly.  3. Results 3.1 Determination of K562 siRNA-QDs conjugation efficiency Conjugation of K562 bcr-abl siRNA onto the surface of the QDs was achieved by incubating bcr-abl siRNA with -COOH functionalized CdSe QDs using EDAC as the cross-linker. In this procedure, siRNA was immobilized to the surface of QDs through the direct connection of the amidogen, which had been modified simultaneously in the process of siRNA synthesis process, and the carboxyl group on the QDs, in the presence of EDAC, which was used as the crosslinker. HPLC spectrum (Figure 2) revealed that the  188  peak position of the QD-nucleotide complexes appeared 1min later than that of pure siRNAs. We believe that the electronegativity of the QDs increases the negative charge of the QD-siRNA conjugates, which reduces the retention time of the conjugates in HPLC. This result indicated that QDs could be successfully conjugated with siRNAs using methods reported previously.  To investigate the physical absorption of the QD/siRNA conjugates, we co-cultured K562 cells with QDs subjected to a series of different treatments, including pure QDs, QDssiRNA and QDs-siRNA (without EDAC), respectively. (Figure 3a).Consistent with the results shown by MTT, pure QDs and samples without EDAC had little effect on cell viability, and the cell viability did not significantly change (<20%). There was, however, a significant decrease of cell viability when K562 cells were treated with QD-siRNA in the presence of EDAC. The viability of the treated cells was less than 50% compared to that of the control cells. Because these designed siRNAs could induce apoptosis of the K562 cell strain, the cell viability change suggested that EDAC could improve the connection between QDs and siRNAs. In our study, MTT data showed that pure QDs could apparently induce a slight decrease of cell viability (<10%). This observation is not surprising because QDs induced apoptotic and cell death in several kinds of cell strains were also reported in some previous research [46]. In summary, these results confirmed that QDs could be effectively connected with siRNAs in the presence of EDAC, which served as a prerequisite for the experiments described later in this study. Also, low concentrations of QDs would not induce substantial cell apoptosis.  189  3.2 Transfection efficiency of QD-siRNA  In order to measure the transfection efficiency of QD-siRNA, K562 cells were incubation in red CdTe QDs for 48h before they were evaluated for the fluorescence intensity using flow cytometry. Flow cytometry distinguished and counted the number of red fluorescence expressing cells from all viable cells. The baseline of this measurement was represented by the region of the graph occupied by the viable and healthy cells in the negative control (nontransfected) cells. Then, the transfected cells were subjected to the same measuring procedures and the number of red fluorescence expressing cells was recorded. Compared to nontransfected cells, the fluorescence intensity of the QD-siRNA transfected cells was significantly increased and the number of the transfected cells reached 74.1% of the total cell count. These data indicated that QD is a potent vector that can be efficiently transfected into K562 cells.  3.3 Optimization of QD concentration for siRNA transfection  To optimize the concentration correlation of QD/siRNA in siRNA transfection, we combined different ratios of siRNA with a fixed dose of 13µM QDs (10µL). Specifically, a series of bcr/abl siRNA reagents (125µL , 250µL , 375µL, 500µL, 550µL and 700µL, corresponding to 0.25ng, 0.5ng, 0.75ng, 1ng, 1.1ng and 1.4ng siRNA) were cocomplexed with QDs. MTT assay was applied to evaluate the cell viability as described before in this manuscipt(Figure 3-b). With the increasing concentrations of siRNA in the co-culture system, cell viability decreased, suggesting higher transfection efficiency. We found, however, that and the best gene silencing effects (cell viability =22.67%) happened at the ratio of 1:55 QD: siRNA (1.1ng siRNA), and higher amounts siRNA 190  (over 1.4ng) did not further increase the transfection efficiency. These optimization results suggested that the surface areas of the QDs for siRNA to occupy during the complexing process is limited and can thus be saturated. Meanwhile, we found that pure siRNA does not exhibit the same effects on cell viability as the QD-siRNA complexes do, suggesting the importance of QDs as the co-transfection factor.  In addition, we explored the effects of different concentrations of siRNA in the QDsiRNA complexes on cell viability. In this procedure, Amino-modified siRNAs (2µM and 20µM, respectively.) were conjugated onto QDs at a series of concentrations (final QD concentration: 0.25nM, 0.5nM, 0.75nM and 1nM, respectively.). We found that regardless of QD concentrations, the lower concentration of siRNA included in the complexing process exhibited higher effects on cell viability than the higher one, as evidenced by MTT assays (Figure 3b). We believe that the differences in the effects on cell viability reflected different conjugation efficiency of QDs and siRNA under the conditions we used in our study.  3.4 The fluorescence performance of QD after cell transfection  The uptake of bcr-abl siRNA-conjugated QDs is monitored by laser confocal microscopy (LSCM, LS 5 PASCAL ZEISS, German). The red fluorescence of the QDs facilitated the tracking of the delivery of the conjugated bcr-abl siRNAs. Figure 4a shows the confocal fluorescence image of a K562 cell after incubation with pure red CdTe QDs for 24h. Several previous studies have shown that QDs are cytotoxic to a series of live cell strains. [47, 48] The image we showed, however, demonstrated that 24 h after the QD transfection, the morphology of K562 cells was very little changed, which suggested that 191  the presence of CdTe QDs had little effects on K562 cell. Figure 4b and 4c show LSCM images of different periods of K562 cells after they were transfected with QD-siRNA complexes. In these two images, high intensity of red fluorescence emitted by the QDs was detected in a region that was suggested to be the cellular nucleus.of a K562 cell (marked by green circles). We speculate that the QD-siRNA complexes were specifically recognized by telomerase in the nucleus, which induced the high intensity of red fluorescence in this region. In figure 4b, individual red QDs can also be identified in the cytoplasm. After 48h of transfection (shown in Figure 4c), two separate regions of high intensity of red fluorescence (Figure 4c, green circle and yellow circle) could be identified in K562 cells, which presumably represented the karyorrhexis and the spumose of the cell membrane in the apoptosis procedure, respectively.  To investigate the mophology of K562 cells after incubation with the QD-siRNA conjugates, transmission electron microscope (TEM, JEM-2010, Japan) were utilized. Figure 5a shows the TEM image of a K562 cell after incubated with pure red CdTe QDs for 24h, in which the nucleoli and other cellular structures could be clearly identified. Figure 5b and 5c show the TEM images of K562 cells incubated with QD-siRNA conjugates for 24h and 48h, respectively. In these two figures, cell nuclei were broken and underwent pyknosis. In Figure 5c, especially, the structure of the K562 cell was severely disrupted, which indicated the occurrence of cell apoptosis in later periods of transfection. The areas with high density in Figure 4-5b and c were proposed to be the telomerase in cell nuclei which was specifically tagged by the QD-siRNA conjugates. Clearly, with the help of QDs, we were able to obtain high quality TEM and LSCM images showing the detailed process of siRNA induced apoptosis in K562 cells.  192  3.5 The effects of QD-siRNA conjugates on cell apoptosis  In order to detect QD-siRNA induced apoptosis, a further analysis by flow cytometry was carried out. As shown in figure 6, the K562 cells treated with 50µL QD-siRNA showed a hypodiploid DNA peak from diploid DNA molecules. The peak is the result of the reduced DNA contents characteristic of apoptotic cells. The percentage of the cells was about 10.7%. While cells treated with siRNA-free QD as well as untreated K562 cells did not show the visible hypodiploid DNA peak .The percentages of apoptotic cells were 6.02% and 5.17%, respectively. These data verified that QD-siRNA substantially induced apoptosis in K562 cells.  4. Discussion The present study described the generation of the quantum dot- siRNA conjugates and their initial application to track the delivery of therapeutic siRNAs into the cell. Quantum dots were used as the vector and were conjugated with double-stranded siRNAs. These fluorescent NPs provided the advantages in tracking and monitoring the delivery of bcr/abl siRNAs and proved invaluable in the study of gene delivery. The results in this study suggested that siRNA efficiently restrained the cell activity and induced cell apoptosis. In vitro confocal microscopy and flow cytometry studies also revealed that QD-siRNA could be readily internalized into cells and accumulated specifically in cell nuclei. As such, using QDs as photostable probes in combination with FACS may be useful for detecting the protein down-regulation and phenotypic responses of cells to gene regulation over time. Meanwhile, recent in vitro and in vivo preclinical findings  193  suggested that RNAi could be considered a therapeutic tool for the treatment of various pathological conditions. Tumor cells, such as leukaemia, may particularly benefit from this method. The implementation of RNAi technology in the biotherapeutic approach to cancer has already provided investigators with promising data. Further studies, however, are still required to understand the intracellular transport and delivery of siRNAs, which may provide the basis for further applications of the QD-siRNA conjugates.  194  Figure 1 (A) Schematic representation of the QD-siRNA conjugates and their cotransfection into cells; (B) Bcr-abl fusion sequence and designed siRNAs. dTdT indicates the 3’-overhang deoxythymidine dimer. A  B BCR—ABL mRNA  5’…UGGAUUUAAGCAGAGUUCAA AAGCCC UUCAGCGGCCAGUA…3’  Target mRNA sequence Designed siRNA: Sense strand Antisense strand  5’-GCAGAGUUCAAAAGCCCUUdTdT-3’ 3’- dTdTCGUCUCAAGUUUUCGGGAA-5’.  195  Figure 2 The HPLC result of the nucleotide acid conjugation. (a) The HPLC spectrum of the pure nucleotides; (b) The HPLC spectrum of the QD-nucleotide complexes.  196  Figure 3 (a) Cell viability measured by MTT assays on K562 cells not treated, cells incubated with 50µL naked QD, cells incubated with pure siRNA, and cells transfected with QD-siRNA at different siRNA concentrations. The viability of nontreated control cells was adesignated as 100%. .(b) Effects of siRNA concentrations on the inhibition effect of K562 cells by MTT assays.  197  Figure 4 The LSCM image of cells cultured with the red quantum dot. (a) The LSCM image of the cell cultured with pure QDs. (b)-(c) The LSCM image of the cells cultured with QDs-siRNA complex.  198  Figure 5 The TEM images of the cells a. normal cells; b-c. Apoptotic cells  199  Figure 6 FACS analysis of the control K562 cells (a), and the QD-siRNA transfected K562 cells (b), and the integrated figure (c), the cells were analyzed for their autofluorescence (negative control), and the fluorescence induced by QD-siRNA transfection(positive). No treatment  300  200 99.4  0.58  100  a 0 10 0  10 1  10 2  10 3  10 4  Sample  60  40 25.9  74.1  20  b 0 10 0  10 1  10 2  10 3  10 4  100  No treatment Sample  80  60 74.1  25.9 40  20  C c  0 10 0  10 1  10 2  10 3  10 4  200  Figure 7 Flow cytometry analyses. DNA fluorescence histograms of propidium iodide-stained K562 cells without treatment and treated with 50µL pure QDs or the same concentration of QD-siRNA. Apoptotic cells are indicated by the first bar. 1200  No treatment  900  600  300  0 0  200  400  600  800  1000  1000  Pure QDs 800  Counts  600  400  200  0 0  200  400  600  800  1000  800  QD-siRNA NPs Counts  600  400  200  0 0  200  400  600  800  1000  201  5. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.  11. 12. 13. 14. 15. 16. 17. 18.  Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC: Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 1998, 391(6669):806-811. Ngo H, Tschudi C, Gull K, Ullu E: Double-stranded RNA induces mRNA degradation in Trypanosoma brucei. Proceedings of the National Academy of Sciences 1998, 95(25):14687. Hamilton AJ, Baulcombe DC: A Species of Small Antisense RNA in Posttranscriptional Gene Silencing in Plants. Science 1999, 286(5441):950. Lohmann JU, Endl I, Bosch TCG: Silencing of Developmental Genes in Hydra. Dev Biol 1999, 214(1):211-214. Kennerdell JR, Carthew RW: Heritable gene silencing in Drosophila using double-stranded RNA. Nat Biotechnol 2000, 18(8):896-898. Zamore PD, Tuschl T, Sharp PA, Bartel DP: RNAi Double-Stranded RNA Directs the ATP-Dependent Cleavage of mRNA at 21 to 23 Nucleotide Intervals. Cell 2000, 101(1):25-33. Valencia-Sanchez MA, Liu J, Hannon GJ, Parker R: Control of translation and mRNA degradation by miRNAs and siRNAs. Genes Dev 2006, 20(5):515-524. Ameres SL, Martinez J, Schroeder R: Molecular Basis for Target RNA Recognition and Cleavage by Human RISC. Cell 2007, 130(1):101-112. Elbashir SM, Harborth J, Lendeckel W, Yalcin A, Weber K, Tuschl T: Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature 2001, 411(6836):494-498. Soutschek J, Akinc A, Bramlage B, Charisse K, Constien R, Donoghue M, Elbashir S, Geick A, Hadwiger P, Harborth J: Therapeutic silencing of an endogenous gene by systemic administration of modified siRNAs. Nature 2004, 432(7014):173-178. Behlke MA: Progress towards in Vivo Use of siRNAs. Mol Ther 2006, 13(4):644-670. de Fougerolles A, Vornlocher HP, Maraganore J, Lieberman J: Interfering with disease: a progress report on siRNA-based therapeutics. Nature Reviews Drug Discovery 2007, 6:443-453. Carmell MA, Zhang LQ, Conklin DS, Hannon GJ, Rosenquist TA: Germline transmission of RNAi in mice. Nat Struct Biol 2003, 10(2):91-92. Devi GR: siRNA-based approaches in cancer therapy. Cancer Gene Ther 2006, 13:819-829. Shankar P, Manjunath N, Lieberman J: The prospect of silencing disease using RNA interference. JAMA 2005, 293(11):1367-1373. Mocellin S, Costa R, Nitti D: RNA interference: ready to silence cancer? J Mol Med 2006, 84(1):4-15. McManus MT, Sharp PA: Gene silencing in mammals by small interfering RNAs. Nature Reviews Genetics 2002, 3(10):737-747. Kim DH, Rossi JJ: Strategies for silencing human disease using RNA interference. Nat Rev Genet 2007, 8(3):173-184.  202  19.  20. 21. 22. 23. 24. 25.  26. 27.  28. 29. 30. 31. 32.  Mescalchin A, Detzer A, Wecke M, Overhoff M, Wunsche W, Sczakiel G: Cellular uptake and intracellular release are major obstacles to the therapeutic application of siRNA: novel options by phosphorothioatestimulated delivery. Expert Opinion on Biological Therapy 2007, 7(10):15311538. Russ V, Wagner E: Cell and Tissue Targeting of Nucleic Acids for Cancer Gene Therapy. Pharm Res 2007, 24(6):1047-1057. Tan WB, Jiang S, Zhang Y: Quantum-dot based nanoparticles for targeted silencing of HER2/neu gene via RNA interference. Biomaterials 2007, 28(8):1565-1571. Stewart SA, Dykxhoorn DM, Palliser D, Mizuno H, Yu EY, An DS, Sabatini DM, Chen IS, Hahn WC, Sharp PA: Lentivirus-delivered stable gene silencing by RNAi in primary cells. RNA 2003, 9(4):493-501. Tiscornia G, Singer O, Ikawa M, Verma IM: A general method for gene knockdown in mice by using lentiviral vectors expressing small interfering RNA. Proceedings of the National Academy of Sciences 2003, 100(4):1844. Tomar RS, Matta H, Chaudhary PM: Use of adeno-associated viral vector for delivery of small interfering RNA. Oncogene 2003, 22:5712-5715. Yang G, Thompson JA, Fang B, Liu J: Silencing of H-ras gene expression by retrovirus-mediated siRNA decreases transformation efficiency and tumorgrowth in a model of human ovarian cancer. Oncogene 2003, 22:56945701. Juliano R, Alam M, Dixit V, Kang H: Mechanisms and strategies for effective delivery of antisense and siRNA oligonucleotides. Nucleic Acids Res 2008, 36(12):4158. Hornung V, Guenthner-Biller M, Bourquin C, Ablasser A, Schlee M, Uematsu S, Noronha A, Manoharan M, Akira S, De Fougerolles A, Endres S, Hartmann G: Sequence-specific potent induction of IFN-α by short interfering RNA in plasmacytoid dendritic cells through TLR7. Nat Med 2005, 11(3):263-270. Judge AD, Sood V, Shaw JR, Fang D, McClintock K, MacLachlan I: Sequencedependent stimulation of the mammalian innate immune response by synthetic siRNA. Nat Biotechnol 2005, 23(4):457-462. Xia H, Mao Q, Paulson HL, Davidson BL: siRNA-mediated gene silencing in vitro and in vivo. Nat Biotechnol 2002, 20(10):1006-1010. Morris KV, Chan SWL, Jacobsen SE, Looney DJ: Small interfering RNAinduced transcriptional gene silencing in human cells. Science 2004, 305(5688):1289-1292. Sun HK, Ji HJ, Kyung CC, Sung WK, Tae GP: Target-specific gene silencing by siRNA plasmid DNA complexed with folate-modified poly(ethylenimine). J Controlled Release 2005, 104(1):223-232. Schiffelers RM, Ansari A, Xu J, Zhou Q, Tang Q, Storm G, Molema G, Lu PY, Scaria PV, Woodle MC: Cancer siRNA therapy by tumor selective delivery with ligand-targeted sterically stabilized nanoparticle. Nucleic Acids Res 2004, 32(19):e149.  203  33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48.  Sun TM, Du JZ, Yan LF, Mao HQ, Wang J: Self-assembled biodegradable micellar nanoparticles of amphiphilic and cationic block copolymer for siRNA delivery. Biomaterials 2008, 29(32):4348-4355. Kam NW, Liu Z, Dai H: Functionalization of carbon nanotubes via cleavable disulfide bonds for efficient intracellular delivery of siRNA and potent gene silencing. J Am Chem Soc 2005, 127(36):12492-12493. Wu X, Liu H, Liu J, Haley KN, Treadway JA, Larson JP, Ge N, Peale F, Bruchez MP: Immunofluorescent labeling of cancer marker Her2 and other cellular targets with semiconductor quantum dots. Nat Biotechnol 2003, 21(1):41-46. Dahan M, Levi S, Luccardini C, Rostaing P, Riveau B, Triller A: Diffusion dynamics of glycine receptors revealed by single-quantum dot tracking. Science 2003, 302(5644):442-445. Tsien RY: The green fluorescent protein. Annu Rev Biochem 1998, 67:509-544. Gao X, Cui Y, Levenson RM, Chung LWK, Nie S: In vivo cancer targeting and imaging with semiconductor quantum dots. Nat Biotechnol 2004, 22(8):969976. Derfus AM, Chan WCW, Bhatia SN: Intracellular delivery of quantum dots for live cell labeling and organelle tracking. Advanced Materials 2004, 16(12):961-961. Gao X, Nie S: Molecular profiling of single cells and tissue specimens with quantum dots. Trends Biotechnol 2003, 21(9):371-373. Jovin TM: Quantum dots finally come of age. Nat Biotechnol 2003, 21(1):3233. Jia N, Lian Q, Shen H, Wang C, Li X, Yang Z: Intracellular delivery of quantum dots tagged antisense oligodeoxynucleotides by functionalized multiwalled carbon nanotubes. Nano Lett 2007, 7(10):2976-2980. Zhang K, Ouellette BFF: GAIA: a gram-based interaction analysis tool–an approach for identifying interacting domains in yeast. BMC Bioinformatics 2009, 10(Suppl 1):S60. Scherr M, Battmer K, Winkler T, Heidenreich O, Ganser A, Eder M: Specific inhibition of bcr-abl gene expression by small interfering RNA. Blood 2003, 101(4):1566. Bakalova R, Zhelev Z, Ohba H, Baba Y: Quantum dot-conjugated hybridization probes for preliminary screening of siRNA sequences. J Am Chem Soc 2005, 127(32):11328-11335. Chan WH, Shiao NH, Lu PZ: CdSe quantum dots induce apoptosis in human neuroblastoma cells via mitochondrial-dependent pathways and inhibition of survival signals. Toxicol Lett 2006, 167(3):191-200. Cho SJ, Maysinger D, Jain M, Rö der B, Hackbarth S, Winnik FM: Long-term exposure to CdTe quantum dots causes functional impairments in live cells. Langmuir 2007, 23(4):1974-1980. Choi AO, Cho SJ, Desbarats J, Lovric J, Maysinger D: Quantum dot-induced cell death involves Fas upregulation and lipid peroxidation in human neuroblastoma cells. Journal of Nanobiotechnology 2007, 5(1):1.  204  Appendix C New perspectives in predicting membrane protein-protein interactions 1. Introduction Cells are multi-molecular entities whose biological functions rely on stringent regulations both temporally and specially. These regulations are achieved through a variety of molecular interactions including protein-DNA interactions, protein-RNA interactions and protein-protein interactions (PPIs) [1]. PPIs are extremely important in a wide range of biological functions from enzyme catalysis, signal transduction and more structural functions. Owing to advanced large-scale techniques such as yeast two-hybrid and mass spectrometry, interactomes of several model organisms such as Saccharomyces cerevisiae [2-6], Drosophila melanogaster [7, 8] and Caenorhabditis elegans [9] have recently been extensively studied. Such large-scale interaction networks have provided us with a good opportunity to explore and decipher new information from them. However, there are some limitations of these large-scale data sets: 1) the experimental techniques for detecting PPIs are time-consuming, costly and labor-intensive [10]; 2) the quality of certain datasets is uneven [11]; and 3) technical limitations such as the requirement to tag proteins of interest still exist [12]. As a complementary alternative, computational approaches that identify PPIs have been studied intensively for years and have yielded some interesting results.  Proteins with at least one transmembrane domain constitute 20% to 35% of all known proteins, and therefore account for an important fraction of the proteins involved in biological mechanisms. However, for several reasons, the research on membrane protein  205  interactions has been lagging behind [13]. First, although the current available interactomes contain adequate interactions for analysis, the data sets still have a large amount of false positives. For example, compared to a gold-standard data set, identified protein-protein interactions from three frequently-used high-throughput methods (yeast two-hybrid [6], tandem affinity purification (TAP) [2] and high-throughput mass spectrometry protein complex identification (HMS-PCI)) [3] yielded very low accuracy, coverage and overlap [14]. Second, some large-scale experimental techniques are biased against membrane proteins. For instance, in order to check whether proteins interact or not in a yeast two-hybrid (Y2H) system, they need to be expressed and be present in the nucleus, which may not be their native living environment.  The modified version of the Y2H called the split-ubiquitin membrane yeast two-hybrid (MYTH) system [15] was developed for specially detecting the interactions between membrane proteins. However, it is still time-consuming and labor-intensive, making it infeasible to generate a complete picture of the interactome of membrane proteins at current stage. Several groups have tackled this problem using computational approaches. Miller and colleagues [16] worked on identifying interactions between integral membrane proteins in yeast using a modified split-ubiquitin technique. To address the challenges presented in experimental techniques, Xia and colleagues [17] developed a computational method to predict the interactions between helical membrane proteins in yeast by integrating 11 genomic features such as sequence, function, localization, abundance, regulation, and phenotype using logistic regression. It however suffers low prediction power and low verifiability with experimental results. In addition to utilizing genomic features to predict protein-protein interactions (PPIs), graph theory based on the topology  206  of network is an alternative approach to infer protein-protein relationship from protein interaction networks and showing interesting results [18, 19]. Our group proposed a method to predict interactions between membrane proteins using a probabilistic model based on the topology of protein-protein interaction network and that of domain-domain interaction network in yeast [20].  The objective of this chapter is to provide an overview focused on recent approaches in predicting membrane proteins by computational methods including a new approach to predict membrane PPIs developed in our own laboratory. We also discuss the applicability of each computational approach and also the strengths, weaknesses and challenges of all of them.  2. Experimental identification of PPIs between membrane proteins Currently, the yeast two-hybrid (Y2H) and the tandem affinity purification (TAP) followed by mass spectrometry are the two mainstream experimental techniques to identify protein-protein interactions on a large scale [10]. In the yeast two-hybrid system, a bait protein containing a DNA binding domain hybridizes with a prey protein containing an activation domain. If the reporter gene is generated, it means that this pair of proteins interact with each other as the activation domain activates the transcription of the reporter gene. An alternative way is to tag a protein of interest and then express it in cells. The tagged protein and its interacting/binding proteins are purified as it binds to a column or bead. After purification, proteins interacted with the tagged protein are analyzed and identified through SDS-PAGE followed by mass spectrometry. These  207  approaches have provided us with an important amount of PPIs, which make it possible to build a more robust interactome of cells.  Apart from the intrinsic limitations of these approaches, such as high false positives and the requirement to tag proteins of interest, both of them are biased against membrane proteins. In the yeast two-hybrid system, the generation of the reporter gene product indicates an interaction. As the activation of the transcription of the reporter gene takes place in the cell nucleus, participating proteins must be localized to the nucleus. However, membrane proteins usually locate at the cell membrane or other cytoplasmic localization instead of in the cell nucleus, which makes them excluded from the results of the Y2H system. Due to their chemical properties, membrane proteins are difficult to manipulate and purify. Therefore, interactions between membrane proteins are less likely to be detected in such approaches.  To overcome the drawback of the above methods, an approach called the split-ubiquitin membrane yeast two-hybrid (MYTH) system was first developed by Stagljar et al. [15] and then was further modified in recent years [13, 16, 21]. MYTH is a yeast-based genetic technology to detect detection of membrane protein interactions in vivo. This system is based on the split-ubiquitin approach, in which protein-protein interactions can direct the reconstitution of two ubiquitin halves. In such system (Figure 1), individual proteins are simultaneously introduced into the mutant yeast strain. The carboxy-terminal half of ubiquitin (Cub) and a LexA-VP16 transcription factor (TF) are fused onto the Nor C-terminus of a membrane protein while the amino-terminal half of ubiquitin bearing an Ile 13 Gly mutation (NubG-Prey or Prey-NubG) is fused onto the N- or C-terminus of  208  another membrane protein. The protein fused to the Cub and TF can be referred to as the bait protein and is typically a known protein that the investigator is using to identify new binding partners. The protein fused to the NubG-Prey or Prey-NubG can be referred to as the prey protein and can be either a single known protein or a library of known or unknown proteins. If the bait protein interacts with the prey protein, quasi-native ubiquitin is reconstituted. The resultant ubiquitin-specific proteases (UBPs) from the process of ubiquitin can cleave at the C-terminus of the Cub, which releases the TF, so some reporter genes such as HIS3, ADE2 and lacZ can be transcribed in the system.  The split-ubiquitin approach has been widely applied and has yielded interesting results. Thaminy et al. [21] identified the interacting partners of the mammalian ErbB3 receptor using the split-ubiquitin approach, which proved the effectiveness of such system. Miller et al. [16] further applied this approach to construct an array of yeast expressing the fusion of membrane proteins of interest on a large scale. Recently, more applications of the split-ubiquitin approach have been proposed. For example, novel interactors of the yeast ABC transporter Ycf1p [22] and the human Frizzled 1 receptor [23] have been identified using such method.  3. Computational prediction of PPIs between membrane proteins 3.1 Multiple evidence-based  Thanks to current advanced techniques, the relationship between genes can be evaluated based on various types of biological data such as protein-protein interaction data, genetic interaction data, gene co-expression data and phylogenetic profiles. These data sets help 209  us better understand gene functions in the context of specific pathways or biological networks and also enables us to discover gene relationships too weak to be detected in individual data type.  The first attempt to predict interaction between membrane proteins on a large scale started from the work of Miller and colleagues [16]. They first generated a set of putative protein-protein interactions between membrane proteins through a modified splitubiquitin technique. In order to test how reliable these putative protein-protein interactions are, they employed an artificial intelligent approach, support vector machine (SVM), to predict interactions at the different confidence levels. For training purposes, they compiled a positive training set containing 56 PPIs between membrane proteins from their experimental results and the literatures and a negative training set containing random protein pairs. Besides 10 features derived from experiments such as the number of interactions that the Cub-PLV participates, other 8 genomic features such as Gene Ontology term similarity and co-expression are included as input parameters to the SVM algorithm (Table 1). Finally, they tested 1,985 putative interactions from the experiment using the trained SVM and identified 131 highest confident interactions, 209 higher confident interactions, 468 medium confident interactions and 1,085 low confident interactions.  Xia et al. proposed a prediction method to identify 4,145 helical membrane protein interactions by optimally combining 14 genomic features (Table 1) [17]. After the fold enrichment analysis between interacting membrane protein pairs and all membrane protein pairs, they found 11 features are good indicators to predict interactions. Three  210  features (relative protein abundance, relative mRNA expression and relative marginal essentiality) do not demonstrate statistically significant difference between interacting membrane protein pairs and all membrane protein pairs. The authors compiled a goldstandard positive set by selecting all membrane protein pairs in the same MIPS complex and a gold-standard negative set by paring all membrane proteins not in the MIPS complexes. They applied both the logistic regression classifier and the Naïve Bayes classifier on the gold-standard data sets using 11 genomic features. They demonstrated that the integration-based classifier outperforms single evidence-based classifier. Also the logistic regression classifier has higher true positive rate than the Naïve Bayes classifier.  3.2 Protein primary sequence and structure-based  Helix-helix interactions within a membrane protein or between membrane proteins play a critical role in protein folding and stabilization. Therefore, it has been of great importance to test if a pair of membrane proteins could interact with each other through helix-helix interactions.  Eilers et. al proposed a method to calculate helix-helix packing values at the level of individual atoms, amino acids and entire proteins [24]. They found that packing values could be utilized to differentiate transmembrane proteins and soluble proteins as transmembrane helices pack more tightly. Besides packing values, they also demonstrated that helix contact plot, a method to calculate distances between all backbone atoms of each interacting helix pair, is another feature that can be used to classify transmembrane proteins and soluble proteins because the helix contact plot of transmembrane proteins display a broader distribution than that of soluble proteins. This 211  study provides us with a good starting point to predict interactions between membrane proteins using helix packing and interhelical propensity.  Instead of using physical properties between residues, Fuchs et al. developed an approach to predict helical interactions based on the co-evolving mechanism of residues [25]. The underlying hypothesis is that residues within the same particular protein structure tend to be mutated concurrently. They first generated a set of co-evolving residues from seven different prediction algorithms and the helix-helix interactions were then predicted by comparing helix pairs to their structural information in the Protein Data Bank (PDB) combined with this set of co-evolving residues. With this approach, interacting helices could be predicted at the specificity of 83% and the sensitivity of 42%. It is demonstrated that evolutionarily conserved residues are a valuable feature to predict membrane protein interactions.  As more and more structural information related to residues becomes available, more sophisticated computational approaches are needed to improve prediction performance. In a recent publication, a two-level hierarchical method based on support vector machine (SVM) was proposed. In this study, they built two layers of SVMs [26]. The first layer of SVM was to predict contact residues. Three input features were included at this level: residue contact propensity, evolutionary profile and relative solvent accessibility. The prediction of interactions between contact residues was implemented in the second layer of SVM in which contract residues were used as inputs. They selected five different features in this level: residue pair contact propensities, evolutionary profile, relative solvent accessibility, helix-helix interaction type and helical length. Tested on a set of 85  212  interacting helical pairs, 768 contact pairs and 939 contact residues, this method reaches to the sensitivity of 67% and specificity of 95%. This approach further proves the notion that the integration of diverse structural and sequence information with residue contact propensities is a good direction to predict helix-helix interactions and membrane protein interactions.  3.3 Biological network-based  A network topology-based approach was proposed by our group [20]. It is able to predict interactions between membrane proteins using a probabilistic model based on the topology of PPI network and that of domain-domain interaction (DDI) network in yeast. It has been demonstrated that the more likely a pair of proteins are functionally related to each other, the more likely they are to share interaction partners [27]. Moreover, DDIs have also been shown as indicators of protein interactions due to the binding of modular domains or motifs [28, 29]. Therefore, we sought to examine the hypothesis that two proteins that share same interactors may interact with each other themselves. In order to address this question, we considered the internal protein-protein and domain-domain relationship of a pair of proteins and their PPI partners.  PPI and DDI data from disparate sources were integrated and then a log likelihood scoring method was applied on all putative integral membrane proteins in yeast to predict all putative integral membrane protein-protein interactions based on a cut-off threshold. It is shown that our approach improves on other predictive approaches when tested on a “gold-standard” data set [20] and achieves 74.6% true positive rate at the expense of 9.9% false positive rate. Furthermore, it is also found that two integral membrane 213  proteins are more likely to interacts with each other if they share more common interaction partners. Recently, we proposed an improved approach to predict membrane PPIs by incorporating one more piece of evidence – gene ontology (GO) semantic similarity.  A scoring model can infer how closely a pair of genes are related in a PPI network. As shown in previous works, if two proteins interact with a very similar group of proteins, they are likely to interact with each other [3, 30], thus, for a given pair of genes, we first mapped them to a pair of proteins, and then found a common set of interactors for this pair of genes and protein-protein interactions within the whole set of common interactors. A scoring method was employed to calculate the likelihood that a group of genes (a pair of query genes) and the whole set of their common interactors are more densely connected (the number of PPIs within a group of proteins) than would be expected at random [31]:  (1)  where S is a set of common interactors for a given pair of genes and I is a set of proteinprotein interactions among those genes. PI(x, y) is an indicator function that equals 1 if and only if the interaction (x, y) occurs in I and otherwise 0. For network N, interactions are expected to occur with high probability for every pair of proteins in S. In our work, we followed the previous knowledge to estimate β and set β to 0.9 [32]. For network Ncontrol, the probability of observing each interaction cx,y was determined by estimating the fraction of all control networks with randomly expected degree distribution which 214  also contain that PPI. Comparable control networks were randomly generated by rewiring interaction networks with same node number from the same gene set and same number of degrees, and by repeating the process 100 times.  Should a given pair of proteins have a documented list of DDIs in iPfam, then we will have two sets of domains corresponding to two proteins. Hence, given a pair of proteins and their common interaction partners, a lot of domain-domain pairs among these sets of domains are possible. A modified model (2) implies dense domain-domain interactions existing in a group of common interactors of a given gene pair. A related log-odds score was used to evaluate the probability that the domain-domain interactions bridging between these two genes and their common interaction partners were denser than random based on the above scoring method:  (2)  Compared to the previous equation, DI(m, n) is an indicator function that equals 1 if and only if the domain-domain interaction (m,n) occurs in I and otherwise 0; Dx/Dy is the number of domains in each protein x and y; for network Ncontrol, the probability of observing each domain-domain interaction cx,y was determined by estimating the fraction of all control networks with randomly expected degree distribution that also contain that domain-domain interactions occurring between two proteins.  215  In order to measure the functional similarity between a pair of proteins, we developed a new scoring approach based on GO terms. Given two groups of GO terms (M, N) representing two proteins, the functional similarity between a pair of proteins was calculated by the following formula:  SGO (M , N ) =  m  n  n  m  i=1  j =1  i=1  j =1  ! max(GO(i, j)) + ! max(GO(i, j))  (3)  m+n  where M is the set of unique GO terms of the protein x; N is the set of unique GO terms of the protein y; m is the number of GO terms in the set M; n is the number of GO terms in the set N; GO(i,j) is the similarity score between GO term i and GO term j. The similarity scores between a pair of GO terms were computed based on the algorithm GSESAME, a new advanced method to measure the semantic similarity of GO terms by considering the locations of their ancestor terms of the two specific terms [33].  To put the above three types of scores together, the final scoring function for a given pair of proteins was then:  S final = S p + Sd + Sgo  (4)  For each possible interaction between integral membrane proteins, we calculated three different scores: PPI score, DDI score and a combined PPI/DDI/GO score according to (1)(2)(3)(4). This generated a table with 996,166 interacting pairs of proteins, each with  216  three interaction probability scores. We compared the performance of our proposed approach by different types of scores: PPI score, DDI score, GO score and the combined score. A ROC curve was plotted by measuring sensitivity and specificity when tested against the gold-standard data set at different cut-off values. The area under curve is 0.95 for combined score, 0.85 for PPI, 0.74 for DDI and 0.8 fro GO terms, respectively, which indicates the good prediction performance of the proposed scoring method. Better performance can be achieved if we used combined scores rather than using PPI scores or DDI scores alone. It is estimated that there are around 5,000 interactions existing between membrane proteins [12]. Based on that number, we achieved 81.2% true positive rate (sensitivity) at the expense of 9.9% false positive rate (1 – specificity) for a cut-off score of 455, which predicted 4,531 interactions between integral membrane proteins, about 0.61% coverage of all possible interactions among integral membrane proteins.  The map of the interactome of integral membrane protein was built based on 4,531 predicted protein-protein interactions between integral membrane proteins at the cutoff value of 455 (Fig. 2) by Cytoscape [34]. 53.4% (281/527) proteins in the interactome map contains at least one transmembrane helix according to the predictions by TMHMM. 80% (392/513) interactions within gold-standard data set overlaps with those within the interactome map but only accounts for 8.4% of the whole interactome of integral membrane proteins. By checking the topology properties of the interactome map, we found that most interactions in the gold-standard data set are in the same sub-complex such as lipid biosynthesis, energy couple proton transport, protein biosynthesis, protein targeting to mitochondria and ATP synthesis coupled electron transport, which reflects the characteristics of performed experiments (detecting protein-protein interactions  217  between same complexes). Our predicted interactions indicates some new members in some complexes such as transport, secretion, vesicle-mediated transport and intracellular transport, which is probably caused by some false negatives from experimental methods. One example is that in the group of protein import into nucleus, KAP95 and SSA1 do not interact with other proteins within the group according to the gold-standard data set, however they both play a critical role on nuclear localization signal (NLS)-directed nuclear transport by interacting other proteins to guide transport across the nuclear pore complex [35, 36]. Interestingly, we found some new complexes such as peroxisome organization and biogenesis related to the functions of peroxisome membrane proteins such as peroxisome biogenesis and peroxisomal matrix protein import [37-39].  4. Challenges in predicting membrane PPIs Complemented by experimental methods, computational approaches provide us with a promising path to reveal a more complete picture of the membrane protein interactome. However, we should be aware of several challenges in predicting membrane PPIs.  First, we are still in lack of reliable membrane PPIs, which results in the difficulty of compiling the gold-standard data set. Currently, positive interaction data is collected from protein pairs in the same protein complex and negative interaction data is derived from those pairs not in the same protein complex. The data quality problem arises as the complex data itself is limited by experimental approaches and contains false positive PPIs. On the other hand, the complex data is biased against membrane proteins, therefore, making it difficult to access the prediction performance of various approaches due to the  218  scarcity of membrane PPIs in the gold-standard data set and the small coverage of membrane interactome. Furthermore, another concern is that large amount of negative data may bring false negatives during the training.  Moreover, it is challenging to interpret the prediction results from different approaches. Inconsistency of predicted membrane proteins has been observed. For example, Miller and colleagues [16] identified 1,949 putative non-self interactions among 705 integral membrane proteins. Xia and colleagues [17] predicted 4,145 helical membrane protein interactions among 516 proteins. Our group recently predicted 4,660 PPIs between integral membrane proteins using the PPIs network and the DDIs data [20]. Interestingly, only 79 protein-protein interactions are overlapped between the results from all three approaches (Figure 3). The reason for these differences among three large-scale sets of membrane protein interactions may be that each approach focuses on different aspects. The experimental result from Miller et al. is reliable but probably contains false positives and false negatives due to the intrinsic limitation of experimental techniques they employed. The approach proposed by Xia et al. is more focused on the interactions between complexes instead of on binary protein-protein interactions, so the result from Xia et al. is prone to predict interactions in the complex. Our approach emphasizes the interactions through the topological properties of PPI and DDI networks and appears to improve on the above methods because these interactions are probably important features for membrane protein interactions. The better prediction accuracy may be achieved by more sophisticated approaches by incorporating various types of biologically meaningful evidence such as network topological features, protein primary sequences and structural information.  219  Currently, computational membrane protein interaction prediction is intensively studied but focuses only on yeast. Theoretically, methodologies can be applicable to a variety of organisms. However, even with the unprecedented increase of heterogeneous biological data, the data of some organisms such as Mus musculus, Drosophila melanogaster and especially Homo sapiens is far from complete. Therefore, prediction approaches based on multiple lines of evidence undertake the challenge caused by data incompleteness.  5. Conclusions In this chapter, we reviewed various computational approaches to predict protein-protein interactions between membrane proteins. In spite of some limitations caused by incompleteness of existing experimental data, computational methods have demonstrated reasonable prediction accuracy, which make them to be good resources to provide testable hypotheses for experimental validation. With an emergence of different types of high-throughput data at the systematic level, it prompts us to develop and propose computational methods to identify PPIs between membrane proteins by integrating these data sets. Therefore, complemented with various prediction methods and experimental approaches, such studies lead us to elucidate a cell’s interactome.  220  Table 1 A list of biological features indicating the interactions between membrane proteins. Xia et al represents the method proposed by Xia et al. 2006, and Miller et al. represents the method proposed by Miller et al. 2005. A star sign means this feature has been applied to the corresponding approach. Features  Biological relevance  The number of interactions that the Cub-PLV participates The number of interactions that the NubG participates Weather both spots for a given NubG were found by the Cub-PLV in either repetition Whether repeated screens by using the same Cub-PLV found this NubG The total number of times that this interaction was observed in the screen Whether a reciprocal interaction is observed Whether the reciprocal interaction was tested The total number of times that this interaction was observed in this orientation or its reciprocal The strength of growth of the yeast in the positive colonies The relative strength of growth of the yeast in the positive colonies to the controls.  A membrane protein was proved to interact with other membrane proteins in the experiment. A membrane protein was proved to interact with other membrane proteins in the experiment. A membrane protein was proved to interact with other membrane proteins in the experiment.  Xia et al.  Miller et al ∗ ∗ ∗  A membrane protein was proved to interact with other membrane proteins in the experiment.  ∗  A membrane protein was proved to interact with other membrane proteins in the experiment.  ∗  A reciprocal interaction represents the more reliable interaction. A reciprocal interaction represents the more reliable interaction. A reciprocal interaction represents the more reliable interaction.  ∗  Stronger interactions result in more growth of the yeast.  ∗  Stronger interactions result in more growth of the yeast.  ∗  ∗ ∗  221  Features  Biological relevance  The mutual clustering coefficients, the meet/min coefficient, the geometric coefficient, and the hypergeometric coefficient The difference in the codon enrichment correlation (CEC) between the two proteins GO functional similarity  High coefficient score indicates interactions.  MIPS functional similarity  Membrane colocalization  Total protein abundance Total mRNA expression Relative protein abundance Relative mRNA expression mRNA expression correlation  Xia et al.  ∗  Interacting proteins might have comparable codon compositions.  A pair of membrane proteins tends to interact with each other if they share very similar Gene Ontology (GO) terms. A pair of membrane proteins tends to interact with each other if they share very similar functional categories as defined in the MIPS database. A pair of membrane proteins tends to interact with each other if they are assigned to the same cellular localization based on the SGD database. A pair of membrane proteins tends to interact with each other if the sum of their protein abundance is high. A pair of membrane proteins tends to interact with each other if the sum of their mRNA expression level is high. A pair of membrane proteins tends to interact with each other if the absolute difference between their protein abundance is low. A pair of membrane proteins tends to interact with each other if the absolute difference between their mRNA expression levels is low. A pair of membrane proteins tends to interact with each other if the  Miller et al ∗  ∗  ∗  ∗  ∗  ∗  ∗  ∗  ∗  ∗  ∗  ∗  ∗  ∗  222  Features  Transcriptional coregulation Co-essentiality Total marginal essentiality Relative marginal essentiality Genetic interaction  Gene fusion, phylogenetic profile, gene neighborhood, gene cluster  Biological relevance correlation of their mRNA expression profiles over timecourse experiments is high. A pair of membrane proteins tends to interact with each other if they are related by a same transcription factor. A pair of membrane proteins tends to interact with each other if they both are essential genes. A pair of membrane proteins tends to interact with each other if the sum of their marginal essentiality is high. A pair of membrane proteins tends to interact with each other if the absolute difference between their marginal essentiality is low. A pair of membrane proteins tends to interact with each other if they also genetically interact with each other. A pair of membrane proteins tends to interact with each other if they have high score in the Prolinks database representing functional relatedness.  Xia et al.  Miller et al  ∗  ∗  ∗  ∗  ∗  ∗  ∗  223  Figure 1 The split-ubiquitin membrane yeast two-hybrid system.  Two membrane proteins are fused to NubG and Cub-TF, respectively. They both are expressed in different mating type. If two membrane proteins interact with each other upon mating as a diploid, the two halves of ubiquitin reconstitute as a quasi-native ubiquitin, a target of ubiqutin-specific proteases that cleave the ubiqutin. The reporter gene is transcribed if two membrane proteins interact with each other as uniqutin-specific proteases release TF into the nucleus and then actives the transcription of the reporter gene.  224  Figure 2 The interactome map of membrane proteins in yeast.  Nodes are represented as membrane proteins, and edges are represented as our predicted interactions between a pair of membrane proteins. Red nodes represent membrane proteins in the gold-standard data set and red edges represent interactions in the goldstandard data set. This graph is generated by Cytoscape [34].  225  Figure 3 Comparison of the prediction results and involved proteins from three large-scale methods.  There are 438 predicted protein-protein interactions overlapping between data sets from Miller et al.(2005) and Zhang and Ouellette (2008), 79 between Miller et al. and Xia et al.(2006), 372 between Xia et al. and Zhang and Ouellette, respectively. There are 107 membrane proteins overlapping between data sets from Miller et al.(2005) and Zhang and Ouellette (2008).  226  6. References 1. 2.  3.  4. 5.  6.  7.  8.  Kirschner MW: The meaning of systems biology. Cell 2005, 121(4):503-504. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415(6868):180-183. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001, 98(8):4569-4574. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, Peregrin-Alvarez JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MH, Butland G, Altaf-Ul AM, Kanaya S, Shilatifard A, O'Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440(7084):637643. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403(6770):623-627. Formstecher E, Aresta S, Collura V, Hamburger A, Meil A, Trehin A, Reverdy C, Betin V, Maire S, Brun C, Jacq B, Arpin M, Bellaiche Y, Bellusci S, Benaroch P, Bornens M, Chanet R, Chavrier P, Delattre O, Doye V, Fehon R, Faye G, Galli T, Girault JA, Goud B, de Gunzburg J, Johannes L, Junier MP, Mirouse V, Mukherjee A, Papadopoulo D, Perez F, Plessis A, Rosse C, Saule S, StoppaLyonnet D, Vincent A, White M, Legrain P, Wojcik J, Camonis J, Daviet L: Protein interaction mapping: a Drosophila case study. Genome Res 2005, 15(3):376-384. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong  227  9.  10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.  Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL, Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM: A protein interaction map of Drosophila melanogaster. Science 2003, 302(5651):1727-1736. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303(5657):540-543. Shoemaker BA, Panchenko AR: Deciphering protein-protein interactions. Part I. Experimental techniques and databases. PLoS Comput Biol 2007, 3(3):e42. Stelzl U, Wanker EE: The value of high quality protein-protein interaction networks for systems biology. Curr Opin Chem Biol 2006, 10(6):551-558. Figeys D: Mapping the human protein interactome. Cell Res 2008, 18(7):716724. Thaminy S, Miller J, Stagljar I: The split-ubiquitin membrane-based yeast two-hybrid system. Methods Mol Biol 2004, 261:297-312. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002, 417(6887):399-403. Stagljar I, Korostensky C, Johnsson N, te Heesen S: A genetic system based on split-ubiquitin for the analysis of interactions between membrane proteins in vivo. Proc Natl Acad Sci U S A 1998, 95(9):5187-5192. Miller JP, Lo RS, Ben-Hur A, Desmarais C, Stagljar I, Noble WS, Fields S: Large-scale identification of yeast integral membrane protein interactions. Proc Natl Acad Sci U S A 2005, 102(34):12123-12128. Xia Y, Lu LJ, Gerstein M: Integrated prediction of the helical membrane protein interactome in yeast. J Mol Biol 2006, 357(1):339-349. Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 2005, 21 Suppl 1:i302-310. Valente AX, Cusick ME: Yeast Protein Interactome topology provides framework for coordinated-functionality. Nucleic Acids Res 2006, 34(9):28122819. Zhang KX, Ouellette BFF: A new approach to predict interactions between integral membrane proteins in yeast. IEEE Congress on Evolutionary Computation 2008:1801-1806.  228  21. 22.  23. 24. 25. 26. 27. 28.  29. 30. 31. 32.  33. 34.  35.  Thaminy S, Auerbach D, Arnoldo A, Stagljar I: Identification of novel ErbB3interacting factors using the split-ubiquitin membrane yeast two-hybrid system. Genome Res 2003, 13(7):1744-1753. Paumi CM, Menendez J, Arnoldo A, Engels K, Iyer KR, Thaminy S, Georgiev O, Barral Y, Michaelis S, Stagljar I: Mapping protein-protein interactions for the yeast ABC transporter Ycf1p by integrated split-ubiquitin membrane yeast two-hybrid analysis. Mol Cell 2007, 26(1):15-25. Dirnberger D, Messerschmid M, Baumeister R: An optimized split-ubiquitin cDNA-library screening system to identify novel interactors of the human Frizzled 1 receptor. Nucleic Acids Res 2008, 36(6):e37. Eilers M, Patel AB, Liu W, Smith SO: Comparison of helix interactions in membrane and soluble alpha-bundle proteins. Biophys J 2002, 82(5):27202736. Fuchs A, Martin-Galiano AJ, Kalman M, Fleishman S, Ben-Tal N, Frishman D: Co-evolving residues in membrane proteins. Bioinformatics 2007, 23(24):3312-3319. Lo A, Chiu YY, Rodland EA, Lyu PC, Sung TY, Hsu WL: Predicting helix-helix interactions from residue contacts in membrane proteins. Bioinformatics 2009, 25(8):996-1003. Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol 2003, 5(1):R6. Jothi R, Cherukuri PF, Tasneem A, Przytycka TM: Co-evolutionary analysis of domains in interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions. J Mol Biol 2006, 362(4):861-875. Pawson T, Nash P: Assembly of cell regulatory systems through protein interaction domains. Science 2003, 300(5618):445-452. Yu H, Paccanaro A, Trifonov V, Gerstein M: Predicting interactions in protein networks by completing defective cliques. Bioinformatics 2006, 22(7):823-829. Kelley R, Ideker T: Systematic interpretation of genetic interactions using protein networks. Nat Biotechnol 2005, 23(5):561-566. Mewes HW, Frishman D, Mayer KF, Munsterkotter M, Noubibou O, Pagel P, Rattei T, Oesterheld M, Ruepp A, Stumpflen V: MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res 2006, 34(Database issue):D169-172. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms. Bioinformatics 2007, 23(10):1274-1281. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):24982504. Denning D, Mykytka B, Allen NP, Huang L, Al B, Rexach M: The nucleoporin Nup60p functions as a Gsp1p-GTP-sensitive tether for Nup2p at the nuclear pore complex. J Cell Biol 2001, 154(5):937-950.  229  36. 37. 38. 39.  Liu SM, Stewart M: Structural basis for the high-affinity binding of nucleoporin Nup1p to the Saccharomyces cerevisiae importin-beta homologue, Kap95p. J Mol Biol 2005, 349(3):515-525. Eckert JH, Erdmann R: Peroxisome biogenesis. Rev Physiol Biochem Pharmacol 2003, 147:75-121. Heiland I, Erdmann R: Biogenesis of peroxisomes. Topogenesis of the peroxisomal membrane and matrix proteins. Febs J 2005, 272(10):2362-2372. Honsho M, Hiroshige T, Fujiki Y: The membrane biogenesis peroxin Pex16p. Topogenesis and functional roles in peroxisomal membrane assembly. J Biol Chem 2002, 277(46):44513-44524.  230  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0069679/manifest

Comment

Related Items