Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Identification and exploration of gene product annotation instability and its impact on current usages Sedeño Cortés, Adriana Estela 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2014_november_sedeno_adriana.pdf [ 4.67MB ]
JSON: 24-1.0167627.json
JSON-LD: 24-1.0167627-ld.json
RDF/XML (Pretty): 24-1.0167627-rdf.xml
RDF/JSON: 24-1.0167627-rdf.json
Turtle: 24-1.0167627-turtle.txt
N-Triples: 24-1.0167627-rdf-ntriples.txt
Original Record: 24-1.0167627-source.json
Full Text

Full Text

Identification and exploration of geneproduct annotation instability and itsimpact on current usagesbyAdriana Estela Seden˜o Corte´sB.Sc., National Autonomous University of Mexico (UNAM), 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)October 2014c© Adriana Estela Seden˜o Corte´s 2014AbstractProteins are macromolecules responsible for a wide range of activities in thestructure and function of cells. Their activities have been described in dif-ferent contexts as a mean to elucidate their “function”. These descriptionshave been captured across biological databases in a standardized formatcalled Gene Ontology Annotations (GOA), to disseminate the knowledgeand extrapolate the information to other proteins whose function is still un-known. Furthermore, the annotations are used to analyse and interpret datafrom high-throughput studies and also as a benchmark for the assessmentof protein function prediction algorithms. Constant changes occur in GOAthat can potentially impact such usages, but only limited effort has been putinto exploring their instability, or to assess the impact that these changeshave on reproducibility or interpretation of previous analyses.In the present work, I performed the most comprehensive analysis of theannotation instability for 14 representative model organisms (E.coli, fruitfly, mouse, etc.). The results showed important instability patterns thatwere species-specific. As such information would be of use to the commu-nity to trace the instability of annotations of their interest, a web-basedvisualization tool was built to track these changes on a protein, functionalterm and species specific basis.Additionally, we identified artifacts on the annotation data that can beattributed to curation patterns. We propose such artifacts to be consideredfor a more accurate assessment of function prediction algorithms. Further-more, the impact that changes in the annotations have on common settingslike gene set enrichment analyses was also explored. In particular, 2,000iiAbstractdatasets were used to assess the robustness of enrichment results over time.On average, the results would display a 60% similarity after only 2 years.However, cases were found were the similarity will drop 80% within thesame year, demonstrating the impact that the instability has on such ap-plications. In conclusion, the results of this work will prove useful for thosewho use the annotations to interpret their studies to assess their reliabilityon a case-by-case scenario.iiiPrefaceThe present work was elaborated at the Centre of High-Throughput Biology(CHiBi) in the UBC’s Michael Smith Laboratories (MSL) under the super-vision of Paul Pavlidis.I am responsible for the data collection, design and code implementa-tion done in this project to pre-process and analyse the data. My supervisorPaul Pavlidis contributed with the study design, supervision and editorialsuggestions for all chapters.Jesse Gillis contributed suggestions for the evaluation of the performanceof gene function prediction algorithms.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations and Definitions . . . . . . . . . . . . . . . xiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xivDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvChapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . 21.2 GO Annotations . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Uses, Challenges and Assessments of GO . . . . . . . . . . . . 19Chapter 2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 26Chapter 3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1 GOtrack: Pre-processing and Analysis . . . . . . . . . . . . . 283.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . 283.1.2 ID Mapping . . . . . . . . . . . . . . . . . . . . . . . . 29vTable of Contents3.1.3 Exploratory Analyses . . . . . . . . . . . . . . . . . . 313.1.4 Database Design, Creation and Management . . . . . 363.1.5 Web-based Visualization Tool: GOtrackWeb . . . . . 403.2 Proposing a Benchmark for the Assessment of Function Pre-diction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 443.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . 453.2.2 GO Term Prevalence in Annotation Data . . . . . . . 463.2.3 Identification of Inferred Electronic Annotations Com-monly Reviewed and Re-annotated by Curators in GOA 473.2.4 Identification of GO Terms Frequently Co-annotatedin GOA . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2.5 Evaluation of the Performance of Function PredictionAlgorithms and the Proposed Benchmark . . . . . . . 483.3 Analysis of the Instability of Gene Set Enrichment AnalysisOver Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Chapter 4 Results and Discussion . . . . . . . . . . . . . . . . . 534.1 Exploratory Analyses . . . . . . . . . . . . . . . . . . . . . . . 534.2 Utility of Creating a Web-based Visualization Tool: GOtrack-Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.3 The Assessment of Gene Function Prediction Algorithms . . . 804.4 Instability of Gene set Enrichment Results . . . . . . . . . . . 88Chapter 5 Future Directions . . . . . . . . . . . . . . . . . . . . . 98Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .109viList of TablesTable 1.1 Evidence codes used in GOA files. . . . . . . . . . . . 8Table 1.2 Attributes of a GO Annotation. . . . . . . . . . . . . 10Table 3.1 Target sequences and species considered for the CAFA2assessment. . . . . . . . . . . . . . . . . . . . . . . . . 46Table 3.2 Target sequences and species considered for the CAFA1assessment. . . . . . . . . . . . . . . . . . . . . . . . . 49Table 4.1 The GO terms most frequently used in GOA data. . . 64Table 4.2 Results of the function-centered performance as mea-sured by AUROC. . . . . . . . . . . . . . . . . . . . . 83Table 4.3 Results of the function-centered performance measuredby information content (molecular function ontology). 85Table 4.4 Results of the function-centered performance as mea-sured by information content (biological process on-tology). . . . . . . . . . . . . . . . . . . . . . . . . . . 86Table 4.5 Classification of gene sets by the number of significantGO terms. . . . . . . . . . . . . . . . . . . . . . . . . 90Table 4.6 Classification of gene sets by the number of parentalterms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93viiList of FiguresFigure 1.1 Illustration of the structure of the Gene Ontology graph. 6Figure 1.2 Schematic representation of the protocol followed togenerate gene annotations . . . . . . . . . . . . . . . . 7Figure 1.3 GO annotation statistics. . . . . . . . . . . . . . . . . 15Figure 1.4 Illustration of changes in UniProtKB entries over time. 18Figure 3.1 General overview of the methods and analyses donein the present study. . . . . . . . . . . . . . . . . . . . 27Figure 3.2 General overview of the mapping procedure. . . . . . 31Figure 3.3 Data pre-processing. . . . . . . . . . . . . . . . . . . . 33Figure 3.4 General GOtrack pipeline for exploratory analyses. . . 34Figure 3.5 General overview to track commonly upgraded anno-tations. . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 3.6 GOtrack database model. . . . . . . . . . . . . . . . . 39Figure 3.7 General overview of the GOtrackWeb implementation. 43Figure 3.8 Pre-processing steps for enrichment analysis. . . . . . 51Figure 3.9 Pipeline to compare results of enrichment analysesover time. . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 4.1 Overview of species-specific biases on manual curationefforts. . . . . . . . . . . . . . . . . . . . . . . . . . . 54Figure 4.2 Total number of gene product IDs for each species. . . 56Figure 4.3 Average number of GO terms directly annotated togene products across editions. . . . . . . . . . . . . . . 58viiiList of FiguresFigure 4.4 Contrasting shifts found in the number of GO termsassigned to a random set of gene products across edi-tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Figure 4.5 Example of the functional instability that gene prod-ucts have over time. . . . . . . . . . . . . . . . . . . . 60Figure 4.6 Average values of the semantic similarity of gene prod-ucts across editions. . . . . . . . . . . . . . . . . . . . 61Figure 4.7 Exploring the association between prioritized gene setsfor curation and multifunctionality. . . . . . . . . . . 62Figure 4.8 Average score of gene multifunctionality across editions. 63Figure 4.9 Average number of inferred terms over time. . . . . . 65Figure 4.10 Electronic annotations that are curated and annota-tions that are promoted. . . . . . . . . . . . . . . . . 67Figure 4.11 Changes in the usage of evidence codes for the cellularcomponent ontology. . . . . . . . . . . . . . . . . . . . 69Figure 4.12 Changes in the usage of evidence codes for the molec-ular function ontology. . . . . . . . . . . . . . . . . . . 71Figure 4.13 Changes in the usage of evidence codes for the biolog-ical process ontology. . . . . . . . . . . . . . . . . . . 73Figure 4.14 Manually curated annotations for highly popular geneproducts are also unstable. . . . . . . . . . . . . . . . 75Figure 4.15 GOtrackWeb: Main page. . . . . . . . . . . . . . . . . 77Figure 4.16 GOtrackWeb: Tracing the history of annotations. . . 79Figure 4.17 Results of the performance of function-prediction al-gorithms as measured by AUROC. . . . . . . . . . . . 82Figure 4.18 Example of an experimentally-derived hit list enrichedat different time points showing problems in repro-ducibility and interpretation of results. . . . . . . . . 89Figure 4.19 Variability of GO term overlap in the gene sets (C3). 91Figure 4.20 Variability of GO term overlap in the gene sets (C2). 92Figure 4.21 Semantic similarity of enriched gene sets (C3). . . . . 94Figure 4.22 Semantic similarity of enriched gene sets (C2). . . . . 95Figure 4.23 Percentage overlapped genes supporting gene sets (C3). 96ixList of FiguresFigure 4.24 Percentage of overlapped genes supporting gene sets(C2). . . . . . . . . . . . . . . . . . . . . . . . . . . . 97xList of Scripts5.1 This is the main structure of GOtrack, built to pre-processand analyse historical GO annotations. . . . . . . . . . . . . . 1105.2 An algorithm run by GOtrack to compute one single edition. 1115.3 Program to create GOMatrix files. They list genes and theGO terms they are associated to on a particular edition. . . . 1115.4 An algorithm to compute semantic similarity based on Jac-card distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.5 An algorithm to map old DB Object IDs to the most currentversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.6 An algorithm to map old MEDLINE IDs to current PubMedIDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.7 An algorithm to load the information to the database . . . . 1145.8 CAFA main algorithm . . . . . . . . . . . . . . . . . . . . . . 115xiList of Abbreviations andDefinitionsAUROC Area Under the Receiver Operating Characteristic CurveBP Biological Process OntologyCAFA Critical Assessment of Automated Function PredictionCC Cellular Component OntologyChEBI Chemical Entities of Biological InterestDAG Directed Acyclic GraphDDBJ DNA Data Bank of JapanDE Differentially ExpressedEBI The European Bioinformatics InstituteEMBL The European Molecular Biology LaboratoryGAF Gene Association File FormatGO Gene OntologyGOA Gene Ontology AnnotationGOC Gene Ontology ConsortiumGPAD Gene Product Association Data formatGPI Gene Product Information FormatxiiList of Abbreviations and DefinitionsHPO Human Phenotype OntologyMF Molecular Function OntologyOBO Open Bio-medical OntologiesSGD Saccharomyces Genome DatabaseSwissProt Swiss Institute of Bioinformatics DatabaseTrEMBL Translated EMBL Nucleotide DatabaseUBERON Integrated Cross-species OntologyUniProtKB The Uniprot KnowledgebasexiiiAcknowledgementsI would like to thank the Centre for High-Throughput Biology and theMichael Smith Laboratories at the University of British Columbia, wherethis research project was conducted. A space where thoughts and passionare shared among and beyond its community.To my supervisor, Paul Pavlidis, Professor of the Department of Psychia-try and Associate Director of the UBC Graduate Program in Bioinformatics.I greatly appreciate all the support, accessibility, guidance, patience to ex-plain things, constructive feedback and critical thinking that made this aninvaluable learning experience to me.To Jesse Gillis, Assistant Professor at the Cold Spring Harbor Labora-tory, previously a post-doctoral researcher at the Pavlidis lab, for all hisadvice, insightful thoughts and guidance.To Ryan Brinkman and Nobuhiko Tokuriki, members of the thesis com-mittee for their constructive feedback.To the NSERC and NIH as founding sources.To the Gene Ontology Consortium (GOC), the European BioinformaticsInstitute-European Molecular Biology Laboratory (EMBL-EBI) and orga-nizers from the Critical Assessment of Function Annotation Experiment,sources of the data used for this research project.xivDedicationTo my family and friends for all your love and support. Being far from youhas been difficult but you are always in my thoughts. Thank you for allthose moments, memories and time spent together, for supporting me inevery step of the way.“The important thing is not to stop questioning.Curiosity has its own reason for existing.One cannot help but be in awe when he contemplatesthe mysteries of eternity, of life, of the marvelousstructure of reality. It is enough if one tries merelyto comprehend a little of this mystery every day”.Albert Einstein. From the memoirs of William Miller, aneditor, quoted in Life magazine, May 2, 1955; Expanded, p.281xvChapter 1IntroductionProteins are biological macro molecules responsible for a wide range of ac-tivities in the structure and function of cells. Research has focused on de-scribing protein activity in different contexts as a mean to elucidate their“function”. This information is being captured across biological databasesin a standardized format, called Gene Ontology annotations (GOA). Theprimary reason to create the annotations is to disseminate this knowledge,compare the information across species and extrapolate the information toother similar proteins whose function is still unknown. GOA has becomeover time a key resource and is increasingly used to analyse or interpretthe large amount of data generated from high-throughput studies and alsoas a benchmark for the assessment of protein function prediction algorithms.However, the GO and the GOA are not complete nor perfect. Multiplechanges occur in their structure to better reflect the current knowledge. Thevariability derived from such modifications are likely to affect the outcomeof the current uses, specially for the interpretation of biological data, butcritical evaluations on the limitations of GOA are limited in number andscope.To properly assess the usefulness of the annotations to analyse or inter-pret biological data, it is crucial to understand first: 1)how these annotationsare generated, 2) where annotations come from and 3)what factors influencetheir changes, if one aims to identify the limitations of GOAs in currentapplications. Furthermore, the historical information should be accessiblefor comparison purposes to the community. However, no tool has been de-veloped and made available to conduct such evaluations.11.1. The Gene OntologyIn this thesis, I performed the most comprehensive analysis of the his-torical changes that have influenced GO and GO annotations; built a visu-alization tool to make this data accessible for exploration and assessed theimpact that these changes have in current applications. Even though thereis concern within the scientific community of such impacts, only a handfulof studies evaluating the annotation quality have been published, all withsome limitations that I attempt to overcome.In this chapter, I introduce the background for my research, with anoverview of the Gene Ontology and its annotations, properties, current us-ages and describe some of the previous assessments that have been done onthis data.1.1 The Gene Ontology16 years ago the Gene Ontology project was created to integrate and fa-cilitate the exploration of biological information behind different genomicand proteomic studies. The Gene Ontology Consortium (GOC) is a set ofgenome database organizations and communities that have joined efforts todevelop and maintain the Gene Ontology (GO), currently considered themost important ontology within bioinformatics. Its original publication [1]has over 13,371 citations based on Google Scholar (as of September 29,2014).The GO describes gene attributes using a standardized vocabulary (terms)in the form of a directed acyclic graph (DAG) [1]. The terms are classifiedin three independent aspects or domains: 1) Molecular Function Ontology(MF): Activities of the gene product within the cell (e.g. binding, receptor,enzymatic or transporter activities); 2) Biological Process Ontology (BP): Aseries of activities or events that a gene product is involved in within the cell(e.g. cell-cell signalling, locomotion, cell death); and 3) Cellular ComponentOntology (CC): Describes sub cellular locations and macro molecular com-plexes within the cell (e.g. membrane, pyruvate dehydrogenase complex,protein storage vacuole).21.1. The Gene OntologyIn each of those domains, terms are represented as nodes (with a nameand an identifier or accession number) and are inter-connected with otherparental terms (more general entities) and/or children terms (more detailedentities) by edges that represent different relationships:• is a: represent cases where the the children term B is a sub type ofthe parental term A (e.g. “enzyme regulator activity” is a “molecularfunction”; “anoikis” is a “apoptotic process”).• part of : represent cases where the children term B implies the pres-ence of the parental term A, but given A we cannot ensure that B ex-ists (e.g.“catalytic activity” part of metabolic process”; “signal trans-duction” part of cell communication”).• has part: represent cases where the parental term A always has thechildren term B as a part; if A exists, B will always exist (e.g. “proteinbinding transcription factor activity” has part protein binding”; “ni-trogen utilization” has part nitrogen compound metabolic process”).• regulates: represent cases where the children term B necessarily reg-ulates the parental term A, but A may not always be regulated byB. The regulation of a process does not need to be part of the pro-cess itself. Two sub-relations exist to represent more specific forms ofregulation (e.g.“regulation of mesenchymal cell apoptotic process” reg-ulates “mesenchymal cell apoptotic process”; “positive regulation ofcatalytic activity” positively regulates “catalytic activity”; “negativeregulation of M phase” negatively regulates “cell cycle process”).• occurs in: Used to link an occurring function or process to a location(process A necessarily occurs in component B) (e.g. “mitochondrialRNA processing” occurs in “mitochondrion”; “COPII-coated vesiclebudding” occurs in “Golgi membrane”).The three ontologies of GO are each represented by a root term withno common parental node, but their terms can be inter-connected through31.1. The Gene Ontologythe part of or regulates relationships. For example, “catalytic activity” is a“molecular function”, but is also part of “metabolic process” which is a “bi-ological process” (Figure 1.1).The GO structure is constantly revised and modified to cover missinglinks and incorporate new biological knowledge. Some modifications oftenfound are:1. extensions: Terms being added when missing attributes are identi-fied;2. reductions: Terms being deleted, when definitions are vague or donot accurately represent a biological aspect;3. revisions: Terms being split, merged, substituted, moved on a differ-ent location within the graph;4. cross-products: Terms combined through aggregating certain rela-tions (includes terms from other ontologies such as the Cell Ontol-ogy, Plant Ontology, Uber Anatomy Ontology (UBERON) or ChEBI(chemical entities of biological interest)). Example: “DNA replica-tion” + “occurs in” + “mitochondrion” = “mitochondrial DNA repli-cation” [2, 3].These modifications are included in each new release of GO. Daily andmonthly versions can be found and different formats are available to use:• Basic version: Includes is a, part of and regulates (positively and neg-atively) relationships and excludes those that inter-connect the ontolo-gies. It is the recommended format for GO annotations. This versionis often used for most of the GO-based annotation tools available.• Core version: Available in two formats (OBO and OWL-RDF/XML).It is the non-filtered version and includes the has part and occurs inrelationships, but excludes relationships to other ontologies. Theserelationships are recommended to be excluded for propagation, which41.1. The Gene Ontologyis important to note as many enrichment tools consider the propagatedterms for their results.• Plus version: Includes dependencies to other external ontologies andsome inter-ontology relationships.• GO Slim version: It is a subset of the ontology created to provide abroad view of the graph, the most granular terms are removed. Thisversion is often used in many applications as it does not include species-specific terms.Over time, many different tools have been developed to browse GO or itsannotations, each one retrieving one of these different formats. For example,tools like “CateGOrizer”[4] or “GOSlimViewer” [5] use GO slim versions,whereas “GO::TermFinder” [6] use the basic version but consider only termsassociated to gene products.51.1. The Gene OntologyFigure 1.1: An illustration of the structure of the Gene Ontology graph. A termcan have multiple parental and children terms and different relationships betweenthem. The term that is often annotated to a gene product is called a “Direct GOterm” and the terms that can be inferred by propagation to the root node are called“Inferred or parental GO terms”.61.2. GO Annotations1.2 GO AnnotationsWith the active collaboration of 36 groups, the GOC releases monthly ver-sions of GO Annotation files (GOA) that capture the association betweengene products and GO terms for different species. For a gene product to be-come annotated, an electronic or experimental evidence must indicate thatsuch gene product possess an attribute, i.e. that it has a particular function;is involved in a certain process or is located on a cellular component. Then,the most appropriate GO term to reflect such attribute (from the most up-to-date version of the GO graph at the time) is assigned to the gene andannotated with the evidence supporting it (Figure 1.2).Figure 1.2: Schematic representation of the protocol followed to generate geneannotations. Different members of the GO Consortium link proteins stored in theirdatabases with the GO terms that best reflect their biological attributes (based oncertain evidence) and store the relationships in annotation files.An evidence code is also incorporated into the annotation to indicateif the source is based on experimental or computational evidence or froma statement made by an author or curator(Table 1.1). Additionally, the71.2. GO Annotationsqualifiers “NOT”, “colocalizes with”, or “contributes to” can be added inthe annotation to modify its interpretation.Table 1.1: Evidence codes used in GOA files.Evidence CodesReviewed by a curatorExperimental sourceEXP Inferred from experimentIDA Inferred from direct assayIPI Inferred from physical interactionIMP Inferred from mutant phenotypeIGI Inferred from genetic interactionIEP Inferred from expression patternComputational sourceISS Inferred from sequence or structural similarityISO Inferred from sequence orthologyISA Inferred from sequence alignmentISM Inferred from sequence modelIGC Inferred from genomic contextRCA Inferred from reviewed computational analysisAuthor statementsTAS Traceable author statementNAS Non traceable author statementCurator StatementsIC Inferred by curatorND No biological data availableObsolete NR Not recordedElectronic sourceNot reviewed IEA Inferred from electronic annotationUsers can browse the annotations online through website tools providedby the GOC such as “AmiGO” [7], “QuickGO” [8], or retrieve the informa-tion from the annotation files that can be downloaded. Third-party toolsand sources like “NCBI Gene” [9] are also used to retrieve the information,although these are not necessarily synchronized and updated with the mostup-to-date version of GO or GOA.The Gene Association File (GAF) is the primary format created byGOC and has had two versions: GAF1.0 (deprecated as of June 2010)and GAF2.0 (July 2010-current)[10]. A detailed description of each for-81.2. GO Annotationsmat and their differences can be found (Table 1.2). There are importantdifferences between these two versions that most studies assessing anno-tation history do not address, specially in the protocol used to identifygenes and gene products. This is crucial to interpret annotations for eachgene / gene product. However, most assessments or tools only considerone of the two versions or do not to take into account the differences be-tween such formats. While this thesis was being developed, in 2013, anew format was introduced: the Gene Product Association Data (GPAD)file format. This format is a simplified version that only contains anno-tation data without the information about the gene product (gene namesor synonyms) and it was proposed as a “more normalized version” thatcan be used across databases ( If one aims to collect the gene product in-formation, other formats such as the Gene Product Information files (GPI)were created for this task.91.2.GOAnnotationsTable 1.2: Attributes of a GO Annotation.Content Description GAF1.0 (2001-2010) GAF2.0 (2010-current) Entry Examples1. DB Source of the Object ID Pre-merge stage:Uniprot andEnsembl anno-tations incor-porated in oneGOA fileMostly UniPro-tKB is usedUniProtKB,SGD,Ensembl2. DB Object ID Unique identifier for a geneproduct.Able to refer toparticular proteinisoforms or post-translationallycleaved or modi-fied proteinsA top-level pri-mary gene/geneproduct ID. Iso-forms no longervalid.O15072,S000038306,1-PFK-MONOMER,FBgn00434673. DB ObjectSymbolA symbol/ORF name towhich the DB Object ID ismatched.present present ADAMTS3,FruK,COX1,064Ya ,14-3-3epsilon4. Qualifier Flags that modify the inter-pretation of the annotation.present present NOT, contributes to,co-localizes with5. GO ID GO term ID attributed tothe DB Object ID.present present GO:00310126. DB Reference Source of the attribution(literature,database or com-putational reference).present present PMID:22261194,FB:FBrf0174215,SGD REF:S000050955Continued on next page101.2.GOAnnotationsTable 1.2 – continued from previous pageContent Description GAF1.0 GAF2.0 Entry Examples7. Evidence Code Indicate how the annota-tion to the GO term is sup-ported.present present TAS, EXP,IGI8. With or From Other gene products towhich the annotated geneproduct is similar or inter-acts with.present present UniProtKB-SubCell:SL-00399. Aspect Refers to the Ontology towhich the GO term ID be-longs.present present C,F,P10. DB ObjectNameName of the gene/geneproduct.present present sonic hedgehog11. DB objectsynonymAlternative gene symbols orprevious gene product iden-tifiers associated to the DBObject ID.previous DB Ob-ject IDs wouldbe gradually in-corporatedmany that werepresent in GAF1editions were re-movedDPS1 MOUSE|Pdss1|Dps1|Sps1|Tprt|IPI00123984|B8JJW9|Q9WU6912. DB ObjecttypeUsed to describe if the prod-uct is a gene, transcript, pro-tein or functional RNA.present present protein,gene13. Taxon The taxonomic identifier ofthe organism encoding thegene product.present present taxon:9606Continued on next page111.2.GOAnnotationsTable 1.2 – continued from previous pageContent Description GAF1.0 GAF2.0 Entry Examples14. Date The date on which the an-notation was submitted intothe database (not the date ofthe GOA file)present present 2012022815. Assigned by The database that made theannotation. Can differ fromDB (column 1)present present BHF-UCL;MGI;UniProtKB; Inter-Pro; RefGenome16. AnnotationextensionContains cross references toother ontologies (Cell TypeOntology), targets of pro-cesses/functions to indicategene products/chemicals present part of(UBERON:0002084);acts on populationof(CL:0000100);has regulation target(MGI:MGI:107364);occursin(CL:0000057)17. Gene ProductForm IDAnnotate specific vari-ants of the gene productused at the DB ObjectID(differential splicing,post-translational cleav-age or post-translationalmodifications)no present UniProtKB:A5YKK6-2121.2. GO AnnotationsMost of the annotations in GOA files are derived from computationalsources (Figure 1.3). Mostly, because the ratio of scientific discovery orpublications available largely exceeds the amount of information that canbe curated and annotated. To increase the “coverage” of gene products,many sources are constantly pooled together for this task. Many of thoseinferences are based on the assumption that a marked similarity exists be-tween two proteins through evolution (duplication or speciation) from thesame ancestral sequence (homology).Features that are commonly used for this type of assignments include:1) structure similarity (ISS); 2) sequence similarity (ISO, ISA, ISM, IKR);3) protein profiles and phylogenetic relationships (IBA, IBD, IRD, IGC);4) supervised machine learning algorithms (based on features from proteinsequences (ISM)) or 5) high throughput studies (RCA). However, when theassociation is generated electronically but hasn’t been reviewed by a cura-tor, the evidence code IEA is assigned.Computationally-inferred annotations are often considered to have limi-tations in their reliability compared to human-curated evidence [11]. Errorscan arise when, for example, proteins have high sequence similarity but dif-ferent functions or when they possess a similar function but their sequencesare highly divergent. Some of these cases have been identified in the curationprocess and can be identified in the GOA files with the NOT qualifier andthe evidence code IKR, which is characterized by the lack of key sequenceresidues. It is important to consider that some of such cases are likely to bepresent in inferred annotations but haven’t been revised. Another problemarises in determining the level of GO term granularity that can be assignedto an annotation based on similarity alone [12]. Hence, it is common to findbroad GO terms assigned in the annotations, which do not provide insightfrom a biological perspective, specially when root terms are assigned.Despite these limitations, annotations are being computationally gener-ated for more than 483,000 taxonomic groups (according to UniProtKB).131.2. GO AnnotationsThe GOC has grown considerably since its foundation [13] and is currentlyintegrated by 32 institutions, specialist groups and major resources, all ofwhich participate collectively in the evolution and implementation of GOand GOA.141.2. GO AnnotationsFigure 1.3: GO annotation overview. Figures highlight the number of anno-tations that are non-experimental compared to the number of experimentalannotations across all species. UniProtKB is the largest source of GO an-notations [1] (Figure taken from: August, 2014.)151.2. GO AnnotationsEach institution or resource generates species-specific annotations and isresponsible to update them when a change is made to the annotation pro-tocols or to the GO structure. However, there are cases where the resourcethat makes the annotation differs from the institution that provides supportin the long-term. Such cases can be identified in the GAF files (with the DBand the Assigned By attributes). Likewise, when the research communitiesfor certain model species do not have an established group that commitsto the long-term maintenance, the annotations are done by collaborationsthrough the UniProtKB-GO Annotation (UniProtKB-GOA) multi-speciesgroup.Hence, some resources might have a larger or faster curation effort, ormight have internal changes in their protocols that affect the annotationsthey handle. Together, such differences can influence species-specific anno-tation biases.The GOA project is predominantly supported by the database UniPro-tKB, considered the largest source of protein knowledge, with over 80,370,243entries of protein sequences (Figure 1.3).These entries are derived from multiple sources and are classified in 2 sec-tions:• UniProtKB/SwissProt: This protein sequence database comprises highquality, manually reviewed and non-redundant entries and is continu-ously revised and updated. Each entry contains information about oneor more protein sequences derived from the same gene to avoid redun-dancy. Often, entries that are present in the UniProtKB/TrEMBLdatabase are revised and integrated into the corresponding UniPro-tKB/SwissProt entry. As of July 2014, 546,000 entries for 498,088species can be found on this database. Most of those entries wereinferred from homology(70%) or have evidence at the protein or tran-script level (26%). The rest are classified as predicted or putative( [14] (Figure 1.4).• UniProtKB/TrEMBL: This protein sequence database contains all thesequences that are not yet present in UniProtKB/SwissProt. These161.2. GO Annotationssequences are derived from public databases, such as EMBL, Gen-Bank or DDBJ but haven’t been revised. Over 130 databases havealso been cross-referenced. As of July 2014, 79,824,243 entries in-tegrate this database. Most of them are bacterial (82%), a smallerproportion are eukaryotic (14%) and the rest are from archaeal or vi-ral origin (5%). Almost 76% of these entries have been predicted orinferred by homology (23%). Only a small proportion has evidenceat the transcript level (1.18%) or at the protein level (0.58%). Foreach and all of these entries, automatically inferred annotations arealso assigned ( As men-tioned above, when UniProtKB/TrEMBL entries are revised, they areoften merged to a matching UniProtKB/SwissProt entry [14].171.2.GOAnnotationsFigure 1.4: Illustration of changes in UniProtKB entries over time. The figure exemplifies a typical process of revi-sion and upgrades in entries from the UniProtKB database. UniProtKB/TrEMBL entries that have been revisedat particular time points are merged into a matching UniProtKB/SwissProt entry. Likewise, UniProtKB/Swis-sProt entries are revised and updated. In this example, two entries for the “ubiquitin” protein sequences wereavailable back in 1988. The redundancy was eliminated and only one new UniProtKB/SwissProt ID remained.UniProtKB/TrEMBL entries whose sequences were derived from the same gene were gradually merged. In 2010,four protein sequences were identified to come from different genes, so the entry representing “ubiquitin” demergedinto 4 new UniProtKB/SwissProt entries.181.3. Uses, Challenges and Assessments of GO1.3 Uses, Challenges and Assessments of GOThe current GO and GOA structure do not aim to cover aspects relevantto mutants or diseases, attributes of sequences, protein-protein interac-tions, anatomical or histological information or any feature that is context-dependant (environmental). Also, the annotation format, reflects a func-tional “independence” between gene products [3], but in reality, the geneproducts can interact and participate collaboratively in different pathways.Additionally, the incompleteness of the annotations is a concern among thecommunity [15]. Despite this, the usage of GO and GOA for the interpre-tation of biological data is continuously growing (as observed from queryingthe Gene Ontology using PubMed Discovery tools: Results by Year graph)[16].The increase in the number of publications using GO is in part due to thechallenge that scientists have had (ever since microarrays became available[17]) to interpret the large volume of data generated from high-throughputtechnologies. In a typical setting, researchers compare experimental condi-tions and generate a list of differentially expressed (DE) genes. To extractmeaning from those long lists, features that are common among them aresearched by using gene set enrichment analysis tools 1. As such tools of-ten base their results on the biological information captured in a particularGOA version, the quality of the annotations acquires even more relevance.The first exploration was made by Lord et al in 2003 [18]. The authorslooked at the quality of GO annotations indirectly by assessing the validityof using semantic similarity to compare proteins annotated in the SwissProtdatabase at that time. The validation of their study was based on the hy-pothesis that proteins with a certain sequence similarity would have similarannotations and that the quality of the evidence codes assigned should becomparable. In their study, they found that some GO annotations were1As of August 2014, running a PubMed query with the terms “enrichment analysis/-analyses” and ”gene set/gene-set” would result in 2553 related publications.191.3. Uses, Challenges and Assessments of GOincorrect or inconsistent and was thus reflected in a reduced semantic simi-larity score. After grouping sequences that were similar (based on BLASTsearches) and comparing the corresponding similarity scores, they observedthat annotations with a TAS evidence code assigned would tend to increasein “similarity” compared to others.A year later, annotation quality was explored in bacterial and archaealgenomes. Several genome annotation inconsistencies were also found, chal-lenging the common misconception from users that “reliable annotations”could be obtained from sources like EMBL or GenBank[19]. Afterwards,several other groups described similar inconsistencies for other organismsand databases, regardless of their origin (automatically generated or man-ually curated). This issue further highlighted the need for standardizedannotation protocols between research groups [20–22].Further more, an estimation by Baumgartner et al showed that the speedof manual curation at that time point was not sufficient to complete the an-notation of even the most important model organisms [23], extending theproblematic not only qualitatively, but also quantitatively.The importance of assessing annotation quality was recognized and dif-ferent groups started to propose metrics. For example, Buza et al, suggesteda quality score based on two features: 1) the level of detail (depth) of the an-notation, by considering the longest path from the term to its root node and2)the evidence code used for the annotation, in which the authors assignedarbitrary rankings to the evidence codes. Then, for assessing the “overallquality” for a gene product, the authors proposed to sum all the individualscores for each one of its annotations [24]. The “quality score” proposedhad the limitation that both parameters are quite subjective. The lengthof each path in the graph does not necessarily reflect its specificity and anarbitrary ranking of evidence codes does not necessarily reflect the qualityof the source.A second metric was proposed by Gross et al in 2009. They proposed201.3. Uses, Challenges and Assessments of GOthat the “quality score” for an annotation should be based on five param-eters: 1) how many times the evidence codes assigned to the annotationchanged across editions (quality); 2) how many editions have been createdsince the annotation first appeared (age of the annotation); 3) the number ofeditions where the annotation is present (existence); and 4) by consideringprevious editions (without the current one), the “stability” could be mea-sured by the number of editions where the annotation remained with thesame quality with respect to its existence. Finally, a “combined stability”would be assigned per annotation, which is basically the minimum scoreobtained in either the “existence” or the “quality” [25].Gross et al do explore (although indirectly) the effects that changes inthe ontology structure have on the annotations across editions. However,the authors did not make clear whether they considered the properties ofthe GOA files. In particular, a specific association (gene product-GO term)can be incorporated in multiple rows in just one GOA file, specially whenmultiple sources supporting the relationship exist. I raise this concern be-cause they do not trace the source of the annotation, but only the evidencecode, so the possibility of considering an annotation “unstable” by mistake ispresent. Furhermore, they are unable to assess if the evidence code changedfor a “better option”, as they only quantify the number of times it changed(and removed annotations that had the evidence codes ND or NR from theanalyses).Gross et al concluded that annotations derived from Ensembl were notpaired with their corresponding GO releases, using often an older versionand that in general, Ensembl annotations were more unstable than those de-rived from SwissProt. However, they did not explore the changes/updatesthat tend to occur in the accession numbers assigned to SwissProt entries,with the potential risk of losing track of the gene products. It is importantto note that, back in 2009, Ensembl annotations could be distinguished fromUniprot ones even if they were both integrated in just one GOA file; butthese databases are now merged in the UniProtKB, so the conclusions de-211.3. Uses, Challenges and Assessments of GOrived from the “source of the data” cannot be re-explored in current versions.Just after such assessments were made, the GOC introduced the GOreference Genome Annotation project, implementing more rigorous anno-tation protocols. Since then, existing anotations are being revised and re-placed with more specific experimental codes. Thus, the GOC acknowledgedthat the changes in the evidence codes assigned should not be consideredstatements of the quality of the annotation, specially as some methods orreferences may have a higher confidence or specificity than others. For ex-ample, previous annotations would often be assigned with EXP, which isthe parental code for IDA, IPI, IMP, IGI and IEP. However, curators wereencouraged to revise old annotations with such code and replace them withchildren codes of higher specificity [26].Changes in the GO structure took place as more emphasis was put onthe assessment on the impact of changes of GO and GOA in applicationslike enrichment analysis. In particular, Alterovitz et al (2010) proposedmodifications to the GO because they identified terms misplaced within thegraph that affected the results of enrichment analyses. Such modificationswere discussed with the GOC and incorporated afterwards [27]. The qualityof computationally inferred annotations also seemed to improve after suchchanges were made [11].Some members of the GOC also introduced annotation efforts focusedtowards prioritized gene sets, and the EMBL-EBI explored the impact thatsuch prioritization had on gene set enrichment results. In particular, theyobserved that more GO groups where such genes belonged could be retrieved[28]. An independent assessment by Clarke et al (2013) also highlighted thatchanges in GOA versions had a larger impact (compared to the changes inGO structure alone) in the reproducibility of the results of enrichment anal-yses over time [29].Changes in GO/GOA also affect widely used tools for enrichment anal-221.3. Uses, Challenges and Assessments of GOyses, such as GSEA [30] or DAVID[31]. The tool developers have to keepup and follow the recommendation from the GOC to use the latest versionof GOA available [32], but in many cases, the have not. Hence, users runtheir analyses on annotations that are considerably outdated. Even if usersacknowledge the situation, they often forget to cite or check the version forinterpreting their findings or future references [32].A different set of tool-related problems arise when they fail to removenegative associations (those with a “NOT” qualifier) [33]; do not considerthe same protocol to map gene identifiers, sources, type of relationshipswithin GO, incorporate robust statistical analyses or correct for data ar-tifacts [34]. For example, GSEA [30] considers “regulates” relationships inthe GO structure within their analyses, whereas ErmineJ [35] only considers“is a” or “part of” relationships and also takes into account artifacts such asgene multifunctionality (i.e., genes which have multiple functions, reflectedas the number of GO terms that have been assigned to them). This is par-ticularly relevant when multifunctional genes are often retrieved from theresults, but are not necessarily related to the question of interest [36].Another missing gap in the assessment of GO annotation quality waspartially filled in 2013 when Gillis and Pavlidis explored the stability of GOannotations by measuring how genes can lose their “functional identity”.In particular, they expressed functional identity in terms of how semanti-cally similar a gene’s annotations were across editions. If a gene was mostsemantically similar to its previous incarnations in the GOA, compared toother genes, then it was considered to retain its “functional identity”. Lossof functional identity is expected as annotations are added, but the rate ofthis loss had not been previously evaluated. They found that at least 20%of the genes can lose their identity after 2 years. They also characterized acircularity problem, where the same publications are used to support pro-tein interaction databases and GO annotations, affecting the applicabilityof protein-protein interactions for gene function prediction [37].231.3. Uses, Challenges and Assessments of GOParallel to the usage of GOA for enrichment analyses and despite thechallenges mentioned above, these have also been used in algorithms for genefunction prediction and their assessment. For those genes whose attributesare not known or haven’t been processed by curators, the challenge relieson predicting their “function”, especially because the experimental investi-gation is limited and costly. However, the issues arising from using a goldstandard that is incomplete, such as GO, often makes this task more chal-lenging. Huttenhower et al (2009) highlighted some of these problems whileassessing how the performance of machine learning algorithms are affectedin this context, but also suggested that the methods were still able to make“useful predictions” out of incomplete standards [38].To assess the performance of function prediction algorithms, the taskhas been set to predict GO terms for some target genes [39–41]. Such as-sessments often use, as a benchmark set, recently curated annotations froma subset of those targets. For example, in the CAFA assessment, a 6 to 12month waiting period (after the submission) is considered for the accumula-tion of manual GO annotations. Then, a subset of those “new” annotationsare selected for the evaluation, which in turn look at which GO terms wereassigned to each target gene. However, Gillis and Pavlidis (2013) criticizedsuch task, as predicting biologically meaningful gene functions may not beequivalent to predicting GO annotations. This is particularly relevant whenconsidering that patterns that can be attributed to the curation processhave been used to predict “gene function” since 2002. To give an example,commonly co-annotated GO terms have been proposed as predictive meth-ods [42], and even popular tools like the “GeneMANIA prediction server”utilize such patterns to weight their predictions [43].In fact, the results from the first Critical Assessment of Automated Func-tion Prediction (CAFA) in 2013, showed that the top performing methodsincorporated existing knowledge of GO or based their algorithms on sequencesimilarity. As many of the existing computational annotations (either IEAor manually revised) (Table 1.1) are based on sequence similarity and incor-241.3. Uses, Challenges and Assessments of GOporated into the UniProtKB annotation pipeline, many inferred annotationsthat have not been curated (IEA) can be considered predictions. In theirresults, BLAST was outperformed by what was called the na¨ıve method,a control that assigns all the target sequences the exact same predictionsbased on GO term prevalence from GOA data. This suggests that somethingwas wrong with the metric. Having a control that performs better than apopular sequence similarity method, highlights the impact that artifacts at-tributed to the curation process have on the performance metrics.Gillis and Pavlidis (2013) assessed the CAFA results independently usingfunction-centered metrics,i.e., by asking the question “which genes shouldbe assigned to a particular function? In fact, when considering a function-centric metric, BLAST was a top performing method and many of the man-ually curated annotations were derived from existing electronic annotations(IEA) [37]. This result could be interpreted as an attempt to predict whichtargets would be curated or upgraded from existing annotations withoutconsidering any biological attribute from the targets, which seems to fallout of the scope of the actual “function prediction” task.In conclusion, given the increased usage of GO for different scenarios andthe constant changes in GO annotations, it is of interest to make a compre-hensive assessment of the annotation instability and assess the impact thatthese changes have on current usages.25Chapter 2ObjectivesIn this thesis, I aim to:• Run exploratory analyses to identify trends and the evolution of GOAover time for 14 different taxa.• Apply different metrics that can be used to learn more about theinstability of particular genes and annotations, including the degreeto which GO assignments are distributed unequally for each gene overtime; the functional identity of each gene compared to its currentstate and the “existence” of an annotation considering the source ofthe annotation.• Assess the impact of these changes and artifacts that can be attributedto curation efforts in applications such as the Assessment of ProteinFunction Prediction Algorithms and Gene set Enrichment Analysis.• Build a visualization tool to extend this information to GO users whowant to explore the instability of genes and annotations of their inter-est.26Chapter 3MethodsThe current project involved several steps. To ensure the quality of theanalysis, the first part involved a careful pre-processing of the raw informa-tion and the implementation of exploratory analyses to observe changes inthe annotation data. The second part involved the design and creation ofa database and a website to visualize and make this information available.The third part consisted of the exploration of the impact that changes inthe annotation data have on common settings (Figure 3.1).Figure 3.1: General overview of the methods and analyses done in thepresent study.273.1. GOtrack: Pre-processing and Analysis3.1 GOtrack: Pre-processing and AnalysisIn this section, I will describe the methods used at each step, the databasebuilt to store this information and the web-based tool that was implementedto make the data accessible to other users.3.1.1 Data CollectionTo collect all the historical annotations available for each one of the 14species considered, monthly releases of Gene Association Files (”GOA”) inGAF1.0 and GAF2.0 formats were retrieved from the EMBL-EBI FTP web-site for: Arabidopsis thaliana (thale cress), Gallus gallus (chicken), Bos tau-rus (cow), Dictyostelium discoideum (slime mold), Canis familiaris (dog),Drosophila melanogaster (fruit fly), Homo sapiens (human), Mus musculus(mouse), Rattus norvegicus (rat), Sus scrofa (pig), Danio rerio (zebrafish),Caenorhabditis elegans (worm) and Saccharomyces cerevisiae (yeast) [44].As the EBI repository only contains annotations for yeast and fruit fly withthe date stamp from 2011 until now, earlier versions of GOA files for theseorganisms were retrieved from: FlyBase [45] (2006-2011), SGD [46] (2001-2004) and SGD [47] (2005-2010). The editions (or versions) were ordered andrenumbered consecutively based on the release date. Data for Escherichiacoli was solely retrieved from EcoCyc [48].To match the annotations with the corresponding versions of the GOgraph at each time point, monthly releases of the core version of the GeneOntology database (termdb-xml files) were collected from the GO repository[49]. Each GOA file was paired with their respective GO version by usingtheir release dates and the date embedded on each termdb file name. In caseswhere the GOA date did not have a matching termdb file, an earlier versionof the termdb was considered. The purpose of such matching was to inferparental terms in the GO hierarchy for each GO annotation, considering only“is a” and “part of” relationships and excluding root and obsolete terms.283.1. GOtrack: Pre-processing and Analysis3.1.2 ID MappingAs described in the introduction, each GOA file can be created and sup-ported by different databases. In particular, those created with the GAF1.0format had database-specific accession codes for each gene product (DBObject IDs). When the GAF2.0 format was introduced, some DB ObjectIDs remained the same, but some others were merged, demerged, deleted,replaced or mapped to their equivalents in the UniProtKB DB Object IDversion. These changes were implemented at different time points for eachspecies and some others, like SGD, had internal changes in their internal DBidentifiers even before the GAF format changed. As these changes are ofmajor importance to track the historical annotations of each gene product,I implemented a procedure attempting to map the identifiers in a robustmanner was implemented (Figure 3.2 and Script 5.5).The procedure considered the information retrieval and integration fromdifferent sources:• ”Mapping Files” provided by the UniProtKB which map a list of iden-tifiers from external databases to UniProtKB accession IDs[50].• Three custom dictionaries created with the information currently avail-able in the Uniprot Documentation for E.coli, yeast and fruit fly. Thedictionaries map Uniprot/SwissProt entries with gene designations, or-dered locus names, SwissProt primary accession numbers, entry namesand cross-reference accession numbers to the original accession IDs as-signed from EcoliWiki, SGD and Flybase, respectively.• A custom dictionary to track Uniprot accession numbers that wereonce “primary” accessions and latter became “secondary” because ofa merging or demerging event. Before 2010, when the transition pe-riod from the GAF1.0 to GAF2.0 format, the secondary IDs would benormally incorporated as “synonyms” in the annotations. In subse-quent GAF2.0 format files, these “synonyms” were removed from theannotations.293.1. GOtrack: Pre-processing and AnalysisThe last GOA edition for each species where these secondary accessionnumbers were found as synonyms were selected for the creation ofanother custon dictionary: (Human (edition 105); Arabidopsis (edition56); Chicken (edition 53); Cow (edition 46); Mouse (edition 69); E.coli(edition 95); Fly (edition 30); Rat (edition 72); Dictyostelium (edition25); Dog (edition 25); Zebrafish (edition 57); Worm (edition 25)).The information stored on “DB Object Synonym” was collected andmapped to its primary “DB Object ID”.• Automated queries to the UniProtKB website were also implementedto maintain the UniProtKB DB Object IDs (primary accession num-bers) as updated as possible 2.Some protein sequences and their corresponding accession numbers aredeleted from UniProtKB and disappear in subsequent GOA files 3. Thedeletions occur when entries correspond to open reading frames (ORFs)or pseudo genes wrongly predicted to code for proteins [14]( mapping procedure was also implemented to track “DB referenceobjects” (sources of annotation, mostly publications) from old GOA files(GAF1.0 format). Earlier versions were found to incorporate obsolete MED-LINE IDs that in subsequent editions were replaced for PubMed IDs. Themapping files used for this mapping process were retrieved through the Na-tional Library of Medicine [51] (Script: 5.6).Finally, old DB Object IDs and DB Reference Objects were updated inthe GOA files for further analyses.2Changes between primary and secondary accession numbers can also be found on GOtrack: Pre-processing and AnalysisFigure 3.2: General schema of the mapping procedure.The analysis by Gillis and Pavlidis (2013) only considered gene productsthat were consistently present in all the GOA file versions. No mappingprocedure was implemented and the instability of the identifiers was notconsidered.3.1.3 Exploratory AnalysesAfter mapping all the identifiers, I aimed to consider as many gene productsas possible, but discarding those that were considered “mistakes”, whichwere disappearing from the GOA over time. For this reason, a series of listswere generated to identify:• All Terms: All the GO terms that have been used at least once acrossGOA editions,• Terms Always Present: GO terms that were present across all GOAeditions,• All Genes: All the DB Object IDs that have been used at least once(after mapping) across GOA editions,313.1. GOtrack: Pre-processing and Analysis• Genes Almost Always Present: DB Object IDs representing “genes”that are almost always annotated across GOA editions (user definesthreshold. Analyses were run to trace the annotations of gene productsthat are present in at least 85% of the GOA editions).The implementation gave the option to focus on gene products alwayspresent, but including those that might not have been always made it evenmore flexible and powerful. Therefore, in the analysis I considered “genesproducts almost always present” and a threshold can be set up when run-ning the analysis. In particular, I considered “gene products almost alwayspresent” if they were present at least in 85% of the GOA editions.A series of metrics were implemented to assess the instability of theannotations for the gene products “almost always present” (Figures 3.3and 3.4, Scripts 5.1 and 5.2):• Semantic similarity: For each GOA edition, a hash table (an asso-ciative array called gomatrix) was implemented to trace the GO termsdirectly associated for those gene products “almost always present”.Then, an assessment of how “functionally similar” each one of thesegene products is to itself was conducted by comparing the gomatricesfrom previous editions vs. the current one (Scripts 5.3 and 5.4).• Multifunctionality: Multifunctional genes in the last edition wereidentified and ranked per species. Likewise, a multifunctionality scorefor each gene product (using ErmineJ) was also calculated per GOAedition. Gene products that have been prioritized for annotation bythe GOA tend to have more GO terms assigned. This metric doesnot reflect that certain gene products are more biologically relevantthan others, but that they tend to be more studied or annotated.Difference among gene products in their multifunctionality have animportant effect [36]. The score is the Area Under the Receiver Oper-ating Characteristic Curve(AUROC, a comparison of the true positiverate and false positive rate at various threshold settings) obtained by323.1. GOtrack: Pre-processing and AnalysisFigure 3.3: General GOtrack pipeline to pre-process the data for any species.comparing the genes that are members of a GO group to the rankingprovided by the “GO term membership”.• GO term membership: A way to assess GO term “popularity” is toassess how many gene products have been associated to each GO termover time. This metric can also be interpreted as how prevalent a GOterm is on GOA at each time point. This quantitative measure canalso indirectly reflect when the terms are incorporated or discardedfrom the graph or when the GOC decided the term was no longersuitable for annotations.• Source Instability: If the sources of annotation are robust enough to333.1. GOtrack: Pre-processing and AnalysisFigure 3.4: General GOtrack pipeline for exploratory an association (such as experimental publications) then, evenif the GO terms assigned to reflect a determined “function” acrossannotations change, these sources should remain linked to each geneproduct across editions. To explore this hypothesis and assess if thereis also an instability in terms of sources used, a “Publication history”analysis was made by linking the gene product with a publication ID(supporting at least one of its annotations), among with the releasedate of such paper and tracing their connection across all editions.Hence, one can observe when the source was first used and when itwas discarded if that is the case. Tracing the date of publication isalso useful to visualize the age of the sources supporting GOA on aglobal scale.• Evidence code Instability: The usage of evidence codes to reflect343.1. GOtrack: Pre-processing and Analysisthe source for an annotation has changed over time. Since the GO Ref-erence Genome annotation effort was established, annotations havebeen revised to assign more specific experimental codes to annota-tions.The guide to best practices for GO manual annotation also sug-gests that annotations that had TAS codes should be replaced withthose that reflect published experimental results [52]. Hence, evi-dence codes assigned to each annotation (gene product + GO term+ PubMed ID) were traced across editions (“Evidence code history”)to visualize such changes on a case-by-case basis.• Number of direct GO terms annotated per gene product: Thetotal count of GO terms directly annotated to each gene product wastraced to assess quantitatively if it gains or lose terms over time.• Number of propagated GO terms: The total count of GO termsthat can be inferred from those directly annotated to each gene productwas traced. This metric is used to assess if the gene product gains orlose terms over time because of changes in the GO structure. Thepropagation was made using ErmineJ and only “is a” and “part of”relationships were considered.• Number of promoted annotations: When annotations are revisedby curators, they can disappear, but one of the reasons is because theGO terms assigned are replaced with a more granular GO term thatbetter reflects or supports the association. Annotations whose originalGO terms were “promoted” to a more granular (children term) wereidentified and counted across editions (Figure 3.5).353.1. GOtrack: Pre-processing and AnalysisFigure 3.5: General overview to track commonly upgraded annotations.3.1.4 Database Design, Creation and ManagementA database was created to store annotation data and retrieve the information(Figure 3.6). There are tables created to store general information forall the species and tables that store specific information for each species(Script 5.7).The tables that contain information for all the species are:• popularGenes: Stores the query history of GOtrackWeb users acrossall species. It aims to provide us with an idea of the usage of GO andwhat genes are of popular interest.• edition to date: Stores the release date of each GOA file per species.• species: Stores a catalog of all the species analyzed.• GO names: Stores the GO term accession IDs and their correspond-ing GO names (human readable names) for each Termdb file over time.Therefore,if a GO term changed its name, previous names can be re-trieved.• unique go functions: Stores the relationship between the GO termaccession ID with its most recent GO name.• avgAllSpeciesCount: Stores for each species and edition: 1) theaverage number of GO terms directly annotated to the DB Object IDs(gene products); 2) the average multifunctionality score; 3) the averagesemantic similarity score of the DB Object IDs (gene products) in363.1. GOtrack: Pre-processing and Analysisthat edition with respect to the current one; 4) the average number ofparental GO terms that can be inferred from the annotations and 5)the total number of DB Object IDs (gene products) that are presentin each edition.• annotAnalysisTab: Stores for each DB Object IDs (gene products)and edition per species: 1) the total number of annotations that havebeen promoted from an IEA evidence code to a curated evidence code;2) the total number of annotations that have been promoted to amore granular GO term; the average number of GO annotations thathave negative a NOT qualifier with respect to the total number of DBObject IDs (gene products) that have at least one negative annotation.The tables that contain information for each species are:• species gene annot: Stores the information from the GOA files: DBObject IDs (gene products), GO term, evidence code, PubMed ID,taxon, DB Object symbol, GO term name, Ontology.• species replaced id: Stores the relationships between the originalDB Object IDs assigned to the annotations and the new DB Object IDs(gene products) that were replaced with during the mapping process.• species evidence code: A simplified version of “gene annot” wherethe PubMed identifiers have been cleaned. It was created to makequeries faster. Stores the information: DB Object IDs (gene products)GO term, PubMed ID, evidence code, edition.• species count: Stores the pre-processed information for each geneand edition. Contains: DB Object symbol, total number of GO termsdirectly annotated, total number of GO terms inferred, the multifunc-tionality score, semantic similarity (Jaccard) score.• species gene per go: Stores information about how many DB Ob-ject IDs (gene products) belong to a particular GO term (GO termmembership) per edition.373.1. GOtrack: Pre-processing and Analysis• species avg: This table is created dynamically after all the informa-tion has been inserted into the database. It stores the average value ofthe columns present in species count and annotAnalysisTab for eachspecies.• species unique gene symbol: This table is created dynamicallyfrom “species gene annot” after the data for one species has been in-serted. It contains the relationship of the DB Object symbols and theDB Object IDs (gene product accession IDs).383.1. GOtrack: Pre-processing and AnalysisFigure 3.6: GOtrack database model. See main text for description.393.1. GOtrack: Pre-processing and Analysis3.1.5 Web-based Visualization Tool: GOtrackWebA website was designed an implemented (Figure 3.7) and is now availableat: the main page, the top 10 queries made by the users and the topmultifunctional gene products per species from the last edition available aredisplayed.The main page was designed for users to query the historical informationfor a gene product of their interest. The query allows the use of UniProtKBaccession number IDs, synonyms or or DB Object Symbols (gene symbol).If the user decides to use a Symbol, all the DB Object IDs (accession num-bers) that match (whether these are UniProtKB/SwissProt or UniProtK-B/TrEMBL) are retrieved. If the user queries a UniProt accession number,the specific information will be retrieved. Obsolete IDs can also be queried.If the annotations from that obsolete ID are assigned to a newer ID, theinformation associated to the most recent accession number is retrieved. Tomake the query species-specific, the user must select a particular speciesfrom the list and click on the Search button. The program on the back-enddoes the following:1. Search first for the gene product (DB Object IDs) in the db table“species unique gene symbol” (list1).2. Search each element from list1 in the db table “species repaced id” toretrieve the most recent DB Object IDs available (list2).3. Retrieves the data stored on table “species count” for each element inlist2.4. Retrieves the data stored on table “species avg” for each element inlist2.5. The DB Object IDs found for the queried gene product are displayedabove linked to the Uniprot website for more information.403.1. GOtrack: Pre-processing and Analysis6. A dynamic plot using the retrieved data is generated in section “Counthistory”, which includes the total number of inferred and direct GOterms annotated to each ID from list2 across editions, as well as itsmultifunctionality score and semantic similarity, which can also becompared with the average values for the queried species.7. Update the table “popularGenes” inserting the query made by theuser.8. Retrieves the corresponding GO terms and GO names annotated toeach DB Object ID found in the db tables “GO names” and “species geneannot”, separated by Ontology and displayed on the userRequest.xhtmlweb page.The user can then explore the annotations associated to the queriedgene. The user should change to the tab named “Functionality”. This tabcontains a browsable list of all the GO terms that have been ever assignedto that gene product,separated by Ontology. The user can select those thatare of interest and click “continue”. The page is redirected to functional-ity.xhtml with the tab “evidence code history”, which displays the historicalexistence of the annotation. A dynamic plot is displayed, coloured by thetype of evidence code used. Each row corresponds to one annotation (DBObject ID + GO term + evidence code + source (PubMed)). The user canalso click on a table with link-outs to access those papers that were usedto support the associations. Another tab named “GO term membership” isalso incorporated. In this section, the total number of DB Object IDs (geneproducts) annotated to each of the GO terms selected are displayed on a dy-namic plot. Alternatively, on the main website, the user can just search fora GO term ID, and only the plot with the GO term membership is displayed.A different section on the website was created to provide a generalpanorama for each species. The section is called “Global Trends” and canbe accessed through the top panel. Two tabs are displayed, one allowing theuser to select two species for comparison. Two dynamic plots are generated.413.1. GOtrack: Pre-processing and AnalysisThe first one retrieves information from the table “avgAllSpeciesCount”and displays for each edition and species: the average number of direct GOterms; the average multifunctionality score; the average number of parentalGO terms; the average semantic similarity score and the total number ofgene products (measured by DB Object IDs found after the mapping pro-cess).The second plot is generated from table “annotAnalysis” and displaysfor each edition and species: the total number of annotations that werereplaced with a more specific granular GO term; the total number of elec-tronic annotations that were revised (switching the evidence code IEA toanother evidence code); and the average number of GO annotations thathave a negative association (those with the NOT qualifier) relative to thetotal number of gene products that have at least one negative annotation ineach edition.To explore the overall historical data for one species, the informationstored on “species avg” is retrieved and displays per edition: average mul-tifunctionality score, total number of annotations promoted from IEA to amanual evidence code and from a general to a more granular GO term, aver-age number of direct GO terms and the total number of unique DB ObjectIDs (gene products). As there were limitations in the the visualization plot,to aid with the comparison of changes in the multifunctionality score withother parameters, this score was multiplied by 10,000,000.The components used to develop GOtrack and the web-based tool were:Eclipse Juno, Maven 3.1.0, NetBeans 7.3.1, mysql-connector-java-5.1.18,Primefaces 4, JSF 2.2, Apache Tomcat and a Google Charts API( GOtrack: Pre-processing and AnalysisFigure 3.7: General overview of the GOtrackWeb implementation.433.2. Proposing a Benchmark for the Assessment of Function Prediction Algorithms3.2 Proposing a Benchmark for the Assessmentof Function Prediction AlgorithmsAs mentioned in the introduction, GO and GOA are used in the contextof Gene function prediction and the evaluation of the performance of suchalgorithms. Currently, no reliable “gold standard” is available to use in thiscontext, but GO annotations that get freshly curated are used as a bench-mark set.However, we have to consider the limitations that the annotations havefor this task. The task has been defined to predict GO terms to a set of tar-get genes. This task can be interpreted as an assessment of the participant’sability to predict the curation activity, specially as the functional informa-tion for some of the target gene products might already be published butjust haven’t been captured in the annotations (post-dictions). Even more,the “accumulation period” of only 6 months is a limiting step, as few actualfunctional discoveries would be made and annotated in the same evalua-tion time point. With this in mind, it is also important to consider thatsome curation patterns can be found in GOA, such as: GO terms frequentlyused, GO terms commonly upgraded or GO terms often co-annotated. Al-gorithms attempting to “predict” curation activity might use such patternsto increase their “performance”. Gillis and Pavlidis (2013) observed thatin the “state of the art” publication by CAFA, the participating algorithmsdo exploit this information [53]. However, such artifacts do not have anybiological relevance and should be subtracted.In the present study, we identify, use and propose these parameters as abaseline for a better assessment of function prediction algorithms. To testthe actual “performance” of such artifacts, we submitted the “predictions”inferred from such patterns to the CAFA2 assessment, but as of September2014, the results haven’t been made available to the participants. In themeantime, I elaborated an independent analysis of the performance usingtarget sequences and old predictions submitted by participating algorithms443.2. Proposing a Benchmark for the Assessment of Function Prediction Algorithmsof the CAFA1 assessment(Script 5.8).3.2.1 Data CollectionTargets from the current (2013) and previous (2011) CAFA assessment wereused as gene sets to study patterns associated to the GO curation process(Tables 3.1 and 3.2). GOA files were retrieved for the selected species forthe CAFA (2013) assessment.• CAFA 2013-2014 Target sequences were retrieved from:• Annotations (GOA) for A.thaliana, D.discoideum, H.sapiens, M.musculus,R.novergicus, S. cerevisiae, D. rerio, E.coli were used from the GO-track analyses.• S. pombe annotations were retrieved from:• X. laevis annotations were retrieved from:\%3a8355&format=*• H. pylori, M. genitalium, S. enterica, P. syringae, P. putida, S. pneu-monia, M. genitalium, B. subtilis were retrieved from:• P. aeruginosa was retrieved from:• M. jannaschii, I. hospitalis, N. maritimus, H. salinarum, S. solfatar-icus, H. volcanii :• P. furiosus from:\%3a186497&format453.2. Proposing a Benchmark for the Assessment of Function Prediction AlgorithmsTable 3.1: Target sequences and species considered for the CAFA2 assess-ment.Species ID Organism Domain No.Targets3702 Arabidopsis thaliana Eukarya 1206944689 Dictyostelium discoideum Eukarya 41269606 Homo sapiens Eukarya 2025710090 Mus musculus Eukarya 1661310116 Rattus norvegicus Eukarya 7854559292 Saccharomyces cerevisiae Eukarya 66218355 Xenopus laevis Eukarya 3365284812 Schizosaccaromyces pombe Eukarya 50897955 Danio rerio Eukarya 28857227 Drosophila melanogaster Eukarya 3195224308 Bacillus subtilis subsp. subtilis 168 Bacteria 418883333 Escherichia coli K12 Bacteria 443185962 Helicobacter pylori ATCC 700392 Bacteria 581243273 Mycoplasma genitalium ATCC 33530 Bacteria 483208964 Pseudomonas aeruginosa PA01 Bacteria 1245160488 Pseudomonas putida KT2440 Bacteria 693223283 Pseudomonas syringae pv.tomato str.DC3000 Bacteria 675321314 Salmonella enterica Bacteria 88299287 Salmonella typhimurium Bacteria 1771170187 Streptococcus pneumoniae TIGR4 Bacteria 502478009 Halobacterium salinarum R1 Archaea 267309800 Haloferax volcanii DS2 Archaea 93453591 Ignicoccus hospitalis KIN4/I Archaea 125243232 Methanocaldococcus jannaschii DSM 2661 Archaea 1787186497 Pyrococcus furiosus DSM 3638 Archaea 480273057 Sulfolobus solfataricus P2 Archaea 448436308 Nitrosopumilus maritimus strain SCM1 Archaea 913.2.2 GO Term Prevalence in Annotation DataThe prevalence of the terms in the GOA are considered in both CAFA1 andin the present study as a baseline (na¨ıve method or null). Some GO termsare noticeably more used than others, specially those that are “generic”.Prevalence was computed using the last GOA edition available (before theCAFA2 or CAFA1 submission deadline). Annotations were propagated andthe size of each GO group was determined. Prevalence was computed sep-arately for each taxon (at the species level, except for bacteria and archaeaorganisms, within each of which annotations were pooled). The assessmentallowed the assignment of up to 1500 predictions per target sequence. Hence,463.2. Proposing a Benchmark for the Assessment of Function Prediction Algorithmsthe top 1500 most prevalent GO terms were assigned to each and all thetarget sequences after filtering those that were already assigned as manualannotations for each gene product.Prevalence provided the initial scores for each target-GO term associa-tion. As simply predicting commonly used terms yields surprisingly strongperformance according to Gillis and Pavlidis [53], this score would only bereplaced by the IEA upgrade or the co-occurrence methods if they had astronger prediction than prevalence.3.2.3 Identification of Inferred Electronic AnnotationsCommonly Reviewed and Re-annotated by Curatorsin GOAElectronic annotations for each species were identified within a two yearinterval (12-2008 to 12-2012). Posterior dates were then used to check forannotation upgrades, which could be the same GO term but with a manualevidence code assigned or updated to a children term with a manual evi-dence code. The data for each species was independently processed. Thepromotions were translated into probabilities by pooling the frequency forwhich term is upgraded across all the taxa.3.2.4 Identification of GO Terms Frequently Co-annotatedin GOAThe probability of co-occurrence of GO terms was calculated based on theconditional likelihood of getting a GO term “B” given that a gene alreadyhas GO term “A” assigned with a manual evidence code. These probabilitieswere calculated by pooling gene annotation data from all the taxa, consid-ering an interval of two years before submission deadline.Given the GO terms A,B, the correlation matrix M is defined as thematrix whose entries M[A,B] are the number of gene products(Uniprot ac-cession IDs) that have annotations to A,B simultaneously (integrating GOAannotations for all species from 12-2008 until the submission deadline), i.e.473.2. Proposing a Benchmark for the Assessment of Function Prediction Algorithmsfreq A ∩B.Step 1: freq(A ∪B) = M [A,A] +M [B,B]−M [A,B]Step 2: P (A ∩B) = freq(A∩B)freq(A∪B) =M [A,B]freq(A∪B)Step 3: P (A|B) = P (A∩B)P (A)3.2.5 Evaluation of the Performance of Function PredictionAlgorithms and the Proposed BenchmarkTo assess the performance of the methods, predictions were generated usingthe same pipeline described above but simulating the predictions that wouldhave been likely assigned if we were participating in the CAFA1 assessment.The results were later compared with those of other algorithms submittedin the CAFA1, which were provided to us anonymously.The final list of predictions used as “gold standard” by the organizersof CAFA1 was also considered to assess the performance of our methods.The gold standard list included 866 targets from a total of 48,298 targetsequences initially set for the assessment and 1876 annotations (which, afterpropagation,formed a total of 16,888 relations (gene-GO term). Only BPand MF ontologies were considered. A filtered “gold standard” list was alsoconsidered using only those GO terms that had 10 to 100 members in thetrue positive list (after propagation).483.2. Proposing a Benchmark for the Assessment of Function Prediction AlgorithmsTable 3.2: Target sequences and species considered for the CAFA1 assess-ment.Taxon ID Organism Number of targets10090 Mus musculus 23110116 Rattus norvegicus 453702 Arabidopsis thaliana 8644689 Dictyostelium discoideum 28355 Xenopus laevis 169606 Homo sapiens 2854932 Saccharomyces cerevisiae 51423 Bacillus subtilis 1683333 Escherichia coli K12 153287 Pseudomonas aeruginosa 21313 Streptococcus pneumoniae 25The primary metric used for the CAFA1 assessment was gene-centricand is called the “CAFA score”. For each predicted annotation from thealgorithms and each annotation present in the “gold standard” list, termswere propagated to the root. Any overlap between the predicted annota-tions and the “gold standard” was considered a true positive. Precision,recall, thresholds and F-score were calculated using the ROCR package inR[54]. The average precision for each threshold t was calculated across tar-gets with respect to the number of targets for which at least one predictionwas made above that threshold t. The average recall was calculated acrossall targets regardless of the threshold. The F-measure (harmonic mean) wasalso calculated as defined by the CAFA organizers [41].As proposed by Gillis and Pavlidis [53], a different gene-centric approachwas considered by using the Resnik measure to explore the semantic similar-ity between the actual and the predicted function. In particular, this metricwas used to find how many predictions were more informative than the null,i.e., how many of those predictions had a higher score than what could beassigned by prevalence alone.A function-centric measurement was also performed by calculating the493.3. Analysis of the Instability of Gene Set Enrichment Analysis Over TimeArea under the receiver operating characteristic curve (AUROC). Termswere propagated to the root and the scores assigned by the prediction meth-ods were used. The package pROC [55] was used to calculate the area underthe ROC curve.3.3 Analysis of the Instability of Gene SetEnrichment Analysis Over TimeGene Set Enrichment analysis are increasingly used to analyse and inter-pret biological information. For this reason, it is important to explore theimpact that annotation instability has on the results of such analysis on acomprehensive scale.In this study, more than 2,000 hit lists stored in GMT files from Mol-SigDB [56] (collection C2: curated gene sets from online pathway databases,publications in PubMed and knowledge of domain experts and collectionC3: motif gene sets based on conserved cis-regulatory motifs from a compar-ative analysis of the human, mouse, rat and dog genomes) [30] were retrievedfrom the GSEA website:( Only hit listswith more than 10 members were considered (Figure 3.8). After that ini-tial filtration, a series of enrichment analyses were ran using yearly GOA(from May and November). For each gene set, the same score was assigned toall the genes (0.001). Enrichment analyses were done using the software Er-mineJ, considering an over-representation(ORA) analysis and a FDR ≤ 0.1(Figure 3.9).503.3. Analysis of the Instability of Gene Set Enrichment Analysis Over TimeFigure 3.8: Pre-processing steps for enrichment analysis.513.3. Analysis of the Instability of Gene Set Enrichment Analysis Over TimeFigure 3.9: Pipeline to compare results of enrichment analyses over time.52Chapter 4Results and Discussion4.1 Exploratory AnalysesTo get a general overview of the data, the first exploration was made towardsexploring if the annotations were more prominent or supported for organismswith a smaller genome size or if the biases previously reported were mostlikely influenced by curation preference. By considering the the genome sizeof each species (considered as the total number of coding genes reported inthe current assembly listed in the Ensembl Genome Browser), the number ofexisting annotations and the number of annotations supported by publica-tions. The results showed that there is no association between of the numberof annotations and the genome size. Within organisms that have less than10,000 genes, yeast has more annotations and also more publications (circlesize), followed by E. coli; while dictyostelium clearly lacked annotations andpublication supporting those annotations. For those organisms that havea genome size between 15,000 and 20,000 genes, the fruit fly had consider-ably less annotations than chicken or dog, but more of them were supportedby publications. This result is expected as chicken, dog or cow have mostof their annotations are inferred by homology. For those organisms whosegenome size is bigger than 20,000 genes, mouse was noticeably the organismwith the highest number of annotations, followed by human and rat. Thesethree organisms also had a considerable number of annotations supportedby publications. However, other important model organisms like zebrafish,worm or Arabidopsis fall behind in the number of annotations available andthose supported by publications (Figure 4.1).534.1. Exploratory AnalysesFigure 4.1: Overview of Species-specific biases on manual curation efforts. No re-lationship was found between the genome size (quantified by the number of protein-coding genes), the number of total annotations and the total number of publicationssupporting them (circle size).All 14 organisms were processed for analysis. However, for descriptivepurposes and an easier comparison, only 5 representative species are dis-cussed: human, Arabidopsis, E.coli, yeast and fruit fly.It is clear that there are more gene products than genes in the genome.544.1. Exploratory AnalysesHowever, one would likely expect a constant or subtle increase in the numberof those gene products annotated over time. Nevertheless, these numbersare highly dependant on the source and the database that provides thatinformation. As mentioned in the introduction, DB Object IDs from theUniProtKB are reflecting gene products mapped to each gene. However,the instability of such database (and others) will likely impact in the num-ber of DB Objects available.Consistent with this hypothesis, the number of gene products (DB Ob-ject IDs) available at each GOA version over time had a gradual increase fororganisms with larger genomes such as mouse, human or Arabidopsis. Incontrast, smaller organisms like fly or yeast seemed more stable in terms ofthe number of gene products. However, exceptions were observed in a par-ticular time points for human and E.coli, where clearly some gene productswere lost and recovered intermittently over time.Even if the GAF2.0 GOA format has the rule of assigning DB ObjectIDs to a top level primary gene or gene product ID, the total number ofproteic entries are directly influenced by the source of the annotation files.In particular, the sudden decrease in the number of DB Object IDs observedfor human data between 2009 and 2011 can be explained by the fact that,at the end of 2008, a draft of the complete human proteome was releasedspecifically from UniProtKB/SwissProt. This release had approximately20,000 putative human protein-coding genes manually reviewed. UniProtK-B/TrEMBL products were also revised and 15,000 isoforms were mergedwith 40% of the UniProtKB/SwissProt entries, causing a large reductionin the number of annotated products. These numbers are consistent withthe shifts observed (Figure 4.2).The next increase for human was observed in 2011. The most likelyreason is that a complementary pipeline was implemented to import pre-dictions from UniProtKB/TrEMBL sequences, which are non-revised andpotentially redundant ( Exploratory AnalysesSimilar shifts can occur when annotation pipelines are revised by otherdatabases that are species-specific, as it may be the case for the changesobserved in E.coli (EcoCyc).Figure 4.2: Total number of Gene Product Identifiers found for each species.Additionally, a scientist would most likely be interested to know howstable are the annotations of each gene product over time. This can be ex-plored by looking at the changes in the number of the GO terms directlyannotated to each gene product over time.To have a general overview of the number of “functions” one can expectfor each gene product and species, the average number of GO terms assigned564.1. Exploratory Analysesto all the gene products was computed. The results showed a considerablevariation within and between species, ranging from 3 to 10 GO terms pergene product. In this metric, no distinction was made for GO terms man-ually assigned or inferred electronically. However, in the case of human,the average count is clearly influenced by electronic annotations, speciallywhen considering the changes discussed above (where UniProtKB/TrEMBLproducts were removed between 2009 and 2011). From this observation, itis clear that the gene products from UniProtKB/SwissProt (in that periodof time) had considerably more GO terms assigned than their non-reviewedcounterparts originated from UniProtKB/TrEMBL. But, as both sets areincorporated in-distinctively in the GOA files, the average value decreasedwhen the new pipeline reincorporated UniProtKB/TrEMBL entries into thedatabase.574.1. Exploratory AnalysesFigure 4.3: Average number of GO terms directly annotated to gene prod-ucts across editions.A closer look at the data showed a high variability in the number ofGO terms assigned to each gene product over time (Figure 4.3). In somecases, the changes would be noticeable, whereas in others, no changes wouldbe detected. An example of such changes is shown (Figure 4.4), wherea random set of genes from human was taken to compare the differencebetween the number of GO terms assigned from one edition to the next one.A common “assumption” is that existent annotations remain stable and newones would be incorporated over time. A gradual increase in the number ofGO terms might be then expected. However, this figure showed that somenotorious shifts can occur on small periods of time and that these differencesoccur in larger proportions for some gene products and in particular editions.584.1.ExploratoryAnalysesFigure 4.4: Contrasting shifts found in the number of GO terms assigned to a random set of gene products across editions.Thegray colour indicate that no changes in the number of terms were observed compared to those of a previous edition. A redcolour would indicate that the gene product lost more than 20 terms within that particular edition compared to the previousone and the blue colour would reflect that it gained more than 20 GO terms compared to the previous edition. All geneswere explored, but only a random sample is shown. Genes that showed the more drastic changes in human data showed amagnitude of over 100 GO terms added or removed at particular time points compared to previous editions.594.1. Exploratory AnalysesHowever, gene products that seemed to have a relatively constant num-ber of GO terms are not necessarily “stable”. In particular, terms can appearand disappear from one edition to the next. An example of such instabilityis shown on (Figure 4.5).Figure 4.5: Example of the functional instability that gene products have overtime. The gene product OR11H7 (Olfactory receptor 11H7) with the UniProtK-B/SwissProt ID Q8NGC8 is used as an example of functional instability. The GOterms were traced and a constant adding and removal of the same terms across edi-tions could be appreciated. The schema displays the historical changes, where eachGO term directly annotated has been colour coded. Those marked with an asterisk(*) are GO terms that remained annotated to the gene product as of August 2014.The latter example clearly reflects the problem of the instability of theGO terms. A way to explore such shifts on a wider scale is by looking athow semantically similar a gene product is to itself. This can be done by604.1. Exploratory Analysesusing the Jaccard distance (Figure 4.6).Figure 4.6: Average values of how semantically similar genes are acrosseditions.Noticeably, after only 2 years, yeast, human, Arabidopsis and mousedata showed only a semantic similarity of just 50% compared to the currentannotations but increased to 80% within the same year. Interestingly, thefruit fly had a different behaviour and on average, its gene products retaineda higher functional similarity. In contrast, E.coli data showed a considerablyabnormal pattern of semantic similarity, which seemed correlated with thedrops observed in the number of gene product IDs. The results are similarto those reported by Gillis and Pavlidis (2013) for “genes always present”614.1. Exploratory Analysesin human data [37].Gene products that are multifunctional often tend to be prioritized forcuration. In fact, when looking at the gene sets defined by different GOCprojects, they seemed to rank high in terms of their multifunctionality score.This behaviour was particularly observed in the gene sets listed for car-diovascular processes and those derived from the reference genome project(Figure 4.7).Figure 4.7: Exploring the association between prioritized gene sets for curationand multifunctionality. Genes that have been prioritized for curation seemed alsoto be highly multifunctional. However, some projects seemed to rank higher thanothers. Some genes are not in the top multifunctional ranking, seemingly becausethe curation project is still in progress.However, from a general overview, the average multifunctionality scorefor each species showed a very slight gradual increase, and only yeast and624.1. Exploratory AnalysesE.coli data had considerable shifts. This means that on average, the scoreof gene multifunctionality across genes do tend to increase over time, poten-tially due to the increase in the multifunctionality effect of the prioritizedgenes (Figure 4.8).Figure 4.8: Average score of gene multifunctionality across editions.In fact, most of the gene products present in GOA data are not consid-ered multifunctional, specially when on average, each gene product has 3 to10 GO terms. One particular problem that has been constantly observedand described is that most genes have very shallow GO terms assigned, spe-cially because these were inferred computationally and in most cases, havenot been curated.634.1. Exploratory AnalysesOne indirect form to verify this is by exploring the overall GO termmembership across editions (i.e. the number of gene products belongingto a particular GO group). The results showed that only a few set of GOterms are directly annotated to support the largest number of gene products(Table 4.1).In particular, the top GO terms used across editions were too shallow,reflecting the lack of a proper coverage in the GOA data, specially as theroot terms were also the ones most commonly assigned: “biological pro-cess” (GO:0008150), “molecular function”(GO:0003674/ GO:0005554), “cel-lular component” (GO:0008372/GO:0005575), “cytoplasm”(GO:0005737),“nucleus” (GO:0005634), “translation” (GO:0006412), “plasma membrane”(GO:0005886), “membrane” (GO:0016020), “integral component of mem-brane” (GO:0016021), “protein binding” (GO:0005515) and “ATP binding”(GO:0005524).On one hand, a person would assume that such shallow terms are only as-signed by IEA annotations. However, when IEA annotations are removed,the top terms most commonly assigned (as direct GO terms) remain mostlythe same, except for a few others like: ”regulation of transcription, DNA-templated” (GO:0006355), “cellular response to DNA damage stimulus”(GO:0006974), “structural constituent of ribosome”( GO:0003735) or “cy-tosolic ribosome” (GO:0022626), which were particularly prevalent in Flydata.Table 4.1: The GO terms most frequently used in GOA data.Species >5,000 genes 1,000-5,000 genes 100-1,000 genes <100 genes Total GO terms(GOA)Human 1-5 7-33 96-436 3244-14034 3347-14507Arabidopsis 1-2 9-22 85-265 1534-5223 1628-5509E.coli 0 1-2 2-38 859-3146 861-3191Yeast 0 5-8 6-85 1396-4925 1402-4965Fruit fly 0 4-13 61-128 4320-6238 4410-6374The numbers on the table reflect the range in number of GO terms that have acertain number of genes (GO membership) for each species. The last column ofthe table shows the range of GO terms that have been ever been used to supportthe annotations across GOA editions.644.1. Exploratory AnalysesAnother way to observe this is by looking at the number of inferred termsfor each direct annotation (Figure 4.9). The average values obtained wereconsistent with previous observations. In particular, for human data, thenumber of propagated functions between 2009 and 2011 was considerablyhigher from those of other time points where UniProtKB/TrEMBL anno-tations (with shallow GO terms) are present. However, it is important tonote that all the species seem to have a gradual increase in the numberof propagated terms, which can lead to think that annotations are gaining“specificity”. Nevertheless, it is important to remember that one of theproperties of the GO structure is that the depth of a term in the branchdoes not necessarily reflect how specific a term can be.Figure 4.9: Average number of inferred terms over time.However, an interesting way to interpret that subtle an gradual increaseis by arguing that in the last couple of years, the Reference Genome Project654.1. Exploratory Analysesfocused their efforts towards improving and revising existent GO annota-tions. The results reflect such efforts. Particularly, a large number of anno-tations are being curated from electronic inferences. In contrast, a small butstill appreciable proportion of previous manual annotations have also beenupgraded to more granular GO terms. However, these promotions were con-siderably different between species (Figure 4.10).664.1. Exploratory AnalysesFigure 4.10: Total number of electronic annotations that are curated andtotal number of manual annotations that are promoted with a more granularGO term.The effect of upgrading IEA annotations to manual revisions can also beexplored by observing the changes in the number of annotations assigned toother manual evidence codes for each ontology. For example, in the cellularcomponent ontology, recent annotations have been assigned to the compu-tational codes IGC, IBA, IKR, and IRD. Some codes are only used for one674.1. Exploratory Analysesspecies: ISO and IGC in fruit fly and IKR in human. Others have been usedfor longer and have increased its usage, like RCA and ISM. It seemed thatmore increases in annotations from this ontology are found for those withexperimental evidence codes, such as IPI, IDA and in a smaller proportionIGI. The usage of IEP was apparently discontinued and EXP evidence isonly found currently in E.coli data. Interestingly, TAS annotations haveremained stable, except for human data, where an important increase wasfound. The usage of other codes assigned by curators, such as IC or NDhave remained stable, except for Arabidopsis, where a considerable increaseof gene product annotations with the code ND (no biological data is avail-able) was reported in 2011 (Figure 4.11).684.1. Exploratory AnalysesFigure 4.11: Changes in the usage of evidence codes for the Cellular com-ponent Ontology.In the molecular function ontology, recent annotations were assigned tothe computational codes ISO, ISA, IBA, IKR and IRD. Some increased itsusage like ISS (although remained the same for fly or E.coli), ISM in fly (but694.1. Exploratory Analysesdropped in yeast), and RCA remained stable in all but yeast data, were itsusage dropped. Experimental annotations have also increased slightly forIDA, IPI and IGI. The usage of IEP was also discontinued in this ontology.TAS annotations have remained stable, whereas IC have slightly increased.ND annotations have also remained stable, except for Arabidopsis again.Contrary to the cellular component, annotations assigned to ND were re-moved from E.coli data (Figure 4.12).704.1. Exploratory AnalysesFigure 4.12: Changes in the usage of evidence codes for the Molecular Func-tion Ontology.In the biological process, many new annotations have been recently in-corporated with the computational codes ISA, ISM, IBA, IRD, IKR and714.1. Exploratory AnalysesRCA. However, they were mostly used for fly, yeast and human data. Ex-perimental annotations are gradually increasing for IDA, IGI, IMP or ICcodes, and has remained stable in TAS or ND annotations. It seems thatNR data is still present in E.coli data, indicating that some genes haven’tbeen characterized (Figure 4.13).724.1. Exploratory AnalysesFigure 4.13: Changes in the usage of evidence codes for the Biological Pro-cess Ontology.It is important to remember that annotations can change not only due to734.1. Exploratory AnalysesGO term upgrades or changes in the evidence codes, but also due to incon-sistencies in the databases. A way to visualize the shifts observed (Figure4.5) on a practical way for users is by plotting the “existence” of the anno-tation at each time point.Contrary to the previous methods discussed in the introduction, I con-sidered not only the GO term or evidence code, but also the supportingpublication to trace the annotation. This is arguably the most importantfactor to consider if one aims to properly trace the existence of the anno-tation and, even if the annotation disappeared, the association can still bevalidated if the source is accessible, specially for manual annotations.Users often assume that manually curated annotations from revised geneproduct IDs (as its the case for those derived from the UniProtKB/SwissProtdatabase) remain relatively present or stable in subsequent GOA editions.However, an exploration made with human data for the highly popular andmultifunctional gene TP53 with a random selection of its annotations showedthe opposite. ( Figure 4.14) shows that, even for highly studied genes suchlike this one, important changes occur from one edition to the next. Oldannotations can be removed completely, others can remain relatively stable,disappear after just a few editions or exist in only one single GOA edition.Another potential hypothesis of why manual annotations can disappear,is that maybe the source was not robust enough, was wrongly interpretedor was even derived from a retracted paper. An exploration of how manyretracted papers are used to support annotations however, showed that thenumbers are negligible (less than 5, data not shown).744.1.ExploratoryAnalysesFigure 4.14: Manually curated annotations are also unstable. A common assumption is that manually curatedannotations are stable. However, an exploration for 50 random annotations in human for the multifunctional geneTP53 highlighted that their annotations can be highly volatile over time.754.2. Utility of Creating a Web-based Visualization Tool: GOtrackWeb4.2 Utility of Creating a Web-basedVisualization Tool: GOtrackWebClearly, there is still a lot to learn from exploratory data analyses in gene an-notation data. The previous results highlighted the high variability of GOannotations. However, the importance of assessing the instability shouldalso be translated into an application that users can use, and particularly,that allows them to explore how these factors impact genes and annotationsof their interest. Such tool was not available until now.A database and a web interface (GOtrackWeb) were built to study andextend this information to the community and designed to keep the informa-tion as updated as possible. This is a large contribution that can be usefulnot only for researchers, but also for GO curators.On the website, users are able to explore all the UniProtKB/SwissProt orUniProtKB/TrEMBL identifiers mapped to particular gene symbols; com-pare parameters such as the number of direct and propagated terms fortheir gene product of interest over time, changes in their multifunctionalityscore or how semantically similar they were in previous versions as com-pared to the latest version on a dynamic plot. Similarly, they can retrievewhich are the top multifunctional genes for each species in the latest edition(Figure 4.15).764.2. Utility of Creating a Web-based Visualization Tool: GOtrackWebFigure 4.15: Explore different exploratory metrics for a gene product.Additionally, the functionality tab is very handy as users are able toexplore all the GO terms that have been ever annotated to the queried geneproduct and visualize the “existence” of each annotation across editions. A“time line” with monthly squares are displayed for users to visualize thisexistence. The colours of each monthly box represent the evidence code774.2. Utility of Creating a Web-based Visualization Tool: GOtrackWebthat was used in that annotation at that time point. When the annotationwas absent for a particular edition, a gap in the time line is shown. Evenif the GO term is no longer used or if the annotation seems unstable, theuser can have access to the sources used to support them by using thetable displayed below (in case there is a PubMed ID). Likewise, users candownload this information or even check the time line of how many geneproducts have been assigned to the respective GO terms at each time point(Figure 4.16).784.2. Utility of Creating a Web-based Visualization Tool: GOtrackWebFigure 4.16: Explore the historical existence of annotations associated tofunctions for a particular gene product of interest.794.3. The Assessment of Gene Function Prediction Algorithms4.3 The Assessment of Gene Function PredictionAlgorithmsAs it was noted in the exploratory analyses, curation effort is skewed to cer-tain model organisms and within each organism, prioritized gene sets seemto favor multifunctional genes. The results from the GO term membershipalso highlighted that mostly shallow GO terms are assigned to annotationdata and that often IEA annotations are preferred in manual revisions.As most of the genes considered in the CAFA assessment already haveat least an electronically inferred function assigned to them (mostly by se-quence alignment or other similar methods), it is likely that for the CAFAassessment, manual annotations that are likely to come up in the “accumu-lation period” (and most likely used for the assessment), will reflect IEAupgrades.Taken all these factors into account, a submission was made for theCAFA2 assessment by considering the GO term prevalence as the null model,and assigning the most prevalent terms (excluding those already assignedmanually) to every single target. Additionally, GO terms that are frequentlyupdated or commonly co-occurred were also assigned as “predictions” iftheir probability score was higher than what could be assigned by preva-lence alone.To assess the performance of this set of methods, which basically can beattributed to annotation artifacts and do not consider any biological realityof the targets, I reproduced the predictions by using in this case the targetsincluded in the gold standard set of the CAFA1 assessment.In general, co-occurrence and IEA upgrade showed a small contributionon top of prevalence. Specifically, on average across all the evaluation tar-gets, 15% of the predictions were derived from co-occurrence and 84% werederived from prevalence. Contrary to what was expected, only 1% of the804.3. The Assessment of Gene Function Prediction Algorithmspredictions were assigned by IEA upgrade, as their probabilities rarely im-proved those from prevalence. When using the gold standard set reportedin the CAFA1 paper to assess how many of our predictions became truepositives, an average of 10 (true positive) annotations and a maximum of125 annotations were derived from prevalence, 0 on average and a maximumof 2 annotations were derived from IEA upgrades and 1 on average and 41as the maximum number of annotations were derived from co-occurrence.These trends didn’t differ when combining prevalence with only one of thetwo other methods. The apparent utility derived from IEA upgrades mightin part be just a reflection of the closed world assumption of the evaluation,a limitation that has already been criticized [15] and recently explored [57].In fact, many IEA annotations are likely to be accurate even if they are notdirectly upgraded by curators in the time frame used for the evaluation.To further assess the relative “performance of these methods, I used the18 sets of predictions that were provided to Jesse Gillis and Paul Pavlidisfrom the organizers of the CAFA1 assessment for an independent evaluation.They also provided the “predictions” from the prevalence set used as acontrol,as well as those derived from the GOtcha and BLAST methods.When exploring the performance of the proposed methods in a function-centric evaluation versus the others, it was clearly observed that regardlessof the ontology, the combined method performed better than prevalencealone and was comparable to others. GOtcha and BLAST were the topperforming methods. An alternative “gold standard” was used by includingGO terms that have 10 to 100 genes assigned, but the results didn’t showany significant differences between using this or the CAFA1 gold standard(Figure 4.17; Table 4.2).814.3. The Assessment of Gene Function Prediction AlgorithmsFigure 4.17: Results of the performance of function-prediction algorithms asmeasured by AUROC. A and B show performance using GO terms presentin the gold standard list; C and D show performance using GO terms thathad 10 to 100 genes assigned.824.3. The Assessment of Gene Function Prediction AlgorithmsTable 4.2: Results of the function-centered performance as measured byAUROC.OntologyNumber of GO(terms out of 18974)Method forAUC analysisTeam-dataAUC value(Prevalence filtered)BP154GOTermsconsideredif they had10-100 membersBLAST 0.680Gotcha 0.698Naive Method 0.500Prevalence + IEA + co-occurrence 0.599Prevalence + IEA 0.557Prevalence + co-occurrence 0.588Top score 0.698Lowest score 0.500Average (all teams) 0.571211/233All GOtermsconsidered inthe gold standardBLAST 0.702Gotcha 0.715Naive Method 0.500Prevalence + IEA + co-occurrence 0.595Prevalence + IEA 0.557Prevalence + co-occurrence 0.590Top score 0.715Lowest score 0.500Average (All teams) 0.575MF22GO termsconsidered ifthey had 10-100membersBLAST 0.798Gotcha 0.815Naive Method 0.500Prevalence + IEA + co-occurrence 0.540Prevalence + IEA 0.508Prevalence + co-occurrence 0.538Top score 0.831Lowest score 0.500Average (All terms) 0.68827/28All GOTermspresent in thegold standardBLAST 0.792Gotcha 0.813Naive method 0.500Prevalence + IEA + co-occurrence 0.530Prevalence + IEA 0.505Prevalence + co-occurrence 0.545Top score 0.831Lower score 0.500Average (All terms) 0.671Regardless of what was considered the “gold standard”, our methods had a strongerperformance compared to prevalence alone. Performance was similar to the averageperformance across the other algorithms considered, but could not outperformBLAST or GOTcha. 834.3. The Assessment of Gene Function Prediction AlgorithmsI was not able to reproduce the results of the “CAFA score” used for theoriginal CAFA1 assessment. The results computed were highly discordantwith what was reported in the publication and the reason could be that Idid not have all the results of the algorithms submitted, but just a subset.The descriptions to compute the metrics were also not entirely clear on theoriginal publication. I took an alternative approach to evaluate the perfor-mance as described by Gillis and Pavlidis (2013). The evaluation is based onthe semantic similarity between the predictions and the true assignments inthe gold standard. This metric allows the exploration of how informative aprediction is, in particular, by looking at those predictions that had a higherprobability score than what could be assigned by prevalence alone.Using information content (IC) as a measure of term specificity is ad-equate for this purpose because of the shallow annotation problem. Themetric proposed by Resnik was used to explore this. The results obtainedfrom the molecular function ontology showed that, on average, 14.17% ofthe predictions were more informative than prevalence across all methods.The proposed controls yielded 16.08% informative predictions and from thedatasets, the top performing method (labelled as team20) had 30% infor-mative predictions. Contrary to the function-centered evaluation, BLASTand GOtcha were not the top most informative, but yielded between 14-15%informative predictions (Table 4.3).Even though all the methods assessed here seemed to improve the base-line set with prevalence alone, some of them had a very low performance.The results showed that some -but not all- the algorithms can make correctand specific predictions. Such percentages, however, can be affected whensome rarely used generic terms in GOA (and thus, not that prevalent) ac-quire a high IC score but do not necessarily reflect the specificity of the term[58].844.3. The Assessment of Gene Function Prediction AlgorithmsTable 4.3: Results of the function-centered performance measured by infor-mation content (molecular function ontology).Method Total PredictionsTotal InformativepredictionsPercentagePrevalence+IEA+Co-Occurrence 690 111 16.1 %Prevalence+IEA 690 79 11.4 %Prevalence+Co-Occurrence 690 110 15.9 %GOtcha 690 105 15.2 %BLAST 690 102 14.8 %Naive 690 0 0.0 %Team 51 690 118 17.1 %Team 50 690 117 17.0 %Team 49 690 1 0.1 %Team 48 690 13 1.9 %Team 47 690 12 1.7 %Team 45 690 134 19.4 %Team 44 690 184 26.7 %Team 38 690 97 14.1 %Team 35 690 133 19.4 %Team 28 690 116 16.8 %Team 27 690 15 2.2 %Team 22 690 147 21.3 %Team 21 690 131 19.0 %Team 20 690 208 30.1 %Team 17 690 120 17.4 %In contrast, when looking at the results obtained for the biological pro-cess ontology, it was noted that on average, only 6.3% of the predictions weremore informative than prevalence. Our methods, similar to MF, yielded 17%informative predictions and again, team20 was the most informative one,but only with 12.73%. In this case, GOtcha only yielded 6% and BLASTperformed better with 12% (Table 4.4).854.3. The Assessment of Gene Function Prediction AlgorithmsTable 4.4: Results of the function-centered performance as measured byinformation content (biological process ontology).Method Total PredictionsTotal InformativepredictionsPercentagePrevalence+IEA+Co-Occurrence 1186 206 17.4 %Prevalence+IEA 1186 93 7.8 %Prevalence+Co-Occurrence 1186 203 17.1 %GOtcha 1186 73 6.2 %BLAST 1186 142 12.0%Naive 1186 0 0.0 %Team 51 1186 35 3.0 %Team 50 1186 33 2.8 %Team 49 1186 0 0.0 %Team 48 1186 2 0.2 %Team 47 1186 5 0.4 %Team 45 1186 76 6.4 %Team 44 1186 139 11.7 %Team 38 1186 49 4.1 %Team 35 1186 91 7.7 %Team 28 1186 0 0.0 %Team 27 1186 4 0.3 %Team 22 1186 59 5.0 %Team 21 1186 78 6.6 %Team 20 1186 151 12.7 %Team 17 1186 89 7.5 %The results obtained are comparable to those reported by Gillis andPavlidis (2013) and also highlight the large impact that prevalence has onthe assessment. Similarly, other patterns that can be attributed to biases inthe annotation process, such as gene multifunctionality, or commonly anno-tated terms should be considered in critical assessments such as CAFA.However, as it was described earlier, such artifacts are now consideredpriors for prediction methods, when in fact, they do not take into accountany meaningful biological information. In particular, the results from theIC evaluation showed that only scarce informative terms are assigned bythe current function prediction algorithms and are equivalent to the per-formance obtained in the proposed benchmark, which is partially reflecting864.3. The Assessment of Gene Function Prediction Algorithmsthe continuous problem of assigning shallow terms that might not answerthe question of what a certain target gene do in a biological context. Thequestion still remains on how to design an assessment not affected by suchbiases. However, the methods proposed in this thesis serve as a constructivebaseline that any algorithm focused on function prediction should clearlyoutperform.874.4. Instability of Gene set Enrichment Results4.4 Instability of Gene set Enrichment ResultsThe last part of the study aimed to explore the impact that annotationinstability has on the performance of gene set enrichment analyses. Morethan 2,000 experimentally-derived hit lists from MolSigDB were analysed forthis matter. Previous studies have explored this variability on a small scaleand suggested that changes in GO annotations have an important influencein the enrichment results [29]. However, to my knowledge, no analysis hasbeen made on a large scale to identify their variability. An example of theenrichment results for one hit list from MolSigDB is given, at 3 differenttime points, to exemplify problems in reproducibility of the results, wherethe first time point indicated vesicle transport-related terms, whereas thethird time point involved terms highly associated with organ development.Such a difference in the top enriched terms may impact the interpretationof results (Figure 4.18).A biologist would prefer to consider terms that are “robust” regardlessof the changes in the annotations. Members of the GOC recommend to usethe latest version of GO/GOA (if possible) for analysis[32]. To explore the“robustness” of the data, I considered for each gene set, how many enrichedGO terms reaching statistical significance (with a FDR≤ 0.1) will overlap(at each time point) with those present in the “last edition” (July 2014).The results showed a considerable variability in the number of significantGO terms in the “last edition” across data sets (from 1 to more than a 100).To facilitate the exploration, the sets were classified in arbitrary groups toexplore the variability of those gene sets that contain less than 50 significantterms vs. those that have more than 150 significant terms (Table 4.5).It is of interest to find the proportion of enrichment results that are likelyto show some stability for a certain period of time. It is also expected toidentify shifts in the results at certain time points that would correlate withreported changes in GO/GOA.The variability in “motif gene sets” (C3 collection) between May vs. the884.4. Instability of Gene set Enrichment ResultsFigure 4.18: Example of a motif gene set at different time points showing prob-lems in reproducibility and interpretation of results. The data corresponds to withgenes with promoter regions around a Transcription Start Site containing the motifYTTCCNNNGGAMR. The motif does not match any known transcription factor.The top 37 enriched GO terms are shown and unstable terms are coloured forcomparison.last edition showed that 17% of the gene sets had less than 80% overlapin their enriched terms. In fact,in a few cases the overlap barely reached40%. This raised the concern that some results can considerably in shortperiods of time (a couple of months). The pre-defined groups A and B,894.4. Instability of Gene set Enrichment ResultsTable 4.5: Classification of gene sets by the number of significant GO terms.C2 collection (1870 total hit lists) C3 collection (636 total hit lists)Group # sigGOterms # gene sets Group # sigGOterms # gene setsGroup A ≤ 10 160(28) Group A ≤ 10 163(40)Group B 11-50 386(26) Group B 11-50 178(48)Group C 51-100 377(9) Group C 51-100 139(14)Group D >100 947(1) Group D >100 156 (7)The table shows an arbitrary classification of the gene sets by considering thenumber of statistically significant GO terms in the “last edition”. For example,gene sets that had 11 to 50 significant GO terms belong to Group B. For reference,the numbers in parenthesis show how many gene sets in each group had less than80% overlap in May 2014 with those of July 2014.which have less than 50 significant GO terms, showed the largest variabil-ity. Most gene sets (regardless of the group where they were classified)only showed a very small overlap before 2009, which matches the period oftime where the Reference Genome Project started to revise and improve thequality of GO/GOA. A large variation was also observed between 2010 and2011, which also coincides with major changes in GO annotations for hu-man data, as described earlier. In general, most results will show more than50% similarity after 2012. However, interesting outliers were also observed(Figure 4.19). A similar trend was observed in curated gene sets from onlinepathway databases (C2 collection), although these gene sets showed a higherpercentage of overlap with the last edition, compared to C3 (Figure 4.20).904.4. Instability of Gene set Enrichment ResultsFigure 4.19: Variability of GO term overlap in the gene sets (C3). Figureshows that, gene sets with less than 50 significant terms, tend to showmore variability after comparing different time points vs. the “last edition”.Outliers were identified which showed almost no overlap after just a coupleof months (May vs. July 2014).914.4. Instability of Gene set Enrichment ResultsFigure 4.20: Variability of GO term overlap in the gene sets (C2).Figureshows that, gene sets with less than 50 significant terms(Group A and B),tend to show a small variability after comparing different time points vs.the “last edition”. Outliers were identified with almost no overlap after justa couple of months (May vs. July 2014).The previous results showed a considerable variation in the enrichmentresults for group A in C2 vs. C3, while the rest of the groups seemed to havea consistent variation. In general, some variability is expected, although theextent of this variation has not been explored on a larger scale until now.The differences between the overlap of significant terms at different timepoints might not be substantial if the changes reflect an “improvement” orupgrade in the annotations. Hence, the gene sets are likely to contain morespecific GO terms, which in turn, will share many parental terms with pre-vious results. As long as the semantic similarity of the results is maintained,924.4. Instability of Gene set Enrichment Resultsthe results could be considered consistent.To further explore how similar the results are, the significant results werecompared in terms of how semantically similar the significant GO terms arevs. the “last edition” (after propagation). One limitation of this method isthat the changes in the GO structure are not considered. However, accordingto the results of Clarke et al [29] and Gross et al [59], changes in the GOstructure have a smaller influence in the instability of the results compared tothe effect of changes in GOA. For comparative purposes, the gene sets werealso grouped by an arbitrary classification, defined by the number of parentsbelonging to the significant terms from the “last edition” (Table 4.6).Table 4.6: Classification of gene sets by the number of parental terms.C2 collection (1870 total hit lists) C3 collection (636 total hit lists)Group # Parental terms # gene sets Group # Parental terms # gene setsGroup E ≤ 50 249(114) Group E ≤ 50 167(79)Group F 51-150 406(140) Group F 51-150 158(86)Group G 151-250 335(87) Group G 151-250 129(53)Group H >250 1026(82) Group H >250 182(29)The table shows an arbitrary classification of the enrichment results by considering,for each gene set, the number of parental terms linked to the significant GOterms in the “last edition”. The numbers in parenthesis show how many of theenrichment results in each group had a semantic similarity value of 80% or lessbetween May 2014 and July 2014.The results showed a notable difference between the similarity of theresults from 2013 vs. the “last edition” in the motif gene sets collection(C3). For the curated gene sets (C2), the overall similarity remained slighlyhigher at each time point (Figures 4.21 and 4.22). The small groups (Eand F) had the largest variability across gene sets. The outliers observedon the bottom side of May 2014 also demonstrate that, in some cases, theresults are highly discordant between editions and thus, we might be gettinga completely different result. On the contrary, the outliers observed in theupper side of the groups from 2009-2010 show that, in some cases, the genesets are highly similar, extending the possibility of having a “robust result”after 5 years. Changes in the GO structure are also likely to influence these934.4. Instability of Gene set Enrichment Resultsmeasurements.Figure 4.21: Semantic similarity of enriched gene sets by group (C3).Figureshows that gene sets with more than 150 parental terms tend to show lessvariability in their semantic similarity. Outliers were detected showing al-most no overlap after just a couple of months (May vs. July 2014).944.4. Instability of Gene set Enrichment ResultsFigure 4.22: Semantic similarity of enriched gene sets by group (C2).Figureshows that gene sets with more than 150 parental terms tend to show lessvariability in their semantic similarity. Outliers were detected showing al-most no overlap after just a couple of months (May vs. July 2014).Even if the gene sets change, a biologist might be more interested inlooking at the genes responsible to support those results and formulate fur-ther hypotheses. It is then relevant to assess whether the same genes areactually supporting the significant results, even if the GO terms change. Ifthe overlap is high, then the groups can be considered highly similar (how-ever, they might likely also be supported by multifunctional genes). If theoverlap is low, the actual functional result has changed, and this can belikely be due to changes in the size of the GO groups in GOA.954.4. Instability of Gene set Enrichment ResultsThe results showed a similar trend to the other two metrics. In partic-ular, the variability observed for gene sets with less than a thousand genesranged from 0 to a 100% overlap in the years 2012 and 2013. Comparedto the results from May 2014, most of the results showed a considerableoverlap,but outliers would also cover the entire range. Even for the years2005-2010, outlier gene sets were found to have a high overlap, but most ofthem seemed to be completely different results. Those gene sets supportedby more than 4,000 genes showed a high percentage of overlap and a notice-ably smaller variability. This reflects that GO groups with a high numberof genes are likely to contain those present in the hit lists and often showin the results of the enrichment analysis, although most of the results arederived from GO groups with 1000 or less genes (Figures 4.23 and 4.24).Figure 4.23: Percentage of overlapped genes in the gene sets from C3. Genesets were grouped by number of genes supporting their significant terms inthe last edition.964.4. Instability of Gene set Enrichment ResultsFigure 4.24: Percentage of overlapped genes in the gene sets from C2. Genesets were grouped by number of genes supporting their significant terms inthe last edition.Taken together, these results show a considerable number of gene setswith a high degree of variability, even in small time frames. Some termsmight disappear in future analyses, influenced by the effect of changes inthe annotations, which was clearly correlated with the changes observed inthese results at particular time points. The interpretation of all the resultsderived from gene set enrichment analyses should be considered with cautionand used as a complementary exploration rather than a “conclusive result”,specially as the reproducibility of results for studies that are older than 5years seemed to be jeopardized in most cases. Enrichment tools that useGO annotations older than 2009 (like DAVID) might then display completelydifferent results to what could be obtained using current GO annotations.97Chapter 5Future DirectionsThere is clearly a great deal of interest in better understanding GO andGOA both among biologists who use it and even among GOC itself. In thisthesis I have presented results obtained from the integration of historicalGO Annotation data for 14 different organisms. I showed differences in theannotation patterns for different species and built a tool to track and extendthis analyses to the research community. The assessment of GO Annotationinstability is still challenging, but the work presented here provides an over-all panorama of how annotation data is evolving. By nature, each change isdependent on decisions made by the GOC and a constant evolution of thedatabases they rely on, such as UniProtKB; which in turn, limits the fea-sibility of assigning predictive scores for future annotation instability. Suchdecisions also impact the traceability of protein annotations, specially whenprevious identifiers are removed or de-merged or new annotation pipelinesare implemented. However, this study has addressed changes that had oc-curred in the 14 year history of GOA and filling important gaps in theassessment of annotation quality and instability. Different metrics were im-plemented and incorporated into a web-based tool, along with a baselinemethod that could be employed for the assessment of function predictionalgorithms. While the web-based tool does reflect the existence of an an-notation, it does not reflect time points where annotations are promoted.Likewise, the influence of annotation extensions and cross-references to otherontologies on GOA instability wasn’t addressed. Future work in this regardwould answer the question of whether such additions do contribute to theinterpretability of GOA annotations or adds in a detrimental way to theproblem of annotation instability. Likewise, incorporating these metricsand historical information into actual applications such as enrichment tools98Chapter 5. Future Directionswill definitely contribute to the interpretability of the shared functions forfurther analyses. In particular, the final aim is that any user can submit aset of genes and obtain enrichment results over time for such dataset. Themost stable or relevant GO terms and genes for their study could be thenprioritized for further interpretation and analyses. Likewise, the possibilityof a parallel exploration of the instability of the parental terms within theenrichment analyses should be considered.In the meantime, this work altogether has been presented to the commu-nity of interest at the fourth annual CHiBi/GSAT retreat (UBC’s Loon LakeResearch and Education Centre, October 3-4,2013),the 3rd Annual Cana-dian Human and Statistical Genetics Meeting (Fairmont Empress Hotel Vic-toria B.C.,May 3-6, 2014), at the Bio-ontologies Special Interest Group andthe Automated Function Prediction Interest Group in the SIG-Meetings andIntelligent Systems for Molecular Biology (ISMB) conference (Boston MA,July 11-15, 2014), receiving a positive feedback from the GOC, UniPro-tKB, users of gene-set enrichment tools and by participants from the CAFAassessment.99Bibliography[1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M.Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A.Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese,J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, “Geneontology: tool for the unification of biology,” Nature Genetics, vol. 25,pp. 25–29, May 2000.[2] C. J. Mungall, M. Bada, T. Z. Berardini, J. Deegan, A. Ireland, M. A.Harris, D. P. Hill, and J. Lomax, “Cross-product extensions of the geneontology,” Journal of Biomedical Informatics, vol. 44, pp. 80–86, Feb.2011.[3] R. P. Huntley, M. A. Harris, Y. Alam-Faruque, J. A. Blake, S. Carbon,H. Dietze, E. C. Dimmer, R. E. Foulger, D. P. Hill, V. K. Khodiyar,A. Lock, J. Lomax, R. C. Lovering, P. Mutowo-Meullenet, T. Sawford,K. V. Auken, V. Wood, and C. J. Mungall, “A method for increas-ing expressivity of gene ontology annotations using a compositionalapproach,” BMC Bioinformatics, vol. 15, p. 155, May 2014.[4] H. Zhi-Liang, J. Bao, and J. Reecy, “Categorizer: a web-based pro-gram to batch analyze gene ontology classification categories,” OnlineJ Bioinformatics, vol. 9, pp. 108–112, 2008.[5] F. M. McCarthy, N. Wang, G. B. Magee, B. Nanduri, M. L. Lawrence,E. B. Camon, D. G. Barrell, D. P. Hill, M. E. Dolan, W. P. Williams,et al., “Agbase: a functional genomics resource for agriculture,” BMCgenomics, vol. 7, no. 1, p. 229, 2006.100Bibliography[6] E. I. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J. M. Cherry, andG. Sherlock, “Go:: Termfinderopen source software for accessing geneontology information and finding significantly enriched gene ontologyterms associated with a list of genes,” Bioinformatics, vol. 20, no. 18,pp. 3710–3715, 2004.[7] S. Carbon, A. Ireland, C. J. Mungall, S. Shu, B. Marshall, S. Lewis,et al., “Amigo: online access to ontology and annotation data,” Bioin-formatics, vol. 25, no. 2, pp. 288–289, 2009.[8] D. Binns, E. Dimmer, R. Huntley, D. Barrell, C. O’Donovan, and R. Ap-weiler, “Quickgo: a web-based tool for gene ontology searching,” Bioin-formatics, vol. 25, no. 22, pp. 3045–3046, 2009.[9] D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova, “Entrez gene:gene-centered information at ncbi,” Nucleic acids research, vol. 33,no. suppl 1, pp. D54–D58, 2005.[10] G. O. Consortium et al., “The gene ontology: enhancements for 2011,”Nucleic acids research, vol. 40, no. D1, pp. D559–D564, 2012.[11] N. Skunca, A. Altenhoff, and C. Dessimoz, “Quality of computation-ally inferred gene ontology annotations,” PLoS Comput Biol, vol. 8,p. e1002533, May 2012.[12] L. d. Plessis, N. kunca, and C. Dessimoz, “The what, where, how andwhy of gene ontologya primer for bioinformaticians,” Briefings in Bioin-formatics, vol. 12, pp. 723–735, Nov. 2011.[13] R. P. Huntley, T. Sawford, M. J. Martin, and C. O’Donovan, “Under-standing how and why the gene ontology and its annotations evolve:the GO within UniProt,” GigaScience, vol. 3, no. 1, p. 4, 2014.[14] U. Consortium et al., “Update on activities at the universal proteinresource (uniprot) in 2013,” Nucleic acids research, vol. 41, no. D1,pp. D43–D47, 2013.101Bibliography[15] C. Dessimoz, N. kunca, and P. D. Thomas, “CAFA and the open worldof protein function predictions,” Trends in Genetics.[16] K. Canese, “Pubmed discovery tools,” NLM Tech Bull, vol. 386,no. May-June, 2012.[17] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitativemonitoring of gene expression patterns with a complementary DNAmicroarray,” Science, vol. 270, pp. 467–470, Oct. 1995.[18] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble, “Investigatingsemantic similarity measures across the gene ontology: the relationshipbetween sequence and annotation,” Bioinformatics, vol. 19, pp. 1275–1283, July 2003.[19] D. W. Ussery and P. F. Hallin, “Genome update: annotation quality insequenced microbial genomes,” Microbiology, vol. 150, pp. 2015–2017,July 2004.[20] M. E. Dolan, L. Ni, E. Camon, and J. A. Blake, “A procedure for assess-ing GO annotation consistency,” Bioinformatics, vol. 21, pp. i136–i143,June 2005.[21] C. Andorf, D. Dobbs, and V. Honavar, “Exploring inconsistencies ingenome-wide protein function annotations: a machine learning ap-proach,” BMC Bioinformatics, vol. 8, p. 284, Aug. 2007.[22] C. E. Jones, A. L. Brown, and U. Baumann, “Estimating the annota-tion error rate of curated GO database sequence annotations,” BMCBioinformatics, vol. 8, p. 170, May 2007.[23] W. A. Baumgartner, K. B. Cohen, L. M. Fox, G. Acquaah-Mensah, andL. Hunter, “Manual curation is not sufficient for annotation of genomicdatabases,” Bioinformatics, vol. 23, pp. i41–i48, July 2007.[24] T. J. Buza, F. M. McCarthy, N. Wang, S. M. Bridges, and S. C. Burgess,“Gene ontology annotation quality analysis in model eukaryotes,” Nu-cleic Acids Research, vol. 36, pp. e12–e12, Feb. 2008.102Bibliography[25] A. Gross, M. Hartung, T. Kirsten, and E. Rahm, “Estimating the qual-ity of ontology-based annotations by considering evolutionary changes,”in Data Integration in the Life Sciences (N. W. Paton, P. Missier,and C. Hedeler, eds.), no. 5647 in Lecture Notes in Computer Science,pp. 71–87, Springer Berlin Heidelberg, Jan. 2009.[26] The Reference Genome Group of the Gene Ontology Consortium,“The gene ontology’s reference genome project: A unified frameworkfor functional annotation across species,” PLoS Comput Biol, vol. 5,p. e1000431, July 2009.[27] G. Alterovitz, M. Xiang, D. P. Hill, J. Lomax, J. Liu, M. Cherkassky,J. Dreyfuss, C. Mungall, M. A. Harris, M. E. Dolan, J. A. Blake, andM. F. Ramoni, “Ontology engineering,” Nature Biotechnology, vol. 28,pp. 128–130, Feb. 2010.[28] Y. Alam-Faruque, R. P. Huntley, V. K. Khodiyar, E. B. Camon, E. C.Dimmer, T. Sawford, M. J. Martin, C. O’Donovan, P. J. Talmud,P. Scambler, R. Apweiler, and R. C. Lovering, “The impact of fo-cused gene ontology curation of specific mammalian systems,” PloSOne, vol. 6, no. 12, p. e27541, 2011.[29] E. L. Clarke, S. Loguercio, B. M. Good, and A. I. Su, “A task-based ap-proach for gene ontology evaluation,” Journal of Biomedical Semantics,vol. 4, p. S4, Apr. 2013.[30] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert,M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander,and J. P. Mesirov, “Gene set enrichment analysis: a knowledge-basedapproach for interpreting genome-wide expression profiles,” Proceedingsof the National Academy of Sciences of the United States of America,vol. 102, pp. 15545–15550, Oct. 2005.[31] D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Bioinformaticsenrichment tools: paths toward the comprehensive functional analysisof large gene lists,” Nucleic acids research, vol. 37, pp. 1–13, Jan. 2009.103Bibliography[32] J. A. Blake, “Ten quick tips for using the gene ontology,” PLoS ComputBiol, vol. 9, p. e1003343, 11 2013.[33] S. Y. Rhee, V. Wood, K. Dolinski, and S. Draghici, “Use and misuseof the gene ontology annotations,” Nature Reviews. Genetics, vol. 9,pp. 509–515, July 2008.[34] P. Khatri and S. Drghici, “Ontological analysis of gene expression data:current tools, limitations, and open problems,” Bioinformatics, vol. 21,pp. 3587–3595, Sept. 2005.[35] J. Gillis, M. Mistry, and P. Pavlidis, “Gene function analysis in complexdata sets using ErmineJ,” Nature Protocols, vol. 5, pp. 1148–1159, June2010.[36] J. Gillis and P. Pavlidis, “The impact of multifunctional genes on ”guiltby association” analysis,” PLoS ONE, vol. 6, p. e17258, Feb. 2011.[37] J. Gillis and P. Pavlidis, “Assessing identity, redundancy and confoundsin gene ontology annotations over time,” Bioinformatics (Oxford, Eng-land), vol. 29, pp. 476–482, Feb. 2013.[38] C. Huttenhower, M. A. Hibbs, C. L. Myers, A. A. Caudy, D. C. Hess,and O. G. Troyanskaya, “The impact of incomplete knowledge on eval-uation: an experimental benchmark for protein function prediction,”Bioinformatics (Oxford, England), vol. 25, pp. 2404–2410, Sept. 2009.[39] W. K. Kim, C. Krumpelman, and E. M. Marcotte, “Inferring mousegene functions from genomic-scale data using a combined functionalnetwork/classification strategy,” Genome Biology, vol. 9, no. Suppl 1,p. S5, 2008.[40] L. Pea-Castillo, M. Tasan, C. L. Myers, H. Lee, T. Joshi, C. Zhang,Y. Guan, M. Leone, A. Pagnani, W. K. Kim, C. Krumpelman, W. Tian,G. Obozinski, Y. Qi, S. Mostafavi, G. N. Lin, G. F. Berriz, F. D.Gibbons, G. Lanckriet, J. Qiu, C. Grant, Z. Barutcuoglu, D. P. Hill,104BibliographyD. Warde-Farley, C. Grouios, D. Ray, J. A. Blake, M. Deng, M. I. Jor-dan, W. S. Noble, Q. Morris, J. Klein-Seetharaman, Z. Bar-Joseph,T. Chen, F. Sun, O. G. Troyanskaya, E. M. Marcotte, D. Xu, T. R.Hughes, and F. P. Roth, “A critical assessment of mus musculus genefunction prediction using integrated genomic evidence,” Genome Biol-ogy, vol. 9, p. S2, June 2008.[41] P. Radivojac, W. T. Clark, T. R. Oron, A. M. Schnoes, T. Wittkop,A. Sokolov, K. Graim, C. Funk, K. Verspoor, A. Ben-Hur, G. Pandey,J. M. Yunes, A. S. Talwalkar, S. Repo, M. L. Souza, D. Piovesan,R. Casadio, Z. Wang, J. Cheng, H. Fang, J. Gough, P. Koskinen,P. Trnen, J. Nokso-Koivisto, L. Holm, D. Cozzetto, D. W. A. Buchan,K. Bryson, D. T. Jones, B. Limaye, H. Inamdar, A. Datta, S. K.Manjari, R. Joshi, M. Chitale, D. Kihara, A. M. Lisewski, S. Erdin,E. Venner, O. Lichtarge, R. Rentzsch, H. Yang, A. E. Romero, P. Bhat,A. Paccanaro, T. Hamp, R. Kaner, S. Seemayer, E. Vicedo, C. Schae-fer, D. Achten, F. Auer, A. Boehm, T. Braun, M. Hecht, M. Heron,P. Hnigschmid, T. A. Hopf, S. Kaufmann, M. Kiening, D. Krompass,C. Landerer, Y. Mahlich, M. Roos, J. Bjrne, T. Salakoski, A. Wong,H. Shatkay, F. Gatzmann, I. Sommer, M. N. Wass, M. J. E. Sternberg,N. kunca, F. Supek, M. Bonjak, P. Panov, S. Deroski, T. muc, Y. A. I.Kourmpetis, A. D. J. van Dijk, C. J. F. ter Braak, Y. Zhou, Q. Gong,X. Dong, W. Tian, M. Falda, P. Fontana, E. Lavezzo, B. Di Camillo,S. Toppo, L. Lan, N. Djuric, Y. Guo, S. Vucetic, A. Bairoch, M. Linial,P. C. Babbitt, S. E. Brenner, C. Orengo, B. Rost, S. D. Mooney, andI. Friedberg, “A large-scale evaluation of computational protein func-tion prediction,” Nature Methods, vol. 10, pp. 221–227, Mar. 2013.[42] O. D. King, R. E. Foulger, S. S. Dwight, J. V. White, and F. P. Roth,“Predicting gene function from patterns of annotation,” Genome Re-search, vol. 13, pp. 896–904, May 2003.[43] D. Warde-Farley, S. L. Donaldson, O. Comes, K. Zuberi, R. Badrawi,P. Chao, M. Franz, C. Grouios, F. Kazi, C. T. Lopes, A. Maitland,105BibliographyS. Mostafavi, J. Montojo, Q. Shao, G. Wright, G. D. Bader, and Q. Mor-ris, “The GeneMANIA prediction server: biological network integrationfor gene prioritization and predicting gene function,” Nucleic Acids Re-search, vol. 38, pp. W214–W220, July 2010.[44] EMBL-EBI, “European bioinformatics institute.”, 2014. 2002-2014. Fly and Yeast(07/2011-current).Last accessed: 2014-07-10.[45] FlyBase, “Flybase repository.”,2014. Files under precomputed files folder. Retrieved until 2011-06.Last accessed: 2014-07.[46] SGD, “Sgd repository.”, 2004. Files only until 2004. Lastaccessed: 2014-07.[47] SGD, “Gene ontology repository.”, 2014. Files from 2004-01 until 2011-06.Last accessed: 2014-07.[48] EcoCyc, “Gene ontology repository.”, 2014. 2008-2014.Last accessed: 2014-07-10.[49] G. O. db, “Gene ontology repository.”, 2014. termdb rdf xml files. Lastaccessed: 2014-07.[50] U. Consortium, “Id mapping files by organism.”, 2014. 2002-2014. Last accessed: 2014-06-01.106Bibliography[51] NLM, “Medline baseline repository.”, 2013. 2004-2013. Lastaccessed: 2013-09-18.[52] R. Balakrishnan, M. A. Harris, R. Huntley, K. Van Auken, and J. M.Cherry, “A guide to best practices for gene ontology (GO) manual an-notation,” Database, vol. 2013, pp. bat054–bat054, July 2013.[53] J. Gillis and P. Pavlidis, “Characterizing the state of the art in the com-putational assignment of gene function: lessons from the first criticalassessment of functional annotation (CAFA),” BMC Bioinformatics,vol. 14, p. S15, Apr. 2013.[54] T. Sing, O. Sander, N. Beerenwinkel, and T. Lengauer, “Rocr: vi-sualizing classifier performance in r,” Bioinformatics, vol. 21, no. 20,pp. 3940–3941, 2005.[55] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.-C. Sanchez,and M. Mu¨ller, “proc: an open-source package for r and s+ to analyzeand compare roc curves,” BMC bioinformatics, vol. 12, no. 1, p. 77,2011.[56] A. Liberzon, A. Subramanian, R. Pinchback, H. Thorvaldsdo´ttir,P. Tamayo, and J. P. Mesirov, “Molecular signatures database (msigdb)3.0,” Bioinformatics, vol. 27, no. 12, pp. 1739–1740, 2011.[57] Y. Jiang, W. T. Clark, I. Friedberg, and P. Radivojac, “The impactof incomplete knowledge on the evaluation of protein function predic-tion: a structured-output learning perspective,” Bioinformatics, vol. 30,pp. i609–i616, Sept. 2014.[58] M. Mistry and P. Pavlidis, “Gene ontology term overlap as a measureof gene functional similarity,” BMC Bioinformatics, vol. 9, p. 327, Aug.2008.[59] A. Gross, M. Hartung, K. Prfer, J. Kelso, and E. Rahm, “Impact107Bibliographyof ontology evolution on functional analyses,” Bioinformatics, vol. 28,pp. 2671–2677, Oct. 2012.108AppendixI provide a list of the algorithms used and implemented for the elabora-tion of this thesis.General definitionsLet M a hash table, and k a key of the hash map. We define M(k) asthe function that get all the elements mapped to kLet M be a Hash Table of type < S,E >, k a key of type S and ean element of type E We define Put(M,k, e) the function that creates therelationship k− > e in M109AppendixProgram 5.1 This is the main structure of GOtrack, built to pre-processand analyse historical GO annotations.1. Read user arguments.2. Download new GOA files and TermDB files from the repository, collectthe release date of each file.Output: edition2dates.txt.3. Update DB Object IDs to current version and old MEDLINE IDs toPubMed IDs (mapping).4. Create a list of all the GO terms and GO terms always present in theGO graph.5. Create a file listing all DB Object IDs and DB Object IDs almostalways present across editions with the user threshold. For each GOAfile, get parental GO terms for each direct GO term annotated.6. Retrieve the PubMed date of each publication. If possible, use thepre-computed file per species or query on website.7. Track changes in the evidence codes assigned to each annotation. Out-put: evidencecodehistory.txt8. Compute parameters using the EDITIONANALYSIS algorithm.9. Count the number of GO terms annotated to each DB Object ID perGOA file. Output: countGenseperGoTerm.txt10. Count the number of direct GO terms assigned to each DB Object IDper GOA file. Output: countDirectTermspergene.txt11. Count the number of parental GO terms with ”is a” and ”part of”relationships to the direct GO term assigned to each DB Object IDper GOA file. Output: countInferredTermspergene.txt12. Count the number of parental GO terms with ”is a” and ”part of” re-lationships to the direct GO term assigned to each DB Object Symbolper GOA file. Output: countInferredTermsperSymbol.txt13. Generate a file with 4 columns: the DB Object ID, number of directGO terms per DB Object ID, number of inferred GO terms per DBObject ID and the edition. This file will be loaded into the database.14. Compute the JACCARD algorithm for all GOA editions.15. Count the number of GO annotations that are replaced infuture editions with a more granular GO term. 110AppendixProgram 5.2 An algorithm run by GOtrack to compute one single edition.1. Get the parental GO terms of each direct GO term in the GOA file.2. Get a list of DB Object IDs in the GOA file. Output:genes.+edition+.txt3. Compute the multifunctionality score per DB Object ID.4. Create gomatrix file (GOMATRIX algorithm).Program 5.3 Program to create GOMatrix files. They list genes and theGO terms they are associated to on a particular edition.Data: GenesAlmostAlwaysPresent.txt (GAAP ): List of genes.Data: Gene Association File (GOA): File with GO Annotations.hashGenesTerms← a hash table that maps a gene g to the set ofGO terms annotated to it.for DB Object ID g ∈ GAAP doif g ∈ GOA thenfor term ∈ hashGenesTerms(g) doprint DB Object Id + GOtermelseprint DB Object ID + ”-1” ;(used to indicate that the DB Object ID is not in GOA)Output: gomatrix.*.txt111AppendixProgram 5.4 An algorithm to compute semantic similarity based on Jac-card distanceData: Gene Association Files (GOA): File with GO Annotations.Data: gomatrix.*.txt filesfor edition i ∈ 1... n doLet edA be the edition i;Let edB be the edition n;genesA← all DB Object IDs in edA;genesB ← all DB Object IDs in edB(done only once);for each DB Object ID ∈ genesA doif g /∈ edB thensim← −1//gene not in the last editionelsegoA← GOterms associated to g in edA (gomatrix file);goB ← GOterms associated to g in edB (gomatrix file);jaccardScore← (|goA ∩ goB|)/(|goA ∪ goB|);//jaccardScore is the similarity score for gene g in edition iOutput: jaccardpergeneovertime.txt112AppendixProgram 5.5 An algorithm to map old DB Object IDs to the most currentversion.Data: IdMap: DB Object ID mapping for not Uniprot IdsData: Gene Association File (GOA): File with annotations.Data: List of genes (allgenes.txt): File with all the DB object IDsannotated across all editions of GOA.Data: Genes Always Present (GAP ): List of DB Object IDs presentin all GOA.Data: Genes Last Edition (GLE): List of DB Object IDs present inthe current GOA.Result: GOA files with updated DB Object IDs//Create the most updated version of the IDsfor gene ∈ allgenes.txt doIdWebMap← value returned by the Uniprot Website for gene//Build a dictionary using a special edition of the GOA filess← pre-selected edition for the current speciesfor annot ∈ GOAs docustomDic← hash table that maps the DB Object ID to the DBObject Symbol and the synonyms present in annot//Update IdMap and customDic using IdWebMapfor match ∈ IdMap ∪ customDic doif match points to a different id in IdWebMap thenupdate matchfor GOAi ∈ GOA dofor annot ∈ GOAi doGet DB Object ID and synonyms;//For Ecoli the symbol is used ;if DB Object ID is already an Uniprot ID thenif DB Object Id ∈ GAPorGLE thenLeave current DB Object ID;else if DB Object Id ∈ customDic thenReplace old DB Object ID with new one;elseSearch synonyms in annot and replace DB Object IDwith the most prevalent candidate;else if DB Object ID is in IdMap thenReplace old DB Object ID with matching DB Object ID;elseSearch synonyms in annot and replace DB Object ID withthe most prevalent candidate;if DB Object Id IdWebMap thenUpdate the If with the latest version in idWebMap;Write to *syn file the latest version of annotRename the *syn files to *gz113AppendixProgram 5.6 An algorithm to map old MEDLINE IDs to current PubMedIDsData: Dictionaries (MuID-PmID.ids.gz): Files with conversioninformation from older IDs used in MEDLINE to newPubMed IDs.Data: Gene Association Files (GOA): Files with GO AnnotationsResult: GOA files with updated PubMed IDs.Procedure;Read Dictionaries.for GOAi ∈ GOA dofor annot ∈ GOAi doPubMedIdgets PubMed id annotated in annot;if PubMedId is in Dictionaries thenReplace DB Reference from annot with new DB Referencein a new copy (*syn) of the GOA file;Rename file *syn to *gz;Program 5.7 An algorithm to load the information to the databaseData: Go Tree files (termdb):Data: Gene Association Files (GOA): Files with GO AnnotationsData: countGenesPerGoTerm.txt: Contains the information aboutthe number of genes that are annotated to a go termProcedure;1. Load the GO term names in the termdb files2. Load the ¡species¿ count table3. Load number of genes per GO term4. Load evidence code history5. Load the relationship of GOA file and the publication date6. Load the DB Object Ids that were updated7. Load the GOA files8. Load the analysis of annotations9. Execute post load procedures114AppendixProgram 5.8 CAFA main algorithmData: go.*.txt : Prevalence GO termsids← hashset that maps partialId to CafaId ;annot← read annotation file, create hash map that associates aCafaId ∈ ids to arrayList of annotations ;annot← read children and parents for each annotation ;topGoterms← read prevalence goterms go. ∗ .txt ;for gene ∈ annot dopredictionsForThisGene← will save all predictions for thistarget ;predictionsForThisGene← Call predictGoTerms if methodCooccurrence is active ;predictionsForThisGene← add all items in topGoterms ifmethod prevalence is active ;predictionsForThisGene← Call predictGoTerms if methodIEAupgrade is active ;predictionsForThisGene← remove dups, if any, also orderpredictions by score.;if two or more methods predict the same GO term thentake the one with highest scoreif A prediction is already manually annotated to gene thenDon’t take a predictionprint only the first 1500 predictions in predictionsForThisGene115


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items