Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Building and inferring knowledge bases using biomedical text mining Lever, Jake 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2019_may_lever_jake.pdf [ 1.82MB ]
Metadata
JSON: 24-1.0372325.json
JSON-LD: 24-1.0372325-ld.json
RDF/XML (Pretty): 24-1.0372325-rdf.xml
RDF/JSON: 24-1.0372325-rdf.json
Turtle: 24-1.0372325-turtle.txt
N-Triples: 24-1.0372325-rdf-ntriples.txt
Original Record: 24-1.0372325-source.json
Full Text
24-1.0372325-fulltext.txt
Citation
24-1.0372325.ris

Full Text

Building and inferring knowledgebases using biomedical text miningbyJake LeverB.Eng., University of Edinburgh, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)September 2018© Jake Lever 2018The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, the thesisentitled:Building and inferring knowledge bases using biomedicaltext miningsubmitted by Jake Lever in partial fulfillment of the requirements for thedegree of Doctor of Philosophy in Bioinformatics.Examining Committee:Steven Jones, BioinformaticsSupervisorInanc Birol, BioinformaticsSupervisory Committee MemberSohrab Shah, BioinformaticsSupervisory Committee MemberKendall Ho, Emergency MedicineUniversity ExaminerRoger Tam, RadiologyUniversity ExaminerAdditional Supervisory Committee Member:Art Cherkasov, BioinformaticsSupervisory Committee MemberiiAbstractBiomedical researchers have the overwhelming task of keeping abreast ofthe latest research. This is especially true in the field of personalized can-cer medicine where knowledge from different areas such as clinical trials,preclinical studies, and basic science research needs to be combined. Wepropose that automated text mining methods should become a common-place tool for researchers to help them locate relevant research, assimilateit quickly and collate for hypothesis generation. To move towards this goal,we focus on extracting relations from published abstracts and full-text pa-pers. We first explore the use of co-occurrences in sentences and developa method for inferring new co-occurrences that can be used for hypothesisgeneration. We next explore more advanced relation extraction methodsby developing a supervised learning method, VERSE, which won part ofthe BioNLP 2016 Shared Task. Our classical method outperforms a deeplearning method showing its applicability to text mining problems with lim-ited training data. We develop it further into the Kindred Python packagewhich integrates with other biomedical text mining resources and is easilyapplied to other biomedical problems. Finally, we examine the applicabilityof these methods in personalized cancer research. The specific role of genesin different cancer types as drivers, oncogenes, and tumor suppressors isessential information when interpreting an individual cancer genome. Webuilt CancerMine, a high-quality knowledgebase, using the Kindred classi-fier and annotations from a team of annotators. This allows for quantifiablecomparisons of different cancer types based on the importance of differentgenes. The clinical relevance of cancer mutations is generally locked in theraw text of literature and was the focus of the CIViCmine project. As a col-laboration with the Clinical Interpretation of Variants in Cancer (CIViC)project team, we built methods to prioritise relevant papers for curation.Through this work, we have focussed on different ways to extract struc-tured knowledge from individual sentences in biomedical publications. Themethods, guidelines, and results developed will aid biomedical text miningresearch and the personalized cancer treatment community.iiiLay SummaryThere are too many publications for a single researcher to read. This is par-ticularly true in cancer research where the knowledge can be spread acrossmany journals. We develop computational methods to automatically readpublished papers and extract important sentences. We first look at co-occurrences, where two terms appear in the same sentence, and build asystem for inferring new ones. We then build a system that, provided withenough examples, can extract the meaning from a sentence. This competedin and won a specific problem in the BioNLP Shared Task 2016 communitycompetition. Finally, we use these methods to extract knowledge relevantfor personalized cancer treatment, to understand the role of different genesin cancer, and the relevance of different mutations to clinical decisions. Ourmethods can be generalized to other problems in biology and our results willbe kept up-to-date to remain valuable to cancer researchers and clinicians.ivPrefaceAll the work presented henceforth was conducted at Canada’s Michael SmithGenome Sciences Centre, part of the BC Cancer Agency, in the laboratoryof Dr. Steven J.M. Jones with the collaboration of the Griffith Lab at Wash-ington University in St Louis. I was personally funded by a Vanier CanadaGraduate Scholarship, the MSFHR/CIHR Bioinformatics training program,a UBC four year fellowship and funding from the OpenMinTeD Horizon 2020project. This work was also supported through Compute Canada infrastruc-ture.A version of Chapter 2 has been published in the Bioinformatics journaland the citation is below. A licence to reuse the text and figures from thispaper has been gained from Oxford University Press through the CopyrightClearance Center.Lever J, Gakkhar S, Gottlieb M, Rashnavadi T, Lin S, Siu C, Smith M,Jones M, Krzywinski M, Jones SJ. A collaborative filtering based approachto biomedical knowledge discovery. Bioinformatics. 2017 Sep 26.I created the experimental design, did all the analysis and wrote the fullinitial draft. Dr. Jones came up with the concept of the project. Earlyversions of the research were undertaken by Martin Krzywinski, Maia Smith,Mike Gottlieb, Celia Siu, Santina Lin and Tahereh Rashnavadi. All authorscontributed edits to the final manuscript.The contents of Chapter 3 have been published as two separate papers listedbelow. Both papers were presented at BioNLP workshops and are publishedas part of the ACL anthology. This anthology is made available througha Creative Commons 4.0 BY (Attribution) license which allows for reuse(https://aclanthology.coli.uni-saarland.de/faq).Lever J, Jones SJ. VERSE: Event and relation extraction in the BioNLP2016 Shared Task. In Proceedings of the 4th BioNLP Shared Task Workshop2016 (pp. 42-49).vPrefaceLever J, Jones S. Painless Relation Extraction with Kindred. BioNLP 2017.2017:176-83.For both these works, I was the main researcher and developed all code andanalysis. These works were written entirely by myself and supervised by Dr.Jones.A version of Chapter 4 has been published on bioRxiv and will be submittedfor publication in a journal. It is available with CC-BY 4.0 Internationallicense which allows sharing and adaptation.Lever, Jake, et al. ”CancerMine: A literature-mined resource for drivers,oncogenes and tumor suppressors in cancer.” bioRxiv (2018): 364406.I was the primary researcher for this work. Dr. Martin R. Jones and I cameup with the concept and Dr. Steven Jones supervised the work. Dr. EricZhao and Jasleen Grewal worked on the annotation of data for this work.I developed the methods, annotated data and lead the writing efforts. Allauthors made edits to the manuscript.A version of Chapter 5 will be submitted for publication as:Lever J, Jones MR, Krysiak K, Danos A, Bonakdar M, Grewal J, CulibrkL, Griffith O, Griffith M, Jones SJM, Text-mining clinically relevant cancerbiomarkers for curation into the CIViC databaseI was the lead researcher for this work. I developed the concept and ex-perimental design for the work. All authors contributed to the writing ofthe paper. The work was primarily supervised by Drs Obi Griffith, MalachiGriffith, and Steven J.M. Jones. CIViC is supported by the National CancerInstitute (NCI) of the National Institutes of Health (NIH) under award num-ber U01CA209936 to O.L.G. (with M.G. and E.R.M. as co-investigators).M.G. was supported by the NHGRI under award number R00HG007940.O.L.G. was supported by the NCI under award number K22CA188163. Theauthors would like to thank Compute Canada for the computational infras-tructure used.The Introduction and Conclusion chapters are original work and have notbeen published or submitted for publication elsewhere.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . xixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 Biomedical text mining . . . . . . . . . . . . . . . . . 31.2.2 Information Retrieval . . . . . . . . . . . . . . . . . . 41.2.3 Information Extraction . . . . . . . . . . . . . . . . . 71.2.4 Applications of Deep Learning . . . . . . . . . . . . . 12viiTable of Contents1.2.5 Knowledge Bases and Knowledge Graphs . . . . . . . 131.2.6 Personalized Cancer Genomics . . . . . . . . . . . . . 141.3 Chapter Overviews . . . . . . . . . . . . . . . . . . . . . . . 152 A collaborative filtering-based approach to biomedicalknowledge discovery . . . . . . . . . . . . . . . . . . . . . . . . 182.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . 212.2.1 Word List . . . . . . . . . . . . . . . . . . . . . . . . 212.2.2 Positive Data . . . . . . . . . . . . . . . . . . . . . . 212.2.3 Sampling and Negative Data . . . . . . . . . . . . . . 222.2.4 SVD Method . . . . . . . . . . . . . . . . . . . . . . . 232.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 242.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1 Methods comparison . . . . . . . . . . . . . . . . . . 262.3.2 Predictions over time . . . . . . . . . . . . . . . . . . 302.3.3 Comparison of predictions between SVD and Arrow-smith methods . . . . . . . . . . . . . . . . . . . . . . 322.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Relation extraction with VERSE and Kindred . . . . . . . 413.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.1.1 VERSE . . . . . . . . . . . . . . . . . . . . . . . . . . 423.1.2 Kindred . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 VERSE Methods . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.1 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.2 Text processing . . . . . . . . . . . . . . . . . . . . . 443.2.3 Candidate generation . . . . . . . . . . . . . . . . . . 46viiiTable of Contents3.2.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . 483.2.5 Classification . . . . . . . . . . . . . . . . . . . . . . . 503.2.6 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 523.2.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 523.3 Kindred Methods . . . . . . . . . . . . . . . . . . . . . . . . 523.3.1 Package development . . . . . . . . . . . . . . . . . . 533.3.2 Data Formats . . . . . . . . . . . . . . . . . . . . . . 533.3.3 Parsing and Candidate Building . . . . . . . . . . . . 553.3.4 Vectorization . . . . . . . . . . . . . . . . . . . . . . . 553.3.5 Classification . . . . . . . . . . . . . . . . . . . . . . . 563.3.6 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 573.3.7 Precision-recall tradeoff . . . . . . . . . . . . . . . . . 573.3.8 Parameter optimization . . . . . . . . . . . . . . . . . 573.3.9 Dependencies . . . . . . . . . . . . . . . . . . . . . . 593.3.10 PubAnnotation integration . . . . . . . . . . . . . . . 593.3.11 PubTator integration . . . . . . . . . . . . . . . . . . 613.3.12 BioNLP Shared Task integration . . . . . . . . . . . . 613.3.13 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 623.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 623.4.2 Cross-validated results . . . . . . . . . . . . . . . . . 623.4.3 Competition results . . . . . . . . . . . . . . . . . . . 653.4.4 Multi-sentence analysis . . . . . . . . . . . . . . . . . 653.4.5 Error propagation in events pipeline . . . . . . . . . . 673.4.6 Kindred . . . . . . . . . . . . . . . . . . . . . . . . . 673.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69ixTable of Contents4 A literature-mined resource for drivers, oncogenes and tu-mor suppressors in cancer . . . . . . . . . . . . . . . . . . . . 714.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2.1 Corpora Processing . . . . . . . . . . . . . . . . . . . 734.2.2 Entity recognition . . . . . . . . . . . . . . . . . . . . 734.2.3 Sentence selection . . . . . . . . . . . . . . . . . . . . 744.2.4 Annotation . . . . . . . . . . . . . . . . . . . . . . . . 744.2.5 Relation extraction . . . . . . . . . . . . . . . . . . . 744.2.6 Web portal . . . . . . . . . . . . . . . . . . . . . . . . 754.2.7 Resource comparisons . . . . . . . . . . . . . . . . . . 754.2.8 CancerMine profiles and TCGA analysis . . . . . . . 754.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.3.1 Role of 3,775 unique genes catalogued in 426 cancertypes . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.3.2 60 novel putative tumor suppressors are published inliterature each month . . . . . . . . . . . . . . . . . . 784.3.3 Text mining provides voluminous complementary datato Cancer Gene Census . . . . . . . . . . . . . . . . . 854.3.4 CancerMine provides insights into cancer similarities 874.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885 Text-mining clinically relevant cancer biomarkers for cura-tion into the CIViC database . . . . . . . . . . . . . . . . . . 905.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.2.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . 945.2.2 Term Lists . . . . . . . . . . . . . . . . . . . . . . . . 945.2.3 Entity extraction . . . . . . . . . . . . . . . . . . . . 955.2.4 Sentence selection . . . . . . . . . . . . . . . . . . . . 97xTable of Contents5.2.5 Annotation Platform . . . . . . . . . . . . . . . . . . 975.2.6 Annotation . . . . . . . . . . . . . . . . . . . . . . . . 995.2.7 Relation extraction . . . . . . . . . . . . . . . . . . . 1015.2.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 1035.2.9 Precision-recall Tradeoff . . . . . . . . . . . . . . . . 1055.2.10 Application to PubMed and PMCOA . . . . . . . . . 1055.2.11 CIViC Matching . . . . . . . . . . . . . . . . . . . . . 1065.2.12 User interface . . . . . . . . . . . . . . . . . . . . . . 1065.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.3.1 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 1125.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.2 Lessons Learnt . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.2.1 Inaccessible and out-of-date results . . . . . . . . . . 1176.2.2 User Interfaces . . . . . . . . . . . . . . . . . . . . . . 1186.3 Limitations and Future Directions . . . . . . . . . . . . . . . 1196.4 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122xiList of Tables2.1 Summary of methods for comparison. . . . . . . . . . . . . . 242.2 Summary of performance for the initial steps for the ANNIand SVD algorithms. . . . . . . . . . . . . . . . . . . . . . . . 262.3 Summary of performance for the different algorithms. . . . . 262.4 Thresholds used for different methods to select prediction set. 343.1 Overview of the various features that VERSE can use forclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2 Parameters used for BB3 and SeeDev subtasks . . . . . . . . 633.3 Cross-validated results of BB3 event subtask using optimalparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.4 Cross-validated results of SeeDev event subtask using optimalparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.5 Averaged cross-validated F1-score results of GE4 event sub-task with entities, relations and modifications trained separately 653.6 Cross-validated results (Fold1/Fold2) and final test set re-sults for VERSE and Kindred predictions in Bacteria Biotope(BB3) event subtask with test set results for the top threeperforming tools: VERSE, TurkuNLP and LIMSI. . . . . . . 663.7 Cross-validated results (Fold1/Fold2) and final test set re-sults for Kindred predictions in Seed Development (SeeDev)binary subtask with test set results for the top three perform-ing tools: LitWay, UniMelb and VERSE. . . . . . . . . . . . . 663.8 Final reported results for GE4 subtask split into entity, rela-tions and modifications results . . . . . . . . . . . . . . . . . 66xiiList of Tables5.1 The five groups of search terms used to identify sentences thatpotentially discussed the four evidence types. Strings suchas “sensitiv” are used to capture multiple words including“sensitive” and “sensitivity”. . . . . . . . . . . . . . . . . . . . 965.2 Number of annotations in the training and test sets . . . . . . 1035.3 The selected thresholds for each relation type with the highprecision and lower recall trade-off. . . . . . . . . . . . . . . . 1055.4 Four example sentences for the four evidence types extractedby CIViCmine. The associated PubMed IDs are also shownfor reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . 109xiiiList of Figures2.1 Violin plots of the different scores calculated using eachmethod for the positive and negative test co-occurrencesshown separately. . . . . . . . . . . . . . . . . . . . . . . . . . 272.2 The methods evaluated using 1,000,000 co-occurrences ex-tracted from publications after the year 2010, and 1,000,000co-occurrences randomly generated as negative data. . . . . . 282.3 The corresponding precision-recall curves for each methodshows similar trade-offs for precision and recall for each method. 292.4 Evaluation of SVD predictions on test co-occurrences frompublications further into the future using recall as the metric. 312.5 An Upset plot showing the overlap in predictions made bythe three most successful systems. . . . . . . . . . . . . . . . 372.6 The methods evaluated using 1,000,000 abstract-levelco-occurrences extracted from publications after the year2010, and 1,000,000 abstract-level co-occurrences randomlygenerated as negative data. . . . . . . . . . . . . . . . . . . . 382.7 The class balance in the dataset can affect the resulting classi-fier metrics making interpretation of score distributions chal-lenging. The dataset has a class balance of 0.14% which is atthe far left. Arrowsmith overtakes SVD at a class balance of~5% which is an implausibly high class balance of a knowledgediscovery dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 393.1 Overview of VERSE pipeline . . . . . . . . . . . . . . . . . . 44xivList of Figures3.2 Relation candidate generation for the example text which con-tains a single Lives_In relation (between bacteria and habi-tat). The bacteria entity is shown in bold and the habitatentities are underlined. Relation example generation createspairs of entities that will be vectorised for classification. (a)shows all pairs matching without filtering for specific entitytypes (b) shows filtering for entity types of bacteria and habi-tat for a potential Lives_In relation . . . . . . . . . . . . . . 453.3 Dependency parsing of the shown sentence provides (a) thedependency graph of the full sentence which is then reduced to(b) the dependency path between the two multi-word terms.This is achieved by finding the subgraph which contains allentity nodes and the minimum number of additional nodes. . 493.4 An example of a relation between two entities in the samesentence and the representations of the relation in four in-put/output formats that Kindred supports. . . . . . . . . . . 543.5 The precision-recall tradeoff when trained on the training setfor the BB3 and SeeDev results and evaluating on the devel-opment set using different thresholds. The numbers shownon the plot are the thresholds. . . . . . . . . . . . . . . . . . . 583.6 An illustration of the greedy approach to selecting featuretypes for the BB3 dataset. . . . . . . . . . . . . . . . . . . . . 603.7 Analysis of performance on binary relations that cross sen-tence boundaries. The classifier was trained on the BB3 eventtraining set and evaluated using the corresponding develop-ment set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68xvList of Figures4.1 The supervised learning approach of CancerMine involvesmanual annotation by experts of sentences discussing can-cer gene roles. Machine learning models are then trained andevaluated using this data set. (a) Manual text annotationof 1,500 randomly selected sentences containing genes andcancer types show a similar number of Oncogene and Tu-mor Suppressor annotations. (b) The inter-annotator agree-ment (measured using F1-score) was high between three ex-pert annotators. (c) The precision recall curves show thetrade-off of false positives versus false negatives. (d) Plottingthe precision-recall data in relation to the threshold appliedto the classifier’s decision function provides a way to select ahigh-precision threshold. . . . . . . . . . . . . . . . . . . . . . 774.2 Overview of the cancer gene roles extracted from the completecorpora. (a) The counts of the three gene roles extracted. (b)and (c) show the most frequently extracted genes and cancertypes in cancer gene roles. (d) The most frequent journalsources for cancer gene roles with the section of the paperhighlighted by color. (e) illustrates a large number of cancergene roles have only a single citation supporting it but thata large number (3917) have multiple citations. . . . . . . . . . 794.3 Examination of the sources of the extracted cancer gene roleswith publication date. (a) More cancer gene roles are ex-tracted each year but the relative proportion of novel rolesremains roughly the same. (b) Roles extracted from older pa-pers tend to focus on oncogenes, but mentions of driver geneshave become more frequent since 2010. (c) The full text arti-cle is becoming a more important source of text mined data.(d) Different sections of the paper, particularly the Intro-duction and Discussion parts, are key sources of mentions ofcancer gene roles (d). . . . . . . . . . . . . . . . . . . . . . . . 804.4 (a) Cancer gene roles first discussed many years ago havea longer time to accrue further mentions. (b) Some cancergene roles grow substantially in discussion while others fadeaway. (c) CancerMine can further validate the dual roles thatsome genes play as oncogenes and tumor suppressive. Cita-tion counts are shown in parentheses. . . . . . . . . . . . . . . 82xviList of Figures4.5 A comparison of CancerMine against resources that providecontext for cancer genes. (a) The CancerMine resource con-tains substantially more cancer gene associations than theCancer Gene Census resource. (b) Surprisingly few of thecancer gene associations are overlapping between the IntO-Gen resource and CancerMine . CancerMine overlaps sub-stantially with the genes listed in the (c) TSGen and (d)ONGene resources. . . . . . . . . . . . . . . . . . . . . . . . . 844.6 CancerMine data allows the creation of profiles for differentcancer types using the number of citations as a weightingfor each gene role. (a) The similarities between the top 30cancer types in CancerMine are shown through hierarchicalclustering of cancers types and genes using weights from thetop 30 cancer gene roles. (b) All samples in seven TCGAprojects are analysed for likely loss-of-function mutationscompared with the CancerMine tumor suppressor profilesand matched with the closest profile. Percentages shownin each cell are the proportion of samples labelled witheach CancerMine profile that are from the different TCGAprojects. Samples that match no tumor suppressor in theseprofiles or are ambigious are assigned to none. The TCGAprojects are breast cancer (BRCA), colorectal adenocar-cinoma (COAD), liver hepatocellular carcinoma (LIHC),prostate adenocarcinoma (PRAD), low grade glioma (LGG),lung adenocarcinoma (LUAD) and stomach adenocarcinoma(STAD). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.1 A screenshot of the annotation platform that allowed expertannotators to select the relation types for different candidaterelations in all of the sentences. The example sentence shownwould be tagged using “Predictve/Prognostic” as it describesa prognostic marker. . . . . . . . . . . . . . . . . . . . . . . . 98xviiList of Figures5.2 An overview of the annotation process. Sentences are identi-fied from the literature that describe cancers, genes, variantsand optionally drugs and then filtered using search terms.The first test phase tried complex annotation of biomarkerand variants together but was unsuccessful. The annotationtask was split into two separate tasks for biomarkers and vari-ants separately. Each task had a test phase and then the mainphase on the 800 sentences that were used to create the goldset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.3 The inter-annotator agreement for the main phase for 800sentences, measured with F1-score, showed good agreementin the two sets of annotations for biomarkers (a) and (b) andvery high agreement in the variant annotation task (c). Thesentences from the multiple test phases are not included inthese numbers and are discarded from the further analysis. . 1015.4 (a) The precision-recall curves illustrate the performance ofthe five relation extraction models built for the four evidencetypes and the associated variant prediction. (b) This samedata can be visualised in terms of the threshold values onthe logistic regression to select the appropriate value for highprecision with reasonable recall. . . . . . . . . . . . . . . . . . 1045.5 A Shiny-based web interface allows for easy exploration of theCIViCmine biomarkers with filters and overview piecharts. Amain table shows the list of biomarkers and links to a subse-quent table showing the list of supporting sentences. . . . . . 1075.6 The entirety of PubMed and PubMed Central Open Accesssubset were processed to extract the four different evidencetypes shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.7 An overview of the top 20 (a) genes, (b) cancer types, (c)drugs and (d) variants extracted as part of evidence items. . . 1105.8 A comparison of the evidence items curated in CIViC and au-tomatically extracted by CIViCmine by (a) exact biomarkerinformation and by (b) paper. . . . . . . . . . . . . . . . . . . 112xviiiList of AbbreviationsShort LongAMW average minimum weightAUPRC area under the precision recall curveBRCA breast cancerCGC Cancer Gene CensusCIHR Canadian Institutes of Health ResearchCIViC Clinical Interpretation of Variants in CancerCOAD colorectal adenocarcinomaCRF continuous random fieldsCUID concept unique identifierDO Disease OntologyGBM glioblastoma multiformeGFP green fluorescent proteinHCI human-computer interactionHMM hidden markov modelIE information extractionIR information retrievalKBC knowledge base constructionLBD literature-based discoveryLGG low grade gliomaLIHC liver hepatocellular carcinomaLSA latent semantic analysisLSI latent semantic indexingLTC linked term countLUAD lung adenocarcinomaMeSH Medical Subject HeadingsMSFHR Michael Smith Foundation for Health ResearchNCI National Cancer InstituteNER named entity recognitionxixList of AbbreviationsNIH National Institutes of HealthNLM National Library of MedicineNLP natural language processingNSCLC non-small-cell lung carcinomaOpenIE open information extractionPMCOA PubMed Central Open AccessPMID PubMed identifierPOG Personalized OncoGenomicsPPI protein-protein interactionPRAD prostate adenocarcinomaRBF radial basis functionROC receiver operator characteristicSeeDev Seed DevelopmentSQL structured query languageSTAD stomach adenocarcinomaSVD singular value decompositionSVM support vector machineTCGA The Cancer Genome AtlasTFIDF term frequency–inverse document frequencyUMLS unified medical language systemVERSE Vancouver Event and Relation System for ExtractionWGS whole genome sequencingxxAcknowledgementsI would like to thank my Ph.D. supervisor, Dr. Steven J. M. Jones, whoprovided a wonderful environment to explore text mining and cancer ge-nomics. He encouraged me to pursue countless opportunities to present mywork and build collaborations. Through this, I learned a great deal aboutthe practicalities of team science that I hope will benefit my career for yearsto come. I am very thankful that I chose to do my graduate work withhim. Many thanks go to Sharon Ruschkowski and Louise Clarke for makingmy time in graduate school go so smoothly. I would also like to thank mysupervisory committee, Drs. Sohrab Shah, Art Cherkasov, and Inanc Birol,for their support and guidance during this research.I was very lucky to have the opportunity to conduct research alongside thelead researchers of the Personalized Oncogenomics (POG) program. Shar-ing an office with Drs. Yaoqing Shen, Erin Pleasance, Martin Jones andLaura Williamson was one of the key factors for my enjoyment of my en-tire Ph.D. Their friendship, anecdotes, and discussions created a wonderfulatmosphere. All the other members of the Jones lab, including Eric Zhao,Jasleen Grewal, Luka Culibrk, Celia Siu and Santina Lin, have been goodfriends whom I hope to stay connected with as our careers progress. They allcontributed invaluable work to this research. I am thankful for all the staffat the Genome Sciences Centre and involved in the Personalized Oncoge-nomics (POG) project for the incredible work that continues to be done.The other members of my Bioinformatics program cohort, including ShaunJackman and Sarah Perez, have been great friends who helped me settleeasily into the program and life in Canada.I offer Martin Krzywinski great thanks for our conversations and the fan-tastic opportunities to contribute to the Points of Significance column andseveral visualization projects. These were wonderful learning experiences incollaboration and communication. His leadership of the coffee club at theGenome Sciences Centre was also an essential component of my researchsuccess. I am also thankful for the many conversations with Dr. MorganxxiAcknowledgementsBye and Amir Zadeh.I would also like to thank Drs. Obi Griffith, Malachi Griffith, KilanninKrysiak, Arpad Danos and other members of the Griffith lab at WashingtonUniversity at St. Louis for all their support and also their hard work thatprovided data for this work. I would like to thank Dr. Ben Busby at theNational Center of Biotechnology Information (NCBI) for his encourage-ment in my research and for welcoming me to Bethesda for a short researchplacement during my Ph.D. as a visiting bioinformatician.Many thanks go to the various funding agencies that have supported thisresearch. This includes a Vanier Canada Graduate Scholarship, a UBCFour Year Fellowship, a scholarship from the MSFHR/CIHR Bioinformaticstraining program and funding from the OpenMinTeD Horizons 2020 project.The BC Cancer Foundation supports the POG project which has furtherenabled this research. This work would not have been possible without theincredible online community of developers and researchers who were willingto answer questions, particularly at StackOverflow and RStudio.Lastly, I would like to thank my family and my partner Chantal for theunending support that I have received.xxiiDedicationTo my mother Ann, my late father David, my sister Sarah and my partnerChantal.xxiiiChapter 1IntroductionWho was the last person who fully understood all areas of biology andmedicine? As the fields have grown, it has become impossible for one re-searcher or doctor to keep track of the latest research across such broadfields. This question is a popular discussion amongst mathematicians asthere are several arguable candidates for the last mathematician who trulyunderstood all the branches of the field at their time. Famous minds, likeEuler or Gauss, are commonly cited. Some of the most remarkable work inrecent mathematics, such as Andrew Wile’s proof of Fermat’s Last Theo-rem (Wiles, 1995), has required the use of multiple branches of mathematics.Uniting knowledge from diverse areas biology will become essential to solv-ing applied medical problems (Altman, 2018, Council and others (2014)).Getting the right research to the right researchers is a major bottleneck. Thetopics in the field go from the micro-scale of protein interactions and geneticmodifications to the macro-scale of clinical trials and healthcare systems. Anindividual researcher needs to know more about different areas of biology inorder to plan out an experiment and interpret the results. This problem iscomplicated by the increasing rate of publications in biomedicine (Lu, 2011).These problems necessitate automated text mining methods to help digestand disseminate research results.The primary driving forces for text mining development are challenges inresearch communication which are illustrated by three anecdotes. First,Gregor Mendel’s seminal genetics work on pea plants was published in 1866(Mendel and Tschermak, 1866). It lay dormant, acquiring a small numberof citations over the following thirty years until being rediscovered in theearly 20th century. This indicates the importance of the venue used forpublication and clarity of language. Could a cure for an important dis-ease have already been published but not been recognized by appropriateresearchers? The second anecdote describes the fear that a huge problemin one field is equivalent to a solved problem in another field. Temple F.Smith and Michael S. Waterman having published the now famous Smith-1Chapter 1. IntroductionWaterman algorithm for local sequence alignment (Smith and Waterman,1981) discovered the similar problem of aligning stratigraphic sequences ingeology. This discovery came through serendipity as the two researcherswalked through a geology department and saw a research poster visualizingthe alignment problem (Smith, 2015) and not because either of them wasinvolved in geology research. They were quickly able to publish a paper us-ing a similar algorithm for this problem (Smith and Waterman, 1980). Thethird anecdote describes a case where valuable information is mentioned asa small section of a paper. As it is not a key result of the paper, it is notmentioned in the abstract and overlooked by most researchers. The discov-ery and understanding of green fluorescent protein (GFP) is an example ofthis phenomenon. The bioluminescent properties of a protein are discussedonly as a footnote in the paper describing the purification of the aequorinprotein from the Aequorea jellyfish (Shimomura et al., 1962). These threeanecdotes cannot be isolated incidents. There will be numerous further casesof “undiscovered public knowledge” (a term popularised by Dr. Don Swan-son (Swanson, 1986b)) where the solution to a research or clinical questionalready exists within published literature.Text mining research and the larger natural language processing (NLP) re-search area use computers to understand human-written text and providenew ways for humans to interact with digital media. Text mining processeslarge corpora of text for particular types of knowledge that can both directusers towards relevant knowledge and structure the knowledge for easy as-similation. Researchers should use text mining methods in everyday use tocollate relevant knowledge and stay up-to-date. The goal of this work is tounderstand the problems impeding this goal and solve several of them.One area in which this need to combine knowledge from the macro to themicro is the area of personalized cancer treatment, also known as precisiononcology. This approach aims to use genome sequencing data of individualpatient tumors to guide clinical decision making. By identifying the geneticmistakes that are causing the uncontrolled cell growth of a tumor and inte-grating knowledge from across biology, clinicians will be able to understandthe reasons behind a cancer’s development and hopefully find weaknessesthat can be targeted. The knowledge for this is contained within basic bio-logical research studies of protein function and cell biology, larger sequenc-ing studies and statistical analysis, clinical trial data, clinical guidelines andpharmacological recommendations and many other sources of knowledge.Most of this knowledge is contained within published academic literaturewhich is indexed in PubMed.21.1. ObjectiveThis thesis develops and carefully evaluates approaches to applying textmining technology for extracting and inferring biomedical knowledge frompublished literature. We turn these methods to problems faced in personal-ized cancer treatment research in order to create valuable knowledge basesthat condense tens of thousands of papers for easier survey. This combinedwork moves us closer to a research world where scientists work with textmining tools in order to keep up-to-date with distilled knowledge relevantto their research.1.1 ObjectiveThe overall objective of this thesis is to develop generalizable methods forextracting and inferring knowledge directly from published biomedical liter-ature that will provide a lasting benefit to both the text mining and largerbioinformatics community. This work will move us one step closer to a worldin which researchers use text mining tools and results in their everyday re-search. The subgoals of the thesis are (1) to explore methods for identifyingrelations between biomedical concepts (e.g. drugs, genes and diseases) and(2) to apply these approaches to build knowledge bases relevant to precisioncancer medicine.1.2 BackgroundThe following sections will outline the current status of research into biomed-ical text mining and the relevant problems faced in personalized cancermedicine.1.2.1 Biomedical text miningText mining is the application of informatics to process text documents toretrieve or extract information (Ananiadou and Mcnaught, 2006). In thebiomedical field, this can focus on text from published literature, electronichealth records, clinical guidelines and any other text source that containsknowledge about medicine or biology. The field broadly focuses on twomain applications, information retrieval (IR) for identifying relevant docu-ments and information extraction (IE) for siphoning relevant knowledge ina structured fashion.31.2. BackgroundIn order to extract and structure knowledge from published literature ona large scale, computers must be able to process the raw text. However,computers are designed to deal with numerical data. Text data does nottranslate well into a form for easy computation. It is stored as a list ofcharacters, either ASCII, Unicode or another encoding. Computers cannotglean any level of understanding from the raw bytes of a sentence. Varioussteps need to happen in order to build structure from this raw data. Theseapproaches are common in all natural language processing (NLP) solutionsand are not specific to the biomedical domain.The first challenge is generally to split the series of characters into sentences.In most writing, a period is a good predictor of the end of a sentence andrules can be used to catch exceptions. Exceptions include acronyms and ti-tles (such as U.K. and Dr.). These sentences are then split into tokens, gener-ally individual words, which can be treated as independent elements. Thesetokens can be further processed to identify the part-of-speech (e.g. noun),remove stems (e.g. -ing) and other lemmatization methods (e.g. plurals ->singular). These further steps depend on the downstream analysis to beperformed on the data. Statistical systems have been built that integrateknowledge to combine these steps together such as the Stanford CoreNLPparser (Manning et al., 2014). These parsers can then identify substructureswithin sentences such as noun or verb phrases. Furthermore, additionalstructure such as dependency parses can build information about how dif-ferent tokens within a sentence relate to each other (e.g. a noun may be asubject of a verb).1.2.2 Information RetrievalInformation retrieval (IR) is the task of finding and prioritizing relevantdocuments for a particular search task. Researchers use these methods dailyby searching for academic papers using tools such as PubMed and GoogleScholar. Advances aim to improve relevance for search results. A singleresearcher has a practical limit of the number of papers that they can readin one year, so it is of paramount concern how they select those papers. IRmethods can be used to search other text corpora such as clinical guidelines,but the largest research focus is on academic paper retrieval.Biomedical IR work has benefitted from the approaches developed for websearch. Most methods require a set of keywords as input and then re-turn a prioritized list of papers. Older web search tools, such as Altavista,41.2. Backgroundused direct keyword matching and simple heuristics based on the frequency(Lawrence and Giles, 2000). Search tools encouraged website developersto add hidden metadata into the header information of a web page. Boththese methods placed significant trust in the content developers to providerelevant information. Google’s Pagerank method dramatically changed howsearch results were prioritized (Brin and Page, 1998). By treating the weblinks as a graph, they could model the “importance” of certain websites bythe number of websites linking to it.In the biomedical domain, similar challenges existed for search. Many jour-nals required (and still require) authors to provide keywords for their paper.This data could be used to help indexing and searching papers but werenot associated with a standardized ontology. This created inconsistency.In order to solve this, the National Library of Medicine (NLM) developedthe Medical Subject Headings ontology (MeSH). This is used to manuallyannotate all citations in NLM’s PubMed indexing service by highly skilledannotators. With this information, PubMed’s search can return highly rel-evant papers for a provided topic. Advanced search functionality allowscontrol of the journals to search, years, authors and many other factors.For a long time, their search facility ordered results by reverse chronologicalorder. Recent advances have introduced a relevance ranking method thatuses different factors including publication type, year and data on how thesearch term matches the document (Fiorini et al., 2017). A Pagerank-likeapproach is more challenging in academic literature as the only links be-tween papers are citations which are not truly analogous to links betweenweb pages. PubMed recently implemented a relevance rank system thatcombined various data types to improve the relevance of the search results(Fiorini et al., 2018).A similar IR problem to search is the identification of similar documents.In this case, the input is a current document (either published or free text)and the output is a prioritized list of published works that are similar. Thisdocument similarity metric is a feature of PubMed through their “Similararticles” option. One solution to this problem uses ideas in document clus-tering. The basic concept is to group documents that discuss similar topics.The route to extract the topics discussed in a paper can be quite varied. Forbiomedical abstracts, the associated MeSH terms provide a rich and highquality manually curated resource to allow for document clustering. Simpleoverlap metrics based on MeSH terms can provide good quality results forsimilar document classification (Zhu et al., 2009). The textual content of51.2. Backgroundthe document can be interrogated directly. The simplest method groupsdocuments that share similar words. This data is often very sparse as theEnglish vocabulary is very large and similar ideas can be expressed usingvery different words. Preprocessing methods that standardize case, turnplurals in singulars and other steps can reduce the sparsity. Term normal-ization can be used to group different synonyms together that describe thesame term. Each document is represented by a numeric vector. This iseither counts of associated metadata terms or counts of words within thedocument. This numeric count data is known as a “bag-of-words”. To findsimilar documents, these vectors are then compared often with Euclideandistance or cosine distance.A popular document clustering method designed for this problem is LatentSemantic Analysis (LSA) (Deerwester et al., 1990) which treats documentclustering as an unsupervised learning problem, specifically as a learning bycompression problem. It transforms the text documents in a word frequencymatrix where documents are along one axis and each word in the vocabularyis along the other axis. Every occurrence of a word i in a document jincrements the value of xij . Hence most of the matrix will be zero. It useslow-rank singular value decomposition (SVD) to compress the sparse datainto a small dense space where similar topics will be represented by similarlatent variables. One other way to detect similarity between papers is bylooking at the similarity of their citations. Papers that cite similar papers,or at least papers with similar topics themselves, likely have similar topics.However, citation networks are challenging to build due to paper and authorambiguity and duplicates (Carpenter and Thatcher, 2014).Document classification can be invaluable for problems in information re-trieval. It uses the content and potentially metadata of a document topredict the specific topic of the document. Similar to document cluster-ing methods it uses word frequencies within the document represented assparse count vectors. However, as a supervised method, it requires sampledocuments that have been annotated with specific classes (e.g. the topic ofthe paper, or whether the document is of interest to the researcher). Atraditional binary classifier then attempts to identify the words that makethe most accurate predictions. In the biomedical space, there is particularinterest in predicting the MeSH terms for a paper to assist in the labori-ous task undertaken by the National Library of Medicine to annotate allbiomedical abstracts with terms from the MeSH ontology. Given the hugenumber of existing annotated abstracts as training data, several methodshave been developed for this task as part of a regular competition, BioASQ61.2. Background(Tsatsaronis et al., 2015).1.2.3 Information ExtractionInformation extraction (IE) methods identify structured information froma span of text, an entire document or even a large corpus of documents.This allows text to be transformed into a standardized format that can beeasily searched, queried and processed by other algorithms. These methodsare valuable in the biomedical field for extracting knowledge from publishedliterature, automating the analysis of electronic medical records and manyother applications. There are three main problems that information extrac-tion methods try to solve: coreference resolution, named entity recognitionand relation extraction.Coreference resolution addresses the problem of anaphora. Pronouns andnon-specific terms are frequently used to refer back to entities named inprevious sentences (e.g. “he was first prescribed the drug in 2007”). Coref-erence resolution attempts to link these terms to their original citation. Thiscan be challenging as there can be many candidate coreferences for a singlepronoun in a sentence. For example, the word “it” in a sentence could referto any of the previous objects mentioned in a document. A naive approachwould simply use the most recent noun but this is often wrong. Contextmust be used to infer which coreferences are most likely (Soon et al., 2001).Furthermore, by processing all coreference decisions at the same time, moreoptimal solutions can be found that don’t create contradictions where thesame person is both the subject and object of an inconsistent action (e.g.“she passed her the newspaper’) (Clark and Manning, 2015).Named entity recognition (NER) identifies mentions of specific entities suchas genes and drugs. Basic approaches can use exact string matching with alist of entity names (e.g. synonyms of genes provided by the UMLS metathe-saurus (Bodenreider, 2004)). NER methods can make use of context withina sentence to predict tokens that would likely be a certain entity type. Forinstance, a token that comes before “expression” and is all capitals, e.g.“EGFR expression” is likely a gene. NER methods often make use of ap-proaches based on Hidden Markov Models (HMM) or Continuous RandomFields (CRF). These are finite-state based methods that can assign labels totokens in a sequence provided a set of training data. Exact string matchingcan provide very high recall but with lower precision due to high levels ofambiguity for frequently used English words (e.g. “ICE” is a gene name, but71.2. Backgroundis frequently “ice” in non-gene contexts). HMM/CRF methods will providebetter precision as they can take the context into account but requires a goodtraining set for the associated entity type. Entity normalization approachestake a tagged entity in a sentence and connect it back to an ontology us-ing the context and a set of synonyms associated with each ontology item.Successful NER tools include BANNER (Leaman and Gonzalez, 2008) formany entity types, DNorm (Leaman et al., 2013) for diseases and tmChem(Leaman et al., 2015) for chemicals.Relation extraction predicts whether a relation exists between two or moreentities provided with text in which these entities appear. These methodsmay also try to differentiate the type of relationship between these terms(e.g. whether a drug treats or causes a disease). The most basic approachto identify whether a relationship exists between two entities is the use ofco-occurrences. At its most basic, this method states that a relation existsbetween entities if they ever appear within a span of text. The text lengthcan vary depending on the application, but sentences and abstracts arecommon. This binary decision will lead to very high recall of relations butalso likely a high false positive rate.There are alternative metrics than the simple binary decision of whether aco-occurrence ever appears. Intuitively two terms that appear together inmany sentences are more likely to be part of a relationship. When takenacross a large corpus of documents, e.g. all publications in a journal or evenall accessible biomedical literature, the frequency of co-occurrences can bevery high. However, for a single document, these methods may not beapplicable. A threshold can be used to cut off co-occurrences that appeartoo infrequently. These infrequent co-occurrences may be false positives.However, a small number may be valuable information that are simply notcommonly discussed.Co-occurrences will be affected by the frequency of the individual terms.Frequently mentioned terms, such as “breast cancer”, will have higherco-occurrence numbers than rarely discussed terms such as “ghost cellcarcinoma”. Hence a normalization approach that takes into accountthe background frequency of individual terms can help identify spuriousco-occurrences driven by the fact that one or the other term occurs a lot.“Breast cancer” appears in many papers and so is more likely to cooccurwith terms. By taking the frequency of the words “breast cancer” intoaccount, we can reduce the false positives. At the same time, we can putgreater importance on the few co-occurrences of terms with “ghost cell81.2. Backgroundcarcinoma”. This concept is used in the term-frequency inverse-document-frequency (TF-IDF) approach to normalization. Term frequency is thecount of terms and inverse document frequency is the normalizer for thefrequency of the term in general.The power of co-occurrences really comes from aggregated informationacross a large corpus. For individual documents, more advanced relationextraction methods can be used. These can take for the form of supervisedapproaches (which require substantial example text data), semi-supervisedapproaches (which require less example data and is easier to acquire) orunsupervised approaches (which use no prior knowledge).Supervised learning approaches to relation extraction involve a training setof text with annotated entities and relations. The general goal is to trans-form the text and annotations into a form amenable to traditional classifi-cation methods. A common method is to vectorize the candidate relationwithin a sentence so that it is represented by a numerical (often sparse andvery large) vector that can be fed into a standard binary classifier (e.g. logis-tic regression or support vector machine). These methods use bag-of-wordsapproaches similar to the document clustering discussed previously. Thistransforms the sentence into a vector representation of word counts. Bi-grams, tri-grams (or n-grams to generalize) capture neighboring two, three,or more words. They can also transform subsections of the sentence, e.g. theclause that contains the relation, or a window of words around each en-tity. The entity types can also be represented with one-hot vectors (wherethe vector is as long as the number of entity types with a value of one atthe location corresponding to the entity type and zeroes elsewhere). Thesemethods produce very sparse and large vectors and often p >> n, where pis the number of features and n is the number of examples used for training.These vectors can then be processed by classifiers such as logistic regression,support vector machines or random forests.Support vector machines offer an alternative method that avoiding vectoriz-ing the relations. A support vector machine attempts to find the hyperplanethat separates the training examples. However, the power of SVMs reallycomes down to the “kernel trick” which allows SVMs to be solved by us-ing comparisons between training examples instead of vectorizing them andplacing them in N-dimensional space. A kernel is simply a similarity func-tion that takes in two examples and returns a similarity value. Without acomplex kernel, an SVM is known as a linear SVM and behaves very simi-larly to logistic regression. Popular kernels include polynomial functions and91.2. Backgroundradial basis functions (RBF). These kernels implicitly transform the exampledata into another space where a separating hyperplane is easier to find. Fortext mining purposes, support vector machines are valuable for the abilityof kernel functions to accept example data which aren’t numerical vectors.A string comparison kernel can accept two text strings as input and outputa similarity measure based on metrics such as Hamming distance or edit dis-tance. This means that a classifier can be built using a similarity measureand no vectorization is required. Furthermore, support vector machines donot require each input example to be compared with every single trainingexample. The SVM identifies the training examples (known as the sup-port vectors) that can be used to define the separating hyperplane. Whenapplied to test data, comparisons are only needed against these “supportvector” examples, which allows for a high-performance classifier.Dependency parsing provides information about the basic relations betweenwords, such as the subjects and objects of a verb and the modifiers that applyto a noun. When these parsers were developed, relation extraction methodsquickly began to make use of the information. Bunescu and Mooney specif-ically argue that the main information about the relationship is containedwithin the dependency path which is the shortest path between two entitieswithin the dependency parse tree (Bunescu and Mooney, 2005). Kernelsthat used this information such as the dependency path kernel allows com-parison of the dependency parse instead of the full sentence. These usea simple similarity metric based on the number of shared words, parts ofspeech, and entity types at each place within the two dependency pathsbeing compared.Deep learning methods have made great headway into non-biomedical in-formation extraction problems with the main computational linguistics re-search venues being dominated by deep learning methods. These methodsexploit the concept of distributional semantics. This is the idea that indi-vidual words can be represented as numerical vectors where similar wordswill have similar vectors. The bag-of-words approach to word representationdoes not fit this as each word is represented by a one-hot vector which is asa wide as the vocabulary and only has a single one. Each word is thereforeorthogonal to all other words in the vocabulary. These techniques dependon large amounts of annotated data as the model complexity of deep learn-ing is very high and methods are liable to overfit. Due to lack of data, deeplearning has had a hard time gaining traction in biomedical text miningresearch.101.2. BackgroundEvent extraction is a special type of relation extraction, sometimes denotedas complex relation extraction. It extracts events described in a sentencewhich may involve multiple relations. These relations have other relations asarguments instead of entities. There are three relations in this example sen-tence: “upregulation of one gene decreases phosphorylation of another pro-tein”. The upregulation would be one relation, the phosphorylation wouldbe the second relation, and the decrease would be a compound relationconnecting the other two relations. Event extraction has been the focusof several shared tasks such as GENIA (Kim et al., 2003). The standardapproach involves breaking the task down into a series of binary relationextractions which can be built up into a full event (Björne and Salakoski,2015).When fully annotated training data is not available, there are two possibleoptions. Semi-supervised methods use partially annotated data or so-calledsilver-annotated data. This silver-annotated data is generated using a pro-cedure known as distant supervision (Mintz et al., 2009). When no annota-tions exist, existing knowledge bases which contain some relevant relationscan be used to automatically annotate sentences. For instance, if erlotinib isknown to inhibit EGFR, then all sentences which contain both terms couldbe annotated with this relation. This will produce a larger number of falsepositive annotations. But if there are enough “seed facts” in the knowl-edge base, a well-trained classifier may be able to identify the key patternsthat link all the sentences and reduce the false positive rate. A fully un-supervised method based on clustering can also be used to group potentialrelations that look similar. Percha et al grouped relations based on theirdependency path and then used a distant-supervision like approach to tagdifferent relation clusters (Percha et al., 2018).All of these relation extraction methods will annotate a span of text with thelocation of the relationship and the entities associated with them. Depend-ing on the application, these annotated documents could then be presentedto the user, or the relations could be aggregated to allow easier searching.In order to drive research in relation extraction and other areas of biomedi-cal information extraction, there are regular shared tasks organized by theresearch community. These are competitions where one group releases anannotated training set for other groups to build machine learning systemsfor. A held-out test set is then used to evaluate the competing algorithms.These competitions have included the BioNLP Shared Tasks (Kim et al.,2011, Kim et al. (2009)), BioCreative tasks (Hirschman et al., 2005) andmany others. They provide a good metric of the latest algorithms in the111.2. Backgroundfield. They are especially valuable as biomedical information extraction ishampered by the small annotation sets (compared to non-biomedical do-mains). Biomedical annotation often requires expert level knowledge andcan be difficult to organize. These events encourage the development ofmethods that can work with few examples.The documents for these shared tasks are often based on PubMed abstractsand full-text articles from PubMed Central. These resources, which areoften used for text mining, are the easiest to access which is a commonlimiting factor in biomedical text mining. In contrast, it is very difficultto get access to a large corpus of electronic health records which limitsthe research opportunities in this area. In biomedicine, abstracts are eas-ily accessible through PubMed and can be downloaded in bulk through theNCBI’s FTP service. However full-text articles are often challenging to ac-cess. The PubMed Open Access Subset provides full-text articles in XMLformat for over a million full-text articles. This is, however, a fraction ofthe publications in PubMed. Other researchers have tried mass download-ing of the PDFs of published literature. Publishers often limited this intheir terms of use contracts and have been known to limit access to theirresources for entire universities to encourage individual researchers to desistfrom mass downloading (Bohannon, 2016). Even with a large set of PDFs,the conversion to processible text is incredibly challenging. PDF is a formatdesigned for standardized viewing and printing across platforms and is notstructured for easy extraction of text. Many tools have been developed totry to make this task easier (Ramakrishnan et al., 2012). But with differentjournal formats, even simple tasks such as linking paragraphs across pagesand removing page numbers are challenging.1.2.4 Applications of Deep LearningDeep learning methods have exploded in popularity in recent years and havebeen broadly applied in many fields including computer vision and speechrecognition (LeCun et al., 2015). Deep learning involves multilayer andoften complex structured neural networks. The backpropagation methodwhich is used to solve the underlying parameters which controls when theseartificial neurons fire has been around for several decades (Rumelhart et al.,1986). However the vast aggregation of data in the last decade has seentheir performance eclipse other classification methods. It is for this reasonthat this thesis will not focus on deep learning methods. For high qualityresults, a very large dataset of annotated data is required. Co-occurrence121.2. Backgrounddata provides very noisy data that can easily be overfit with very complexmodels. Furthermore biomedical relation data sets are normally countedin the hundreds, perhaps thousands, of annotations which are several or-ders of magnitude lower than are needed to fully see the benefit of deeplearning. Finally, deep learning is also computationally costly which makesit challenging to create a high-quality knowledge base that can be quicklyupdated.1.2.5 Knowledge Bases and Knowledge GraphsInformation extraction methods provide a means to extract relations be-tween different entities. By applying these methods to a well-defined prob-lem and using large biomedical text as the input corpus, a variety of knowl-edge bases have been constructed. These include the STRING databasewhich use co-occurrence methods to identify likely protein-protein interac-tions (Szklarczyk et al., 2014). The PubTator resource provides automati-cally annotated PubMed abstracts which are valuable for advanced searchingand further text mining efforts (Wei et al., 2013b). An example of infor-mation extraction for a very specific domain is the miRTex database whichcollates information on microRNA targets (Li et al., 2015).The relations within knowledge bases are often represented as triples. Thesetriples are two entities and the relation that connects them. The set of triplescan, therefore, be viewed as a directed graph where vertices are entities anddirected labeled edges are relations. Knowledge bases that contain triplescan then be queried using SPARQL (Prud’hommeaux and Seaborne, 2006).This is a database query language based on the structured query language(SQL) format used in normal relational databases. The key improvementsof SPARQL are the ease of ability to query multiple databases (known asendpoints) and connect together diverse data sets (assuming they can belinked by appropriate unique identifiers).A growing area of research is inference on knowledge bases. This can in-volve asking questions of the knowledge base by traversing the knowledgebase (Athenikos and Han, 2010). It can also involve making predictions ofadditions to the knowledge base, particularly new edges to the knowledgebase. Most knowledge inference work has focussed on non-biomedical knowl-edge graphs such as Freebase (Bollacker et al., 2008). The TransE (Bordeset al., 2013) and RESCAL (Nickel et al., 2012) methods focussed on theproblem of knowledge base completion (KBC) where there are known to131.2. Backgroundbe edges missing. By using different latent-based approaches, they are ableto prioritize missing edges. Several knowledge graphs have been built forbiomedical knowledge either through manual curation or automated meth-ods. The WikiData knowledge graph is the structured data backend for allof Wikipedia (Vrandečić and Krötzsch, 2014). It contains a large amount ofbiological data that is mostly manually curated (Burgstaller-Muehlbacheret al., 2016) and provides a SPARQL endpoint for querying. Other knowl-edge graphs include KnowLife (Ernst et al., 2014) and GNPR (Percha et al.,2018) which are extracted from text.1.2.6 Personalized Cancer GenomicsCancer is a disease of uncontrolled cell growth caused by genomic abnor-malities. These abnormalities include small point mutations, copy numbervariation, structural rearrangements, and epigenetic changes. These affectregulation of growth signaling, control of apoptosis, angiogenesis and manyother factors that together are known as the hallmarks of cancer (Hana-han and Weinberg, 2000). These abnormalities can be caused by exogenousmutagens such as smoking or UV radiation, or endogenous mutagens suchas oxidation and deamination. Certain chemotherapies can also be muta-genic as damaging DNA can prove lethal to the fast-dividing tumor cells.With the advances in sequencing technology, genomic interrogation of can-cers has become commonplace. These investigations are confounded by thedriver/passenger mutation paradigm which states that only a small fractionof genomic abnormalities are actually involved in the development of a can-cer. These abnormalities (known as drivers) can inactivate key protectivegenes, or overactivate other genes that normally required careful regulation.The other abnormalities (known as passengers) do not have an oncogeniceffect and have “come along for the ride” (Haber and Settleman, 2007).The goal of personalized (or precision) medicine is to provide a treatmentplan that is tailored to an individual patient. This idea holds great promisein cancer treatment as every patient’s cancer is different. No two cancerscontain the exact same set of genomic abnormalities. By sequencing an in-dividual tumor, researchers hope to identify which genomic aberrations aredriver events to understand which pathways are essential to the growth ofa cancer. Using this information, combined with knowledge of pharmacoge-nomics, individualized treatments can be identified.The Personalized Oncogenomics (POG) project, based at the BC Cancer141.3. Chapter OverviewsAgency, began in 2008. Through whole genome sequencing (WGS) andtranscriptome sequencing (RNAseq), the genome and transcriptome are an-alyzed. Over time, the costs of sequencing have reduced dramatically (Wey-mann et al., 2017). However, the cost of informatics and genome inter-pretation have remained stable. This is mostly due to the laborious andmanual steps involved in understanding the relevance of important genomicabnormalities within the sequencing data.There are limited databases that provide some context on whether a likelymutation is a driver or passenger (Forbes et al., 2014) and how to clinicallyinterpret variants (Tamborero et al., 2018). Much of this data is derivedfrom the genomic survey provided by the Cancer Genome Atlas project(Weinstein et al., 2013). This means that analysts must search the vastbiomedical literature to understand the latest research for many genes andvariants. This area would benefit greatly from the development of new textmining approaches and resources to collate information on the relevance ofgenes and variants to different cancer types.1.3 Chapter OverviewsIn Chapter 2, we begin by exploring the power of co-occurrences betweenbiomedical terms within sentences. We propose a method for buildingknowledge graphs using co-occurrences and inferring new knowledge thatwill likely appear in future publications. With the recent development ofrecommendation systems, we were inspired to assess a matrix decomposi-tion method against the leading methods in the field. By building a datasetof biomedical co-occurrences from the PubMed and PubMed Central OpenAccess datasets, we are able to construct a knowledge graph using publica-tions up to the year 2010. A test set is then constructed using publicationsafter 2010 and different prediction methods are compared against it. Acomparison of our matrix decomposition method with the other leading so-lutions to this knowledge inference problem shows that our approach givesdramatically improved performance and provide a step towards automatedhypothesis generation for biologists.Chapter 3 moves past co-occurrences as the method for extracting knowledgeand towards full relation extraction based on a supervised learning approach.As part of the BioNLP 2016 Shared Task, we developed a generalizablerelation extraction method that builds features from the sentence containinga candidate relation and uses support vector machines. We build upon the151.3. Chapter Overviewsleading work that has shown the power of vectorized dependency-path-basedmethods. This tool, known as VERSE, went on to win the Bacteria Biotopesubtask, came third in the Seed Development subtask and outperformeddeep learning based methods. The chapter includes our further developmentof generalizable relation extraction tools with the Kindred Python packagethat integrates with many other biomedical text mining platforms includingPubTator (Wei et al., 2013b) and PubAnnotation (Kim and Wang, 2012).Chapter 4 begins to look at applying the information extraction methods toproblems faced in personalized cancer treatment. In order to automate theanalysis of individual patient tumors, a knowledge base of known drivers,oncogenes, and tumor suppressors is absolutely essential. In order to under-stand the purpose of a particular genomic aberration, the role of the associ-ated gene must be known for the cancer. Unfortunately, this has previouslyrequired manual searching of literature. In this chapter, we describe the de-velopment of the CancerMine resource using a supervised learning approach.We hypothesized that the necessary information for drivers, oncogenes andtumor suppressors would be contained within single sentences and that ourpreviously developed methods could be used to extract this information enmasse from published literature. To this end, a team of annotators hascurated a set of sentences related to the roles of different genes in cancer.By using the methods developed in Chapter 3, we build a machine learningpipeline that can efficiently process the entire biomedical literature and ex-tract cancer gene roles. This data is kept up-to-date and is available to theprecision cancer community for easy searching. This data can be integratedinto precision oncology pipelines to flag genomic aberrations that are withinrelevant genes for that cancer type. The annotated set of sentences is alsoavailable to the text mining community as a dataset on which to evaluatefuture relation extraction methods.Chapter 5 advances our knowledge of clinically relevant biomarkers in can-cer. The Clinical Interpretation of Variants in Cancer (CIViC) database isa community-curated knowledge base for diagnostic, prognostic, predispos-ing and drug resistance biomarkers in cancer (Griffith et al., 2017). Thisinformation is invaluable in automating a precision oncology analysis andproviding actionable information to clinicians. In order to identify gaps inthe CIViC knowledge and prioritize biomarkers that should be curated, weidentify published sentences that likely contain all the relevant information.A team of eight curators worked to annotate sentences to link cancers, genes,drugs, and variants as biomarkers. This complex dataset is used to developa multi-stage extraction system. We provide further advances with a ternary161.3. Chapter Overviewsrelation extraction system to integrate drug information. Through valida-tion by the CIViC curation team, we illustrate the power of this methodologyfor extracting high-quality complex biological knowledge in bulk. This ap-proach is able to provide a vast dataset of very high quality and can easilybe applied to other problems in biology and medicine. Furthermore, thedataset of cancer biomarkers is valuable to all groups curating knowledge inprecision medicine and also all analysts that are interrogating the genomesof patient tumors.Finally, Chapter 6 concludes the thesis and discusses the successes and lim-itations of the research approaches taken. It explores interesting futuredirections that could be taken with the generalized and high performingmethods developed in this thesis and with the valuable precision oncologydatasets extracted from the literature.17Chapter 2A collaborativefiltering-based approach tobiomedical knowledgediscovery2.1 IntroductionA scientist relies on knowledge contained in many published articles whendeveloping a new hypothesis. Generating new hypotheses automaticallybased on extracting knowledge from academic publications is the problemfaced by literature-based discovery (LBD) algorithms. These approaches arebecoming more important as knowledge is spread out across larger numberof publications. Text mining tools, including LBD methods, will likely be-come an essential tool to biology researchers as they explore new researchideas in their specific domains (Ananiadou and Mcnaught, 2006). Most ap-proaches to LBD predict associations between two biomedical concepts thatare not frequently discussed in the literature but are predicted to be stronglyassociated in the future.Research in the LBD field was first prompted by Swanson’s discussions ofundiscovered knowledge and his associations of dietary fish oil and Ray-naud’s disease (Swanson, 1986a). This early technique proposed the con-cept of open discovery in which a starting term (A) is selected and noveltarget terms (C) are predicted that are likely associated with A. Swanson’smethod proposed using intermediate terms (B) that are associated with Aand C. For instance, dietary fish oil is mentioned in articles with blood vis-cosity and vascular reactivity. These two terms are also mentioned withRaynaud’s disease. Swanson proposes that it is reasonable that dietary fishoil and Raynaud’s disease may be associated, possibly as a treatment. This182.1. Introductionresult has been validated experimentally (DiGiacomo et al., 1989). WilliamHersh provides an excellent overview of the different steps involved in theliterature-based knowledge discovery problem (Hersh, 2008).Various tools have been developed to pursue this idea of predicting asso-ciations between previously unlinked biomedical terms. All these methodsgenerate a score for a potential association which allow potential associationsto be ranked. Swanson’s Arrowsmith tool used co-occurrence of biomedi-cal terms in titles from MEDLINE abstracts to identify known associations(Swanson and Smalheiser, 1997). The system required the user to inputa starting term, gave them choices on the appropriate intermediate termsand ranked the predicted target terms based on the number of intermediateterms. Co-occurrences have proven a valuable metric for gauging conceptassociations and have been used in several systems including CoPub (Frijterset al., 2008) and STRING (Szklarczyk et al., 2016). Many other systemshave been developed using this concept with different methods for rankingthe predictions and most systems generally use the text from the abstract,not just the title. Notable systems include FACTA+ that uses the probabil-ity of two terms appearing together in a publication given the frequency ofthe individual terms (Tsuruoka et al., 2011). The BITOLA system uses thenumber of intermediate terms as well as the number of papers that supportthese intermediate links (Hristovski et al., 2013). The ANNI approach usesa comparison of concept vectors to predict novel associations (Jelier et al.,2008b). These concept vectors, based on the symmetric uncertainty coef-ficient (William, 2007), give a summary of the known associations of eachconcept with every other concept. The recent Implicitome project makesuse of the same methodology as ANNI and has been integrated into theknowledge.bio project (Hettne et al., 2016; Bruskiewich et al., 2016). Thesemethods largely make use of local knowledge, which we define as knowledgeof the intermediate terms that cooccur with the starting term and the targetterms.A thorough evaluation procedure has previously been proposed to evalu-ate the different scoring methods (Yetisgen-Yildiz and Pratt, 2009). It usespublications before a certain year as the input to each approach and evalu-ates their scoring of novel associations in newer publications. The authorsalso propose using precision-recall curves as a metric for success which issupported by analysis of the similar link prediction problem (Lichtnwalterand Chawla, 2012).Recommendation systems are used in many commercial products such as192.1. IntroductionAmazon and Netflix to suggest relevant products to a customer given theirprevious purchasing or viewing history. These systems often rely on collabo-rative filtering algorithms which use the combined history of many users andproducts. The success of these approaches are largely down to their use ofglobal knowledge with which they can implicitly learn about types of users orproducts based on this combined history and not any individual user, prod-uct or user-product interactions. The Netflix Prize spurred development ofnew recommendation algorithms and many of the most successful techniqueswere based on matrix decomposition (Bennett et al., 2007). We proposethat similar techniques should be used for literature-based discovery. In-stead of associations between users and products, these techniques could bereformulated to predict associations between biomedical terms. They wouldtherefore be able to use global knowledge about the co-occurrence patternsof all entities and be able to implicitly learn about different types of entities.Latent semantic indexing (LSI), a matrix-based approach for finding termsimilarity, has previously been examined for recapitulating Swanson’s fishoil discovery but was limited by computational cost (Gordon and Dumais,1998).The literature-based discovery problem can be thought of as a implicit feed-back problem (also known as one-class collaborative filtering (Pan et al.,2008). Implicit feedback problems, such as user purchase history, have onlypositive data points. Missing data may be negative or real missing data.In LBD, we have known associations between biomedical concepts, as theyare discussed in the same publications. However the lack of a co-occurrencebetween two terms can mean two different things: either this is an associa-tion that has not yet been discovered, or the two concepts are definitely notassociated.In this chapter, we present the singular value decomposition (SVD) methodas the best method for predicting associations between biomedical concepts.We use a similar approach in creating a gold standard data set to the previ-ous comprehensive comparison of knowledge discovery methods (Yetisgen-Yildiz and Pratt, 2009). We build up a training set of co-occurrences ex-tracted from PubMed abstracts and PubMed Central full-text articles up tothe year 2010. We then compare methods using their predictions on novelco-occurrences that appear in literature after the year 2010. We also explorethe predictive power of this approach to discover associations that appearin literature at various time-points after 2010. Finally, we delve into theseveral specific associations to examine the strengths and limitations of ourSVD method compared to the commonly cited Arrowsmith method.202.2. Materials and Methods2.2 Materials and MethodsIn order to evaluate different knowledge discovery systems, we extracted aset of co-occurrence relationships to use as training and test sets. These co-occurrences are between different biomedical concepts extracted from theUnified Medical Language System (UMLS) within the same sentence.2.2.1 Word ListA list of controlled vocabulary terms with synonyms was generated using theUMLS Metathesaurus (version 2016AB - Active Set). The terms selectedwere filtered from the Semantic Medline groups Anatomy (ANAT), Chemi-cals and Drugs (CHEM), Disorders and diseases (DISO), Genetics (GENE)and Physiology (PHYS) (Kilicoglu et al., 2008). The Findings group (T033)was removed due to a large number of vague terms. This generated a list of1,345,346 terms which was filtered using a set of stop words combined fromthe NLTK toolkit (Bird, 2006) and the most frequent 5,000 words based onthe Corpus of Contemporary American English (Davies, 2009). Notably,only ~26% of the terms were found to appear within the downloaded articleand abstract text.All terms in the UMLS Metathesaurus are associated with multiple syn-onyms and contain alternative spellings and other wordings for the sameterm. All synonyms were used and matched to a single term ID in thegenerated word list. When a word (or multiple words) was found in a sen-tence which was associated with multiple concepts, the co-occurrences werecounted for all possible concepts.2.2.2 Positive DataCo-occurrence relationships were extracted from biomedical literature toidentify potential associations between biomedical terms. Raw text wasextracted from titles and abstracts from MEDLINE citations and the titles,abstracts and full texts from PMC Open Access Subset articles. Manyrelationships may be mentioned in the full paper but not in the abstract (VanLandeghem et al., 2013). Therefore, full articles where available, as well asabstracts, were used to identify the largest possible number of relationships.In total, 13,153,418 abstracts and 1,503,065 full articles were downloadedfrom MEDLINE and PubMed Central (downloaded through FTP on 12th212.2. Materials and MethodsFeb 2017). In order to avoid duplication, articles that appear in PMC werefiltered out of the MEDLINE data set.These texts were filtered to remove HTML tags and Unicode special char-acters. They were split into sentences using LingPipe v4.1.0 (downloadedfrom http://alias-i.com/lingpipe) and tokenized using the GENIA part-of-speech tagger v3.0.1 (Tsuruoka et al., 2005). Exact string matching wasused to identify entities from the UMLS-based word list. Longer terms wereextracted first and removed from the sentence. This meant that a sentencediscussing “tumor necrosis factor” would be flagged for “tumor necrosis fac-tor” and not for “tumor necrosis”. The tokenization was used to identifyword boundaries, such that “non-cancerous tumor” was not flagged as “can-cerous tumor”. When multiple terms appear in a sentence, all pair-wiseco-occurrences were recorded.2.2.3 Sampling and Negative DataIdeally to evaluate a scoring method, we would calculate the scores for allpossible novel co-occurrences, which are defined as co-occurrences that donot appear in the training set. We would then evaluate the difference inscores for known novel co-occurrences in the test set compared with negativeco-occurrences, which are those that do not occur in the test set. It isimportant to note that while all LBD methods discussed in this chapter useonly positive data (co-occurrences that do occur in literature) to calculatescores, our evaluation methodology will require the generation of negativedata (co-occurrences that neither appear in training or test data).The training set, from publications published up to and including the year2010, contains 101,139,316 unique co-occurrences between 305,077 uniquebiomedical concepts. The size of the set of co-occurrences that could bepredicted as novel is ~46.4 billion. The test set contains 65,680,905 novelco-occurrences observed in publications published after the year 2010 andtherefore makes up only 0.14% of possible novel co-occurrences.It is computationally infeasible to evaluate the full space of possible co-occurrences so instead a large sampling approach is taken. 1,000,000 randomco-occurrences are selected from the test set that represent known novelassociations (also referenced as positive co-occurrences) and do not overlapwith the training set. To match the 1,000,000 positive co-occurrences, thesame number of “negative” co-occurrences are randomly generated. These222.2. Materials and Methodsare co-occurrences that don’t appear in the training or test data and arevery likely not real associations.2.2.4 SVD MethodThe SVD approach treats the co-occurrence data as a binary adjacencymatrix X where Xij is 1 if the terms i and j have appeared in a sentencetogether and 0 if they have not. The matrix is square, symmetric, generallyvery sparse and has the dimension of the number of terms in the vocabulary.A complete SVD decomposes it into three matrices such that X = UΣV Twhere X is the adjacency matrix, U and V contain the singular vectors andΣ is a diagonal matrix containing the singular values.We use a truncated form of SVD in which we only use a small numberof the singular values in order to create a low-rank approximation of thematrix. In this case, we decompose X ≈ UkΣk(Vk)T in which we keep thefirst k singular values and vectors. This means that each term i has a denserepresentation as the ith truncated singular vectors in Uk and Vk.By reducing the dimensionality, this approach is able to summarize the orig-inal matrix (Eckart and Young, 1936). We used the Graphlab implementa-tion v2.2 (Low et al., 2014) (built from Dato Powergraph Github repositoryat https://github.com/dato-code/PowerGraph) which uses the Lanczos al-gorithm. When the truncated SVD is used to reconstruct the matrix, everypossible co-occurrence is given a real-valued score which we designate theSVD score. The SVD method gives co-occurrences that are predicted to notappear in future literature a score close to zero, and those that will appeara score closer to one.There is only one parameter for the SVD method which is the number ofsingular values k to use for reconstructing the matrix. In order to choose thevalue for this, we take a cross-validation approach in which we use a furthertime-split data set. Publications up to the year 2009 are used to gener-ate a co-occurrence training set. And then 1,000,000 novel co-occurrencesare randomly sampled from publications in the year 2010. The same nega-tive data generation and sampling approaches are used and precision-recallcurves are generated for each rank parameter. By selecting the parameterthat gave the largest area under the precision-recall curve, 132 was chosenas the number of singular values.The SVD method provides scores with a range of approximately zero toone. By setting a different threshold on these scores in order to select the232.2. Materials and MethodsTable 2.1: Summary of methods for comparison.Algorithm Equation for score(x, z)Average MinimumWeight (AMW)1|cx∪cz |∑y∈cx∪czmin(|cx ∪ cy|, |cy ∪ cz|)ANNI vx.vzArrowsmith (LTC) |cx ∪ cz|BITOLA ∑y∈cx∪cz|cx ∪ cy| × |cy ∪ cz|FACTA+1− ∏y∈cx∪cz1−D(x, y)D(y, z)D(i, j) = max(P (i|j), P (j|i))P (i|j) = |ci ∪ cj |/|sj |Jaccard |cx ∩ cz|/|cx ∪ cz|Preferential Attachment |cx|+ |cz|SVD (Uk)xΣk((Vk)z)Tset of predictions, a trade-off of precision and recall can be made. Withk = 132, the associated precision-recall curve is examined to identify theoptimal trade-off which is equivalent to maximizing the F1-score. We findthe score threshold that gives the largest F1-score is 0.44.2.2.5 EvaluationBased on previous literature we selected 8 other knowledge discovery algo-rithms for benchmarking. These methods are based on the number of co-occurrences of terms and occurrences of individual terms. Table 2.1 gives anoverview of the equations implemented for the scoring methods. score(x, z)is the score calculated between term x and z. ci is the set of terms thatcooccur with term i. vi is the concept profile vector as defined in (Jelieret al., 2008a). FACTA+ requires knowledge of the set of sentences that con-tain term i which is defined as si. The SVD method uses truncated versionsof the decomposed matrices U , Σ and V . Uk is the truncated U matrixwith only the first k columns kept. (Uk)x is the ith row of the Uk truncatedmatrix from the SVD decomposition. The same terminology is used for theΣ and V matrices.The Arrowsmith algorithm counts the number of intermediate terms alsoknown as the linked term count (LTC). The average minimum weight242.2. Materials and Methods(AMW) method calculates the path with minimum support between twoconcepts. An amalgamation of LTC-AMW, in which LTC is used to rankfirst and then AMW is used as a secondary ranking criterion, was identifiedas the top performing methods in a previous comparison of literature-baseddiscovery (Yetisgen-Yildiz and Pratt, 2009). We implement LTC-AMWby simply scaling the LTC score up so that the smallest LTC score islarger than the largest AMW score and then add the AMW score andorder accordingly. We also compare two successful methods from thelink prediction literature, the Jaccard Index and Preferential Attachment(Liben-Nowell and Kleinberg, 2007). Finally we compare three methodsfrom more recent literature-based discovery methods: ANNI, BITOLA andthe FACTA+ reliability measure.A “time-split” approach was used to create a training and test set. Thisapproach has been used previously for literature-based knowledge discovery(Yetisgen-Yildiz and Pratt, 2009) instead of a traditional cross-validationfor two reasons. Each data point is not unique as would normally be thecase in a classification problem. By randomly assigning each co-occurrencein the training and test sets, the structure of the implicit knowledge graphfor training and test would be dramatically altered. The second reason touse the “time-split” method is that it strongly reflects the intended use ofthese methods, in order to predict future co-occurrences and the so-called“undiscovered public knowledge”.Precision-recall curves were chosen as the evaluation procedure due to thelarge class imbalance. Previous analysis has shown that receiver operatingcharacteristic (ROC) curves are not appropriate for problems with large classimbalance (Lichtnwalter and Chawla, 2012). When calculating the precision,the prior known class balance , based on the training set, is taken into ac-count. While our test data of positive and negative sampled co-occurrencesshows a 50% class balance, the real training data shows a class balance,b, of approximately 0.14% positive co-occurrences within all possible co-occurrences. This information is used to reweight the precision calculationas below where TP is the count of true positives and FP is the counter offalse positives.precision = b× TPb× TP + (1− b)× FPRecall is calculated as normal and does not require any correction. TheF1-score is calculated using the normal recall and the corrected precision.252.3. ResultsTable 2.2: Summary of performance for the initial steps for the ANNI andSVD algorithms.Method Run-time (h:m:s) RAM usage (GB)ANNI Vector Generation 2:31:06 5.8SVD (with publications up to 2009) 6:21:10 14.8SVD (with publications up to 2010) 6:09:27 15.6Table 2.3: Summary of performance for the different algorithms.Method Run-time (h:m:s) RAM usage (GB)AMW 0:53:15 43.3ANNI 9:52:13 347.0Arrowsmith 0:12:58 14.8BITOLA 0:51:49 43.3FACTA+ 1:30:01 43.4Jaccard 0:46:18 14.8LTC-AMW 0:52:22 43.3Preferential Attachment 0:06:45 14.8SVD 0:07:32 7.82.3 Results2.3.1 Methods comparisonThe 9 methods were compared on the same data set of 2,000,000 randomlysampled positive and negative co-occurrences. In order to visualize the dif-ferent scoring methods more intuitively, we show violin plots of the variousscores for the positive and negative sets in Figure 2.1. The perfect knowl-edge discovery algorithm would display two separable distributions for thepositive and negative sets. However, none of distribution pairs are easilyseparable showing that none of the algorithms are capable of completelydifferentiating positive and negative co-occurrences. The performance met-rics for the runs of the algorithms are shown in Tables 2.2 and 2.3.In order to quantitatively compare the different sets of scores, we used thearea under the precision-recall curves (AUPRC) which are shown in Fig-262.3. ResultsFigure 2.1: Violin plots of the different scores calculated using each methodfor the positive and negative test co-occurrences shown separately.272.3. ResultsFigure 2.2: The methods evaluated using 1,000,000 co-occurrences extractedfrom publications after the year 2010, and 1,000,000 co-occurrences ran-domly generated as negative data.282.3. ResultsFigure 2.3: The corresponding precision-recall curves for each method showssimilar trade-offs for precision and recall for each method.292.3. Resultsure 2.2. Notably, SVD outperforms all the other methods. This suggeststhat the SVD approach, which is a form of dimensionality reduction, is ableto compress the knowledge into a reduced form and generalize the knowl-edge of the matrix. The associated precision-recall curves, shown in Figure2.3 highlight that SVD can gain surprisingly high precision if a low recallis acceptable to the user. Arrowsmith gives the second best performanceshowing that the simple count of intermediate terms gives a strong measureof association between two terms.While Figure 2.1 suggests that FACTA+ does have different distributions forthe positive and negative co-occurrences, the performance shown in Figure2.2 is surprisingly low. Further analysis showed FACTA+ predicts associ-ations between many extremely rare terms with high probability, a resultthat disagrees with all other scoring methods. For example, the terms “dis-corhabdin Y” and “aspernidine A” are predicted to be associated with aprobability of 1.0. However both of them only appear in a single sentenceeach. Given the extreme rarity of these terms, this is a very weak associationand likely not helpful. They share a single intermediate term: “alkaloids”that appears in 32,749 sentences, including the single sentences that containthe rare terms. The high probability score is due to the max function used tocombine the conditional probabilities P (i|j) and P (j|i) to calculate D(i, j).The conditional probability P (i|j) represents the probability of one term iappearing in a sentence that also contains term j. Given a common term i(e.g. “alkaloids”) that occurs in a high proportion of the sentences that arare term j (e.g. “discorhabdin Y”) appears, P (i|j) will be very large andP (j|i) will be extremely small. The max value will always use P (i|j) andthese high values skew the results.The previous comparison analysis (Yetisgen-Yildiz and Pratt, 2009) con-cluded that the LTC-AMW was the best knowledge discovery method. Ouranalysis shows the LTC-AMW performs similarly to the Arrowsmith whichis equivalent to the linked term count (LTC). This suggests that the im-provement of LTC-AMW over AMW previously shown is based entirely onthe linked term count and that AMW doesn’t contribute at all.2.3.2 Predictions over timeWe also explored predictions for novel co-occurrences that appear in publica-tions at different time points. We again used the data set of co-occurrencesfrom papers up to and including the year 2010. We then found all novel302.3. ResultsFigure 2.4: Evaluation of SVD predictions on test co-occurrences from pub-lications further into the future using recall as the metric.312.3. Resultsdiscoveries after this period and grouped them by the year in which theyfirst appear. There were on average 10.9 million novel co-occurrences in eachyear from 2011 to 2016 inclusive. Using the optimal parameters ( k = 132 )for the SVD model, we then calculate the scores using 1 million randomlysampled co-occurrences from each year (for computational reasons). Usingthe previously selected threshold value of 0.44 on the scores to filter out pre-dictions, we calculate the recall values for each year. These are presented inFigure 2.4.The model is best able to predict co-occurrences in the year immediatelyafter the data set ends (2011). The recall then decreases each year. Thismeans that novel co-occurrences that appear further in the future are harderto predict. This result makes sense as a large proportion of next year’s dis-coveries will be based closely on existing discoveries. This could be a newdrug tested on a similar disease to the current use of the drug or a differentmember of a gene family being associated with the same disease. However,co-occurrences further into the future are based on more complicated inter-pretations of the current research or, more likely, new research that has yetto be published.Importantly, this model should not create too many predictions as to over-whelm a researcher and artificially inflate recall values. The SVD approachmakes 12,242,242 co-occurrence predictions with a score above the requiredthreshold. This number of predictions seems reasonable as it is smaller inmagnitude to the known number of real novel co-occurrences (65,680,905)in the same time period. One further comment is that a number of thepredictions that don’t match with a novel discovery in the years up to 2016will likely appear in future years after 2016.2.3.3 Comparison of predictions between SVD andArrowsmith methodsIn order to explore the strengths and weaknesses of the SVD approach, weexamine four results from the SVD system with comparisons to the outputof the Arrowsmith system. The Arrowsmith system is used for comparisonas it is the second best performing system. The associated UMLS ConceptUnique Identifier (CUID) is noted for each term.The first case examines the highest scoring prediction from the SVD fromour test set. This is an association between “Obstruction” (C0028778) and“Structure of anulus fibrosus of intervertebral disc” (C0223087). SVD gives322.3. Resultsthis association a score of 1.320. The Arrowsmith method also gives thisa high score with 1804 intermediate terms. This prediction turns out tobe correct and is found in 7 separate sentences in publications after 2010.One of the papers (Kang et al., 2014) discusses using a block (synonymof “Obstruction” term) to interfere with the “annulus fibrosus” as an ex-perimental model. It is common to block or obstruct parts of the spineto understand developmental biology, hence it is understandable that bothSVD and Arrowsmith would make this prediction.The next case to examine is one in which the SVD method predicts anassociation which is missed by the Arrowsmith method. Here we find all as-sociations with SVD score above the previously defined threshold of 0.44 andseek the association with lowest Arrowsmith score. This is the associationof “Proteins” (C0033684) and “hydantoin racemase” (C0168561). This as-sociation has SVD score=0.464 and Arrowsmith score=55. The associationis also correct as it is found in a publication during the test period. Hydan-toin racemase is an enzyme encoded by a gene in several strains of bacteria.It is unsurprising that there would be discussion of the protein product ofthis gene and that this association would occur. The SVD method likelyimplicitly identifies that hydantoin racemase is an enzyme as the pattern ofco-occurrences between the enzyme and other terms is similar to other en-zymes. Other enzymes are commonly discussed with the word “proteins” asmost enzymes are proteins. Arrowsmith likely fails to generate a high scorebecause this is an infrequently discussed enzyme (only appearing in 37 sen-tences in our corpus and cooccurring with 57 other terms). This suggeststhat the SVD method may be more successful for infrequently discussedterms.Next we examine a case where the SVD method failed to predict an asso-ciation that Arrowsmith found. We look for a case where the Arrowsmithscore is above the thresholds defined in Table 2.4 but has the lowest SVDscore. This association is between “Surgical Flaps” (C0038925) and “MAP2gene” (C1417006). Note that “Surgical Flaps” also has the synonym “Flap”and “Flaps”. Arrowsmith gives this a high score of 2327, but SVD gives avery low score of -0.175. This association is deemed correct as it appears asa positive association in the test set. However the article in which it appears(Chu et al., 2013) uses “FLAP” to refer to a particular protein and not theexpected context of surgical flaps. This shows the limitation of using exactstring matching to identify biomedical terms using the UMLS set of syn-onyms. The question remains why Arrowsmith gives a high score, but theSVD method provides a low score. One likely explanation is that the “Sur-332.3. ResultsTable 2.4: Thresholds used for different methods to select prediction set.Method ThresholdAMW 5.530ANNI 2.416e-05Arrowsmith 2188BITOLA 4355364FACTA+ 0.029Jaccard 0.192LTC-AMW 2188.0Preferential Attachment 34159SVD 0.441gical Flaps” term cooccurs with a large number of terms (15,374) of whichonly 2,327 (~15%) cooccur with the “MAP2 gene” term. The Arrowsmithmethod only takes those ~15% into account whereas SVD takes into accountthe complete co-occurrence pattern when predicting associations. Most ofthese co-occurrences will be related to “flaps” and “surgical flaps” and notto gene/protein related terms.Lastly we look at the association with the highest SVD score that wasdeemed a negative association within our test set, that is one that did notoccur in any publications within our corpus. This association is between“Kidney Failure, Acute” (C0022660) and “Thalassemia” (C0039730). TheSVD method gave this a score of 0.895 and the Arrowsmith also gave a veryhigh score of 2987. Thalassemia is a group of disorders associated with lowhaemoglobin production. A publication in 2011 (Quinn et al., 2011) notesthat “[l]ittle is known about the effects of thalassaemia on the kidney” andgoes on to study the association of thalassemia with renal issues and findingstrong links. This suggests that this association is a valid prediction and ex-emplifies the power of knowledge discovery methods to identify valid linksbetween biomedical terms.These examples have highlighted several strengths and weaknesses of theSVD and Arrowsmith approaches. Firstly Arrowsmith can be confused byvery frequently appearing terms (such as the “Flap” term). It can miss in-frequently mentioned terms (such as “Hydantoin racemase”). SVD is ableto identify important characteristics of a term, even with infrequent men-tions (as was the case for “Hydantoin racemase”). On the other hand, SVD342.4. Discussioncan also be confused by terms that have a lot of synonyms. If one of thesynonyms is a frequently occurring and ambigious term, the SVD methodcan put too much weight on co-occurrences from this synonym. This limita-tion may be improved with the development of a named entity recognition(NER) system that can distinguish the context for different UMLS terms. Amethod built upon the NER systems evaluated in (Funk et al., 2014) wouldbe an interesting direction for a future LBD system.2.4 DiscussionThe success of singular value decomposition over the other current methodsfor knowledge discovery suggests that the matrix deconstruction approachmay be the best avenue for further improvements in knowledge discovery.By compressing the co-occurrence information down to a dense representa-tion of each concept (the row Ui of the U matrix that corresponds to termi), SVD is able to deal with the sparsity inherent in the co-occurrence data.Furthermore it deals with two concepts that aren’t frequently discussed to-gether but share the same pattern of co-occurrences with other biomedicalconcepts. An example would be a drug with generic name and brand namesas separate terms in the wordlist (e.g. erlotinib and Tarceva). It would besensible to merge these entities, however, most knowledge discovery tech-niques would not be able to do this automatically. Because the two conceptsshare similar co-occurrence patterns, singular value decomposition will de-compose them to similar dense representations and make use of both theirco-occurrence patterns to predict new associations. From the recommenda-tion systems perspective, this can be viewed as two customers that watchthe same genres of movies but have never watched the exact same movie.The matrix decomposition method is able to identify that these customersshare similar tastes and use each others’ viewing history to make recommen-dations.SVD does, however, have several drawbacks. The first is that it is stillcomputationally expensive. Our SVD runs required ~16GB of memory andabout 6 hours per run (on a machine with quad Intel E5-4640 processors).This could be ameliorated through trimming very rare terms, thereby reduc-ing the size of the matrix for decomposition. Furthermore, this will becomeless of a problem as memory costs decrease. Another issue with singularvalue decomposition is interpretability so that a user can understand whya prediction is made. Classic methods, such as the Arrowsmith approach,352.4. Discussionallows the user to view the intermediate concepts that were used to generatethe prediction. As there are no intermediate concepts in the SVD model,it is more challenging to display the rationale for prediction. One approachwould be to show the concepts with similar dense representations in order togive context to the user of why these two concepts are predicted to cooccurand presents an interesting future direction for research.There are many general terms in the UMLS word lists, such as “Local Anes-thetic”, which may not prove to be useful drug associations. One approachwould be to attempt to filter these terms out of the word lists entirely.However, it could be argued that these terms are valuable in understandingthe context of other concepts, and in creating their implicit relationships.Hence it would likely be more valuable to filter them out later in the processso that they are not shown as predictions but are used during the singularvalue decomposition.The evaluation approach of making predictions using a training set andcomparing predictions to a test set (as previously used by (Yetisgen-Yildizand Pratt, 2009) does have several limitations. The most important for aknowledge discovery algorithm is that many of the predictions deemed asfalse positives may prove to be true positives as new research is published.This limitation is hard to overcome. Knowledge base completion algorithmsmake use of a ranking evaluation where the ranking of randomly sampledknown positive associations within the full set of predictions is calculated (asused in (Lin et al., 2015). This is used to compare systems and avoids theproblem of false positives but is also very challenging to interpret correctly.By using a training/test split approach, the associated metrics of recall andprecision give a lower limit to the performance of each system which is easierto interpret. However a testing methodology that avoids the issue of negativedata really being positive data (that will appear in future publications) butis also easy to interpret remains an open problem.Each of the systems generates scores for each association and does not makea binary decision. In order to create a finite set of “predictions”, a thresholdis chosen for each method and those associations with scores above thethreshold are selected. The threshold is chosen in the same manner as for theSVD method. Each method is trained using co-occurrences in publicationsup to 2009 and evaluated on the co-occurrences that appear for the first timein publications during the year 2010. The threshold that gives the best F1-score using this data split method is selected. Table 2.4 shows the thresholdsselected for each method. The predictions shown in Figure 2.5 are based on362.4. DiscussionFigure 2.5: An Upset plot showing the overlap in predictions made by thethree most successful systems.372.4. Discussionthe scores generated for the test set of 2,000,000 associations (half of whichare positive and half are negative cases). The scores are thresholded andthe associations collected for each method.While the SVD method clearly outperforms the other methods, an obviousquestion is whether the different systems make similar predictions. Figure2.5 examines the overlap of top performing systems. LTC-AMW and Arrow-smith give very similar predictions so only Arrowsmith is included. Thereare a core set of predictions that are shared by each method. However alarge number of predictions are made by each system individually. Thispoints towards the development of a meta-method that combines the differ-ent predictions of multiple systems and is an interesting direction for futurework.Figure 2.6: The methods evaluated using 1,000,000 abstract-level co-occurrences extracted from publications after the year 2010, and 1,000,000abstract-level co-occurrences randomly generated as negative data.382.4. DiscussionIt is worthwhile to note that our decision to focus on sentence level co-occurrence, as opposed to abstract level co-occurrence, was based on re-ducing potential incorrect associations. These happen between terms thatcooccur but do not have any real biological relationship. By increasing theamount of text within which a co-occurrence can happen (e.g. to a full ab-stract), there are likely many more incorrect associations. However to checkthat this decision didn’t bias out results, we reran the entire analysis pipelineusing abstract-level co-occurrences. In this case a co-occurrence occurs whentwo terms appear in the same abstract. The results (shown in Figure 2.6)show a similar pattern to the sentence-level results and that SVD is the bestperforming system for this type of co-occurrence.Figure 2.7: The class balance in the dataset can affect the resulting clas-sifier metrics making interpretation of score distributions challenging. Thedataset has a class balance of 0.14% which is at the far left. Arrowsmithovertakes SVD at a class balance of ~5% which is an implausibly high classbalance of a knowledge discovery dataset.Finally, we examined the effect that the extreme class imbalance (0.14%positive data) has on the classification metrics. An inspection of the violinplots in Figure 2.1 seems to conflict with the results shown in Figure 2.2.For instance, the AMW results seem to have bulbous positive distributionthat has scores clearly larger than the negative distribution. Meanwhile, the392.5. ConclusionsSVD method has an obvious difference between the positive and negativedistributions but is not as well defined. But Area under the Precision Recallcurve results in Figure 2.2 show that SVD outperforms AMW. We examinedthe effect that the class balance had on the resulting AUPRC scores in Figure2.7. This shows that the class balance, which is a property of the dataset,does have an effect on the AUPRC score. This means that visual comparisonof score distributions (as in Figure 2.1) is much more challenging. With avery low class balance, more emphasis is put on co-occurrences with highscores. Any false positives with high scores will quickly drop the precision,with a knock-on effect on the AUPRC. This drop increases with larger classimbalance. The ~5% increase in class balance that would be needed tocause Arrowsmith to be the better performing system is very unrealistic for aknowledge discovery problem. Nevertheless, this is an important illustrationthat the class balance plays an important role in the classification metricsand also in interpreting the score distributions.2.5 ConclusionsOur study has shown that the singular value decomposition technique pro-vides the best scoring method for predicting future co-occurrences whencompared to the leading methods in the knowledge discovery problem. Themethod is best able to predict co-occurrences that occur in publications inthe near future and slowly reduces in predictive power for the far future. Wehope this analysis will benefit the knowledge discovery research communityin developing tools that will be beneficial for molecular biology researchers.40Chapter 3Relation extraction withVERSE and Kindred3.1 IntroductionExtracting knowledge from biomedical literature is a huge challenge in thenatural language parsing field and has many applications including knowl-edge base construction and question-answering systems. In this chapter, wedescribe our competition winning event extraction system (VERSE) and itsfollowup highly interoperable relation extraction Python package (Kindred).Event extraction systems focus on this problem by identifying specific eventsand relations discussed in raw text. Events are described using three keyconcepts, entities, relations and modifications. Entities are spans of textthat describe a specific concept (e.g. a gene). Relations describe a specificassociation between two (or potentially more) entities. Together entities andrelations describe an event or set of events (such as complex gene regulation).Modifications are changes made to events such as speculation.The BioNLP Shared Tasks have encouraged research into new techniques fora variety of important NLP challenges. Occurring in 2009, 2011 and 2013,the competitions were split into several subtasks (Kim et al., 2009, 2011;Nédellec et al., 2013). These subtasks provided annotated texts (commonlyabstracts from PubMed) of entities, relations and events in a particularbiomedical domain. Research groups were then challenged to generate newtools to better predict new relations and events in test data.The BioNLP 2016 Shared Task contains three separate parts, the BacteriaBiotope subtask (BB3), the Seed Development subtask (SeeDev) and theGenia Event subtask (GE4). The BB3 and SeeDev subtasks have separateparts that specialise in entity recognition and relation extraction. The GE4subtask focuses on full event extraction of NFkB related gene events.413.1. IntroductionPrevious systems for relation and event extraction have taken two main ap-proaches: rule-based and feature-based. Rule-based methods learn specificpatterns that fit different events, for instance, the word “expression” follow-ing a gene name generally implies an expression event for that gene. Thepattern-based tool BioSem (Bui et al., 2013) in particular performed wellin the Genia Event subtask of the BioNLP’13 Shared Task. Feature-basedapproaches translate the textual content into feature vectors that can beanalysed with a traditional classification algorithm. Support vector ma-chines (SVMs) have been very popular with successful relation extractiontools such as TEES (Björne and Salakoski, 2013).3.1.1 VERSEWe will first present the Vancouver Event and Relation System for Extrac-tion (VERSE) for the BB3 event subtask, the SeeDev binary subtask andthe Genia Event subtask. Utilising a feature-based approach, VERSE buildson the ideas of the TEES system. It offers control over the exact semanticfeatures to use for classification, allows feature selection to reduce the sizeof feature vectors and uses a stochastic optimisation strategy with k-foldcross-validation to identify the best parameters. We examine the competi-tive results for the various subtasks and also analyse VERSE’s capability topredict relations across sentence boundaries.The VERSE method came first in the BB3 event subtask and third in theSeeDev binary subtask in the BioNLP Shared Task 2016. An analysis ofthe two systems that outperformed VERSE in the SeeDev subtask points tointeresting directions for further development. The SeeDev subtask differsgreatly from the BB3 subtask as there are 24 relation types compared to only1 in BB3 and the training set size for each relation is drastically smaller.The LitWay approach, which came first, uses a hybrid approach of rule-based and vector-based (Li et al., 2016). For “simpler” relations, definedusing a custom list, a rule-based approach uses a predefined set of patterns.The UniMelb approach created individual classifiers for each relation typeand was able to predict multiple relations for a candidate relation (Panyamet al., 2016). This approach of treating relation types differently suggeststhat there may be large differences in how a relation should be treatedin terms of the linguistic cues used to identify it and the best algorithmapproach to identify it.423.1. Introduction3.1.2 KindredThere are several shortcomings in the approaches to the BioNLP SharedTasks, the greatest of all is the poor number of participants that providecode. It is also clear that the advantages of some of the most successful toolsare tailored specifically to these datasets and may not be able to generalizeeasily to other relation extraction tasks. Some tools that do share code suchas TEES and VERSE have a large number of dependencies, though TEESameliorates this problem with an excellent installing tool that manages de-pendencies. These tools can also be computationally costly, with both TEESand VERSE taking a parameter optimization strategy that requires a clusterfor reasonable performance.The biomedical text mining community is endeavoring to improve consis-tency and ease-of-use for text mining tools. In 2012, the Biocreative BioCInteroperability Initiative (Comeau et al., 2014) encouraged researchers todevelop biomedical text mining tools around the BioC file format (Comeauet al., 2013). More recently, one of the Biocreative BeCalm tasks focuseson “technical interoperability and performance of annotation servers” for anamed entity recognition systems. This initiative encourages an ecosystemof tools and datasets that will make text mining a more common tool in bi-ology research. PubAnnotation (Kim and Wang, 2012), which is part of thisapproach, is a public resource for sharing annotated biomedical texts. Thehope of this resource is to provide data to improve biomedical text miningtools and as a launching point for future shared tasks. The PubTator tool(Wei et al., 2013b) provides PubMed abstracts with various biomedical enti-ties annotated using several named entity recognition tools including tmVar(Wei et al., 2013a) and DNorm (Leaman et al., 2013).In order to overcome some of the challenges in the relation extraction com-munity in terms of ease-of-use and integration, we present Kindred whichis a successor to VERSE. Kindred is an easy-to-install Python package forrelation extraction using a vector-based approach. It abstracts away muchof the underlying algorithms in order to allow a user to easily start extract-ing biomedical knowledge from sentences. However, the user can easily useindividual components of Kindred in conjunction with other parsers or ma-chine learning algorithms. It integrates seamlessly with PubAnnotation andPubTator to allow easy access to training data and text to be applied to.Furthermore, we show that it performs very well on the BioNLP SharedTask 2016 relation subtasks.433.2. VERSE Methods3.2 VERSE MethodsThe VERSE system competed in the BioNLP Shared Task 2016 and themethods are outlined here.3.2.1 PipelineFigure 3.1: Overview of VERSE pipelineVERSE breaks event extraction into five steps outlined in the pipeline shownin Figure 3.1. Firstly the input data is passed through a text processing toolthat splits and tags text and associates the parsed results with the providedannotations. This parsed data is then passed through three separate classi-fications steps for entities, relations and modifications. Finally, the resultsare filtered to make sure that all relations and modifications fit the expectedtypes for the given task.3.2.2 Text processingVERSE can accept input in the standard BioNLP-ST format or the Pub-Annotation JSON format (Kim and Wang, 2012). The annotations describeentities in the text as spans of text and relations and modifications of theseentities.The input files for the shared subtasks are initially processed using the Stan-ford CoreNLP toolset. The texts are split into sentences and tokenized.Parts-of-speech and lemmas are identified and a dependency parse is gen-erated for each sentence. CoreNLP also returns the exact positions of eachtoken. Using this data, an interval tree is created to identify intersections oftext with entities described in the associated annotation. The specific sen-tence and locations of each associated word are then stored for each entity.443.2. VERSE MethodsFigure 3.2: Relation candidate generation for the example text which con-tains a single Lives_In relation (between bacteria and habitat). The bacte-ria entity is shown in bold and the habitat entities are underlined. Relationexample generation creates pairs of entities that will be vectorised for clas-sification. (a) shows all pairs matching without filtering for specific entitytypes (b) shows filtering for entity types of bacteria and habitat for a po-tential Lives_In relationRelations and modifications described in the associated annotations are alsoloaded, retaining information on which entities are involved.The entities in the BB3 and SeeDev subtasks are generally sets of full wordsbut can be non-contiguous. Entities are stored as a set of associated wordsrather than a span of words. The GE4 task also contains entities that con-tain only partial words, for example, “PTEN” is tagged as an entity within“PTEN-deficient”. A list of common prefixes and suffixes from the GE4 taskis used to separate these words into two words so that the example wouldbecome “PTEN deficient”. Furthermore, any annotation that divides a wordthat contains a hyphen or forward slash causes the word to be separate intotwo separate words.For easier interoperability, the text parsing code was developed in Jython(Developers, 2008) (a version of Python that can load Java libraries, specif-ically the Stanford CoreNLP toolset). This Jython implementation is thenable to export easily processed Python data structures. Due to incompatibil-ity between Jython and various numerical libraries, a separate Python-onlyimplementation loads the generated data structures for further processing453.2. VERSE Methodsand classification.3.2.3 Candidate generationFor all three classifications steps (entities, relations and modifications), thesame machine learning framework is used. All possible candidates are gen-erated for entities, relations or modifications. For relations, this means allpairs of entities are found (within a certain sentence range). For the train-ing step, the candidates are associated with a known class (i.e. the typeof relation), or the negative class if the candidate is not annotated in thetraining set. For testing, the classes are unknown. Candidates can containone argument (for entity extraction and modification) or two arguments (forrelation extraction). These arguments are stored as references to sentencesand the indices of the associated words.3.2.3.1 Entity extractionEntity extraction aims to classify individual or sets of words as a certain typeof entity, given a set of training cases. Entities may contain non-contiguouswords. The set of all possible combinations of words that could composean entity is too large for the classification system. Hence VERSE filters foronly combinations of words that are identified as entities in the training set.This means that if the term “Lake Como” is annotated as a Habitat entity inthe training set, any instance of “Lake Como” will be flagged as a candidateHabitat entity. However if a term (e.g. “the River Thames”) never appearsas an entity in the training set, it will be ignored for all test data.3.2.3.2 Relation extractionVERSE can predict relations between two entities, also known as binaryrelations. Candidates for each possible relation are generated for every pairof entities that are within a fixed sentence range. Hence when using thedefault sentence range of 0, only pairs of entities within the same sentence areanalysed. VERSE can optionally filter pairs of entities using the expectedtypes for a set of relations as shown in Figure 3.2.Each candidate is linked with the locations of the two entities. If the twoentities are already annotated to be in a relation, then they are labelled463.2. VERSE MethodsTable 3.1: Overview of the various features that VERSE can use for classi-ficationFeature Name Targetunigrams Entire Sentenceunigrams & parts-of-speech Entire Sentencebigrams Entire Sentenceskipgrams Entire Sentencepath edges type Dependency Pathunigrams Dependency Pathbigrams Dependency Pathunigrams Each Entityunigrams & parts-of-speech Each Entitynearby path edge types Each Entitylemmas Each Entityentity types Each Entityunigrams of windows Each Entityis relation across sentences N/Awith the corresponding class. Otherwise, the binary relation candidate isannotated with the negative class.3.2.3.3 Modification extractionVERSE supports modification of entities in the form of event modificationbut currently does not support modification of individual relations. A mod-ification candidate is created for all entities that form the base of an event.These entities are often known as the triggers of the event. In the JSONformat, these entities traditionally have IDs that start with “E“. If a mod-ification exists in the training set for that entity, the appropriate class isassociated with it. Individual binary classifiers are generated for each mod-ification type. This allows an event to be classified with more than onemodification.473.2. VERSE Methods3.2.4 FeaturesFor each generated candidate, a variety of features (controllable through aparameter) is calculated. The features focus on characteristics of the fullsentence, dependency path or individual entities. The full-set is shown inTable 3.1. Each feature group, shown in the table, can be included orexcluded with a binary flag. It should also be noted that a term frequency-inverse document frequency (TFIDF) transform is also an option for allbag-of-words based features.3.2.4.1 Full sentence featuresN-grams features (unigrams and bigrams) use a bag-of-words approach tocount the word occurrences across the whole sentence. The words are trans-formed to lowercase but notably are not filtered for stop words. A versioncombining the individual words with part-of-speech information is also used.A bag-of-words vector is also generated for lemmas of all words in the sen-tence. Skip-gram-like features are generated using two words separated bya fixed window of words are also used to generate features. Hence the terms“regulation of EGFR” and “regulation with EGFR” would match the samefeatures of “regulation * EGFR”.3.2.4.2 Dependency path featuresThe dependency path is the shortest path between the two entities in adependency parse graph and has been shown to be important for relationextraction (Bunescu and Mooney, 2005). Features generated from the set ofedges and nodes of the dependency graph include a unigrams and bigramsrepresentation. The specific edge types in the dependency path are alsocaptured with a bag-of-words vector. In order to give specific informationabout the location of the entity in the dependency path, the types of theedges leaving the entity nodes are recorded separately for each entity.Interestingly an entity may span multiple nodes in the dependency graph.An example of a dependency path with the multi-word entities “coxiellaburnetii” and “freshwater lakes” is shown in Figure 3.3. In this case, theminimal subgraph that connects all entity nodes in the graph is calculated.This problem was transformed into a minimal spanning tree problem asfollows and solved using the NetworkX Python package (Hagberg et al.,483.2. VERSE MethodsFigure 3.3: Dependency parsing of the shown sentence provides (a) the de-pendency graph of the full sentence which is then reduced to (b) the depen-dency path between the two multi-word terms. This is achieved by findingthe subgraph which contains all entity nodes and the minimum number ofadditional nodes.493.2. VERSE Methods2008). The shortest paths through the graph were found for all pairs of entitynodes (nodes associated with the multi-word entities). The path distancebetween each pair was totalled and used to generate a new graph containingonly the entity nodes. The minimal spanning tree was calculated and theassociated edges recovered to generate the minimal subgraph. This approachwould allow for a dependency path-like approach for relations between morethan two entities.3.2.4.3 Entity featuresThe individual entities are also used to generate specific features. Threedifferent vectorised versions use a unigrams approach, a unigrams approachwith parts-of-speech information and lemmas respectively. A one-hot vectorapproach is used to represent the type of each entity. Unigrams of wordsaround each entity within a certain window size are also generated.3.2.4.4 Multi-sentence and single entity featuresVERSE is also capable of generating features for relations between two en-tities that are in different sentences. In this case, all sentence features aregenerated for both sentences together and no changes are made to the entityfeatures.The dependency path features are treated differently. The dependency pathfor each entity is created as the path from the entity to the root of the de-pendency graph, generally the main verb of the sentence. This then createstwo separate paths, one per sentence and the features are generated in simi-lar ways using these paths. Finally, a simple binary feature is generated forrelation candidates that span multiple sentences.For relation and modifications, candidates contain only a single argument.The dependency path is created in a similar manner to candidates of rela-tions that span across sentences.3.2.5 ClassificationAll candidates are vectorized using the same framework, whether for can-didates with one or two arguments with minor changes. These vectorizedcandidates are then used for training a traditional classifier. The vectors503.2. VERSE Methodsmay be reduced using feature selection. Most importantly, the parame-ters used for the feature generation and classifier can easily be varied tofind the optimal results. Classification uses the scikit-learn Python package(Pedregosa et al., 2011b).3.2.5.1 Feature selectionVERSE implements optional feature selection using a chi-squared test on in-dividual parameters against the class variable. The highest ranking featuresare then filtered based on the percentage of features desired.3.2.5.2 Classifier parametersClassification uses either a support vector machine (SVM) or logistic re-gression. When using the SVM, the linear kernel is used due to lower timecomplexity. The multi-class classification uses a one-vs-one approach. Theadditional parameters of the SVM that are optimised are the penalty param-eter C, class weighting approach and whether to use the shrinking heuristic.The class weighting is important as the negative samples greatly outnumberthe positive samples for most problems.3.2.5.3 Stochastic parameter optimisationVERSE allows adjustment of the various parameters including the set offeatures to generate, the classifier to use and the associated classificationparameters. The optimisation strategy involves initially seeding 100 randomparameter sets. After this initial set, the top 100 previous parameter sets areidentified each iteration and one is randomly selected. This parameter set isthen tweaked as follows. With a probability of 0.05, an individual parameteris changed. In order to avoid local maxima, an entirely new parameter setis generated with a probability of 0.1. For the subtasks, a 500 node clusterusing Intel X5650s was used for optimisation runs.The optimal parameters are determined for the entity extraction, relationextraction and each possible modification individually. In order to balanceprecision and recall equally at each stage, the F1-score is used.513.3. Kindred Methods3.2.6 FilteringFinal filtering is used to remove any predictions that do not fit into thetask specification. Firstly all relations are checked to see that the types ofthe arguments are appropriate. Any entities that are not included in rela-tions are removed. Finally, any modifications that do not have appropriatearguments or were associated with removed entities are also trimmed.3.2.7 EvaluationAn evaluation system was created that generates recall, precision, and asso-ciated F1-scores for entities, relations and modifications. The system worksconservatively and requires exact matches. It should be noted that our in-ternal evaluation system gave similar but not exactly matching results tothe online evaluation system for the BB3 and SeeDev subtasks.K-fold cross-validation is used in association with this evaluation system toassess the success of the system. The entity, relation and modification ex-tractors are trained separately. For the BB3 and SeeDev subtasks, two-foldcross-validation is used, using the provided split of training and develop-ment sets as the training sets for the first and second fold respectively. Forthe GE4 task, five-fold cross-validation is used. The average F1-score of themultiple folds is used as the metric of success.3.3 Kindred MethodsThe Kindred package was built as a follow up to the VERSE system. Itis designed for generalizable relation extraction, is integrated with a widevariety of biomedical text mining resources and is distributed as a self-contained Python package for easy use.Kindred is a Python package that builds upon the Stanford CoreNLP frame-work (Manning et al., 2014) and the scikit-learn machine learning library(Pedregosa et al., 2011a). The decision to build a package was based on theunderstanding that each text mining problem is different. It seemed morevaluable to make the individual features of the relation extraction systemavailable to the community than a bespoke tool that was designed to solvea fixed type of biomedical text mining problem. Python was selected due523.3. Kindred Methodsto the excellent support for machine learning and the easy distribution ofPython packages.The ethos of the design is based on the scikit-learn API that allows complexoperations to occur in very few lines of code, but also gives detailed controlof the individual components. Individual computational units are encap-sulated in separate classes to improve modularity and allow easier testing.Nevertheless, the main goal was to allow the user to download annotateddata and build a relation extraction classifier in as few lines of code as pos-sible.3.3.1 Package developmentThe package has been developed for ease-of-use and reliability. The codefor the package is hosted on Github. It was also developed using the con-tinuous integration system Travis CI in order to improve the robustness ofthe tool. This allows regular tests to be run whenever code is committed tothe repository. This will enable further development of Kindred and ensurethat it continues to work with both Python 2 and Python 3. Coveralls andthe Python coverage tool are used to evaluate code coverage and assist intest evaluation.These approaches were in line with the recent recommendations on improv-ing research software (Taschuk and Wilson, 2017). We hope these techniqueswill allow for and encourage others to make use of and contribute to the Kin-dred package.3.3.2 Data FormatsAs illustrated in Figure 3.4, Kindred accepts data in four different formats:the standoff format used by BioNLP Shared Tasks, the JSON format usedby PubAnnotation, the BioC format (Comeau et al., 2013) and a simple tagformat. The standoff format uses three files, a TXT file that contains theraw text, an A1 file that contains information on the tagged entities andan A2 file that contains information on the relations between the entities.The JSON, BioC and simple tag formats integrate this information intosingle files. The input text in each of these formats must have already beenannotated for entities.The simple tag format was implemented primarily for simple illustrations ofKindred and for easier testing purposes. It is parsed using an XML parser533.3. Kindred MethodsFigure 3.4: An example of a relation between two entities in the same sen-tence and the representations of the relation in four input/output formatsthat Kindred supports.543.3. Kindred Methodsto identify all tags. A relation tag should contain a “type” attribute thatdenotes the relation type (e.g. causes). All other attributes are assumed tobe arguments for the relation and their values should be IDs for entities inthe same text. A non-relation tag is assumed to be describing an entity andshould have an ID attribute that is used for associating relations.3.3.3 Parsing and Candidate BuildingThe text data is loaded, and where possible, the annotations are checkedfor validity. In order to prepare the data for classification, the first step issentence splitting and tokenization. We use the Stanford CoreNLP toolkitfor this which is also used for dependency parsing for each sentence.Once parsing has completed, the associated entity information must thenbe matched with the corresponding sentences. An entity can contain non-contiguous tokens as was the case for the BB3 event dataset in the BioNLP2016 Shared Task. Therefore each token that overlaps with an annotationfor an entity is linked to that entity.Any relations that occur entirely within a sentence are associated withthat sentence. The decision to focus on relations contained within sentenceboundaries is based on the poor performance of relation extraction systemsin the past. The VERSE tool explored predicting relations that spannedsentence boundaries in the BioNLP Shared Task and found that the falsepositive rate was too high. The sentence is also parsed to generate a depen-dency graph which is stored as a set of triples (tokeni, tokenj , dependencyij)where dependencyij is the type of edge in the dependency graph between to-kens i and j. The edge types use the Universal Dependencies format (Nivreet al., 2016).Relation candidates are then created by finding every possible pair of entitieswithin each sentence. The candidates that are annotated relations are storedwith a class number for use in the multiclass classifier. The class zero denotesno relation. All other classes denote relations of specific types. The typesof relations and therefore how many classes are required for the multiclassclassifier are based on the training data provided to Kindred.3.3.4 VectorizationEach candidate is then vectorized in order to transform the tokenized sen-tence and set of entity information into a numerical vector that can be553.3. Kindred Methodsprocessed using the scikit-learn classifiers. In order to keep Kindred sim-ple and improve performance, it only generates a small set of features asoutlined below.• Entity types in the candidate relation• Unigrams between entities• Bigrams for the full sentence• Edges in dependency path• Edges in dependency path that are next to each entity.For the entity type and edge relations, they are stored in a one-hot format.Entity specific features are created for each entity. For instance, if there arethree relation types for relations between two arguments, then six binaryfeatures would be required to capture the entity types.The unigrams and bigrams use a bag-of-words approach. Term-frequencyinverse-document frequency (TF-IDF) is used for all bag-of-words basedfeatures. The dependency path, using the same method as VERSE, is cal-culated as the minimum spanning tree between the nodes in the dependencygraph that are associated with the entities in the candidate relation.3.3.5 ClassificationKindred has in-built support for the support vector machine (SVM) andlogistic regression classifiers implemented in scikit-learn. By default, theSVM classifier is used with the vectorized candidate relations. The linearkernel has shown to give good performance and is substantially faster to trainthan alternative SVM kernels such as radial basis function or exponential.The success of the LitWay and UniMelb entries to the SeeDev shared tasksuggested that individual classifiers for unique relation types may give im-proved performance. This may be due to the significant differences in com-plexity between different relation types. For instance, one relation type mayrequire information from across the sentence for good classification, whereasanother relation type may require only the neighboring word.Using one classifier per relation type, instead of a single multiclass classifier,means that a relation candidate may be predicted to be multiple relationtypes. Depending on the dataset, this may be the appropriate decision asrelations may overlap. Kindred offers this functionality of one classifier per563.3. Kindred Methodsrelation type. However, for the SeeDev dataset, we found that the bestperformance was actually through a single multiclass classifier.3.3.6 FilteringThe predicted set of relations is then filtered using the associated relationtype and types of the entities in the relation. Kindred uses the set of relationsin the training data to infer the possible argument types for each relation.3.3.7 Precision-recall tradeoffThe importance of precision and recall depends on the specific text min-ing problem. The BioNLP Shared Task has favored the F1-score, givingan equal weighting to precision and recall. Other text mining projects mayprefer higher precision in order to avoid biocurators having to manually fil-ter out spurious results. Alternatively, projects may require higher recallin order to not miss any possibly important results. Kindred gives the userthe control of a threshold for making predictions. In this case, the logis-tic regression classifier is used as it allows for easier thresholding. This isbecause the underlying predicted values can be interpreted as probabilities.We found that logistic regression achieved performance very close to theSVM classifier. By selecting a higher threshold, the classifier will becomemore conservative, decrease the number of false positives and therefore im-prove precision at the cost of recall. By using cross-validation, the user canget an idea of the precision-recall tradeoff. The tradeoffs for the BB3 andSeeDev tasks are shown in Figure 3.5. This allows the user to select theappropriate threshold for their task.3.3.8 Parameter optimizationTEES took a grid-search approach to parameter optimization and focusedon the parameters of the SVM classifier. VERSE had a significantly largerselection of parameters and grid search was not computationally feasibleso a stochastic approach was used. Both approaches are computationallyexpensive and generally need a computer cluster.Kindred takes a much simpler approach to parameter optimization and canwork out of the box with default values. To improve performance, theuser can choose to do minor parameter optimization. The only parameter573.3. Kindred MethodsFigure 3.5: The precision-recall tradeoff when trained on the training set forthe BB3 and SeeDev results and evaluating on the development set usingdifferent thresholds. The numbers shown on the plot are the thresholds.583.3. Kindred Methodsoptimized by Kindred is the exact set of features used for classification.This decision was made with the hypothesis that some relations potentiallyrequire words from across the sentence and other need only the informationfrom the dependency parse.The feature choice optimization uses a greedy algorithm. It calculates theF1-score using cross validation for each feature type. It then selects the bestone and tries adding the remaining feature types to it. It continues growingthe feature set until the cross-validated F1 score does not improve.Figure 3.6 illustrates the process for the BB3 subtask using the trainingset and evaluating on the development set. At the first stage, the entitytypes feature is selected. This is understandable as the types of entity arehighly predictive of whether a candidate relation is reasonable for a par-ticular candidate type, e.g. two gene entities are unlikely to be associatedin a ‘IS_TREATMENT_FOR’ relation. At the next stage, the unigramsbetween entities feature is selected. And on the third stage, no improvementis made. Hence for this dataset, two features are selected. We use this ap-proach for the BB3 dataset but found that the default feature set performedbest for the SeeDev dataset.3.3.9 DependenciesThe main dependencies of Kindred are the scikit-learn machine learning li-brary and the Stanford CoreNLP toolkit. Kindred will check for a locallyrunning CoreNLP server and connect if possible. If none is found, thenthe CoreNLP archive file will be downloaded. After checking the SHA256checksum to confirm the file integrity, it is extracted. It will then launchCoreNLP as a background process and wait until the toolkit is ready be-fore proceeding to send parse requests to it. It also makes sure to kill theCoreNLP process when the Kindred package exits. Kindred also dependson the wget package for easy downloading of files, the IntervalTree Pythonpackage for identifying entity spans in text and NetworkX for generatingthe dependency path (Schult and Swart, 2008).3.3.10 PubAnnotation integrationIn order to make use of existing resources in the biomedical text miningcommunity, Kindred integrates with PubAnnotation. This allows annotatedtext to be downloaded from PubAnnotation and used to train classifiers.593.3. Kindred MethodsFigure 3.6: An illustration of the greedy approach to selecting feature typesfor the BB3 dataset.603.3. Kindred MethodsThe PubAnnotation platform provides a RESTful API that allows easydownload of annotations from a given project. Kindred will initially down-load the listing of all available text sources with annotation for a givenproject. The listing is provided as a JSON data file. It will then downloadthe complete set of texts with annotations.3.3.11 PubTator integrationKindred can also download a set of annotated PubMed abstracts that havealready been annotated with named entities through the PubTator frame-work using the RESTful API. This requires the user to provide a set ofPubMed IDs which are then requested from the PubTator server using theJSON data format. The same loader used for PubAnnotation data is thenused for the PubTator data.3.3.12 BioNLP Shared Task integrationKindred gives easy access to the data from the most recent BioNLP SharedTask. By providing the name of the test and specific data set (e.g. training,development or testing), Kindred manages the download of the appropriatearchive, unzipping and loading of the data. As with the CoreNLP depen-dency, the SHA256 checksum of the downloaded archive is checked beforeunzipping occurs.3.3.13 APIOne of the main goals of Kindred is to open up the internal functionalityof a relation extraction system to other developers. The API is designedto give easy access to the different modules of Kindred that may be usedindependently. For instance, the candidate builder or vectorizer could easilybe integrated with functionality from other Python packages, which wouldallow for other machine learning algorithms or deep learning techniques to betested. Other parsers could easily be integrated and tested with the otherparts of the Kindred in order to understand how the parser performanceaffects the overall performance of the system. We hope that this ease-of-usewill encourage others to use Kindred as a baseline method for comparisonin future research.613.4. Results and discussion3.4 Results and discussionThe VERSE tool as described was applied to three subtasks: the BB3 eventsubtask, the SeeDev binary subtask and the GE4 subtask. The Kindredtool, which only focuses on relation extraction, is also compared to the topperforming tools for the BB3 and SeeDev tasks.3.4.1 DatasetsThe BB3 event dataset provided by the BioNLP-ST 16 organizers containsa total of 146 documents (with 61, 34 and 51 documents in the training,development and test sets respectively). These documents are annotatedwith entities of the following types and associated total counts: bacteria(932), habitat (1,861) and geographical (110). Only a single relation type(Lives_In) is annotated which must be between a bacteria and habitat or abacteria and a geographical entity.The dataset for the SeeDev binary subtask contains 20 documents with atotal of 7,082 annotated entities and 3,575 relations. There are 16 entitytypes and 22 relation types.The GE4 dataset focuses on NFkB gene regulation and contains 20 docu-ments. After filtering for duplicates and cleanup, it contains 13,012 anno-tated entities of 15 types. These entities are in 7,232 relations of 5 differenttypes. It also contains 81 negation and 121 speculation modifications forevents. Coreference data is also provided but was not used.3.4.2 Cross-validated resultsBoth BB3 event and SeeDev binary subtasks required only relation extrac-tion. VERSE was trained through cross-validation using the parameteroptimising strategy and the optimal parameters are outlined in Table 3.2.Both tasks were split into training and development sets by the competitionorganisers. The training set contained roughly twice as many annotationsas the development set. We used this existing split for the two-fold cross-validation. A linear kernel SVM was found to perform the best in bothtasks. For both subtasks, relation candidates were generated ignoring theargument types as shown in Figure 3.2.623.4. Results and discussionTable 3.2: Parameters used for BB3 and SeeDev subtasksParameter BB3 event SeeDev binaryFeaturesunigramsunigrams POSbigrams of dependency pathunigrams of dependency pathpath edges typesentity typesentity lemmasentity unigrams POSpath edges types near entitiesunigramsunigrams POSpath edges typespath edges types near entitiesentity typesFeature Selection No Top 5%Use TFIDF Yes YesSentence Range 0 0SVM Kernel linear linearSVM C Parameter 0.3575 1.0 (default)SVM Class Weights Auto 5 for positive and 1 for negativeSVM Shrinking No No633.4. Results and discussionTable 3.3: Cross-validated results of BB3 event subtask using optimal pa-rametersMetric Fold 1 Fold 2 AverageRecall 0.552 0.610 0.581Precision 0.469 0.582 0.526F1-score 0.507 0.596 0.552Table 3.4: Cross-validated results of SeeDev event subtask using optimalparametersMetric Fold 1 Fold 2 AverageRecall 0.363 0.386 0.375Precision 0.261 0.246 0.254F1-score 0.303 0.301 0.302The classifiers for the two tasks use two very different sizes of feature vec-tors. The BB3 parameter set has a significant amount of repeated unigramsdata, with unigrams for the dependency path and whole sentence with andwithout parts of speech. This parameter set also does not use feature se-lection, meaning that the feature vectors are very large (14,862 features).Meanwhile, the SeeDev parameters use feature selection to select the top5% of features which reduces the feature vector from 7,140 features downto only 357. This size difference is very interesting and warrants furtherexploration of feature selection for other tasks.Unfortunately, both classifiers performed best with a sentence range of zero,meaning that only relations within sentences could be detected. Tables 3.3and 3.4 show the optimal cross-validated results that were found with theseparameters. Notably, the F1-scores for the two folds of the SeeDev datasetare very similar, which is surprising given that the datasets are differentsizes.For the GE4 subtask, the cross-validation based optimisation strategy wasused to find parameters for the entity, relation and modification extractionsindependently. Due to the larger dataset, filtering was applied to the argu-ment types of relation candidates as shown in Figure 3.2. Table 3.5 outlinesthe resulting F1-scores from the five-fold cross-validations. As these extrac-643.4. Results and discussionTable 3.5: Averaged cross-validated F1-score results of GE4 event subtaskwith entities, relations and modifications trained separatelyMetric Entities Relations ModsRecall 0.703 0.695 0.374Precision 0.897 0.736 0.212F1-score 0.786 0.715 0.266tors are trained separately, their performance in the full pipeline would beexpected to be worse. This is explained as any errors during entity extrac-tion are passed onto relation and modification extraction.3.4.3 Competition resultsThe official results for the BB3 and SeeDev tasks are shown in Tables 3.6 and3.7. Only VERSE competed in the competition as Kindred was developedat a later date. VERSE performed well in both tasks and was ranked firstfor the BB3 event subtask and third for the SeeDev binary subtask. Theworse performance for the SeeDev dataset may be explained by the addedcomplexity of many additional relation and entity types.Table 3.8 shows the final results for the test set for the Genia Event subtaskusing the online evaluation tool. As expected, the F1-scores of the relationand modification extraction are lower for the full pipeline compared to thecross-validated results. Nevertheless, the performance is very reasonablegiven the more challenging dataset.3.4.4 Multi-sentence analysis29% of relations span sentence boundaries in the BB3 event dataset and 4%in the SeeDev dataset. Most relation extraction systems do not attemptto predict these multi-sentence relations. Given the higher proportion inthe BB3 set, we use this dataset for further analysis of VERSE’s abilityto predict relations that span sentence boundaries. It should be noted thatsome of these relations may be artifacts due to false identification of sentenceboundaries by the CoreNLP pipeline.653.4. Results and discussionTable 3.6: Cross-validated results (Fold1/Fold2) and final test set results forVERSE and Kindred predictions in Bacteria Biotope (BB3) event subtaskwith test set results for the top three performing tools: VERSE, TurkuNLPand LIMSI.Data Precision Recall F1 ScoreFold 1 0.319 0.715 0.441Fold 2 0.460 0.684 0.550Kindred 0.579 0.443 0.502VERSE 0.510 0.615 0.558TurkuNLP 0.623 0.448 0.521LIMSI 0.388 0.646 0.485Table 3.7: Cross-validated results (Fold1/Fold2) and final test set resultsfor Kindred predictions in Seed Development (SeeDev) binary subtask withtest set results for the top three performing tools: LitWay, UniMelb andVERSE.Data Precision Recall F1 ScoreFold 1 0.333 0.411 0.368Fold 2 0.255 0.393 0.309Kindred 0.344 0.479 0.400LitWay 0.417 0.448 0.432UniMelb 0.345 0.386 0.364VERSE 0.273 0.458 0.342Table 3.8: Final reported results for GE4 subtask split into entity, relationsand modifications resultsMetric Entities Relations ModsRecall 0.71 0.23 0.11Precision 0.94 0.60 0.38F1-score 0.81 0.33 0.17663.4. Results and discussionUsing the optimal parameters for the BB3 problem, we analysed predic-tion results using different values for the sentence range parameter. Theperformance, shown in Figure 3.7, is similar for relations within the samesentence using different sentence range parameters. However, as the dis-tance of the relation embiggens, the classifier predicts larger ratios of falsepositives to true positives. With sentence range = 3, the overall F1-score forthe development set has dropped to 0.326 from 0.438 when sentence range= 1.The classifier is limited by the small numbers of multi-sentence relations touse as a training set. With a suitable amount of data, it would be worthwhileexploring the use of separate classifiers for relations that are within sentencesand those that span sentences as they likely depend on different features.3.4.5 Error propagation in events pipelineIt should be noted that at each stage of the event extraction pipeline (Figure3.1), additional errors can be introduced. If entities are not identified, thenrelations cannot be built upon them. And if entities or relations are missed,modifications cannot be predicted for them. At each stage, we targettedoptimal F1-score with equal balance of precision and recall. An interestingfuture direction would be an exploration of different methods to reduce this,either targeting high recall (with lower precision) at each stage with a finalcleanup method, or a unified approach that solves all three steps together.3.4.6 KindredIn order to show the efficacy of Kindred, we evaluate the performance on theBioNLP 2016 Shared Task data for the BB3 event extraction subtask and theSeeDev binary relation subtask. Parameter optimization was used for BB3subtask but not for the SeeDev subtask which used the default set of featuretypes. Both tasks used a single multiclass classifier. Tables 3.6 and 3.7 showsboth the cross-validated results using the provided training/developmentsplit as well as the final results for the test set.The results are in line with the best performing tools in the shared task.It is to be expected that it does not achieve the best score in either task.VERSE, which achieved the best score in the BB3 subtask, utilized a com-putational cluster to test out different parameter settings for vectorizationas well as classification. LitWay, the winner of the SeeDev subtask, used673.4. Results and discussionFigure 3.7: Analysis of performance on binary relations that cross sentenceboundaries. The classifier was trained on the BB3 event training set andevaluated using the corresponding development set.683.5. Conclusionhand-crafted rules for a number of the relation types. Given the computa-tional speed and simplicity of the system, Kindred is a valuable contributionto the community.These results suggest several possible extensions of Kindred. Firstly, a hy-brid system that mixes a vector-based classifier with some hand-crafted rulesmay improve system performance. This would need to be implemented toallow customization in order to support different biomedical tasks. Kindredis also geared towards PubMed abstract text, especially given the integra-tion with PubTator. Using PubTator’s API to annotate other text wouldallow Kindred to easily integrate other text sources, including full-text arti-cles where possible. Given the open nature of the API, we hope that theseimprovements, if desired by the community, could be easily developed andtested.Kindred has several weaknesses that we hope to improve. It does not prop-erly handle entities that lie within tokens. For example, a token “HER2+”,with “HER” annotated as a gene name, denotes a breast cancer subtypethat is positive for the HER2 receptor. Kindred will currently associate thefull token as a gene entity and will not properly deal the “+”. This is not aconcern for the BioNLP Shared Task problem but may become importantin other text mining tasks.3.5 ConclusionWe have presented VERSE, a full event extraction system that performedvery well in the BioNLP 2016 Shared Task and its successor the KindredPython package.The VERSE system builds upon the success of previous systems, particularlyTEES, in several important ways. It gives full control of the specific semanticfeatures used to build the classifier. In combination with the stochasticoptimisation strategy, this control has been shown to be important giventhe differing parameter sets found to be optimal for the different subtasks.Secondly, VERSE allows for feature selection which is important in reducingthe size of the large sparse feature vectors and avoid overfitting. Lastly,VERSE can predict relations that span sentence boundaries, which is certainto be an important avenue of research for future relation extraction tasks.We hope that this tool will become a valuable asset in the biomedical text-mining community.693.5. ConclusionKindred is designed for ease-of-use to encourage more researchers to test outrelation extraction in their research. By integrating a selection of file formatsand connecting to a set of existing resources including PubAnnotation andPubTator, Kindred will make the first steps for a researcher less cumber-some. We also hope that the codebase will allow researchers to build uponthe methods to make further improvements in relation extraction research.70Chapter 4A literature-mined resourcefor drivers, oncogenes andtumor suppressors in cancer4.1 IntroductionAs sequencing technology becomes more widely integrated into clinical prac-tice, genomic data from cancer samples is increasingly being used to supportclinical decision making as part of precision medicine efforts. Many initia-tives use targeted panels that focus on well understood cancer genes, howevermore comprehensive approaches such as exome or whole genome sequenc-ing that often uncover variants in genes of uncertain relevance to cancerare increasingly being employed. Interpreting individual cancer samples re-quires knowledge of which mutations are significant in cancer development.The importance of a particular mutation depends on the role of the associ-ated gene and the specific cancer type. The terms “oncogene” and “tumorsuppressor” are commonly used to denote genes (or aberrated forms) thatrespectively promote or inhibit the development of cancer. Genes of specialsignificance to a particular cancer type or subtype are often described as“drivers”. A deletion or loss-of-function mutation in a tumor suppressor geneassociated with the cancer type of the sample is potentially an importantevent for this cancer. Furthermore, amplifications and gain-of function mu-tations in oncogenes, and any somatic activity in known driver genes may bevaluable information in understanding the mutational landscape of a givencancer sample. This knowledge can then help select therapeutic options andimprove our understanding of markers of resistance in the particular cancertype.A variety of methods exist to identify a gene as a driver, oncogene or tu-mor suppressor given a large set of genomic data. Many methods use the714.1. Introductionbackground mutation rate and gene lengths to calculate a p-value for theobserved number of somatic events in a particular gene (Kristensen et al.,2014). Other studies use the presence of recurrent somatic deletions or lowexpression to deduce that a gene is a tumor suppressor (Cheng et al., 2017).In-vitro studies that examine the effect of gene knockdowns on the cancer’sdevelopment are also used (Zender et al., 2008).Structured databases with information about the role of different genes incancer, specifically as drivers, oncogenes and tumor suppressors, are neces-sary for automated analysis of patient cancer genomes. The Cancer GenomeAtlas (TCGA) project has provided a wealth of information on the genomiclandscape of over 30 types of primary cancers (Weinstein et al., 2013). Datafrom TCGA (and other resources) are presented in the IntOGen resourceto provide easy access to lists of driver genes (Gonzalez-Perez et al., 2013).The Cancer Gene Census has been curated using data from COSMIC toprovide known oncogenes and tumor suppressors (Futreal et al., 2004) butfaces the huge cost of manual curation. The Network of Cancer Genes (Cic-carelli et al., 2018) builds on top of the Cancer Gene Census and integratesa wide variety of additional contextual data including cancer types in whichthe genes are frequently mutated. Other resources that provide curatedinformation about cancer genes include TSGene (Zhao et al., 2015) and On-Gene (Liu et al., 2017) but do not match them with specific cancer types.There are also two other resources that are no longer accessible for unknownreasons (NCI Cancer Gene Index and MSKCC Cancer Genes database).Text mining approaches can be used to automatically curate the role of genesin cancer, by identifying mentions of genes and cancer types, and extractingtheir relations from abstracts and full-text articles. Machine learning meth-ods have shown great success in building protein protein interaction (PPI)networks using such data (Szklarczyk et al., 2016). We present CancerMine,a robust and regularly updated resource that describes drivers, oncogenesand tumor suppressors in all cancer types using the latest ontologies. Byweighting gene roles by the number of supporting papers and using a high-precision classifier, we mitigate the noisy biomedical corpora and extracthighly relevant structured knowledge.724.2. Methods4.2 Methods4.2.1 Corpora ProcessingPubMed abstracts and full-text articles from PubMed Central Open Ac-cess (PMCOA) subset and Author Manuscript Collection (PMCAMC) weredownloaded from the NCBI FTP website using the PubRunner framework(paper in preparation - https://github.com/jakelever/pubrunner). Theywere then converted to BioC format using PubRunner’s convert function-ality. This strips out formatting tags and other metadata and retains theUnicode text of the title, abstract and for PMCOA, the full article. Thesource of the text (title, abstract, article) is also encoded.4.2.2 Entity recognitionLists of cancer types and gene names were built using a subset of the DiseaseOntology (DO) and NCBI gene lists. These were complemented by matchingto the Unified Medical Language System (UMLS). For cancer types, this wasachieved using the associated ID in DO or through exact string matching onthe DO item title. For gene names, the Entrez ID was used to match withUMLS IDs. The cancer type was then associated with a DO ID, and thegene names were associated with their HUGO gene name. These cancer andgene lists were then pruned with a manual list of stop-words with severalcustom additions for alternate spellings/acronyms of cancers. All cancerterms with less than four letters were removed except for a selected set ofabbreviations, e.g. GBM for glioblastoma multiforme.The corpus text was loaded in BioC format and processed using the KindredPython package which, as of v2.0, uses the Spacy IO parser (described inChapter 3). Using the tokenization, entities were identified through exactstring matching against tokens. Longer entity names with more tokens wereprioritised and removed from the sentence as entities were identified. Fusionterms (e.g. BCR-ABL1) were identified by finding gene names separated bya hyphen or slash. Non-fusions, which are mentions with multiple genessymbols that actually refer to a single gene (e.g.l HER2/neu), were thenidentified when two genes with matching HUGO IDs were attached andcombined to be a single non-fusion gene entity. Genes mentioned in thecontext of pathways were also removed (e.g. MTOR pathway) using a listof pathway related keywords.734.2. Methods4.2.3 Sentence selectionAfter Kindred parsing, the sentences with tagged entities were searched forthose containing at least one cancer type and at least one gene name. Thesesentences were then filtered using the terms “tumor suppress”, “oncogen”and “driv” to enrich for sentences that were likely discussing these generoles.4.2.4 AnnotationFrom the complete set, 1,600 of the sentences were then randomly selectedand output into the BioNLP Shared Task format for ingestion into an onlineannotation platform. This platform was then used by three expert annota-tors who are all PhD students actively engaged in precision cancer projects.The platform presents each possible pair of a gene and cancer and the usermust annotate this as driving, oncogene and tumor suppressor. The first 100sentences were used to help the users understand the system, evaluate initialinter-annotator agreement, and adjust the annotation guidelines (availableat the Github repository). The results were then discarded and the com-plete 1,500 sentences were annotated by the first two annotators. The thirdannotator then annotated the sentences that the first two disagreed on. Theinter-annotator agreement was calculated using the F1-score. A gold cor-pus was created using the majority vote of the annotations of the threeannotators.4.2.5 Relation extractionTo create a training and test split, 75% of the 1500 sentences were used as atraining set and a Kindred relation classifier was trained with an underlyinglogistic regression model for all three gene roles (Driver, Oncogene and Tu-mor_Suppressor). The threshold was varied to generate the precision-recallcurves with evaluation on the remaining 25% of sentences. With the selec-tion of the optimal thresholds, a complete model was trained using all 1,500sentences. This model was then applied to all sentences found in PubMed,PMCOA and PMCAMC that fit the sentence requirements. The associatedgene and cancer type IDs were extracted, entity names were normalized andthe specific sentence was extracted.744.2. Methods4.2.6 Web portalThe resulting cancer gene roles data were aggregated by the triples (gene,cancer, role) in order to count the number of citations supporting each cancergene role. This information was then presented through tabular and chartform using a Shiny web application.4.2.7 Resource comparisonsThe data from the Cancer Gene Census (CGC), IntOGen, TS and ONGeneresources were downloaded for comparison. HUGO gene IDs in CancerMinewere mapped to Entrez gene IDs. CGC data was mapped to Disease On-tology cancer types using a combination of the cancer synonym list createdfor CancerMine and manual curation. Oncogenes and tumor suppressorswere extracted using the presence of “oncogene” or “TSG” in the “Role inCancer” column. The mapped CGC data was then compared against the setof oncogenes and tumor suppressors in CancerMine. IntOGen cancer typeswere manually mapped to corresponding Disease Ontology cancer types andcompared against all of CancerMine. The TSGene and ONGene gene setswere compared against the CancerMine gene sets without an associated can-cer type.4.2.8 CancerMine profiles and TCGA analysisFor each cancer type, the citation counts for each gene role that were in thetop 30 cancer genes were then log10-transformed and rescaled so that themost important gene had the value of 1 for each cancer type. Gene roleswith values lower than 0.2 for all cancer types were trimmed. The top 30cancer types and genes were then hierarchical clustered for the associatedheatmap.The open-access VarScan somatic mutation calls for the seven TCGAprojects (BRCA,COAD,LIHC,PRAD,LGG,LUAD,STAD) were down-loaded from the GDC Data Portal (https://portal.gdc.cancer.gov). Theywere filtered for mutations that contained a stop gain or were classified asprobably damaging or deleterious by PolyPhen. Tumor suppressor specificCancerMine profiles were generated that used all tumor suppressors for eachcancer type. The citation counts were again log10-transformed and rescaledto produce the CancerMine tumor suppressor profile. Each TCGA sample754.3. Resultswas represented as a binary vector matching the filtered mutations. Thedot-product of a sample vector and a CancerMine profile vector producedthe sum of citation weightings and gave the score. For each sample, thescore was calculated for all seven cancer types and the highest score wasused to label the sample. A sample that did not contain tumor suppressormutations associated with any of the seven profiles or could not be labelledunambigously was labelled as ‘none’.4.3 Results4.3.1 Role of 3,775 unique genes catalogued in 426 cancertypesThe entire PubMed, PubMed Central Open Access subset (PMCOA) andPubMed Central Author Manuscript Collection (PMCAMC) corpora wereprocessed to identify sentences that discuss a gene and cancer types withintitles, abstracts and where accessible full text articles. By filtering for acustomized set of keywords, these sentences were enriched for those likelydiscussing the genes’ role and 1,500 randomly selected sentences were man-ually annotated by three expert annotators. Using a custom web interfaceand a well-defined annotation manual, the annotators tagged sentences thatdiscussed one of three gene roles (driver, oncogene and tumor suppresser)with a mentioned type of cancer (Fig 4.1A). An example of a simple relationthat was annotated as “Tumor Suppressor” annotation is: “DBC2 is a tu-mor suppressor gene linked to breast and lung cancers” (PMID: 17023000).A more complex example illustrates a negative relation: “KRAS mutationsare frequent drivers of acquired resistance to cetuximab in colorectal can-cers” (PMID:24065096). In this case, the KRAS mutations are drivers ofdrug resistance, and not of cancer development as required for annotationof driver relations.With high inter-annotator agreement (Fig 4.1B), the data were split into75%/25% training and test sets. A machine learning model was built foreach of the three roles and precision-recall curves were generated (Fig 4.1C)using the test set. Receiver operating characteristic (ROC) curves were notused as the class balance for each relation was below 20%. A high thresholdwas selected for each gene role in order to provide high-precision predictionwith the accepted trade-off of low recall (Fig 4.1D).The trade-off of higher precision with lower recall was made based on the764.3. Resultscount200300400500600Driver Oncogene Tumor Suppressor(a)Annotator 11Annotator 1Annotator 20.803Annotator 2Annotator 30.791Annotator 30.80310.7230.7910.7231(b)RecallPrecision0.00.20.40.60.81.00.0 0.2 0.4 0.6 0.8 1.0Driver Oncogene0.00.20.40.60.81.0Tumor_Suppressor(c)ThresholdPrecision / Recall0.00.20.40.60.81.00.0 0.2 0.4 0.6 0.8 1.0Driver Oncogene0.00.20.40.60.81.0Tumor_Suppressorprecision recall(d)Figure 4.1: The supervised learning approach of CancerMine involves man-ual annotation by experts of sentences discussing cancer gene roles. Machinelearning models are then trained and evaluated using this data set. (a) Man-ual text annotation of 1,500 randomly selected sentences containing genesand cancer types show a similar number of Oncogene and Tumor Suppressorannotations. (b) The inter-annotator agreement (measured using F1-score)was high between three expert annotators. (c) The precision recall curvesshow the trade-off of false positives versus false negatives. (d) Plotting theprecision-recall data in relation to the threshold applied to the classifier’sdecision function provides a way to select a high-precision threshold.774.3. Resultshypothesis that there exists a large amount of redundancy within the pub-lished literature. The same idea is often stated multiple times in differentpapers in slightly different ways. Therefore, for frequently stated ideas, amethod with lower recall would likely identify at least one occurrence. Nev-ertheless, we also distribute a version with thresholds of 0.5 for researcherswho are willing to accept to a higher level of noise.We apply the models to all sentences selected from PubMed abstracts andPMCOA/PMCAMC full-text articles, identifying 35,951 sentences from26,767 unique papers that mention gene roles in cancer. We extract theunique gene/cancer pairs for each role (Fig 4.2A) and find that 3,775 genesand 426 cancer types are covered. These capture the commonly discussedcancer genes and types (Fig 4.2B/C) from a large variety of journals (Fig4.2D). We provide a coverage of 21% (426/2,044) of the cancer typesdescribed in the Disease Ontology (Schriml et al., 2011) having at least onegene association. These results are made accessible through a web portalwhich can be explored through a gene or cancer-centric view. The resultingdata are stored with Zenodo for versioning and download. This storagewill provide the results in perpetuity. The results are licensed under theCreative Commons Public Domain (CC0) license to allow this data to beeasily integrated with precision cancer workflows.Our hypothesis of high levels of redundancy within the literature is sup-ported by the frequent extraction of commonly-known gene roles such asERBB2 as an oncogene in breast cancer (421 citations) and APC as a tumorsuppressor in colorectal cancers (107 citations). On the other hand, a longtail exists of gene roles with only a single citation – 10,903 of 14,820 (73.6%)of extracted cancer gene roles (Fig 4.2E). For researchers that are accept-ing of a higher false positive rate, we provide an additional less stringentdataset using a lower prediction threshold and estimated average precisionand recall of 0.5 and 0.6 respectively. The individual prediction score, akinto probabilities, are included so that users can further refine the results ifneeded.4.3.2 60 novel putative tumor suppressors are published inliterature each monthBy examining the publication dates of the articles containing the minedcancer gene roles, we can see that the rate of published cancer gene roles isincreasing over time (Fig 4.3A). In 2017, there were 6,851 mentions of cancer784.3. ResultsCitation #2000400060008000DriverOncogeneTumor_Suppressor(a)Citation #50100150200250TP53MYCKRASPTENCDKN2AEGFRNRASERBB2KITBRAFCDH1METAKT1RETBCL2FOSCCND1MYCNSTAT3NOTCH1(b)Citation #20040060080010001200breast cancerhepatocellular carcinomacolorectal cancerlung cancerprostate cancerstomach cancermelanomamalignant gliomaovarian cancernon−small cell lung carcinomapancreatic cancerleukemiaglioblastoma multiformecolon canceracute myeloid leukemiaosteosarcomaesophagus squamous cell carcinomalymphomaurinary bladder cancerneuroblastoma(c)Citation #1000200030004000oncotargetplos oneinternational journalcancer researchoncogenescientific reportsbmc cancerclinical cancer reseaoncology lettersmolecular cancerbritish journal of cacell death & diseasecancer cellnature communicationsbreast cancer researconcotargets and therajournal of experimentoncology reportsproceedings of the nafrontiers in oncologyAbstractArticleTitle(d)Number of citations# of cancer gene roles2000400060008000100001 2−4 5−9 10−19 20+(e)Figure 4.2: Overview of the cancer gene roles extracted from the completecorpora. (a) The counts of the three gene roles extracted. (b) and (c) showthe most frequently extracted genes and cancer types in cancer gene roles.(d) The most frequent journal sources for cancer gene roles with the sectionof the paper highlighted by color. (e) illustrates a large number of cancergene roles have only a single citation supporting it but that a large number(3917) have multiple citations.794.3. Resultsfreq02000400060001985199019952000200520102015Not Novel Novel(a)freq02000400060001985199019952000200520102015DriverOncogeneTumor Suppressor(b)freq02000400060001985199019952000200520102015AbstractArticleTitle(c)freq0200040006000800010000introductionmethodsresultsdiscussionconclusionUnable to identify(d)Figure 4.3: Examination of the sources of the extracted cancer gene roleswith publication date. (a) More cancer gene roles are extracted each year butthe relative proportion of novel roles remains roughly the same. (b) Rolesextracted from older papers tend to focus on oncogenes, but mentions ofdriver genes have become more frequent since 2010. (c) The full text articleis becoming a more important source of text mined data. (d) Differentsections of the paper, particularly the Introduction and Discussion parts,are key sources of mentions of cancer gene roles (d).804.3. Resultsgene roles in publications, translating to over ~571 each month. Approxi-mately 69% of these are gene roles that have been published previously, butmore importantly, the remaining 31% are novel. A breakdown by the roleshows that oncogene and tumor suppressor gene mentions greatly outnum-ber driver genes. In 2017, 1,358, 3,632 and 1,861 genes were mentioned asdrivers, oncogenes and tumor suppressors (Fig 4.3B). Combining this data,we find that there were, on average, 22 novel drivers, 96 novel oncogenesand 60 novel tumor suppressors described in literature each month. Thisemphasizes the need to keep these text mining results up-to-date at a fre-quency of less than a year. To this end, we have integrated the CancerMineresource with the PubRunner framework to execute intelligent updates oncea month (paper forthcoming - https://github.com/jakelever/pubrunner).Unhindered access to the full-text of articles for text mining purposes re-mains a key challenge. A larger number of cancer gene role mentions areextracted from the full text (25,641) than from the abstract alone (15,291),with a smaller number extracted from the titles (4,150). As can be seen inFig 4.3C, the number extracted from full text articles is increasingly dramat-ically over time. This is likely linked to the increasing number of publicationsincluded in the PubMed Central Open Access subset and Author ManuscriptCollection. This strengthens the need for publishers to provide open accessand for funding agencies to require publications in platforms that allow textmining. From the full text articles, we extract, where possible, the in-textlocation of the relationship captured within the paper (Fig 4.3D). Interest-ingly, a substantial number of the mentions are found in the Introductionsection, suggesting that the cancer gene’s role is usually discussed as back-ground information and not a result of the paper. Knowing the subsectionthat a relationship is captured from can be valuable information for Cancer-Mine users, since a user can then quickly ascertain if the discussed cancerrole is prior knowledge or a likely result from the publication. This alsohighlights the important point that the scientific quality of a paper cannotbe verified automatically by text mining technologies, since these methodsrely on the statements made by the original author. Hence, any use of text-mined resources will always require users to access the original papers toevaluate the assertion of a gene’s role in a particular cancer.Cancer gene roles that are first mentioned at earlier timepoints have moretime to accrue additional citations (Fig 4.4A). Thus, it is no surprise thatwhile most cancer gene roles have less than 10 associated citations, thosewith very large citation counts tend to be published over 10 years ago. Forinstance, ERBB2’s role as an oncogene in breast cancer is first extracted814.3. ResultsFirst publication dateCitation count01002003004001980 1990 2000 2010 2020(a)Publication YearCitations per year010203040501990 2000 2010Oncogene ERBB2 breast cancerOncogene NRAS breast cancerTumor_Suppressor RUNX3 breast cancer(b)GeneFOXP1Oncogenic inID4Tumor Suppressive inMEN1NOTCH1PTCH1RAB25TCF3WT1diffuse large B−cell lymphoma (6)hepatocellular carcinoma (6)ovarian cancer (4)ovarian cancer (7)leukemia (5)acute T cell leukemia (25)breast cancer (4)chronic lymphocytic leukemia (5)leukemia (5)papillary thyroid carcinoma (8)ovarian cancer (4)leukemia (4)leukemia (16)lung cancer (5)prostate cancer (4)prostate cancer (5)pituitary cancer (4)head and neck squamous cell carcinoma (14)lung small cell carcinoma (4)pancreatic ductal adenocarcinoma (4)basal cell carcinoma (14)colon cancer (6)hepatocellular carcinoma (4)stomach cancer (15)childhood kidney neoplasm (9)kidney cancer (10)nephroblastoma (354)(c)Figure 4.4: (a) Cancer gene roles first discussed many years ago have alonger time to accrue further mentions. (b) Some cancer gene roles growsubstantially in discussion while others fade away. (c) CancerMine can fur-ther validate the dual roles that some genes play as oncogenes and tumorsuppressive. Citation counts are shown in parentheses.824.3. Resultsfrom a publication in 1988 and has accumulated 421 citations that fit ourextraction criteria in literature since then. However, there are some cancergene roles that were first extracted from publications within the last decadebut have already accrued a great number of additional mentions. For in-stance, KRAS driving non-small cell lung carcinoma is first extracted froma paper published in 2010, and already has 92 other papers mentioning thisrole since. Lastly, there are 691 cancer gene roles that are mentioned inliterature before 2000, but are not extracted in papers after that period.The most frequently mentioned cancer gene role that reflects this patternis MYC as a oncogene in cervix carcinoma, with 10 papers mentioning itbefore 2000 but no further citations afterwards.With the knowledge of date of publication, we have gleaned a historical per-spective on the gene relations captured in literature. Fig 4.4B summarizesthree trends of citations that we observe, as exemplified by three gene as-sociations with breast cancer. ERBB2 is an example of the small numberof well established oncogenes that are more frequently discussed year uponyear. NRAS in breast cancer exemplifies a gene that continues to be dis-cussed in a single paper every few years, but has never gained importancein this cancer. RUNX3 has been discussed as a tumor suppressor in breastcancer in many papers in just the last few years. Its mechanism of action waselucidated after aggregated data from cell-line sequencing projects revealedits likely role as a tumor suppressor (Huang et al., 2012).The cancer type is important when trying to understand the context ofsomatic mutations. This is underscored by examples such as NOTCH1.NOTCH1 is a commonly-cited gene that behaves as an oncogene in onecancer (acute T cell leukemia) and as a tumor suppressor in another (headand neck squamous cell carcinoma) (Radtke and Raj, 2003). We furthervalidate CancerMine by querying the resource for the set of genes that are(i) strongly identified as a oncogene in at least one cancer type (>90% of>=4 citations) and (ii) strongly identified as a tumor suppressor in at leastone other cancer type. This method successfully identifies NOTCH1 alongwith several other genes that are reported to play dual roles in differentcancer types (Fig 4.4C).834.3. Results127269603470500010000Intersection SizeCancerMineCGC   0500010000Set Size(a)130012043278050001000015000Intersection SizeCancerMineIntOGen   0500010000Set Size(b)129574247505001000Intersection SizeCancerMineTSGene   0500100015002000Set Size(c)18855362670500100015002000Intersection SizeCancerMineONGene   05001000150020002500Set Size(d)Figure 4.5: A comparison of CancerMine against resources that provide con-text for cancer genes. (a) The CancerMine resource contains substantiallymore cancer gene associations than the Cancer Gene Census resource. (b)Surprisingly few of the cancer gene associations are overlapping betweenthe IntOGen resource and CancerMine . CancerMine overlaps substantiallywith the genes listed in the (c) TSGen and (d) ONGene resources.844.3. Results4.3.3 Text mining provides voluminous complementarydata to Cancer Gene CensusThe Cancer Gene Census (CGC) (Futreal et al., 2004) provides manuallycurated information about cancer genes with mutation types and their rolesin cancer. CancerMine contains information on 3,775 genes (compared to554 in CGC) and 426 cancer types (compared to 201 in CGC). CancerMineoverlaps with roughly a quarter of the oncogenes and tumor suppressors inthe CGC when comparing specific cancer types (Fig 4.5A). When the CGCis compared to the less stringent CancerMine dataset, a further 202 cancergene roles were found to match. This indicates that CGC contains curatedinformation not easily captured using the sentence extraction method andthat CancerMine represents an excellent complementary resource to workwith CGC. Our resource also provides the sentence in which the gene role isdiscussed, and citations that link to the corresponding published literatureare made available to help the user easily evaluate the evidence supportingthe gene’s role. CancerMine would be an excellent resource for prioritizingfuture curation of literature for resources such as CGC.The IntOGen resource leverages a number of cancer sequencing projects,including the Cancer Genome Atlas (TCGA) to index genes inferred tocontain driver mutations. A comparison of the genes with their cancer typesin CancerMine shows surprising differences (Fig 4.5B). CancerMine includesa much larger set of genes but has little overlap with the IntOGen resource.This suggests that many of the genes identified through the projects includedin IntOGen are not yet frequently discussed in the literature with respectto the specific cancer types in the IntOGen resource.ONGene and TSGene2 provide lists of oncogenes and tumor suppressors.Unfortunately these gene names are not associated with specific cancer typeswhich is an important aspect for precision oncology. When trying to dif-ferentiate between driving and passenger mutations, the lack of cancer typecontext would likely cause a high false positive rate. CancerMine contains~67% of the genes in ONGene and ~61% of TSGene2, and contains sub-stantially more genes than both resources (Fig 4.5C/D). These results lendmore weight to the use of automated text mining approaches for the popu-lation of knowledge bases, since no curation is required to keep the resourceup-to-date.854.3. ResultsB−cell lymphomamedulloblastomalymphomaneuroblastomachronic myeloid leukemiagastrointestinal stromal tumorosteosarcomaretinoblastomanephroblastomaendocrine gland cancerthyroid medullary carcinomaacute T cell leukemiaacute myeloid leukemialeukemialung adenocarcinomalung cancernon−small cell lung carcinomapancreatic cancercolon cancercolorectal cancerbreast cancermelanomaglioblastoma multiformemalignant gliomaprostate cancerhepatocellular carcinomaliver cancerstomach cancerovarian cancerurinary bladder cancerMYC:DriverMYC:OncogeneERBB2:OncogenePTEN:Tumor_SuppressorMET:OncogeneCDKN2A:Tumor_SuppressorEGFR:OncogeneKRAS:OncogeneTP53:Tumor_SuppressorMYCN:DriverMYCN:OncogeneALK:DriverALK:OncogeneRET:OncogeneBRAF:DriverBRAF:OncogeneEGFR:DriverKRAS:DriverKIT:OncogeneNRAS:OncogeneWT1:OncogeneMYB:OncogeneABL1:OncogeneBCR:OncogeneABL1:DriverBCR:DriverRB1:Tumor_SuppressorFOS:OncogeneNRAS:DriverKIT:DriverPTEN:DriverNOTCH1:DriverNOTCH1:OncogeneWT1:Tumor_SuppressorAPC:OncogeneRET:Tumor_SuppressorAPC:DriverRET:DriverMET:DriverCDH1:Tumor_SuppressorBRCA1:DriverVHL:Tumor_SuppressorRUNX3:OncogeneMYB:Tumor_SuppressorMYCN:Tumor_SuppressorKRAS:Tumor_SuppressorCDKN2A:OncogeneAPC:Tumor_SuppressorRUNX3:Tumor_SuppressorCDH1:OncogeneBRCA1:Tumor_SuppressorPIK3CA:OncogenePIK3CA:DriverERBB2:DriverCCND1:OncogeneTP53:OncogeneTP53:DriverAKT1:DriverAKT1:OncogenePTEN:OncogeneCancer Gene RoleCancer Type0.000.250.500.751.00(a)40.310.19.74.1726.50.418.232.622.219.421.410.633.845.59.910.715.411.88.947.85.98.38.2933.59.53.518.233.411.318.216.313.528.66.414.35.71.85.611.19.48.970.53.95.110.114.616.28.215.7315.64nonestomach cancerprostate cancermalignant gliomalung cancerhepatocellular carcinomacolorectal cancerbreast cancerBRCA COAD LIHC LUAD LGG PRAD STADTCGA ProjectCancerMine Profile(b)Figure 4.6: CancerMine data allows the creation of profiles for differentcancer types using the number of citations as a weighting for each generole. (a) The similarities between the top 30 cancer types in CancerMineare shown through hierarchical clustering of cancers types and genes usingweights from the top 30 cancer gene roles. (b) All samples in seven TCGAprojects are analysed for likely loss-of-function mutations compared with theCancerMine tumor suppressor profiles and matched with the closest profile.Percentages shown in each cell are the proportion of samples labelled witheach CancerMine profile that are from the different TCGA projects. Sam-ples that match no tumor suppressor in these profiles or are ambigious areassigned to none. The TCGA projects are breast cancer (BRCA), colorectaladenocarcinoma (COAD), liver hepatocellular carcinoma (LIHC), prostateadenocarcinoma (PRAD), low grade glioma (LGG), lung adenocarcinoma(LUAD) and stomach adenocarcinoma (STAD). 864.3. Results4.3.4 CancerMine provides insights into cancer similaritiesOncology often takes an organ-centric view of cancer types which is re-flected by the numerous disease ontologies that exist for the categorizationand nomenclature of cancer including the Disease Ontology used in thisproject. However, modern medicine is beginning to consider some cancersbased purely on the genetic underpinnings, developing basket trials and ap-proving treatment regimens based on genetic indications only (as shownwith the successful approval of Pembrolizumab for PD-1 positive cancer pa-tients). The CancerMine resource allows for the creation of a gene-centricview of cancers, by clustering cancers based on the role of different genes. Agene-centric view has the potential to reveal treatment regimes that couldbe transferred to other genetically similar cancer types. To allow for visual-isation, we selected the top 30 cancers (based on citation count in Cancer-Mine) and extracted the number of citations mentioning the role of the top30 genes. This produces a profile for each cancer type showing the impor-tance of each gene and its associated role. A heatmap that illustrates thisfor the top 30 cancer types and genes is shown in Figure 4.6A.The clustering puts biologically similar or equivalent cancers together thatare separate entities in the Disease Ontology. For example it groups colorec-tal with colon cancer and malignant glioma with glioblastoma multiforme.Some of these clusters also highlight known gene-cancer associations, for ex-ample, lung cancer, non-small cell lung carcinoma and lung adenocarinomaall cluster together, and are heavily associated with the KRAS and EGFRoncogenes. In fact, the strong cluster of genes on the left side separatescancers that are strongly associated with KRAS, EGFR, and TP53 (suchas lung cancer) from those that are less so (such as thyroid medullary car-cinoma). Put together, this approach is able to explain biological similarityof cancer types using shared gene associations. As an example, leukemiaclusters closely with the more specific subtype, acute myeloid leukemia, andit is evident that this is driven by extracted associations of these cancerswith MYC, ABL1 and many other genes. Several gene associations are no-ticeably low frequency compared to overall patterns, for instance KRAS inglioblastoma multiforme (GBM). While there are a small number of papersdiscussing KRAS in GBM, it is an infrequently discussed gene comparedto EGFR and PTEN. Overall this visualisation presents an easy method toexplore the similarities and differences between cancer types.In order to validate the cancer genes identified in CancerMine, we compareresults to somatic mutation data from the Cancer Genome Atlas (TCGA)874.4. Discussionproject. We hypothesis that the genes denoted to be tumor suppressorswould likely be affected by loss-of-function mutations. Oncogenes may beaffected by gain-of-function mutations which are harder to identify, henceour focus on tumor suppressors. Using CancerMine profiles based on tu-mor suppressor genes, we compare somatic calls for all samples with mu-tation data within seven TCGA projects. For each sample, we match thesomatic calls against the set of CancerMine tumor profiles and sum the im-portance of the tumor suppressors found to be mutated. Figure 4.6B showsthe percentages of top matches to each CancerMine profile. Six of the sevenCancerMine profiles have their highest proportion matches with the corre-sponding TCGA project. Interestingly a large number of breast cancer andprostate cancer samples cannot be unambigiously labelled with one of theCancerMine profiles. For prostate cancer, roughly one third of the samplesdo not have any LoF mutations that match against any tumor suppressorsfor any of the seven types, suggesting that prostate cancer tumor suppres-sors are disabled through other mechanisms or that there are more tumorsuppressors involved which have not been captured by CancerMine.The glioma (LGG) result is the most prominent with 70.5% of TCGA LGGsamples being most closely identified with the CancerMine malignant gliomaprofile. This is largely due to the high prevalence of IDH1 (390/503) muta-tions identified in the LGG cohort. While this data would not be enoughfor a tumor type classifier on its own, this results shows there is substantialsignal that can be leveraged for interpreting the genomic data and couldbe combined further with other mutational data. This is underscored whenexamining breast cancer tumor suppressors with only a single citation, genesthat are hypothetically not well known to be tumor suppressors in breastcancer. Seven of these genes (ARID1B, FGFR2, KDM5B, SPEN, TBX3,PRKDC and KMT2C) are mutated in at least 10 TCGA BRCA samplesproviding extra strength for the importance of these genes in breast cancer.In fact, the mechanism through which KMT2C inactivation drives tumor-genesis was recently elaborated in ER-positive breast cancer (Gala et al.,2018).4.4 DiscussionThis work contributes a much needed resource of known drivers, oncogenesand tumor suppressors in a wide variety of cancer types. The text miningapproach taken is able to discern complicated descriptions of cancer gene884.4. Discussionroles with a high level of precision. This provides for a continually updatedresource with little need for human intervention. This generalizable methodcould extract other types of biological knowledge with only minor changes.However, there are several limitations to this approach that present inter-esting but challenging alleys for further investigation. Firstly, this methodfocuses on single sentence extraction due to the challenge of anaphora andcoreference resolution across sentences. In Chapter 3, we showed that ahigh false positive rate occurs when identifying knowledge across multiplesentences. Our approach requires that authors discuss the gene name, roleand cancer name all within the same sentence. This is a problem of writ-ing style and probabilities that gets greatly diluted with the large numberof publications processed. Furthermore our approach focuses on individualgenes in isolation and is unable to capture complex interactions betweencancer genes discussed in papers, e.g. mutual exclusivity. More of thesecomplex relationships will likely be identified in future research and play apart in interpreting the somatic events in an individual cancer patient. Textmining approaches face growing challenges with extracting complex eventslike these, which may span multiple sentences or even paragraphs.One important concept when interpreting CancerMine data is that ourmethodology does not force a definition of a driver, oncogene or tumorsuppressor and relies on the assertion of individual authors. A decisionwas made to not extract discussion of genes frequently mutated in cancer.This was due to the acknowledged problem of huge genes (e.g. TTN) thatfrequently accrue many somatic mutations but likely don’t play a part incancer. Instead we rely on the authors’ assertions of the role a gene plays incancer. The level of evidence differs greatly as some assertions are based onintervential studies (e.g. knockdowns) while others use observational studies(e.g mutation frequency or expression experiments).As has been noted, many attempts have been made to create a knowledgebases of this topic. Hosting the data through Zenodo and the code throughGithub provides a level of continuity that will guarantee that the projectcode and data stay accessible for the foreseeable future. Furthermore thePubRunner integration makes it easier to keep the results up-to-date. Alldata and analysis for this chapter is open source and documented. We hopeothers will explore this data in order to infer new knowledge of cancer typesand their associated genes.89Chapter 5Text-mining clinicallyrelevant cancer biomarkersfor curation into the CIViCdatabase5.1 IntroductionThe ability to stratify patients into groups that are clinically related is an im-portant step towards a personalized approach to cancer. Over time, a grow-ing number of biomarkers have been developed in order to select patientswho are more likely to respond to certain treatments. These biomarkershave also been valuable for prognostic purposes and for understanding theunderlying disease biology by defining different molecular subtypes of can-cers that should be treated in different ways (e.g. ERBB2/ER/PR testingin breast cancer (Onitilo et al., 2009)). Immunohistochemistry techniquesare the primary approach for testing samples for diagnostic markers. (e.g.CD15 and CD30 for Hodgkin’s disease (Rüdiger et al., 1998)). Recently,the lower cost and increasing speed of sequencing has allowed the DNAand RNA of individual patient samples to be characterized for clinical ap-plications (Prasad et al., 2016). Throughout the world, this technology isbeginning to inform clinician decisions on which treatments to use (Shragerand Tenenbaum, 2014). Such efforts are dependent on comprehensive andcurrent understanding of the clinical relevance of variants. For example, thePersonalized Oncogenomics project at the BC Cancer Agency identifies so-matic events in the genome such as point mutations, copy number variationsand large structural changes and, in conjunction with gene expression data,generates a clinical report to provide an ‘omic picture of a patient’s tumor(Jones et al., 2010).905.1. IntroductionThe huge genomic variability in cancers means that each patient sampleincludes a huge number of new mutations, many of which have never beendocumented before (Chang et al., 2016). The phenotypic impact of most ofthese mutations is difficult to discern. This problem is exacerbated by thedriver/passenger mutation paradigm where only a fraction of mutations areessential to the cancer (drivers) while many others have occurred throughmutational processes that are irrelevant to the cancer and are deemed to havesimply come along for the ride (passengers). An analyst trying to understanda new patient sample typically performs a literature review for each geneand specific variant. This is needed to understand its relevance in a cancertype, characterize the driver/passenger role of its observed mutations, andgauge the relevance for clinical decision making.Several groups have built their own in-house knowledge bases which aredeveloped as analysts examine increasing numbers of cancer patient sam-ples. This tedious and largely redundant effort represents a substantial in-terpretation bottleneck impeding the progress of precision medicine (Goodet al., 2014). To encourage a collaborative effort, the CIViC database(https://civicdb.org) was launched to provide a wiki-like editable online re-source where edits and additions are moderated by experts in order to main-tain high quality (Griffith et al., 2017). The resource provides informationabout clinically-relevant variants in cancer. Variants include protein-codingpoint mutations, copy number variations, epigenetic marks, gene fusions,aberrant expression levels and other ‘omic events. It supports four types ofbiomarkers (also known as evidence types).Diagnostic evidence items describe variants that can help a clinician diagnoseor exclude a cancer. For instance, the JAK2 V617F mutation is a major di-agnostic criterion for myeloproliferative neoplasms to identify polycythemiavera, essential thrombocythemia and primary myelofibrosis. Predictive evi-dence items describe variants that help predict drug sensitivity or responseand are valuable in deciding further treatments. Predictive evidence itemsoften explain mechanisms of resistance in patients who progressed on a drugtreatment. For example, the ABL1 T315I missense mutation in the BCR-ABL fusion, predicts poor response to imatinib, a tyrosine kinase inhibitorthat would otherwise effectively target BCR-ABL, in patients with chronicmyeloid leukemia. Predisposing evidence items describe germline variantsthat increase the likelihood of developing a particular cancer, such as BRCA1mutations for breast/ovarian cancer or RB1 mutations for retinoblastoma.Lastly, prognostic evidence items describe variants that predict survival out-come. As an example, colorectal cancers that harbor a KRAS mutation are915.1. Introductionpredicted to have worse survival.CIViC presents this information in a human-readable text format consist-ing of an ‘evidence statement’ such as the sentence describing the ABL1T315I mutation above together with data in a structured, programmati-cally accessible format. A CIViC ‘evidence item’ includes this statement,ontology-associated disease name (Schriml et al., 2011), evidence type asdefined above, drug (if applicable), PubMed ID and other structured fields.Evidence items are manually curated and associated in the database with aspecific gene (defined by Entrez Gene) and variant (defined by the curator).Several other groups have created knowledge bases to aid clinical interpre-tation of cancer genomes. Many of these projects have joined the VariantInterpretation for Cancer Consortium (VICC, http://cancervariants.org/)to coordinate these efforts and have created a federated search mechanism toallow easier analysis across multiple knowledge bases (Wagner et al., 2018).The CIViC project is co-leading this effort along with OncoKB (Chakravartyet al., 2017), the Cancer Genome Interpreter (Tamborero et al., 2018), Pre-cision Medicine Knowledge base (Huang et al., 2017), Molecular Match,JAX-Clinical Knowledge base (Patterson et al., 2016) and others.Most of these projects focus on clinically-relevant genomic events, partic-ularly point mutations, and provide associated clinical information tieredby different levels of evidence. Only CIViC includes RNA expression-basedbiomarkers. These may be of particular value for childhood cancers whichare known to be ‘genomically quiet’, having accrued very few somatic mu-tations. Consequently, their clinical interpretation may rely more heavilyon transcriptomic data (Adamson et al., 2014). Epigenomic biomarkers willalso become more relevant as several cancer types are increasingly under-stood to be driven by epigenetic misregulation early in their development(Baylin and Ohm, 2006). For example, methylation of the MGMT promoteris a well known biomarker in brain tumors for sensitivity to the standardtreatment, temozolomide (Hegi et al., 2005).The literature on clinically relevant cancer mutations is growing at an ex-traordinary rate. For instance there were only 5 publications with BRAFV600E in title or abstract in PubMed in 2004 compared to 454 citations in2017. In order to maintain a high quality and up-to-date knowledge base,a curation pipeline must be established. This typically involves a queuefor papers, triaging those that should be curated and then assignment to ahighly experienced curator. This prioritisation step is immensely importantgiven the limited time of curators and the potentially vast number of papers925.1. Introductionto be reviewed. Prioritisation must identify papers that contain knowledgethat is of current relevance to users of the knowledge base. For instance,selecting papers for drugs that are no longer clinically approved would notbe valuable to the knowledge base.Text mining methods have become a common approach to help prioritisepapers. These methods fall broadly into two main categories, informationretrieval (IR) and information extraction (IE). IR methods focus on paper-level information and can take multiple forms. Complex search queries forspecific terms or paper metadata (helped by the MeSH term annotationsof papers in biomedicine) are common tools for curators. More advanceddocument clustering and topic modelling systems can use semi-supervisedmethods to predict whether a paper would be relevant for curation. Exam-ples of this approach include the document clustering method used for theORegAnno project (Aerts et al., 2008).IE methods extract structured knowledge directly from the papers. Thiscan take the form of entity recognition, by explicitly tagging mentions ofbiomedical concepts such as genes, drugs and diseases. A further step caninvolve relation extraction to understand the relationship discussed betweentagged biomedical entities. This structured information can then be used toidentify papers relevant for the knowledge base. IE methods are also usedfor automated knowledge base population without a manual curation step.For example, the mirTex knowledge base, which collates microRNA andtheir targets, uses automated relation extraction methods to populate theknowledge base (Li et al., 2015). Protein-protein interaction networks (suchas STRING (Szklarczyk et al., 2016)) are often built using automaticallygenerated knowledge bases.The main objective of this project was to identify frequently discussed can-cer biomarkers which fit the CIViC model but are not yet included in theCIViC knowledge base. We developed an IE-based method to extract keyparts of the evidence item: cancer type, gene, drug (where applicable) andthe specific evidence type from published literature. This allows us to countthe number of mentions of specific evidence items in abstracts and fulltext articles and compare against the CIViC knowledge base. This chap-ter will present our methods to develop this resource, known as CIViCmine(http://bionlp.bcgsc.ca/civicmine/). This main contributions of this workare an approach for knowledge base construction that could be applied tomany areas of biology and medicine, a machine learning method for extract-ing complicated relationships between four entity types, and extraction of935.2. Methodsrelationships across the largest possible publically accessible set of abstractsand full text articles. This resource, containing 70,655 biomarkers, is valu-able to all cancer knowledge bases to aid their curation and also as a toolfor precision cancer analysts searching for biomarkers not yet included inany other resource.5.2 Methods5.2.1 CorporaThe full PubMed and PubMed Central Open Access subset corpora wasdownloaded from the NCBI FTP website using the PubRunner infrastruc-ture (Anekalla et al., 2017). These documents were converted to the BioCformat for processing with the Kindred package (described in Chapter 3).HTML tags were stripped out and HTML special characters converted toUnicode. Metadata about the papers were retained including PubMed IDs,titles, journal information and publication date. Subsections of the paperwere extracted using a customised set of acceptable section headers such as“Introduction”, “Methods”, “Results” and many synonyms of these. Thecorpora were downloaded in bulk in order to not overload the EUtils REST-FUL service that is offered by the NCBI. In order to avoid duplications ofpublications in PMCOA and PubMed, the PMIDs of all documents includedin PMCOA were used to filter out abstracts from the PubMed corpus. Theupdate files from PubMed were also processed to identify the latest versionof each abstract to process.5.2.2 Term ListsTerm lists were curated for genes, diseases and drugs based on several re-sources. The cancer list was curated from a section of the Disease Ontology(Schriml et al., 2011). All terms under the “cancer” (DOID:162) parent termwere selected and filtered for unspecific names of cancer (e.g. “neoplasm” or“carcinoma”). These cancer types were then matched with synonyms fromthe Unified Medical Language System (UMLS) Metathesaurus (Bodenreider,2004) (2017AB), either through existing external reference links in the Dis-ease Ontology or through exact string-matching on the main entity names.The additional synonyms in the UMLS were then added through this link.The genes list was built from the Entrez gene list and complemented with945.2. MethodsUMLS terms. Terms that overlapped with common words found in scientificliterature (e.g. ice) were removed.The drug list was curated from the WikiData resource (Vrandečić andKrötzsch, 2014). All Wikidata entities that are drug instances (Wikidataidentifier: Q12140) were selected using a SPARQL query. The generic name,brand name and synonyms were extracted where possible. This link wascomplemented by a custom list of general drug categories (e.g. chemother-apy, tyrosine kinase inhibitors, etc) and a list of inhibitors built using thepreviously discussed gene list. This allowed for the extraction of terms suchas “EGFR inhibitors”. This was done because analysts are often interestedin biomarkers associated with drug classes that target a specific gene, inaddition to specific drugs.All term lists were filtered with a stopwords list. This was based on thestopword list from the Natural Language Toolkit (Bird, 2006) and the mostfrequent 5,000 words found in the Corpus of Contempory American English(Davies, 2009) as well as custom set of terms. It was then merged withcommon words that occur as gene names (such as ICE).A custom variant list was built that captured the main types of point muta-tions (e.g. loss of function), copy number variation (e.g. deletion), epigeneticmarks (e.g. promoter methylation) and expression changes (e.g. low expres-sion). These variants were complemented by a synonym list.5.2.3 Entity extractionThe BioC corpora files were processed by the Kindred package. This NLPpackage used Stanford CoreNLP (Manning et al., 2014) for processing inthe original published version (Lever and Jones, 2017). It was changed toSpacy (Honnibal and Johnson, 2015) for the improved Python bindings inversion 2 for this project. This provided easier integration and execution ona cluster without running a Java subprocess. Spacy was used for sentencesplitting, tokenization and dependency parsing of the corpora files.Exact string matching was then used against the tokenized sentences toextract mentions of cancer types, genes, drugs and variants. Longer termswere prioritised during extraction so that “non small cell lung cancer” wouldbe extracted instead of just “lung cancer”. Variants were also extracted witha regular expression system for extracting protein coding point mutations(e.g. V600E).955.2. MethodsTable 5.1: The five groups of search terms used to identify sentences thatpotentially discussed the four evidence types. Strings such as “sensitiv” areused to capture multiple words including “sensitive” and “sensitivity”.General Diagnostic Predictive Predisposing Prognosticmarker diagnostic sensitiv risk survivalresistance predispos prognosefficacy DFSpredictGene fusions (such as BCR-ABL1) were detected by identifying mentions ofgenes separated by a forward slash, hyphen or colon. If the two entities hadno overlapping HUGO IDs, then it was flagged as a possible gene fusion andcombined into a single entity. If there were overlapping IDs, it was deemedlikely to be referring to the same gene. An example is HER2/neu which isfrequently seen and refers to a single gene (ERBB2) and not a gene fusion.Acronyms were also detected where possible by identifying terms in paren-theses and checking the term before it, for instance “non-small cell lungcarcinoma (NSCLC)”. This was done to remove entity mistakes where pos-sible. The acronym detection method takes the short form (the term inbrackets) and iterates backwards through the long form (the term beforebrackets) looking for potential matches for each letter. If the long form andshort form has overlapping associated ontology IDs, they likely refer to thesame thing and can be combined, as in the example above. If only one of thelong form or short form has an associated ontology ID, they are combinedand assigned the associated ontology ID. If both long form and short formhave ontology IDs but there is no overlap, the short form is disregarded asthe long form has more likelihood of getting the specific term correct.Gene mentions that are likely associated with signalling pathways and notspecific genes (e.g. “MTOR signalling”) are also removed using a simplepattern based on the words after the gene mention. One final post-processingstep merges neighbouring terms with matching terms. So “HER2 neu” wouldbe combined into one entity as the two terms (HER2 and neu) refer to thesame gene.965.2. Methods5.2.4 Sentence selectionWith all biomedical documents parsed and entities tagged, all sentenceswere selected that mention at least one gene, at least one cancer and atleast one variant. A drug was not required as only one (Predictive) of thefour evidence types involves a drug entity. These sentences were enrichedby filtering with certain keywords that are strongly associated with the dif-ferent evidence items. The full list and groupings of keywords are shown inTable 5.1. This grouping is done to make sure that each evidence type isrepresented reasonably equally in the training data. The General categorywith the keyword “marker” is included to catch additional sentences thatdiscuss markers, which may relate to any of the four evidence types. Severalof the keywords are stems in order to capture different forms of the word,e.g. prognosis or prognostic The acronym “DFS” which means “disease freesurvival” is also included as it was found in many sentences describing prog-nosis.5.2.5 Annotation PlatformA web platform for simple relation annotation was built using Bootstrap(https://getbootstrap.com/). This allowed annotators to work using a vari-ety of devices, including their smartphones. The annotation system could beloaded with a set of sentences with entity annotations stored in a separatefile (also known as standoff annotations). When provided with a relationpattern, for example “Gene/Cancer”, the system would search the inputsentences and find all pairs of the given entity types in the same sentence.It would make sure that the two entities are not the same term, as in somesentences a token (or set of tokens) could be annotated as both a gene anda cancer, for instance “retinoblastoma”. For a sentence with 2 genes and 2cancer types, it would find all four possible pairs of gene and cancer type.Each sentence, with all the possible candidate relations matching the relationpattern, would be presented to the user, one at a time (Fig 5.1). The usercan then select various toggle buttons for the type of relation that theseentities are part of. They can also use these to flag entity extraction errorsor mark contentious sentences for discussion with other annotators.975.2. MethodsFigure 5.1: A screenshot of the annotation platform that allowed expertannotators to select the relation types for different candidate relations inall of the sentences. The example sentence shown would be tagged using“Predictve/Prognostic” as it describes a prognostic marker.985.2. MethodsFigure 5.2: An overview of the annotation process. Sentences are identifiedfrom the literature that describe cancers, genes, variants and optionallydrugs and then filtered using search terms. The first test phase tried complexannotation of biomarker and variants together but was unsuccessful. Theannotation task was split into two separate tasks for biomarkers and variantsseparately. Each task had a test phase and then the main phase on the 800sentences that were used to create the gold set.(#fig:annotationOverview, )5.2.6 AnnotationFor the annotation stage (outlined in Fig ??), an equal number of sentenceswere selected from each of the groups outlined in Table 5.1. This guaran-teed coverage of all four evidence types as the prognostic type dominatedthe other groups. If this step was not done, 100 randomly selected sentenceswould only contain 2 (on average) from the Diagnostic group. However, thissampling provided poor coverage of sentences that describe specific pointmutations. Many precision oncology projects only focus on point mutationsand so a further requirement was that 50% of sentences for annotation in-clude a specific point mutation. All together, this sampling provides bettercoverage of the different omic events and evidence types that were of inter-est. Special care is required when evaluating models built on this customizedtraining set as an unweighted evaluation would not be representative of thereal literature.Sentences that contain many permutations of relationships (e.g. a sentencewith 5 genes and 5 cancer types mentioned) were removed. An upper limitof 5 possible relations was enforced for each sentence. This was done withthe knowledge that the subsequent relation extraction step would have a995.2. Methodsgreater false positive rate for sentences with very large number of possiblerelations. It was also done to make the annotation task more manageable.An annotation manual was constructed with examples of sentences thatwould and would not match the four evidence types. This was built incollaboration with CIViC curators. The annotation manual is available inour Github repository.The annotation began with a test phase of 100 sentences that had poor an-notator agreement and required a refinement of the annotation task outlinedin this paragraph. The test phase allows the annotators to become accus-tomed to the annotation platform and make adjustments to the annotationmanual to clarify misunderstandings. The first test phase (Biomarker +Variant) involved annotating sentences for ternary (gene, cancer, variant)or quaternary (gene, cancer, variant, drug) relationships. The ternary rela-tionships included Diagnostic, Prognostic and Predisposing and the quater-nary relationship was Predictive. A low F1-score inter-annotator agreement(average of 0.52) forced us to reconsider the annotation approach. This pooragreement was likely due to including variants within the annotations andprovided a large combinatorial problem of exactly which entity mentions toinclude within a relationship. In order to simplify the problem, the taskwas split into two separate annotation tasks, the biomarker annotation andthe variant annotation. The biomarker annotation involved binary (gene,cancer) and ternary (gene, cancer, drug) relations that described one of theevidence types. The Predictive and Prognostic evidence types were merged(as shown in Figure 2), to further reduce the annotation complexity. ThePredictive/Prognostic annotations could be separated after tagging as rela-tionships containing a drug would be Predictive and those without wouldbe Prognostic. Any Prognostic relationship for a gene and cancer type thatare in a Predictive relationship were removed. The variant annotation task(gene, variant) focused on whether a variant (e.g. deletion) was associatedwith a specific gene in the sentence.With the redefined annotation task, six annotators were involved inbiomarker annotation, all with knowledge of the CIViC platform and haveexperience interpreting patient cancer genome samples. Three annotatorswere involved in variant annotation, all with experience in cancer genomics.Both annotation tasks started with a new 100-sentence test phase toevaluate the redefined annotation tasks and resolve any ambiguity withinthe annotation manuals. Good inter-annotator agreement was achieved atthis stage for both the biomarker annotation (average F1-score = 0.68)and variant annotation (average F1-score = 0.95). These 100 sentences1005.2. Methodswere discarded as they exhibited a learning curve as annotators becomecomfortable with the task.Annotator 20.74Annotator 1Annotator 3NAAnnotator 20.730.74(a)Annotator 20.78Annotator 1Annotator 3NAAnnotator 20.850.79(b)Annotator 20.96Annotator 1Annotator 3NAAnnotator 20.960.96(c)Figure 5.3: The inter-annotator agreement for the main phase for 800 sen-tences, measured with F1-score, showed good agreement in the two sets ofannotations for biomarkers (a) and (b) and very high agreement in the vari-ant annotation task (c). The sentences from the multiple test phases arenot included in these numbers and are discarded from the further analysis.After a video-conference discussion, the annotation manuals were refinedfurther. The main phase of biomarker annotation involved three annotatorsworking on 400 sentences and the other three working on a different 400sentences. Separately, three annotators worked on variant annotation withthe 800 sentence set. Figure 5.3 shows the inter-annotator agreement forthese tasks for the full 800 sentences. Each sentence is annotated by threeannotators and a majority vote system is used to solve conflicting annota-tions. The biomarker and variant annotations are then merged to create thegold corpus of 800 sentences used for machine learning system.5.2.7 Relation extractionThe sentences annotated with relations were then processed using the Kin-dred relation extraction Python package. Relation extraction models werebuilt for all five of the relation types: the four evidence types (Diagnostic,Predictive, Predisposing and Prognostic) and one AssociatedVariant rela-tion type. Three of the four evidence type relations are binary between aGene entity and a Cancer entity. The AssociatedVariant relation type is alsobinary between a Gene entity and a Variant entity. The Predictive evidence1015.2. Methodsitem type was ternary between a Gene, a Cancer Type and a Drug.Most relation extraction systems focus on binary relations (Björne andSalakoski, 2013, Bui et al. (2013)) and use features based on the dependencypath between those two entities. The recent BioNLP Shared Task 2016 seriesincluded a subtask for non-binary relations (i.e. relations between three ormore entities) but no entries were received (Chaix et al., 2016). Relations be-tween 2 or more entities are known as n-ary relations where n ≥ 2. The Kin-dred relation extraction package, based on the VERSE relation extractiontool (described in Chapter 3) which won part of the BioNLP Shared Task2016, was enhanced to allow prediction of n-ary relations. First, the candi-date relation builder was adapted to search for relations of a fixed n whichmay be larger than 2. This meant that sentences with 5 non-overlappingtagged entities would generate 60 candidate relations with n = 3. Thesecandidate relations would then be pruned by entity types. Hence, for thePredictive relation type (with n = 3), the first entity must be a CancerType, the second a Drug and the third a Gene. Two of the features used arebased on the path through the dependency graph between the entities inthe candidate relation. For relations with more than two entities, Kindredmade use of a minimal spanning tree within the dependency graph.The default Kindred features (outlined below) were then constructed forthis subgraph and the associated entities and sentences. All features wererepresented with 1-hot vectors or bag-of-words representations.• Entity types in the relation• Unigrams between each pair of entities within the relation• Bigrams of the entire sentence• All edge types within the minimal spanning tree of the dependencygraph that links all entity nodes• Edge types of edges that are attached to entity nodes within the de-pendency graphDuring training, candidate relations are generated with matching n-ary tothe training set. Those candidate relations that match a training exampleare flagged as positive examples with all others as negative. These candidaterelations are vectorized and a logistic regression classifier is trained againstthem. The logistic regression classifier outputs an interpretable score akinto a probability for each relation, which was later used for filtering. Kindredalso supports a Support Vector Machine classifier (SVM) or can be extended1025.2. MethodsTable 5.2: Number of annotations in the training and test setsAnnotation Train TestAssociatedVariant 768 270Diagnostic 156 62Predictive 147 43Predisposing 125 57Prognostic 232 88with any classifier from the scikit-learn package (Pedregosa et al., 2011a).The logistic regression classifier was more amenable for adjustment of theprecision-recall tradeoff.For generation of the knowledge base, the four evidence type relations werepredicted first which provided relations including a Gene. The Associated-Variant relation was then predicted and attached to any existing evidencetype relation that included that gene.5.2.8 EvaluationWith the understanding that the annotated sentences were selected ran-domly from customised subsets and not randomly from the full population,care was taken in the evaluation process.First, the annotated set of 800 sentences was split 75%/25% into a trainingand test set that had similar proportions of the four evidence types (Table5.2). Each sentence was then tracked with the group it was selected from(Table 5.1). Each group has an associated weight based on the proportionof the entire population of possible sentences that it represents. Hence, thePrognosis group, which dominates the others, has the largest weight. Whencomparing predictions against the test set, the weighting associated witheach group was then used to adjust the confusion matrix values. The goal ofthis weighting scheme was to provide performance metrics which would berepresentative for randomly selected sentences from the literature and notfor the customised training set.1035.2. MethodsRecallPrecision0.20.40.60.81.00.0 0.2 0.4 0.6 0.8 1.0Predisposing Prognostic0.0 0.2 0.4 0.6 0.8 1.0AssociatedVariantDiagnostic0.0 0.2 0.4 0.6 0.8 1.00.20.40.60.81.0Predictive(a)ThresholdPrecision / Recall0.00.20.40.60.81.00.0 0.2 0.4 0.6 0.8 1.0Predisposing Prognostic0.0 0.2 0.4 0.6 0.8 1.0AssociatedVariantDiagnostic0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0Predictiveprecision recall(b)Figure 5.4: (a) The precision-recall curves illustrate the performance of thefive relation extraction models built for the four evidence types and theassociated variant prediction. (b) This same data can be visualised in termsof the threshold values on the logistic regression to select the appropriatevalue for high precision with reasonable recall.1045.2. MethodsTable 5.3: The selected thresholds for each relation type with the highprecision and lower recall trade-off.Extracted Relation Threshold Precision RecallAssociatedVariant 0.70 0.941 0.794Diagnostic 0.63 0.957 0.400Predictive 0.93 0.891 0.141Predisposing 0.86 0.837 0.218Prognostic 0.65 0.878 0.4145.2.9 Precision-recall TradeoffFigure 5.4a shows precision recall curves for all five of the relation types.The Diagnostic and Predisposing tasks are obviously the most challengingfor the classifier. This same data can be visualised using the threshold valuesused against the output of the logistic regression for each metric (Fig 5.4b).In order to provide a high quality resource, we decided on a trade off ofhigh precision with low recall. We hypothesised that the most commonlydiscussed cancer biomarkers, which are the overall goal of this project, wouldappear in many papers using different wording. These frequently mentionedbiomarkers would then be likely picked up even with lower recall. This alsoreduces the burden on CIViC curators to sift through false positives. Withthis, we selected thresholds that would give as close to 0.9 precision giventhe precision-recall curves for the four evidence types. We require a higherprecision for the variant annotation (0.94). The thresholds and associatedprecision recall tradeoffs are shown for all five extracted relations in Table5.3.5.2.10 Application to PubMed and PMCOAWith the thresholds selected, the final models were applied to all sentencesextracted from PubMed and PMCOA. This is a reasonably large compu-tational problem and was tasked to the compute cluster at the GenomeSciences Centre.In order to manage this compute and provide infrastructure for easy up-dating with new publications in PubMed and PMCOA, we made use of theupdated Pubrunner infrastructure (paper in preparation - https://github.1055.2. Methodscom/jakelever/pubrunner). This allows for easy distribution of the workacross a compute cluster. The resulting data was then pushed to Zenodo(https://zenodo.org/) for perpetual and public hosting. The data is releasedwith a Creative Commons Public Domain (CC0) license so that other groupscan easily make use of it.5.2.11 CIViC MatchingIn order to make comparisons with CIViC, we downloaded the nightly datafile from CIViC (https://civicdb.org/releases) and matched evidence itemsagainst each other. The evidence type and IDs for genes and cancers wereused for matching. Direct string matching was used to compare drug namesfor Predictive biomarkers. The exact variant was not used for comparisonin order to find a genes that contain any biomarkers that match betweenthe two resources.Some mismatches occurred with drug names. For example, CIViCmine maycapture information about the drug family while CIViC contains informa-tion on specific drugs, or a list of drugs. Another challenge with matchingwith CIViCmine is related to the similarity of cancer types in the DiseaseOntology. There are several pairs of similar cancers types that are usedinterchangably by some researchers and not by others, e.g. stomach cancerand stomach carcinoma. CIViC may contain a biomarker for stomach cancerand CIViCmine matches all the other details except it relates it to stomachcarcinoma.5.2.12 User interfaceIn order to make the data easily explorable, we provide a Shiny based front-end (Fig ??) (RStudio, Inc, 2013). This shows a list of biomarkers which canbe filtered by the Evidence Type, Gene, Cancer Type, Drug and Variant.In order to help prioritize the biomarkers, we use the number of uniquepapers that the variants are mentioned in as a metric. By default, the listedbiomarkers are shown with the highest citation count first. Whether thebiomarker is found in CIViC is also shown as a column and is an additionalfilter. This allows CIViC curators to quickly navigate to biomarkers notcurrently discussed in CIViC and triage them efficiently.With filters selected, the user is presented with pie-charts that illustrate therepresentation of different cancer types, genes and drugs. When the user1065.2. MethodsFigure 5.5: A Shiny-based web interface allows for easy exploration of theCIViCmine biomarkers with filters and overview piecharts. A main tableshows the list of biomarkers and links to a subsequent table showing the listof supporting sentences.(#fig:shiny, )1075.3. Resultsclicks on a particular biomarker, an additional table is populated with thecitation information. This includes the journal, publication year, section ofthe publication (e.g. title, abstract or main body), subsection (if cited fromthe main body) and the actual text of the sentence. This table can furtherbe searched and sorted, for example to look for older citations or citationsfrom a particular journal. The PubMed ID is also provided with a link tothe citation on PubMed.5.3 ResultsEvidence Type# of biomarkers1000020000300004000050000Diagnostic Predictive Predisposing PrognosticFigure 5.6: The entirety of PubMed and PubMed Central Open Accesssubset were processed to extract the four different evidence types shown.From the full PubMed and PMCOA corpus, we extracted 70,655 biomarkerswith a breakdown into the four types (Figure 5.6). As expected, there aremany more Prognostic evidence items than the other three types. Table5.4 outlines examples of all four of these evidence types. 34.9% of sentences(33,491/95,871) contain more than one evidence item, such as the Predictiveexample which relates EGFR as a predictive marker in NSCLC to both1085.3. ResultsTable 5.4: Four example sentences for the four evidence types extracted byCIViCmine. The associated PubMed IDs are also shown for reference.Type PMID SentenceDiagnostic 29214759JAK2 V617F is the most common mutationin myeloproliferative neoplasms (MPNs)and is a major diagnostic criterion.Predictive 28456787In non-small cell lung cancer (NSCLC)driver mutations of EGFR are positivepredictive biomarkers for efficacy oferlotinib and gefitinib.Predisposing 28222693Our study suggests that one BRCA1variant may be associated withincreased risk of breast cancer.Prognostic 28469333Overexpression of Her2 in breast canceris a key feature of pathobiology of thedisease and is associated with poorprognosis.erlotinib and gefitinib. In total, we extracted 153,435 mentions of biomarkersfrom 54,274 unique papers. These biomarkers relate to 6,591 genes, 510cancer types and 334 drugs.EGFR and TP53 stand out as the most frequently extracted genes in dif-ferent evidence items (Fig 5.7a). Over 50% of the EGFR evidence itemsare associated with lung cancer or non-small cell lung carcinoma (NSCLC).CDKN2A has a larger proportion of diagnostic biomarkers associated withit than most of the other genes in the top 20. CDKN2A expression is awell-established marker for distinguishing HPV+ versus HPV- cervical can-cers. Its expression or methylation are discussed as diagnostic biomarkersin a variety of other cancer types including colorectal cancer and stomachcancer.Breast cancer is, by far, the most frequently discussed cancer type (Fig5.7b). A number of the associated biomarkers focus on predisposition, asbreast cancer has one of the strongest hereditary components associatedwith germline mutations in BRCA1 and BRCA2. NSCLC shows the largestrelative number of predictive biomarkers, consistent with the previous figureshowing the importance of EGFR.1095.3. ResultsGene# of biomarkers2004006008001000TP53EGFRCDKN2AERBB2PTENCDH1VEGFABCL2BIRC5KRASMYCBRCA1CD274PTGS2BRAFCCND1ERCC1CD44AKT1KITPrognosticDiagnosticPredictivePredisposing(a)Cancer Type# of biomarkers200040006000breast cancercolorectal cancerhepatocellular carcinomanon−small cell lung carcinomastomach cancerovarian cancerlung cancerprostate cancermalignant gliomaesophagus squamous cell carcinomaacute myeloid leukemiaglioblastoma multiformepancreatic cancermelanomacolon cancercervical cancerurinary bladder cancerovarian carcinomalung adenocarcinomahead and neck squamous cell carcinoma(b)Drug# of biomarkers50010001500chemotherapycisplatinradiation therapytyrosine kinase inhibitorspaclitaxelgefitinibtamoxifendoxorubicinfloxuridinecetuximabtrastuzumaberlotinibgemcitabinedocetaxel anhydroussorafenibimatinibegfr tyrosine kinase inhibitorsbevacizumabphenobarbitaltemozolomide(c)Variant Type# of biomarkers5000100001500020000expression[unknown]overexpressionunderexpressionmutationprotein expressionmethylationlosssubstitutionamplificationsingle nucleotide polymorphismpromoter methylationphosphorylationdeletionknockdownwildtypeinactivationstructural variantgermline mutationprotein overexpression(d)Figure 5.7: An overview of the top 20 (a) genes, (b) cancer types, (c) drugsand (d) variants extracted as part of evidence items.1105.3. ResultsFor the predictive evidence type, we see a disproportionally large num-ber associated with the general term chemotherapy and specific types ofchemotherapy including cisplatin, paclitaxel and doxorubicin (Fig 5.7c).Many targeted therapies are also frequently discussed such as the EGFRinhibitors, gefitinib, erlotinib and cetuximab. More general terms such as“tyrosine kinase inhibitor” capture biomarkers related to drug families.Lastly, we see that expression related biomarkers dominate the variant types(Fig 5.7d). Markers based on expression are more likely to be prognosticthan those using non-expression data (81.3% versus 45.6%). The easiestmethod to explore the importance of a gene in a cancer type is to correlateexpression levels with patient survival. With the accessibility of large tran-scriptome sets and survival data (e.g. TCGA), such assertions have becomevery common. The ‘mutation’ variant type has a more even split across thefour evidence types. The mutation term covers very general phrasing with-out a specific mention of the actual mutation. The substitution variant typedoes capture this information but there are far fewer. This reflects the chal-lenge of extracting all the evidence item information from a single sentence.It is more likely for an author to define a mutation in another sentence andthen use a general term (e.g. EGFR mutations) when discussing its clinicalrelevance. There are also a substantial number of evidence items where thevariant cannot be identified and are flagged as ‘[unknown]’. These are stillvaluable but may require more in-depth curation in order to tease out theactual variant.Of all the biomarkers extracted, 21.1% (14,931/ 70,655) are supported bymore than one citation. In fact, the most cited biomarker is BRCA1 mu-tation as a predisposing marker in breast cancer with 545 different papersdiscussing this. The initial priority for CIViC annotation is on highly citedbiomarkers that have not yet been curated into CIViC, in order to eliminateobvious information gaps. However, the single citations may also repre-sent valuable information for precision cancer analysts and CIViC curatorsfocused on specific genes or diseases.We compared the 70,655 biomarkers extracted for CIViCmine with the 2,055in the CIViC resource as of 05 June 2018. Figure 5.8a shows the overlap ofexact evidence items between the two resources. The overlap is quite smalland the number in CIViCmine not included in CIViC is very large. Wenext compare the cited publications using PubMed ID. Despite not havingused CIViC publications in training CIViCmine, we find that a substantialnumber of papers cited in CIViC (253/1,325) were identified automatically1115.3. Results62610 1901154CIViCmine CIViC(a)54021 1072253CIViCmine CIViC(b)Figure 5.8: A comparison of the evidence items curated in CIViC and auto-matically extracted by CIViCmine by (a) exact biomarker information andby (b) paper.by CIViCmine (Fig 5.8b). Altogether, CIViCmine includes 5,568 genes, 388cancer types and 272 drugs or drug families not yet included in CIViC.5.3.1 Use CasesThere are two use cases of this resource that are already been realised byCIViC curators at the McDonnell Genome Institute and analysts at the BCCancer Agency.Knowledge base curation use case: The main purpose of this tool is to as-sist in curation of new biomarkers in CIViC. A CIViC curator, looking fora frequently discussed biomarker, would access the CIViCmine Shiny appthrough a web browser. This would present the table, pie charts and fil-ter options on the left. They would initially filter the CIViCmine resultsfor those not already in CIViC. If they had a particular focus, they mayfilter by Evidence Type. For example, some CIViC curators may be moreinterested in Diagnostic, Predictive and Prognostic biomarkers than Predis-posing. This is due to the focus on somatic events in many cancer types.They would then look at the table of biomarkers, already sorted by cita-tion count in descending order, and select the top one. This would thenpopulate a table further down the page. Assuming that this is a frequentlycited biomarker, there would be many sentences discussing it, which wouldquickly give the curator a broad view of whether it is accepted in the com-munity. They would then open multiple tabs on their web browser to start1125.4. Discussionlooking at several of the papers discussing it. They might select an olderpaper, close to when it was first established as a biomarker, and a morerecent paper from a high-impact journal to gauge the current view of thebiomarker. Several of the sentences may obviously cite other papers as be-ing important to establishing this biomarker. The curator would look atthese papers in particular, as they may be the most appropriate to curate.Importantly, the curator may want the primary literature source(s), whichincludes the experimental data supporting this biomarker.Personalized cancer analyst use case: While interpreting an individual pa-tient tumor sample, an analyst typically needs to interpret a long list ofsomatic events. Instead of searching PubMed for each somatic event, theycan initially check CIViC and CIViCmine for existing structured knowledgeon the clinical relevance of each somatic event. First, they should checkCIViC given the high level of pre-existing curation there. This would in-volve searching the CIViC database through their website or API. If thevariant does not appear there, they would then progress to CIViCmine. Byusing the filters and search functionality, they could quickly narrow downthe biomarkers for their gene and cancer type of interest. If a match is found,they can then move to the relevant papers that are listed below to under-stand the experiments that were done to make this assertion. If they agreewith the biomarker, they could then suggest it as a curated biomarker forthe CIViC database. Both CIViC and CIViCmine reduce curation burdenby aggregating likely applicable data across multiple synonyms for the gene,disease, variant or drug not as easily identified through PubMed searches.5.4 DiscussionThis work provides several significant contributions to the fields of biomed-ical text mining and precision oncology. Firstly, the annotation methodis drastically different from previous approaches. Most annotation projects(such as the BioNLP Shared Tasks (Kim et al., 2009, Kim et al. (2011)) andthe CRAFT corpus (Bada et al., 2012)) have focused on abstracts or entiredocuments. The biomarkers of interest for this project appear sparsely inpapers so it would have been inappropriate to annotate full documents anda focus on individual sentences was necessary. We identified sentences thatcontained the appropriate entities and then filtered them further in order toprovide a rich set that contained similar numbers of relevant sentences asirrelevant sentences that could then be annotated. This approach could be1135.4. Discussionapplied to many other biomedical topics.We also made use of a simpler annotation system than the often used brat(Stenetorp et al., 2012) which allowed for fast annotation by restricting thepossible annotation options. Specifically, annotators did not select the enti-ties but were shown all appropriate permutations that matched the possiblerelation types. Issues of incorrect entity annotation were reported throughthe interface, collated and used to make improvements to the underlyingwordlists for gene, cancer types and drugs. We found that once a curatorbecame familiar with the task, they could curate sentences relatively quickly.Expert annotation is key to providing high quality data to build and evalu-ate a system. Therefore reducing the time required for expert annotators isessential.The supervised learning approach differs from methods that used co-occurrence based (STRING) or rule-based (mirTex) methods. Firstly, themethod is able to extract complex meaning from the sentence providingresults that would be impossible with a co-occurrence method. A rule-basedmethod would require enumerating the possible ways of describing each ofthe diverse evidence types. Our approach is able to capture a wide varietyof biomarker descriptions. Furthermore most relation extraction methodsaim for optimal F1-score (Chaix et al., 2016), placing an equal emphasison precision as recall. With the goal of minimizing false positives, ourapproach of high precision and low recall would be an appropriate model forother information extraction methods applied to the vast PubMed corpus.Apart from the advantages outlined previously, several other factors leadto the decision to use a supervised learning approach to build this knowl-edge base. The CIViC knowledge base could have been used as trainingdata in some form. The papers already in CIViC could have been searchedfor the sentences discussing the relevant biomarker, which could then havebeen used to train a supervised relation extraction system. An alterna-tive approach to this problem would have been to use a distant supervisionmethod using the CIViC knowledge base as seed data. This approach wastaken by Peng et al who also attempted to extract relations across sen-tence boundaries (Peng et al., 2017). They chose to focus only on pointmutations and extracted 530 within sentence biomarkers and 1,461 cross-sentence biomarkers. These numbers are drastically smaller than the 70,655extracted in CIViCmine.The reason to not use the CIViC knowledge base in the creation of thetraining data was taken to avoid any curator-specific bias that may have1145.4. Discussionformed in the selection of papers and biomarkers to curate. This was keyto providing a broad and unbiased view of the biomarkers discussed in theliterature. CIViC evidence items include additional information such asdirectionality of a relationship (e.g. does a mutation cause drug sensitivityor resistance), the level of support for it (from preclinical models up toFDA guidelines) and several other factors. It is highly unlikely that all thisinformation will be included within a single sentence. Therefore, we did nottry to extract this information concurrently. Instead, it is an additional taskfor the curator as they process the CIViCmine prioritised list.A robust named entity recognition solution does not exist for a custom termlist of cancer types, drugs and variants. For instance, the DNorm tool doesnot capture many cancer subtypes. A decision was made to go for high recallfor entity recognition, including genes, as the relation extraction step wouldthen filter out many incorrect matches based on context. This decision isfurther supported by the constant evolution of cancer type ontologies asdemonstrated by workshops at recent Biocuration conferences.Finally, this research provides a valuable addition to the precision oncol-ogy informatics community. CIViCmine can be used to assist curation ofother precision cancer knowledge bases and can be used directly by preci-sion cancer analysts to search for biomarkers of interest. As this resourcewill be kept up-to-date with the latest research, it will likely constantlychange as new cancer types and drug names enter the lexicon. We hopethat the methods described can be used in other biomedical domains andthat the resources provided will be valuable to the biomedical text miningand precision oncology fields.115Chapter 6ConclusionsAt inception of this thesis work, we hoped that text mining could somedaybecome an everyday tool for the biomedical research community. We werespecifically interested in the use of text mining to collate knowledge forthe personalized oncology field. This final chapter will discuss how the workundertaken has contributed to these goals and what hurdles remain. We willbroadly discuss the lessons learnt during this thesis and suggest interestingfuture directions to pursue, particularly to overcome some of the limitationsacknowledged within this work.6.1 ContributionsMany research areas are overwhelmed by potential hypotheses to test andautomated hypothesis generation methods are designed to provide priori-tized lists to researchers. Several factors limit these methods being embracedby the biology research community, including the predictive performance,explainability, and poor awareness that these methods exist. Our work inChapter 2 pushed forward the predictive performance by developing andevaluating a new approach using co-occurrence data. We showed that ourSVD-based method outperformed the previously best performing methodsand explored the explainability of some the successful and failed predictions.Supervised relation extraction is an important step past co-occurrences ininformation extraction. Our work with the VERSE and Kindred tools inChapter 3 illustrated that vectorized dependency path-based approaches arethe best method for biomedical relation extraction and that deep learningdoes not yet achieve the same benefits in other fields with larger trainingdataset sizes. The VERSE system won part of the BioNLP Shared Task2016. Furthermore, the packaging of Kindred makes it easier for other re-searchers to use our methods for their own problems.1166.2. Lessons LearntOur CancerMine resource, described in Chapter 4, will benefit all cancerbiology researchers as a valuable tool to understand the role of different genesin cancer. The high-precision knowledge extraction pipeline proves thatsingle sentences do contain enough information for large-scale knowledgebase construction. By examining the frequently cited gene roles, we wereable to build profiles for each cancer type that can be used to find similaritiesbetween cancers and were validated by comparison to data in the CancerGenome Atlas (TCGA) project.Finally Chapter 5 describes the CIVICmine resource designed specificallyfor curating information about the growing field of precision oncology andthe clinical relevance of mutations in cancer. This resource will prove in-creasingly valuable in the coming years as more medical centres developprecision oncology programs. The methods for annotating the training dataand building a classifier that can scale to PubMed provide valuable guide-lines for other groups interested in building a high-precision knowledge basein another area of biology.6.2 Lessons LearntThe stated goal of much biomedical text mining research is to help biologistsand medical researchers absorb research and identify potential hypothesesfor study. With the information overload present in published literature,automated methods should be used to guide researchers to the knowledgethat they need. Throughout this thesis work, I have identified several keyproblems that frequently occur in biomedical text mining. These problemsare fruitful areas for future research.6.2.1 Inaccessible and out-of-date resultsFirstly, and importantly, access to text mined results is key to adoptionby researchers. Many research papers develop text mining methods wherethe code and/or data are not shared. These papers may benefit other textmining researchers with algorithmic improvement ideas or approaches thatcould be generalized to other text mining problem. But they do not helpbiologists.Text mining published literature has been a focus of research for severaldecades. Advances in computational power within the last 15 years has1176.2. Lessons Learntmade it possible to do large-scale processing of a large number of PubMedabstracts and full-text papers. Hence there have been multiple analyses ofPubMed data, but very few are kept up-to-date as new publications areadded to the corpus.The reason for this lack of updating is primarily that researchers move ontoother projects after publication and potentially move to other institutions(especially graduate students). The additional engineering required to main-tain text mining results can be too much for a research group. But if textmining is to become a ubiquitous tool for biologists, this must be a problemthat is overcome and would be a valuable direction for future work.6.2.2 User InterfacesThe way that a biologist can interact with the text-mined data is key. Evenif the data is public, most biologists do not understand the value of textmining and would not go to the effort of downloading data and searchingit themselves. Hence a user interface is absolutely essential for this devel-opment. To be more specific, a graphical user interface is required as fewbiologists would be willing to use a command-line application.There are three common paths for building applications with graphical userinterfaces. First, the tool can be implemented as a standalone desktop ap-plication. These require installation and are often operating system specific(e.g. only running on Windows). The second is as a Java application thatcan be launched from a website. More web browsers are blocking Java ap-plications by default due to the high-security risks involved in executing aJava application (e.g. access to full file system).This brings us to the third option which I would argue is the only real op-tion these days. With advances in web technologies, specifically AJAX-likelibraries, that provide responsive websites for high-quality user experiences,web apps are the best solution. These can be client-side only where all cal-culations and analysis are done using Javascript code. Or more commonly,with a server-side end with a database, text mining results can be queriedquickly. Several bioinformatics analysis tools have been frequently due totheir implementation as web applications. The DAVID tool for gene setenrichment analysis (Dennis et al., 2003) is a classic example of a tool thatis frequently used when other more up-to-date tools exist but are hard touse.1186.3. Limitations and Future DirectionsThese arguments lead us to build web apps for the CancerMine andCIViCmine projects. We used the Shiny web technology for its ease ofimplementation and visually attractive interfaces. Unfortunately Shinymay not scale well to a larger number of users and these interfaces may berevisited if the resources prove very popular. We would encourage othertext mining developers to consider providing a web interface to navigatetext-mined data.There is a huge area of research in human-computer interaction (HCI). Itcould easily be argued that there should be more integration between textmining and HCI research in order to understand what features make a tooleasier to use. If a biologist finds a tool frustrating to use, or the results unre-liable, they may never use the tool again. The CancerMine and CIViCmineresearch, fortunately, took place in an environment close to potential usersof these resources which provided the opportunity to discuss their design.Understanding the real needs of users and the challenges they face interpret-ing text-mined data would enable text mining to become a more valuablepart of the research process.6.3 Limitations and Future DirectionsOne of the main limitations of our work is the focus on the knowledgecontained within single sentences. For all of our projects, we only cap-ture co-occurrences or relations that are discussed within a sentence anddo not capture knowledge that is spread across multiple sentences. This isa common limitation of many text mining tools at the moment due to thechallenge presented by anaphora. Coreference resolution methods still pro-vide noisy results when identifying which specific term a pronoun (or generalnoun) refer to. We examined the ability to extract relations across sentenceboundaries but found (as others have) that the false positive rate skyrocketsas more sentences are included. This is largely due to the decrease in classbalance, as the positive examples become a small fraction of all possiblecandidate relations. Overcoming this limitation with a high-quality corefer-ence resolution method would provide the largest gain for relation extractionmethods used to populate knowledge bases (as in Chapters 4 and 5).We are also limited by access to text corpora for information extraction. Wechose to focus on PubMed and PubMed Central Open Access subset (PM-COA) as they contain the largest set of published abstracts and full-text1196.3. Limitations and Future Directionsarticles while also being the easiest to access. Several publishers are begin-ning to make other smaller corpora accessible through limited APIs (andoften requiring special permissions) (Westergaard et al., 2018). However,these new corpora provide additional challenges with unique file formatsand rights permissions when sharing the results of text mining. This willbe the primary stumbling block of biomedical text mining in the comingdecades. Several universities have shown the desire to change their rela-tionships with publishers to encourage easier access to literature, both fortext mining and for researchers in general. We hope these efforts progressquickly.In Chapters 4 and 5, we faced a common problem in biomedical text mining.For supervised learning, annotated training data is needed to build a classi-fier. The size of the training data is a limiting factor for the complexity ofthe classifier that can be built. The recent successes of deep learning in otherfields, particularly computer vision, have been led by the development of vasttraining sets (e.g. ImageNet (Deng et al., 2009)). In fact, Google acquiredreCAPTCHA in order to generate human annotated image data to improvetheir computer vision algorithms for Google Streetview and Project Guten-berg (Von Ahn et al., 2008). For the biomedical field, expert annotatorsmay be needed for specific tasks. Some researchers have tried crowdsourc-ing (e.g. Mark2theCure (Tsueng et al., 2016)) either through volunteers orMechanical Turk paid workers (Buhrmester et al., 2011). These crowdsourc-ing efforts have shown that many non-expert annotators must look at thesame sentence in order to get a good consensus. This increases the anno-tation cost and drove our decision to use expert annotators for CancerMineand CIViCmine. However, it created the limitation of a smaller training setsize. This smaller training set size meant that a deep learning based ap-proach wasn’t a viable approach given the currently established issue withoverfitting small data set sizes (Mehryary et al., 2016). The BioNLP SharedTasks showed that more classical approaches, as taken in Chapter 3, werestill the most reliable approach for relation extraction given smaller trainingset sizes.An interesting angle that should be pursued is active learning in which thedata for annotation is continuously updated to identify the most confusingsentences for the system. This approach is impeded by the need to usemultiple annotators and would likely require small batch active learninginstead of continually updated active learning.The decision to focus on a limited set of relations between the biomedical1206.4. Final Wordsentities of interest (e.g. genes and cancers) has advantages and disadvan-tages. In Chapter 4, we were interested in only three relation types (Drivers,Oncogenes and Tumor Suppressors). There are many other relations thatcan exist between a gene and a cancer type, e.g. “frequently mutated in”. Byfocussing on only three relation types, we could provide a tightly controlledannotation process with a specific annotation manual. This meant that theannotation task was feasible and could be completed by annotators withinan acceptable amount of time. However, we may be missing interesting re-lations between these entities. Other approaches take an Open InformationExtraction (OpenIE) approach where no assumptions are made about thetypes of relations that may exist (Percha et al., 2018). An approach thatcould bridge the two methods would be a valuable addition to the biomedicaltext mining field.6.4 Final WordsBiomedical text mining should be an every-day tool used by researchers tokeep up-to-date with research and help guide their hypothesis generation.To get to this stage, we have contributed several key ideas, methods, anddata-sets, including high precision relation extraction for knowledge baseconstruction. This is an exciting period for this field with the culminationof affordable computational resources, web technologies and advances inbiomedical sciences. We must work closely with biomedical researchers tounderstand the problems that matter to them and enable them to interrogatethe biomedical knowledge in a form suited to them.121BibliographyAdamson, P. C., Houghton, P. J., Perilongo, G., and Pritchard-Jones, K.(2014). Drug discovery in paediatric oncology: roadblocks to progress.Nature Reviews Clinical Oncology, 11(12):732.Aerts, S., Haeussler, M., Van Vooren, S., Griffith, O. L., Hulpiau, P., Jones,S. J., Montgomery, S. B., and Bergman, C. M. (2008). Text-mining as-sisted regulatory annotation. Genome Biology, 9(2):R31.Altman, R. B. (2018). Challenges for training translational researchers in theera of ubiquitous data. Clinical Pharmacology & Therapeutics, 103(2):171–173.Ananiadou, S. and Mcnaught, J. (2006). Text mining for biology andbiomedicine. Artech House London.Anekalla, K. R., Courneya, J., Fiorini, N., Lever, J., Muchow, M., andBusby, B. (2017). Pubrunner: a light-weight framework for updating textmining results. F1000Research, 6.Athenikos, S. J. and Han, H. (2010). Biomedical question answering: Asurvey. Computer methods and programs in biomedicine, 99(1):1–24.Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K., Sitnikov, D.,Baumgartner, W. A., Cohen, K. B., Verspoor, K., Blake, J. A., and others(2012). Concept annotation in the craft corpus. BMC bioinformatics,13(1):161.Baylin, S. B. and Ohm, J. E. (2006). Epigenetic gene silencing in cancer–a mechanism for early oncogenic pathway addiction? Nature ReviewsCancer, 6(2):107.Bennett, J., Lanning, S., and others (2007). The netflix prize. In Proceedingsof KDD cup and workshop, volume 2007, page 35. New York, NY, USA.122BibliographyBird, S. (2006). Nltk: the natural language toolkit. In Proceedings of theCOLING/ACL on Interactive presentation sessions, pages 69–72. Associ-ation for Computational Linguistics.Björne, J. and Salakoski, T. (2013). TEES 2.1: Automated annotationscheme learning in the BioNLP 2013 Shared Task. In Proceedings of theBioNLP Shared Task 2013 Workshop, pages 16–25.Björne, J. and Salakoski, T. (2015). Tees 2.2: biomedical event extractionfor diverse corpora. BMC bioinformatics, 16(16):S4.Bodenreider, O. (2004). The unified medical language system(umls): integrating biomedical terminology. Nucleic acids research,32(suppl_1):D267–D270.Bohannon, J. (2016). Who’s downloading pirated papers? everyone.Bollacker, K., Evans, C., Paritosh, P., Sturge, T., and Taylor, J. (2008).Freebase: a collaboratively created graph database for structuring hu-man knowledge. In Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data, pages 1247–1250. AcM.Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O.(2013). Translating embeddings for modeling multi-relational data. InAdvances in neural information processing systems, pages 2787–2795.Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual websearch engine. Computer networks and ISDN systems, 30(1-7):107–117.Bruskiewich, R., Huellas-Bruskiewicz, K., Ahmed, F., Kaliyaperumal, R.,Thompson, M., Schultes, E., Hettne, K. M., Su, A. I., and Good, B. M.(2016). Knowledge. bio: A web application for exploring, building andsharing webs of biomedical relationships mined from pubmed. bioRxiv,page 055525.Buhrmester, M., Kwang, T., and Gosling, S. D. (2011). Amazon’s mechani-cal turk: A new source of inexpensive, yet high-quality, data? Perspectiveson psychological science, 6(1):3–5.Bui, Q.-C., Campos, D., van Mulligen, E., and Kors, J. (2013). A fastrule-based approach for biomedical event extraction. In Proceedings ofthe BioNLP Shared Task 2013 Workshop, pages 104–108. Association forComputational Linguistics.123BibliographyBunescu, R. C. and Mooney, R. J. (2005). A shortest path dependencykernel for relation extraction. In Proceedings of the conference on humanlanguage technology and empirical methods in natural language processing,pages 724–731. Association for Computational Linguistics.Burgstaller-Muehlbacher, S., Waagmeester, A., Mitraka, E., Turner, J., Put-man, T., Leong, J., Naik, C., Pavlidis, P., Schriml, L., Good, B. M., andothers (2016). Wikidata as a semantic framework for the gene wiki ini-tiative. Database, 2016.Carpenter, T. and Thatcher, S. G. (2014). The challenges of bibliographiccontrol and scholarly integrity in an online world of multiple versions ofjournal articles. Against the Grain, 23(2):5.Chaix, E., Dubreucq, B., Fatihi, A., Valsamou, D., Bossy, R., Ba, M.,Delėger, L., Zweigenbaum, P., Bessieres, P., Lepiniec, L., and others(2016). Overview of the regulatory network of plant seed development(seedev) task at the bionlp shared task 2016. In Proceedings of the 4thBioNLP Shared Task Workshop, pages 1–11.Chakravarty, D., Gao, J., Phillips, S., Kundra, R., Zhang, H., Wang, J.,Rudolph, J. E., Yaeger, R., Soumerai, T., Nissan, M. H., and others(2017). Oncokb: a precision oncology knowledge base. JCO precisiononcology, 1:1–16.Chang, M. T., Asthana, S., Gao, S. P., Lee, B. H., Chapman, J. S., Kan-doth, C., Gao, J., Socci, N. D., Solit, D. B., Olshen, A. B., and others(2016). Identifying recurrent mutations in cancer reveals widespread lin-eage diversity and mutational specificity. Nature biotechnology, 34(2):155.Cheng, J., Demeulemeester, J., Wedge, D. C., Vollan, H. K. M., Pitt, J. J.,Russnes, H. G., Pandey, B. P., Nilsen, G., Nord, S., Bignell, G. R., andothers (2017). Pan-cancer analysis of homozygous deletions in primarytumours uncovers rare tumour suppressors. Nature Communications,8(1):1221.Chu, J., Lauretti, E., Di Meco, A., and Pratico, D. (2013). Flap phar-macological blockade modulates metabolism of endogenous tau in vivo.Translational psychiatry, 3(12):e333.Ciccarelli, F. D., Venkata, S. K., Repana, D., Nulsen, J., Dressler, L., Bor-tolomeazzi, M., Tourna, A., Yakovleva, A., and Palmieri, T. (2018). Thenetwork of cancer genes (ncg): a comprehensive catalogue of known and124Bibliographycandidate cancer genes from cancer sequencing screens. bioRxiv, page389858.Clark, K. and Manning, C. D. (2015). Entity-centric coreference resolutionwith model stacking. In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the 7th International JointConference on Natural Language Processing (Volume 1: Long Papers),volume 1, pages 1405–1415.Comeau, D. C., Batista-Navarro, R. T., Dai, H.-J., Doğan, R. I., Yepes,A. J., Khare, R., Lu, Z., Marques, H., Mattingly, C. J., Neves, M., andothers (2014). Bioc interoperability track overview. Database, 2014.Comeau, D. C., Doğan, R. I., Ciccarese, P., Cohen, K. B., Krallinger, M.,Leitner, F., Lu, Z., Peng, Y., Rinaldi, F., Torii, M., and others (2013).Bioc: a minimalist approach to interoperability for biomedical text pro-cessing. Database, 2013.Council, N. R. and others (2014). Convergence: facilitating transdisciplinaryintegration of life sciences, physical sciences, engineering, and beyond.National Academies Press.Davies, M. (2009). The 385+ million word corpus of contemporary amer-ican english (1990–2008+): Design, architecture, and linguistic insights.International journal of corpus linguistics, 14(2):159–190.Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harsh-man, R. (1990). Indexing by latent semantic analysis. Journal of theAmerican society for information science, 41(6):391.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009).Imagenet: A large-scale hierarchical image database. In Computer Visionand Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages248–255. IEEE.Dennis, G., Sherman, B. T., Hosack, D. A., Yang, J., Gao, W., Lane, H. C.,and Lempicki, R. A. (2003). David: database for annotation, visualization,and integrated discovery. Genome biology, 4(9):R60.Developers, J. (2008). Jython implementation of the high-level, dynamic,object-oriented language python written in 100% pure Java. Technicalreport, Technical report (1997-2016), http://www. jython. org/(accessedMay 2016).125BibliographyDiGiacomo, R. A., Kremer, J. M., and Shah, D. M. (1989). Fish-oil di-etary supplementation in patients with raynaud’s phenomenon: a double-blind, controlled, prospective study. The American journal of medicine,86(2):158–164.Eckart, C. and Young, G. (1936). The approximation of one matrix byanother of lower rank. Psychometrika, 1(3):211–218.Ernst, P., Meng, C., Siu, A., and Weikum, G. (2014). Knowlife: a knowledgegraph for health and life sciences. In Data Engineering (ICDE), 2014IEEE 30th International Conference on, pages 1254–1257. IEEE.Fiorini, N., Canese, K., Starchenko, G., Kireev, E., Kim, W., Miller, V., Os-ipov, M., Kholodov, M., Ismagilov, R., Mohan, S., and others (2018). Bestmatch: new relevance search for pubmed. PLoS biology, 16(8):e2005343.Fiorini, N., Lipman, D. J., and Lu, Z. (2017). Cutting edge: Towardspubmed 2.0. eLife, 6:e28801.Forbes, S. A., Beare, D., Gunasekaran, P., Leung, K., Bindal, N., Boutse-lakis, H., Ding, M., Bamford, S., Cole, C., Ward, S., and others (2014).Cosmic: exploring the world’s knowledge of somatic mutations in humancancer. Nucleic acids research, 43(D1):D805–D811.Frijters, R., Heupers, B., van Beek, P., Bouwhuis, M., van Schaik, R., deVlieg, J., Polman, J., and Alkema, W. (2008). Copub: a literature-basedkeyword enrichment tool for microarray data analysis. Nucleic acids re-search, 36(suppl_2):W406–W410.Funk, C., Baumgartner, W., Garcia, B., Roeder, C., Bada, M., Cohen,K. B., Hunter, L. E., and Verspoor, K. (2014). Large-scale biomedicalconcept recognition: an evaluation of current automatic annotators andtheir parameters. BMC bioinformatics, 15(1):59.Futreal, P. A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster,R., Rahman, N., and Stratton, M. R. (2004). A census of human cancergenes. Nature Reviews Cancer, 4(3):177.Gala, K., Li, Q., Sinha, A., Razavi, P., Dorso, M., Sanchez-Vega, F., Chung,Y. R., Hendrickson, R., Hsieh, J., Berger, M., and others (2018). Kmt2cmediates the estrogen dependence of breast cancer through regulation oferα enhancer function. Oncogene.126BibliographyGonzalez-Perez, A., Perez-Llamas, C., Deu-Pons, J., Tamborero, D.,Schroeder, M. P., Jene-Sanz, A., Santos, A., and Lopez-Bigas, N. (2013).Intogen-mutations identifies cancer drivers across tumor types. Naturemethods, 10(11):1081.Good, B. M., Ainscough, B. J., McMichael, J. F., Su, A. I., and Griffith,O. L. (2014). Organizing knowledge to enable personalization of medicinein cancer. Genome biology, 15(8):438.Gordon, M. D. and Dumais, S. (1998). Using latent semantic indexing forliterature based discovery. Journal of the American Society for Informa-tion Science.Griffith, M., Spies, N. C., Krysiak, K., McMichael, J. F., Coffman, A. C.,Danos, A. M., Ainscough, B. J., Ramirez, C. A., Rieke, D. T., Kujan,L., and others (2017). Civic is a community knowledgebase for expertcrowdsourcing the clinical interpretation of variants in cancer. Naturegenetics, 49(2):170.Haber, D. A. and Settleman, J. (2007). Cancer: drivers and passengers.Nature, 446(7132):145.Hagberg, A. A., Schult, D. A., and Swart, P. J. (2008). Exploring networkstructure, dynamics, and function using NetworkX. In Proceedings of the7th Python in Science Conference (SciPy2008), pages 11–15, Pasadena,CA USA.Hanahan, D. and Weinberg, R. A. (2000). The hallmarks of cancer. cell,100(1):57–70.Hegi, M. E., Diserens, A.-C., Gorlia, T., Hamou, M.-F., de Tribolet, N.,Weller, M., Kros, J. M., Hainfellner, J. A., Mason, W., Mariani, L., andothers (2005). Mgmt gene silencing and benefit from temozolomide inglioblastoma. New England Journal of Medicine, 352(10):997–1003.Hersh, W. (2008). Information retrieval in literature-based discovery. InLiterature-based Discovery, pages 153–172. Springer.Hettne, K. M., Thompson, M., van Haagen, H. H., Van Der Horst, E.,Kaliyaperumal, R., Mina, E., Tatum, Z., Laros, J. F., Van Mulligen,E. M., Schuemie, M., and others (2016). The Implicitome: A resource forrationalizing gene-disease associations. PloS one, 11(2):e0149621.127BibliographyHirschman, L., Yeh, A., Blaschke, C., and Valencia, A. (2005). Overview ofbiocreative: critical assessment of information extraction for biology.Honnibal, M. and Johnson, M. (2015). An improved non-monotonic transi-tion system for dependency parsing. In Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Processing, pages 1373–1378,Lisbon, Portugal. Association for Computational Linguistics.Hristovski, D., Rindflesch, T., and Peterlin, B. (2013). Using literature-based discovery to identify novel therapeutic approaches. Cardiovascu-lar & Hematological Agents in Medicinal Chemistry (Formerly CurrentMedicinal Chemistry-Cardiovascular & Hematological Agents), 11(1):14–24.Huang, B., Qu, Z., Ong, C. W., Tsang, Y. N., Xiao, G., Shapiro, D., Salto-Tellez, M., Ito, K., Ito, Y., and Chen, L.-F. (2012). Runx3 acts as a tumorsuppressor in breast cancer by targeting estrogen receptor α. Oncogene,31(4):527.Huang, L., Fernandes, H., Zia, H., Tavassoli, P., Rennert, H., Pisapia, D.,Imielinski, M., Sboner, A., Rubin, M. A., Kluk, M., and others (2017).The cancer precision medicine knowledge base for structured clinical-grademutations and interpretations. Journal of the American Medical Infor-matics Association, 24(3):513–519.Jelier, R., Schuemie, M. J., Roes, P.-J., van Mulligen, E. M., and Kors, J. A.(2008a). Literature-based concept profiles for gene annotation: the issueof weighting. International journal of medical informatics, 77(5):354–362.Jelier, R., Schuemie, M. J., Veldhoven, A., Dorssers, L. C., Jenster, G., andKors, J. A. (2008b). Anni 2.0: a multipurpose text-mining tool for thelife sciences. Genome Biol, 9(6):R96.Jones, S. J., Laskin, J., Li, Y. Y., Griffith, O. L., An, J., Bilenky, M., But-terfield, Y. S., Cezard, T., Chuah, E., Corbett, R., and others (2010).Evolution of an adenocarcinoma in response to selection by targeted ki-nase inhibitors. Genome biology, 11(8):R82.Kang, R., Li, H., Ringgaard, S., Rickers, K., Sun, H., Chen, M., Xie, L.,and Bünger, C. (2014). Interference in the endplate nutritional pathwaycauses intervertebral disc degeneration in an immature porcine model.International orthopaedics, 38(5):1011–1017.128BibliographyKilicoglu, H., Fiszman, M., Rodriguez, A., Shin, D., Ripple, A., and Rind-flesch, T. C. (2008). Semantic medline: a web application for managingthe results of pubmed searches. In Proceedings of the third internationalsymposium for semantic mining in biomedicine, volume 2008, pages 69–76.Kim, J.-D., Ohta, T., Pyysalo, S., Kano, Y., and Tsujii, J. (2009). Overviewof bionlp’09 shared task on event extraction. In Proceedings of the Work-shop on Current Trends in Biomedical Natural Language Processing:Shared Task, pages 1–9. Association for Computational Linguistics.Kim, J.-D., Ohta, T., Tateisi, Y., and Tsuj, J. (2003). Genia corpus—a semantically annotated corpus for bio-textmining. Bioinformatics,19(suppl_1):i180–i182.Kim, J.-D., Pyysalo, S., Ohta, T., Bossy, R., Nguyen, N., and Tsujii, J.(2011). Overview of bionlp shared task 2011. In Proceedings of the BioNLPshared task 2011 workshop, pages 1–6. Association for Computational Lin-guistics.Kim, J.-D. and Wang, Y. (2012). Pubannotation: a persistent and sharablecorpus and annotation repository. In Proceedings of the 2012 Workshopon Biomedical Natural Language Processing, pages 202–205. Associationfor Computational Linguistics.Kristensen, V. N., Lingjærde, O. C., Russnes, H. G., Vollan, H. K. M.,Frigessi, A., and Børresen-Dale, A.-L. (2014). Principles and methods ofintegrative genomic analyses in cancer. Nature Reviews Cancer, 14(5):299–313.Lawrence, S. and Giles, C. L. (2000). Accessibility of information on theweb. intelligence, 11(1):32–39.Leaman, R. and Gonzalez, G. (2008). Banner: an executable survey ofadvances in biomedical named entity recognition. In Biocomputing 2008,pages 652–663. World Scientific.Leaman, R., Islamaj Doğan, R., and Lu, Z. (2013). Dnorm: diseasename normalization with pairwise learning to rank. Bioinformatics,29(22):2909–2917.Leaman, R., Wei, C.-H., and Lu, Z. (2015). tmchem: a high performance ap-proach for chemical named entity recognition and normalization. Journalof cheminformatics, 7(1):S3.129BibliographyLeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature,521(7553):436.Lever, J. and Jones, S. (2017). Painless relation extraction with kindred.BioNLP 2017, pages 176–183.Li, C., Rao, Z., and Zhang, X. (2016). LitWay, Discriminative Extractionfor Different Bio-Events. Proceedings of the 4th BioNLP Shared TaskWorkshop, page 32.Li, G., Ross, K. E., Arighi, C. N., Peng, Y., Wu, C. H., and Vijay-Shanker,K. (2015). mirtex: a text mining system for mirna-gene relation extrac-tion. PLoS computational biology, 11(9):e1004391.Liben-Nowell, D. and Kleinberg, J. (2007). The link-prediction problem forsocial networks. Journal of the American society for information scienceand technology, 58(7):1019–1031.Lichtnwalter, R. and Chawla, N. V. (2012). Link prediction: fair and ef-fective evaluation. In Proceedings of the 2012 International Conferenceon Advances in Social Networks Analysis and Mining (ASONAM 2012),pages 376–383. IEEE Computer Society.Lin, Y., Liu, Z., Sun, M., Liu, Y., and Zhu, X. (2015). Learning entity andrelation embeddings for knowledge graph completion. In AAAI, pages2181–2187.Liu, Y., Sun, J., and Zhao, M. (2017). Ongene: a literature-based databasefor human oncogenes. Journal of Genetics and Genomics, 44(2):119–121.Low, Y., Gonzalez, J. E., Kyrola, A., Bickson, D., Guestrin, C. E., andHellerstein, J. (2014). Graphlab: A new framework for parallel machinelearning. arXiv preprint arXiv:1408.2041.Lu, Z. (2011). Pubmed and beyond: a survey of web tools for searchingbiomedical literature. Database, 2011.Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., and Mc-Closky, D. (2014). The stanford corenlp natural language processingtoolkit. In Proceedings of 52nd annual meeting of the association forcomputational linguistics: system demonstrations, pages 55–60.Mehryary, F., Björne, J., Pyysalo, S., Salakoski, T., and Ginter, F. (2016).Deep Learning with Minimal Training Data: TurkuNLP Entry in the130BibliographyBioNLP Shared Task 2016. Proceedings of the 4th BioNLP Shared TaskWorkshop, page 73.Mendel, G. and Tschermak, E. (1866). Versuche über pflanzen-hybriden.Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). Distant supervisionfor relation extraction without labeled data. In Proceedings of the JointConference of the 47th Annual Meeting of the ACL and the 4th Interna-tional Joint Conference on Natural Language Processing of the AFNLP:Volume 2-Volume 2, pages 1003–1011. Association for Computational Lin-guistics.Nédellec, C., Bossy, R., Kim, J.-D., Kim, J.-J., Ohta, T., Pyysalo, S., andZweigenbaum, P. (2013). Overview of bionlp shared task 2013. In Pro-ceedings of the BioNLP Shared Task 2013 Workshop, pages 1–7.Nickel, M., Tresp, V., and Kriegel, H.-P. (2012). Factorizing yago: scalablemachine learning for linked data. In Proceedings of the 21st internationalconference on World Wide Web, pages 271–280. ACM.Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning,C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., and others(2016). Universal Dependencies v1: A multilingual treebank collection. InProceedings of the 10th International Conference on Language Resourcesand Evaluation (LREC 2016), pages 1659–1666.Onitilo, A. A., Engel, J. M., Greenlee, R. T., and Mukesh, B. N. (2009).Breast cancer subtypes based on er/pr and her2 expression: comparisonof clinicopathologic features and survival. Clinical medicine & research,7(1-2):4–13.Pan, R., Zhou, Y., Cao, B., Liu, N. N., Lukose, R., Scholz, M., and Yang, Q.(2008). One-class collaborative filtering. In Data Mining, 2008. ICDM’08.Eighth IEEE International Conference on, pages 502–511. IEEE.Panyam, N. C., Khirbat, G., Verspoor, K., Cohn, T., and Ramamohanarao,K. (2016). SeeDev Binary Event Extraction using SVMs and a Rich Fea-ture Set. Proceedings of the 4th BioNLP Shared Task Workshop, page 82.Patterson, S. E., Liu, R., Statz, C. M., Durkin, D., Lakshminarayana, A.,and Mockus, S. M. (2016). The clinical trial landscape in oncology andconnectivity of somatic mutational profiles to targeted therapies. Humangenomics, 10(1):4.131BibliographyPedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., and others(2011a). Scikit-learn: Machine learning in Python. Journal of MachineLearning Research, 12(Oct):2825–2830.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.(2011b). Scikit-learn: Machine learning in Python. Journal of MachineLearning Research, 12:2825–2830.Peng, N., Poon, H., Quirk, C., Toutanova, K., and tau Yih, W. (2017).Cross-sentence n-ary relation extraction with graph lstms. Transactionsof the Association for Computational Linguistics, 5:101–115.Percha, B., Altman, R. B., and Wren, J. (2018). A global network of biomed-ical relationships derived from text. Bioinformatics, 1:11.Prasad, V., Fojo, T., and Brada, M. (2016). Precision oncology: origins,optimism, and potential. The Lancet Oncology, 17(2):e81–e86.Prud’hommeaux, E. and Seaborne, A. (2006). SPARQL query language forRDF. Technical report.Quinn, C. T., Johnson, V. L., Kim, H.-Y., Trachtenberg, F., Vogiatzi, M. G.,Kwiatkowski, J. L., Neufeld, E. J., Fung, E., Oliveri, N., Kirby, M., andothers (2011). Renal dysfunction in patients with thalassaemia. Britishjournal of haematology, 153(1):111–117.Radtke, F. and Raj, K. (2003). The role of notch in tumorigenesis: oncogeneor tumour suppressor? Nature Reviews Cancer, 3(10):756.Ramakrishnan, C., Patnia, A., Hovy, E., and Burns, G. A. (2012). Layout-aware text extraction from full-text pdf of scientific articles. Source codefor biology and medicine, 7(1):7.RStudio, Inc (2013). Easy web applications in R. URL: http://www.rstudio.com/shiny/.Rüdiger, T., Ott, G., Ott, M. M., Müller-Deubert, S. M., and Müller-Hermelink, H. K. (1998). Differential diagnosis between classic hodgkin’slymphoma, t-cell-rich b-cell lymphoma, and paragranuloma by paraf-fin immunohistochemistry. The American journal of surgical pathology,22(10):1184–1191.132BibliographyRumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learningrepresentations by back-propagating errors. nature, 323(6088):533.Schriml, L. M., Arze, C., Nadendla, S., Chang, Y.-W. W., Mazaitis, M., Fe-lix, V., Feng, G., and Kibbe, W. A. (2011). Disease ontology: a backbonefor disease semantic integration. Nucleic acids research, 40(D1):D940–D946.Schult, D. A. and Swart, P. (2008). Exploring network structure, dynamics,and function using NetworkX. In Proceedings of the 7th Python in ScienceConferences (SciPy 2008), volume 2008, pages 11–16.Shimomura, O., Johnson, F. H., and Saiga, Y. (1962). Extraction, purifica-tion and properties of aequorin, a bioluminescent protein from the lumi-nous hydromedusan, aequorea. Journal of Cellular Physiology, 59(3):223–239.Shrager, J. and Tenenbaum, J. M. (2014). Rapid learning for precisiononcology. Nature Reviews Clinical oncology, 11(2):109–118.Smith, T. (2015). How far can biology’s big data take us? VancouverBioinformatics User Group (VanBUG) Seminar Series.Smith, T. F. and Waterman, M. S. (1980). New stratigraphic correlationtechniques. The Journal of Geology, 88(4):451–457.Smith, T. F. and Waterman, M. S. (1981). Comparison of biosequences.Advances in applied mathematics, 2(4):482–489.Soon, W. M., Ng, H. T., and Lim, D. C. Y. (2001). A machine learningapproach to coreference resolution of noun phrases. Computational lin-guistics, 27(4):521–544.Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., and Tsujii,J. (2012). Brat: a web-based tool for nlp-assisted text annotation. InProceedings of the Demonstrations at the 13th Conference of the EuropeanChapter of the Association for Computational Linguistics, pages 102–107.Association for Computational Linguistics.Swanson, D. R. (1986a). Fish oil, raynaud’s syndrome, and undiscoveredpublic knowledge. Perspectives in biology and medicine, 30(1):7–18.Swanson, D. R. (1986b). Undiscovered public knowledge. The Library Quar-terly, 56(2):103–118.133BibliographySwanson, D. R. and Smalheiser, N. R. (1997). An interactive system forfinding complementary literatures: a stimulus to scientific discovery. Ar-tificial intelligence, 91(2):183–203.Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K. P., and others(2014). String v10: protein–protein interaction networks, integrated overthe tree of life. Nucleic acids research, 43(D1):D447–D452.Szklarczyk, D., Morris, J. H., Cook, H., Kuhn, M., Wyder, S., Simonovic,M., Santos, A., Doncheva, N. T., Roth, A., Bork, P., and others (2016).The string database in 2017: quality-controlled protein–protein associ-ation networks, made broadly accessible. Nucleic acids research, pagegkw937.Tamborero, D., Rubio-Perez, C., Deu-Pons, J., Schroeder, M. P., Vivancos,A., Rovira, A., Tusquets, I., Albanell, J., Rodon, J., Tabernero, J., andothers (2018). Cancer genome interpreter annotates the biological andclinical relevance of tumor alterations. Genome medicine, 10(1):25.Taschuk, M. and Wilson, G. (2017). Ten Simple Rules for Making ResearchSoftware More Robust. PLOS Computational Biology, 13(4).Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M.,Alvers, M. R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopou-los, D., and others (2015). An overview of the bioasq large-scale biomedicalsemantic indexing and question answering competition. BMC bioinfor-matics, 16(1):138.Tsueng, G., Nanis, S., Fouquier, J., Good, B., and Su, A. (2016). Citizenscience for mining the biomedical literature. Citizen Science: Theory andPractice, 1(2).Tsuruoka, Y., Miwa, M., Hamamoto, K., Tsujii, J., and Ananiadou, S.(2011). Discovering and visualizing indirect associations between biomed-ical concepts. Bioinformatics, 27(13):i111–i119.Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou,S., and Tsuj, J. (2005). Developing a robust part-of-speech tagger forbiomedical text. In Advances in informatics, pages 382–392. Springer.Van Landeghem, S., Björne, J., Wei, C.-H., Hakala, K., Pyysalo, S., Anani-adou, S., Kao, H.-Y., Lu, Z., Salakoski, T., Van de Peer, Y., and others134Bibliography(2013). Large-scale event extraction from literature with multi-level genenormalization. PloS one, 8(4):e55814.Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., and Blum, M. (2008).recaptcha: Human-based character recognition via web security measures.Science, 321(5895):1465–1468.Vrandečić, D. and Krötzsch, M. (2014). Wikidata: a free collaborativeknowledgebase. Communications of the ACM, 57(10):78–85.Wagner, A. H., Walsh, B., Mayfield, G., Tamborero, D., Sonkin, D., Krysiak,K., Pons, J. D., Duren, R., Gao, J., McMurry, J., and others (2018).A harmonized meta-knowledgebase of clinical interpretations of cancergenomic variants. bioRxiv, page 366856.Wei, C.-H., Harris, B. R., Kao, H.-Y., and Lu, Z. (2013a). tmVar: a textmining approach for extracting sequence variants in biomedical literature.Bioinformatics.Wei, C.-H., Kao, H.-Y., and Lu, Z. (2013b). Pubtator: a web-based text min-ing tool for assisting biocuration. Nucleic acids research, 41(W1):W518–W522.Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger,B. A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J. M., Network, C.G. A. R., and others (2013). The cancer genome atlas pan-cancer analysisproject. Nature genetics, 45(10):1113.Westergaard, D., Stærfeldt, H.-H., Tønsberg, C., Jensen, L. J., and Brunak,S. (2018). A comprehensive and quantitative comparison of text-miningin 15 million full-text articles versus their corresponding abstracts. PLoScomputational biology, 14(2):e1005962.Weymann, D., Laskin, J., Roscoe, R., Schrader, K. A., Chia, S., Yip, S.,Cheung, W. Y., Gelmon, K. A., Karsan, A., Renouf, D. J., and others(2017). The cost and cost trajectory of whole-genome analysis guidingtreatment of patients with advanced cancers. Molecular genetics & ge-nomic medicine, 5(3):251–260.Wiles, A. (1995). Modular elliptic curves and fermat’s last theorem. Annalsof mathematics, 141(3):443–551.William, H. (2007). Numerical recipes: The art of scientific computing. 3rdedition.135BibliographyYetisgen-Yildiz, M. and Pratt, W. (2009). A new evaluation methodologyfor literature-based discovery systems. Journal of biomedical informatics,42(4):633–643.Zender, L., Xue, W., Zuber, J., Semighini, C. P., Krasnitz, A., Ma, B.,Zender, P., Kubicka, S., Luk, J. M., Schirmacher, P., and others (2008).An oncogenomics-based in vivo rnai screen identifies tumor suppressorsin liver cancer. Cell, 135(5):852–864.Zhao, M., Kim, P., Mitra, R., Zhao, J., and Zhao, Z. (2015). Tsgene 2.0:an updated literature-based knowledgebase for tumor suppressor genes.Nucleic acids research, 44(D1):D1023–D1031.Zhu, S., Zeng, J., and Mamitsuka, H. (2009). Enhancing medline docu-ment clustering by incorporating mesh semantic similarity. Bioinformat-ics, 25(15):1944–1951.136

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0372325/manifest

Comment

Related Items