Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Support vector machines predict advanced cancer patient response to therapies from bulk RNA sequencing… Erhan, Halid Emre 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2020_november_erhan_halid_emre.pdf [ 5.51MB ]
Metadata
JSON: 24-1.0394095.json
JSON-LD: 24-1.0394095-ld.json
RDF/XML (Pretty): 24-1.0394095-rdf.xml
RDF/JSON: 24-1.0394095-rdf.json
Turtle: 24-1.0394095-turtle.txt
N-Triples: 24-1.0394095-rdf-ntriples.txt
Original Record: 24-1.0394095-source.json
Full Text
24-1.0394095-fulltext.txt
Citation
24-1.0394095.ris

Full Text

SUPPORT VECTOR MACHINES PREDICT ADVANCED CANCER PATIENT RESPONSE TO THERAPIES FROM BULK RNA SEQUENCING DATA by  Halid Emre Erhan  B.Sc., Simon Fraser University, 2018  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  August 2020  © Halid Emre Erhan, 2020  ii  The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, a thesis entitled:  Support Vector Machines Predict Advanced Cancer Patient Response to Therapies from Bulk RNA Sequencing Data  submitted by Halid Emre Erhan in partial fulfillment of the requirements for the degree of Master of Science in Bioinformatics  Examining Committee: Dr. Steven Jones, Medical Genetics, UBC Supervisor  Dr. Inanc Birol, Medical Genetics, UBC Supervisory Committee Member  Dr. Stephen Yip, Pathology, UBC Supervisory Committee Member iii  Abstract   Personalized medicine approaches for cancer therapy seek to determine optimal therapies for cancer patients based on the molecular profile of their tumour. The motivation is to target oncogenomic alterations in tumours with the appropriate therapies. However, it is currently infeasible to determine the optimal therapy simply given the genomic profile of a tumour. There has been significant recent work in attempting to use the computational approach of machine learning for predicting tumour drug response. Machine learning methods have been successfully used for drug response prediction in cancer cell lines and even have been extended to predicting individual cancer patient response to a small number of chemotherapies. This work uses support vector machines (SVM) to predict the response to chemotherapies of 570 advanced cancer patients from the BC Cancer Personalized OncoGenomics program using the transcriptomic profile of their tumours. This dataset of advanced cancers presents over 20 cancer types and 130 unique chemotherapies. F-measures for the SVM predictions were found to be as high as 1.0 for some cohorts. Further analysis on the set of important genes for the SVMs revealed biological explanations that may explain the SVM predictions. This work demonstrates the value of large-scale sequencing projects and the potential of data mining and machine learning in personalized cancer medicine.    iv  Lay Summary  Advanced cancer patients are enrolled in therapies based on the knowledge of similar cancers, which often overlooks peculiarities about the tumours of individual cancer patients. Personalized cancer medicine seeks to determine therapies for cancer patients based on their individual tumour. Much work has been put towards using the computational method of machine learning, a branch of artificial intelligence, to determine the best therapies using genetic information from a tumour. Previous work on this topic has been successful in determining how cancer cells respond to therapy in a laboratory setting. However, cancer cells in a lab are not fully representative of cancer in the human body. This work successfully applies machine learning on a dataset from the BC Cancer Personalized OncoGenomics (POG) program. The dataset describes the genetics of 570 advanced cancer patients and their response to therapies. This project demonstrates the potential of machine learning in determining cancer treatments.  v  Preface I worked on this project under the supervision of Dr. Steven Jones at the BC Cancer Genome Sciences Centre as a part of the Personalized OncoGenomics Project (NCT012155621). I designed the study approach and conducted all of the experiments myself. The work is approved by and conducted under the University of British Columbia – BC Cancer Agency Research Ethics Board (H12-00137, H14-00681) and approved by the institutional review board at UBC. The genomic library construction and biospecimen handling were overseen by Dr. Andrew Mungall and Dr. Richard Moore. The bioinformatics preprocessing was overseen by Karen Mungall, Eric Chuah, Tina Wong and Reanne Bowlby. The transcript quantification was conducted by Dr. Jahanshah Ashkani. The Personalized OncoGenomics program is based at the BC Cancer Agency. Patients are enrolled after written informed consent. Patient information is completely anonymous. The data used in this project can be found at https://bcgsc.ca/downloads/POG570/.  This work has not yet been published.    vi  Table of Contents Abstract................................................................................................................................. iii Lay Summary .......................................................................................................................... iv Preface ................................................................................................................................... v Table of Contents ................................................................................................................... vi List of Tables .......................................................................................................................... ix List of Figures ......................................................................................................................... x List of Abbreviations ............................................................................................................... xi Acknowledgements: I thank… ............................................................................................... xiii Dedication ............................................................................................................................. xv Chapter 1: Introduction .................................................................................................................... 1 1.1 Background on cancer and metastatic cancer ...................................................................................... 1 1.2 History of DNA sequencing technology ................................................................................................. 1 1.2.1 RNA sequencing and transcriptomics ............................................................................................... 3 1.3 Background on machine learning .......................................................................................................... 5 1.3.1 Supervised and unsupervised machine learning .............................................................................. 6 1.3.2 The fundamental trade-off ............................................................................................................... 7 1.3.3 Support vector machines for clinical transcriptomics ...................................................................... 9 1.4 Machine learning for personalized cancer medicine .......................................................................... 11 1.5 Research question ............................................................................................................................... 12 Chapter 2: Support vector machines predict metastatic cancer patient response ........................... 13 vii  2.1 Methods .............................................................................................................................................. 13 2.1.1 An overview of the POG570 cohort ............................................................................................... 13 2.1.2 Cancer type distribution ................................................................................................................. 13 2.1.3 Strand-specific RNA library construction ........................................................................................ 14 2.1.4 Processing RNA-seq data ................................................................................................................ 15 2.1.5 Treatment data description ........................................................................................................... 15 2.1.6 Time on treatment as a proxy for response ................................................................................... 17 2.1.7 Processing time on treatment data ................................................................................................ 20 2.1.8 Recursive feature elimination ........................................................................................................ 20 2.1.9 Machine learning pipeline and implementation ............................................................................ 22 2.1.10 Analysis of RFE genes for biomarkers ............................................................................................. 22 2.2 Results ................................................................................................................................................. 23 2.2.1 Fixed effects model of time on treatment ..................................................................................... 23 2.2.2 RFE results and SVM performance on therapies ............................................................................ 23 2.2.3 Genes and feature coefficients for SVM predicting carboplatin response .................................... 26 2.2.4 Genes and feature coefficients for SVM predicting cisplatin response ......................................... 29 2.2.5 Genes and feature coefficients for SVM predicting gemcitabine response ................................... 31 2.3 Discussion ............................................................................................................................................ 33 2.3.1 Fixed effects model of time on treatment ..................................................................................... 33 2.3.2 RFE and SVM performance ............................................................................................................. 33 2.3.3 Some RFE selected genes are known to be biomarkers for prognosis ........................................... 34 2.3.4 Examining the RFE selected genes for predicting carboplatin response ........................................ 34 2.3.5 Examining the RFE selected genes for predicting cisplatin response ............................................. 35 2.3.6 Examining the RFE selected genes for predicting gemcitabine response ...................................... 37 2.3.7 The expression of pseudogenes may be prognostic biomarkers ................................................... 39 viii  Chapter 3: Conclusion .................................................................................................................... 42 Bibliography ......................................................................................................................... 44  ix  List of Tables Table 1 The results of the RFE and performance of the SVMs on predicting patient response to the therapies with at least 30 patients. .......................................................................................... 24 Table 2 The RFE selected genes for an SVM predicting carboplatin response in the POG cohort........................................................................................................................................................ 26 Table 3 The top 30 RFE selected genes for an SVM predicting cisplatin response in the POG cohort. ........................................................................................................................................... 31 Table 4 The RFE selected genes for an SVM predicting gemcitabine response in the POG cohort........................................................................................................................................................ 32  x  List of Figures  Figure 1 An example of feature matrix X and label vector y for an email spam detection problem. Feature matrix X uses a bag-of-words model for encoding the words in an email. The label vector y denotes whether an email is spam (1 for "spam" and 0 for "not spam"). ..................................... 6 Figure 2 A visual representation of the bias-variance trade-off. As model complexity increases, the testing error decreases along with the training error initially, but when the model starts overfitting to the training set, testing error and training error diverge, increasing the approximation error. ....................................................................................................................... 9 Figure 3 The red line is a support vector determined by a SVM. The red line is the maximum marginal separator separating the two groups in this dataset. ...................................................... 10 Figure 4 Distribution of patients across cancer cohorts. ............................................................... 14 Figure 5 The frequency of drug usage across the POG cohort for the 30 most commonly used therapies. ....................................................................................................................................... 16 Figure 6 The days on treatment for the 30 most frequently used therapies separated by physician assessed response to the therapies. In the boxplots, the lower and upper limits on the boxes correspond with the first and third quartiles respectively. The whiskers on the boxplots extend to the furthest value no greater than 1.5 times the inter-quartile range. Data beyond the whiskers are denoted with points. The actual data is overlaid on the boxplots in translucent blue. The PR and SD assessed patients appear to be on treatments for longer than patients assessed for PD. ......... 19 Figure 7 Response predictions with SVMs on the 9 most commonly used drugs in the POG cohort. ........................................................................................................................................... 25  xi  List of Abbreviations 5-FU – 5-Fluorouracil ACC – Adenoid cystic carcinoma ANN – Artificial Neural Network BC – British Columbia BRC – Breast cancer CAGE – Cap Analysis of Gene Expression cDNA – Complementary Deoxyribonucleic Acid ceRNA – Competitive endogenous ribonucleic acid CNS – Central nervous system cancer CR – Complete Response ddNTP – Dideoxynucleoside triphosphates DNA – Deoxyribonucleic acid GIC – Gastrointestinal cancer GUC – Genitourinary cancer GYN – Gynecologic cancer H&N – Head and neck cancer HEM – Hematologic cancer HPB – Hepatobiliary cancer HPC – Hemangiopericytoma lincRNA – Long intergenic noncoding ribonucleic acid lncRNA – Long noncoding ribonucleic acid NEU – HER2-positive breast cancer xii  NGS – Next-Generation Sequencing ONT – Oxford Nanopore Technologies OTH – Other cancers OVA – Ovarian cancer PAN – Pancreatic cancer PD – Progressive Disease POG – Personalized OncoGenomics PR – Partial Response PUO – Pyrexia of Unknown Origin RFE – Recursive Feature Elimination RNA – Ribonucleic acid SAGE – Serial Analysis of Gene Expression SARC – Sarcoma  SD – Stable Disease SKN – Skin cancer SVM – Support Vector Machine THR – Throat cancer TPM – Transcripts per Million   xiii  Acknowledgements: I thank… My supervisor, Dr. Steven Jones for sharing his vision with me and providing me with the space to enthusiastically explore my curiosities.  My supervisory committee, Drs. Inanç Birol and Stephen Yip, who guided me through my work. In earlier years, Inanç instilled in me a love for bioinformatics for which I am eternally grateful.  Jasleen Grewal and Luka Culibrk, two of the brightest people I have ever met, who welcomed me to the Jones lab with open arms and with whom I shared many late nights discussing our scientific struggles.  Harwood Kwan, Jenny Yang, Micha Disyak, Zoltan Bozoky, Jahanshah Ashkani, Erin Pleasance, Jean-Michel Garant, Kieran O’Neill, Cara Reisle and Vahid Akbari, who worked closely with me, for pointing me in the right direction when I needed a nudge.  Dr. Shaun Jackman, a passionate scientist and engineer, who mentored me through my studies and who modelled an unbridled scientific curiosity that continues to inspire me.  Baraa Orabi, my wonderful and compassionate friend, for his undying support and for sharing with me his passion for science.  Cem Erkli, Figali Taho and Pedro Pessoa for always lending me an ear or a hand when I needed it. Thank you for supporting me in this journey. xiv   Aaryaman Girish, Alex Land, Evan Chisholm and Trevor Clelland, the Coffee Boys, for your unwavering friendship. May our mugs never empty.  Christine Fei, my best friend and partner, who always has my back, for always being on my team. Thank you for lending me your ears when my work became frustrating. I would not be here without you.  My parents, Nesil and Halil Erhan, and my brother, Efe Erhan, who have always believed in me and pushed me in my career. Thank you for your patience and your unending support.  The patients in the POG program who entrusted their private data with us for the benefit of all cancer patients. xv  Dedication      To my anne and baba, Nesil and Halil, and my kardeş, Efe,  1  Chapter 1: Introduction 1.1 Background on cancer and metastatic cancer Cancer rates in Canada are increasing and are expected to continue increasing due to Canada’s aging population. Just under half of all people in Canada are expected to receive a cancer diagnosis within their lifetime 1. Although age-standardized cancer-related mortality has been decreasing, cancer remains the leading cause of mortality in Canada and is expected to remain the leading cause of death in 2020 1,2. This trend also holds up on a global scale. As countries develop economically and access to health care improves, the burden of disease transitions from infectious diseases to non-infectious diseases, including cancer 3. In a longitudinal study from 2005 to 2016 of 11,307 participants from 21 countries, cancer was found to be the second leading cause of death surpassed only by heart disease 4. Cancer continues to be a significant contributor to the Canadian and global disease burden. Cancer metastasis refers to when cancer cells from the original tumour, called the primary tumour, spread to the surrounding tissues. Often cancer metastasis happens in the late stages of cancer. Cancer metastasis has drastic implications on prognosis; the large majority of cancer-related deaths are thought to be from metastatic disease 5. The proportion of death from cancers due to metastatic cancers is estimated to be from 66.7% to as much as 90% 6,7. Consequently, it is important to be researching metastatic cancers to improve diagnosis and treatment.  1.2 History of DNA sequencing technology DNA sequencing is the process of determining the order of nucleotides of DNA molecules. The technology developed for DNA sequencing has grown tremendously since 1977 2  when Fred Sanger’s method of DNA sequencing with chain-terminating dideoxynucleotides (ddNTPs) was first published 8. The invention of Sanger sequencing led the way to several landmark sequencing projects in the late 20th century. Amongst these projects were those for the genomes of model organisms such as C. elegans 9 and D. melanogaster 10, as well as for the human genome in the influential Human Genome Project 11. Driven by a need to reduce sequencing costs, Sanger sequencing was rapidly improved through automation and commercialized by Applied Biosystems and used by the J. Craig Venter Institute in their private venture in competition with the Human Genome Project for sequencing the human genome. Indeed, increasing computational power to store and analyze biological sequences created a need for ever more efficient DNA sequencing methods. In 2006, the biotechnology company, Solexa, released the Genome Analyzer. The Genome Analyzer was a sequencing machine that used a novel sequencing method based on massively parallelized sequencing with reversible dye-terminators that allowed the machine to rapidly sequence up to a gigabase of DNA in a single run. Around the same time, 454 Life Sciences released the 454 sequencer and Agencourt released the SOLiD sequencer. These three sequencing machines had similar specifications for the quality of DNA reads produced and marked the beginning of next-generation sequencing (NGS) or second-generation sequencing technologies 12. Although the Solexa Genome Analyzer was neither the first nor necessarily the best sequencer at the time, after Solexa was bought by the company, Illumina, they quickly outpaced and outcompeted the other technologies to become the most dominant sequencing company. As of 2020, Illumina owns roughly 75% of the global market share of the sequencing industry 13. 3   NGS quickly made obsolete Sanger sequencing technology due to the magnitudes larger scale of data produced. Modern NGS technologies produce reads from DNA of comparatively short length—50 to 150 base pairs—but with as many as 6 billion reads per experiment. To compare, the Human Genome Project generated a total of 30 million reads with Sanger sequencing technologies 14. However, the short length of reads produced by NGS present many technical challenges due to the repetitiveness of genomes. Large genomes such as those from animals and plants have significant repetitive sequence content; up to half of the human genome consists of these repetitive sequences 15. These repetitive elements become an issue for genome technologies when the repetitive element is longer than the length of the reads. Reads coming from repetitive elements create ambiguity around where in the genome the reads originated. Third-generation sequencing technologies, marked by technology that directly sequences single DNA molecules, have the capability to generate reads of thousands of base pairs 16. These technologies were released in the late 2000s, but only grew to prominence towards the mid-2010s when the companies, Oxford Nanopore Technologies (ONT) and Pacific Biosciences (Pacbio), started producing sequencing machines that were both commercially viable and scientifically ground-breaking 17,18. However, the cost of sequencing and high error rate for third-generation sequencing technologies limits their adoption for large-scale clinical sequencing projects 19.  1.2.1 RNA sequencing and transcriptomics RNA sequencing (RNA-Seq) is the process of determining the order of nucleotides in RNA molecules 20. While third-generation sequencing technologies are also used for sequencing RNA molecules, in this work RNA-seq will refer to the use of high-throughput second-4  generation sequencing technologies to sequence RNA molecules. RNA-Seq has allowed researchers to understand the transcriptome at a scale that was previously unimaginable. Older RNA assaying technologies, such as hybridization techniques used in microarray technologies or sequence-based approaches used in serial analysis of gene expression (SAGE) or cap analysis of gene expression (CAGE) had major limitations 21,22. Hybridization technologies require an established knowledge of RNA sequence to design fluorescent probes and cannot discover novel sequences, and also have noise due to the cross-hybridization of probes 23. While the sequence-based approaches of CAGE and SAGE overcome these limitations, they must cope with the expense and severe lack of throughput that is inherent to Sanger sequencing technologies. RNA-Seq simply adds a few extra steps to the library preparation of a DNA sequencing experiment. This is often done relatively cheaply and quickly with commercial kits, such as the Ribo-Zero and ScriptSeq kits offered by Illumina 24. Once the RNA library is prepared, it is converted to a complementary DNA (cDNA) library using a reverse transcriptase enzyme. This library can be sequenced with an NGS sequencing machine just like any other DNA library. The throughput of RNA-Seq has facilitated the quantification and functional analysis of transcriptomes. One such use is in gene expression quantification through measuring the relative abundances of transcripts in a sample. Once a cDNA library is sequenced, the reads are mapped to a reference genome or transcriptome, often with a splice-aware aligner like STAR or HISAT 25,26. Then the number of reads mapped to genes are counted and normalized for transcript length and by the number of total reads in the sequencing experiment. This normalization allows comparisons of gene expression across different sequencing experiments. The result of this process is a gene expression matrix that can be used to understand the transcriptome of the 5  samples. These matrices are used most often for differential gene expression experiments  and increasingly in machine learning approaches for understanding transcriptomes 27,28.  1.3 Background on machine learning Machine learning refers to a subset of computational methods from the field of artificial intelligence that aims to understand and replicate the relationship between data without explicitly programming said relationship 29. There are many applications of machine learning methods that are very frequently used in our increasingly technological world, for example, in email spam detection and for self-driving cars 30,31. In the field of machine learning, data refers to collections of samples and their associated features. This can be represented with a matrix, denoted as 𝑋, where the rows correspond to samples and columns to features. Many machine learning algorithms aim to learn the association between samples from a dataset and their labels, denoted by a vector 𝑦. For example, in email spam detection, the samples are the individual emails with features corresponding to the content of the emails and each email is labelled as either “spam” or “not spam” (Figure 1) 30. In this case, a machine learning model would learn to associate email content with whether the email should be classified as spam. 6   Figure 1 An example of feature matrix X and label vector y for an email spam detection problem. Feature matrix X uses a bag-of-words model for encoding the words in an email. The label vector y denotes whether an email is spam (1 for "spam" and 0 for "not spam").  1.3.1 Supervised and unsupervised machine learning There are several different types of machine learning algorithms. The most commonly used algorithms, especially in the field of genomics, belong to the classes of supervised and unsupervised machine learning algorithms 32. In supervised machine learning, the machine learning model learns to make relationships between a set of samples with associated features and a corresponding set of labels. This kind of learning is often used for genome annotation. In genome annotation, the samples are sequences of interest from a genome, with corresponding labels that denote their annotation (e.g. transcription start site, enhancer, promoter). The features of these samples may be the actual nucleotide content of the sequence, encoded such that it can be used by a machine learning 7  model 32. Then the entire dataset is divided into two sets: a training set and testing set. During the training stage of supervised machine learning, the machine learning model learns a generalizable association between the samples and labels in the training set and is blind to the samples and labels in the testing set 32. During the testing stage of supervised machine learning, the trained model makes predictions of what labels are associated with the samples in the testing set, based on how it was trained during the training stage. The predicted labels are compared with the true labels, resulting in a performance calculation for the predictions 32. There are problem areas where the labels of a dataset are unknown, or the goal is to discover what labels best explain the data. This approach is called unsupervised machine learning and can also be applied to genome annotation 33, 34. In this approach, the model again takes genomic sequences as samples and their sequence content as features. However, the model is not given annotations for the samples to train on but is given the number of classes of annotations there are expected to be for the dataset 32. The unsupervised machine learning model determines which genomic sequences belong together to an annotation based on their sequence similarities with each other 32.  1.3.2 The fundamental trade-off Machine learning methods attempt to minimize the error of the model on the training set, known as the training error, while trying to have the training error approximate the error of the model on the testing set, or the test error. The difference in training error and test error is called the approximation error and refers to how well the training error approximates the testing error (Equation 1).  8  𝐸!""#$% = 𝐸&#!'( − 𝐸&)*& Equation 1 The approximation error is the difference between the testing error and training error. It describes how well the trained model generalizes to a holdout set.  Inherently, machine learning training algorithms attempt to optimize for training error. They try to minimize a loss function, a function that models how similar the predictions of a model are to the truth. However, the more a model learns the particularities of a training set, the less generalizable it becomes and the worse it performs on a test set, increasing the approximation error. This phenomenon is called overfitting. To handle overfitting, there are several methods for increasing the generalizability of machine learning models, such as introducing a regularization parameter or reducing the complexity of the model. These methods decrease approximation error at the cost of training error. Thus, machine learning theory leads to a fundamental trade-off, called the bias-variance trade-off (Figure 2): optimizing for training error increases approximation error and makes a less generalizable and overfit model, while methods for decreasing approximation error makes a more generalizable model at the cost of an increased training error 35. 9   Figure 2 A visual representation of the bias-variance trade-off. As model complexity increases, the testing error decreases along with the training error initially, but when the model starts overfitting to the training set, testing error and training error diverge, increasing the approximation error.  1.3.3 Support vector machines for clinical transcriptomics Clinical transcriptomic datasets often pose an issue for machine learning tasks because of what is called “the curse of dimensionality”. Datasets with high dimensionality are known to cause many problems for machine learning tasks. During training, it becomes more likely that the machine learning model learns spurious associations with irrelevant features as the number of features increases. This results in less generalizability of the model and poorer performance on the test set. In terms of the fundamental trade-off, adding dimensionality to a dataset increases model complexity, thus decreasing training error while increasing approximation error. The curse of dimensionality for clinical genomics is further confounded by the fact that clinical genomics datasets often have a much fewer number of samples than the number of features. An often-touted rule of thumb in machine learning is that there should be five times as many samples as number of features 36. Since there are over 50,000 Ensembl gene identifiers, 10  transcriptomic datasets that have been through gene expression quantification have over 50,000 features. However, the number of samples for clinical transcriptomics datasets range from as few as a handful of patients to a couple of thousand patients. This makes it impossible to follow the rule of thumb, except perhaps with some extremely stringent feature selection measures.   Figure 3 The red line is a support vector determined by an SVM. The red line is the maximum marginal separator separating the two groups in this dataset.  Support vector machines (SVMs) are supervised machine learning models that do binary classification. During training, SVMs find the optimal separating hyperplane that maximises the distance between the two groups in the dataset, called the maximum marginal separator (Figure 3). The loss function for a typical SVM is a hinge loss expression with an L2-regularization term (Equation 2).    11  𝑓(𝑤) = 	𝐶,max	{0, 1 − 𝑟'('+, } + 12 8|𝑤|8-; 𝑟' =	𝑦'𝑤.𝑥' Equation 2 A hinge loss function with an L2-regularization term. 𝒓𝒊 is the residual of the trained SVM for sample 𝒊.  SVMs are known to be more robust to high dimensional data, while not requiring large numbers of samples to train compared to methods such as artificial neural networks (ANNs) 37. The robustness of SVMs to high dimensionality is a consequence of the regularization term in the loss function. An L2-regularization term, for example, adds the L2-norm of the trained weights, denoted by a vector 𝑤, to the loss function (Equation 2). This tends to shrink the weights of irrelevant features, leading to a better approximation error. While SVMs are not impervious to the curse of dimensionality, they are better suited than other models and have been used frequently in clinical genomics machine learning tasks 37,38.  1.4 Machine learning for personalized cancer medicine Personalized or precision cancer medicine aims to determine treatments for cancer patients based on the molecular profile of their tumour. These kinds of personalized approaches are becoming more common and have been found to improve the outcomes of cancer patients 39. The modest amount of publicly available cancer sequencing data along with associated clinical data is slowly increasing. Machine learning methods are particularly well suited for data mining these datasets to enable researchers and clinicians to start to understand how a tumour may respond to a given cancer therapy 40,41. Machine learning has already been used extensively for 12  understanding how preclinical models (i.e. cancer cell lines and xenograft models) respond to therapies 42-43. Preclinical genomics datasets are much more available than clinical datasets from patients due to how much simpler and cheaper it is to test therapies and determine response on preclinical models. However, clinical genomics datasets from real patients are the most important datasets for determining clinically relevant biomarkers for drug response 43.   1.5 Research question The Personalized OncoGenomics (POG) program is a research initiative at the British Columbia Cancer Agency that uses genomic sequencing technologies to understand how to better treat metastatic cancer patients in British Columbia. The program makes available the transcriptomic profiles of 570 advanced cancer patients and their clinical history. The work described in this thesis uses feature selection methods and SVMs to try to predict the drug response of patients from the POG570 cohort using the transcriptomic profiles of their tumours. The selected genes and their associated weights are examined to determine if the genes may be biomarkers for treatment response. 13  Chapter 2: Support vector machines predict metastatic cancer patient response  2.1  Methods 2.1.1 An overview of the POG570 cohort The data for this project comes from the POG570 cohort from the British Columbia (BC) Cancer Personalized OncoGenomics (POG) program. The details of this cohort and corresponding analyses can be found in a publication by Pleasance et al. 44. The clinical program has the goal of understanding the genomic and transcriptomic landscape of advanced metastatic cancers in the province of BC and to improve patient outcomes by informing treatment planning with genomics driven insights. Patients are enrolled in the program if they meet certain criteria, such as cancer stage and life expectancy. These criteria can be found in the publication by Pleasance et al. 44. Biopsies are taken from the metastatic sites of each cancer using ultrasound-guided, CT-guided or needle core biopsies. Pathologists examine the biopsy slides to estimate tumour content. Matched normal liquid blood biopsies are conducted as well.  2.1.2 Cancer type distribution The POG570 cohort is a diverse pan-cancer cohort composed of 25 unique histologies. The three largest tumour groups constituting 394 patients (70%) were the breast cancer (BRC), gastrointestinal (GIC) and pancreatic cancer (PAN) groups (Figure 4). 14   Figure 4 Distribution of patients across tumour groups. The tumour groups are breast cancer (BRC), gastrointestinal cancer (GIC), pancreatic cancer (PAN), throat cancer (THR), sarcoma (SAR), ovarian cancer (OVA), skin cancer (SKN), gynecological cancer (GYN), head and neck cancer (H&N), pyrexia of unknown origin (PUO), genitourinary cancer (GUC), hematologic cancer (HEM), HER2-positive breast cancer (NEU), hemangiopericytoma (HPC), adenoid cystic carcinoma (ACC), central nervous system cancer (CNS), other cancers (OTH).  2.1.3 Strand-specific RNA library construction Since the analysis described here only uses the transcriptomic data from the POG program, the genomic DNA library preparation is not described here. The in-depth details for 15  both genomic and transcriptomic library construction can be found in the Supplementary Information in Pleasance et al. 44.  In brief, the transcriptomic libraries were constructed using the BC Cancer Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). The libraries were sequenced to generate 150-200 million 75-base paired-end reads on the Illumina HiSeq2500, or on the NextSeq500 using version 2 chemistry.  2.1.4 Processing RNA-seq data RNA-seq data was available for 559 of 570 (98%) POG patients. The RNA-seq data were aligned to the GRCh38 reference genome using the splice-aware RNA-seq alignment software, STAR (version 2.5.2b) 25. Then transcript expression quantification was performed with the software, RSEM (version 1.3.0) to produce normalized transcript quantification in transcripts per million (TPM) 45. This process yielded a 559 x 58,053 TPM feature matrix that was used for all machine learning tasks representing the transcriptomic profiles of the POG570 cohort. The TPM values were then standardized by assigning Z-scores to TPM values from the same gene.   2.1.5 Treatment data description There are 130 unique therapies that were used for the first post-biopsy treatment. The 30 most commonly prescribed post-biopsy drugs and the number of patients on each treatment are described in Figure 5.  16   Figure 5 The frequency of first post-biopsy drug usage across the POG cohort for the 30 most commonly used therapies.   17  2.1.6 Time on treatment as a proxy for response Physician assessed response to therapy was only available for 161 patients (28%) of the POG570 cohort. This is too small a sample size to train a meaningful machine learning model. However, treatment history and time on treatment was available for every patient. Time on treatment was used as a proxy for patient response for this study. The assumption made is that patients who remain on a treatment for the full course either responded well or at least did not have poor response to the treatment. However, if a patient were taken off a treatment quickly, they may have had a poor response or even an adverse reaction. Some literature shows that time on treatment is indeed correlated with therapy response for some cancers 46,47. Physician-assessed treatment responses for the 161 POG patients were assessed according to the RECIST response criteria 48. The response criteria in order of poor to good response are progressive disease (PD), stable disease (SD), partial response (PR) and complete response (CR). Only 1 POG patient was assessed as CR. The POG patients who had physician-assessed response as PR or SD for a drug appear to have had longer time on treatment than POG patients who had physician-assessed response as PD (Figure 6). The hypothesis that patients assessed as SD or PR were on treatments for longer than patients assessed as PD was tested using a fixed-effects regression model (Equation 3). In the model, the effect of each patient is represented by 𝑛 − 1 dummy variables 𝐷2, 𝐷3…𝐷𝑛 where 𝑛 = 570 for the total number of patients. There are 𝑡 = {1, 2…𝑇} where 𝑇 = 130 different panels for each post-biopsy treatment. The effect of the dummy variable for physician assessed response, 𝐴𝑠𝑠𝑒𝑠𝑠𝑒𝑑𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒, which equals 1 if the patient has been assessed as PR or SD, or 0 if the patient has been assessed as PD, is measured by the coefficient 𝛽,. The error for the model is represented by 𝑢'&. The model was fit using the entity-demeaned ordinary least squares 18  algorithm implemented in the plm package for R 49. After fitting the model, the hypothesis 𝐻!:	𝛽, ≠ 0 was tested using a z-test with the lmtest package for R 50.  𝑻𝒊𝒎𝒆𝑶𝒏𝑻𝒓𝒆𝒂𝒕𝒎𝒆𝒏𝒕𝒊𝒕 =	𝜷𝟎 + 𝜷𝟏𝑨𝒔𝒔𝒆𝒔𝒔𝒆𝒅𝑹𝒆𝒔𝒑𝒐𝒏𝒔𝒆𝒊𝒕 + 𝜸𝟐𝑫𝟐𝒊 + 𝜸𝟑𝑫𝟑𝒊 +⋯+ 𝜸𝒏𝑫𝒏𝒊 + 𝒖𝒊𝒕  Equation 3  The fixed effects regression model for determining the effect on first post-biopsy treatment time of a patient assessed as having a partial response or stable disease.  19   Figure 6. The days on treatment for the 30 most frequently used therapies separated by physician assessed response to the therapies. In the boxplots, the lower and upper limits on the boxes correspond with the first and third quartiles respectively. The whiskers on the boxplots extend to the furthest value no greater than 1.5 times the inter-quartile range. Data beyond the whiskers are denoted with points. The actual data is overlaid on the boxplots in translucent blue. The PR and SD assessed patients appear to be on treatments for longer than patients assessed for PD.   20  2.1.7 Processing time on treatment data Almost all POG patients were on therapies before and after biopsy. This study only examines the response of patients to therapies that immediately follow biopsy. It is possible that the tumour at the time of biopsy would have developed resistance to therapies preceding the biopsy. In this case, the length of time on treatment would have no bearing on how the tumour at time of biopsy may respond to another round of the therapy. Similarly, the time on treatment for therapies following the first post-biopsy therapy are not as representative of the susceptibility of the tumour at the time of biopsy.  The patients were divided into two groups to allow binary classification on the patients. The two groups were those who were on the treatment for longer than the median treatment time (good responders) and those who were on the treatment for less than the median treatment time (poor responders). These are the labels that were used for the machine learning task.  2.1.8 Recursive feature elimination It is important to reduce the dimensionality of datasets to only use relevant features for machine learning tasks. Higher dimension datasets increase the chance of the model fitting to spurious relationships in the training set, leading to overfitting, as described in section 1.3.2. This is a significant challenge for many clinical genomics datasets since the dimensionality of the datasets tend to be very large while sample numbers tend to be small.  The feature selection method of recursive feature elimination (RFE) was used to eliminate unimportant features in this dataset. RFE is an iterative method where a model is trained on the full training set and the least informative features are removed from the model. This process is repeated until the termination condition is met, which is a threshold on the 21  minimal size of the feature set. Additionally, the training set is divided into k equally sized cross-validation sets. At each iteration of the RFE, the model is trained on k - 1 of the cross-validation sets and an accuracy score is calculated from the predictions of the model on the remaining cross-validation set. This is repeated k times such that the model’s performance can be evaluated on every set of the cross-validation split. This is called k-fold cross-validation. The selected features from the iteration of the RFE with the highest cross-validation score are used as the best features for the prediction task. RFE is a greedy algorithm for feature selection since it removes the least informative features at each step in the iteration. The final selected feature set is not guaranteed to be the optimal set of features. In the context of drug response prediction, it is also the case that the set of features are not necessarily the most biologically relevant set of genes. However, RFE is a simple and fast feature selection method that has been shown to be effective for drug response prediction tasks in prior studies 40,41. The algorithm itself is deterministic given a training set and a deterministic machine learning algorithm. However, the feature set selected by RFE has been shown to be unstable when the training set is modified 51. The RFE implementation used for this work is modified from the scikit-learn Python package 52. The least informative features at each iteration of the RFE were determined by training an SVM on the training set and ranking the features by the absolute value of the coefficients associated with each feature. The features with the smallest absolute value of coefficients were determined to be the least informative features. The parameters for the RFE were: 5-fold cross-validation at each iteration, a step size of 0.15, meaning 15% of the number of features at each iteration were removed and the stopping criteria was when the feature set size was 1—when there were no features left to eliminate.  22  2.1.9 Machine learning pipeline and implementation To determine for which therapies machine learning methods can predict patient response, the POG570 cohort was divided into samples of patients who were all on the same therapy. The patients who took capecitabine and 5-fluorouracil (5-FU) were combined into a single group because capecitabine is converted to 5-FU in the body. The groups of patients on the same therapy were divided into a training and testing set consisting of 80% and 20% of the total number of samples respectively. The preprocessing steps described above were applied to the transcriptomic and time on treatment data for these patients. After RFE was used to reduce the dimensionality of the transcriptomic data, SVMs were trained on the training data with reduced features and then tested on the testing set. The performances of the SVMs were calculated using the F1-score metric. The SVM implementation used for this project is from the LIBLINEAR project and wrapped in Python by scikit-learn 52,53. All SVMs were L2 regularized with a linear kernel.  2.1.10 Selection of machine learning method SVMs were selected after some simple benchmarking comparisons with random forest, logistic regression and boosted decision tree machine learning methods. SVMs were found to have the best performance for the binary classification task. The modified SVM method of support vector regression was also tested by fitting the regression model on days on treatment.  2.1.11 Analysis of RFE genes for biomarkers The RFE selected genes for the best performing SVMs were further analyzed to determine if they are a known biomarker associated with either prognosis or treatment response. 23  The genes were used as search terms on CIViCmine, a database for text mined biomarkers from all available published literature, and Google Scholar 54.   2.2 Results 2.2.1 Fixed effects model of time on treatment The fit regression function for the fixed effects model is in Equation 4. The coefficient on 𝐴𝑠𝑠𝑒𝑠𝑠𝑒𝑑𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 was determined to be 207.68 (𝑝 = 2.051𝑒 − 11). The standard error for the coefficient was 28.76. The coefficient for 𝐴𝑠𝑠𝑒𝑠𝑠𝑒𝑑𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 is positive with a statistically significant p-value.  𝑻:𝒎𝒆𝑶𝒏𝑻𝒓𝒆𝒂𝒕𝒎𝒆𝒏𝒕; = 	𝟐𝟎𝟕. 𝟔𝟖 ∗ 	𝑨𝒔𝒔𝒆𝒔𝒔𝒆𝒅𝑹𝒆𝒔𝒑𝒐𝒏𝒔𝒆 + 𝑺𝒕𝒂𝒕𝒆𝑭𝒊𝒙𝒆𝒅𝑬𝒇𝒇𝒆𝒄𝒕𝒔  Equation 4 The fit fixed effects regression model. The coefficients for the dummy variables are not reported due to the limitations of the entity-demeaned ordinary least squares algorithm used to fit the model. These dummy variables are represented by the 𝑺𝒕𝒂𝒕𝒆𝑭𝒊𝒙𝒆𝒅𝑬𝒇𝒇𝒆𝒄𝒕𝒔 variable.  2.2.2 RFE results and SVM performance on therapies The results of the RFE and subsequence SVM predictions are described in Table 1. Only the therapies with more than 29 patients were examined. The RFE reduced the dimensionality of the TPM matrices to as few as 7 genes for the carboplatin group. The 3 best performing SVMs were those for predicting carboplatin, cisplatin and gemcitabine response. The SVM predictions on the test sets are visualized in Figure 7. The test set patients are separated by tumour group.   24    Drug Name Pathway Drug type Number of Patients Number of Genes Best iteration CV F1 score Test set F1 Score Capecitabine/5-FU DNA synthesis DNA synthesis inhibitor 112 9 0.875 0.632 Gemcitabine DNA synthesis DNA synthesis inhibitor 75 36 0.866 0.777 Irinetocan Topoisomerase I inhibitor DNA synthesis 57 36 0.880 0.615 Paclitaxel Mitotic inhibitor Taxane 50 1916 0.925 0.333 Cisplatin DNA alkylating DNA damage 47 4313 0.889 0.823 Carboplatin DNA alkylating DNA damage 41 7 0.942 1.00 Oxaliplatin DNA alkylating  DNA damage 36 42 0.960 0.333 Leucovorin  Antidote 33 91 0.960 0.333 Bevacizumab VEGF inhibitor (targeted) Kinase inhibitor 29 1002 0.960 0.400  Table 1. The results of the RFE and performance of the SVMs on predicting patient response to the therapies with at least 30 patients. 25    Figure 7. Response predictions with SVMs on the 9 most commonly used drugs in the POG cohort.   26  2.2.3 Genes and feature coefficients for SVM predicting carboplatin response The 7 RFE selected genes for predicting carboplatin response are described in Table 2. The magnitude of the coefficient determines its impact on the SVM decision. A positive coefficient for a gene means that high expression of the gene influences the SVM to predict good response to the drug of interest and low expression of the gene influences the SVM to make a poor response prediction. If the coefficient is negative, high expression of the gene is associated with poor response and low expression is associated with good response. The RFE selected genes were used as search terms in CIViCmine and Google Scholar to determine if they are previously known biomarkers. The search found that expression of the genes TSC22D1 and ALDOC were generally positively associated with prognosis 55,56. None of the genes were found to be biomarkers for carboplatin response.   ENSG ID HUGO GENE ID DESCRIPTION COEFFICIENT ENSG00000278642 AC015813.4 novel transcript 0.97 ENSG00000102804 TSC22D1 TSC22 domain family member 1 0.77 ENSG00000187715 KBTBD12 kelch repeat and BTB domain containing 12 0.66 ENSG00000227014 AC007285.1 novel transcript 0.54 ENSG00000224025 AL353616.1 dermatan sulfate epimerase (DSE) pseudogene 0.25 ENSG00000109107 ALDOC aldolase, fructose-bisphosphate C 0.05 ENSG00000254400 AC091564.2 novel transcript -0.12  Table 2 The RFE selected genes for an SVM predicting carboplatin response in the POG cohort. 2.2.4 Expression of novel transcripts for carboplatin  The TPM values for the three novel transcript genes, AC015813.4, AC015813.4 and AC015813.4 were plotted to understand their expression levels. The expression for all three 27  genes were relatively very low. The highest expressed gene out of the three genes was AC015813.4 with a maximum TPM value of 8.15 (Figure 8).   Figure 8 Expression of AC015813.4 separated by discretized days on treatment The coefficient on AC015813.4 was very large and positive. We can see in Figure 8 that patients who were on carboplatin for longer than the median time on treatment had greater expression of the gene than patients who were on carboplatin for less than median time on treatment. 28   Figure 9 Expression of AC007285.1 separated by discretized days on treatment  The gene AC007285.1 had a large and positive coefficient. Figure 9 shows that patients whose time on treatment was greater than the median had greater expression for the gene. The expression of the gene was very low for all patients. The maximum TPM value for the gene for all patients was 0.98. 29   Figure 10 Expression of AC091564.2 separated by discretized days on treatment  The coefficient of AC091564.2 was very small and negative. Figure 10 shows that there were several patients who were on carboplatin for less than the median time had relatively high expression of the gene. Overall expression of the gene was generally low with the maximum TPM value being 1.87.   2.2.5 Genes and feature coefficients for SVM predicting cisplatin response The RFE selected 4313 genes for predicting cisplatin response. The top 30 RFE selected genes are described in Table 3. The Google Scholar and CIViCmine search found that the 30  expression of the gene DERL3 was positively associated with prognosis 57. The genes TMPRSS4 and SOX11 were negatively associated with prognosis 58,59.   ENSG ID HUGO GENE ID DESCRIPTION COEFFICIENT ENSG00000099958 DERL3 derlin 3  0.0037 ENSG00000159199 ATP5MC1 ATP synthase membrane subunit c locus 1  0.0032 ENSG00000251151 HOXC-AS3 HOXC cluster antisense RNA 3 0.0031 ENSG00000211938 IGHV3-7 immunoglobulin heavy variable 3-7 0.0031 ENSG00000260298 ACTG1P16 actin gamma 1 pseudogene 16 0.0030 ENSG00000282793 AC090164.4 novel transcript 0.0030 ENSG00000213801 ZNF321P zinc finger protein 321, pseudogene 0.0030 ENSG00000156398 SFXN2 sideroflexin 2  0.0030 ENSG00000137648 TMPRSS4 transmembrane serine protease 4  0.0030 ENSG00000154537 NA NA 0.0030 ENSG00000254731 AP003059.1 novel transcript 0.0029 ENSG00000236484 RRM2P2 ribonucleotide reductase M2  polypeptide pseudogene 2 0.0029 ENSG00000283268 TEX54 testis expressed 54 0.0028 ENSG00000261427 AC099518.3 novel transcript 0.0028 ENSG00000230631 AL353152.1 novel transcript 0.0028 ENSG00000205865 FAM99B family with sequence similarity 99 member B 0.0028 ENSG00000143199 ADCY10 adenylate cyclase 10 0.0027 ENSG00000258972 NDUFB8P1 NADH:ubiquinone oxidoreductase subunit  B8 pseudogene 1 0.0027 ENSG00000274395 NA NA -0.0028 ENSG00000218020 THAP12P5 THAP domain containing 12 pseudogene 5 -0.0028 ENSG00000052802 MSMO1 methylsterol monooxygenase 1  -0.0028 ENSG00000206535 LNP1 leukemia NUP98 fusion partner 1 -0.0029 ENSG00000154027 AK5 adenylate kinase 5 -0.0030 ENSG00000232355 AL603650.1 pseudogene similar to part of a ficolin family protein -0.0030 ENSG00000233688 BX842559.2 zinc finger-like protein 9 (ZPR) pseudogene -0.0030 ENSG00000176887 SOX11 SRY-box transcription factor 11 -0.0030 ENSG00000263489 AC127029.2 novel transcript -0.0031 ENSG00000275852 LINC01742 long intergenic non-protein coding RNA 1742 -0.0031 ENSG00000272324 AC012629.2 novel transcript, antisense to DAP -0.0032 ENSG00000248161 AC098487.1 novel transcript -0.0034  31  Table 3 The top 30 RFE selected genes for an SVM predicting cisplatin response in the POG cohort.  2.2.6 Genes and feature coefficients for SVM predicting gemcitabine response The 36 RFE selected genes for gemcitabine response prediction are described in Table 4. The Google Scholar and CIViCmine search found that expression of the genes A2M and SEC14L2 were positively associated with prognosis 60,61. The expression of genes VPS9D1-AS1, COP1, UFC1 were negatively associated with prognosis 62-63. The expression of the gene PAIP2 was found to not be associated with prognosis 64.   ENSG ID HUGO ID DESCRIPTION COEFFICIENTS ENSG00000226629 LINC00974 long intergenic non-protein coding RNA 974 0.55 ENSG00000166289 PLEKHF1 pleckstrin homology and FYVE domain containing 1  0.51 ENSG00000226928 RPS14P4 ribosomal protein S14 pseudogene  0.42 ENSG00000266891 AP000902.1 ribosomal protein L27a (RPL27A) pseudogene 0.40 ENSG00000100003 SEC14L2 SEC14 like lipid binding 2 0.32 ENSG00000269970 AL162424.1 novel transcript, sense intronic to PTGS1 0.32 ENSG00000273234 OR2A13P olfactory receptor family 2 subfamily A member 13 pseudogene  0.31 ENSG00000251533 LINC00605 long intergenic non-protein coding RNA 605  0.28 ENSG00000113312 TTC1 tetratricopeptide repeat domain 1  0.28 ENSG00000115363 EVA1A eva-1 homolog A, regulator of programmed cell death 0.21 ENSG00000280649 AC245100.8 TEC 0.20 ENSG00000276645 AL020995.2 None 0.18 ENSG00000136305 CIDEB cell death inducing DFFA like effector b 0.16 ENSG00000114786 ABHD14A-ACY1 ABHD14A-ACY1 readthrough 0.13 32  ENSG00000249590 AC004832.3 novel SEC14-like 2 (S. cerevisiae) (SEC14L2)  and mitochondrial protein 18 kDa (MTP18) protein 0.09 ENSG00000249249 AC010226.1 novel transcript 0.09 ENSG00000175899 A2M alpha-2-macroglobulin  0.05 ENSG00000248187 AC078850.1 novel transcript 0.03 ENSG00000178177 LCORL ligand dependent nuclear receptor corepressor like  0.03 ENSG00000184925 LCN12 lipocalin 12 0.02 ENSG00000261373 VPS9D1-AS1 VPS9D1 antisense RNA 1  0.02 ENSG00000260025 CRIM1-DT CRIM1 divergent transcript  0.02 ENSG00000120727 PAIP2 poly(A) binding protein interacting protein 2  0.00 ENSG00000254475 OR2AT1P olfactory receptor family 2 subfamily AT member 1 pseudogene -0.01 ENSG00000250504 KRT18P51 keratin 18 pseudogene 51 -0.01 ENSG00000250221 KRT8P32 keratin 8 pseudogene 32 -0.02 ENSG00000274330 AL160191.3 ADAM metallopeptidase domain 20 (ADAM20) pseudogene -0.10 ENSG00000256060 TRAPPC2B trafficking protein particle complex 2B  -0.11 ENSG00000153037 SRP19 signal recognition particle 19 -0.13 ENSG00000229001 ACTBP14 ACTB pseudogene 14 -0.15 ENSG00000143207 COP1 COP1 E3 ubiquitin ligase -0.15 ENSG00000164303 ENPP6 ectonucleotide pyrophosphatase/phosphodiesterase 6 -0.15 ENSG00000116586 LAMTOR2 late endosomal/lysosomal adaptor, MAPK and MTOR activator 2 -0.21 ENSG00000143222 UFC1 ubiquitin-fold modifier conjugating enzyme 1 -0.24 ENSG00000269388 AC018755.3 putative ATP-binding domain-containing protein 3-like protein (ABP3L) pseudogene -0.31 ENSG00000187583 PLEKHN1 pleckstrin homology domain containing N1 -0.35  Table 4 The RFE selected genes for an SVM predicting gemcitabine response in the POG cohort.  33   2.3 Discussion 2.3.1 Fixed effects model of time on treatment The fixed effects model demonstrates that there is a positive and statistically significant relationship between physician assessed response and time on treatment across all drugs used in POG. This result provides strong justification for using time on treatment as a proxy for treatment response.  2.3.2 Possible overfitting from benchmarking Since several machine learning methods were tested prior to selecting SVMs, it is possible that the good performance of the SVMs in this work is partially due to overfitting.  However, it is unlikely that this explains the entirety of their performance since the benchmarking was not done systematically on the entire dataset. It is worth noting that this is a possible source of bias. SVMs combined with RFE have been shown to perform well for clinical genomics datasets in several other studies 38,40,41.  2.3.3 RFE and SVM performance Remarkably, the RFE reduced the dimensionality of the TPM matrices to very small sets of genes that are effective for the prediction task. The dimensionality was reduced from 58 053 genes to as few as 7 genes in the dataset of patients prescribed carboplatin. The small size of the gene sets increases the performance of the SVM and facilitates manual analysis of the genes. The top 3 SVMs that were explored in this work achieved exceedingly good F1 scores on the test sets. These high scores give credence to the subsequent analyses on the coefficients for 34  the features in the SVM. The SVMs are evidently learning some generalizable patterns in the transcriptomes that are predictive of time on treatment. The coefficients of the features must be useful for understanding these patterns. The F1 scores of the other SVMs on the tests were not convincingly high enough for further analysis.  2.3.4 Some RFE selected genes are known to be biomarkers for prognosis The analysis on the genes selected from the RFE revealed that several of the genes were already known to be biomarkers. None of the genes selected by the RFE were known to interact with the particular therapy that was being examined but instead were general biomarkers for prognosis. This may indicate that the SVMs learned general prognostic biomarkers instead of treatment specific biomarkers.  2.3.5 Examining the RFE selected genes for predicting carboplatin response  For the SVM trained on the carboplatin patients, the RFE selected 7 genes of which 3 were novel transcripts and 2 were previously known as biomarkers for prognosis. One of these previously studied genes, TSC22D1, had a large and positive coefficient in the SVM. This means that expression of TSC22D1 influenced the SVM to predict good response to carboplatin. The TSC22D1 gene encodes a transcription factor from the TSC22 domain family of leucine zippers 65. TSC22D1 is understood to be a tumour suppressor that is known to cause programmed cell death when overexpressed in mammary tissue 65,66. Furthermore, TSC22D1 was found to be downregulated in mouse liver tumours suggesting a role in liver cancer genesis 67. The SVM correctly predicted the response of the two BRC and two HPB patients in the test set. 35  The other gene selected by the RFE that was previously recorded to be a biomarker is the ALDOC gene. The coefficient for this gene was small and positive. The small value indicates that expression of this gene has a weaker impact on the prediction of the SVM compared to the expression of other genes. The protein encoded by ALDOC belongs to the aldolase family which is involved in glycolysis and is responsible for the repair of injured tissue 56. Expression of this gene predicts favourable prognosis for patients with glioblastomas 56. There were no glioblastoma patients prescribed carboplatin. Interestingly, the other 5 genes appeared to have no known association with cancer or cancer prognosis in the literature. Of these 5 genes, 3 were novel transcripts with little to no information in the literature. The coefficient with the largest absolute value was for the gene AC015813.4. The transcript for this gene is a long noncoding RNA (lncRNA) that has not been studied extensively. LncRNAs are now understood to have important functional roles in the cell 68. Consequently, lncRNAs have a surprisingly extensive responsibility in cancer 69. This may indicate that the RFE and SVM are finding novel biomarkers for prognosis that have not been recorded in the literature.  2.3.6 Examining the RFE selected genes for predicting cisplatin response  The RFE process for patients treated with cisplatin selected 4313 genes. This is far more genes than the RFE selected genes for the other treatment groups. This work only examined the 30 most important genes to making the response prediction. It is possible that there are genes used by the SVM that may have been previously studied in the literature that were not examined here. Of the 30 most important genes, 3 genes were found to be previously recorded biomarkers in the literature: DERL3, TMPRSS4 and SOX11. 36  The coefficient in the SVM for the DERL3 gene had the largest value and was positive. The proteins encoded by DERL3 are found in the endoplasmic reticulum and are responsible for the degradation of misfolded proteins 70. The gene is known to be an important tumour suppressor in gastric cancer 57. There were 2 GIC patients treated by cisplatin and the SVM correctly predicted both of their response to treatment. The fact that the expression of DERL3 played a large role in the SVM prediction provides evidence that the SVM is finding relevant biomarkers for prediction. The other two genes TMPRSS4 and SOX11 are known to be negatively associated with prognosis. SOX11 has a negative coefficient with a large value. The SOX11 gene encodes transcription factors essential to stem cell maintenance and differentiation 71. The expression of the gene is known to be elevated in many tumours 72,73. Although classified as an oncogene and thought to suppress apoptosis and promote tumorigenesis, other studies have found that expression of SOX11 is associated with improved prognosis 73,74. This speaks to the complexity of the transcriptomic landscape of cancers and how an oncogene in one context may be a tumour suppressor in another context. Another result that confounds expectations is that the coefficient for the gene TMPRSS4 was positive in the SVM despite the expression of which being associated with a poor prognosis in several cancer types 58. The TMPRSS4 gene encodes for a serine protease that is thought to facilitate metastasis and cell growth 75. It is not clear why the SVM gave a positive coefficient to this gene. The four genes with negative coefficients with the largest value encode four noncoding RNAs that have not been previously studied in the literature. These genes are AC127029.2, LINC01742, AC012629.2 and AC098487.1. The first gene, AC127029.2, encodes a lncRNA of which there appear to be no references within the literature. The second gene, LINC01742, encodes 37  a long intergenic noncoding RNA (lincRNA). The only mention of this gene in the literature appears to be in a study finding that the protein Nrf2, a molecule that regulates redox homeostasis and is dysregulated in cancers, binds to the LINC01742 gene 76. The third gene, AC012629.2, encodes a lncRNA that is antisense to DAP, a mediator of cell death 77. Antisense RNAs are known to regulate genes by blocking transcription. It makes sense that high expression of AC012629.2 may prevent DAP regulated cell death. The final gene, AC098487.1, encodes another lncRNA that does not appear to have been studied in the literature. These noncoding RNAs may be novel biomarkers for prognosis.  2.3.7 Examining the RFE selected genes for predicting gemcitabine response   The RFE for patients treated with gemcitabine selected 36 genes. Of these genes there were 5 previously recorded as biomarkers in the literature: A2M and SEC14L2 are positively associated with prognosis, VPS9D1-AS1 and UFC1 are negatively associated with prognosis and PAIP2 is not associated with prognosis 60-64.  The A2M gene encodes for a pan-proteinase inhibitor that is known to inhibit malignancy in tumour cells 59,78.  The coefficient for the A2M gene in the SVM was positive with a small absolute value. The SEC14L2 gene encodes a lipid-binding protein that is known to be downregulated in breast cancers, suggesting that it may be a tumour suppressor 61. The coefficient for SEC14L2 was positive with a large value. It seems that the SVM was able to correctly associate the expression of these genes with a good prognosis.  The VPS9D1-AS1 and UFC1 genes encode lncRNAs that have been associated with poor prognosis in non-small cell lung cancer 62,79. UFC1 expression is also associated with progression of breast and gastric cancers 63,80. In a contradictory result, low expression of 38  VPS9D1-AS1 is associated with overall survival in gastric cancer patients 81. The coefficient in the SVM for the gene VPS9D1-AS1 was positive with a small value. T he positive coefficient suggests that within the POG cohort the gene was associated with a good prognosis, agreeing with the latter study. The coefficient in the SVM for the gene UFC1 was negative with a large value. This agrees with the work in the literature suggesting that expression of UFC1 is associated with poor prognosis.  The only other gene found in the literature was the PAIP2 gene. This gene encodes an RNA binding protein that binds to the poly(A) region of an mRNA molecule and prevents translation 82. It was thought that expression of PAIP2 may regulate the translation of oncogenes, however, it was found that PAIP2 expression was not associated with prognosis 64. The coefficient in the SVM for PAIP2 was positive and very close to 0. This gene would have been removed in the next iteration of the RFE. This suggests that the expression of PAIP2 may have some role in predicting prognosis but it is not a large role.  There were several genes selected by the RFE not found in literature that appear to have a role in cancer but have not been studied extensively. These genes are: EVA1A, CIDEB and LAMTOR2. The EVA1A gene plays a role in regulating programmed cell death 83. It is known that the gene promotes autophagy and is downregulated in cancers 83. However, no studies have shown EVA1A to be a biomarker for prognosis. Similarly, CIDEB is known to play a role in programmed cell death and that it promotes apoptosis 84. However, the role of CIDEB in cancer regulation and if it can be used as a biomarker for prognosis remains to be studied. The LAMTOR2 gene is known to be a MAPK and MTOR pathway activator. These two pathways are very well studied and known to be dysregulated in cancers 85,86. The expression of other genes from the LAMTOR family, LAMTOR3 and LAMTOR5, have been associated with poor 39  prognosis in brain, esophageal, breast, cervical and ovarian cancers  87-88. There has been no study linking LAMTOR2 expression to prognosis for cancer patients. The expression of these genes may be biomarkers for prognosis.  2.3.8 The expression of pseudogenes may be prognostic biomarkers Pseudogenes are a class of evolutionarily conserved genes, derived from other functional genes, that encode lncRNAs 89. Pseudogenes had been thought to be non-functional, however there is evidence that they have regulatory function in gene expression 89. Perhaps the most notable example of a functional pseudogene is of PTENP1, which is a pseudogene of the well-studied tumour suppressor gene, PTEN 90. The lncRNA expressed by PTENP1 competes for silencing microRNAs that target PTEN transcripts, thus the lncRNAs of PTENP1 have tumour suppressive activity as competitive endogenous RNAs (ceRNAs) 90. Other studies have also discovered ceRNAs with both tumour suppressing and promoting activity 91–93. Given these prior studies, it is not surprising to find several pseudogenes selected by the RFEs that may be biomarkers. In Table 2, which describes the RFE selected genes for predicting carboplatin response, the pseudogene for DSE called AL353616.1 was used by the SVM as a positive predictor of response. DSE is known to be downregulated in hepatocellular cancer cells and restoring DSE expression suppresses tumour growth 94. It is possible that the transcript of AL353616.1 acts as an ceRNA for DSE, acting as a tumour suppressor. In Table 3, which describes the top 30 RFE selected genes for cisplatin response, the selected pseudogenes which may have function in cancer that had positive coefficients were ACTG1P16 and RRM2P2 and the pseudogene with a negative coefficient was BX842559.2. 40  ACTG1, the functional copy of pseudogene ACTG1P16, is known to be overexpressed in skin cancers and hepatocellular cancers 95,96. This seems contradictory to the positive coefficient of ACTG1P16, which would suggest tumour suppressing function if the transcript of the gene acts as an ceRNA. The gene, RRM2, the functional copy of pseudogene RRM2P2, is implicated in poorer patient prognosis in several cancers 97–99. Remarkably, RRM2 knockdown with siRNA has been found to increase cisplatin sensitivity in cancers 100–103. This would imply that tumours with high expression of RRM2 would be resistant to cisplatin, which seems contradictory to the positive coefficient on the pseudogene RRM2P2. The gene, ZPR1, which is the functional copy of BX842559.2, is a tumour promoter in breast cancer 104. Thus, the interaction of BX842559.2 transcripts as ceRNA for ZPR1 transcripts would be tumour promoting. This corroborates the negative coefficient of the pseudogene. In Table 4, which describes the RFE selected genes for predicting gemcitabine response, the pseudogenes with evidence for function in cancer that had positive coefficients were RPS14P4 and AP000902.1, and the pseudogenes with negative coefficients were KRT8P32, KRT18P51 and ACTBP14. The genes, RPS14P4 and AP000902.1, are pseudogenes for ribosomal protein encoding genes RPS14 and RPL27A respectively. Both of these functional genes are known tumour suppressors 105–107. The interaction of the pseudogene transcripts as ceRNA may explain their positive coefficient. The keratin genes, KRT8 and KRT18, are the functional copies of pseudogenes KRT8P32 and KRT18P51. Both keratin genes are well studied and their expression is known to be a biomarker for poor prognosis 108–112. Both pseudogenes have a negative coefficient, which supports the findings in the literature. ACTB is the functional gene for pseudogene ACTBP14. The gene is known to be upregulated in many cancers and is associated with metastasis 113. This corroborates the negative coefficient of ACTBP14. 41  There are other pseudogenes that were selected by RFE with coefficients with high values that are not described here due to not having any cancer-related function described in the literature. Those pseudogenes may be novel biomarkers for treatment response or prognosis. It is possible that the pseudogenes in this section are not functional in cancer. It may be that pseudogene expression is highly dysregulated in advanced cancers which causes associations with prognosis that the SVMs are learning. It is known that general transcriptome dysregulation is associated with aggressive tumours 114. There needs to be more studies in determining how pseudogenes may be regulating transcriptomes.  42  Chapter 3: Conclusion  Since the era of genome sequencing, we have understood that the genome is essentially a data system. However, technology has not always allowed us to query the genome as data. This work demonstrates the power of data-driven approaches to understanding clinical genomics datasets. With the proliferation of high-throughput sequencing and the growth of large clinical databases, the work of finding genomic patterns in diseases has shifted over from the manual labour of wet lab science to the powerful automation of data science. In the discussion section, the data mining and machine learning approaches used in this work provide evidence for many new hypotheses for cancer prognosis. The RFE method was able to profoundly reduce the set of genes in this clinical dataset to a size that could be searched for meaningful patterns. And the combination of using RFE for feature selection with SVMs for classification has proven to be a powerful and interpretable machine learning pipeline. This work set out to aid in finding optimal treatments for cancer patients. It is not clear whether the SVMs in this work are learning to find biomarkers for treatment response or just for prognosis. The genes used in the SVMs did not have evidence for being biomarkers for treatment in the literature, but there was evidence that the genes were biomarkers for prognosis. This leaves the question open for if machine learning methods can be used for helping in determining optimal treatments for cancer patients. Perhaps a more stringent feature selection method could be used that focusses on finding treatment biomarkers. Nevertheless, this work found many previously not studied genes that may be important for cancer prognosis. The primary limitation of this study is the sample size. More robust assessments require many more patients. 43  Looking to the future, it is clear that the efforts for personalized medicine through cancer genomics is a fruitful endeavour. This work sits alongside many other data-driven and machine learning-based projects that demonstrate the potential utility of large clinical sequencing projects. As these sequencing projects grow in number and as sequencing technologies and machine learning methods improve, so will the data driven approaches for understanding cancer. The future of personalized cancer medicine lies in a Kuhnian paradigm shift. Since cancer is a disease of the genome and the genome is a data system, cancer requires data-driven approaches to be fully understood. 44  Bibliography 1. Government of Canada, S. C. Deaths and age-specific mortality rates, by selected grouped causes. https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1310039201 (2018). 2. Brenner, D. R. et al. Projected estimates of cancer in Canada in 2020. CMAJ 192, E199–E205 (2020). 3. Kocarnik, J. Cancer’s global epidemiological transition and growth. The Lancet 395, 757–758 (2020). 4. Dagenais, G. R. et al. Variations in common diseases, hospital admissions, and deaths in middle-aged adults in 21 countries from five continents (PURE): a prospective cohort study. The Lancet 395, 785–794 (2020). 5. Seyfried, T. N. & Huysentruyt, L. C. On the Origin of Cancer Metastasis. Crit. Rev. Oncog. 18, 43–73 (2013). 6. Dillekås, H., Rogers, M. S. & Straume, O. Are 90% of deaths from cancer caused by metastases? Cancer Med. 8, 5574–5576 (2019). 7. Chaffer, C. L. & Weinberg, R. A. A Perspective on Cancer Cell Metastasis. Science 331, 1559–1564 (2011). 8. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U. S. A. 74, 5463–5467 (1977). 9. Waterston, R. & Sulston, J. The genome of Caenorhabditis elegans. Proc. Natl. Acad. Sci. U. S. A. 92, 10836–10840 (1995). 10. Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000). 45  11. Venter, J. C. et al. The Sequence of the Human Genome. Science 291, 1304–1351 (2001). 12. Dames, S., Durtschi, J., Geiersbach, K., Stephens, J. & Voelkerding, K. V. Comparison of the Illumina Genome Analyzer and Roche 454 GS FLX for Resequencing of Hypertrophic Cardiomyopathy-Associated Genes. J. Biomol. Tech. JBT 21, 73–80 (2010). 13. Sequencing giant Illumina scraps $1.2 billion PacBio acquisition. San Francisco Business Times https://www.bizjournals.com/sanfrancisco/news/2020/01/03/sequencing-giant-illumina-scraps-1-2-billion.html. 14. Treangen, T. J. & Salzberg, S. L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46 (2011). 15. Jurka, J., Kapitonov, V. V., Kohany, O. & Jurka, M. V. Repetitive sequences in complex genomes: structure and evolution. Annu. Rev. Genomics Hum. Genet. 8, 241–259 (2007). 16. Third generation sequencing: technology and its potential impact on evolutionary biodiversity research: Systematics and Biodiversity: Vol 14, No 1. https://www.tandfonline.com/doi/abs/10.1080/14772000.2015.1099575?journalCode=tsab20. 17. Mikheyev, A. S. & Tin, M. M. Y. A first look at the Oxford Nanopore MinION sequencer. Mol. Ecol. Resour. 14, 1097–1102 (2014). 18. Introducing the Sequel System: The Scalable Platform for SMRT Sequencing. PacBio https://www.pacb.com/blog/introducing-the-sequel-system-the-scalable-platform-for-smrt-sequencing/ (2015). 19. Mantere, T., Kersten, S. & Hoischen, A. Long-Read Sequencing Emerging in Medical Genetics. Front. Genet. 10, (2019). 46  20. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008). 21. Clark, T. A., Sugnet, C. W. & Ares, M. Genomewide analysis of mRNA processing in yeast using splicing-specific microarrays. Science 296, 907–910 (2002). 22. Kodzius, R. et al. CAGE: cap analysis of gene expression. Nat. Methods 3, 211–222 (2006). 23. Okoniewski, M. J. & Miller, C. J. Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics 7, 276 (2006). 24. Pease, J. & Sooknanan, R. A rapid, directional RNA-seq library preparation workflow for Illumina ® sequencing. Nat. Methods 9, i–ii (2012). 25. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinforma. Oxf. Engl. 29, 15–21 (2013). 26. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015). 27. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010). 28. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010). 29. Mehta, N. & Devarakonda, M. V. Machine learning, natural language programming, and electronic health records: The next step in the artificial intelligence journey? J. Allergy Clin. Immunol. 141, 2019-2021.e1 (2018). 47  30. Dada, E. G. et al. Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 5, e01802 (2019). 31. [1604.07316] End to End Learning for Self-Driving Cars. https://arxiv.org/abs/1604.07316. 32. Libbrecht, M. W. & Noble, W. S. Machine learning in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015). 33. Day, N., Hemmaplardh, A., Thurman, R. E., Stamatoyannopoulos, J. A. & Noble, W. S. Unsupervised segmentation of continuous genomic data. Bioinforma. Oxf. Engl. 23, 1424–1426 (2007). 34. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012). 35. Kohavi, R. & Wolpert, D. Bias plus variance decomposition for zero-one loss functions. in Proceedings of the Thirteenth International Conference on International Conference on Machine Learning 275–283 (Morgan Kaufmann Publishers Inc., 1996). 36. Pattern Recognition - 4th Edition. https://www.elsevier.com/books/pattern-recognition/theodoridis/978-1-59749-272-0. 37. Clarke, R. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev. Cancer 8, 37–49 (2008). 38. HUANG, S. et al. Applications of Support Vector Machine (SVM) Learning in Cancer Genomics. Cancer Genomics Proteomics 15, 41–51 (2017). 39. Schwaederle, M. et al. Impact of Precision Medicine in Diverse Cancers: A Meta-Analysis of Phase II Clinical Trials. J. Clin. Oncol. Off. J. Am. Soc. Clin. Oncol. 33, 3817–3825 (2015). 48  40. Huang, C., Mezencev, R., McDonald, J. F. & Vannberg, F. Open source machine-learning algorithms for the prediction of optimal cancer drug therapies. PLOS ONE 12, e0186906 (2017). 41. Huang, C. et al. Machine learning predicts individual cancer patient responses to therapeutic drugs with high accuracy. Sci. Rep. 8, 16444 (2018). 42. Xia, F. et al. Predicting tumor cell line response to drug pairs with deep learning. BMC Bioinformatics 19, 486 (2018). 43. Adam, G. et al. Machine learning approaches to drug response prediction: challenges and recent progress. Npj Precis. Oncol. 4, 1–10 (2020). 44. Pleasance, E. et al. Pan-cancer analysis of advanced patient tumors reveals interactions between therapy and genomic landscapes. Nat. Cancer 1, 452–468 (2020). 45. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011). 46. Blumenthal, G. M. et al. Analysis of time-to-treatment discontinuation of targeted therapy, immunotherapy, and chemotherapy in clinical trials of patients with non-small-cell lung cancer. Ann. Oncol. 30, 830–838 (2019). 47. Huang, B. et al. Evaluating Treatment Effect Based on Duration of Response for a Comparative Oncology Study. JAMA Oncol. 4, 874–876 (2018). 48. Eisenhauer, E. A. et al. New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). Eur. J. Cancer Oxf. Engl. 1990 45, 228–247 (2009). 49. Croissant, Y. & Millo, G. Panel Data Econometrics in R: The plm Package. J. Stat. Softw. 27, 1–43 (2008). 50. Zeileis, A. & Hothorn, T. Diagnostic Checking in Regression Relationships. 5. 49  51. Dittman, D., Khoshgoftaar, T. M., Wald, R. & Wang, H. Stability Analysis of Feature Ranking Techniques on Biological Datasets. in 2011 IEEE International Conference on Bioinformatics and Biomedicine 252–256 (2011). doi:10.1109/BIBM.2011.84. 52. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825−2830 (2011). 53. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. LIBLINEAR: A Library for Large Linear Classification. 31. 54. Lever, J. et al. Text-mining clinically relevant cancer biomarkers for curation into the CIViC database. Genome Med. 11, 78 (2019). 55. Nakamura, M. et al. Transforming growth factor-β-stimulated clone-22 is a negative-feedback regulator of Ras / Raf signaling: Implications for tumorigenesis. Cancer Sci. 103, 26–33 (2012). 56. Chang, Y.-C. et al. Enrichment of Aldolase C Correlates with Low Non-Mutated IDH1 Expression and Predicts a Favorable Prognosis in Glioblastomas. Cancers 11, (2019). 57. Li, Y. et al. DERL3 functions as a tumor suppressor in gastric cancer. Comput. Biol. Chem. 84, 107172 (2020). 58. Villalba, M. et al. TMPRSS4: A Novel Tumor Prognostic Indicator for the Stratification of Stage IA Tumors and a Liquid Biopsy Biomarker for NSCLC Patients. J. Clin. Med. 8, (2019). 59. Yang, Z. et al. SOX11: friend or foe in tumor prevention and carcinogenesis? Ther. Adv. Med. Oncol. 11, 1758835919853449 (2019). 60. Kurz, S. et al. The anti-tumorigenic activity of A2M—A lesson from the naked mole-rat. PLoS ONE 12, (2017). 50  61. Wang, X. et al. Reduced expression of tocopherol-associated protein (TAP/Sec14L2) in human breast cancer. Cancer Invest. 27, 971–977 (2009). 62. Han, X., Huang, T. & Han, J. Long noncoding RNA VPS9D1-AS1 augments the malignant phenotype of non-small cell lung cancer by sponging microRNA-532-3p and thereby enhancing HMGA2 expression. Aging 12, 370–386 (2020). 63. Zhang, X. et al. Long non-coding RNA UFC1 promotes gastric cancer progression by regulating miR-498/Lin28b. J. Exp. Clin. Cancer Res. CR 37, 134 (2018). 64. Onesto, C. et al. Vascular endothelial growth factor-A and Poly(A) binding protein-interacting protein 2 expression in human head and neck carcinomas: correlation and prognostic significance. Br. J. Cancer 94, 1516–1523 (2006). 65. Huser, C. et al. TSC22 in mammary gland development and breast cancer. Breast Cancer Res. BCR 10, P17 (2008). 66. Hömig-Hölzel, C. et al. Antagonistic TSC22D1 variants control BRAFE600-induced senescence. EMBO J. 30, 1753–1765 (2011). 67. Iida, M., Anna, C. H., Gaskin, N. D., Walker, N. J. & Devereux, T. R. The Putative Tumor Suppressor Tsc-22 is Downregulated Early in Chemically Induced Hepatocarcinogenesis and may be a Suppressor of Gadd45b. Toxicol. Sci. 99, 43–50 (2007). 68. Schmitt, A. M. & Chang, H. Y. Long Noncoding RNAs in Cancer Pathways. Cancer Cell 29, 452–463 (2016). 69. Emerging roles of lncRNA in cancer and therapeutic opportunities. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6682721/. 70. Oda, Y. et al. Derlin-2 and Derlin-3 are regulated by the mammalian unfolded protein response and are required for ER-associated degradation. J. Cell Biol. 172, 383–393 (2006). 51  71. Sarkar, A. & Hochedlinger, K. The Sox Family of Transcription Factors: Versatile Regulators of Stem and Progenitor Cell Fate. Cell Stem Cell 12, 15–30 (2013). 72. Shepherd, J. H. et al. The SOX11 transcription factor is a critical regulator of basal-like breast cancer growth, invasion, and basal-like gene expression. Oncotarget 7, 13106–13121 (2016). 73. Wasik, A. M. et al. SOXC transcription factors in mantle cell lymphoma: the role of promoter methylation in SOX11 expression. Sci. Rep. 3, 1400 (2013). 74. QU, Y. et al. The metastasis suppressor SOX11 is an independent prognostic factor for improved survival in gastric cancer. Int. J. Oncol. 44, 1512–1520 (2014). 75. Wang, C.-H. et al. TMPRSS4 facilitates epithelial-mesenchymal transition of hepatocellular carcinoma and is a predictive marker for poor prognosis of patients after curative resection. Sci. Rep. 5, 12366 (2015). 76. Namani, A., Zheng, Z., Wang, X. J. & Tang, X. Systematic Identification of Multi Omics-based Biomarkers in KEAP1 Mutated TCGA Lung Adenocarcinoma. J. Cancer 10, 6813–6821 (2019). 77. Deiss, L. P., Feinstein, E., Berissi, H., Cohen, O. & Kimchi, A. Identification of a novel serine/threonine kinase and a novel 15-kD protein as potential mediators of the gamma interferon-induced cell death. Genes Dev. 9, 15–30 (1995). 78. Lindner, I. et al. Alpha2-macroglobulin inhibits the malignant properties of astrocytoma cells by impeding beta-catenin signaling. Cancer Res. 70, 277–287 (2010). 79. Zang, X. et al. Exosome-transmitted lncRNA UFC1 promotes non-small-cell lung cancer progression by EZH2-mediated epigenetic silencing of PTEN expression. Cell Death Dis. 11, 1–13 (2020). 52  80. Xie, R. et al. Long Non-Coding RNA (LncRNA) UFC1/miR-34a Contributes to Proliferation and Migration in Breast Cancer. Med. Sci. Monit. Int. Med. J. Exp. Clin. Res. 25, 7149–7157 (2019). 81. Chen, M. et al. Decreased expression of lncRNA VPS9D1-AS1 in gastric cancer and its clinical significance. Cancer Biomark. Sect. Dis. Markers 21, 23–28 (2017). 82. Khaleghpour, K. et al. Translational Repression by a Novel Partner of Human Poly(A) Binding Protein, Paip2. Mol. Cell 7, 205–216 (2001). 83. Hu, J. et al. TMEM166/EVA1A interacts with ATG16L1 and induces autophagosome formation and cell death. Cell Death Dis. 7, e2323 (2016). 84. Li, H. et al. Cell death-inducing DFF45-like effector b (Cideb) is present in pancreatic beta-cells and involved in palmitate induced beta-cell apoptosis. Diabetes Metab. Res. Rev. 28, 145–155 (2012). 85. Dhillon, A. S., Hagan, S., Rath, O. & Kolch, W. MAP kinase signalling pathways in cancer. Oncogene 26, 3279–3290 (2007). 86. Tian, T., Li, X. & Zhang, J. mTOR Signaling in Cancer and mTOR Inhibitors in Solid Tumor Targeting Therapy. Int. J. Mol. Sci. 20, (2019). 87. Kwon, S.-J. et al. Role of MEK partner-1 in cancer stemness through MEK/ERK pathway in cancerous neural stem cells, expressing EGFRviii. Mol. Cancer 16, 140 (2017). 88. HBXIP Over Expression as an Independent Biomarker for Cervical Cancer - PubMed. https://pubmed.ncbi.nlm.nih.gov/28093193/. 89. Balakirev, E. S. & Ayala, F. J. Pseudogenes: are they ‘junk’ or functional DNA? Annu. Rev. Genet. 37, 123–151 (2003). 53  90. Poliseno, L. et al. A coding-independent function of gene and pseudogene mRNAs regulates tumour biology. Nature 465, 1033–1038 (2010). 91. Poliseno, L. Pseudogenes: newly discovered players in human cancer. Sci. Signal. 5, re5 (2012). 92. Poliseno, L., Marranci, A. & Pandolfi, P. P. Pseudogenes in Human Cancer. Front. Med. 2, 68 (2015). 93. Karreth, F. A. et al. The BRAF pseudogene functions as a competitive endogenous RNA and induces lymphoma in vivo. Cell 161, 319–332 (2015). 94. Liao, W.-C. et al. DSE regulates the malignant characters of hepatocellular carcinoma cells by modulating CCL5/CCR1 axis. Am. J. Cancer Res. 9, 347–362 (2019). 95. Gao, B., Li, S., Tan, Z., Ma, L. & Liu, J. ACTG1 and TLR3 are biomarkers for alcohol-associated hepatocellular carcinoma. Oncol. Lett. 17, 1714–1722 (2019). 96. Dong, X., Han, Y., Sun, Z. & Xu, J. Actin Gamma 1, a new skin cancer pathogenic gene, identified by the biological feature-based classification. J. Cell. Biochem. 119, 1406–1419 (2018). 97. Fatkhutdinov, N. et al. Targeting RRM2 and Mutant BRAF Is a Novel Combinatorial Strategy for Melanoma. Mol. Cancer Res. MCR 14, 767–775 (2016). 98. Mazzu, Y. Z. et al. A novel mechanism driving poor-prognosis prostate cancer: overexpression of the DNA repair gene, ribonucleotide reductase small subunit M2 (RRM2). Clin. Cancer Res. (2019) doi:10.1158/1078-0432.CCR-18-4046. 99. Zhang, K. et al. Overexpression of RRM2 decreases thrombspondin-1 and increases VEGF production in human cancer cells in vitro and in vivo: implication of RRM2 in angiogenesis. Mol. Cancer 8, 11 (2009). 54  100. Zhang, M., Wang, J., Yao, R. & Wang, L. Small interfering RNA (siRNA)-mediated silencing of the M2 subunit of ribonucleotide reductase: a novel therapeutic strategy in ovarian cancer. Int. J. Gynecol. Cancer Off. J. Int. Gynecol. Cancer Soc. 23, 659–666 (2013). 101. Xue, T. et al. SiRNA-Mediated RRM2 Gene Silencing Combined with Cisplatin in the Treatment of Epithelial Ovarian Cancer In Vivo: An Experimental Study of Nude Mice. Int. J. Med. Sci. 16, 1510–1516 (2019). 102. Krajewski, A. et al. Cyclin F is involved in response to cisplatin treatment in melanoma cell lines. Oncol. Rep. 43, 765–772 (2020). 103. Su, Y.-F. et al. The Expression of Ribonucleotide Reductase M2 in the Carcinogenesis of Uterine Cervix and Its Relationship with Clinicopathological Characteristics and Prognosis of Cancer Patients. PLOS ONE 9, e91644 (2014). 104. Liu, B. et al. ZNF259 promotes breast cancer cells invasion and migration via ERK/GSK3β/snail signaling. Cancer Manag. Res. 10, 3159–3168 (2018). 105. Alkhatabi, H. A. et al. RPL27A is a target of miR-595 and may contribute to the myelodysplastic phenotype through ribosomal dysgenesis. Oncotarget 7, 47875–47890 (2016). 106. Virgilio, M., Pietka, G. & Payne, E. M. Ribosomal Proteins Rps19 and Rps14 Cooperate As Tumor Suppressor Genes with p53. Blood 124, 2943–2943 (2014). 107. Zhou, X., Hao, Q., Liao, J.-M., Liao, P. & Lu, H. Ribosomal protein S14 negatively regulates c-Myc activity. J. Biol. Chem. 288, 21793–21801 (2013). 108. Tan, H.-S. et al. KRT8 upregulation promotes tumor metastasis and is predictive of a poor prognosis in clear cell renal cell carcinoma. Oncotarget 8, 76189–76203 (2017). 55  109. Xie, L. et al. High KRT8 Expression Independently Predicts Poor Prognosis for Lung Adenocarcinoma Patients. Genes 10, (2019). 110. Zhang, J., Hu, S. & Li, Y. KRT18 is correlated with the malignant status and acts as an oncogene in colorectal cancer. Biosci. Rep. 39, (2019). 111. Wang, D. et al. Knockdown of cytokeratin 8 overcomes chemoresistance of chordoma cells by aggravating endoplasmic reticulum stress through PERK/eIF2α arm of unfolded protein response and blocking autophagy. Cell Death Dis. 10, 887 (2019). 112. Park, B., Lee, W., Park, I. & Han, K. Finding prognostic gene pairs for cancer from patient-specific gene networks. BMC Med. Genomics 12, 179 (2019). 113. Guo, C., Liu, S., Wang, J., Sun, M.-Z. & Greenaway, F. T. ACTB in cancer. Clin. Chim. Acta 417, 39–44 (2013). 114. Ali, H. E. A. et al. Dysregulated gene expression predicts tumor aggressiveness in African-American prostate cancer patients. Sci. Rep. 8, 16335 (2018).  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0394095/manifest

Comment

Related Items