Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A mult-task machine learning pipeline for the classification and analysis of cancers from gene expression… Disyak, Michael 2021

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2021_may_disyak_michael.pdf [ 6.48MB ]
Metadata
JSON: 24-1.0395883.json
JSON-LD: 24-1.0395883-ld.json
RDF/XML (Pretty): 24-1.0395883-rdf.xml
RDF/JSON: 24-1.0395883-rdf.json
Turtle: 24-1.0395883-turtle.txt
N-Triples: 24-1.0395883-rdf-ntriples.txt
Original Record: 24-1.0395883-source.json
Full Text
24-1.0395883-fulltext.txt
Citation
24-1.0395883.ris

Full Text

A MULTI-TASK MACHINE LEARNING PIPELINE FOR THECLASSIFICATION AND ANALYSIS OF CANCERS FROM GENEEXPRESSION DATAbyMichael DisyakB.Sc., Trent University, 2016B.Sc., Brock University, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)February 2021c© Michael Disyak, 2021The following individuals certify that they have read, and recommend to the Faculty ofGraduate and Postdoctoral Studies for acceptance, a thesis entitled:A Multi-task Machine Learning Pipeline for the Classification and Analysis ofCancers from Gene Expression Datasubmitted by Michael Disyak in partial fulfillment of the requirements for the degree ofMaster of Science in Bioinformatics.Examining Committee:Dr. Steven Jones, Professor, Department of Medical Genetics, UBCSupervisorDr. Inanc Birol, Professor, Department of Medical Genetics, UBCSupervisory Committee MemberDr. Andrew Roth, Assistant Professor, Department of Molecular Oncology, UBCSupervisory Committee MemberiiAbstractThe work contained within this thesis sought to accurately classify 55 primary cancersubtypes, 20 metastatic cancer subtypes, and 16 normal tissues using gene expression data.The classification was done using a multiple learning task approach in which an artificialneural network model makes four distinct classifications at varying levels of biological hi-erarchy for each input sample. These learning tasks were the organ system of origin, thedisease state, the cancer type, and the cancer subtype. The model achieved classificationperformance ranging from a macro F1-score of 0.987 within the disease state learning taskto 0.831 within the cancer subtype learning task on a test set composed of primary cancer,metastatic cancer, and normal tissue samples.Having shown good classification performance of the model, the second part of the thesisfocused on leveraging what the model has learned to extract biological information about thevarious cancers present in the data set. A backpropagation-based tool called DeepLift wasused to generate a list of importance scores for each gene within every class of each learningtask. The list of scores was then analyzed for trends that could be utilized to infer biolog-ical insight about specific cancer types and subtypes, and between primary and metastaticcancers as individual groups. The lists provide a means to functionally annotate enrichedpathways and to quantify and compare the role of RNA genes and pseudogenes across variousclasses and learning tasks. Some of the results output by DeepLift were validated for theirbiological relevance by presenting supporting evidence from relevant scientific literature. Theultimate product of this thesis research is a tool with which one can quantify the role of aiiivariety of genes within cancers spanning both primary and metastatic cancer types. Furtheranalysis of the output generated by the tool could provide a better understanding of the roleof genetic expression, including RNA and pseudogenes, within a variety of different cancers.ivLay SummaryThe purpose of this thesis work was to leverage machine learning to learn about a varietyof cancers from their gene expression data. A machine learning model was created that wasable to accurately classify a variety of cancers. Once the model was validated for sufficientaccuracy and performance, a second tool was utilized to determine the importance of everygene used by the model in determining the classification for each type of cancer. By examin-ing which genes were indicated as important and their relative rankings, insight into the roleof different types of genes and their functions in cancer was investigated. The significance ofthe genes identified was supported by relevant scientific literature. The combination of toolsutilized in this thesis and the output it produces was established as a source of data withwhich we can improve our understanding of cancer biology.vPrefaceThis thesis work was conducted under the supervision of Dr. Steven Jones at Canada’sMichael Smith Genome Sciences Centre. No explicit ethics approval was required or re-ceived for this thesis work. However, this work utilizes data from the Personalized OncoGe-nomics (POG) project which was approved by and conducted under the University of BritishColumbia – British Columbia Cancer Agency Research Ethics Board (H12-00137, H14-00681), and approved by the institutional review board (IRB). The POG program whosedata is used herein is registered under clinical trial number NCT02155621. Patients in-volved in the POG program have given consent for tumour profiling using RNASeq as wellas whole-genome sequencing.The thesis approach was designed by me with inspiration for the idea coming from JasleenGrewal. I conducted all of the experiments contained herein myself. All of the external codelibraries used to generated this thesis work and all of the data sources have been referencedappropriately. Where no reference is given, the work is all my own.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background of Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 The Role of Gene Expression in Cancer . . . . . . . . . . . . . . . . . 41.2 Background of Genetic Sequencing . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 RNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 The Role of RNA Genes in Cancer . . . . . . . . . . . . . . . . . . . . . . . 71.4 The Role of Pseudogenes in Cancer . . . . . . . . . . . . . . . . . . . . . . . 81.5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12vii1.5.4 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5.5 Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5.6 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5.7 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.5.8 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.5.9 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.5.10 Over-fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.5.11 Early Stopping and Patience . . . . . . . . . . . . . . . . . . . . . . . 191.5.12 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.6 Deeplift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 The Data and the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.1 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.1.1 Training and Test Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 372.1.2 Data Preprocessing: Validation . . . . . . . . . . . . . . . . . . . . . 382.1.3 Data Preprocessing: Testing . . . . . . . . . . . . . . . . . . . . . . . 382.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.2.1 Model Settings and Hyperparameters . . . . . . . . . . . . . . . . . . 422.2.2 Evaluating the Effect of Multiple Tasks on Classification Performance 453 Classification of Cancers from Transcriptome Data . . . . . . . . . . . . . 503.1 Results: Mixed Held-Out Test Set . . . . . . . . . . . . . . . . . . . . . . . . 503.1.1 Organ System of Origin . . . . . . . . . . . . . . . . . . . . . . . . . 503.1.2 Disease State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.1.3 Cancer Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.1.4 Cancer Subtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2 Discussion: Held-out Test Set Classification . . . . . . . . . . . . . . . . . . 733.2.1 Normal Tissue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.2.2 Complete Misclassifications . . . . . . . . . . . . . . . . . . . . . . . 743.2.3 Cancer Type and Subtype Performance Comparison by Disease State 753.3 Results: Metastatic-Only External (POG) Test Set . . . . . . . . . . . . . . 75viii3.3.1 Organ System of Origin . . . . . . . . . . . . . . . . . . . . . . . . . 753.3.2 Disease State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.3.3 Cancer Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.3.4 Cancer Subtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.4 Discussion: POG Test Set Classification . . . . . . . . . . . . . . . . . . . . 873.5 Discussion: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884 Deeplift Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.1.1 DeepLift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.1.2 Interpreting Gene Lists . . . . . . . . . . . . . . . . . . . . . . . . . . 924.1.3 Over and Underexpression Calculation . . . . . . . . . . . . . . . . . 934.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.2.1 Validation of Results: Normal Tissues . . . . . . . . . . . . . . . . . . 944.2.2 Number of Important Genes . . . . . . . . . . . . . . . . . . . . . . . 974.2.3 Expression Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.2.4 Enriched Pathways: Metastatic Cancer Disease State . . . . . . . . . 1164.2.5 Enriched Pathways: Primary Cancer in the Disease State Task . . . . 1204.2.6 RNA Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1244.2.7 Pseudogenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414.3 The Implications of Batch Effect . . . . . . . . . . . . . . . . . . . . . . . . 1484.3.1 Batch Effect Implications on the Interpretation of Metastatic Cancers 1504.3.2 Batch Effect Implications on the Interpretation of Primary Cancers . 1534.3.3 Batch Effect Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 1554.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1565 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165ixList of Tables2.1 List of data sources for primary, normal, and metastatic data . . . . . . . . . 242.2 Number and composition of classes for each classification task . . . . . . . . 252.3 Organ system of origin classes and frequencies within the full set of prepro-cessed data (including both train and test data) . . . . . . . . . . . . . . . . 262.4 Tissue type classes and frequencies within the full set of preprocessed data(including both train and test data) . . . . . . . . . . . . . . . . . . . . . . . 262.5 Cancer type class abbreviations and frequency within the full set of prepro-cessed data (including both train and test data) . . . . . . . . . . . . . . . . 302.6 Cancer subtype class abbreviations and frequency within the full set of pre-processed data (including both train and test data) . . . . . . . . . . . . . . 342.7 Organ system of origin classes and frequencies within the the POG dataset . 352.8 Cancer type class abbreviations and frequency within the POG dataset . . . 362.9 Cancer subtype class abbreviations and frequency within the POG dataset . 372.10 Hyperparameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 The precision, recall, F1-score, and support for each organ system of originclass with testing conducted using the mixed held-out test set. . . . . . . . . 513.4 The precision, recall, F1-score, and support for each disease state class withtesting conducted using the mixed held-out test set. . . . . . . . . . . . . . . 543.6 The precision, recall, F1-score, and support for each cancer type class withtesting conducted using the mixed held-out test set. . . . . . . . . . . . . . . 613.8 The precision, recall, F1-score, and support for each cancer subtype class withtesting conducted using the mixed held-out test set. . . . . . . . . . . . . . . 67x3.10 The precision, recall, F1-score, and support for each organ system of originclass with testing conducted using the metastatic-only external (POG) test set. 763.12 The precision, recall, F1-score, and support for each disease state class withtesting conducted using the metastatic-only external (POG) test set. . . . . 793.14 The precision, recall, F1-score, and support for each cancer type class withtesting conducted using the metastatic-only external (POG) test set. . . . . 823.16 The precision, recall, F1-score, and support for each disease state class withtesting conducted using the metastatic-only external (POG) test set. . . . . 844.2 A table listing the number of positive important genes identified by DeepLiftfor the organ system of origin classes along with how many of those genes areover and underexpressed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.4 A table listing the number of positive important genes identified by DeepLiftfor the disease state classes along with how many of those genes are over andunderexpressed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.6 A table listing the number of positive important genes identified by DeepLiftfor the cancer type classes along with how many of those genes are over andunderexpressed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.8 A table listing the number of positive important genes identified by DeepLiftfor the cancer subtype classes along with how many of those genes are overand underexpressed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1154.9 List of RNA genes found by the model that are also implicated in Medul-loblastoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374.10 List of cancer types and the non-TCGA data cohorts from which they came. 1504.11 List of DLBC cancer subtypes and the data cohorts from which they came. . 154xiList of Figures1.1 A diagram depicting the basic structure and layering of a feed-forward neuralnetwork model. This figure was taken from the web [45]. . . . . . . . . . . . 111.2 A plot depicting the bias-variance trade-off, the U-shaped generalization errorcurve, the optimal capacity, under-fitting, and over-fitting zones. Note: Thisfigure was taken from Goodfellow et al. (2016) [42]. . . . . . . . . . . . . . . 141.3 A plot of the hyperbolic tangent function. Note: This figure was taken fromMathWorld [48]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.4 (a) A standard fully-connected neural network without dropout. (b) A sub-network created by dropping out some of the connections in the standardneural network. Note: This figure was adapted from Wang et al. (2018) [54]. 202.1 High level diagram of the multi-task neural network . . . . . . . . . . . . . . 402.2 The macro F1-scores of various models using validation sets containing bothprimary and metastatic samples from different organ systems of origin. . . . 462.3 The macro F1-scores of various models on validation sets containing bothprimary and metastatic samples at the disease state classification level . . . 472.4 The macro F1-scores of various models on validation set data containing bothprimary and metastatic cancer type samples. . . . . . . . . . . . . . . . . . . 482.5 The macro F1-scores of various models on validation set data containing bothprimary and metastatic cancer subtype samples. . . . . . . . . . . . . . . . . 49xii3.1 The macro F1-scores of each organ system of origin when testing on the held-out test set containing primary cancer, metastatic cancer, and normal tissuesamples. Classes are ordered from left to right by the number of trainingsamples available with colours representing bins of 20 samples. . . . . . . . . 523.2 A confusion matrix depicting the organ system of origin classification perfor-mance on the held-out test set containing primary cancer, metastatic cancer,and normal tissue samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.3 The macro F1-scores of each disease state when testing on the held-out testset containing primary cancer, metastatic cancer, and normal tissue samples.Classes are ordered from left to right by the number of training samples avail-able with colours representing bins of 20 samples. . . . . . . . . . . . . . . . 553.4 A confusion matrix depicting the disease state classification performance onthe held-out test set containing primary cancer, metastatic cancer, and normaltissue samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.5 The macro F1-scores comparing the classification performance of cancer typeand cancer subtype samples broken down by disease state as tested on theheld-out mixed test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.6 The macro F1-scores of each cancer type when testing on the held-out testset containing primary cancer, metastatic cancer, and normal tissue samples.Classes are ordered from left to right by the number of training samples avail-able with colours representing bins of 20 samples. . . . . . . . . . . . . . . . 623.7 A confusion matrix depicting the cancer type classification performance on theheld-out test set containing primary cancer, metastatic cancer, and normaltissue samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.8 The macro F1-scores of each cancer subtype when testing on the held-outtest set containing primary cancer, metastatic cancer, and normal tissue sam-ples. Classes are ordered from left to right by the number of training samplesavailable with colours representing bins of 20 samples. . . . . . . . . . . . . . 68xiii3.9 A confusion matrix depicting the cancer subtype classification performanceon the held-out test set containing primary cancer, metastatic cancer, andnormal tissue samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.10 The macro F1-scores of each organ system of origin when testing on themetastatic-only external (POG) test set. Classes are ordered from left toright by the number of training samples available with colours representingbins of 20 samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773.11 A confusion matrix depicting the organ system of origin classification perfor-mance on the metastatic-only external (POG) test set. . . . . . . . . . . . . 783.12 The macro F1-scores of each disease state when testing on the metastatic-onlyexternal (POG) test set. Classes are ordered from left to right by the numberof training samples available with colours representing bins of 20 samples. . . 793.13 A confusion matrix depicting the disease state classification performance onthe metastatic-only external (POG) test set. . . . . . . . . . . . . . . . . . . 803.14 The macro F1-scores of each cancer type when testing on the metastatic-onlyexternal (POG) test set. Classes are ordered from left to right by the numberof training samples available with colours representing bins of 20 samples. . . 823.15 A confusion matrix depicting the cancer type classification performance onthe metastatic-only external (POG) test set. . . . . . . . . . . . . . . . . . . 833.16 The macro F1-scores of each cancer subtype when testing on the metastatic-only external (POG) test set. Classes are ordered from left to right by thenumber of training samples available with colours representing bins of 20samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.17 A confusion matrix depicting the disease state classification performance onthe metastatic-only external (POG) test set. . . . . . . . . . . . . . . . . . . 864.1 A screen capture of the top 10 functional annotations (ordered by descendingp-value) as determined by the DAVID functional annotation tool using theimportant positive genes for the normal thyroid tissue class within the cancertype classification task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95xiv4.2 A screen capture of the top 14 functional annotations (ordered by descendingp-value) as determined by the DAVID functional annotation tool using theimportant positive genes for the normal thyroid tissue class within the cancertype classification task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.3 A plot showing the number of important positive genes for each class withinthe organ system of origin classification task in blue and the F1-score of eachclass in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.4 A plot showing the number of important positive genes for each class withinthe disease state classification task. . . . . . . . . . . . . . . . . . . . . . . . 994.5 A plot showing the number of important positive genes for each class withinthe cancer type classification task. . . . . . . . . . . . . . . . . . . . . . . . . 1004.6 A plot showing the number of important positive genes for each class withinthe cancer subtype classification task. . . . . . . . . . . . . . . . . . . . . . . 1014.7 A stacked bar plot showing the number of important positive genes and thenumber of over and underexpressed genes for each class within the organsystem of origin classification task. . . . . . . . . . . . . . . . . . . . . . . . 1044.8 A stacked bar plot showing the number of important positive genes and thenumber of over and underexpressed genes for each class within the diseasestate classification task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.9 A stacked bar chart showing the number of important positive genes and thenumber of over and underexpressed genes for each class within the cancer typeclassification task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.10 A stacked bar chart showing the number of important positive genes and thenumber of over and underexpressed genes for each class within the cancersubtype classification task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114.11 A screen capture of the top 10 functional annotations (ordered by descend-ing p-value) as determined by the DAVID functional annotation tool usingthe important positive genes for the metastatic class within the disease stateclassification task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117xv4.12 A screen capture of the top 10 functional annotations (ordered by descendingp-value) as determined by the DAVID functional annotation tool using thetop 25% of important positive genes for the primary class within the diseasestate classification task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1204.13 A screen capture of the top 3 functional annotation clusters (ordered by de-scending enrichment score) as determined by the DAVID functional annota-tion cluster tool using the top 25% of important positive genes for the primaryclass within the disease state classification task. . . . . . . . . . . . . . . . . 1224.14 A scatter plot showing the proportion of RNA genes within the positive im-portant genes identified for the organ system of origin classes. . . . . . . . . 1254.15 A scatter plot showing the proportion of RNA genes within the positive im-portant genes identified for the disease state classes. . . . . . . . . . . . . . . 1264.16 A scatter plot showing the proportion of RNA genes within the positive im-portant genes identified for the cancer type classes. . . . . . . . . . . . . . . 1294.17 A scatter plot showing the proportion of RNA genes (black) within the positiveimportant genes identified by DeepLift and the corresponding F1 classificationscores (red) for primary cancer types whose proportions were greater 0.06 . . 1344.18 A scatter plot showing the proportion of RNA genes within the positive im-portant genes identified for the cancer subtype classes. . . . . . . . . . . . . 1394.19 A scatter plot showing the proportion of pseudogenes within the positive im-portant genes identified for the classes within the organ system of origin task. 1424.20 A scatter plot showing the proportion of pseudogenes within the positive im-portant genes identified for the classes within the disease state task. . . . . . 1434.21 A scatter plot showing the proportion of pseudogenes within the positive im-portant genes identified for the classes within the cancer type classes. . . . . 1454.22 A scatter plot showing the proportion of pseudogenes within the positive im-portant genes identified for the classes within the cancer subtype task. . . . . 1474.23 A t-SNE plot of the transcriptome data for the full training data set colouredby data cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149xvi4.24 A t-SNE plot of the transcriptome data for the metastatic cancer types fromthe training data set coloured by cancer type. . . . . . . . . . . . . . . . . . 1524.25 A t-SNE plot of the transcriptome data for the primary cancer types from thetraining data set coloured by cancer type. . . . . . . . . . . . . . . . . . . . 1534.26 A t-SNE plot of the transcriptome data for the DLBC cancer type colouredby data cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154xviiList of AbbreviationsANN - artificial neural networkBC - British ColumbiaCAGE - cap analysis of gene expressioncDNA - complementary deoxyribonucleic acidDNA - deoxyribonucleic acidEMT - epithelial–mesenchymal transitionFPKM - fragments per kilobase millionGPH - Genomics Precision Health DatabaseGSC - Canada’s Michael Smith Genome Sciences Centre at BC CancerGTEx - Genotype-Tissue Expression ProjectIG - immunoglobulinKEGG - Kyoto Encyclopedia of Genes and GenomeslncRNA - long non-coding RNAmiRNA - microRNAML - machine learningMLP - multilayer perceptronmRNA - messenger ribonucleic acidMRP - mitoribosomal proteinsNCI - National Cancer InstitutencRNA - non-coding RNANGS - next-generation sequencingNIH - National Institute of HealthxviiiNN - neural networkPOG - Personalized OncoGenomics (POG) program at BC CancerRNA - ribonucleic acidRNA-seq - ribonucleic acid sequencingRPKM - reads per kilobase millionSAGE - serial analysis of gene expressionSGD - stochastic gradient descentTARGET - Therapeutically Applicable Research to Generate Effective Treatments projectTCGA - The Cancer Genome AtlasTFRI - Terry Fox Research InstituteTPM - transcripts per kilobase millionxixAcknowledgementsI want to acknowledge all of the great people at Canada’s Michael Smith Genome SciencesCentre, with a particular emphasis on my supervisor Dr. Steven Jones and my fellow labmates. Thank you for the support and answering all of my random questions. Without youall I would not have been able to complete this thesis. Karen Mungall deserves a shout-outhere as well. Without her support, I would not have entered into the Masters program atUBC, and I would not have met all the great people at the GSC. Thank you Karen!I also want to give special acknowledgement to Jasleen Grewal for helping me get thiswork off the ground. I hope you achieve all of your ambitions. You have certainly helped mereach mine. Thank you to Jean-Michel Garant for helping me with my RNA understandingand analysis. I wish you the best of luck with your professional pursuits (and all those cats).Thank you to Luka Culibrk for your helpful questions. You sir, are a genius. Thank you toEmre Erhan for being so nice and connecting me with so many people, keep on crushing it.Finally, thank you to Jenny Yang for being my lab bestie. Good luck on your PhD journey!Of course, my thanks also extends to my whole family but especially to Pepa, Pop, Sascha,and Opa. Thank you all for the support while I finally complete my school journey. Loveyou guys.Last but not least, thank you to my supervisory committee for taking the time to help meachieve my goals. Your time and effort is much appreciated.xxChapter 1IntroductionThe purpose of this thesis is to attempt to create a machine learning tool that can aid inthe diagnosis of cancers from gene expression (transcriptome) data and subsequently leveragethis tool to better characterize the underlying biology. The approach taken here leveragesmachine learning in a multiple learning task approach. The four learning tasks selected forthe machine learning model represent a biological hierarchy that may help the model tobetter classify cancers. If a model can be trained to understand the features of cancers atthe gene expression level, we can then work to extract any insights the model has gleaned.Ultimately, the goal of this work is to characterise and quantify (where possible) the role ofgene expression in a variety of cancers.The focus of this research can be summarized by the three goals below:1. Create a multi-task neural network model to accurately classify four categories ofbiological interest from gene expression data:• Organ System of Origin• Disease State: primary cancer, metastatic cancer, or normal tissue• Cancer Type1• Cancer Subtype2. Identify and extract the genes utilized by the model to determine the classification ofeach category3. Utilize the identified genes to validate and infer biological information about cancerThe following sections will introduce some important background information to motivatethis thesis work and place it in the current context of cancer research and machine learning.1.1 Background of CancerCancer is a group of diseases defined by the abnormal growth of cells [1]. This uncontrolledgrowth is often caused by acquired or inherent (somatic) genetic mutations that circumventthe normal cell life cycle resulting in the formation of abnormal tissue growth (tumours). Theabnormal growth is driven by mutations that inhibit cell growth suppressors, activate growthfactors, and/or improve cell proliferation and motility [1]. Identifying mutations responsiblefor driving tumourigenesis (the creation of tumours) is the focus of many research endeavoursworld-wide, including this thesis.Cancer is the second leading cause of death in Canada and the world [1, 2]. In 2018,it accounted for one sixth of all mortalities world-wide (9.5 million deaths) and there are83,300 cancer-related deaths expected in Canada in 2020 [1, 2]. It is estimated that almosthalf of Canadians will be diagnosed with cancer at some point in their lives, and while thecancer mortality rates have decreased over the last 40 years, the overall number of new casesand fatalities has been increasing along with the average age of Canadians [2, 3]. Clearly,cancer remains a prominent health issue both in Canada and around the world. On-goingcancer research should remain a prime focus for improving the health and life-span of humanbeings.2Cancer-related death is often the result of complications caused by metastasis. In fact, 67-90% of all cancer-related deaths are attributed to metastases (secondary tumours) which aredefined by the spread of tumour cells beyond the primary site of origin into the surroundingtissues and/or to distant regions of the body [1, 4, 5]. Once a tumour has spread to a criticalorgan, like the brain or lungs, if the growth is not stifled it ultimately results in organ failureand death. As a result of the prominence metastasis plays in cancer deaths, we must preventthe formation and proliferation of metastatic cancers in order to reduce the impact of thisdisease.For metastases to arise, the cells from the primary tumour must not only physically dis-seminate from the primary site, but must adapt to the new micro-environment present atthe secondary site [6, 7]. The ability to disseminate and adapt is a key feature of metastaticcancers. It can be postulated that there are genetic characteristics underlying these abilities.In order to properly identify the origin of these abilities and subsequently hinder them, wemust be able to effectively characterize the underlying genetic causes of cancers [6]. Oftenthere are multiple genetic factors influencing the ability of tumours to spread, and the pri-mary site of origin can predispose some tumours to higher aggression and adaptability [6].For example, tumours of the lungs often spread widely and rapidly, whereas tumours of theprostate and breast are typically much more docile and limited in secondary site proliferation[8, 9]. It is partly for this reason that identifying the site of origin is a key step in diagnosingcancer type and ultimately deciding on the most effective treatment protocol [6, 7].Prior to the advent of genetic sequencing, we had no ability to directly characterize thegenetics of cancer and thus relied solely on morphological and immunohistochemical analysisto determine a cancer’s site of origin and type. This approach is problematic as the accuracyof diagnosis using these techniques can be less than ideal , particularly with metastaticcancers [10]. A meta-analysis by Anderson and Weiss in 2010 found that only 65% ofmetastatic cancers had their site of origin correctly identified through immunohistochemicalanalysis compared to 82% with a mixture of primary and metastatic samples [10]. This is3clear evidence for the need to improve our capacity to characterize cancers in new ways. Inthe era of genetic sequencing, we have the ability to look at the genome of different cancersand attempt to categorize and quantify the role of genetics in the cause and characteristicsof various cancers. By leveraging machine learning tools, as exemplified in this thesis, wecan aim to identify key genetic markers of cancers and ultimately work to improve diagnosisand treatment.1.1.1 The Role of Gene Expression in CancerThe human genome contains two major genomic regions: coding and non-coding regions[11]. Coding regions are areas of the genome comprised of genes that encode the informationnecessary to build proteins from nucleic acids. The level at which protein coding genes aretranscribed into RNA is said to be the expression levels of that gene. The set of RNAtranscripts generated by both coding and non-coding regions is collectively referred to as thetranscriptome [12]. The expression levels of genes are a quantitative measure of the rate oftranscription of each gene. The rate of transcription can have an impact on the amount ofproteins generated from the RNA produced by transcription. Proteins are pivotal to life andthe overabundance or unplanned absence of them can cause a myriad of problems includingthe formation of cancers [13]. For this reason, quantifying and analyzing the expressionlevels of genes is a valuable resource in the cancer research space.Numerous studies report differential expression of genes as being a potential source oftumourigenesis [14, 15, 16, 17]. Over and underexpression of genes, particularly thosewith functionality linked to cell division, propagation, and apoptosis, are hallmarks of manycancers [16, 18, 19, 20]. This knowledge can be leveraged to detect susceptibility to and thecharacteristics of cancers [16, 17, 21]. Through categorization of gene expression patterns incancer types, we can begin to work towards treating the causes and/or effects of differentialexpression. This thesis work is in part motivated by this goal. If we can detect novel patternsof genetic expression within cancers, we can provide more potential therapeutic targets.41.2 Background of Genetic SequencingGenetic sequencing (DNA sequencing) refers to the determination of the sequence of nu-cleic acids within a given piece of DNA. The ultimate goal of sequencing is to rapidly andaccurately determine the entire sequence of an organism’s genome with the intention beingto understand the location, composition, and function of all of its genomic regions.The current state of genetic sequencing arrived as a result of two major breakthroughs. Thefirst of which was the invention of the first-generation of sequencing technology called Sangersequencing [22]. Sanger sequencing was created in 1977 and utilizes radio or dye-labelledchain terminating nucleotides in conjunction with DNA polymerase to grow fragments of theDNA of interest that incorporate the labelled nucleotides [22]. By capturing a large enoughset of labelled fragments, we will eventually have one fragment with a labelled nucleotide ateach position in the DNA sequence of interest. We can then determine which nucleotide existsat each location across all of the fragments and combine the information to obtain the wholeDNA sequence. In Sanger sequencing, the visualization process involves gel electrophoresisand is limited by the number of lanes within the gel. It can only sequence one fragmentof DNA per gel and can only grow as many dyed fragments as there are lanes in the gel.Furthermore, the labelled nucleotides used are chain-terminating nucleotides which preventsthe addition of nucleotides following the dye. Therefore, only a single position in the DNAfragment will be labelled and read. This results in accurate but slow and costly sequencing,particularly when concerning the sequencing of multiple DNA fragments. If one desires tosequence multiple fragments, a gel must be prepared and run for each fragment and a dyedfragment must be produced for each position in the DNA of interest.The second generation of sequencing technology, also known as next-generation sequencing(NGS), was created in 2006 by Solexa [22]. One of the most common forms of NGS today isIllumina sequencing and is the source of all sequence data for this thesis. Illumina sequencingaddresses the limitations of Sanger sequencing by allowing multiple reversible dye-labelled5nucleotides to be attached to a single fragment and by not relying on gel electrophoresisto read the fragments. Instead, Illumina sequencing grows millions of DNA fragments si-multaneously, each with many dyed nucleotides. This allows for parallel sequencing to beconducted. To accomplish this, Illumina sequencing utilizes a flow cell with millions of wellsthat can each read a dyed fragment of DNA. The features of Illumina sequencing providethe ability to rapidly sequence an entire genome with a single prepared sample of DNA. Thissignificantly reduces the cost of sequencing both in preparation time and dollars per DNAfragment. For these reasons, it is widely used in genomic studies of cancer.1.2.1 RNA SequencingRNA sequencing (RNA-Seq) refers to determining the presence and order of nucleotidesfound in an RNA molecule [24]. With each iteration of DNA sequencing technology, therehave been techniques developed to apply them to RNA as well. The earlier techniques suchas serial analysis of gene expression (SAGE) and cap analysis of gene expression (CAGE)shared the limitations present in Sanger sequencing, namely high cost and low throughput[24]. Likewise, these limitations have been mitigated significantly with the advent of thesecond generation of sequencing technology. In order to perform RNA sequencing usingIllumina sequencing, the RNA sample goes through an additional sample preparation phaseto transcribe the RNA to cDNA (complimentary DNA) using a reverse transcriptase enzyme[25]. Having been converted to cDNA, the sample can now undergo the normal Illuminasequencing process. The reads obtained from RNA sequencing are then mapped to a referencegenome/transcriptome using one of a number of genome alignment tools such as STAR [26].The number of reads found at each region of the genome are quantified and normalized todetermine the expression of that region of the genome.The purpose of normalization is to overcome potential biases introduced by differing readdepth and gene lengths. Normalization to a standard format allows expression data to becompared between studies and attempts to remove technical bias introduced by the sequenc-6ing process [27, 28]. The three most popular normalization formats are: RPKM, FPKM andTPM [27]. Each have slightly different ways of implementing normalization. The RPKMformat, however, is the relevant format for this thesis work as all of the data used hereinwas presented as RPKM values. The RPKM value is a within-sample normalization usingthe reads per kilobase per million reads mapped. It is calculated by dividing the numberof reads mapped to each gene by the total number of mapped reads multiplied by the genelength [27].1.3 The Role of RNA Genes in CancerA large fraction of the human genome is composed of non-coding regions and have histor-ically been considered ”junk” [29, 30]. This non-coding region comprises DNA that is nottranscribed to RNA or its RNA transcripts do not code for proteins (non-coding RNAs).In recent years, studies have begun to show the important role non-coding RNAs (ncRNA)play in tumourigenesis and the malignancy of cancers [29, 31, 32, 33]. Non-coding RNAsinfluence many cellular processes involved in cancer, such as cell growth, differentiation,proliferation, and apoptosis [31, 32]. MicroRNAs (miRNA), a class of non-coding RNA,play a key role in these cellular processes by binding messenger RNA (mRNA) targets andeither inhibiting or degrading their function [34, 35]. Through this functional modulation,miRNAs are able to affect large changes in gene expression. In fact, miRNAs control nearlyone third of all human genes, and for this reason, play an important part in our growingunderstanding of cancer [33]. It should be noted that in the context of cancer, miRNAs aregenerally considered to be either tumour suppressors or oncomiRs depending on their targetmRNAs [36]. OncomiRs downregulate tumour suppressor genes and thus act to promotetumourigenesis. In contrast, tumour suppressing miRNAs act to suppress the effects of genesthat promote tumours [36].Studies have exemplified the role of miRNAs in cancer and have shown that characteriza-tion of cancers is feasible using miRNA biomarkers [36, 37, 38]. For example, Calin et al7(2002) showed that two miRNA genes (miR15 and miR16 ) were deleted or downregulatedin more than half of their samples of B-cell chronic lymphocytic leukemias (CLL), and thusplay a significant role in the pathogenesis of CLL [37]. Furthermore, it has been shownthat alterations in the miRNA signature between normal and malignant cells can be utilizedto accurately classify cancer types and the organ system of origin in poorly differentiatedcancers [33].Given the key role RNA genes play in cancer, it is reasonable to expect the model utilized inthis thesis to highlight their role. Since the multi-task model used in this work is attemptingto learn patterns of expression in cancer, we would expect that some of the genes highlightedby the model should support the findings of previous work on RNA genes in cancer.1.4 The Role of Pseudogenes in CancerPseudogenes are decayed versions of functional genes that may originate as a result of geneduplication events such as point mutations, insertions, deletions, and/or frameshifts, amongothers [39]. They have, until recently, been considered entirely non-functional regions of thegenome and looked upon as ”junk DNA” [40, 41]. In recent years, however, studies haveshown that thousands of pseudogenes are transcribed and hundreds are translated [41]. Forthis reason, they are considered part of the set of RNA genes known as long non-coding RNAgenes (lncRNA). Transcribed pseudogenes can be detected through RNA-seq and have beenshown to have diagnostic power as biomarkers in some human cancers [41]. A pan-canceranalysis of RNA-seq data has demonstrated that pseudogenes show cancer subtype-specificexpression patterns, and in some cases can be used to differentiate between subtypes [41].In light of recent evidence surrounding the utility of pseudogenes in cancer, the implicationof pseudogene expression within the context of this thesis will be explored.81.5 Machine LearningMachine learning refers to a set of computer algorithms that can independently acquireknowledge and extract patterns from raw data [22]. These algorithms are often used tocreate predictive models. The process of teaching a model to make accurate predictionsis referred to as training and requires the use of training data. Training data is a set ofdata that is representative of the kind of data we wish to understand and make predictionsabout. Once the model has sufficiently learned a representation of the training data, themodel is said to be trained. It can then be used to make predictions about new, unseendata of the same format as the training data. Machine learning models can be particularlyuseful for pattern recognition tasks in which the data of interest is too large or complex to besufficiently analyzed by human beings. One such example is the characterization of cancersbased on gene expression data, which is the focus of this thesis. There are three major typesof machine learning: supervised, unsupervised, and reinforcement learning [43]. Supervisedlearning is the relevant form of machine learning for this thesis work and is described in thefollowing section.1.5.1 Supervised LearningSupervised learning is a form of machine learning that utilizes labelled data [42, 43]. Thisis in contrast to unsupervised learning in which there are no labels present. Labelled datarefers to a set of data that contains not only features, but also a label (target) [42]. Forexample, images that are labelled with the type of object being visualized within an image.When the labels are discrete, they are often referred to as classes, and using machine learningto predict classes of data can be referred to as classification. The goal of classification is togenerate a model that can accurately label the given training data and make predictions ofthe labels for new, unseen data [42, 43]. In order for a machine learning model to learn toaccurately label the given data, it needs to learn (via the training process) the underlyingfeatures that best represent the data for every possible label.9Training a supervised learning model is an iterative process of assigning predictions foreach sample of data given in the training set, calculating the loss (error) of these predictions,and adjusting the parameters of the model to reduce the loss, thereby improving accuracy.This process is repeated until the loss reaches a plateau or a desired value. Once a plateau isreached, the model is considered trained and should offer some capacity to accurately predictlabels for data previously unseen by the model. The calculation of the loss of a model isdiscussed in more detail as part of the neural network section below.In order for a model to accurately classify data, it must learn to accurately representthe data. This requires building a function that takes the input features of the data andtransforms it to an output classification. This is the core of any machine learning model.There are many algorithmic processes to produce a representative function and an artificialneural network (ANN or NN) is one example of these. Neural networks are described in thefollowing section and is the relevant class of machine learning models for this thesis.1.5.2 Neural NetworksA neural network model is at its core a mathematical function. It takes inputs and mapsthem to outputs. It is composed of a collection of neurons (described in further detail below),often referred to as nodes. These collections of nodes are connected by weighted edges. Theseweights (along with a bias term for each neuron) are the adjustable parameters learned aspart of the training process [43]. The process of training a neural network involves iterativeupdates to the model parameters (weights and biases) in order to reduce the overall errorrate (loss) of the model’s predictions. This process ultimately results in the model beingable to more accurately approximate a representative function of the training data. As therepresentation improves, the loss of the model should decrease and the prediction accuracyshould increase.10The way in which nodes are connected together defines a neural network’s architectureand can have a significant effect on its performance [42, 44]. The canonical example ofa neural network is a feed-forward neural network where the connections within it are allweighted, directed and acyclic (Figure 1.1) [42, 44]. There are three standard parts toa basic feed-forward network: the input layer, the hidden layer, and the output layer. Alayer is simply any number of nodes that exist at the same depth within the network. Thedepth of a network is the number of layers it contains. The term ”deep” is used to refer tonetworks that have multiple hidden layers [42]. The model developed in this thesis containsfour hidden layers and is thus considered a deep neural network.Figure 1.1: A diagram depicting the basic structure and layering of a feed-forward neuralnetwork model. This figure was taken from the web [45].The input layer of a neural network is where the data is given to the model and will containas many nodes as there are input features. In this thesis work for example, the number ofinput nodes is equal to the number of genes in each sample. In an image classification task,11the input layer would have as many nodes as there are pixels in the image. The hidden layeris named as such because it does not provide information in the form of the desired output[42]. The function of the hidden layer is to provide a means with which apply nonlineartransformations to the input. A neural network with a single hidden layer that containsa sufficient number of nodes (width) can approximate any mathematical function [42, 44].The output layer is the layer of the network where predictions are given. The width of theoutput layer in a classification task will be such that it can represent the desired numberof labels or classes to be assigned to the input data. For example, if a neural network istrying to decide if an image is either a car, a bus, or a truck, the output layer may have 3nodes. One for a car, one for a bus, and one for a truck. A classification is determined bywhich corresponding output node has the highest output value (activation). The class beingpredicted by the model is the one in which the corresponding output node has the highestactivation.1.5.3 Gradient DescentThere are two types of components (parameters) being learned by a neural network duringtraining. These components are the weights w and biases b for each neuron (see Section1.5.5) within a neural network [42, 43]. One way that these parameters can be learned ina feed-forward neural network is through gradient descent. Gradient descent is the processof calculating the loss of a model followed by updating its parameters in the direction ofthe negative gradient of that loss. In other-words, it is a descent in the loss of the modelthrough iterative steps in the direction of the negative gradient. The actual calculation of thegradient is done using an algorithm called backpropagation (see the relevant section below)[42]. The negative gradient for a neural network is composed of the set of partial derivativesfor each parameter and represents the direction of steepest change in the parameters requiredto minimize the loss of the network. By determining the negative gradient of the model, wecan effectively update each parameter to move in the direction that minimizes the loss ofthe model.12There are two basic forms of gradient descent [42, 43]. Standard batch gradient descentcalculates the loss of the model using an average of the losses for every sample (batch) in thetraining set. Stochastic gradient descent (SGD) uses a single sample selected at random todetermine the loss [43, 46]. Stochastic gradient descent can also be performed on multiplesamples (mini-batch) whose losses are averaged together [46]. This method is referred toas mini-batch gradient descent. Regardless of the number of samples used, the negativegradient is calculated using backpropagation and then used to update each parameter ofthe model so as to minimize its loss. The formula for updating a parameter can be seenin Equation 1.1 [42]. In this equation, x is the current value for a given parameter, x′ isthe updated value for the same parameter, α is the learning rate, and ∇f(x) represents thegradient of parameter x.x′ = x− α∇xf(x) (1.1)1.5.4 Learning RateThe learning rate used when training a neural network can have a significant effect onthe performance of the model and is often considered the most important hyperparameter[42]. The learning rate effects how large of a step in the direction of the negative gradientthe model makes for each parameter (see Equation 1.1). The direction and magnitude ofthe update is determined by the gradient as calculated via backpropagation. The learningrate parameterizes this gradient and can be used to decrease or increase the size of theparameter update. If the rate is too high, the model may not converge to the best solution(minimal loss), and the loss may increase as a result of the update. If we consider theU-shaped generalization error curve (Figure 1.2), we can think of this as overshooting thegoal (optimal capacity) [42]. If the rate is too small, the model could take a very long timeto converge or may not converge to a good solution at all [42]. Finding a good learningrate is about balancing the training time with trying to converge as close as possible to theoptimal solution for the given architecture and problem space. One technique to balance13these two needs of learning is to use an adaptive learning rate in which training begins withan initial learning rate and is subsequently modified by some algorithmic approach. Theadaptation algorithm, the corresponding initial learning rate, the rate/type of reduction,and the frequency at which the reduction rate is applied, are all aspects of the learning ratethat must be explored as part of hyperparameter tuning.Figure 1.2: A plot depicting the bias-variance trade-off, the U-shaped generalization errorcurve, the optimal capacity, under-fitting, and over-fitting zones. Note: This figure wastaken from Goodfellow et al. (2016) [42].1.5.5 NeuronEach neuron within a neural network is at its core a mathematical function. To be morespecific, it is a nonlinear function [43]. The function that comprises a neuron is defined inEquation 1.2 [43]. In this equation, x is the input feature vector (all the incoming connectionsto the neuron), w are the weights associated with each input feature connection, b is a biasterm, and σ is an activation function applied to the output of the linear equation (w×x+b).Also note the summation over each w×x which makes this term a weighted sum of the input14features. The activation function σ of this equation is used to transform the output of thecontained linear combination of inputs (w × x + b) into either a value between 0 and 1 ora value between -1 and 1 depending on the activation function used [42, 43]. The resultingvalue f(x) for a neuron is considered its ”activation”.f(x) = σ(∑(w × x) + b) (1.2)1.5.6 Activation FunctionsActivation functions within the nodes of a neural network serve to apply a non-linear trans-formation to the output of the linear combination within a neuron (see Section 1.5.5) [42].They are selected on the basis of the task at hand [43]. The two relevant activation functionsfor this thesis are the hyperbolic tangent and softmax functions. These are described below.Hyperbolic Tangent FunctionThe Hyperbolic Tangent (tanh) function is a mathematical function suitable for use asan activation function with the nodes of a neural network. It has properties such that−1 < tanh(x) < 1 and tanh(0) = 0 that make it ideal for use as an activation functionwithin the hidden layers of a neural network [47, 43]. The tanh formula is given below(Equation 1.3) and a graphical representation of it can be seen in figure 1.3 [48].tanh(x) =(ex − e−x)(ex + e−x)(1.3)15Figure 1.3: A plot of the hyperbolic tangent function. Note: This figure was taken fromMathWorld [48].Softmax FunctionThe softmax function is often used in neural networks within the output layer [42]. Thesoftmax function effectively converts the output of the output nodes into a normalizedprobability-like distribution for each class in the output [42]. The output value of eachnode then corresponds to a percentage of the final classification. For example, if a three-class model using the softmax function in the output layer has output values of 0.2, 0.2, and0.6, the model is making a 60% classification as the third class and a 20% classification asthe first two classes. We would interpret this output as a predicted class of the third typebecause the third class has the largest output value. The softmax function (σ) is defined inEquation 1.4 below [42, 43]. In this equation, z is the score output for each output nodeand K is the number of possible classes.16σ(z)i =ezi∑Kj=1ezjfor i = 1, ..., K and z = (z1, ..., zK) in RK (1.4)1.5.7 Loss FunctionsThe loss function (cost function) of a model is used to determine its performance relative tothe training data. The loss is a value representative of how much error there is between thepredicted and true output of the model [49]. The choice of loss function will be dictated bythe type of learning task [42]. For this thesis work, since a multi-class classification is beingconducted, the categorical cross-entropy loss is used. This function is detailed below.Categorical Cross-Entropy LossThe categorical cross-entropy (CCE) loss function is the standard loss function used formulti-class classification [43, 50]. The formula for the categorical cross entropy (CE) loss isshown below in Equation 1.5 [43, 50]. In this formula, K is the total number of classes, Nis the number of samples, t is the target, and y is the predicted class.CCE = −N∑nK∑ctnk ln ynk (1.5)1.5.8 BackpropagationBackpropagation refers to the algorithm used for efficiently calculating the gradient of the lossfunction with respect to each parameter within a neural network [42]. Since a neural networkis a function composed of other functions (each node in the network is a nonlinear function,see Section 1.5.5), the backpropagation algorithm applies the chain rule (from calculus) ina specific manner to efficiently compute the partial derivative for each composing function[42]. The set of partial derivatives for each parameter results in the gradient of the network17as a whole with respect to its loss. These partial derivatives can then be utilized as part ofgradient descent to update each parameter and minimize the loss of the network (see Section1.5.3 for details). Further details on the use of the chain rule for backpropagation can befound in the relevant section of the Deep Learning textbook by Goodfellow, Bengio, andCourville (2016) [42].1.5.9 InitializationThe weights and biases of a neural network need to be initialized to a set of values priorto training. One method of weight initialization is to use Glorot uniform initialization. TheGlorot uniform initializer (Equation 1.6) samples from a uniform distribution between thenegative and positive limit seen in Equation 1.7 [51, 52]. The biases of a neural networkcan be and are often simply set to zeros.glorot uniform initializer = sample[-glorot limit, glorot limit] (1.6)glorot limit =√6number of input nodes + number of output nodes(1.7)1.5.10 Over-fittingOver-fitting is a phenomenon in machine learning where a model will learn to representthe training data with increasing accuracy while the accuracy of the model with predictionsmade on unseen data, such as validation data, decreases [42]. There is a point at which themodel fits the training data so well that its ability to generalize to unseen data is hindered(see the over-fitting zone in Figure 1.2). Preventing over-fitting is about striking a balancebetween accurately learning the training data while maintaining an ability to generalize tounseen data. There are many techniques used to prevent over-fitting and these methods are18generally termed regularization methods. The application of regularization to a model isthe prevention of over-fitting by virtue of penalizing increasing model complexity [42, 53].As model complexity goes up, the ability of the model to fit to the training data increasesand at a point, the ability to generalize goes down. Two examples of regularization used inthis thesis are early stopping and dropout. These are described in their respective sectionsbelow.1.5.11 Early Stopping and PatienceEarly stopping refers to halting the training of a machine learning model when a desiredmetric, often validation loss or accuracy, reaches a minimum or maximum respectively [42].Early stopping is the most common form of regularization used in deep learning and is utilizedto help prevent over-fitting [42, 54]. Patience refers to the number of training epochs thatwill pass before training is halted and is a hyperparameter of the early stopping process thatmust be selected.1.5.12 DropoutDropout is an effective and computationally inexpensive technique for the regularizationof neural networks [42, 55]. It is implemented by setting the weights of connections betweena random subset of nodes to 0 at each training step. A visualization of the effect of dropoutcan be seen in Figure 1.4 [54]. Conceptually, dropout can be thought of as training multiplemodels within a single network [42]. By dropping out different connections, we are essentiallycreating sub-networks at each iteration and forcing the model to learn solutions that do notrely too heavily on any single connection (or set of connections) and thus are more robust[42]. The rate of dropout is typically a value between 0 and 1 that indicates what fractionof the connections are set to 0 (dropped out) at any given training iteration.19Figure 1.4: (a) A standard fully-connected neural network without dropout. (b) A sub-network created by dropping out some of the connections in the standard neural network.Note: This figure was adapted from Wang et al. (2018) [54].Class WeightingClass weighting is a method of weighting the training loss that can be used to combatthe potentially detrimental effects of training machine learning models using imbalanceddata sets [56, 57]. When class weights are appropriately applied, the effect can be toaid in classification performance on minority classes [58]. This can be accomplished byweighting the loss for each sample by either an arbitrary value, or by a value that bearssome relationship to the size of the class in which the sample belongs. The weight for eachclass is the class weight and is the value used to weight the loss for each for each sample ofthat class.Multi-Task LearningMulti-task learning refers to machine learning in which there are more than one learningobjective being learned in parallel. The effect of multi-task learning is that of improvedgeneralization performance as a result of shared parameters. However, this holds true only20under the assumption that there exists a valid relationship between tasks [42]. Each sharedparameter is utilized for multiple objectives and therefore the associated parameters are lesslikely to over-fit the variation in the data related to any singular task [42].1.6 DeepliftDeepLift is a backpropagation-based tool developed for the interpretation of trained neuralnetworks [59]. It uses a backpropagation-like algorithm to determine the effect of a selectedset of input nodes on the resulting activation of a set of selected nodes of interest (outputnodes in this case). A baseline activation level for the output nodes is established using areference value for each input node (the default reference is 0). The baseline activation isestablished by passing the reference value through the network to the output nodes via theselected input nodes. The activation level seen at each output node is then recorded as thebaseline activation for that node. The set of training data is then passed through the networkinputs. The difference from the reference activation value is calculated at each output nodeand then propagated back through the network to each input node. The larger the differencefrom the reference activation caused by a particular input node, the greater the perceivedeffect of that input node is. The larger the effect of an input node, the higher the scoreDeepLift will assign to it. Similarly, if an input node reduces the activation of an outputnode, a negative score is returned. This is repeated for all of the training samples acrosseach input and output node combination. The result is a matrix of positive and negativescores that indicate how important each input node is to the output nodes’ activation. Inthe context of this thesis, we receive a set of scores for each gene in the input data thatcorrespond to how important they are to the classification of each of the output classes. Wehave essentially asked DeepLift to determine how important each gene is in classifying eachof the classes within each learning task of the multi-task model. Further details on howDeepLift works can be found in the DeepLift paper and accompanying videos on YouTube[59].21One additional consideration to the analysis of the DeepLift data for this thesis workpertains to gene expression. Since each input feature represents a gene, we must be carefulto properly interpret positive scores assigned by DeepLift. A positive importance score fora particular gene does not necessitate that the gene has higher than normal expression.It simply means there is something about the expression of this gene that has a positiveinfluence on the model selecting the current class being examined by DeepLift. This couldbe under or overexpression of a gene.22Chapter 2The Data and the Model2.1 The DataAll of the data used for this thesis consists of RPKM gene expression values for 26668genes. The genes selected were those that were found in the intersection of all of the genesavailable across all of the different data sources. The list of these data sources is outlinedin Table 2.1. The data can be thought of as two separate sets. The largest set consists of amix of primary cancer, metastatic cancer, and normal tissue samples. The second, smallerdata set contains only metastatic cancer samples and was used only for testing the trainedmachine learning model.23Primary & NormalsTCGANIH-NCI non-Hodgkin lymphoma dataset including FL and DLBCNon-cell-line GBM data from the TFRI’s Glioblastoma Multiforme projectMESO dataset from GenenTechMB-Adult data from the GSCFollicular lymphoma data from the GSCCML data from the TARGET projectCLL and DLBC data from the GPH projectMetastaticMet500POGTable 2.1: List of data sources for primary, normal, and metastatic dataMixed Data SetWithin the mixed data set, the vast majority of the primary cancer samples are fromThe Cancer Genome Atlas (TCGA) data set. The TCGA data was supplemented withprimary mesothelioma, glioblastoma, non-Hodgkin’s lymphoma, medulloblastoma, follicularlymphoma, and leukemia data sets from a variety of other sources detailed in Table 2.1[11]. There are 375 metastatic cancer samples included in the mixed set that came from theMet500 cohort gathered by the University of Michigan. Details of this cohort can be foundin the associated paper by Robinson et al. [62]. With all of the sources combined, the largemixed data set consists of 11588 samples of which 10493 were primary cancer samples, 715were normal tissue samples, and 375 were metastatic cancer samples.The mixed data set was annotated to include labels for the 4 different classification taskswithin the model architecture: organ system of origin, disease state, cancer type, and cancer24subtype. A summary of the categories for the mixed data set can be found in Tables 2.2, 2.3,2.4, 2.5, and 2.6. There are 3 labels for the disease state category corresponding to primarycancer, metastatic cancer, and normal tissue samples. The other classification tasks consistof 11 organ systems of origin, 68 cancer types, and 91 cancer subtypes. The number ofclasses presented here reflect those remaining after the preprocessing/filtering steps outlinedin Section 2.1.2. Within both the cancer type and subtype labels, there are 20 metastaticcancers and 16 normal classes. Within the organ system of origin task there are 8 classesthat have normal samples included.Total Number of Cancer Subtypes 91Number of Primary Subtypes 55Number of Metastatic Subtypes 20Number of Normal Subtypes 16Total Number of Cancer Types 68Number of Primary Subtypes 32Number of Metastatic Types 20Number of Normal Types 16Total Number of Organ Systems of Origin 11Number of Organ Systems with Normal Samples 8Total Number of Tissue Types 3Table 2.2: Number and composition of classes for each classification taskOrgan System of OriginFull Name Number of CancerSamplesNumber of NormalSamplesBreast 1268 112Central Nervous System 1024 0Endocrine 1005 59Gastrointestinal 1756 146Gynecologic 883 2425Organ System of OriginFull Name Number of CancerSamplesNumber of NormalSamplesHead and Neck 650 44Hematologic 615 0Skin 484 0Soft Tissue 294 0Thoracic 1436 110Urological 2071 200Total Number of Cancer Samples 10496Total Number of Normal Samples 695Total Number of Samples 11486Table 2.3: Organ system of origin classes and frequencies within the full set of preprocesseddata (including both train and test data)Tissue TypeFull Name Number of SamplesPrimary Tumour 10496Metastatic Tumour 295Normal Tissue 695Total Number of Samples 11486Table 2.4: Tissue type classes and frequencies within the full set of preprocessed data (in-cluding both train and test data)Cancer TypesAbbreviation Full Name Number of SamplesACC T Metastatic Metastatic Adrenocortical Carcinoma 8ACC T Tumor Adrenocortical Carcinoma 7926Cancer TypesAbbreviation Full Name Number of SamplesALL T Metastatic Acute Lymphocytic Leukemia 13BLCA N Normal Bladder Tissue 19BLCA T Metastatic Metastatic Bladder Urothelial Carci-noma14BLCA T Tumor Bladder Urothelial Carcinoma 408BRCA T Metastatic Metastatic Breast Invasive Carcinoma 56BRCA N Normal Breast Tissue 112BRCA T Tumor Breast Invasive Carcinoma 1100CESC T Tumor Endocervical Adenocarcinoma 300CHOL T Metastatic Extrahepatic Cholangiocarcinoma 19CHOL N Normal Bile Duct Tissue 9CHOL T Tumor Cholangiocarcinoma 36CLL T Tumor Chronic Lymphocytic Leukemia 29CML T Tumor Chronic Myelogenous Leukemia 102COADREAD N Normal Colorectal Tissue 51COADREAD T Metastatic Metastatic Colorectal Adenocarcinoma 10COADREAD T Tumor Colorectal Adenocarcinoma 386DLBC T Tumor Lymphoid Neoplasm Diffuse Large B-cell Lymphoma170ESCA T Metastatic Metastatic Esophageal Adenocarci-noma9ESCA T Tumor Esophageal Carcinoma 169FL T Tumor Follicular Lymphoma 50GBM T Tumor Glioblastoma 219HNSC N Normal Head and Neck Tissue 44HNSC T Metastatic Metastatic Head and Neck SquamousCell Carcinoma927Cancer TypesAbbreviation Full Name Number of SamplesHNSC T Tumor Head and Neck Squamous Cell Carci-noma517KICH N Normal Kidney Tissue 25KICH T Tumor Kidney Chromophobe 66KIRC N Normal Kidney Tissue 72KIRC T Tumor Kidney Renal Clear Cell Carcinoma 532KIRP N Normal Kidney Tissue 32KIRP T Tumor Kidney Renal Papillary Cell Carcinoma 291LAML T Metastatic Metastatic Acute Myeloid Leukemia 8LAML T Tumor Acute Myeloid Leukemia 123LGG T Tumor Brain Lower Grade Glioma 530LIHC N Normal Liver Tissue 50LIHC T Metastatic Metastatic Liver Hepatocellular Carci-noma7LIHC T Tumor Liver Hepatocellular Carcinoma 373LUAD N Normal Lung Tissue 59LUAD T Metastatic Metastatic Lung Adenocarcinoma 9LUAD T Tumor Lung Adenocarcinoma 518LUSC N Normal Lung Tissue 51LUSC T Tumor Lung Squamous Cell Carcinoma 501MB-Adult T Tumor Medulloblastoma 275MESO T Tumor Mesothelioma 298OV T Metastatic Metastatic Ovarian Serous Cystadeno-carcinoma13OV T Tumor Ovarian Serous Cystadenocarcinoma 308PAAD T Metastatic Metastatic Pancreatic Adenocarci-noma728Cancer TypesAbbreviation Full Name Number of SamplesPAAD T Tumor Pancreatic Adenocarcinoma 179PCPG T Tumor Pheochromocytoma and Paragan-glioma184PRAD N Normal Prostate Tissue 52PRAD T Metastatic Metastatic Prostate Adenocarcinoma 62PRAD T Tumor Prostate Adenocarcinoma 498NET T Metastatic Metastatic Neuroendocrine Tumour 6SARC T Metastatic Metastatic Sarcoma 33SARC T Tumor Sarcoma 261SKCM T Metastatic Metastatic Skin Cutaneous Melanoma 12SKCM T Tumor Skin Cutaneous Melanoma 472STAD N Normal Stomach Tissue 36STAD T Tumor Stomach Adenocarcinoma 415TGCT T Tumor Testicular Germ Cell Tumors 156THCA N Normal Thyroid Tissue 59THCA T Tumor Thyroid Carcinoma 513THYM T Tumor Thymoma 120UCEC N Normal Uterine Tissue 24UCEC T Tumor Uterine Corpus Endometrial Carci-noma181UCS T Tumor Uterine Carcinosarcoma 57UVM T Tumor Uveal Melanoma 80Total Number of Primary Samples 10496Total Number of Metastatic Samples 295Total Number of Normal Samples 695Total Number of Samples 1148629Cancer TypesAbbreviation Full Name Number of SamplesTable 2.5: Cancer type class abbreviations and frequency within the full set of preprocesseddata (including both train and test data)Cancer SubtypesAbbreviation Full Name Number of SamplesACC T Metastatic Metastatic Adrenocortical Carcinoma 8ACC T Tumor Adrenocortical Carcinoma 79ALL T Metastatic Acute Lymphocytic Leukemia 13BLCA N Normal Bladder Tissue 19BLCA T Metastatic Metastatic Bladder Urothelial Carci-noma14BLCA T Tumor Bladder Urothelial Carcinoma 408BRCA Basal T Tumor Basal Breast Invasive Carcinoma 176BRCA HER2like Tumor HER2-like Breast Invasive Carcinoma 80BRCA IDC T Metastatic Metastatic Invasive Ductal Breast Car-cinoma46BRCA ILC T Metastatic Metastatic Invasive Lobular BreastCarcinoma10BRCA LuminalA T Tumor Luminal A Breast Invasive Carcinoma 538BRCA LuminalB T Tumor Luminal B Breast Invasive Carcinoma 207BRCA N Normal Breast Tissue 112BRCA T Tumor Breast Invasive Carcinoma 99CESC CAD T Tumor Endocervical Adenocarcinoma 47CESC SCC T Tumor Cervical Squamous Cell Carcinoma andEndocervical Adenocarcinoma253CHOL EHCH T Metastatic Metastatic Extrahepatic Cholangiocar-cinoma1030Cancer SubtypesAbbreviation Full Name Number of SamplesCHOL IHCH T Metastatic Metastatic Intrahepatic Cholangiocar-cinoma9CHOL N Normal Bile Duct Tissue 9CHOL T Tumor Cholangiocarcinoma 36CLL T Tumor Chronic Lymphocytic Leukemia 29CML T Tumor Chronic Myelogenous Leukemia 102COADREAD N Normal Colorectal Tissue 51COADREAD T Metastatic Metastatic Colorectal Adenocarcinoma 10COADREAD T Tumor Colorectal Adenocarcinoma 386DLBC BM T Tumor Bone Marrow Lymphoid Neoplasm Dif-fuse Large B-cell Lymphoma11DLBC T Tumor Lymphoid Neoplasm Diffuse Large B-cell Lymphoma159ESCA EAC T Metastatic Metastatic Esophageal Adenocarci-noma9ESCA EAC T Tumor Esophageal Adenocarcinoma 63ESCA SCC T Tumor Squamous Cell Esophageal Carcinoma 93ESCA T Tumor Esophageal Carcinoma 13FL T Tumor Follicular Lymphoma 50GBM T Tumor Glioblastoma 219HNSC N Normal Head and Neck Tissue 44HNSC T Metastatic Metastatic Head and Neck SquamousCell Carcinoma9HNSC T Tumor Head and Neck Squamous Cell Carci-noma517KICH N Normal Kidney Tissue 25KICH T Tumor Kidney Chromophobe 6631Cancer SubtypesAbbreviation Full Name Number of SamplesKIRC N Normal Kidney Tissue 72KIRC T Tumor Kidney Renal Clear Cell Carcinoma 532KIRP N Normal Kidney Tissue 32KIRP T Tumor Kidney Renal Papillary Cell Carcinoma 291LAML T Metastatic Metastatic Acute Myeloid Leukemia 8LAML T Tumor Acute Myeloid Leukemia 123LGG T Tumor Brain Lower Grade Glioma 530LIHC N Normal Liver Tissue 50LIHC T Metastatic Metastatic Liver Hepatocellular Carci-noma7LIHC T Tumor Liver Hepatocellular Carcinoma 373LUAD N Normal Lung Tissue 59LUAD T Metastatic Metastatic Lung Adenocarcinoma 9LUAD T Tumor Lung Adenocarcinoma 518LUSC N Normal Lung Tissue 51LUSC T Tumor Lung Squamous Cell Carcinoma 501MB Group3 T Tumor Group 3 Medulloblastoma 39MB Group4 T Tumor Group 4 Medulloblastoma 69MB SHH T Tumor Sonic Hedgehog Medulloblastoma 136MB WNT T Tumor Wingless Medulloblastoma 31MESO T Tumor Mesothelioma 298OV T Metastatic Metastatic Ovarian Serous Cystadeno-carcinoma13OV T Tumor Ovarian Serous Cystadenocarcinoma 308PAAD T Metastatic Metastatic Pancreatic Adenocarci-noma7PAAD T Tumor Pancreatic Adenocarcinoma 17932Cancer SubtypesAbbreviation Full Name Number of SamplesPCPG T Tumor Pheochromocytoma and Paragan-glioma184PRAD N Normal Prostate Tissue 52PRAD T Metastatic Metastatic Prostate Adenocarcinoma 62PRAD T Tumor Prostate Adenocarcinoma 498PrNET T Metastatic Metastatic Pancreatic NeuroendocrineTumour6SARC DDL T Tumor Dedifferentiated Sarcoma 58SARC LMS T Metastatic Leiomyosarcoma 9SARC LMS T Tumor Dedifferentiated Liposarcoma 106SARC MPNST T Tumor Malignant Peripheral Nerve Sheath Tu-mour10SKCM T Metastatic Metastatic Skin Cutaneous Melanoma 12SKCM T Tumor Skin Cutaneous Melanoma 472STAD CIN T Tumor Chromosomal Instability StomachAdenocarcinoma211STAD EBV T Tumor EBV-positive Stomach Adenocarci-noma31STAD GS T Tumor Genomically Stable Stomach Adeno-carcinoma70STAD MSI T Tumor Microsatellite Instability Stomach Ade-nocarcinoma76STAD N Normal Stomach Tissue 36STAD T Tumor Stomach Adenocarcinoma 27TGCT T Tumor Testicular Germ Cell Tumors 156THCA N Normal Thyroid Tissue 59THCA T Tumor Thyroid Carcinoma 51333Cancer SubtypesAbbreviation Full Name Number of SamplesTHYM T Tumor Thymoma 120UCEC N Normal Uterine Tissue 24UCEC T Tumor Uterine Corpus Endometrial Carci-noma181UCS T Tumor Uterine Carcinosarcoma 57UVM T Tumor Uveal Melanoma 80Total Number of Primary Samples 10496Total Number of Metastatic Samples 295Total Number of Normal Samples 695Total Number of Samples 11486Table 2.6: Cancer subtype class abbreviations and frequency within the full set of prepro-cessed data (including both train and test data)Metastatic-Only Data SetThe second data set was derived from the Personalised OncoGenomics (POG) project atBC Cancer and contains only metastatic cancer samples. Throughout this thesis, this dataset is referred to as the external test set, the POG data set, or the metastatic-only test set.Extensive details of the POG project can be found in the paper by Pleasance et al. [60]. Asummary of its composition as utilized in this thesis can be found in Tables 2.7, 2.8, and 2.9.There are 461 metastatic cancer samples that span 15 cancer subtypes, 13 cancer types, and10 organ systems of origin. The 461 samples were selected from a larger set of POG dataand were chosen on the basis that each of their labels for all four classification tasks werealso present in the training data.34Organ System of OriginFull Name Number of SamplesBreast 134Endocrine 6Gastrointestinal 163Gynecologic 34Head and Neck 5Hematologic 2Skin 14Soft Tissue 56Thoracic 44Urological 3Total Number of Organ Systems 11Total Number of Samples 461Table 2.7: Organ system of origin classes and frequencies within the the POG datasetCancer TypesAbbreviation Full Name Number of SamplesACC T Metastatic Metastatic Adrenocortical Carcinoma 6BRCA T Metastatic Metastatic Breast Invasive Carcinoma 134CHOL T Metastatic Metastatic Cholangiocarcinoma 3COADREAD T Metastatic Metastatic Colorectal Adenocarcinoma 85HNSC T Metastatic Metastatic Head and Neck SquamousCell Carcinoma5LAML T Metastatic Metastatic Acute Myeloid Leukemia 2LIHC T Metastatic Metastatic Liver Hepatocellular Carci-noma3LUAD T Metastatic Metastatic Lung Adenocarcinoma 4435Cancer TypesAbbreviation Full Name Number of SamplesOV T Metastatic Metastatic Ovarian Serous Cystadeno-carcinoma34PAAD T Metastatic Metastatic Pancreatic Adenocarci-noma72PRAD T Metastatic Metastatic Prostate Adenocarcinoma 3SARC T Metastatic Metastatic Sarcoma 56SKCM T Metastatic Metastatic Skin Cutaneous Melanoma 14Total Number of Cancer Types 13Total Number of Samples 461Table 2.8: Cancer type class abbreviations and frequency within the POG datasetCancer SubtypesAbbreviation Full Name Number of SamplesACC T Metastatic Metastatic Adrenocortical Carcinoma 6BRCA IDC T Metastatic Metastatic Invasive Ductal Breast Car-cinoma125BRCA ILC T Metastatic Metastatic Invasive Lobular BreastCarcinoma9CHOL IHCH T Metastatic Metastatic Intrahepatic Cholangiocar-cinoma3COADREAD T Metastatic Metastatic Colorectal Adenocarcinoma 85HNSC T Metastatic Metastatic Head and Neck SquamousCell Carcinoma5LAML T Metastatic Metastatic Acute Myeloid Leukemia 2LIHC T Metastatic Metastatic Liver Hepatocellular Carci-noma3LUAD T Metastatic Metastatic Lung Adenocarcinoma 4436Cancer SubtypesAbbreviation Full Name Number of SamplesOV T Metastatic Metastatic Ovarian Serous Cystadeno-carcinoma34PAAD T Metastatic Metastatic Pancreatic Adenocarci-noma72PRAD T Metastatic Metastatic Prostate Adenocarcinoma 3SARC LMS T Metastatic Metastatic Leiomyosarcoma 11SARC T Metastatic Metastatic Sarcoma 45SKCM T Metastatic Metastatic Skin Cutaneous Melanoma 14Total Number of Cancer Subtypes 15Total Number of Samples 461Table 2.9: Cancer subtype class abbreviations and frequency within the POG datasetThe POG data class labels were annotated using the same class labels that were found inthe training data and correspond to the TCGA naming convention. The most appropriateTCGA label was determined as part of the analysis conducted for the POG project andconsidered genomic, pathological, and clinical factors [61, 62].2.1.1 Training and Test SetsThe mixed held-out data set described above was divided into training and test sets. Thetraining set used to train the model(s) was generated by utilizing 85% of the whole mixeddata set and contained 9763 samples. The remaining 15%, 1723 samples, constitutes theheld-out test data set and contains primary, metastatic, and normal samples in proportionsequal to those found in the training data set (ie. it is stratified). This held-out data wasexcluded from all aspects of training including cross-validation.37The POG data as described above was utilized in its entirety for testing only. The resultingtest set is 461 metastatic cancer samples. The value of this data set as a test set is that all ofits samples were processed at a facility that is different from any of the metastatic samplesin the training and held-out test set. This should add some objectivity to the testing results.2.1.2 Data Preprocessing: ValidationThe RPKM values of the data were ranked and normalized to lie between 0 and 1 usingthe rank function from the pandas Python package [63]. Samples were then filtered outbased on whether or not they were part of a cancer subtype class that contained at least6 samples. Since the intention was to utilize five-fold cross-validation for model validationand optimization, it was important to keep this minimum number of samples to ensure classratios remained the same across all folds.Following the filtering of samples, 15% of the data was separated into a held-out test setusing the train test split function found in the scikit-learn Python package [64]. The optionto stratify the classes was enabled to ensure proper class representation. The remaining85% of the data not used for the held-out test set was then divided into five folds for usein cross-validation. The StratifiedKFold function from the scikit-learn package was used togenerate the folds and maintain class ratios. The result of the data splitting is that at leastone sample of each subtype was present in the test set and five samples were equally splitamong the five folds generated for cross-validation.2.1.3 Data Preprocessing: TestingThe preprocessing steps for testing differ slightly from those of validation outlined inSection 2.1.2. Since the model has been validated using cross-validation, multiple trainingfolds is no longer necessary for training the final model. The advantage of this is that moredata can be used for training the model as a validation set is not needed. Therefore, the38preprocessing for testing excludes splitting the data for cross-validation but still includesnormalization and ranking via the rank function within the pandas Python package. Theresult is a held-out test set containing 15% of the mixed data and a single training data setcontaining the remaining 85%. The metastatic-only test data (derived from POG) underwentthe same ranking and normalization described above.2.2 The ModelThe model used in this thesis is a fully-connected feed-forward artificial neural network.The model is a multi-task model in that it has four classification output layers used to makeclassifications within four distinct tasks. A visualization of the model can be seen in Figure2.1. These four classification tasks are:1. Organ System of Origin2. Disease State3. Cancer Type4. Cancer Subtype39Figure 2.1: High level diagram of the multi-task neural networkEach classification task in the neural network model has a hidden layer directly connectedwith it. Each hidden layer connects to a classification task output layer as well as the nexthidden layer (with the exception of the final hidden layer). Each hidden layer is a fullyconnected (dense) layer. The effect of this network architecture is that as information movesfrom the first hidden layer (associated with the organ system of origin classification) to thefinal hidden layer (associated with the cancer subtype classification), the model has morehidden layers to utilize in making the classification. For example, with the organ system oforigin classification there is only a single hidden layer available to encode information, butat the cancer subtype classification there are four. By having more hidden layers for learningtasks that are more complex (cancer subtype being more complex than organ system), weare providing the model with a greater learning capacity for these more complex tasks.The rationale behind using a model with this multi-task architecture is four fold. Theseare described below.The first rationale is that the multiple task setup forces the model to learn increasinggranular features of the data. The first dense layer must encode all of the information40necessary to accurately classify an organ system of origin. The effect of this is that insubsequent layers (down-stream from the organ system layer), the model is encouraged tolearn features that will help to distinguish the disease state, and can, at least in part, ignorefeatures needed to distinguish the organ system of origin.The increasing granulation of feature learning described above contributes to the secondrationale behind this architecture: mitigating tissue bias. A single learning task that requiresa model to only classify cancer types and seeks to do so with cancers from different organsystems will, at least in part, learn features that define the organ system. This is tissue bias.We can imagine trying to distinguish stomach cancer from brain cancer. During training,the model can increase its baseline classification accuracy if it can learn what makes abrain different from a stomach. This does not necessarily require learning anything aboutcancer specifically. The background gene expression levels of the relevant organ systemscan be leveraged in distinguishing brain cancer from stomach cancer and may be enoughinformation to accurately classifying some samples. Thus, the model is encouraged to identifythe expression patterns of the organ system of origin. By forcing the model to learn todistinguish organ systems with the first layer of the multi-task model, we are providing amechanism of encouragement for it to learn patterns of expression specific to cancers insubsequent learning tasks.The third rationale behind this multi-task architecture is that we are imbuing a biologicalhierarchy into the decision making process. The order of classification tasks is such thatit follows a biological hierarchy. The model first questions what organ system is involved,then if this is normal or cancerous tissue, then what type of cancer it is, followed by whatsubtype. This is a biologically relevant series of decisions and may help to improve theclassification ability of the model. In fact, convolutional neural networks used for imagerecognition are thought to show improved performance as a result of the hierarchical natureof their structure and learning [65]. It is reasonable to attempt to utilize this approach inthe domain of this thesis work.41The fourth and final rationale for this multi-task model is simply the volume of output.The more tasks we have, the greater the wealth of data being output. This is an advantagewhen conducting post-classification analysis of the trained model as it provides access to moredecision levels of the model and may provide a means with which to ask more interestingbiological questions.2.2.1 Model Settings and HyperparametersThe optimization of the model’s hyperparameters was done using five-fold cross-validationand a combination of manual search and limited grid search. The performance on the meanof all five validation sets was examined to determine the hyperparameter values of the model.The hyperparameters experimented with included the number of nodes in the hidden layersof the model, the learning rate, optimizer, dropout rate, batch sizes and various learningrate decay schedules. Ultimately, the hyperparameter settings seen in Table 2.10 were thebest values found for this particular model architecture and problem space.Hyperparameter ValueNumber of Nodes in Hidden Layers 2000Optimizer Mini-Batch Gradient DescentBatch size 32Learning Rate (reduce on plateau) 0.001Learning Rate Reduction Factor 0.95Learning Rate Reduction Patience 20Early Stopping Patience 40Dense Layer Activation Function TanhDropout (every dense layer) 0.2Class Weighting TrueTable 2.10: Hyperparameter settings42InitilizationThe weights were initialized using the glorot uniform (also known as the Xavier uniform)initializer as implemented in the Keras Python package [51]. The biases were initialized tozeros.Mini-Batch Gradient DescentMini-batch gradient descent was used as the optimizer for the model. The implementationused was the one found in the Keras Python package [51]. This is simply the SGD optimizerwith the batch size set to 32.Learning Rate ReductionFor this work, the validation loss on the cancer subtype classification task was used asthe observed metric for learning rate reduction patience. The initial learning rate was setto 0.001, the reduction factor to 0.95, and the reduction patience to 20 (Table 2.10). Thelearning rate reduction was implemented using the ReduceLROnPlateau callback from theKeras Python package [51].Early Stopping and PatienceWhen determining if training should be stopped at any given epoch due to a lack ofimprovement in the validation loss, a patience of 40 was utilized. This means that themodel would allow 40 epochs to complete without an improvement in the validation lossbefore halting the training. The EarlyStopping callback from the Keras Python package wasutilized to achieve early stopping and patience for the models used in this thesis [51].43Activation FunctionThe hyperbolic tangent function was used as the activation function for each node withinthe hidden layers of all of the models presented in this thesis. The softmax activationfunction was used for each node in the output layers. Both activation functions were usedas implemented in the Keras Python package [51].Loss FunctionThe categorical cross-entropy loss was utilized for the models in this thesis. The imple-mentation used was the standard one found in the Keras Python package [51].DropoutA dropout rate of 0.2 or 20% was used for the models in this thesis and was implementedusing the Dropout layers from the Keras Python package [51].Class WeightingClass weighting was implemented for the models in this thesis using the compute class weightand compute sample weight functions from the scikit-learn Python package [16]. Due to lim-itations of the Keras package in a multi-task environment, it was not possible to directlyapply the class weights during training. As a workaround, the compute class weight functionwas used to calculate the proper weight values and then they were applied on a per-samplelevel at training time using the compute sample weight function. The weights were chosento reflect the relative sizes of the classes. The largest class was given a weight of 1 and allother classes were given a weight corresponding to the difference in their sizes compared tothe largest class. For example, if a minority class had half the number of samples as themajority class, it was assigned a weight of 2.442.2.2 Evaluating the Effect of Multiple Tasks on ClassificationPerformanceThe classification performance of the multi-task model was evaluated against models con-taining fewer learning tasks and one model using just a single task (cancer subtype only).The evaluation of the models was conducted as part of the cross-validation process and thusthe results presented here are an average of the performance across five validation folds. Thevalidation folds contained normal tissue, primary cancer, and metastatic cancer samples asdescribed in Section 2.1. The validation folds were used to ensure that the test sets remaineduntouched during the validation stage. The following sections will present the validation re-sults for each of the learning tasks: organ system of origin, disease state, cancer type, andcancer subtype.Organ System of OriginThe F1-scores presented in Figure 2.2 range from 0.981639 for the ”All Tasks” model to0.984891 for a multi-task model containing organ system of origin, disease state, and cancersubtype and not containing a cancer type learning task (”No Cancer Type”). This representsa performance reduction of 0.003252 for the ”All Tasks” model versus the best performingset of tasks. The variation in performance seen between models is largest between the ”AllTasks” model and the other three models. We note that the smallest standard deviation isseen with the ”No Disease State” model and the largest with the ”No Cancer Type” model .45Figure 2.2: The macro F1-scores of various models using validation sets containing bothprimary and metastatic samples from different organ systems of origin.Disease StateThe F1-scores presented in Figure 2.3 range from 0.978703 for a model with the cancer typetask removed and 0.981184 for a model missing the organ system of origin learning task (”NoOrgan System”). This represents a performance reduction of 0.002481. The performanceof the ”All Tasks” model sits in between the other two with an F1-score of 0.979349. Thestandard deviation is largest with the ”No Cancer Type” model and smallest with the ”NoOrgan System” model.46Figure 2.3: The macro F1-scores of various models on validation sets containing both primaryand metastatic samples at the disease state classification levelCancer TypeThe F1-scores presented in Figure 2.4 range from 0.861412 for the ”All Tasks” modelto 0.862186 for a multi-task model containing organ system of origin, cancer type, andcancer subtype, and not containing a disease state learning task (”No Disease State”). Thisrepresents a performance improvement of 0.000774 over the ”All Tasks” model, which had thepoorest performance of the models tested. Note, however, that the ”All Tasks” model hadthe smallest standard deviation. The variation in performance seen between models is muchsmaller for cancer type classification when compared with the cancer subtype classifications.47Figure 2.4: The macro F1-scores of various models on validation set data containing bothprimary and metastatic cancer type samples.Cancer SubtypeThe F1-scores presented in Figure 2.5 range from 0.80281 for the single task (”SubtypeOnly”) model to 0.812746 for a multi-task model missing the disease state learning task (”NoDisease State”). This represents a performance improvement of 0.009936 over the single taskmodel. The performance of the ”All Tasks” model is approximately in the middle of theother models with an F1-score of 0.806287. The performance decrease between the ”AllTasks” model and the best performing model is 0.006459. Note that the ”All Tasks” modelhad the largest standard deviation and the ”No Cancer Type” model had the smallest.48Figure 2.5: The macro F1-scores of various models on validation set data containing bothprimary and metastatic cancer subtype samples.Discussion of Task-Dependent PerformanceWhile the ”All Tasks” model did not provide the best performance in any of the clas-sification categories, relatively speaking, the difference in performance was small. In eachclassification category, the ”All Tasks” model’s mean performance was within an F1-scoreof 0.001 of the best performing set of tasks. The inclusion of all four tasks within the ”AllTasks” model provides a greater opportunity to leverage more fine-grained information indown-stream analyses than it would if tasks were removed. Given that the classificationperformance is similar between all sets of tasks, it can be justified that the additional infor-mation gained from including all of the learning tasks is worth the slight loss of potentialperformance, and thus the ”All Tasks” model can be used for further analysis.49Chapter 3Classification of Cancers fromTranscriptome DataThe following results were obtained from a single trained model. The first section presentsthe results from the mixed primary, metastatic and normal data that comprises the held-outtest set as described in Chapter 2. The second section presents results using the externalmetastatic-only data set derived from POG data. Where F1-score is reported, it is the macroF1-score in which each class is weighted equally in the calculation of the score regardless ofclass size.3.1 Results: Mixed Held-Out Test Set3.1.1 Organ System of OriginThe organ system of origin classification scores can be seen in Table 3.2 and graphicallyin Figure 3.1. The classification performance was above 0.95 for all organ systems withthe poorest performer being the soft tissue class. The soft tissue class was misclassifiedas thoracic and gastrointestinal at rate of approximately 2% and 3% respectively. The50misclassification of organ systems can be seen in Figure 3.2.Organ System of Origin Precision Recall F1-score SupportBreast 1 1 1 190Central Nervous System 1 1 1 154Endocrine 0.993 0.993 0.993 151Gastrointestinal 0.996 0.992 0.994 261Gynecologic 0.97 0.985 0.978 133Head and Neck 0.98 0.99 0.985 98Hematologic 0.978 1 0.989 91Skin 1 0.986 0.993 73Soft Tissue 0.976 0.932 0.953 44Thoracic 0.986 0.986 0.986 216Urologic 0.987 0.984 0.986 312accuracy 0.99 0.99 0.99 0.99macro avg 0.988 0.986 0.987 1723weighted avg 0.99 0.99 0.99 1723Table 3.2: The precision, recall, F1-score, and support for each organ system of origin classwith testing conducted using the mixed held-out test set.51Figure 3.1: The macro F1-scores of each organ system of origin when testing on the held-outtest set containing primary cancer, metastatic cancer, and normal tissue samples. Classesare ordered from left to right by the number of training samples available with coloursrepresenting bins of 20 samples.52Figure 3.2: A confusion matrix depicting the organ system of origin classification perfor-mance on the held-out test set containing primary cancer, metastatic cancer, and normaltissue samples.3.1.2 Disease StateThe disease state classification scores can be seen in Table 3.4 and graphically in Figure 3.3.Each disease state had F1-scores above 0.95 with the poorest performer being the normal53class. Referring to Figure 3.4, we can see that the normal class was misclassified as theprimary cancer class at rate of approximately 5%.Disease State Precision Recall F1-score SupportMetastatic 1 1 1 41Normal 0.98 0.934 0.957 106Primary 0.996 0.999 0.997 1576accuracy 0.995 0.995 0.995 0.995macro avg 0.992 0.978 0.985 1723weighted avg 0.995 0.995 0.995 1723Table 3.4: The precision, recall, F1-score, and support for each disease state class withtesting conducted using the mixed held-out test set.54Figure 3.3: The macro F1-scores of each disease state when testing on the held-out test setcontaining primary cancer, metastatic cancer, and normal tissue samples. Classes are orderedfrom left to right by the number of training samples available with colours representing binsof 20 samples.55Figure 3.4: A confusion matrix depicting the disease state classification performance on theheld-out test set containing primary cancer, metastatic cancer, and normal tissue samples.3.1.3 Cancer TypeThe classification performance of the cancer type task resulted in a total F1-score of 0.885and an accuracy of 96.5% across all 68 types. The F1-scores for the classification of theprimary, metastatic, and normal types individually were 0.964, 0.683, and 0.925 respectively56(see Figure 3.5).Figure 3.5: The macro F1-scores comparing the classification performance of cancer typeand cancer subtype samples broken down by disease state as tested on the held-out mixedtest set.The F1-scores, precision, and recall of each cancer type can be seen in Table 3.6, and theclass-wise F1-scores are presented graphically in Figure 3.6. The classification performancedecreases along with the number of training samples. There are no outliers from the trendobserved in Figure 3.6 like we saw with the subtype classes. However, there are four cancertypes with an F1-score of 0.0. The cancer type classification accuracy and the predictedclasses of misclassified types can be seen in the confusion matrix in Figure 3.9. The fourcancer types with F1-scores of 0 and the source of their misclassifications are presentedbelow:57ESCA T MetastaticThe ESCA T Metastatic type was completely misclassified as CHOL T Metastatic. Notethat only a single test sample was available for this class.HNSC T MetastaticThe HNSC T Metastatic type was completely misclassified as PAAD T Metastatic. Notethat only a single test sample was available for this class.LIHC T MetastaticThe LIHC T Metastatic type was completely misclassified as CHOL T Metastatic. Notethat only a single test sample was available for this class.NET T MetastaticThe NET T Metastatic type was completely misclassified as PRAD T Metastatic. Note thatonly a single test sample was available for this class.Cancer Type Precision Recall F1-score SupportACC T Metastatic 1 1 1 1ACC T Tumor 1 1 1 12ALL T Metastatic 1 0.5 0.667 2BLCA N Normal 1 1 1 3BLCA T Metastatic 1 1 1 2BLCA T Tumor 0.951 0.951 0.951 61BRCA N Normal 1 1 1 17BRCA T Metastatic 1 1 1 858Cancer Type Precision Recall F1-score SupportBRCA T Tumor 1 1 1 165CESC T Tumor 0.907 0.867 0.886 45CHOL N Normal 1 1 1 1CHOL T Metastatic 0.5 1 0.667 2CHOL T Tumor 0.714 1 0.833 5CLL T Tumor 1 1 1 4CML T Tumor 1 1 1 15COADREAD N Normal 1 1 1 8COADREAD T Metastatic 1 1 1 1COADREAD T Tumor 1 0.966 0.982 58DLBC T Tumor 1 1 1 26ESCA T Metastatic 0 0 0 1ESCA T Tumor 0.92 0.92 0.92 25FL T Tumor 0.875 1 0.933 7GBM T Tumor 0.97 0.97 0.97 33HNSC N Normal 0.875 1 0.933 7HNSC T Metastatic 0 0 0 1HNSC T Tumor 0.973 0.936 0.954 78KICH N Normal 0.667 1 0.8 4KICH T Tumor 0.769 1 0.87 10KIRC N Normal 1 0.909 0.952 11KIRC T Tumor 0.951 0.963 0.957 80KIRP N Normal 1 0.8 0.889 5KIRP T Tumor 0.95 0.864 0.905 44LAML T Metastatic 0.5 1 0.667 1LAML T Tumor 1 1 1 18LGG T Tumor 0.988 0.988 0.988 80LIHC N Normal 1 1 1 759Cancer Type Precision Recall F1-score SupportLIHC T Metastatic 0 0 0 1LIHC T Tumor 1 0.964 0.982 56LUAD N Normal 0.8 0.889 0.842 9LUAD T Metastatic 1 1 1 1LUAD T Tumor 0.927 0.974 0.95 78LUSC N Normal 0.857 0.75 0.8 8LUSC T Tumor 0.932 0.92 0.926 75MB-Adult T Tumor 1 1 1 41MESO T Tumor 1 1 1 45NET T Metastatic 0 0 0 1OV T Metastatic 1 1 1 2OV T Tumor 1 1 1 46PAAD T Metastatic 0.5 1 0.667 1PAAD T Tumor 0.964 1 0.982 27PCPG T Tumor 1 1 1 28PRAD N Normal 0.7 0.875 0.778 8PRAD T Metastatic 0.9 1 0.947 9PRAD T Tumor 0.986 0.96 0.973 75SARC T Metastatic 1 1 1 5SARC T Tumor 1 0.974 0.987 39SKCM T Metastatic 1 1 1 2SKCM T Tumor 1 0.986 0.993 71STAD N Normal 1 0.8 0.889 5STAD T Tumor 0.954 0.984 0.969 63TGCT T Tumor 1 1 1 23THCA N Normal 1 1 1 9THCA T Tumor 1 1 1 77THYM T Tumor 1 1 1 1860Cancer Type Precision Recall F1-score SupportUCEC N Normal 1 1 1 4UCEC T Tumor 0.862 0.926 0.893 27UCS T Tumor 0.889 0.889 0.889 9UVM T Tumor 1 1 1 12accuracy 0.965 0.965 0.965 0.965macro avg 0.879 0.905 0.885 1723weighted avg 0.966 0.965 0.965 1723Table 3.6: The precision, recall, F1-score, and support for each cancer type class with testingconducted using the mixed held-out test set.Referring to Figure 3.7, we can see that the majority of types were accurately classified witha few exceptions. The most poorly performing classes are noted above, however, as with thecancer subtype classes we again observe misclassifications within the normal lung and kidneytissue subtypes. We can also see some significant misclassifications of ALL T Metastatic andSTAD N Normal.ALL T MetastaticALL T Metastatic consisted of only two test samples. One of these samples was misclassifiedas LAML T Tumor.STAD N NormalSTAD N Normal was misclassified as ESCA T Tumor in 20% of the samples.61Figure 3.6: The macro F1-scores of each cancer type when testing on the held-out test setcontaining primary cancer, metastatic cancer, and normal tissue samples. Classes are orderedfrom left to right by the number of training samples available with colours representing binsof 20 samples.62Figure 3.7: A confusion matrix depicting the cancer type classification performance on theheld-out test set containing primary cancer, metastatic cancer, and normal tissue samples.3.1.4 Cancer SubtypeThe classification performance of the cancer subtype task resulted in a total F1-score of0.885 and an accuracy of 93.3% across all 91 subtypes. The F1-score for the classificationof the primary, metastatic, and normal subtypes individually was 0.851, 0.704, and 0.92763respectively (see Figure 3.5). The F1-scores, precision, and recall of each subtype can beseen in Table 3.8, and the class-wise F1-scores are presented graphically in Figure 3.8.Cancer Subtype Precision Recall F1-score SupportACC T Metastatic 1 1 1 1ACC T Tumor 1 1 1 12ALL T Metastatic 1 1 1 2BLCA N Normal 1 1 1 3BLCA T Metastatic 0.667 1 0.8 2BLCA T Tumor 0.952 0.967 0.959 61BRCA Basal T Tumor 0.852 0.885 0.868 26BRCA HER2like Tumor 0.818 0.75 0.783 12BRCA IDC T Metastatic 0.857 0.857 0.857 7BRCA ILC T Metastatic 0 0 0 1BRCA LuminalA T Tumor 0.819 0.951 0.88 81BRCA LuminalB T Tumor 0.724 0.677 0.7 31BRCA N Normal 1 1 1 17BRCA T Tumor 0.667 0.267 0.381 15CESC CAD T Tumor 1 0.571 0.727 7CESC SCC T Tumor 0.921 0.921 0.921 38CHOL EHCH T Metastatic 0 0 0 1CHOL IHCH T Metastatic 0.5 1 0.667 1CHOL N Normal 1 1 1 1CHOL T Tumor 0.714 1 0.833 5CLL T Tumor 1 1 1 4CML T Tumor 1 1 1 15COADREAD N Normal 1 1 1 8COADREAD T Metastatic 1 1 1 1COADREAD T Tumor 1 0.983 0.991 58DLBC BM T Tumor 1 1 1 264Cancer Subtype Precision Recall F1-score SupportDLBC T Tumor 1 0.958 0.979 24ESCA EAC T Metastatic 0 0 0 1ESCA EAC T Tumor 0.889 0.889 0.889 9ESCA SCC T Tumor 0.824 1 0.903 14ESCA T Tumor 0 0 0 2FL T Tumor 0.778 1 0.875 7GBM T Tumor 1 0.97 0.985 33HNSC N Normal 0.875 1 0.933 7HNSC T Metastatic 0 0 0 1HNSC T Tumor 0.974 0.949 0.961 78KICH N Normal 0.8 1 0.889 4KICH T Tumor 0.75 0.9 0.818 10KIRC N Normal 1 1 1 11KIRC T Tumor 0.939 0.963 0.951 80KIRP N Normal 1 0.8 0.889 5KIRP T Tumor 0.95 0.864 0.905 44LAML T Metastatic 1 1 1 1LAML T Tumor 1 1 1 18LGG T Tumor 0.988 1 0.994 80LIHC N Normal 1 1 1 7LIHC T Metastatic 1 1 1 1LIHC T Tumor 1 0.964 0.982 56LUAD N Normal 0.778 0.778 0.778 9LUAD T Metastatic 1 1 1 1LUAD T Tumor 0.938 0.974 0.956 78LUSC N Normal 0.75 0.75 0.75 8LUSC T Tumor 0.958 0.92 0.939 75MB Group3 T Tumor 1 1 1 665Cancer Subtype Precision Recall F1-score SupportMB Group4 T Tumor 1 1 1 10MB SHH T Tumor 1 1 1 20MB WNT T Tumor 1 1 1 5MESO T Tumor 1 1 1 45OV T Metastatic 0.667 1 0.8 2OV T Tumor 1 1 1 46PAAD T Metastatic 1 1 1 1PAAD T Tumor 1 1 1 27PCPG T Tumor 1 1 1 28PRAD N Normal 0.7 0.875 0.778 8PRAD T Metastatic 0.9 1 0.947 9PRAD T Tumor 0.986 0.96 0.973 75PrNET T Metastatic 0 0 0 1SARC DDL T Tumor 0.857 0.667 0.75 9SARC LMS T Metastatic 1 1 1 1SARC LMS T Tumor 0.722 0.813 0.765 16SARC MFS T Tumor 0.5 0.75 0.6 4SARC MPNST T Tumor 0 0 0 1SARC Synovial T Tumor 1 1 1 1SARC T Metastatic 1 1 1 4SARC UPS T Tumor 0.333 0.25 0.286 8SKCM T Metastatic 1 1 1 2SKCM T Tumor 1 0.986 0.993 71STAD CIN T Tumor 0.813 0.813 0.813 32STAD EBV T Tumor 0.714 1 0.833 5STAD GS T Tumor 0.692 0.818 0.75 11STAD MSI T Tumor 0.75 0.818 0.783 11STAD N Normal 1 0.8 0.889 566Cancer Subtype Precision Recall F1-score SupportSTAD T Tumor 0 0 0 4TGCT T Tumor 1 1 1 23THCA N Normal 1 1 1 9THCA T Tumor 1 1 1 77THYM T Tumor 1 1 1 18UCEC N Normal 1 1 1 4UCEC T Tumor 0.893 0.926 0.909 27UCS T Tumor 1 1 1 9UVM T Tumor 1 1 1 12accuracy 0.933 0.933 0.933 0.933macro avg 0.826 0.846 0.831 1723weighted avg 0.929 0.933 0.929 1723Table 3.8: The precision, recall, F1-score, and support for each cancer subtype class withtesting conducted using the mixed held-out test set.67Figure 3.8: The macro F1-scores of each cancer subtype when testing on the held-out test setcontaining primary cancer, metastatic cancer, and normal tissue samples. Classes are orderedfrom left to right by the number of training samples available with colours representing binsof 20 samples.The observed trend is that the F1-scores decrease as the number of training samplesper class decreases (from left to right in Figure 3.8). The classification performance isgenerally better with a larger number of training samples. The largest outliers from thistrend are the primary breast carcinoma (BRCA T Tumor) and primary undifferentiatedpleomorphic sarcoma (SARC UPS T Tumor) subtypes with F1-scores of 0.381 and 0.286respectively. There are eight subtypes with F1-scores of 0.0, each of which had fewer than 20training examples. The cancer subtype classification accuracy and the predicted classes ofmisclassified subtype can be seen in the confusion matrix in Figure 3.9. The eight subtypeswith F1-scores of 0 and the source of their misclassifications are presented below:68STAD T TumorSTAD T Tumor was misclassified completely with the model predicting STAD CIN T Tumor,STAD GS T Tumor, and STAD MSI T Tumor instead. Nearly half of the STAD T Tumorsamples were misclassified as STAD CIN T Tumor.ESCA T TumorESCA T Tumor was misclassified in half the samples as ESCA SCC T Tumor and the otherhalf as STAD EBV T Tumor.SARC MPNST T TumorSARC MPNST T Tumor was misclassified completely as LUAD T Tumor. Note that thereis only a single test sample for this subtype.BRCA ILC T MetastaticBRCA ILC T Metastatic was completely misclassified as BRCA IDC T Metastatic. Notethat there is only a single test sample for this subtype.CHOL EHCH T MetastaticCHOL EHCH T Metastatic was misclassified completed as CHOL IHCH T Metastatic. Notethat there is only a single test sample for this subtype.69ESCA EAC T MetastaticESCA EAC T Metastatic was misclassified completely as OV T Tumor. Note that there isonly a single test sample for this subtype.HNSC T MetastaticHNSC T Metastatic was misclassified completely as BLCA T Metastatic. Note that thereis only a single test sample for this subtype.PrNET T MetastaticPrNET T Metastatic was misclassified completely as PRAD T Metastatic. Note that thereis only a single test sample for this subtype.70Figure 3.9: A confusion matrix depicting the cancer subtype classification performance on theheld-out test set containing primary cancer, metastatic cancer, and normal tissue samples.Referring to Figure 3.9, we can see that the majority of cancer subtypes were accuratelyclassified, with a few exceptions. The most poorly performing classes are noted above,however, there was also misclassifications observed within the normal lung and kidney tissuesand the sarcoma, stomach adenocarcinoma, and breast adenocarcinoma subtypes.71Normal TissuesThe normal tissues for the lungs and kidneys all see some misclassifications between theirrespective, related normal counterparts. For example, normal LUAD is misclassified asnormal LUSC and normal LUSC is misclassified as normal LUAD.STAD SubtypesThe primary stomach adenocarcinoma without a subtype annotation (STAD T Tumor) re-sulted in the largest error rate. The model completely misclassified all four test samples aseither the GS, CIN, or MSI primary cancer subtypes. There were other misclassificationsbetween STAD subtypes that can be observed in Figure 3.9. One notable observation isthat the normal STAD subtype had one of five (20%) test samples incorrectly classified asESCA SCC T Tumor. This is one of two normal classes that had a misclassification as aprimary cancer, with the other misclassification being PRAD (one sample called as primaryPRAD).BRCA SubtypesThe largest primary cancer offender was the primary breast cancer class without a sub-type annotation (BRCA T Tumor). It consisted of 15 test samples and was misclassified75% of the time as a mixture of the other primary cancer subtypes. As mentioned above,BRCA ILC T Metastatic was classified incorrectly as BRCA IDC T Tumor for every testsample.SARC SubtypesSarcoma subtypes had misclassifications observed within the primary LMS, DDL, MFS, andMPNST subtypes. As mentioned above MPNST was completely misclassified. The UPS72subtype was the only subtype in which the majority of its samples were classified incorrectly.All of the other subtypes had fewer than half of their samples misclassified.ESCA SubtypesThe primary ESCA subtypes generally had much better classification performance than theBRCA, STAD, and SARC subtypes. The poorest performer was again the class without asubtype annotation. Primary ESCA EAC had a single sample misclassified as STAD CINand the Metastatic ESCA EAC was completely misclassifed (as mentioned above).MB SubtypesThe medulloblastoma subtypes were all classified with perfect accuracy and F1-scores.3.2 Discussion: Held-out Test Set ClassificationAll of the misclassifications that occurred within the held-out test set remained within thesame disease type. We do not see any primary cancer samples classified as metastatic cancersand vice versa. Even when a sample is classified to a subtype in a completely different organsystem, like with the DLBC T Metastatic subtype, it is misclassified as another metastaticcancer. This implies that model has learned to distinguish differences in the expressionpatterns of metastatic samples when compared to primary one. We will see, however, thatthis does not hold true for samples from the POG test set.3.2.1 Normal TissueCross-calling was observed within the normal lung and kidney classes in both the cancertype and subtype tasks. Cross-calling in this context refers to two or more classes that73are misclassified as each other in at least some of the samples. This cross-calling was ex-pected behaviour within the normal classes and serves as a sanity check of sorts. The normalkidney and lung classes are labelled based on the adjacent tumour type/subtype. For ex-ample, a patient with a primary LUSC tumour (LUSC T Tumor) would have their tumouradjacent normal lung sample labelled as LUSC N Normal. A patient with primary LUAD(LUAD T Tumor) would have their lung normal sample labelled as LUAD N Normal. So,while these patients’ normal lung samples have different labels, they are both normal lungtissue and could have been given identical class labels. Keeping these labels separate bytype/subtype allows us to confirm that the model is indeed learning the underlying biologyof the samples. Seeing cross-calling between normals of the same tissue (ie. lung or kidney)indicates that the model has learned what characterizes these tissues and is interchangingtheir labels as a result.3.2.2 Complete MisclassificationsThe misclassifications of the largest concern are those that were completely misclassifiedand have no apparent biological underpinning (see Sections 3.1.4 and 3.1.3). We saw exam-ples of these using both the held-out and external test sets on the cancer type and cancersubtype learning tasks. In the held-out test set on the cancer type task, we saw four ex-amples of total misclassifications. Each of these examples were metastatic cancer types andall only consisted of a single sample. The small sample sizes make these results potentialaberrations and are not conclusive enough to be considered total faults of the learning task.At the cancer subtype task, however, we had more test samples with which to concludethe poor performance of the model. We did again see metastatic subtypes that containedonly single samples, so I will exclude these from further discussion. I will also exclude theSARC MPNST T Tumor subtype as there was also only a single test sample for this class.The two remaining multi-sample complete misclassifications (STAD T Tumor and ESCA T Tumor)were for subtypes that did not have a subtype annotation. It is not entirely clear what sub-74type would be the correct annotation for these samples as TCGA subtyping excludes somesamples in some cases and in others does not assign a subtype based on not matching entirelywith the subtype annotations they defined [66, 67, 68, 69]. Further in depth analysis ofeach sample and the TCGA subtyping protocols may reveal further information pertainingto these samples and may help to clarify the model’s performance on them.3.2.3 Cancer Type and Subtype Performance Comparison by Dis-ease StateFigure 3.5 illustrates that primary cancers were better classified within the cancer type taskthan within the cancer subtype task. The metastatic cancers were classified better withinthe cancer subtype task, and the normal samples were classified with similar performancein both tasks. It is important to note that there are very few metastatic subtypes whencompared to primary subtypes. As a result, the class sizes decrease more within the primarycancers when moving from cancer type to subtype and may have contributed to the reducedclassification performance seen for the primary cancers within the subtype task.3.3 Results: Metastatic-Only External (POG) Test Set3.3.1 Organ System of OriginThe classification F1-scores of the model can be seen in Table 3.2 and Figure 3.10. Aconfusion matrix depicting the rate and subject of classification errors can be seen in Figure3.11.Overall, the performance of the organ system classifications on the POG test set resultedin an F1-score that is 0.095 less than we saw with the held-out test set. The F1-scoresrange from 0.6 to 1.0 (Figure 3.10. The poorest performing class is head and neck, and75the two best (F1-scores of 1.0) performing classes are endocrine and hematologic. The headand neck class was misclassified approximately 20% of the time as either gynecologic or softtissue (Figure 3.11).Organ System of Origin Precision Recall F1-score SupportBreast 0.964 1 0.982 134Endocrine 1 1 1 6Gastrointestinal 0.988 1 0.994 163Gynecologic 0.962 0.735 0.833 34Head and Neck 0.6 0.6 0.6 5Hematologic 1 1 1 2Skin 0.933 1 0.966 14Soft Tissue 0.959 0.839 0.895 56Thoracic 0.889 0.909 0.899 44Urologic 0.6 1 0.75 3macro avg 0.889 0.908 0.892 461weighted avg 0.958 0.948 0.951 461Table 3.10: The precision, recall, F1-score, and support for each organ system of origin classwith testing conducted using the metastatic-only external (POG) test set.76Figure 3.10: The macro F1-scores of each organ system of origin when testing on themetastatic-only external (POG) test set. Classes are ordered from left to right by the numberof training samples available with colours representing bins of 20 samples.77Figure 3.11: A confusion matrix depicting the organ system of origin classification perfor-mance on the metastatic-only external (POG) test set.3.3.2 Disease StateThe classification performance of this learning task was the greatest of the four tasks. Outof the 495 test samples, 461 were correctly classified correctly and resulted in an F1-scoreof 0.886 (see Table 2.8. This is a decrease in the F1-score of 0.114 when compared to the78metastatic classification performance with the mixed held-out test set.Disease State Precision Recall F1-score SupportMetastatic 1 0.796 0.886 461macro avg 1 0.796 0.886 461weighted avg 1 0.796 0.886 461Table 3.12: The precision, recall, F1-score, and support for each disease state class withtesting conducted using the metastatic-only external (POG) test set.Figure 3.12: The macro F1-scores of each disease state when testing on the metastatic-onlyexternal (POG) test set. Classes are ordered from left to right by the number of trainingsamples available with colours representing bins of 20 samples.79Figure 3.13: A confusion matrix depicting the disease state classification performance on themetastatic-only external (POG) test set.3.3.3 Cancer TypeThe classification performance of the models on the metastatic-only test set resulted inan overall F1-score of 0.761 (Table 3.6. The F1-scores, precision, and recall of each class canbe seen in Table 3.16 and the F1-scores are presented visually in Figure 3.16. A confusion80matrix presenting the predicted classes can be seen in Figure 3.15. Overall, the classificationperformance on the cancer type learning task using the metastatic-only test set resulted in anF1-score that was lower by 0.124 when compared to the results obtained using the held-outset. We also observe no easily observed downward trend of classification performance as afunction of training class size like we saw with the held-out test set (Figures 3.14 and 3.14).The F1-scores ranged from 0.0 to 1.0 with LAML T Metastatic being completely misclas-sified and the ACC T Metastatic and PRAD T Metastatic classes being completely accu-rately classified. Referring to Figure 3.15, we observe that seven primary tumour and onenormal class were incorrectly predicted. The LUAD T Metastatic class is observed to havethe largest number of incorrectly predicted classes, though still maintained an F1-score of0.725.We can also observe a large portion of gastric cancers being misclassified as another gastriccancer. PAAD T Metastatic samples were predicted with high frequency as CHOL T Metastaticor ESCA T Metastatic and COADREAD T Metastatic was frequently predicted incorrectlyas ESCA T Metastatic.Cancer Type Precision Recall F1-score SupportACC T Metastatic 1 1 1 6BRCA T Metastatic 0.978 0.993 0.985 134CHOL T Metastatic 0.176 1 0.3 3COADREAD T Metastatic 0.957 0.788 0.865 85HNSC T Metastatic 0.8 0.8 0.8 5LAML T Metastatic 0 0 0 2LIHC T Metastatic 0.75 1 0.857 3LUAD T Metastatic 1 0.568 0.725 44OV T Metastatic 1 0.824 0.903 34PAAD T Metastatic 0.822 0.514 0.632 7281Cancer Type Precision Recall F1-score SupportPRAD T Metastatic 1 1 1 3SARC T Metastatic 0.94 0.839 0.887 56SKCM T Metastatic 0.875 1 0.933 14macro avg 0.792 0.794 0.761 461weighted avg 0.933 0.803 0.852 461Table 3.14: The precision, recall, F1-score, and support for each cancer type class withtesting conducted using the metastatic-only external (POG) test set.Figure 3.14: The macro F1-scores of each cancer type when testing on the metastatic-onlyexternal (POG) test set. Classes are ordered from left to right by the number of trainingsamples available with colours representing bins of 20 samples.82Figure 3.15: A confusion matrix depicting the cancer type classification performance on themetastatic-only external (POG) test set.3.3.4 Cancer SubtypeThe classification performance of the models on the metastatic only data set resulted inan F1-score of 0.683 and an accuracy of 67.0%. The F1-scores, precision, and recall of eachclass can be seen in Table 3.16 and the F1-scores are presented visually in Figure 3.16.83Cancer Subtype Precision Recall F1-score SupportACC T Metastatic 1 1 1 6BRCA IDC T Metastatic 0.897 0.696 0.784 125BRCA ILC T Metastatic 0 0 0 9CHOL IHCH T Metastatic 0.158 1 0.273 3COADREAD T Metastatic 0.941 0.753 0.837 85HNSC T Metastatic 0.667 0.8 0.727 5LAML T Metastatic 0 0 0 2LIHC T Metastatic 1 1 1 3LUAD T Metastatic 1 0.636 0.778 44OV T Metastatic 0.966 0.824 0.889 34PAAD T Metastatic 0.889 0.444 0.593 72PRAD T Metastatic 1 1 1 3SARC LMS T Metastatic 0.643 0.818 0.72 11SARC T Metastatic 0.824 0.622 0.709 45SKCM T Metastatic 0.875 1 0.933 14macro avg 0.724 0.706 0.683 461weighted avg 0.879 0.67 0.75 461Table 3.16: The precision, recall, F1-score, and support for each disease state class withtesting conducted using the metastatic-only external (POG) test set.84Figure 3.16: The macro F1-scores of each cancer subtype when testing on the metastatic-onlyexternal (POG) test set. Classes are ordered from left to right by the number of trainingsamples available with colours representing bins of 20 samples.In contrast to the held-out test set results, there appears to be no downward trend be-tween the number of training samples and classification performance. However, eight ofthe 15 classes have fewer than 20 training samples with two of three of the worst perform-ing classes being in this category (depicted in dark blue in Figure 3.16). The majorityof classes (12 of 15 subtypes) achieved F1-scores between 0.593 and 1.0, with the exceptionBRCA ILC T Tumor, LAML T Metastatic, and CHOL IHCH T Metastatic which obtainedF1-scores of 0.0, 0.0, and 0.273 respectively. However, there were two subtypes that had per-fect classification accuracy: ACC T Metastatic, LIHC T Metastatic, and PRAD T Metastatic.85Looking at the predicted classes in Figure 3.17, we see a number of misclassificationsthat span disease types. There are 15 primary tumour subtypes called and one normal.The largest offending class for misclassifying samples as primary cancer subtypes was theSARC T Metastatic class with 5 primary subtypes called.Figure 3.17: A confusion matrix depicting the disease state classification performance on themetastatic-only external (POG) test set.86Figure 3.17 also shows a much higher incidence of gastric cancers being classified as othergastric cancers than we observed in the held-out test set. We can see COADREAD T Metastaticwas called as ESCA T Tumor and PAAD T Metastatic called as CHOL IHCH T Metastatic,CHOL EHCH T Metastatic, ESCA EAC T Metastatic, and PAAD T Tumor cancer.Generally, the overall classification performance observed with the external, metastatic-only test set is worse than the performance seen with the mixed held-out test set. Thespread of subtype misclassications also generally spans more incorrect subtypes with greaterfrequency and variety than the misclassifications within the held-out set.3.4 Discussion: POG Test Set ClassificationAside from the decrease in classification performance using the metastatic-only test setwhen compared to the held-out test set, there are two other aberrations. The first is thenumber of predictions of classes outside the same disease type as the true class. The sec-ond aberration is that there is a larger range of incorrect classes predicted per test classthan testing on the held-out set. On the POG test set, the cancer subtype learning taskhad 15 primary tumour classes predicted and one normal, whereas on the held-out test setwe saw no classifications of metastatic cancers as primary or normal classes. To the sec-ond point, the metastatic-only test set had classes that predicted 9 or 10 different classes(SARC T Metastatic and BRCA IDC T Metastatic) whereas the held-out set only had atmost 2 predicted classes within the metastatic subtypes (BRCA IDC T Metastatic). Thesetwo aberrations are significant because they indicate the uncertainty with some sampleswas so great that it overcame the information learned by the model at upstream learningtasks. For example, we can observe in Figure 3.17 that approximately one quarter of themetastatic LUAD samples were misclassified as primary LUAD. This same classification er-ror is repeated at the cancer type level. However, the amount of misclassification seen withinthe disease state task does not seem to indicate the same level of confusion as fewer than7% of all samples were classified as primary cancer. This implies that the model was able87to correctly identify many samples as metastatic, but that in downstream learning tasks,the level of uncertainty was high enough to outweigh the information passed from upstreamlayers. This is an effect not seen using the held-out test set even within the metastaticsamples.What these aberrations imply is that there is something about the external test set thatis making it more difficult for the model to make accurate predictions. A potential cause forthis is that the external test set data was derived at a different facility than the data usedin training. There are differences in the preparation and sequencing process that may haveproduced variations in the output [60, 62]. It is possible the model is suffering the negativeconsequences of batch effect. There are two ways to potentially mitigate this effect in thecontext of metastatic samples. The first would be to include a larger variety of trainingsamples from different facilities and sequencing protocols and the second would be to applybatch correction to the data prior to training. Essentially the model is, at least to someextent, fitting to noise in the data produced by the sequencing process (ie. batch effect).3.5 Discussion: SummaryAs the granularity of the classification task increases with respect to biological complexity,we do see a coinciding decrease in classification performance of the model. As such, the cancersubtype learning task resulted in the poorest classification performance. In addition to theincreased complexity of this learning task relative to the other tasks, the model is also facedwith smaller class sizes for each subtype. The class size reduction is a result of increasinglabel granularity while maintaining the size of the training data. The combination of thesetwo factors produce smaller classes and thus reduce the training data size for the affectedclasses. In general, with all things being equal and over-fitting being carefully managed, asmaller training data set will always result in poorer performance when compared to a largerdata set. We can speculate that having access to more subtype data could mitigate the effectof increased classification task complexity.88The ability of the model to learn each task and each class was not uniform. As we sawabove, some classes, particularly within the cancer type and subtype tasks, were classifiedwith complete accuracy, while others were entirely misclassified. One contributing factor tothis outcome is the class distribution. The training data ranges from classes of size 6 (PrimaryPrNET) to classes of size 532 (Primary KIRC). The impact of class distribution is well-studied in machine learning and there are techniques to overcome this [70, 71, 72]. However,these are often not able to fully mitigate the effect of class imbalance and in the domain ofthesis, was further impacted by the high-dimensionality of the data. High-dimensionality inmachine learning is defined as having a much larger set of input features than the numberof input samples for a given data set [73]. The effect of high-dimensionality as it relates tomachine learning often hinges on the complexity of the model. Complex models, like largeneural networks, that perform well in a high-dimensional domain quite often are in factover-fitting to certain features of the data and suffer a reduced ability to generalize duringprediction when compared with simpler models [73, 74]. By coupling high-dimensionalityand large class imbalances together, we have two factors that play a significant part in thepoor performance of the model on learning the minority classes. Further experimentationwith techniques that address class imbalance and feature selection may provide improvedperformance for this model going forward. If feature selection is considered, however, wewould have to carefully consider the impact of this on post-classification analysis.Within the held-out test set results, the majority of subtypes had the bulk of their sam-ples correctly classified. With the exception of medulloblastomas, the model shows roomfor improvement among subtype classifications. We saw particularly poor performance inthe cancer subtype task with the classes that lacked subtype annotations. The TCGA dataset excludes subtype annotation for samples for a number of reasons including sample du-plication, unknown subject identity, low DNA/RNA yield, unacceptable histology, or failedpathology review among others [67]. This criteria is not fully explained and may differbetween cancer types. The ambiguous nature of the underlying biology of these classes, cou-pled with their mixed classifications, suggests that the model may be improved by excludingthese classes in future iterations of training in order to better train the model to recognize89subtypes. By excluding these classes in the future, we would remove one source of ambiguityfor the model as well as improve our ability to analyse the model’s true subtype learningperformance.The classification performance of the model is generally satisfactory across all learningtasks in both mixed and metastatic-only domains. There is absolutely room for improve-ment, but the classification performance is higher than we would expect to see using randomclass assignment. Metastatic cancers being the form of cancer suffering from the highestmisdiagnosis rates and poorest classification performance using the neural network modelin this thesis, this is a good metric to determine the model’s efficacy compared to currentdiagnostic practices. Studies have indicated that the misdiagnosis rates of metastatic can-cers using standard pathological analysis can range from 45% to 94% [61, 74]. The modelused in this thesis has shown performance that improves upon this misdiagnosis rate andachieves an F1-score of 0.724 at its poorest performing classification task (cancer subtypeusing the external test set). Furthermore, according to a 2010 literature review by Ander-son and Weiss, correct tissue identification from metastatic-only samples was 65.6% usingimmunohistochemical analysis [75]. Again, the performance of the neural network modelexceeds this achieving an F1-score of 1.0 on a similar diagnostic task (organ system of originclassification from metastatic-only samples).The model’s overall classification performance indicates that we can, with some confidence,state that it has learned to represent the biology of and predict the correct class for a varietyof cancers. With sufficient performance established, it is viable to utilize the trained modelto extract and analyze the important features learned by the model in search of biologicalinsights. This is the focus of the remainder of this thesis beginning at Chapter 4.90Chapter 4Deeplift Analysis4.1 Methods4.1.1 DeepLiftAfter creating the fully-trained multi-task neural network model described in the previouschapters, DeepLift was run on the model to obtain a sample-gene importance score matrix.DeepLift was run sample-wise on each classification task separately. The samples providedto DeepLift were simply all of the samples in the training set used to train the model. Thisdata is described in Section 2.1. Since DeepLift computes scores for each gene on eachsample on each class within a single classification task, the resulting output is n matricesfor each classification task where n is the number of classes within that classification task.For clarity, there were 91 matrices created for the cancer subtype task because there are 91cancer subtypes as possible output classifications of the model. Each matrix is of size s ∗ gwhere s is the number of samples and g is the number of genes. In this case, s = 9736 andg = 26668 resulting in matrices of size 9736 ∗ 26668. In total, DeepLift produced 3 sets of9736 ∗ 26668 matrices for the disease state task, 11 for the organ system of origin, 68 for thecancer type, and 91 for the cancer subtype task.91Due to the stochastic nature of neural network training (SGD and weight initialization),as well as the complexity of both the problem domain and the models themselves, trainingmultiple iterations of the same model with the same data would most likely produce differingsolutions for the classification tasks outlined in this thesis. For this reason, to further improvethe robustness of the results from DeepLift, five models were trained and DeepLift was runon each. Each model was trained using the same set of training data and hyperparametersas outlined in Section 2.2, and DeepLift was run on each model in the fashion describedabove. This resulted in five sets (one for each model) of matrices corresponding to eachclassification task with each matrix containing per-sample importance scores. These scoreswere then averaged within each matrix across all samples to obtain five sets of matrices withsingle average values within them. Each averaged matrix was then averaged across all fivemodels. The final result being a single set of importance scores for each gene for each ofthe classes across each classification task that is derived from across five models. The finalformat of the data is 11, 3, 68, and 91 vectors of length 26668 corresponding to the organsystem of origin, disease type, cancer type, and cancer subtype learning task classes. Inother words, we have importance scores for each gene that reflects how important that geneis for a positive classification of each class across all of the classification tasks.4.1.2 Interpreting Gene ListsFor the following results and discussion, the list of the important genes for each classexcludes genes with importance scores at or below 0. This means that for each class, weare only considering the genes that were influential in making a positive classification of theclass in question.In order to gain insight into the functionality of gene lists, annotation of functionallyenriched pathways was conducted using DAVID [76, 77]. The default settings were usedin all cases. When considering enrichment scores in the results presented below, pathwaysor clusters were considered significant if they had an enrichment score > 1.3, as that is92equivalent to a p-value < 0.05 [76].4.1.3 Over and Underexpression CalculationIn the subsections to follow, some results are presented on the over and underexpressionof genes. A gene was considered overexpressed when its mean RPKM value within theclass in question was greater than two standard deviations above the mean RPKM valueacross all other classes within the same classification task. Similarly, a gene was consideredunderexpressed if its mean RPKM value was lower than two standard deviations below themean across all other classes in the same task. The mean and standard deviation werecalculated using the pandas Python package mean and std functions [63].4.2 Results and DiscussionThe results presented here were taken from the average, positively scored genes (identifiedby DeepLift) as described in the preceding section. It is important to note that the scale ofthe data available for analysis is large and the analysis conducted here is far from exhaustive.This section will focus primarily on larger trends with some examples within specific classes.In addition to this, the results are divided into subsections and are interleaved with theirrelevant discussion for better readability.The results have been examined in five different ways. Each of these ways exemplifies oneaspect of the data that can be challenged and studied in further depth in future work. Thefirst is examining how many important genes were identified for each class. The second islooking for any patterns of over or underexpression of genes within the identified importantgenes. The third is to examine the functionality of highly enriched pathways within differ-ent classes. The fourth and fifth analyses involve quantifying the role of RNA genes andpseudogenes within the important genes for various tasks and classes.934.2.1 Validation of Results: Normal TissuesIt is important to ensure that the model is indeed finding genes that accurately reflectthe biology of the included classes. One way we can do this is to leverage the normal tissueclasses embedded in each classification. We can examine the important genes for classifyingnormal breast tissue, for example, and look for genes involved in lactation. By doing thisfor a few tissues with specialized functions, we can offer some validity to the results as theypertain to the cancer classes and further increase our confidence in the DeepLift results. Toobtain the functional annotations, the DAVID functional annotation tool was used and thefunctional annotation chart was examined for relevant pathways [76].It is important to note that the secreted, signal, signal peptide, and extracellular regionpathways are highly enriched in the normal tissue classes as identified using DAVID on thedisease state important genes for the normal class. For the results to follow I will excludediscussing these pathways as being significant because they are not tissue specific and arelikely used by the model to distinguish normal classes from cancer classes.ThyroidThe first normal class examined was normal thyroid tissue: THCA. Since the thyroidperforms a small set of very specific functions, the genes used to identify it should, at leastin part, reflect this functionality. The important genes from the cancer type classificationtask were used to validate if the model is identifying genes of biological relevance. The cancertype class was selected because its the first task that forces the model to identify thyroidtissue explicitly. In theory, it is here that the model will need to distinguish thyroid genes.The functions for the top 10 (ordered by descending p-value) enriched pathways can be seen inFigure 4.1. DAVID identified 4 out of the 123 important genes identified by DeepLift as beinga part of a thyroid hormone generation pathway with a highly significant p-value of 3.8E-5. There are two other notable indications of thyroid tissue. The first is the neuropeptidehormone activity pathway. A number of neuropeptides are found within the thyroid and thus94can be indicative of thyroid tissue [78]. The second is the olfactory transduction pathway. Acursory search through the genes involved in this pathway (according to KEGG) were shownto be highly expressed in the thyroid by the Genotype-Tissue Expression (GTEx) Project[79, 80, 81]. The significance of these pathways according to DAVID is indicative of theidentified genes having biological relevance to thyroid tissue.Figure 4.1: A screen capture of the top 10 functional annotations (ordered by descending p-value) as determined by the DAVID functional annotation tool using the important positivegenes for the normal thyroid tissue class within the cancer type classification task.LungThe second normal class examined was lung tissue: LUSC and LUAD. Again the DAVIDfunctional annotation tool was utilized. The genes for the LUSC and LUAD normal classeswere taken from the cancer type results and combined. For these results the functional anno-tations were clustered and produced 71 functional clusters. The reason for using clusteringin this case was because much of the top annotations in the annotation chart were related tokeratinization. Keratinization is a by-product of lung distress and is common in the pathol-ogy of lung cancer patients. Since this lung tissue was all obtained from tumour-adjacentnormal tissue, the presence of keratinization is expected here. Of the 71 clusters identifiedby DAVID, 24 of them were significantly enriched. Within these significant clusters four ofthem were directly related to lung function. These clusters pertained to gaseous exchange,oxygen binding, oxygen transport, and saposin proteins (involved in the pulmonary surfac-tant complex) [82]. Each of these are clear indications of the function of lung tissue and95serve to validate the results found by the model.BreastThe third and final tissue examined was normal breast tissue. Again the DAVID anno-tation tool was used to produce an annotation chart. The top 14 results are presented inFigure 4.2. Fourteen results were included here because the pathways ranked between 9thand 14th are significant in normal breast tissue. The obvious pathway of significance is theone pertaining to milk proteins. A literature search revealed that keratin is also significant[83, 84]. A study conducted in 1989 revealed that: ”The luminal and basal epithelial cellsin the human mammary gland can be distinguished in tissue sections on the basis of thepattern of keratins they express” [83]. Finally, the Iroquois-class homeobox proteins havebeen shown to be detected in breast tissue [85, 86]. It should be noted that pathways forlactation and prolactin signalling were also identified by DAVID as significant, though not inthe top 14 annotations. These results support that the model is able to learn genes relevantto breast tissue and its function.96Figure 4.2: A screen capture of the top 14 functional annotations (ordered by descending p-value) as determined by the DAVID functional annotation tool using the important positivegenes for the normal thyroid tissue class within the cancer type classification task.4.2.2 Number of Important GenesNumber of Important Genes ResultsThe results presented here focus on the number of positive scoring genes for each class.See Figures 4.3, 4.4, 4.5, and 4.6. Each of these figures illustrates how many genes (in blue)the model considered important for the classification of each class. The F1-scores (in red) foreach class is also presented, though it is scaled by the total number of genes (26668 genes)for the sake of visualization.97Figure 4.3: A plot showing the number of important positive genes for each class within theorgan system of origin classification task in blue and the F1-score of each class in red.98Figure 4.4: A plot showing the number of important positive genes for each class within thedisease state classification task.99Figure 4.5: A plot showing the number of important positive genes for each class within thecancer type classification task.100Figure 4.6: A plot showing the number of important positive genes for each class within thecancer subtype classification task.The first notable observation from the above figures is that there appears to be threedistinct plateaus in the number of important genes within the disease state, cancer type,and cancer subtype plots (Figures 4.4, 4.5, and 4.6). We see that the number of importantgenes is generally much higher in the primary classes (left) compared to the metastatic andnormal classes (center and right respectively). The second observation is that we do not seean obvious correlation between the performance (in red) of each class and the number ofimportant positive genes (in blue). If there was a relationship between these two values wewould expect the F1-scores to also produce three tiers to reflect the tiers seen in the numberof genes.101If we unpack these results further into different tasks (as seen in the above figures), wenote that for the disease type classification, the primary cancer classification considers al-most all of the genes as important. When compared to the approximately 2400 genes formetastatic cancers and fewer than 100 genes for normal tissues, this is a significant differ-ence. In the cancer type task, the highest concentration of genes is around 19000 genes forprimary cancers, 7500 genes for metastatic cancers and around 0 for normal tissues. A sim-ilar observation can be made at the cancer subtype level. However, the range of importantgenes for primary cancer subtypes increases. Notably, we see the appearance of six primarycancer subtypes that have fewer than 7000 important genes, which is much closer to thevalues seen within the metastatic subtypes.The single gene with the highest importance among all classes were RPL19P12,LYVE1, PGA4, and SFTA3 in the Soft Tissue, Normal, STAD N Normal, and LUAD N Normalclasses of the organ system, disease state, cancer type, and cancer subtype tasks respectively.The associated scores were 0.001563, 0.009981, 0.026148, and 0.075350. These scores indicatethe percentage of the classification made using each gene for their respective classes.Number of Important Genes DiscussionThese observations beg the question of what fewer important genes mean in this context.Firstly, fewer positively important genes indicates a greater number of negatively importantgenes. The model has 26668 genes to consider and if it deems only 1000 as positivelyimportant for a particular class, that means there are 25668 genes that are pushing themodel to not classify a sample as that class. If we consider that the performance of eachclass is not directly related to the number of positive important genes and the classificationperformance is acceptable (as we have seen in Chapter 3), then we must conclude that fewergenes indicates a sufficient compressed representation of the class for each of the specifiedclassification tasks. In biological terms, this could mean that either fewer genes are involved,that each of the genes identified plays a more significant role than the others, or that somegenes have tumour suppressing properties. As we have seen in the section on the validation of102results using normal tissues, it does appear as though the model and DeepLift have identifiedgenes of biological relevance. This further supports the idea that the model is able to learngenes of value. Being able to classify cancers using fewer genes suggests that these cancershave a more distinct expression pattern and that some of the identified genes likely play animportant role in these cancers. Through further analysis of the identified genes, we couldvalidate existing oncogenes and potentially identify new therapeutic targets.4.2.3 Expression LevelsThis section considers if the model has identified any trends in the levels of gene expressionfor different cancers. Over and underexpression was determined as described in the methodssection above.Expression Levels ResultsThe following figures and tables show the number of over and underexpressed genes withinthe positive important genes for each classification task. The number of important genes,overexpressed genes, and underexpressed genes are presented in black, red, and blue respec-tively.103Figure 4.7: A stacked bar plot showing the number of important positive genes and thenumber of over and underexpressed genes for each class within the organ system of originclassification task.Organ Systemof OriginNumber of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesBreast 2563 341 1Central NervousSystem4178 2306 11Endocrine 3215 420 3Gastrointestinal 4402 509 10Gynecologic 2389 207 1104Organ Systemof OriginNumber of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesHead and Neck 2012 394 3Hematologic 4476 3500 34Skin 1691 434 1Soft Tissue 1620 234 0Thoracic 5777 456 2Urologic 5145 332 8Table 4.2: A table listing the number of positive important genes identified by DeepLiftfor the organ system of origin classes along with how many of those genes are over andunderexpressed.105Figure 4.8: A stacked bar plot showing the number of important positive genes and thenumber of over and underexpressed genes for each class within the disease state classificationtask.Disease State Number of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesPrimary 26156 2137 1307Metastatic 2060 2060 0Normal 166 115 2Table 4.4: A table listing the number of positive important genes identified by DeepLift forthe disease state classes along with how many of those genes are over and underexpressed.106Figure 4.9: A stacked bar chart showing the number of important positive genes and thenumber of over and underexpressed genes for each class within the cancer type classificationtask.Cancer Type Number of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesACC T Tumor 18092 364 123BLCA T Tumor 19575 62 9BRCA T Tumor 18635 87 163CESC T Tumor 12770 126 9107Cancer Type Number of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesCHOL T Tumor 18023 110 29CLL T Tumor 8281 1684 20CML T Tumor 17766 3787 628COADREAD T Tumor 17277 222 53DLBC T Tumor 24610 2266 466ESCA T Tumor 18819 156 73FL T Tumor 8048 2564 0GBM T Tumor 18124 661 17HNSC T Tumor 10588 298 15KICH T Tumor 17595 704 263KIRC T Tumor 23246 210 38KIRP T Tumor 18252 241 34LAML T Tumor 15919 1938 950LGG T Tumor 17713 1059 205LIHC T Tumor 17242 501 244LUAD T Tumor 20840 46 5LUSC T Tumor 20762 125 7MB-Adult T Tumor 9744 2009 1MESO T Tumor 26178 1311 81OV T Tumor 15269 223 21PAAD T Tumor 19016 119 2PCPG T Tumor 18688 774 128PRAD T Tumor 18903 365 39SARC T Tumor 11971 22 3SKCM T Tumor 18430 206 24STAD T Tumor 15087 134 15108Cancer Type Number of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesTGCT T Tumor 16356 890 171THCA T Tumor 21425 277 30THYM T Tumor 20287 343 128UCEC T Tumor 17689 81 37UCS T Tumor 8040 151 6UVM T Tumor 13141 790 139ACC T Metastatic 6771 640 0ALL T Metastatic 5476 831 0BLCA T Metastatic 7372 523 0BRCA T Metastatic 3340 133 0CHOL T Metastatic 5022 124 0COADREAD T Metastatic 6943 445 0ESCA T Metastatic 6670 239 0HNSC T Metastatic 2439 229 0LAML T Metastatic 1970 309 0LIHC T Metastatic 743 116 0LUAD T Metastatic 6655 333 0NET T Metastatic 3904 232 0OV T Metastatic 7783 289 0PAAD T Metastatic 1000 217 0PRAD T Metastatic 6436 258 0SARC T Metastatic 6487 159 0SKCM T Metastatic 7987 342 0BLCA N Normal 160 33 0BRCA N Normal 354 97 0CHOL N Normal 179 107 0109Cancer Type Number of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesCOADREAD N Normal 128 79 0HNSC N Normal 212 130 0KICH N Normal 169 62 0KIRC N Normal 454 93 0KIRP N Normal 303 115 0LIHC N Normal 94 78 0LUAD N Normal 110 53 0LUSC N Normal 771 117 0PRAD N Normal 182 63 0STAD N Normal 42 14 0THCA N Normal 132 61 0UCEC N Normal 121 29 0Table 4.6: A table listing the number of positive important genes identified by DeepLift forthe cancer type classes along with how many of those genes are over and underexpressed.110Figure 4.10: A stacked bar chart showing the number of important positive genes and thenumber of over and underexpressed genes for each class within the cancer subtype classifi-cation task.Cancer Subtype Number of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesACC T Tumor 18931 389 149BLCA T Tumor 19870 69 164BRCA Basal T Tumor 18312 79 21BRCA HER2like Tumor 15877 93 10BRCA LuminalA T Tumor 17415 144 164BRCA LuminalB T Tumor 22827 162 44111Cancer Subtype Number of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesBRCA T Tumor 18096 37 1CESC CAD T Tumor 15584 76 21CESC SCC T Tumor 14633 183 21CHOL T Tumor 18615 118 26CLL T Tumor 11088 2150 192CML T Tumor 21014 3833 939COADREAD T Tumor 18196 199 56DLBC BM T Tumor 11562 4224 178DLBC T Tumor 25123 1954 455ESCA EAC T Tumor 14650 149 47ESCA SCC T Tumor 14587 333 21ESCA T Tumor 17074 89 59FL T Tumor 8038 2320 0GBM T Tumor 22553 588 40HNSC T Tumor 14119 305 148KICH T Tumor 18276 765 300KIRC T Tumor 24190 242 208KIRP T Tumor 20279 296 49LAML T Tumor 18362 1888 1626LGG T Tumor 19417 1023 490LIHC T Tumor 18558 549 296LUAD T Tumor 21219 49 156LUSC T Tumor 22062 135 163MB Group3 T Tumor 10754 1823 11MB Group4 T Tumor 10149 2249 10MB SHH T Tumor 8905 1523 0112Cancer Subtype Number of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesMB WNT T Tumor 8838 1210 0MESO T Tumor 26434 1117 62OV T Tumor 18123 195 22PAAD T Tumor 19057 101 0PCPG T Tumor 19645 755 180PRAD T Tumor 19308 358 193SARC DDL T Tumor 476 13 0SARC LMS T Tumor 16773 113 20SARC MFS T Tumor 1520 45 0SARC MPNST T Tumor 21674 59 8SARC Synovial T Tumor 10513 395 33SARC UPS T Tumor 2800 44 0SKCM T Tumor 19016 204 175STAD CIN T Tumor 18355 119 22STAD EBV T Tumor 16224 108 47STAD GS T Tumor 650 53 0STAD MSI T Tumor 5089 107 0STAD T Tumor 3432 167 0TGCT T Tumor 18825 858 237THCA T Tumor 21175 295 186THYM T Tumor 20496 362 141UCEC T Tumor 18618 83 47UCS T Tumor 16603 119 42UVM T Tumor 19136 801 423ACC T Metastatic 5757 490 0ALL T Metastatic 5825 729 0113Cancer Subtype Number of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesBLCA T Metastatic 6172 409 0BRCA IDC T Metastatic 4517 84 0BRCA ILC T Metastatic 1453 227 0CHOL EHCH T Metastatic 2830 156 0CHOL IHCH T Metastatic 4955 152 0COADREAD T Metastatic 6747 352 0ESCA EAC T Metastatic 8244 300 0HNSC T Metastatic 2704 225 0LAML T Metastatic 3093 409 0LIHC T Metastatic 1345 134 0LUAD T Metastatic 7609 320 0OV T Metastatic 7270 223 0PAAD T Metastatic 1422 227 0PRAD T Metastatic 6544 202 0PrNET T Metastatic 4117 213 0SARC LMS T Metastatic 3537 208 0SARC T Metastatic 7531 193 0SKCM T Metastatic 7957 306 0BLCA N Normal 62 14 0BRCA N Normal 64 34 0CHOL N Normal 287 150 2COADREAD N Normal 65 39 0HNSC N Normal 156 117 0KICH N Normal 91 37 0KIRC N Normal 443 87 0KIRP N Normal 243 118 0114Cancer Subtype Number of Im-portant PositiveGenesNumber of Over-expressed GenesNumber of Un-derexpressedGenesLIHC N Normal 72 66 0LUAD N Normal 59 38 0LUSC N Normal 977 112 0PRAD N Normal 126 56 0STAD N Normal 3 0 0THCA N Normal 147 67 0UCEC N Normal 31 11 0Table 4.8: A table listing the number of positive important genes identified by DeepLift forthe cancer subtype classes along with how many of those genes are over and underexpressed.Within the disease state task, the most notable observation regarding the gene expressionlevels of the genes identified by the model pertains to the metastatic cancer and normaltissues. If we look at Figure 4.8 and Table 4.4, we see that the model has selected 2060 genesas important for metastatic cancers and that all of them are overexpressed. Similarly, thevast majority of genes within the normal tissue class are considered overexpressed.Within the cancer type and subtype tasks, the majority of genes across all primary classesare categorized as neither over or underexpressed. The metastatic and normal classes allutilized only overexpressed genes with the excpetion of the CHOL N Normal subtype inwhich 2 genes were underexpressed.115Expression Levels DiscussionThe results above suggests that the model has found the overexpression of genes to bemore informative in the context of metastatic and normal classes than of primary ones. Thecaveat to this is that the metastatic and normal classes both utilize many fewer genes thanthe primary ones and make up the minority of classes in the data set. Since the expressioncategories were determined using the mean expression values across all samples, the meanwill be skewed towards the majority classes’ gene values. So while the resultant gene listswould remain the same, redefining the boundaries of over and underexpression individuallyfor each class would likely reveal better insight into the relevant expression levels. Thisbecomes particularly important when trying to connect expression levels with expected genefunctionality. For example, if a gene is a known tumour suppressor, carefully determiningits expected expression level for a given class would allow insight into whether or not it isbeing underexpressed and thus driving the growth of a particular class of cancers. Given thiscaveat, the only real conclusion we can make of the results is that genes that are expressedabove the majority of classes tend to have high importance and may contribute to needingfewer genes for classification. Until a better defined methodology for calculating the truemean expression values on a class-wise level is developed, a biological interpretation of theseresults should be reserved.4.2.4 Enriched Pathways: Metastatic Cancer Disease StateDAVID Functional Annotation Chart ResultsFigure 4.4 and Table 4.4 show that the disease state task identified 2060 genes contributepositively to the classification of metastatic cancers by the model. These genes were inputinto the DAVID functional annotation tool to identify enriched functional pathways [76, 77].116The DAVID functional annotation chart returned 163 records and the top 10 results arepresented in Figure 4.11. This figure is ordered by descending p-value.Figure 4.11: A screen capture of the top 10 functional annotations (ordered by descending p-value) as determined by the DAVID functional annotation tool using the important positivegenes for the metastatic class within the disease state classification task.DAVID Functional Annotation Chart DiscussionThe first observation to note is that within the top annotations we see ’MicroRNAs incancer’. According to KEGG, the pathway highlighted here corresponds to ’a cluster of smallnon-encoding RNA molecules of 21 - 23 nucleotides in length, which controls gene expressionpost-transcriptionally either via the degradation of target mRNAs or the inhibition of proteintranslation’ [79, 80, 81]. This corresponds with the information presented in Chapter 1about the role of microRNA in tumourigensis. Furthermore, studies have suggested thatspecific miRNAs play a role in metastatic cancers [87, 88]. The fact that this is a highlyenriched pathway for this class is a promising result that is supported by the literature.Through further in depth analysis, we may be able to identify metastatic-specific miRNAgenes that could make suitable therapeutic targets.117The second observation to note is that four of the top 10 functional annotations arerelated to the ribosome in some way. Studies have indicated that changes in the regulationof ribosomal proteins can be associated with poor prognosis for cancer patients [89, 90, 91,92]. Additionally, increased expression of ribosomal RNAs have been correlated with thedevelopment of some cancer types and in some cases poor prognosis and/or metastasis. [89,93, 94, 95, 96]. The prevalence of genes related to ribosomal function further supports theevidence presented in the literature and further validates the biological significance of theidentified genes. Ribosomal function plays a key role in the development of cancers and thethrough careful examination of the genes identified here, ribosomal genes could be furtherexplored as therapeutic targets [97, 98].Finally, the top four annotations strongly suggests that immunoglobulin (IG) is importantto classifying metastatic cancers. In particular, the V-set of immunoglobulin is listed twicewith the highest p-values. This set of immunoglobulin has been found to be overexpressed inand indicative of poor prognosis for patients with advanced gastric cancers [99]. If evidencesuggests increased expression of V-set IG is found to prognosticate advanced cancers, thenIG should be explored for causal links to metastatic cancers, as they are themselves are anadvanced form of the disease. Furthermore, a second study also validates the importance ofthe IG family of genes (which includes the V-set) and found that dysregulated expressionof IG shows prognostic value for breast cancers [100]. We can speculate that the modelhas learned to detect changes in the expression of IG genes and utilizes it to inform theclassification of metastatic cancers.DAVID Functional Annotation Clustering ResultsThe DAVID functional annotation clustering tool was used to cluster the functional anno-tations for the metastatic cancer genes within the disease state task and returned 53 clusters.Of the 53 clusters only nine had a significant enrichment score (above 1.3) [76]. The topthree enriched pathways include groups involving immunoglobulin, ribosomal translation,118and mitchondrial translation/mitochondrial ribosome pathways with enrichment scores of12.58, 5.79, and 2.71 respectively.DAVID Functional Annotation Clustering DiscussionThe clustering results further support the important role of IG and ribosomes discussedin the preceding subsection. The role of mitochondria was not discussed earlier in this thesisbut has been shown to play a role in the formation of cancers and metastases [101]. Inparticular, a number of mitochondrial ribosomal proteins (mitoribosomal proteins or MRPs)have been implicated in the development of various metastatic cancers [101, 102]. Thepresence of all three of these annotation groups, both in the cancer literature and in theresults presented here, are excellent further indicators of the ability for the model to identifyimportant biological features of metastatic cancer. The implication of each of these pathwaysshould be investigated further.Summary of Enriched Pathways for Metastatic Cancer in the Disease StateThe results presented above have indicated that immunoglobulin, microRNA, and theribosome play significant roles in the classification of metastatic cancers within the diseasestate task. Furthermore, the scientific literature seems to indicate that there is some validityto this observation made by the model. The value of this insight is that it was made withinthe disease state task and thus applies across multiple metastatic cancer types. The listof pathways identified within this task could be further examined and exploited to try andbetter understand common characteristics of metastatic cancers as a whole. One caveat toconsider with these results is the implication of batch effect as a result of all the metastaticcancers being from a single data source external to the bulk of the training data (see thesection on batch effect below).1194.2.5 Enriched Pathways: Primary Cancer in the Disease StateTaskThe genes identified as important for the primary cancer class contains almost all of thetotal available genes (26668 genes). In this case, it would be uninformative to use thefull gene list for enriched pathway analysis. However, since each gene is given a score, itis possible to rank the importance of each gene as a percentage of the total classificationdecision. Therefore, enriched pathway analysis results are presented using the top 25% ofgenes ranked by descending importance score. Note that individual cancer type and subtyperesults will not be presented but rather the focus will remain on the more general categoriesof primary and metastatic cancers.DAVID Functional Annotation Chart ResultsThe top 25% of important genes for primary cancer within the disease state task consistsof 2780 genes. These genes were input into DAVID and the functional analysis chart toolwas utilized to generate the results presented in Figure 4.12.Figure 4.12: A screen capture of the top 10 functional annotations (ordered by descendingp-value) as determined by the DAVID functional annotation tool using the top 25% ofimportant positive genes for the primary class within the disease state classification task.120DAVID Functional Annotation Chart DiscussionLooking at the results in Figure 4.12, we see that the most significant pathway listedis cancer-related. The cancer/testis antigens are a group of proteins that are normallyexpressed only in testicular germ cells but are found to be expressed in numerous cancers[103]. These antigens include a number of other types of genes including GAGE, MAGE,and BAGE [103]. GAGE genes were found to also be significant within the primary cancerresults and are discussed below in the context of functional annotation clustering. Anothernotable result in the top 10 enriched pathways presented in Figure 4.12, is the V-type IG-likepathway. We have seen that the V-set IG pathway was important for the metastatic cancerclass (see above) and the results presented here, along with the literature presented above,support the significance of this pathway in cancers in general.DAVID Functional Annotation Cluster ResultsThe DAVID functional annotation clustering tool returned 13 significant clusters out of198 identified functional clusters. The top three clusters had enrichment scores of 5.31, 3.41,and 2.94 corresponding to functional pathways involving DNA repair/damage, G antigens(GAGE), and putative proteins (see Figure 4.13).121Figure 4.13: A screen capture of the top 3 functional annotation clusters (ordered by de-scending enrichment score) as determined by the DAVID functional annotation cluster toolusing the top 25% of important positive genes for the primary class within the disease stateclassification task.DAVID Functional Annotation Cluster DiscussionThe first functional cluster identified by the clustering tool pertains to DNA repair anddamage. This result is somewhat difficult to reconcile. Numerous studies report that theunderexpression of DNA repair genes is associated with an increased likelihood of tumourge-nesis as a result of increased genomic instability [104, 105, 106]. However, underexpressionof DNA repair genes in patients already afflicted with cancer is associated with poorer prog-noses and treatment outcomes [106]. The consensus seems to contradict the results shownhere. We would expect to see overexpression of DNA repair genes in the context of primarycancers and underexpression, if at all, in the metastatic cancers.122The second functional cluster pertains to G antigens (GAGE). GAGE genes have beenfound to be upregulated across numerous cancers and support the importance placed onthis functional cluster by the model [107, 108, 109]. They are expressed in response toepigenetic dysregulation in cancer cells but are otherwise inactive [107, 108, 109, 110]. Theonly exceptions to this are during the developmental period and within testicular germ cells[107, 108, 109, 110]. The exact mechanism by which GAGE genes impact tumourigenesisis unclear but they are being are explored as potential therapeutic targets [107, 108, 109,110]. It should be noted that GAGE genes are within the same category of genes as thecancer/testis antigens discussed above, further encouraging their relevance within the model.The third functional cluster involves a set of putative proteins. By nature of being putative,we cannot speculate on the value or function of these genes. However, these genes could benoted for future experimentation to determine their functionality where possible.Summary of Enriched Pathways for Primary Cancer in the Disease State TaskGenerally, the list of functional annotations for primary cancers was less informative ofthe underlying biology than within the metastatic cancer class. This outcome was expectedgiven that the model utilized almost all of the genes to make a primary cancer classificationwithin the disease state task. The high gene usage results in each gene’s contribution beingsignificantly reduced and thus having weaker importance. This increased gene contributionconfounds the resulting functional annotations as 75% of the genes identified were excludedfrom the results presented here. Using only 25% of genes is effectively an arbitrary cut-offpoint and may not have any real underlying biological significance. Given the vast numberof genes used, these are simply genes ranked slightly above the others.1234.2.6 RNA GenesThe results presented in this section serve to quantify and present the role that RNAgenes play in different cancers. The majority of the discussion that follows will focus onthe disease state and cancer type classification tasks as the trends remain similar within thecancer subtype task. These tasks sufficiently exemplify the larger trends across cancers. Itshould be noted that there are 2890 RNA genes in the full set of genes available to the modeland that this comprises 10.8% of the available genes.Organ System of Origin Task ResultsFigure 4.14 presents the proportion of RNA genes identified as important within each classof the disease state task.124Figure 4.14: A scatter plot showing the proportion of RNA genes within the positive impor-tant genes identified for the organ system of origin classes.We note that the highest RNA gene proportion is found in the central nervous system,hematologic, thoracic and soft tissues classes with values of 0.26, 0.17, 0.13, and 0.095respectively. The other classes have values ranging from 0.025 to 0.053.Organ System of Origin Task DiscussionThe two highest RNA gene proportions are found in the central nervous system andhematologic classes. When we consider the RNA gene proportions of cancer types with thehighest proportions (see sections below) we find that the top three are within cancer types125of the central nervous system and hematologic organ systems (CLL, FL, and MB). Sincethese organ systems already include higher RNA gene proportions in their important genes,it may have contributed to the high RNA gene proportion found in the related cancer types.Disease State Task ResultsFigure 4.15 presents the proportion of RNA genes identified as important within each classof the disease state task.Figure 4.15: A scatter plot showing the proportion of RNA genes within the positive impor-tant genes identified for the disease state classes.126Figure 4.15 illuminates a significant difference between the number of RNA genes deemedimportant for metastatic classes when compared to primary and normal ones. We note thatthe RNA gene involvement is approximately three times as high in the metastatic class asin the primary one with proportions of 0.32 and 0.11 respectively. We also note that RNAgene importance is approximately 0.02 within the normal class.Disease State Task DiscussionThe small number of RNA genes utilized in the normal class coincides with the smallnumber of important genes seen in Section 4.2.2. This suggests that the model has learnedto ignore the majority of genes, RNA or otherwise, for normal tissues. In fact, the modelhas deemed most genes as an indication of non-normal classes. This implies that there arevery few genes that are not important to the classification of cancer within the context ofthis model.With regards to RNA genes in the primary cancer class, it is important to remember thatthe model has deemed almost all of the genes available (see Figure 4.4) as important. Asa result, the number of RNA genes closely reflects the number available within the entiregene set (10.8% RNA genes). While this fact reduces the value of examining the RNA geneproportion within the primary class, the opposite is true of the metastatic class.Given the otherwise high gene exclusion rate within the metastatic class, the fact that themodel has elected to deem such a high proportion of RNA genes as important is significant.This suggests that RNA genes have a very strong impact on the classification of metastaticsamples within the context of the disease state learning task and accounts for almost onethird of a classification decision. When conducting further analysis on the genes highlightedfor metastatic cancers special attention should be made to consider the interaction betweenRNA and non-RNA (such as miRNA) genes. It may be possible to look for correlationsbetween the expression patterns of RNA genes and their related coding genes.127It is worth noting here that the disease state task has the highest potential for beingnegatively affected by batch effect. Since all of the metastatic cancers are from a singleand different data source than the bulk of the training data (TCGA), batch effect canpose a serious issue for biological interpretation. At this level of classification task, thesusceptibility to learning how to simply differentiate data sources is high and thus anybiological interpretation of results should consider this implication. Batch effect is discussedin more detail in Section 4.3.Cancer Type Task ResultsFigure 4.16 presents the proportion of RNA genes identified as important within each classof the cancer type task.128Figure 4.16: A scatter plot showing the proportion of RNA genes within the positive impor-tant genes identified for the cancer type classes.In Figure 4.16, we see that the proportion of RNA genes utilized in the metastatic cancertypes range from 0.23 to 0.37 and are spread relatively evenly throughout this range. Therange for primary cancer types is 0.025 to 0.275 with the vast majority having an RNAproportion of approximately 0.03. When comparing these two proportions, the vast majorityof primary cancers use 7 times fewer RNA genes than metastatic cancers. The normal tissueclasses share a similar range to that of the primary cancers with proportions from near 0.0and 0.245. The KIRC normal presents as the largest outlier with an RNA gene proportionof 0.25.129The metastatic cancer types with the lowest and highest proportion of RNA genes areLAML and OV respectively. As a whole, the metastatic cancer types proportions composeone cluster. Within the primary cancers, there are three classes whose RNA gene proportionsare within the range of metastatic cancers, making them outliers within the primary class.These classes are CLL, FL, and MB, and they have RNA proportions of 0.265, 0.29, and0.25 respectively.Cancer Type Task DiscussionGiven that the total gene set available to the model contains 10.8% RNA genes, the factthat the majority of normal and primary cancer type classes rely on fewer than 4% RNA genessuggests that RNA genes were selected against. Within metastatic cancer classifications wesaw the opposite effect with high proportions of RNA gene involvement. We therefore havenot only evidence of RNA gene importance in metastatic classes, but evidence of RNA geneaversion in primary and normal ones. Combined, these observations strongly suggest thatRNA genes play a much more significant role in the classification of metastatic cancers whencompared to primary cancers and normal tissues.Regarding the normal tissues, there is no easily observable correlation between the numberof RNA genes utilized for the corresponding primary cancer type. For example, we see thatthe two lung normal classes, LUAD and LUSC, have both the highest and lowest RNAgene usage within the normal classes. Their corresponding primary cancers both show onlyslightly elevated RNA gene usage with proportions just above 0.05. This indicates that themodel is able to detect and learn the differences in RNA expression patterns between normaland primary cancer tissues. We can go one step further and speculate that, since the normalclasses are defined by their adjacent cancer types, perhaps the RNA expression patterns differbetween lung tissues that have developed adenocarcinoma verses squamous cell carcinoma.One study by Shi et al. (2014) has found 2961 microRNAs that are differentially expressedbetween lung cancers and normal lung tissue [111]. Following up on this work, another130study by Venugopal et al. (2019) detected fundamental differences in the gene expressionpatterns between lung adenocarcinoma and squamous cell carcinoma [112]. It is plausiblethat given the high number of microRNA genes implicated in lung cancer and the differentialgene expression between lung cancer types, that at least some of these changes in expressionare driven by non-coding genes. Further functional analysis of the RNA genes identified ineach class may serve to validate this speculation.There are two aspects of these results that should be further analyzed. The first aspect iswhat the increased RNA presence in metastatic cancers means in terms of biology, and thesecond involves examining the RNA proportions for the outlying primary cancers. Thesewill be discussed in the subsections below.Cancer Type Task Discussion: RNA Genes in Metastatic CancersThe high level of RNA gene importance in metastatic cancer types indicates that thesegenes play a significant role in differentiating metastatic cancers from primary ones. Thequestion is whether or not there is biological significance to this. Scientific evidence is begin-ning to suggest that non-coding RNA plays an important role in regulating the developmentaltransitions of cells [113]. In particular, the epithelial to mesenchymal transition (EMT) is akey developmental transition that is indicated at the start of metastasis. EMT is the mech-anism by which cells can reactivate embryonic morphogenesis and ultimately contributesto the ability of cells to propagate and migrate to distant organ systems [113]. Non-codingRNA, as they pertain to cancer, are also implicated in the disruption of the cellular signallingpathways involved in the proliferation, migration, and survival of cells [113, 114, 115].There are two main types of RNA genes that are most often cited in relation to cancer:long non-coding RNAs (lncRNA) and microRNAs (miRNA) [116]. It should be noted thatwork has also been done on the role of circular non-coding RNA in cancer but these have beenexcluded from this thesis [117, 118]. The presence of relevant lncRNAs and microRNAs131will be briefly discussed below.Recent studies looking into lncRNA have found several genes that are implicated in themetastasis of breast cancers (HOTAIR), lung/cervical cancers (MALAT1 ), and prostate can-cers (PRNCR1 and PCGEM1 ) [116]. Examining the DeepLift results revealed that indeedHOTAIR was identified in metastatic BRCA and PCGEM1 was identified in metastaticPRAD. Note that MALAT1 was not found by the model within the metastatic BRCA can-cer type. The model’s results correlate with the literature and suggest that the model has,at least in part, the ability to detect useful biological insight from lncRNA genes, as wellas coding genes. The inclusion of more lncRNA in the data set might provide a meansfor which to further expand the knowledge-base surrounding the role of lncRNA in variouscancer types. Furthermore, the examples listed here are a subset of lncRNA genes availablefor study and simply show that this a feasible line of inquiry for further analysis.With regards to microRNAs, there are a number that have been implicated across multiplecancers (discussed in Chapter 1) and have been shown to play a role in metastasis [116,119]. To reiterate, studies have found that miRNAs influence metastasis in a wide rangeof ways including by targeting oncogenes and/or tumour suppressors, modulating cancerstem cell properties, regulating EMT, and by influencing changes in the microenvironment.Furthermore, genes involved in the regulation of miRNA biogenesis have been implicated incancer as well, adding an additional layer by which miRNAs themselves can be dysregulated[119]. Given the variety of ways in which miRNAs have been shown to influence metastasis(also see Chapter 1), the results obtained from the model seem to reflect this influence. Withsuch a large number of cellular functions being affected by miRNA expression and such ahigh proportion of RNA gene utilization in the model, we can speculate that there are RNAexpression patterns that can be learned to classify and potentially prognosticate metastaticcancers. In depth analysis of each cancer type’s identified RNA genes and their functionalannotations could be conducted to further glean biological insights.132Cancer Type Task Discussion: RNA Genes in Primary Cancer TypesIt should be noted that all of the primary cancer types, with the exception of LAML, thatare part of the organ systems that showed elevated RNA gene proportions (see the organsystem of origin section above) have at least slightly elevated RNA gene proportions (above0.03) at the cancer type level. The cancer types that are a part of the thoracic, hematologic,central nervous system, and soft tissue organ systems are as follows: CLL, CML, DLBC,FL, LUAD, LUSC, MB, MESO, and PCPG. We can speculate that some of the RNA genesidentified as important within these cancer types are reflective of the organ systems fromwhich they came.Cancer Type Task Discussion: RNA Genes in Outlying Primary Cancer TypesThere are three primary cancer types (Figure 4.16) that show high levels of RNA geneinvolvement consistent with the levels seen in metastatic cancers (between 23% and 37%RNA genes): CLL, FL, and MB. These RNA proportion values are high enough to beconsidered outliers from the rest of the primary cancer types. There are an additional 9primary cancer types that have at least two times the RNA involvement when compared tothe primary cancer types as a whole. These types can be seen in Figure 4.17. To validatethe significance of these findings, there are two metrics that should be considered. The firstis the classification performance (F1-score) and the second is the total number of importantgenes identified by the model. It is important to see if these cancer types differ from theother primary cancers in terms of either metric as the significance of the increase in RNAgene proportions may be related.133Figure 4.17: A scatter plot showing the proportion of RNA genes (black) within the positiveimportant genes identified by DeepLift and the corresponding F1 classification scores (red)for primary cancer types whose proportions were greater 0.06Figure 4.17 illustrates the RNA gene proportions in conjunction with the F1-scores for asubset of primary cancer types. The average RNA gene proportion across all primary cancertypes is approximately 0.03. The subset of cancer types presented in Figure 4.17 was selectedon the basis of having an RNA gene proportion of at least 0.06 (two times the average valuefor primary cancer types). The figure shows that the F1-scores for each of the primary cancertypes listed remains close to 1.0. This seems to indicate that the RNA proportions are not acontributing factor on the classification performance. If this were the case, we would expectthat cancer types with higher RNA gene usage would have poor performance relative to the134others.To observe the impact of the total number of important genes on the RNA proportionsreported, we need to look a bit more carefully at the results in Figure 4.16. Specifically,MESO and DLBC utilize nearly all of the available 26668 genes and as such, the proportionof RNA genes identified closely reflects the total number of RNA genes available within thedata set (2890 genes or 10.8%). This effectively eliminates the significance of what appearsto be higher RNA gene involvement for these two classes. As a result, we should excludethem from further analysis pertaining to RNA gene significance.Having identified CLL, FL, and MB as primary cancer outliers, when we consider Figure4.16, we note that they make up the primary cancer types with the lowest number of identifiedimportant genes, each with fewer than 10000. When compared to the bulk of the primarycancer types, which have approximately 18000 genes, this is a significant decrease. Thissuggests that perhaps the proportion of RNA genes is specific to these cancer types and mayreflect the underlying biology. It also suggests that patterns of gene expression involvingRNA genes are more easily learned by the model with the use of fewer genes. This alsoindicates that RNA gene expression patterns are more informative for classification thannon-RNA genes.The following sections will examine the biological underpinnings of RNA gene involvementfor CLL, FL, and MB. As we will see below, each of these cancer types have literature tosupport that RNA genes, particularly miRNAs, play a significant role. It should be notedthat miRNAs have been more widely studied and thus the literature and model resultspresented lean heavily towards miRNAs and away from other types of RNA genes (like longnon-coding or circular).135Cancer Type Task Discussion: RNA Genes in Follicular Lymphoma (FL)The involvement of RNA genes in lymphomas has been studied over the past decade andhave been shown to have prognostic value [120-127]. Specifically, miRNA expression can beutilized to produce unique miRNA signatures that have indications with regards to treatmentresponse for lymphomas [120, 121, 122, 123]. A number miRNAs have been indicated inthe development of follicular lymphomas through the regulation of BCL2 with miR-15 andmiR-16, hematopoesis with miR-150 and miR-155, and tumour development with miR-210, miR-10a, miR-17-5P and miR-145 [120, 125, 126, 127]. The genes identified by themodel for FL contain examples from each of these regulatory categories and include mir-15,miR-16, miR-150, and miR-210. This illustrates the model’s ability to learn some of theunderlying miRNA signatures of FL. Note that there has been at least one long non-codingRNA gene (RP11-625 L16.3 ) identified as playing a pathogenic role in FL, but this genewas not present in the set of genes available to the model [128].Cancer Type Task Discussion: RNA Genes in Chronic Lymphocytic Leukemia(CLL)MicroRNA expression profiles have been shown to be of value in assessing the prognosis,progression, and drug resistance of CLL [129]. The following have been identified as themost deregulated miRNAs in CLL: miR-15/16 cluster, miR-34b/c, miR-29, miR-181b, miR-17/92, miR-150, and miR-155. The model identified miR-15b, miR16-1/2, miR-34b/c, miR-29b2, miR1-81b1/2, and miR-150 as being significant RNA genes in CLL [128, 129]. Theseidentified genes correspond well with the deregulated miRNAs from the literature on CLL andsupport the ability of the model to identify known, relevant RNA genes. Further analysiscould seek to determine the expression levels of the relevant microRNA. Note that thereexists some long non-coding RNA and circular RNA that may be implicated in CLL butnone identified in the literature were found by the model [128].136Cancer Type Task Discussion: RNA Genes in Medulloblastoma (MB)Recent studies have begun to establish the role of RNA genes in the development ofmedulloblastomas. One such study identified that MB can be differentiated from normalbrain tissue using the expression profiles of the miR-9 and miR-125a microRNA genes [131].The model in this thesis supported the importance of these miRNAs and selected both ofthem for use in classifying MB. There are a number of other miRNAs that have been identifiedas either tumour suppressing or oncogenic and can be found in papers by Mollashahi et al.(2019), Cho et al. (2010), and Joshi et al. (2019) [130, 131, 132]. For example, Mollashahiet al. (2019) identified miR-125b, miR-324-5p, and miR-32 as tumor suppressors withinMB and indicated their dysregulation contributes to the development of MB [130]. Joshi etal. (2019) discuss the role of long non-coding RNA in MB and indicate that they are keyregulators of cell proliferation and differentiation and that their dysregulation contributesto the development of many other cancers as well [132]. While data on lncRNA in MB islimited, the Joshi et al. (2019) paper lists 8 lncRNA genes implicated in MB. The modelselected for 3 of the 8 genes and the results are presented in Table 4.9 below.RNA Gene Type RNA Genes FoundOncomir miR-30b/d, miR-10b, miR-367, miR-106bTumour Suppressor Mi-croRNAmiR-193, miR-32, miR-124, miR-199b, miR-324, miR-326, miR-125a/b, miR-218, miR-31,miR-135a, miR-494, miR-221Long Non-Coding RNA CRNDE, LOXL1-AS1, NKX2-2AS1Table 4.9: List of RNA genes found by the model that are also implicated in MedulloblastomaCancer Type Task Discussion Summary: RNA Genes in Outlying Primary Can-cersGiven the results presented in the preceding sections, it is clear that the model is learningsomething about the role of RNA genes within primary CLL, FL, and MB. It is encouraging137that the results span multiple types of RNA genes (lncRNA, oncomirs, and tumour sup-pressing miRNAs) and has overlap with genes identified in the relevant scientific literature.We may be able to identify key functions that are disrupted in each of the cancer types asa result of dysregulation within the identified RNA genes. It would also be interesting tocompare and contrast the identified RNA genes with those found in the metastatic cancertypes.Further analysis of these outliers should look into the correlation between RNA typeand expression levels to determine if the patterns match what is to be expected from abiological perspective. For example, we would expect to see oncomirs being overexpressedand tumour suppression miRNAs being underexpressed. This will require refining how thegene expression categories are defined.Cancer Subtype Task ResultsFigure 4.18 presents the proportion of RNA genes identified as important within each classof the cancer subtype task.138Figure 4.18: A scatter plot showing the proportion of RNA genes within the positive impor-tant genes identified for the cancer subtype classes.In Figure 4.18, we see that the proportion of RNA genes utilized in the metastatic cancersubtypes range from 0.23 to 0.37 and are spread relatively evenly throughout this range.This is very similar to the results seen at the cancer type level as there are only 3 metastaticcancer types with subtype annotations.The range for primary cancer types is 0.025 to 0.29 with the vast majority having anRNA proportion of approximately 0.03. When comparing the metastatic and primary sub-type proportions, the majority of primary cancers again use 7 times fewer RNA genes thanmetastatic cancers. When comparing the cancer type and subtype proportions for primary139cancers, we note that the proportions have risen in approximately half of the subtypes mak-ing their values 0.05 or above. We also note that in addition to the three outlier cancer types(MB, CLL, and FL) seen in the previous section, at the subtype level, DLBC BM shouldnow be considered an outlier as well. DLBC BM has an RNA gene proportion of 0.22, upfrom the DLBC proportion of 0.10 at the cancer type level.The normal tissue classes remain identical to the proportions seen in the cancer typeresults.Primary DLBC Bone Marrow DiscussionOne notable oberservation of RNA gene involvement within cancer subtypes was the highRNA gene proportion for the primary DLBC BM subtype. There is more than two timesthe RNA gene proportion in DLBC with bone marrow involvement (DLBC BM) than in theDLBC subtype without. If we look more carefully at the number of genes involved (see Figure4.6) we see that the number of genes utilized by DLBC is double that of DLBC BM. So whileDLBC BM utilizes fewer genes, it retains the same number of RNA genes. This suggeststwo things about DLBC BM. First, that it is easier to classify, as it requires fewer genes.Second, non-RNA genes were excluded in favour of RNA genes, implying an important rolefor RNA genes in bone marrow involvement of DLBC. According to the literature regardingDLBC, there are several miRNAs identified as being responsible for B cell development inbone marrow [135]. The first of these found was miR-181a and was indeed highlighted bythe model as important [135]. The literature presents a thorough understanding of B celldevelopment and how it correlates with miRNA expression. Further investigation into themiRNAs identified in the literature and those found within the DLBC subtypes by the modelcould provide further insight into the effect of differential expression on the progression ofDLBC from a functional perspective. This would allow linking our understanding of normaland abnormal B cell development to the expression patterns of RNA genes.140Cancer Subtype Task DiscussionRNA Expression SummaryThe results presented in the above sections attempt to validate the increased RNA geneimportance seen in some primary and all metastatic cancer types. We have seen clearevidence that the model has uncovered significant differences in the contribution of RNAgenes between metastatic cancers and the majority of primary ones. While the analysisgiven is far from exhaustive, it serves to show the potential for this line of research. Wehave identified a number of RNA genes across a variety of functions and cancer types thatcorrelate, at least in part, with the relevant scientific literature. This has shown the capacityof the machine learning pipeline developed as part of this thesis to identify patterns in genes,coding or not, and that these genes may be biologically relevant.There is an important caveat to consider when analyzing the RNA gene importance. Allof the cancer classes (primary and metastatic) with the highest RNA gene proportions areclasses in which the data has come from sources outside of the TCGA data set. Since TCGAdata composes the bulk of the training data, it is possible that these results reflect someartifacts in the data that exist as a result of sequencing protocol (ie. batch effect). If thisis the case, determining the true biological significance of increased RNA gene proportionsrequires careful further analysis of the results and the methods of data generation for eachsource. The implication of batch effect is discussed in more detail its own section below.4.2.7 PseudogenesThe results presented in this section serve to quantify the role that pseudogenes play indifferent cancers. The majority of the discussion that follows will, as with the RNA genes,focus on the disease state and cancer type classification tasks as the trends remain similarwithin the cancer subtype task. These classification tasks should sufficiently exemplify the141larger trends across cancers. It should be noted that there are 5280 pseudogenes in the fullset of genes available to the model and that this comprises 19.8% of the available genes.Organ System of Origin Task ResultsFigure 4.19 presents the proportion of pseudogenes identified as important within each classof the organ system of origin task. We note that, as in the RNA gene results, the centralnervous system, hematologic, and thoracic classes have the highest proportion of pseudogenesimportant for classification. We also note that these three classes are outliers from theother organ system classes with at least two and a half times the number of pseudogeneinvolvement.Figure 4.19: A scatter plot showing the proportion of pseudogenes within the positive im-portant genes identified for the classes within the organ system of origin task.142Disease State Task ResultsFigure 4.20 presents the proportion of pseudogenes identified as important within each classof the disease state task. We note again that, as in the RNA gene results, the metastaticclass has the highest proportion. The metastatic class utilizes more than two times thenumber of pseudogenes in classification when compared to the primary class. Normal classclassifications use very few pseudogenes.Figure 4.20: A scatter plot showing the proportion of pseudogenes within the positive im-portant genes identified for the classes within the disease state task.143Cancer Type Task ResultsFigure 4.21 presents the proportion of pseudogenes identified as important within each classof the cancer type task. We observe that the metastatic classes utilize significantly morepseudogenes than most primary and normal classes. The bulk of the metastatic classeshave pseudogene proportions between 0.5 and 0.55. The majority of primary cancer typeshave a pseudogene proportion of approximately 0.045. There are some notable exceptionsto the primary cancer types. The CLL, FL, and MB classes are outliers with pseudogeneproportions of 0.55, 0.585, and 0.48 respectively. There are another 11 primary cancer typesthat show at least two times the number of pseudogenes as the majority. These primarycancer types are as follows: CML, DLBC, ESCA, GBM, KIRC, LGG, LUAD, LUSC, MESO,PCPG, THCA, and THYM.144Figure 4.21: A scatter plot showing the proportion of pseudogenes within the positive im-portant genes identified for the classes within the cancer type classes.Cancer Type Task DiscussionThe results presented here show that pseudogenes play a large role in classifying metastaticcancer. In fact, at or near 50% of a classification decision is made using pseudogenes for allmetastatic cancers. This value exceeds that of the RNA gene proportions seen in the previoussection. Studies have suggested that the diagnostic and prognostic power of pseudogenes isin some cases higher than that of miRNAs [136]. They have also found that there arespecific signatures of pseudogenes that correlate with poor survival and as a result couldsuggest a predisposition to metastasis [136, 137]. The implications of these studies supportthe results presented above for all of the metastatic cancers. In other words, the model has145placed a high importance on pseudogenes for differentiating metastatic from primary cancersand may prove useful in future work for determining a predisposition for metastasis.Similar to the caveat placed on the RNA gene results, we must consider the implication ofthe data source on the pseudogene content of cancer type classifications. We again see thatthe metastatic proportions and the proportions for the largest primary cancer outliers (CLL,FL, and MB) are all cancer types that come from non-TCGA datasets. The implication ofthis is that there may be some batch effect occurring. This consideration somewhat dimin-ishes the reliability of the trends shown in these results and requires further investigationto alleviate. However, when we examine the batch effect (see the relevant section) based onthe data sources, we note that the metastatic cancers appear to be the most problematic.We also note that there are 11 other primary cancers (see above) that show at least twotimes the pseudogene proportions of the majority of primary cancers. Of these 11 cancertypes, eight of them were sourced from TCGA. This suggests that while batch effect mayplay a role in pseudogene importance, there are a number of examples where this is not thecase and patterns of pseudogene expression may be strictly a result of underlying biological.Further in depth analysis of the genes identified for the eight TCGA cancer types with highpseudogene importance could serve to better support the biological relevance of pseudogeneexpression.Cancer Subtype Task ResultsFigure 4.22 presents the proportion of pseudogenes identified as important within each classof the cancer subtype task. Generally, the observed trends seen here are very similar tothose found within the cancer type task. The metastatic classes all have significantly higherpseudogene proportions than the bulk of the primary ones, with values ranging from 0.405to 0.57. The outlier primary cancer subtypes correspond to the outlier primary cancer types,with CLL, FL, and MB subtypes composing this group. There are, however, more classesthat show higher levels (above 0.045) of pseudogene involvement than we saw at the cancer146type level. We note that 24 of the 55 primary cancer subtypes have values close to 0.045and the range of value when compared to the cancer type results is much more varied.This means the majority of subtypes show some elevated levels of pseudogene involvementcompared to their corresponding cancer types. The significance of this is that some subtypesshow elevated pseudogene involvement when compared to their related subtypes (ie. withinthe same type). For example, Luminal B BRCA has 3 times higher pseudogene involvementthan the other BRCA subtypes and this elevated pseudogene involvement was not visibleat the cancer type level. We again see this in the STAD subtypes with STAD EBV andSTAD MSI being elevated and in ESCA with ESCA SCC being elevated.Figure 4.22: A scatter plot showing the proportion of pseudogenes within the positive im-portant genes identified for the classes within the cancer subtype task.147Cancer Subtype Task DiscussionWe noted similar trends in the cancer subtype pseudogene proportions to those seen atthe cancer type level. The implication for metastatic cancers is still that pseudogenes playan important role in their classification. We again have the caveat that the influence ofdata sources should be considered because some of the subtypes with elevated pseudogeneimportance are from non-TCGA data sources.The most noteworthy change seen in the results from cancer type to subtype pertainsto the elevated pseudogene levels of one or more subtypes where the other related subtypeproportions remain low. One clear example of this is with the primary Luminal B BRCAsubtype. Luminal B BRCA shows 3 times the level of pseudogene usage when comparedto the other primary BRCA subtypes. One study has supported the use of pseudogenesfor discerning breast cancers from normal tissue and other breast cancer subtypes [138].This study suggests that pseudogenes are a valid line of inquiry for discerning breast cancersubtypes [138]. Given that the primary BRCA subtypes all comes from the same data setand that there is a marked change in one subtype compared to both the related cancer typeand subtypes, this is a strongest indication for a biological interpretation of the impact ofpseudogenes. Couple these facts with supporting evidence of the value of pseudogenes incancer diagnosis from the literature, and pseudogene analysis seems to be a viable avenuefor which to further characterize and diagnose cancers. Continued analysis of the role ofpseudogenes on cancer diagnosis and the characterization of cancer subtypes should focus onthe kinds of examples where individual subtypes differ from their related subtypes in orderto remove batch effect implications.4.3 The Implications of Batch EffectThis section will present and discuss the implications of batch effect on the biologicalinterpretation of the above results. Figure 4.23 presents a t-SNE of the transcriptome data148from the full training data set.Figure 4.23: A t-SNE plot of the transcriptome data for the full training data set colouredby data cohort.Given the results shown by Figure 4.23, there appears to be a potential issue with batcheffect within the training data set. In order to understand the implication of possible batcheffect, we must look at the cancer types within each data cohort. Table 4.10 shows therelevant data.149Data Cohort Cancer TypeGPH CLL T Tumor & DLBC T TumorNIH FL T Tumor & DLBC T TumorMET500 All Metastatic Cancer TypesGenenTech MESO T TumorTARGET CML T TumorTFRI GBM T TumorMAGIC MB-Adult T TumorTable 4.10: List of cancer types and the non-TCGA data cohorts from which they came.4.3.1 Batch Effect Implications on the Interpretation of MetastaticCancersWhen we consider Table 4.10, we see that batch effect is potentially a serious problemfor the metastatic cancers (MET500). All of the cancer types group together based on datacohort. This observation has a negative implication on the ability to interpret any biologicalfeatures of metastatic cancers within the disease state task. Within this task, the modelwould likely be able to accurately classify metastatic cancers on the basis of features presentwithin the data source alone. The interpretation of metastatic cancer results within thedisease state classification task should consider these effects carefully.The batch effect is such that the shared trend of high RNA gene and pseudogene impor-tance across all metastatic cancers may not be a reliable source for biological interpretation.The implication is not that all of the important genes found for each metastatic cancertype are biologically irrelevant, but rather that a portion of them may be shared among allmetastatic types and result as an artifact of the data set. The advantage to having multiplelearning tasks, however, is that we are able to follow the important genes from the diseasestate (covering all metastatic cancers at once) down into the cancer type and subtype levels150and potentially filter out common features whose presence may exist due to batch effect.Also, since the model is encouraged to learn disease state features upstream of cancer typeand subtype classifications, the important genes within the cancer type and subtype classesare genes used to differentiate between not only metastatic and primary cancers as a group,but also between individual metastatic cancers. This means that for interpretation pur-poses we can still expect the model to learn some unique features of each metastatic cancertype and subtype. However, we must be careful in considering overall trends seen across allmetastatic cancers because as a whole, some of their important genes will be artifacts of thedata source from which they came.To support the conclusion that some of the biological features of metastatic cancers can belearned at the cancer type level, we can visualize the metastatic samples separately from theprimary ones. Figure 4.24 is a t-SNE plot of only the metastatic samples from the trainingdata and is coloured according to cancer type. This figure illustrates that there is at leastsome grouping together of cancer types in the metastatic domain. This suggests that theremay be some common features for the model to learn within each cancer type and that thedata is not entirely useless. Figure 4.25 is a t-SNE plot of only the primary cancer samplesfrom the training data and is also coloured according to cancer type. Comparing Figures 4.24and 4.25, we see that the metastatic cancer types are much less well defined and have greateroverlap between types than the primary cancer types seen in Figure 4.25. It is possible thatif we had more metastatic samples from each cancer type we would have better establishedgroupings within the metastatic cancers. Given the results shown here with the current dataset, we would expect worse classification performance on the metastatic cancers than theprimary ones. This is, in fact, what is observed throughout this thesis and can partially beexplained by batch effect and the feature set contained within the data.151Figure 4.24: A t-SNE plot of the transcriptome data for the metastatic cancer types fromthe training data set coloured by cancer type.152Figure 4.25: A t-SNE plot of the transcriptome data for the primary cancer types from thetraining data set coloured by cancer type.4.3.2 Batch Effect Implications on the Interpretation of PrimaryCancersFor the primary cancers from TCGA-external data sources, we must also consider thepossible implications of batch effect. The cancer types listed in Table 4.11, with the exceptionof DLBC T Tumor, are all predisposed to suffering from batch effect. The reason for this isthat they are each sourced from single data sources that are unique to their cancer type. Thisoffers the machine learning model an opportunity to identify and leverage features present inthe data stemming from the source as opposed to features relevant to the underlying biology.153Looking at Figure 4.26, we can exclude DLBC T Tumor from batch effect as it has multipledata sources in which to encourage the model to identify biologically representative features.Data Cohort Cancer SubtypeTCGA DLBC T TumorGPH DLBC T TumorNIH DLBC BM T TumorTable 4.11: List of DLBC cancer subtypes and the data cohorts from which they came.Figure 4.26: A t-SNE plot of the transcriptome data for the DLBC cancer type coloured bydata cohort.154We can also identify a correlation between single data sources for primary cancers and theoutlier types identified within the RNA gene and pseudogene results. CLL, FL, and MB eachhave a single data source and may have elevated importance of these gene types as a resultof their data source. In light of this, future analysis of the RNA gene and pseudogene trendsseen within primary cancer types should focus on those cancer types that do not appear tosuffer from batch effect (ie. those from within TCGA). There were 11 other cancer typesthat showed elevated RNA gene and/or pseudogene importance, with eight of them beingfrom within the TCGA data set.4.3.3 Batch Effect ConclusionThe overall conclusion is that batch effect is potentially a problem within this data set.The nature of having limited and unique data sources for some cancer types has implicationson the ability to interpret the biological implications of the results presented in this thesis.There are techniques that could be applied to the training data to try and mitigate theseeffects. For example, ComBat-seq is a recently published (January, 2020) tool for mitigatingbatch effect on RNA-seq data [139]. The model used for this thesis could be trained onbatch corrected data and the results re-evaluated. Barring batch correction and retraining,the multi-task nature of the model provides a mechanism by which the filtration of generesults can be conducted and the significance of genes biologically interpreted. This couldbe done by leveraging the features identified within the disease state against the cancer typeand subtype task results. Finally, the inclusion of at least two data sources for each cancerof interest could help to mitigate batch effect by encouraging the model to find commonfeatures between the sets.The feature set (genes) is another aspect to consider as part of the batch effect analysis.The feature set for this thesis work contains the intersection of genes that are found withinall of the data cohorts combined. Given the results presented in Figure 4.24 and 4.25, we cansee that primary cancers types are better defined by the present feature set when compared155to the metastatic types. The metastatic types may need to include a different set of genesin order to be better differentiated from each other. By restricting the set of genes for thedata set to the intersecting genes from each data source, we are almost certainly disposingpotentially useful information. It may be the case that the metastatic cancers suffered agreater loss of information than the primary ones as a result of the gene exclusion conductedto generate the data set for this thesis.4.4 SummaryThe results presented in Chapter 4 were given in five sections. First, the biological validityof the gene results were examined using the normal tissue classes. The list of genes for somenormal classes were functionally annotated using DAVID to identify enriched pathways thatare indicative of the expected biological functions of the classes in question. Thyroid tissueshowed enrichment of pathways involved in neuropeptide hormone production, lung tissueshowed enrichment of gas exchange and oxygen binding/transport pathways, and breasttissue showed enrichment of pathways involved in milk and keratin production.Following the biological validation of normal tissues, the number of important genes iden-tified by the model for each class was presented. We noted that the number of importantgenes used for classification was significantly smaller for the metastatic and normal classesthan for the primary ones. These results suggest that metastatic and normal tissues hadmore distinct patterns of expression and thus required fewer genes to identify. These resultsalso suggest that as a whole, metastatic cancers have unique expression signatures that dif-ferentiate them from primary cancers. Whether or not this is biologically significant willrequire addressing the batch effect noted in the previous section.The expression levels of the identified important genes was also evaluated. We noted thatthe model favoured genes with high expression levels in metastatic and normal classes within156the disease state task. We again saw this trend at the cancer type and subtype level. Weconcluded here that further analysis of the implication of the expression levels of importantgenes would require redefining of the over and underexpression categories on a per-class basisin order to confirm the biological impact of expression levels. The only conclusion that canbe made from these results is that the model seems to need fewer genes for classificationwhen those genes have expression levels far above the mean (over two standard deviations).Whether this is the result of biological or computational factors remains to be determined.After identifying the number of important genes and their expression levels, some insightinto the functionality of the important genes for the non-normal classes were examined.DAVID was again used for the functional annotation of enriched pathways. Here we cor-related the enrichment of particular functional pathways (the top 10 most enriched) withinmetastatic and primary classes of interest with scientific literature that supports the functionand presence of these pathways within each class.Significant enrichment of microRNAs were found in the enriched pathways of the metastaticclass within the disease state task. This prompted further analysis of the RNA gene contentof each classes’ important genes. The role of RNA genes was quantified and presented. Wenoted that there was a significant increase in the number of RNA genes identified as im-portant within all metastatic cancers and three primary cancer types (CLL, FL, and MB).The role of RNA genes in each specific primary cancer type was discussed and shown tohave support from the relevant scientific literature. The role of RNA genes in metastaticcancers was also discussed and the foundation for future research in this area was laid. Theoverall conclusion was that the model elected to make approximately 30% of a metastaticcancer classifications using RNA genes. This suggests that RNA genes play a large enoughrole in metastatic cancers when compared to primary ones to elicit a recognizable patternof expression. This pattern can be effectively leveraged by a machine learning model andfurther analysis of the genes involved could result in novel insights on the progression ofmetastatic cancers. We noted the caveat that sequencing protocol and data generation may157have impacted the apparent important role of RNA genes in classification (ie. batch effect).While this caveat does not negate the presence of the particular RNA genes noted in thediscussion, it could impact the perceived strength of the correlation between cancer type andthe number of RNA genes present in the important genes listed for each class.Finally, the role of pseudogenes in classification was presented. We noted a similar trendto the one found in RNA genes. The model elected to identify metastatic and normal classeswith a high proportion of pseudogenes relative to both coding and RNA genes. This suggeststhat there is an expression pattern within pseudogenes that is of higher importance in clas-sifying metastatic cancers than in primary ones. However, as with RNA gene importance,the biological significance of this needs to be further elucidated while carefully consideringthe implications of batch effect.These five ways of examining the model’s results have provided some examples of thekinds of data contained within the model’s output. The value of these results lies in thelarge amount of data being output and that it appears to have some biological significance.Further study of the model’s output should provide a means with which to gain insightinto the biological functionality of genes within and across a variety of both metastatic andprimary cancers.158Chapter 5Conclusion5.1 Summary of FindingsThe first goal of this thesis research was to demonstrate the ability to classify with rea-sonable accuracy a set of normal, primary cancer, and metastatic cancer samples using geneexpression data. Chapter 2 presented detailed information on the data set utilized for thiswork along with the methodology used to generate the machine learning model to be usedfor this task. This chapter also presented the results of model validation using five-foldcross-validation and a set of multi-task models with different combinations of learning tasksincluding organ system of origin, disease state, cancer type, and cancer subtype. We notedthat the performance was relatively similar between the different models and since the even-tual goal of this thesis was to produce and analyze the largest amount of data with as muchgranularity as possible, a multi-task model (referred to in Chapter 2 as the ”all task” model)that included all four listed learning tasks was deemed appropriate for further use.Having validated the model architecture and set of learning tasks in Chapter 2, Chapter3 was focused on presenting the results of classification using a model trained on the fulltraining data set. The classification performance was evaluated across each of the learning159tasks using two test sets. The test sets included one held-out data set composed of normal,primary cancer, and metastatic cancer samples, and an external test set (POG) composedof only metastatic samples. The overall trend was such that as the learning task increasedin complexity and biological granularity (ie. from organ system to cancer subtype), theclassification performance declined on both test sets. The model also performed worse onmetastatic cancer classification than on primary cancer. This can be, at least in part, ex-plained by batch effect. We noted that a t-SNE plot of the metastatic cancers does notdifferentiate as well into cancer type as primary cancers do. In addition, the training dataset contains a large class imbalance that most certainly contributes to the reduction in per-formance on the metastatic classes, as they are in the minority. Furthermore, as the learningtasks increase in biological granularity (ie. from organ system to cancer subtype) the classsizes decrease by virtue of now having more classes with a data set that remains the same.Overall, while there were some exceptions noted, the majority of classes within each learningtask were classified reasonably well. We determined that the classification performance wassufficient to warrant further downstream analysis of the model using DeepLift.Chapter 4 of this thesis focused on extracting biological information from the trainedmulti-task model in an attempt to glean insight about the characteristics of various cancers.This task was accomplished using a backpropagation-based tool called DeepLift. DeepLiftis designed to query the impact of input features on the output of a trained neural network.In the context of this thesis, it provided a mechanism to query what importance each geneplays on the classification of each class within each learning task. Using DeepLift, we wereable to score the importance of each gene on the classification of cancers and use these scoresto examine five aspects of the results.The first aspect was to examine the number of important genes used for the positiveclassification of each class and it was determined that metastatic cancers had many fewergenes involved. This suggests that they have a unique pattern of expression that easilydifferentiates them from primary cancers. This may simply be the result of batch effect,160particularly within the disease state task.The second aspect of the DeepLift results examined were the expression values of importantgenes selected by the model. We noted that when differentiating metastatic cancers withinthe disease state level the model selected to use only overexpressed genes. The caveat of thisobservation was that the definition of mean gene expression was skewed towards primarycancers and should be redefined to further refine the analysis of these results.The third aspect of the DeepLift results investigated were the functional annotationsof enriched pathways for some classes of interest. Three normal tissue types were usedto validate the results of the functional annotations and the genes selected by the model.Following this, the enriched pathways within the primary and metastatic classes from thedisease state task were presented and discussed.The fourth analysis of the results looked at RNA gene importance for each class acrossevery learning task. The results indicated that all metastatic and three outlier primarycancer types (CLL, FL, and MB) had significantly increased RNA importance assigned bythe model when compared to the bulk of the primary and normal classes. A literature searchwas conducted to try and validate some of the RNA genes identified for metastatic cancersin the disease state task and the three outlier primary types in the cancer type task outputs.We noted that the model identified a number of microRNA genes within each of the cancersinvestigated and had literature supporting their role in each type. We did, however, observethat there was an apparent correlation between increased RNA gene importance and classeswhose data came from non-TCGA sources. The implication of this is that there is somebatch effect going on. These results should be further examined to determine if the modelis truly learning biologically relevant trends in RNA gene involvement or simply learning todifferentiate something in the sequencing process that is representative of the source of thedata.161Finally, the results were examined for pseudogene importance in each class across eachlearning task. As with the RNA genes, we observed that all metastatic and three outlierprimary cancer types (CLL, FL, and MB) had significantly increased pseudogene importancewhen compared to the bulk of the primary and normal classes. We again saw a correlationbetween non-TCGA data sources and an increase in pseudogene usage. This trend againraises concerns about batch effect. However, we also noted that there were 11 cancer types(excluding CLL, FL, and MB) that showed at least two times the pseudogene importance ofmost of the primary cancers. Within these 11 cancer types, there were eight that came fromthe TCGA data set and thus would be a good place to begin further analysis into the valueof pseudogene characteristics. By focusing on these eight TCGA-sourced cancer types, wecould get around some of the negative implications of batch effect and be more comfortablein making a biological interpretation of pseudogene and RNA gene trends.5.2 Future WorkThere are many avenues available for future work related to aspects of this thesis work. Thefirst avenue to explore would be trying to batch correct the data. ComBat-seq was releasedin January of 2020 and may prove to be a useful tool for correcting the training data [145].Following this, we should iterate on and improve the model used for classification. Theremay be changes to hyperparameters or architecture that could provide improved performancegiven the batch corrected data.Following this, the way in which learning tasks interact within the model could be explored.For example, each non-terminal learning task’s prediction could be included as an inputfeature to the downstream tasks’ corresponding layers. The effect of this would be such thatwithin each learning task, the model would be informed of the previous task’s classification.This may help to mitigate compounded errors caused by having kept each learning tasksprediction separate. In other words, the model would have the ability to correct upstreamlearning task classification errors in downstream tasks by learning the relationship between162accuracy and the previous task’s prediction.Finally, the utility of the learning tasks could be further interrogated. For example, wesaw that in some cases, including all learning tasks performed worse than a model missingthe cancer type task. Depending on the intended use and desired output of the DeepLiftdata, we may benefit from excluding particular learning tasks and obtaining better accuracyon fewer tasks.The bulk of the DeepLift results are left to be further investigated. There is an opportunityto conduct an in-depth review of the important genes for each cancer type and subtype. Thisthesis noted some larger trends that separate primary and metastatic cancers, but theseresults may be confounded by batch effect. Regardless, there remains numerous lists ofgenes available for each cancer type and subtype in which the negative implications of batcheffect should be mitigated. Functional annotation of the important genes for each cancertype and subtype can and should be investigated for existing and novel pathways that mayprove to be cancer driving.Another avenue of research may be to try and differentiate metastatic and primary cancersat the disease state level using subsets of the gene types. We could attempt to classify thesame cancers using only coding genes or only non-coding genes and compare the results togain a more granular understanding of the impact of each gene type. Given that the genesused for each metastatic cancer classification were at least 24% RNA genes and at least 40%pseudogenes, there is reason to believe these types of genes encode a significant amount ofinformation that can be used to differentiate these cancers. However, by focusing on thesetypes of genes independently, we would be encouraging the model to learn more complexexpression patterns within these genes. We may also be able to use these new results tobetter observe the implication of batch effect on the current set of results163Finally, for post-classification analysis it may be valuable to separate the primary andmetastatic cancer classifications into separate models. This would encourage the model tolearn more unique features of each cancer type as the broad, significant differences seenbetween primary and metastatic cancers, such as RNA and pseudogene expression, couldnot be as easily leveraged. In the current state, the model is able to use broad categories ofgenes such as pseudogenes to make the bulk of a classification decision for metastatic cancers,as it is vastly different from the primary ones. It would be interesting to see the impact onthe results when the model is forced to choose between cancer types that are more closelyrelated as far as non-coding genes are concerned. This would also mitigate issues related tobatch effect.164Bibliography[1] Bru¨cher, B. L., Jamall, I. S. (2014). Epistemology of the origin of cancer: a newparadigm. BMC cancer, 14(1), 1-15.[2] Brenner, D. R., Weir, H. K., Demers, A. A., Ellison, L. F., Louzado, C., Shaw, A., ...Smith, L. M. (2020). Projected estimates of cancer in Canada in 2020. Cmaj, 192(9),E199-E205.[3] Government of Canada, Statistics Canada. (2020, January 29). Number andrates of new cases of primary cancer, by cancer type, age group and sex.https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1310011101.[4] Seyfried, T. N., Flores, R. E., Poff, A. M., D’Agostino, D. P. (2014). Cancer as ametabolic disease: Implications for novel therapeutics. Carcinogenesis (New York),35(3), 515-527. doi:10.1093/carcin/bgt480[5] Dillek˚as, H., Rogers, M. S., Straume, O. (2019). Are 90% of deathsfrom cancer caused by metastases?. Cancer medicine, 8(12), 5574–5576.https://doi.org/10.1002/cam4.2474[6] Chaffer, C. L., Weinberg, R. A. (2011). A perspective on cancer cell metastasis.Science (American Association for the Advancement of Science), 331(6024), 1559-1564.doi:10.1126/science.1203543165[7] Martin, T. A., Ye, L., Sanders, A. J., Lane, J., Jiang, W. G. (2013). Cancer inva-sion and metastasis: molecular and cellular perspective. In Madame Curie BioscienceDatabase [Internet]. Landes Bioscience.[8] Staub, E., Buhr, H. J., Gro¨ne, J. (2010). Predicting the site of origin of tumors by agene expression signature derived from normal tissues. Oncogene, 29(31), 4485-4492.[9] Hess, K. R., Varadhachary, G. R., Taylor, S. H., Wei, W., Raber, M. N., Lenzi,R., Abbruzzese, J. L. (2006). Metastatic patterns in adenocarcinoma. Cancer, 106(7),1624-1633.[10] Anderson, G. G., Weiss, L. M. (2010). Determining tissue of origin for metastaticcancers: meta-analysis and literature review of immunohistochemistry performance.Applied immunohistochemistry molecular morphology : AIMM, 18(1), 3–8.[11] Brosius, J. (2009). The fragmented gene. Annals of the New York Academy of Sciences,1178(1), 186-193.[12] Brown, T. A. (2018). genomes 4. CRC Press. https://doi.org/10.1201/9781315226828.[13] Shin, B. K., Wang, H., Yim, A. M., Naour, F. L., Brichory, F., Jang, J. H., Zhao,R., Puravs, E., Tra, J., Michael, C. W., Misek, D. E., Hanash, S. M. (2003). Globalprofiling of the cell surface proteome of cancer cells uncovers an abundance of pro-teins with chaperone function. The Journal of Biological Chemistry, 278(9), 7607-7616.https://doi.org/10.1074/jbc.M210455200.[14] Zhang, L., Zhou, W., Velculescu, V. E., Kern, S. E., Hruban, R. H., Hamilton, S. R.,... Kinzler, K. W. (1997). Gene expression profiles in normal and cancer cells. Science,276(5316), 1268-1272.[15] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,... Bloomfield, C. D. (1999). Molecular classification of cancer: class discovery andclass prediction by gene expression monitoring. science, 286(5439), 531-537.166[16] Croce C. M. (2009). Causes and consequences of microRNA dysregulation in cancer.Nature reviews. Genetics, 10(10), 704–714. https://doi.org/10.1038/nrg2634[17] Shin, D. M., Kim, J., Ro, J. Y., Hittelman, J., Roth, J. A., Hong, W. K., Hittelman,W. N. (1994). Activation of p53 gene expression in premalignant lesions during headand neck tumorigenesis. Cancer research, 54(2), 321-326.[18] Ambrosini, G., Adida, C., Altieri, D. C. (1997). A novel anti-apoptosis gene, survivin,expressed in cancer and lymphoma. Nature medicine, 3(8), 917-921.[19] Thompson, C. B. (1995). Apoptosis in the pathogenesis and treatment of disease.Science, 267(5203), 1456-1462.[20] Altieri, D. C. (2003). Survivin, versatile modulation of cell division and apoptosis incancer. Oncogene, 22(53), 8581-8589.[21] Missiaglia, E., Blaveri, E., Terris, B., Wang, Y. H., Costello, E., Neoptolemos, J. P.,... Lemoine, N. R. (2004). Analysis of gene expression in cancer cell lines identifiescandidate markers for pancreatic tumorigenesis and metastasis. International journalof cancer, 112(1), 100-112.[22] Shendure, J., Balasubramanian, S., Church, G. M., Gilbert, W., Rogers, J., Schloss, J.A., Waterston, R. H. (2017). DNA sequencing at 40: past, present and future. Nature,550(7676), 345-353.[23] Barba, M., Czosnek, H., Hadidi, A. (2014). Historical perspective, development andapplications of next-generation sequencing in plant virology. Viruses, 6(1), 106–136.https://doi.org/10.3390/v6010106[24] Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., Wold, B. (2008). Mappingand quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621-628.[25] Zhong, S., Joung, J. G., Zheng, Y., Chen, Y. R., Liu, B., Shao, Y., ... Giovannoni, J. J.(2011). High-throughput illumina strand-specific RNA sequencing library preparation.Cold spring harbor protocols, 2011(8), pdb-prot5652.167[26] Dobin, A., Gingeras, T. R. (2015). Mapping RNA-seq Reads withSTAR. Current protocols in bioinformatics, 51, 11.14.1–11.14.19.https://doi.org/10.1002/0471250953.bi1114s51[27] Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson,A., ... Mortazavi, A. (2016). A survey of best practices for RNA-seq data analysis.Genome biology, 17(1), 13.[28] Dillies, M. A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant,N., ... Guernec, G. (2013). A comprehensive evaluation of normalization methods forIllumina high-throughput RNA sequencing data analysis. Briefings in bioinformatics,14(6), 671-683.[29] Gutschner, T., Diederichs, S. (2012). The hallmarks of cancer: a long non-codingRNA point of view. RNA biology, 9(6), 703-719.[30] Palazzo, A. F., Gregory, T. R. (2014). The case for junk DNA. PLoS Genet, 10(5),e1004351.[31] Gutschner, T., Ha¨mmerle, M., Eißmann, M., Hsu, J., Kim, Y., Hung, G., ... Zo¨rnig,M. (2013). The noncoding RNA MALAT1 is a critical regulator of the metastasisphenotype of lung cancer cells. Cancer research, 73(3), 1180-1189.[32] Anastasiadou, E., Jacob, L. S., Slack, F. J. (2018). Non-coding RNA networks incancer. Nature reviews. Cancer, 18(1), 5–18. https://doi.org/10.1038/nrc.2017.99[33] Di Leva, G., Croce, C. M. (2013). miRNA profiling of cancer. Current opinion ingenetics development, 23(1), 3-11.[34] Rupaimoole, R., Slack, F. J. (2017). MicroRNA therapeutics: towards a new era forthe management of cancer and other diseases. Nature reviews Drug discovery, 16(3),203.[35] Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. cell,116(2), 281-297.168[36] Oberg, A. L., French, A. J., Sarver, A. L., Subramanian, S., Morlan, B. W., Riska, S.M., ... Smyrk, T. C. (2011). miRNA expression in colon polyps provides evidence fora multihit model of colon cancer. PloS one, 6(6), e20465.[37] Calin, G. A., Dumitru, C. D., Shimizu, M., Bichi, R., Zupo, S., Noch, E., ... Rassenti,L. (2002). Frequent deletions and down-regulation of micro-RNA genes miR15 andmiR16 at 13q14 in chronic lymphocytic leukemia. Proceedings of the national academyof sciences, 99(24), 15524-15529.[38] Dulak, A. M., Schumacher, S. E., Van Lieshout, J., Imamura, Y., Fox, C., Shim, B., ...Tabernero, J. (2012). Gastrointestinal adenocarcinomas of the esophagus, stomach, andcolon exhibit distinct patterns of genome instability and oncogenesis. Cancer research,72(17), 4383-4393.[39] Tutar Y. (2012). Pseudogenes. Comparative and functional genomics, 2012, 424526.https://doi.org/10.1155/2012/424526[40] Pei, B., Sisu, C., Frankish, A., Howald, C., Habegger, L., Mu, X. J., ... Reymond, A.(2012). The GENCODE pseudogene resource. Genome biology, 13(9), R51.[41] Poliseno, L., Marranci, A., Pandolfi, P. P. (2015). Pseudogenes in Human Cancer.Frontiers in medicine, 2, 68. https://doi.org/10.3389/fmed.2015.00068[42] Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y. (2016). Deep learning (Vol. 1, p.2). Cambridge: MIT press.[43] Bishop, Christopher M. (2006). Pattern recognition and machine learning. New York:Springer.[44] de Villiers, J., Barnard, E. (1993). Backpropagation neural nets with oneand two hidden layers. IEEE Transactions on Neural Networks, 4(1), 136-141.doi:10.1109/72.182704[45] Putri, O. (2018, December 21). Titanic Prediction with Artificial Neural Network inR. Retrieved November 04, 2020, from https://laptrinhx.com/titanic-prediction-with-artificial-neural-network-in-r-3087367370/.169[46] Li, M., Zhang, T., Chen, Y., Smola, A. J. (2014, August). Efficient mini-batch trainingfor stochastic optimization. In Proceedings of the 20th ACM SIGKDD internationalconference on Knowledge discovery and data mining (pp. 661-670).[47] Anastassiou, G. A. (2011). Multivariate hyperbolic tangent neural network approxi-mation. Computers Mathematics with Applications, 61(4), 809-821.[48] Weisstein, Eric W. ”Hyperbolic Tangent.” From MathWorld–A Wolfram Web Re-source. https://mathworld.wolfram.com/HyperbolicTangent.html[49] Murugan, P. (2018). Implementation of deep convolutional neural network in multi-class categorical image classification. arXiv preprint arXiv:1801.01397.[50] Rusiecki, A. (2019). Trimmed categorical cross-entropy for deep learning with labelnoise. Electronics Letters, 55(6), 319-320.[51] Chollet, Franc¸ois. Keras. https://github.com/fchollet/keras, 2015.[52] Glorot, X., Bengio, Y. (2010, March). Understanding the difficulty of training deepfeedforward neural networks. In Proceedings of the thirteenth international conferenceon artificial intelligence and statistics (pp. 249-256).[53] Larsen, J., Hansen, L. K. (1994). Generalization performance of regularized neuralnetwork models. Paper presented at the 42-51. doi:10.1109/NNSP.1994.366065[54] Wang, S., Wang, X., Zhao, P., Wen, W., Kaeli, D., Chin, P., Lin, X. (2018, November).Defensive dropout for hardening deep neural networks under adversarial attacks. InProceedings of the International Conference on Computer-Aided Design (pp. 1-8).[55] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. (2014).Dropout: a simple way to prevent neural networks from overfitting. The journal ofmachine learning research, 15(1), 1929-1958.[56] Buda, M., Maki, A., Mazurowski, M. A. (2018;2017;). A systematic study of the classimbalance problem in convolutional neural networks. Neural Networks, 106, 249-259.doi:10.1016/j.neunet.2018.07.011170[57] Zhou, Z., Liu, X. (2006). Training cost-sensitive neural networks with methods ad-dressing the class imbalance problem. IEEE Transactions on Knowledge and DataEngineering, 18(1), 63-77. doi:10.1109/tkde.2006.17[58] Johnson, J. M., Khoshgoftaar, T. M. (2019). Survey on deep learning with classimbalance. Journal of Big Data, 6(1), 1-54. doi:10.1186/s40537-019-0192-5[59] Shrikumar, A., Greenside, P., Kundaje, A. (2017). Learning important featuresthrough propagating activation differences. arXiv preprint arXiv:1704.02685.[60] Pleasance, E., Titmuss, E., Williamson, L., Kwan, H., Culibrk, L., Zhao, E. Y., ...Shen, Y. (2020). Pan-cancer analysis of advanced patient tumors reveals interactionsbetween therapy and genomic landscapes. Nature Cancer, 1(4), 452-468.[61] Grewal, J. K., Tessier-Cloutier, B., Jones, M., Gakkhar, S., Ma, Y., Moore, R., ... Lim,H. (2019). Application of a Neural Network Whole Transcriptome–Based Pan-CancerMethod for Diagnosis of Primary and Metastatic Cancers. JAMA network open, 2(4),e192597-e192597.[62] Robinson, D. R., Wu, Y. M., Lonigro, R. J., Vats, P., Cobain, E., Everett, J.,... Schuetze, S. (2017). Integrative clinical genomics of metastatic cancer. Nature,548(7667), 297-303.[63] McKinney, W., others. (2010). Data structures for statistical computing in python. InProceedings of the 9th Python in Science Conference (Vol. 445, pp. 51–56).[64] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ...Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. The Journal of Ma-chine Learning Research, 12, 2825-2830.[65] Seo, Y., Shin, K. S. (2019). Hierarchical convolutional neural networks for fashionimage classification. Expert Systems with Applications, 116, 328-339.[66] Cancer Genome Atlas Research Network. (2017). Integrated genomic characterizationof oesophageal carcinoma. Nature, 541(7636), 169-175.171[67] Cancer Genome Atlas Research Network. (2014). Comprehensive molecular character-ization of gastric adenocarcinoma. Nature, 513(7517), 202-209.[68] Zhang W. (2014). TCGA divides gastric cancer into four molecular subtypes: impli-cations for individualized therapeutics. Chinese journal of cancer, 33(10), 469–470.https://doi.org/10.5732/cjc.014.10117[69] Zeng, D., Li, M., Zhou, R., Zhang, J., Sun, H., Shi, M., ... Liao, W. (2019). Tu-mor microenvironment characterization in gastric cancer identifies prognostic and im-munotherapeutically relevant gene signatures. Cancer immunology research, 7(5), 737-750.[70] Buda, M., Maki, A., Mazurowski, M. A. (2018). A systematic study of the classimbalance problem in convolutional neural networks. Neural Networks, 106, 249-259.[71] Guo, X., Yin, Y., Dong, C., Yang, G., Zhou, G. (2008, October). On the classimbalance problem. In 2008 Fourth international conference on natural computation(Vol. 4, pp. 192-201). IEEE.[72] Japkowicz, N. (2000, June). The class imbalance problem: Significance and strategies.In Proc. of the Int’l Conf. on Artificial Intelligence (Vol. 56).[73] Dudoit, S., Fridlyand, J., Speed, T. P. (2002). Comparison of discrimination methodsfor the classification of tumors using gene expression data. Journal of the Americanstatistical association, 97(457), 77-87.[74] Meyer, A. N., Payne, V. L., Meeks, D. W., Rao, R., Singh, H. (2013). Physicians’ di-agnostic accuracy, confidence, and resource requests: a vignette study. JAMA internalmedicine, 173(21), 1952-1958.[75] Anderson, G. G., Weiss, L. M. (2010). Determining tissue of origin for metastaticcancers: meta-analysis and literature review of immunohistochemistry perfor-mance. Applied immunohistochemistry molecular morphology : AIMM, 18(1), 3–8.https://doi.org/10.1097/PAI.0b013e3181a75e6d172[76] Sherman, B. T., Lempicki, R. A. (2009). Systematic and integrative analysis of largegene lists using DAVID bioinformatics resources. Nature protocols, 4(1), 44.[77] Huang, D. W., Sherman, B. T., Lempicki, R. A. (2009). Bioinformatics enrichmenttools: paths toward the comprehensive functional analysis of large gene lists. Nucleicacids research, 37(1), 1-13.[78] Lewinski, A. (1988). Neuropeptides and Thyroid Function and Growth II. Intrathy-roidal Peptidergic Nerves and Neuropeptides Located in Parafollicular (C) Cells. InProgress in Neuropeptide Research (pp. 65-72). Birkha¨user, Basel.[79] Kanehisa, M., Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes.Nucleic acids research, 28(1), 27-30.[80] Kanehisa, M. (2019). Toward understanding the origin and evolution of cellular organ-isms. Protein Science, 28(11), 1947-1951.[81] Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M., Tanabe, M. (2020).KEGG: integrating viruses and cellular organisms. Nucleic Acids Research.[82] Patthy, L. (1991). Homology of the precursor of pulmonary surfactant-associated pro-tein SP-B with prosaposin and sulfated glycoprotein 1. Journal of Biological Chemistry,266(10), 6035-6037.[83] Taylor-Papadimitriou, J., Stampfer, M., Bartek, J., Lewis, A., Boshell, M., Lane, E. B.,Leigh, I. M. (1989). Keratin expression in human mammary epithelial cells culturedfrom normal and malignant tissue: relation to in vivo phenotypes and influence ofmedium. Journal of cell science, 94 ( Pt 3), 403–413.[84] Nanashima, N., Horie, K., Yamada, T., Shimizu, T., Tsuchida, S. (2017). Hair ker-atin KRT81 is expressed in normal and breast cancer cells and contributes to theirinvasiveness. Oncology reports, 37(5), 2964-2970.[85] Lewis, M. T., Ross, S., Strickland, P. A., Snyder, C. J., Daniel, C. W. (1999). Reg-ulated expression patterns of IRX-2, an Iroquois-class homeobox gene, in the humanbreast. Cell and tissue research, 296(3), 549-554.173[86] Chen, H., Sukumar, S. (2003). Role of homeobox genes in normal mammary gland de-velopment and breast tumorigenesis. Journal of mammary gland biology and neoplasia,8(2), 159-175.[87] Baffa, R., Fassan, M., Volinia, S., O’Hara, B., Liu, C. G., Palazzo, J. P., ... Rosenberg,A. (2009). MicroRNA expression profiling of human metastatic cancers identifies cancergene targets. The Journal of Pathology: A Journal of the Pathological Society of GreatBritain and Ireland, 219(2), 214-221.[88] Pencheva, N., Tavazoie, S. F. (2013). Control of metastatic progression by microRNAregulatory networks. Nature cell biology, 15(6), 546-554.[89] Bastide, A., David, A. (2018). The ribosome,(slow) beating heart of cancer (stem)cell. Oncogenesis, 7(4), 1-13.[90] Artero-Castro, A., Kondoh, H., Fernandez-Marcos, P. J., Serrano, M., y Cajal, S.R., Lleonart, M. E. (2009). Rplp1 bypasses replicative senescence and contributes totransformation. Experimental cell research, 315(8), 1372-1383.[91] Kim, J. H., You, K. R., Kim, I. H., Cho, B. H., Kim, C. Y., Kim, D. G. (2004). Over-expression of the ribosomal protein L36a gene is associated with cellular proliferationin hepatocellular carcinoma. Hepatology, 39(1), 129-138.[92] Yang, S., Cui, J., Yang, Y., Liu, Z., Yan, H., Tang, C., ... Wang, W. (2016). Over-expressed RPL34 promotes malignant proliferation of non-small cell lung cancer cells.Gene, 576(1), 421-428.[93] Zhou, H., Wang, Y., Lv, Q., Zhang, J., Wang, Q., Gao, F., ... Li, L. (2016). Overex-pression of ribosomal RNA in the development of human cervical cancer is associatedwith rDNA promoter hypomethylation. PLoS One, 11(10), e0163340.[94] Uemura, M., Zheng, Q., Koh, C. M., Nelson, W. G., Yegnasubramanian, S., De Marzo,A. M. (2012). Overexpression of ribosomal RNA in prostate cancer is common but notlinked to rDNA promoter hypomethylation. Oncogene, 31(10), 1254-1263.174[95] Tsoi, H., Lam, K. C., Dong, Y., Zhang, X., Lee, C. K., Zhang, J., ... Fang, J. (2017).Pre-45s rRNA promotes colon cancer and is associated with poor survival of CRCpatients. Oncogene, 36(44), 6109-6118.[96] Ebright, R. Y., Lee, S., Wittner, B. S., Niederhoffer, K. L., Nicholson, B. T., Bardia,A., ... Mai, A. (2020). Deregulation of ribosomal protein expression and translationpromotes breast cancer metastasis. Science, 367(6485), 1468-1473.[97] Penzo, M., Montanaro, L., Trere´, D., Derenzini, M. (2019). The Ribosome Biogenesis-Cancer Connection. Cells, 8(1), 55. https://doi.org/10.3390/cells8010055[98] Catez, F., Dalla Venezia, N., Marcel, V., Zorbas, C., Lafontaine, D. L., Diaz, J. J.(2019). Ribosome biogenesis: An emerging druggable pathway for cancer therapeutics.Biochemical pharmacology, 159, 74-81.[99] Kim, S. W., Roh, J., Lee, H. S., Ryu, M. H., Park, Y. S., Park, C. S. (2020). Ex-pression of the immune checkpoint molecule V-set immunoglobulin domain-containing4 is associated with poor prognosis in patients with advanced gastric cancer. GastricCancer, 1-14.[100] Li, Y., Guo, M., Fu, Z., Wang, P., Zhang, Y., Gao, Y., Yue, M., Ning, S., Li, D.(2017). Immunoglobulin superfamily genes are novel prognostic biomarkers for breastcancer. Oncotarget, 8(2), 2444–2456. https://doi.org/10.18632/oncotarget.13683[101] Kim, H. J., Maiti, P., Barrientos, A. (2017). Mitochondrial ribosomes in cancer. Sem-inars in cancer biology, 47, 67–81. https://doi.org/10.1016/j.semcancer.2017.04.004[102] Lyng, H., Brøvig, R. S., Svendsrud, D. H., Holm, R., Kaalhus, O., Knutstad, K.,... Stokke, T. (2006). Gene expressions and copy numbers associated with metastaticphenotypes of uterine cervical cancer. BMC genomics, 7(1), 268.[103] Caballero, O. L., Chen, Y. T. (2012). Cancer/testis antigens: potential targets forimmunotherapy. In Innate Immune Regulation and Cancer Immunotherapy (pp. 347-369). Springer, New York, NY.175[104] Lahtz, C., Pfeifer, G. P. (2011). Epigenetic changes of DNA repair genes in cancer.Journal of molecular cell biology, 3(1), 51–58. https://doi.org/10.1093/jmcb/mjq053[105] Kappil, M. A., Liao, Y., Terry, M. B., Santella, R. M. (2016). DNA Repair GeneExpression Levels as Indicators of Breast Cancer in the Breast Cancer Family Registry.Anticancer research, 36(8), 4039–4044.[106] Sample, K. M. (2020). DNA repair gene expression is associated with differential prog-nosis between HPV16 and HPV18 positive cervical cancer patients following radiationtherapy. Scientific reports, 10(1), 1-9.[107] Gjerstorff, M. F., Terp, M. G., Hansen, M. B., Ditzel, H. J. (2016). The role ofGAGE cancer/testis antigen in metastasis: the jury is still out. BMC cancer, 16, 7.https://doi.org/10.1186/s12885-015-1998-y[108] Gjerstorff, M. F., Ditzel, H. J. (2008). An overview of the GAGE cancer/testis antigenfamily with the inclusion of newly identified members. Tissue antigens, 71(3), 187–192.https://doi.org/10.1111/j.1399-0039.2007.00997.x[109] De Backer, O., Arden, K. C., Boretti, M., Vantomme, V., De Smet, C., Czekay, S., ...Van den Eynde, B. (1999). Characterization of the GAGE genes that are expressed invarious human cancers and in normal testis. Cancer research, 59(13), 3157-3165.[110] Chao, N. X., Li, L. Z., Luo, G. R., Zhong, W. G., Huang, R. S., Fan, R., Zhao, F.L. (2018). Cancer-testis antigen GAGE-1 expression and serum immunoreactivity inhepatocellular carcinoma. Nigerian journal of clinical practice, 21(10), 1361-1367.[111] Shi, W. Y., Liu, K. D., Xu, S. G., Zhang, J. T., Yu, L. L., Xu, K. Q., Zhang, T. F.(2014). Gene expression analysis of lung cancer. Eur Rev Med Pharmacol Sci, 18(2),217-28.[112] Venugopal, N., Yeh, J., Kodeboyina, S. K., Lee, T. J., Sharma, S., Patel, N., Sharma,A. (2019). Differences in the early stage gene expression profiles of lung adenocarcinomaand lung squamous cell carcinoma. Oncology Letters, 18(6), 6572-6582.176[113] Parsons, C., Tayoun, A. M., Benado, B. D., Ragusa, G., Dorvil, R. F., Rourke, E.A., ... Habibian, M. (2018). The role of long noncoding RNAs in cancer metastasis. JCancer Metastasis Treat, 4(4), 19.[114] Adams, B. D., Parsons, C., Walker, L., Zhang, W. C., Slack, F. J. (2017). Targetingnoncoding RNAs in disease. The Journal of clinical investigation, 127(3), 761-771.[115] Adams, B. D., Slack, F. J. (2015). MicroRNA Signatures as Biomarkers in Cancer.eLS, 1-20.[116] Huang T, Alvarez A, Hu B, Cheng SY. Noncoding RNAs in cancer and cancer stemcells. Chin J Cancer. 2013;32(11):582-593. doi:10.5732/cjc.013.10170[117] Li, Y., Zheng, Q., Bao, C., Li, S., Guo, W., Zhao, J., ... Huang, S. (2015). CircularRNA is enriched and stable in exosomes: a promising biomarker for cancer diagnosis.Cell research, 25(8), 981-984.[118] Hansen, T. B., Kjems, J., Damgaard, C. K. (2013). Circular RNA and miR-7 incancer. Cancer research, 73(18), 5609-5612.[119] Lu, J., Getz, G., Miska, E. A., Alvarez-Saavedra, E., Lamb, J., Peck, D., ... Downing,J. R. (2005). MicroRNA expression profiles classify human cancers. nature, 435(7043),834-838.[120] Sandhu, S. K., Croce, C. M., Garzon, R. (2011). Micro-RNA expression and functionin lymphomas. Advances in hematology, 2011.[121] Navarro, A., Gaya, A., Martinez, A., Urbano-Ispizua, A., Pons, A., Balague´, O., ...Montserrat, E. (2008). MicroRNA expression profiling in classic Hodgkin lymphoma.Blood, The Journal of the American Society of Hematology, 111(5), 2825-2832.[122] Zhang, J., Jima, D. D., Jacobs, C., Fischer, R., Gottwein, E., Huang, G., ... Weinberg,J. B. (2009). Patterns of microRNA expression characterize stages of human B-celldifferentiation. Blood, 113(19), 4586-4594.177[123] Li, C., Kim, S. W., Rai, D., Bolla, A. R., Adhvaryu, S., Kinney, M. C., ... Aguiar,R. C. (2009). Copy number abnormalities, MYC activity, and the genetic fingerprintof normal B cells mechanistically define the microRNA profile of diffuse large B-celllymphoma. Blood, The Journal of the American Society of Hematology, 113(26), 6681-6690.[124] Baranwal S, Alahari SK. miRNA control of tumor cell invasion and metastasis. Int JCancer. 2010;126(6):1283-1290. doi:10.1002/ijc.25014[125] Singh, R., Saini, N. (2012). Downregulation of BCL2 by miRNAs augments drug-induced apoptosis–a combined computational and experimental approach. Journal ofcell science, 125(6), 1568-1578.[126] Pekarsky, Y., Balatti, V., Croce, C. M. (2018). BCL2 and miR-15/16:from gene discovery to treatment. Cell death and differentiation, 25(1), 21–26.https://doi.org/10.1038/cdd.2017.159[127] Roehle, A., Hoefig, K. P., Repsilber, D., Thorns, C., Ziepert, M., Wesche, K. O., ...Matolcsy, A. (2008). MicroRNA signatures characterize diffuse large B-cell lymphomasand follicular lymphomas. British journal of haematology, 142(5), 732-744.[128] Li, J., Zou, J., Wan, X., Sun, C., Peng, F., Chu, Z., Hu, Y. (2020). TheRole of Noncoding RNAs in B-Cell Lymphoma. Frontiers in oncology, 10, 577890.https://doi.org/10.3389/fonc.2020.577890[129] Balatti V, Pekarky Y, Croce CM. Role of microRNA in chronic lymphocyticleukemia onset and progression. J Hematol Oncol. 2015;8:12. Published 2015 Feb 20.doi:10.1186/s13045-015-0112-x[130] Mollashahi B, Aghamaleki FS, Movafagh A. The Roles of miRNAs in Medulloblastoma:A Systematic Review. J Cancer Prev. 2019;24(2):79-90. doi:10.15430/JCP.2019.24.2.79[131] Cho, W. C. (2010). MicroRNAs: potential biomarkers for cancer diagnosis, prognosisand targets for therapy. The international journal of biochemistry cell biology, 42(8),1273-1281.178[132] Joshi, P., Katsushima, K., Zhou, R., Meoded, A., Stapleton, S., Jallo, G., ... Perera,R. J. (2019). The therapeutic and diagnostic potential of regulatory noncoding RNAsin medulloblastoma. Neuro-oncology advances, 1(1), vdz023.[133] Mollashahi B, Aghamaleki FS, Movafagh A. The Roles of miRNAs in Medulloblastoma:A Systematic Review. J Cancer Prev. 2019;24(2):79-90. doi:10.15430/JCP.2019.24.2.79[134] Joshi, P., Katsushima, K., Zhou, R., Meoded, A., Stapleton, S., Jallo, G., ... Perera,R. J. (2019). The therapeutic and diagnostic potential of regulatory noncoding RNAsin medulloblastoma. Neuro-oncology advances, 1(1), vdz023.[135] Zheng, B., Xi, Z., Liu, R., Yin, W., Sui, Z., Ren, B., ... Liu, C. (2018). The function ofmicroRNAs in B-cell development, lymphoma, and their potential in clinical practice.Frontiers in immunology, 9, 936.[136] Poliseno, L., Marranci, A., Pandolfi, P. P. (2015). Pseudogenes in Human Cancer.Frontiers in medicine, 2, 68. https://doi.org/10.3389/fmed.2015.00068[137] Gao, K. M., Chen, X. C., Zhang, J. X., Wang, Y., Yan, W., You, Y. P. (2015).A pseudogene-signature in glioma predicts survival. Journal of experimental clinicalcancer research, 34(1), 23.[138] Welch, J. D., Baran-Gale, J., Perou, C. M., Sethupathy, P., Prins, J. F. (2015).Pseudogenes transcribed in breast invasive carcinoma show subtype-specific expressionand ceRNA potential. BMC genomics, 16(1), 113.[139] Zhang, Y., Parmigiani, G., Johnson, W. E. (2020). ComBat-Seq: batch effect adjust-ment for RNA-Seq count data. bioRxiv.179

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0395883/manifest

Comment

Related Items