Utility of machine learning approachesfor cancer diagnosis and analysis fromRNA sequencingbyJasleen K GrewalB.Sc., Simon Fraser University, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2020© Jasleen K Grewal 2020The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, the thesisentitled:Utility of machine learning approaches for cancer diagnosisand analysis from RNA sequencingsubmitted by Jasleen K Grewal in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Bioinformatics.Examining Committee:Dr. Steven JM Jones, BioinformaticsSupervisorDr. Ryan Morin, BioinformaticsSupervisory Committee MemberDr. Marianne Sadar, Pathology and Laboratory MedicineUniversity ExaminerDr. Wan Lam, Pathology and Laboratory MedicineUniversity ExaminerAdditional Supervisory Committee Members:Dr. Inanc Birol, BioinformaticsSupervisory Committee MemberDr. Ryan Morin, BioinformaticsSupervisory Committee MemberDr. Sohrab Shah, BioinformaticsSupervisory Committee MemberDr. Stephen Yip, Pathology and Laboratory MedicineSupervisory Committee MemberiiAbstractThe highest number of cancer-associated deaths are attributable tometastasis. These include rare cancer types that lack established treatmentguidelines, or cancers that become resistant to established lines of therapy.Precision oncology projects aim to develop treatment options for thesepatients by obtaining a detailed molecular view of the cancer. Scientistsuse sequencing data like whole-genome sequencing and RNA-sequencingto understand the biology of the cancer. A significant challenge in thisprocess is diagnosing the cancer type of the sample since the observedmeasurements are best understood with this context.Routine histopathology relies on tissue morphology and can fail to providea determinative diagnosis when the cancer metastasizes, presents biologyattributable to multiple different cancer types, or presents as a rare cancertype. Molecular data has revealed differences in the genetic makeup ofcancers that appear morphologically similar, motivating the use of moleculardiagnostics. Nevertheless, no existing tools utilize the output from thesesequencing modalities in its entirety (that is, without feature selection).There is also limited work evaluating the utility of pan-cancer moleculardiagnostics in a precision oncology trial.In this work we review an ongoing precision oncology trial and identify theimpact of sequencing-based approaches on cancer diagnosis. We developSCOPE, a machine-learning method that uses RNA-Seq profiles of tumoursfor automated cancer diagnosis. We show that this method, which usesover 17,688 gene measurements as input, has better classification accuracythan when using statistically prioritized marker genes, can deconvolvecancer-types with mixed histology, and has high performance in metastaticcancers and cancers of unknown origin. In precision oncology, manualanalysis of the tumour’s genomic profile is used to understand tumourbiology and driver pathways. We find that by assessing the classifier’sdependence on gene subsets, we can automatically calculate the importanceof various biological programs in individual tumours. Pathways prioritizediiithrough this tool - called PIE - show a high overlap with manual integrativeanalysis performed by expert bioinformaticians to identify clinicallyimportant genomic changes. Lastly, we demonstrate that PIE facilitatescohort-wide cancer analysis and discovery of novel sub-groups in advancedcancers.ivLay summaryDiagnosis is an important early step in the management of cancer, and isusually provided by expert doctors based on the cells’ appearance. Thisprocess can be easy or difficult based on various factors. Cancers are diseasesof the genome. In this thesis we show that in many advanced cancers wecan use molecular measurements like DNA and RNA sequencing to providea diagnosis. We develop a computational cancer diagnosis tool that usesexpression measurements of all the genes in a cancer. We then show thatthis method can be used to learn which biological changes are important foran individual cancer. We compare these automatically identified changeswith what expert computational biologists found manually, and find asignificant overlap. This approach automates the way we use sequencingdata for diagnosing and understanding cancers, and expands our ability tounderstand rare and understudied cancers.vPrefaceAll the work presented herein was conducted at Canada’s Michael SmithGenome Sciences Centre, part of BC Cancer, in the laboratory of Dr. StevenJ.M. Jones. Data from Personalized OncoGenomics (POG) clinical trialwas obtained after institutional review board (IRB) approval. This workwas approved by and conducted under the University of British Columbia– British Columbia Cancer Agency Research Ethics Board (H12-00137,H14-00681), and approved by the institutional review board (IRB). ThePOG program is registered under clinical trial number NCT02155621.As part of this trial, cancer patients with advanced disease who failedconventional treatment and fulfilled the inclusion criteria were consentedfor tumour profiling using RNA-Seq (tumour) as well as whole-genomesequencing (tumour and blood).Patients were referred to the POG program through their treatingoncologist and enrolled into the program through a POG trained oncologistor study nurse. Sample collection was performed by the overseeing surgicaloncologist. Dr. Andrew Mungall was responsible for the processing andlibrary construction of the samples. Dr. Richard Moore oversaw thesequencing of the samples. Eric Chuah, Karen Mungall, Tina Wong andReanne Bowlby supervised the alignment and variant calling of the samples.A version of Chapter 2 has been published in Cold Spring Harbor’s MolecularCase Studies and the citation is below. A license to reuse the text andfigures was not necessary since the authors retain the copyright for thepublication, licensed under CC-BY. Drs. SJM Jones, Marra, Laskin andKarnezis contributed to the conception and design of the study. Dr. Anna VTinker referred the patient to the study. Initial pathology at the VancouverGeneral Hospital was led by Dr. Chen Zhou. Validation pathology wasled by Dr. Anthony N Karnezis and assisted by Drs. Kenrry Chiu andBasile Tessier-Cloutier. Dr. Andy Mungall contributed to the collectionand assembly of data. Drs. Martin Jones and Peter Eirew contributed todata analysis and interpretation. I led the manuscript writing efforts invicollaboration with Drs. Kenrry Chiu, Basile Tessier-Cloutier, Martin Jones,Anna V Tinker, and Anthony N Karnezis. All authors approved the finalmanuscript.Grewal JK, Eirew P, Jones M, Chiu K, Tessier-Cloutier B, KarnezisAN, Karsan A, Mungall A, Zhou C, Yip S et al. Detection and genomiccharacterization of a mammary-like adenocarcinoma. Molecular CaseStudies. 2017 November 21.The work described in Chapter 3 was written entirely by myself anddeveloped jointly by Dr. Basile Tessier-Cloutier and myself. I was themain bioinformatics analyst and Dr. Tessier-Cloutier led the pathologyanalysis and evaluation. The study was jointly designed by me and Dr.Tessier-Cloutier with supervision from Dr. Steven J.M. Jones and Dr.Stephen Yip. All other authors contributed equally to study design,implementation, interpretation, and writing. A version of Chapter 3highlighting the clinical implications will be submitted for publication asfollows. ’*’ indicates co-first authors.Tessier-Cloutier B*, Grewal JK*, Jones M, Pleasance E, Shen Y, Cai E,Dunham C, Hoang L, Horst B, Huntsman D, Ionescu D, Karnezis AN, LeeA, Lee CH, Lee TH, Mungall A, Mungall K, Naso JR, Ng T, SchaefferDF, Sheffield BS, Skinnider B, Smith T, Williamson L, Zhong E, Laskin J,Marra M, Gilks CB, Jones SJM, Yip S. The impact of whole genome andtranscriptome sequencing on diagnostic accuracy and treatment planning.A version of Chapter 4 has been published in JAMA Network Open. Alicense to reuse the text and figures was not necessary since the articlewas published under the CC-BY license, which permits unrestricted use,distribution, and reproduction in any medium, provided the author andjournal are credited. The study was conceptualized and designed by Drs.SJM Jones, Martin Jones, Marco Marra, Michael Taylor and myself. Iassisted in the collection and interpretation of the data in conjunctionwith Drs. Tessier-Cloutier, M Jones, Gakkhar, Ma, Moore, Mungall, Zhao,Mungall, Gelmon, Lim, Renouf, Laskin, and Yip. I performed the statisticalanalysis in collaboration with Sita Gakkhar. I created the experimentaldesign for the classifier development, undertook the analysis and wrote thefull initial draft. Dr. Jones devised the concept of the project. Early versionsof the research design and assessment were developed in collaboration withDr. SJM Jones, Sita Gakkhar, and Dr. Basile Tessier-Cloutier. All otherauthors contributed equally to the manuscript.viiGrewal JK, Tessier-Cloutier B, Jones M, Gakkhar S, Ma Y, Moore R,Mungall AJ, Zhao Y, Taylor MD, Gelmon K et al. Application of a neuralnetwork whole transcriptome–based pan-cancer method for diagnosis ofprimary and metastatic cancers. JAMA network open. 2019 April 26.The published case study referenced and re-analyzed in Chapter 5 wasoriginally published in Cold Spring Harbor’s Molecular Case Studies andthe citation is provided below. I assisted in collation of the genomic findingsand provided expertise on interpretation of results from the supervisedcancer-type classifier used in this analysis. A license to reuse the text andfigures was not necessary since the authors retain the copyright for thepublication, licensed under CC-BY.Ko JJ, Grewal JK, Ng T, Lavoie JM, Thibodeau ML, Shen Y, Mungall AJ,Taylor G, Schrader KA, Jones SJM et al. Whole-genome and transcriptomeprofiling of a metastatic thyroid-like follicular renal cell carcinoma.Molecular Case Studies. 2018 December 17.I conceptualized the study in Chapter 5 jointly with Dr. Jones. I was themain researcher for this work and developed all the presented code andanalysis. This work was written entirely by myself, supervised by Dr. Jones.Drs. Pleasance, Csizmok and Williamson provied guidance for interpretationof results and development of statistical analysis methods. All other authorscontributed equally to the editing and review. A version of this chapter hasbeen submitted for publication as follows:Grewal JK, Pleasance E, Csizmok V, Williamson L, Wee K, Bleile D, ShenY, Tessier-Cloutier B, Yip S, Renouf DJ, Laskin J, Marra M, Jones SJM.Single-sample pathway analysis using Pathway Impact Evaluation (PIE) ofmachine-learning based cancer classifiers.The Introduction and Conclusion chapters are original work and have notbeen published or submitted for publication elsewhere.viiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . xxvAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . .xxviiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Pathology and cancer diagnosis . . . . . . . . . . . . . . . . . 21.1.1 The history of cancer diagnosis . . . . . . . . . . . . . 31.1.2 Cancer classification using histopathology . . . . . . . 61.1.3 Pathology in recent decades . . . . . . . . . . . . . . 91.1.4 Diagnostic challenges in pathology . . . . . . . . . . . 121.1.5 Impact of diagnosis on treatment . . . . . . . . . . . 141.2 Genomics and cancer diagnosis . . . . . . . . . . . . . . . . . 151.2.1 Computational algorithms for cancer classification . . 241.2.2 Beyond the diagnosis - identifying biological changesin individual tumours . . . . . . . . . . . . . . . . . . 291.3 Objectives and chapters overview . . . . . . . . . . . . . . . 342 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.1 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39ix2.1.1 Clinical background . . . . . . . . . . . . . . . . . . . 402.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 422.1.3 Pathology analysis and findings . . . . . . . . . . . . 442.1.4 Genomic analyses . . . . . . . . . . . . . . . . . . . . 482.1.5 Clinical decision and outcome . . . . . . . . . . . . . 542.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 Impact of genomics on diagnostic pathology in a precisiononcology trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.1.1 Consent and institutional review board process . . . . 603.1.2 Tissue biopsy and processing . . . . . . . . . . . . . . 603.1.3 Library construction and sequencing . . . . . . . . . . 603.1.4 Determination of tumour type . . . . . . . . . . . . . 613.1.5 Assessment of clinical input of whole-genome andtranscriptome analysis in pathology . . . . . . . . . . 623.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2.1 Cohort demographics, clinical metrics, and sequencingdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2.2 Correlation of histopathologic diagnosis and nextgeneration sequencing results . . . . . . . . . . . . . . 653.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784 Development and validation of SCOPE - supervised cancerorigin prediction using expression . . . . . . . . . . . . . . . 804.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2.1 Training data . . . . . . . . . . . . . . . . . . . . . . 834.2.2 Test data . . . . . . . . . . . . . . . . . . . . . . . . . 854.2.3 Model training . . . . . . . . . . . . . . . . . . . . . . 874.2.4 Algorithmic model selection . . . . . . . . . . . . . . 914.2.5 Ensemble selection . . . . . . . . . . . . . . . . . . . . 924.2.6 Feature weights analysis for neural network . . . . . . 934.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.3.1 Association of classification anomalies and biologicalsimilarities in held-out set . . . . . . . . . . . . . . . 944.3.2 Prioritization of known diagnostic gene featureswithout prior knowledge . . . . . . . . . . . . . . . . 97x4.3.3 External validation on primary cancers . . . . . . . . 994.3.4 Providing diagnosis for pre-treated metastases . . . . 1004.3.5 Identification of putative primary tumour type forcancers of unknown primary . . . . . . . . . . . . . . 1024.3.6 Impact of feature removal on classification . . . . . . 1044.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055 Enabling cancer transcriptome analysis from SCOPE usingsingle-sample pathway impact evaluation (PIE) . . . . . . . 1075.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.2.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . 1095.2.2 Classifier used for PIE measurements . . . . . . . . . 1105.2.3 Pathway analysis for individual samples . . . . . . . . 1105.2.4 Cohort-level pathway analysis . . . . . . . . . . . . . 1115.2.5 Statistical selection of top pathways associated witheach cancer type . . . . . . . . . . . . . . . . . . . . . 1135.2.6 Statistical identification of important pathways forsingle-sample analysis . . . . . . . . . . . . . . . . . . 1135.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.3.1 Pathway impact profiles allow clustering and analysisof samples by cancer type . . . . . . . . . . . . . . . . 1145.3.2 Pathway impact scores reveal prostate cancersubgroups . . . . . . . . . . . . . . . . . . . . . . . . 1245.3.3 PIE independently recovers sample-level findings fromintegrative genomic analysis . . . . . . . . . . . . . . 1265.3.4 PIE enables sample-level genomic analysis of cancerswith unknown primary . . . . . . . . . . . . . . . . . 1295.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1346 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.1.1 Impact of genomic information on diagnosis ofadvanced cancers . . . . . . . . . . . . . . . . . . . . 1376.1.2 Algorithmic advances in cancer classifier development 1386.1.3 Interpreting cancer classification decisions andperforming single-sample pathway analysis . . . . . . 1396.2 Limitations of developed tools . . . . . . . . . . . . . . . . . 1406.3 Broader challenges in clinical translation . . . . . . . . . . . 141xi6.3.1 Management of diagnostic inaccuracies in clinicalpractice . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.3.2 Facilitating adoption in routine practice . . . . . . . . 1446.3.3 Ensuring equitable access to developed tools . . . . . 1466.3.4 Keeping classifiers up-to-date . . . . . . . . . . . . . . 1476.3.5 Incorporating other -omics technologies in automateddiagnosis . . . . . . . . . . . . . . . . . . . . . . . . . 1486.3.6 Utilizing single-cell sequencing for interrogation ofcancer genomes . . . . . . . . . . . . . . . . . . . . . 1496.4 Final words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152AppendicesAppendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177xiiList of Tables1.1 Diagnostic challenges in cancer histopathology . . . . . . . . 142.1 Details of sequencing experiments. . . . . . . . . . . . . . . . 442.2 SNVs of interest are listed, along with details on the countsof the supporting reads spanning the tumour genome atthe mutated and reference bases, in the tumour genome(transcriptome). . . . . . . . . . . . . . . . . . . . . . . . . . 492.3 Copy number variants of interest in the tumour genomeare listed, along with percentile values and fold changescalculated from the respective RPKMs against a backgroundof TCGA Breast cancers. . . . . . . . . . . . . . . . . . . . . 503.1 Classification outcome from SCOPE for the cancer cohorts. . 724.1 Cancer types used for training, with abbreviations referencedin text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2 Breakdown of cancer types in the external metastatic cohort. 864.3 Architecture, identifying names, and additional informationfor each neural network in the SCOPE ensemble. . . . . . . . 934.4 Performance of SCOPE on the Genentech cohort of primarymesotheliomas. The training cohort was composed ofepithelioid mesotheliomas, whereas the testing cohort wascomposed of epithelioid mesotheliomas and sarcoma-likemesotheliomas. Mesotheliomas that also show sarcoma-likehistology are either predicted correctly as part sarcoma, partmesothelioma (”sarcomatoid mesothelioma”), or otherwise,usually as mesothelioma alone (”epithelioid mesothelioma”),or as sarcoma alone (”sarcoma”). . . . . . . . . . . . . . . . . 994.5 Performance of SCOPE on the metastatic cohort. Number ofmis-predictions are listed in brackets if more than one. . . . . 101xiii1 Important genes based on frequency analysis of gene weightsfor each neural network in SCOPE. . . . . . . . . . . . . . . . 1772 Top 25 statistically identified pathways for each commoncancer and normal tissue category in TCGA, based on PIEscores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1833 Top 25 statistically identified pathways for each commoncancer category in the POG and MET500 cohorts, based onPIE scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237xivList of Figures1.1 Histologic classification of cancers based on their organ-systemof origin. Various organ-systems of origin have multiplecancer-types associated with them, differing by the cell-typethey originate from. . . . . . . . . . . . . . . . . . . . . . . . 81.2 Thesis overview and key contributions. In this thesis weexplore the utility of bulk RNA-Seq as a diagnostic andanalysis aid in personalized oncogenomics initiatives. Ina detailed retrospective study we review the frequency ofdiagnostic changes motivated by genomic data and molecularobservations. We develop an automated, open-access tool(SCOPE) for cancer classification using large, representativeRNA-Seq profiles. We then extend this method to providepathway-level profiles of individual cancer samples, alsomade available as an open-access tool (PIE). . . . . . . . . . 362.1 Clinical history and pathology sampling timepoints forMLAV patient. Initial treatment is indicated in orange,tumour biopsies at various time-points following metastasisindicated with purple lines, and treatments providedbased on genomic analysis are shown with purple drugsymbols over dark-grey timeline bars. Tumour biopsieson which immunohistochemistry was performed areshown with open circle termination of correspondingline. Abbreviations: IHC - Immunohistochemistry test, POG- Personalized OncoGenomics clinical trial (Clinical Trialnumber: NCT02155621). . . . . . . . . . . . . . . . . . . . . . 42xv2.2 Histopathology of biopsies retrieved from MLAV Patient. A)The biopsy of the vulvar mass shows a poorly differentiatedtumour composed of nests and cords of pleomorphic tumourcells. B) The HER2 immunostain on the initial vulvar massbiopsy is equivocal, compatible with score 2+ based onpredominantly incomplete, weak and moderate membranestaining within greater than 10% of tumour cells. C) Thefine needle aspirate of the recurrence lesion from thesupraclavicular lymph node shows clusters of pleomorphictumour cells (H&E stain). D) The HER2 immunostain ofthe supraclavicular lymph node shows tumour cells withcomplete, intense membrane staining in greater than 10% oftumour cells compatible with score 3+. . . . . . . . . . . . . 472.3 ERBB2 gene’s genomic locus is shown in the patient’stumour. A) A lollipop plot showing the coordinates of theS310F gain-of-function mutation observed in this case. B) Aplot of the copy number landscape of Chromosome 17 in thetumour. The ERBB2 copy-number gain is indicated. . . . . . 522.4 Correlation plots of the cancer’s RNA-Seq profile withTCGA cancer datasets. A) Boxplot distribution of thepairwise Spearman correlation of the recurrence biopsy’sgene expression profile and all TCGA samples. The x-axisrepresents cancer types following TCGA naming conventions.TCGA breast cancer cohort is indicated by BRCA. B)Boxplot distribution of the pairwise Spearman correlationbetween the recurrence biopsy and the TCGA breast cancercohort based on the PAM50 set of genes. The pairwisecorrelations with adjacent normal are shown in blue. . . . . . 533.1 Cohort selection for the assessment of the impact of DNA andRNA sequencing analysis on histopathologic diagnosis in thePOG clinical trial. . . . . . . . . . . . . . . . . . . . . . . . . 643.2 Tumour types in the cohort are shown, along with the type ofgenomic data guiding major outcomes from the retrospectiveanalysis evaluating the diagnostic utility of RNA-Seq and WGS. 66xvi3.3 Detection of clinically relevant molecular alterations bywhole-genome and RNA sequencing in the POG cohort.(A-C) Detection of HER2 amplification in a colorectalcarcinoma is shown, as indicated by immunohistochemistry(IHC) staining for HER2 (overexpression, 3+) in the tumoursections in panels A) and B), and with FISH testingfor additional copies of HER2 (HER2 to chromosome 17centromere (CEP17) ratios > 2.0) in panel C). (D-F) ALKfusion identified in a lung adenocarcinoma, missed on initialFISH analysis. H&E staining of the tumour sample is shownin D). ALK IHC testing results showing equivocal ALKstaining are represented in E), with the original negativeFISH results (break apart probe test, less than 15% ofcells showed break apart probes) shown in F). (G andH) Detection of an IDH1 mutation in a CUP supportedthe putative diagnosis of cholangiocarcinoma. The H&Estaining is shown in G). Panel H) shows a snapshot of theIntegrative Genomics Viewer track for the mutation locationwith proportional read-counts supporting the reference(G, in orange) and mutation (A, in green) in the tumourgenome. This supported the putative diagnosis of this CUPas a cholangiocarcinoma in the clinical context, as aided byRNA-Seq analysis. . . . . . . . . . . . . . . . . . . . . . . . . 683.4 The outcome from genomic analysis is shown separated byA) the site of biopsy of the tumour, and B) the organ-systemof origin of the cancer. M and P indicated the number ofmetastatic and primary/relapse samples respectively. . . . . 693.5 The final diagnoses for the 15 CUP cases and 2 cases withrevised diagnosis are shown, along with the type of genomicdata guiding each of the outcomes. WGS = Whole-genomesequencing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.6 Impact of tumour content on the ability of RNA-Seq toprovide the correct putative diagnosis in the POG cohort.The majority of samples arose from 3 biopsy sites - lymphnode, lung, and liver, indicated in each of the panels. Wilcoxtest for significance between SCOPE outcome matching finaldiagnosis, versus each of the other categories: * p =< 0.05;** p =< 0.01; *** p =< 0.001; ns p > 0.05 . . . . . . . . . . 74xvii3.7 Impact of tumour content on the ability of RNA-Seq toprovide the correct putative diagnosis in the POG cohort,agnostic of biopsy site. Wilcox test for significance betweenSCOPE outcome matching final diagnosis, versus each of theother categories: * p =< 0.05; ** p =< 0.01; *** p =< 0.001;ns p > 0.05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.1 Performance of SMOTE as compared to other class expansionmethods. Cross-validation results on the TCGA trainingdataset are shown. Abbreviations: dup - duplication ofsamples in small classes, none - no class expansion applied,weight - inverse cost for misclassification of smaller classesduring training. . . . . . . . . . . . . . . . . . . . . . . . . . . 884.2 Results from algorithm and feature selection experiments,and performance on held-out test set. A) Feature selectiondoes not improve pan-cancer classification. B) Comparisonof algorithms - performance of single neural network onheld-out set is higher than other algorithms. C) Validationof SCOPE on TCGA held-out set demonstrates highdiscriminatory power amongst most cancer types. Pointwith bar represents average F1-score and standard deviationspread for corresponding category. Incorrect predictionsfor more than 10% of samples belonging to a given cancertype are shown by curved directed edges. Curve widthindicates relative fraction of samples in misprediction set.Mispredictions occur amongst cancer types with the sameorgan-system of origin. Specific trends are discussed furtherin Section 4.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . 95xviii4.3 Performance of various models that make up SCOPE, onthe cross-validation and held-out sets. The x-axis is orderedby increasing class size. Performance is reported as precisionfor the test-folds from CV 𝑖𝑛𝑏𝑙𝑎𝑐𝑘 and for all samples in theheld-out set 𝑖𝑛𝑦𝑒𝑙𝑙𝑜𝑤. Number of samples in training areshown in the upper histogram panel. Cancer codes followTCGA nomenclature and are defined in Table A.1, with _TSsamples indicating tumours and _NS samples indicatingadjacent normal tissues. The difference between CV-foldperformance and held-out performance is typically larger forsmall classes. The difference become insignificant as classsize approaches N > 100. When the classifier is augmentedwith addition of synthetic samples in the training folds(last panel), we observe an overall increase in performancefor the smaller classes with a concomitant reduction in theperformance gap between mean-CV-precision and heldoutprecision. The line of best fit (loess) is indicated for eachmodel, with standard error bounds in grey. The spread ofperformance across different CV folds is shown by the blackpoint (mean) with 1 standard deviation bars. . . . . . . . . . 964.4 t-SNE plot of transcriptomic data in TCGA training cohorts.The relevant gynecologic and gastrointestinal cancer typesare shown, and reflect the trends of cross-calling observedin SCOPE. Esophageal adenocarcinoma 𝐸𝑆𝐶𝐴_𝐸𝐴𝐶and stomach adenocarcinoma 𝑆𝑇𝐴𝐷 cluster together,as do uterine carcinosarcomas 𝑈𝐶𝑆 with uterine corpusendometrial carcinomas 𝑈𝐶𝐸𝐶. . . . . . . . . . . . . . . . . 984.5 Performance of SCOPE on external metastatic cohort. A)Two-sided t-tests show a significant association of tumourcontent on general diagnosis as organ system, for biopsiessamples from site of metastasis. B) Two-sided t-tests showno effect of tumour content on misclassification to organsystem, for biopsies sampled from the cancer’s site oforigin. C) SCOPE has improved performance comparedwith baseline linear comparator trained from a statisticallyfiltered feature subset. Abbreviations: AC - adenocarcinoma,CA - carcinoma, SCC - squamous cell carcinoma, CESCAC - cervical/endocervical adenocarcinoma, UCEC - uterinecorpus endometrial carcinoma. . . . . . . . . . . . . . . . . . 103xix4.6 SCOPE prediction and putative primary for cancers withunknown primary site. A confusion matrix of predictionsis shown, where the size of the circles represents relativenumber of samples in each category. Case count for CUPsby putative origin is shown with a histogram on the right.Correct predictions are indicated in yellow whereas incorrectones are shown in black. Salivary carcinoma, neuroendocrinetumours, and ewing sarcomas were not present in SCOPEtraining, explaining the inability of the method to identifythese accurately. Abbreviations: CA - carcinoma, AC -adenocarcinoma. . . . . . . . . . . . . . . . . . . . . . . . . . 1045.1 UMAP projections of PIE profiles for 3,963 biochemicalpathways, for samples in the TCGA cohort of primarytumours. For ease of readability, the projections in panel A)show TCGA tumour types coloured by their organ systemof origin. The spread of sample-specific silhouette indices,grouped by cancer type, is shown in panel B). . . . . . . . . 1155.2 Pathways commonly associated with multiple cancer typesin the TCGA cancers are shown. Grey bars indicate totalnumber of tumour samples evaluated, whereas coloured barsindicate the number of tumour samples from the respectiveorgan-system of origin. Panel A) shows the most commoncell-function pathways. Panel B) shows the most commoncancer-associated pathways. . . . . . . . . . . . . . . . . . . . 1175.3 Statistically significant pathways in TCGA cancers. Panelsshow important pathways for each tumour and normalcategory, grouped by organ system of origin. Each groupshows the top-5 pathways associated exclusively with cancersin the relevant organ-systems of origin, ordered by numberof samples in which the pathway had a positive PIE score.Coloured bars indicate fraction of tumour samples from theorgan-system where the respective pathway was positivelyscored by PIE. . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.4 Determination of pathway-level activities for TCGA primarycancers, using PIE. Panel A) shows the number of statisticallysignificant positively associated with each cancer-type, fromthe group of 3,963 pathways evaluated using PIE. Panel B)shows the number of pathways with statistically significantPIE scores per sample. . . . . . . . . . . . . . . . . . . . . . 119xx5.5 Clustering of POG cohort samples by cancer-type usingPIE profiles for 3,963 biochemical pathways. Cancer typeswith at-least 10 (N = 510/602) are shown for ease ofreadability. A) UMAP projections of pathway profiles areshown. Using pathway importance scores, samples cluster bytheir diagnosed cancer type. B) Silhouette indices of samplesare shown, grouped by cancer type. A positive silhouetteindex indicates sample clusters with assigned cancer-type. . . 1215.6 Clustering of MET500 cohort samples by cancer-type usingPIE profiles for 3,963 biochemical pathways. For ease ofreadability, the projections only show cancer types withat-least 10 samples in the MET500 cohort (N = 259/375).A) UMAP projections of pathway profiles are shown.Using pathway importance scores, samples cluster by theirdiagnosed cancer type. Of note, we observe 3 distinct clustersof prostate adenocarcinoma (PRAD, in dark-green). B)Silhouette indices of samples are shown, grouped by cancertype. A positive silhouette index indicates sample clusterswith assigned cancer-type. . . . . . . . . . . . . . . . . . . . 1225.7 Silhouette index spread for the MET500 cohort subtypes.Silhouette metrics are calculated from the UMAP projectionsinitialized with the first two principal components; clustersevaluated based on cancer type annotation. A positivesilhouette index indicates sample clusters with assignedcancer-type. Abbreviations: BLCA – Bladder cancer, BRCA– Breast cancer, IDC – invasive ductal carcinoma, ILC –invasive lobular carcinoma, CHOL – Cholangiocarcinoma,EHCH – extrahepatic CHOL, IHCH – intrahepatic CHOL,COADREAD – colorectal adenocarcinoma, ESCA –esophageal carcinoma, SCC – squamous cell carcinoma,EAC – adenocarcinoma, OV – ovarian cancer, PRAD– prostate adenocarcinoma, SARC – sarcoma, RHBD –rhabdoid, LMS – leiomyosarcoma, EW – Ewings Sarcoma,UPS – Undifferentiated pleomorphic carcinoma, DDL –dedifferentiated sarcoma, SKCM – subcutaneous melanoma. 123xxi5.8 Cohort comparison between the top 25 pathways associatedwith breast cancer, for The Cancer Genome Atlas (TCGA)cohort of primary cancers, the POG cohort of metastatictumours, and the MET500 cohort of metastatic tumours.Panel A) shows the number of unique and shared pathwaysbetween each of the cohorts. The MET500 and POG cohortsare grouped as ‘metastatic’. Pathways common betweenprimary and metastatic cancers (in purple), exclusive toprimary cancers (in orange), and common within themetastatic cohorts (in light blue) are shown in panel B) withthe corresponding mean PIE score across samples on they-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.9 UMAP projections of the MET500 cohort are shown, filteredto view only the prostate adenocarcinoma samples. UMAPprojections (initialized by the first two principal components)are calculated based on A) sample pathway importanceprofiles calculated automatically by PIE for 3,963 pathways,and B) gene expression profiles of the samples (RPKMvalues). Panel B) also suggests a non-random separation ofthe samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275.10 Top 25 pathways driving the 3 distinct clusters observed forthe prostate adenocarcinomas in the MET500 cohort. . . . . 1285.11 Top 25 pathways from PIE-based pathway analysis of amammary-like vulvar adenocarcinoma. 40% of the pathwaysshown here overlap with the integrative pathway analysis(in yellow, green), and 16% are associated with paclitaxeltherapy that the patient had received previously (in green,blue). Size of pathways is indicated in brackets next to thepathway name on the y-axis. The right panel shows thenumber of genes shared between the integrative analysis (N= 50) and the indicated pathways. Distribution of PIE scoresfor the remaining 65 classes is shown in grey, for each pathway.130xxii5.12 Top 25 pathways identified by automated pathway impactanalysis using PIE, for a cancer of unknown primary thatwas later diagnosed as a rare thyroid-like follicular renal cellcarcinoma. Size of pathways is indicated in brackets next tothe pathway name on the y-axis. Panel on the right showsthe number of genes from integrative analysis (N = 34) thatoverlap with the genes in each of the pathways. 48% of thepathways in the main panel overlap with manual integrativepathway analysis findings (in yellow, red), of which 12%associated with the actual rare cancer type that this cancerrepresented (in red). Distribution of PIE scores for theremaining 65 output classes is shown in grey, for each pathway.1325.13 Comparison of pathway importance scores for two differentoutput categories – renal clear cell carcinoma (KIRC) andrenal papillary carcinoma (KIRP). Scores were calculatedby PIE using the SCOPE classifier output. The input wasthe RNA-Seq profile of a cancer of unknown primary, laterdiagnosed as a rare follicular renal cell carcinoma thatmolecularly aligned with KIRP. The pathways that wereimportant for classification of the sample as KIRP insteadof KIRC are highlighted in yellow. Pathways important forclassification of the sample as KIRC instead of KIRP areshown in blue. As is evident, the magnitude of the pathwayimportance is higher for pathways driving the classificationof KIRP over KIRC. Relevant pathways have been labelled. . 1331 Example output from SCOPE for a sarcomatoid mesothelioma,predicted with split confidence as mesothelioma and sarcoma. 1812 Mean prediction accuracy of SCOPE as RPKM values ofvarious fractions of genes are set to 0 in the input RNA-Seqdata. Grey bars around mean points indicate standard errorbounds. Black line indicates the line of best fit (loess). At agiven threshold n% genes in input were randomly set to zero.This was repeated 10 times for each n in (10, 20, 30, 40, 50,60, 70, 80, 90, 99). . . . . . . . . . . . . . . . . . . . . . . . . 1813 UMAP projections of PIE profiles for 3,963 biochemicalpathways, for samples in the TCGA cohort of primarytumours.The projections are coloured by tumour-type. . . . 182xxiii4 Pathway importance for various Androgen Receptorassociated pathways for the MET500 Prostate Adenocarcinomasamples, separated by observed cluster groups. . . . . . . . . 2505 Manual integrative analysis of a mammary-like vulvaradenocarcinoma. Colour of circles shows fold expressionchange of the respective gene in the sample, relative to abackground of all healthy normal tissues from GTEx. Boxadjacent to circle indicates percentile expression comparedto the Cancer Genome Atlas’ cohort of breast cancers.Over-expression is shown in red, and loss of expression inblue. The key oncogenic pathways impacted in this case areshown with grey boxes and red border. Manual analysisidentified activation of ERBB2/ERBB3, mTOR pathway,and the MAPK pathway. Overexpression of various genesparticipating in transcriptional regulation and metabolismwas also identified (shown in red borders). . . . . . . . . . . . 2516 Manual integrative analysis of a cancer with unknownprimary, which was diagnosed as a thyroid-like follicularrenal cell carcinoma molecularly similar to renal papillarycarcinoma. Colour of circles shows fold expression change ofthe respective gene in the sample, relative to a backgroundof healthy renal tissues. Box adjacent to circle indicatespercentile expression compared to the Cancer GenomeAtlas’ cohort of renal papillary carcinomas. Over-expressionis shown in red, and loss of expression in blue. The keyoncogenic pathways impacted in this case are shown withgrey boxes and red border. . . . . . . . . . . . . . . . . . . . 252xxivList of AbbreviationsShort LongACC Adrenocortical CarcinomaACYC Adenoid Cystic CarcinomaBCCA British Columbia Cancer AgencyBLCA Bladder CarcinomaBRCA Breast CarcinomaCESC Cervical squamous cell carcinoma and endocervicaladenocarcinomaCHOL CholangiocarcinomaCI Confidence intervalCIHR Canadian Institutes of Health ResearchCNV Copy Number VariantCOADREAD Colorectal adenocarcinomaCOSMIC Catalog of Somatic Mutations in CancerCUP Cancer with Unknown PrimaryCV Cross-validationDLBC Diffuse Large B-cell LymphomaDLBC-BM Diffuse Large B-cell Lymphoma (Bone Marrow)EMPD Extra-mammary Pagets diseaseESCA Esophageal carcinomaESCA-EAC Esophageal adenocarcinomaESCA-SCC Esophageal Squamous Cell CarcinomaET Extra-treesFL Follicular lymphomaGBM Glioblastoma multiformeGSC Genome Sciences CentreGTEx Genotype-Tissue ExpressionHNSC Head and Neck squamous cell carcinomaHPV Human PapillomavirusICGC International Cancer Genome ConsortiumIHC Immunohistochemistryxxv(continued)Short LongINDEL Insertion/Deletion eventIQR Inter-quartile rangeKICH Kidney ChromophobeKIRC Kidney renal clear cell carcinomaKIRP Kidney renal papillary cell carcinomaLAML Acute Myeloid LeukemiaLGG Brain Lower Grade GliomaLIHC Liver hepatocellular carcinomaLOH Loss of heterozygosityLUAD Lung adenocarcinomaLUSC Lung squamous cell carcinomaMB-Adult Adult medulloblastomaMESO MesotheliomaMISC MiscellaneousMLAV Mammary-like adenocarcinoma of the vulvaNCI-GPH-DLBCLDiffuse Large B-cell Lymphoma (National CancerInstitute cohort)NN Neural networkOV Ovarian serous cystadenocarcinomaPAAD Pancreatic adenocarcinomaPCPG Pheochromocytoma and ParagangliomaPIE Pathway Impact EvaluationPOG Personalized OncoGenomicsPRAD Prostate adenocarcinomaREAD Rectum adenocarcinomaRF Random forestRNA-Seq RNA SequencingRPKM Reads Per Kilobase of transcript per Millionmapped readsSARC SarcomaSCOPE Supervised Cancer Origin Prediction usingExpressionSKCM Skin Cutaneous MelanomaSMOTE Synthetic Minority Oversampling TechniqueSNV Single Nucleotide VariantSTAD Stomach adenocarcinomaSVM Support vector machineTCGA The Cancer Genome Atlasxxvi(continued)Short LongTFRI-NCL-GBM Glioblastoma multiforme (Terry Fox ResearchInstitude cohort, non-cell line)TGCT Testicular Germ Cell TumoursTHCA Thyroid carcinomaTHYM ThymomaUCEC Uterine Corpus Endometrial CarcinomaUCS Uterine CarcinosarcomaUVM Uveal MelanomaWES Whole exome sequencingWGS Whole genome sequencingxxviiAcknowledgementsI would like to thank my supervisor, Dr. Steven Jones, for the supportiveenvironment and mentorship he has provided me throughout my degree.The independence granted me to pursue collaborations, research directions,and other opportunities beyond academia have enabled me to exploremy research and teaching interests to their limits, and for that, I amextremely grateful. Steve’s wisdom and foresightedness - percolated throughnumerous long hours in the office and at the pub - continue to serve aslife-long lessons in mentorship, team management, problem-solving, andinterpersonal relationships. Special thanks to Louise Clarke and SharonRuschkowski for their administrative support through the years. Meetingswould be impossible to schedule and paperwork a torturous exercise if notfor Louise’s ever-smiling presence and incredible organizational skills.I would also like to thank my committee members, Drs. Stephen Yip, RyanMorin, Sohrab Shah, and Inanc Birol, for their mentorship during these years.Their unique perspectives and eternal wisdom have helped me appreciate thepracticalities of interdisciplinary research and provided me the motivationto keep learning beyond what I know. Dr. Morin was instrumental in givingme my first proper taste of bioinformatics research in my undergraduatedays - without his support and advice during that time, I would not havebeen nearly as confident embarking upon my PhD. Thankyou to Dr. Birolfor constantly teaching me the humble lesson that every ‘cancer sample’ westudy is a person with dreams and aspirations. Dr. Shah’s meticulousness,eye for detail, and command over mathematics principles have kept meon my toes, providing impetus on those rare days of procrastination. Mythanks to Dr. Yip in particular for being akin to a second supervisor to me,guiding me into the world of histopathology and spearheading our variouscollaborations. The visits to the cancer genomics laboratory and dissectionshave provided me with better understanding of cancer genomics in the clinicand a deep appreciation for the medical profession. The opportunity to workwith residents like Drs. Basile Tessier-Cloutier and Adrian Levine has beenxxviiia privilege and contributed significantly to this research.The last few years would have been very dull without the guidanceand cheerleading abilities of several current and former lab members,particularly Drs. Jake Lever, Erin Pleasance, Yaoqing Shen, Martin Jones,Laura Williamson, Zoltán Bozóky, Martin Krzywinski, and Eric Zhao.The long days spent coding, debugging, and thinking about thinkingabout writing were made shorter by Jake and Eric’s constant concernand encouragement. Thankyou, Erin, Yaoqing, Laura, Zoltan, and bothMartins, for providing me the much needed guidance in research efforts,faith in my abilities, and abstract musings to keep the gears well-oiledalong this journey. To Martin K in particular, thankyou for opening up myeyes to the importance of data visualization, providing me the opportunityto contribute to the Points of Significance series, relating much-neededphilosophical quips and musings during stressful times, and often, as aconsequence of our invigorating long chats, making me late for othermeetings.My fellow graduate trainees in the lab, including Luka Culibrk, MichaelDisyak, Emre Erhan, Jenny Yang, and Vahid Akbari, have been a constantsource of inspiration. I am particularly thankful to have worked alongsideLuka, Micha, Emre, and Jenny in their research endeavors.Their curiousity,intelligence, and enthusiasm have helped preserve the wide-eyed child inme throughout. Drs. Robin Coope, Andy Mungall, Gordon Robertson andKieran O’Neill have been invaluable in helping me understand the largerworld of scientific research and the gears that keep it moving. I am gratefulto them for their valuable feedback and advice in situations where I couldnot map out a clear way forward myself. I hope we continue to keep in touchas we all advance through life.My friends and family have been vital in making me the person I amtoday. My immense love to my nanaji (grandfather) Jaswant Sran and mynaniji (grandmother) Gurdev Sran, who have constantly believed in me. Mydeepest gratitude to my parents, Jaswinder and Gurmukh, for their visionand guidance, and to my mamiji and mamaji, Sarb and Jasdev, who havebeen excellent aunt and uncle + proxy parents for the last 10 years. Thisdegree is as much theirs as it is mine. I would not have nearly the samedrive and enthusiasm for my research areas if not for the mentorship of myhigh-school teachers Ms. Ekta Bali, Ms. Hampreet Sidana, and Mr. DougBarham. Their dedication to the profession and their eagerness to do bestby their students gives me lots to aspire to. My sister Baldeep and myxxixcousins Shub, Sehaj, and Seerat have brightened up my years with theirlaughter and smiles - an important factor in rainy Vancouver. The companyof life-long friends - Kern, Archit, Onishma, Bhavit, and Varun - and thatof my peers in the bioinformatics training program has also served as vitalmotivation and support. Most crucially, a big thanks to my boyfriend androck, John Dupuis, for his love, patience, and confidence in me throughoutcomps, thesis writing, and the years in-between. His beyond-par baking skillsand Wednesday beer club have fuelled me throughout this degree, and hisfamily has been a warm and welcoming refuge from the worries of researchobligations.Lastly, I must thank the many members of the personalized oncogenomics(POG) team, including the patients and their families who put their trustin this initiative. The exemplar leadership by Drs. Marco Marra, StevenJones, and Robyn Roscoe have been an inspiration to become the bestscientist version of myself. Thankyou to Jessica Nelson for putting up withand resolving my frequently obtuse inquiries about the clinical data. Thevast majority of this work would not be possible without the various onlinecontributors and developers of StackOverflow, RStudio, BioRender, andPython, and the various members of the GSC’s systems team. Variousfunding agencies have supported this research, including a UBC Four YearFellowship and travel fellowships from Canada’s Michael Smith GenomeSciences Centre, CIHR, and the Canadian Cancer Society. Part of thepresented work was supported by the BC Cancer Foundation and GenomeBritish Columbia (project B20POG). I also acknowledge contributionstowards equipment and infrastructure from Genome Canada and GenomeBC (projects 202SEQ, 212SEQ, and 12002), Canada Foundation forInnovation (projects 20070, 30981, 30198, and 33408), and the BCKnowledge Development Fund.xxxDedicationTo my father Gurmukh, my mother Jaswinder, my sister Baldeep, and myboyfriend John. This thesis is sponsored by the many hugs, conversations,and delicious food supplied by them.xxxiChapter 1IntroductionAt the microscopic level, organs are distinguished from each other by the cellsthat make them up. Cells in different tissues and organs behave differentlyfrom each other. This behaviour can be characterized by differences in geneexpression. It is an inherent biological property of each cell type. Tumoursbegin when healthy cells start dividing in an uncontrolled manner. This canhappen due to exposure to carcinogens that cause DNA damage, or fromsmall changes in DNA that accumulate over time. Tumours can be benign, asis the case with moles that arise from skin cells called melanocytes. In certaincases, tumours can become cancerous, starting to compete with surroundinghealthy cells for space and resources. Cancerous lesions also exhibit localand distant invasion. When cancers move away from their site of origin(primary site) to other sites in the body, they are said to have metastasized.Metastasis can be facilitated by the lymphatic or circulatory systems. Cancermetastasis is the primary cause of cancer morbidity and mortality. Advancesin cancer treatment have resulted in effective management or even completecure of cancers if detected and diagnosed prior to metastasis . However,metastatic cancers are challenging to treat, resulting in about 90% of allcancer-associated deaths in North America .Cancer treatment is conventionally based on the site of origin of thetumour. Primary and metastatic cancers usually retain certain biologicalfeatures characteristic of their primary cell type. These features canbe physical (the shape of cells, how cells organize themselves), ormolecular (expression patterns of genes). The identification of the siteand cell type a cancer originated from is dependent on various factors.In most cases, a cancer diagnosis - characterizing the type of cancer andaccompanying treatment option - is provided to patients using guidelinesthat consider the tumour cells’ appearance and the patient’s clinical history.However, morphology-based diagnosis is a challenging task, involvingsequential interrogation of the tissue sample using various establishedhistopathology methods. Previous studies have found that histopathology11.1. Pathology and cancer diagnosisbased misdiagnoses can range from 10-20% simply due to differences ininterpretation of results by different pathologists . Other studies havefound that misdiagnosis rate can go up to 50% depending on the type ofcancer [39, 69, 198].Cancer can be considered a disease of the genome. Presumably then, thegenomic profile of a cancer provides more reliable diagnostic assessment thanmanual inspection of tissue morphology. The work presented in this thesisexplores the relevance of cancer diagnosis in precision oncology and outlinesthe utility of gene expression data in diagnostic pathology. It then proposesmachine-learning approaches to integrate RNA-sequencing experiments intodiagnostic workflows and genomic analysis of rare and advanced cancers.In this chapter the reader will be familiarized with the essential aspectsof cancer diagnosis - how it is obtained, why it is needed, and what someof the recent technical and biological advances in the field are. The firstsection deals with the origin and evolution of diagnostic pathology over thelast 3 centuries. The second section outlines the contribution of genomicsto cancer diagnosis in recent decades and introduces the reader to technicalconcepts of relevance in subsequent chapters. We conclude the chapter byoutlining the key objectives of this thesis accompanied by a brief overviewof the following chapters.1.1 Pathology and cancer diagnosisEvidence-based medicine bases itself on the bedrock of medical guidelines.These constantly evolving guidelines form the criteria for diagnosis andmanagement of diseases. Particularly in cancer, the appearance of cellsand nuclei determines whether abnormal tissue growths are cancerous ornot. The field of diagnostic pathology establishes these morphology-basedguidelines in cancer care.The widely used Pap smear test is the most common example of theprimary means for cancer diagnosis. Dr. George Papanicolau developedthe test in 1923 to identify cervical cancer. It involves exfoliating cellsfrom the cervix through scraping and examining these cells microscopicallyfor cancerous behaviour. The low cost, ease of administration, andaccurate interpretation made this test accessible, significantly reducingglobal cervical cancer incidence . The approach itself is known ashistopathology and forms the basis for modern-day cancer diagnosis. Two21.1. Pathology and cancer diagnosiskey ideas that established the field - cell staining and cell theory - emergedfrom extensive investigation of disease origin and characterization in the17th to 19th centuries.1.1.1 The history of cancer diagnosisBy the early 1600s, it was well-recognized that many illnesses are correlatedwith changes inside the patient’s body. Giovanni Morgagni, an Italianpathologist and anatomist, was the first to use broad anatomical findingsfor routine diagnosis of several diseases - including tumours. In his 1769’sseminal work, “The Seats and Causes of Diseases Investigated by Anatomy”,he noted tumourous growths on the maxillary gland of a patient who hadtrouble swallowing. Dr. Morgagni’s work also documented the spread ofcancer, observing that the same patient had some tumours in the pharynxand larynx. In 1775, the occurrence of cancers of the scrotum in chimneysweeps’ was reported - the first incidence of an occupational cancer .The biological cause of cancer was as yet unknown.The 1800s saw the emergence of two notable theories on the origin of cells.In 1838, a German pathologist, Dr. Johannes Müller proposed that canceris made up of abnormal cells originating from the body itself. Throughexamination of several microscopic samples of tumours, he determinedthat cancers possess distinct microscopic features that can be used toidentify them, thereby establishing the field of histopathology . In 1852,Dr. Robert Remak, a prominent embryologist working with Dr. Müller,leveraged membrane staining to observe cellular changes. Tracing theorigin of new cells using this membrane staining technique, Dr. Remakwas able to show that all cells arise from cells - ‘omnis cellula a cellula’.This was a significant advancement on Dr. Müller’s original hypothesisthat cancerous cells arose from bodily fluids spontaneously. The ensuingdebate, accompanied by the introduction of the compound microscope inmedical research in the 1850s, shifted the assessment protocol for cancersfrom gross anatomic features to cellular changes . Combined withDr. Rudolph Virchow’s subsequent work supporting this cell theory, thefindings established the domain of cellular pathology .Tissue excisionThe use of tissue excisions to understand diseases was not entirely novel forthe time. The first documented case of physicians studying the appearance31.1. Pathology and cancer diagnosisof a disease dates back to the 900’s, when the Arab physician Albucasis(936-1013) used a thin needle to retrieve a tissue sample from the throat of agoiter patient - an approach known today as the fine-needle aspiration biopsy. A few centuries later, in 1848, a German dermatopathologist usedmicroscopic studies of tissue excisions to distinguish normal and abnormalskin . The term ‘biopsy’, used commonly today to refer to tumour tissuespecimens taken for histopathology analysis, was coined in 1879 by Frenchdermatologist Ernest Besnier .Over the last few decades, various approaches for biopsying an abnormallesion have been established . Cytologic examination of scraped cellscan provide a quick overview of the tissue morphology. Detailed histologicexamination is done using core-needle biopsies, whereby a special hollowneedle is used to take a small cylinder-shaped (core) sample of the lesion. Ifthe lesion cannot be excised easily or if the excision could lead to functionalimpairment, an incisional biopsy or a fine-needle aspirate is drawn topermit evaluation. In some cases, the tumour can be excised using surgicaltools, providing an excisional biopsy for pathology. In challenging cases, theprocess can be guided by an imaging procedure like X-ray or computerizedtomography.Tissue sectioningOnce the tissue is biopsied, it needs to be treated such that it could endurelong-term storage and study. Until the 1860s, tissue specimens would beprepared with fluid smears, or by scrapping the cut-surface of tissues .As the microscopic resolution of lenses increased, so did the need to improvethe preservation quality of tissue specimens and facilitate fine-grained study.Edwin Klebs introduced paraffin embedding of tissues in 1869 . Sincethen, various embedding techniques using waxes and resins have beendeveloped to analyze different tissue specimens, but paraffin remains mostsuitable for embedding the broadest range of tissues. If the tissue is calcified(for example, bone), minerals have to be removed via decalcification first.Prior to paraffin embedding, tissues need to be fixed to impart mechanicalrigidity and withstand subsequent processing. No ideal fixative has beenfound to date - that is, one that preserves cellular morphology perfectlywithout compromising the specimen’s composition or reactivity of proteinsin the cell . Formalin, a fixative discovered in the early 1900’s, isused most commonly in the fixation process. Currently, formalin-fixedparaffin-embedded (FFPE) tissue sectioning is one of the two prominent41.1. Pathology and cancer diagnosismethods used to prepare tissue for analysis in pathology laboratories.Dr. Virchow’s student, Julius Cohnheim, was able to preserve tissue excisionsas frozen sections in 1864. Unlike FFPE samples, frozen sections could beprepared within minutes and used to undertake additional work-ups (forexample, special fixation and tissue staining). Today, frozen sections arepreferred over FFPE for genomic analysis (DNA or RNA sequencing, forexample) since the FFPE process damages nucleic acids, causing artefactsthat compromises their quality . FFPE tissues require 12-24 hours toprepare, but have better morphologic quality and are hence preferred overfrozen tissues for archival purposes .Several pre-analytical factors also influence the morphologic quality ofprepared tissue sections from the FFPE process . The time intervalbetween the tissue’s removal from the patient to the time it is fixed informalin - known as the cold ischemia time - impacts antigen viabilityfor various protein binding assays. The fixation time in formalin alsoimpacts the availability of the antigens detected by pathology assays likeimmunohistochemistry. After this, the time and temperature of formalinfixation impact how well the preserved tissue can be used for moleculartests like in situ hybridization (discussed later) .Tissue stainingAfter tissue sectioning (FFPE or frozen), thin sections of the tissue can becut using a microtome, and stained to facilitate histopathology analysis.These stains aid in tumour differential diagnosis and classification .Based on the type of stain used, various morphologic attributes in thetissue specimen can be highlighted. In the 1800s, Dr. Cohnheim used asilver stain to outline frozen sections of nerve endings. A few years later,it was found that hematoxylin and eosin stain different parts of the cell. Hematoxylin is a naturally occurring chemical discovered in 1502. Itwas found in the 1800s that the dye binds to nuclear proteins (specifically,histones) in its oxidized form, imparting a deep blue to black colour .Eosin Y is an acidic dye that binds to positively charged components of thecytoplasm, imparting a pink colour to the non-nuclear components of thecell. It is used as a ‘counterstain’ along-with hematoxylin . H&E is themost widely used mixture of dyes in histology. Hematoxylin binds to thenuclei, painting them a bright blue, and eosin stains the extracellular matrixand cytoplasm pink. Other cell organelles take on a combination of thesehues. In the case of FFPE samples, the method used for decalcification canimpact the effectiveness of the H&E stain .51.1. Pathology and cancer diagnosisAdditional stains exist to specifically identify bio-molecules like mucins,amyloids, lipids, glycogen, and elastic tissues. Van Gieson’s stain orthe Masson trichrome method, for example, can be used to distinguishcollagen and muscle . The stained sections are reviewed by trainedpathologists, who draw upon their expertise in the area to determine thetumour characteristics. The triad of histology, microscopy, and pathologistexamination forms the crux of modern-day cancer diagnosis . It is theresponsibility of the surgical pathologist to use available stains as neededand synthesize the resultant findings provided by tissue morphology and byeach stain to provide a comprehensive diagnosis for the cancer patient .1.1.2 Cancer classification using histopathologyThe assessment of cell morphology to study diseases organically led tothe development of classifications for healthy and malignant cells. In 1858,Dr. Rudolph Virchow observed that some patients had an abnormal numberof white blood cells, naming the condition leukamie. It would be a fewmore decades before leukemia is classified as cancer in 1938. In the interim,histopathology remained broadly the same.The healthy human body has various tissue types. The primary ones arethe epithelium (outer layer of skin, mucosal tissues), supportive tissue(bone, cartilage, connective tissues), nerve tissues, lymphatic tissue, andthe bone marrow. Abnormal tissue changes (lesions) are classified based onthe cell-type of origin, and in some cases, using additional attributes thatindicate how different the cancer looks from its counterpart normal cells.The nomenclature for all tumours, benign or malignant, is based on thenormal tissue the tumour originates from. Current tumour classificationsystems encompass terms on biologic behaviour, cellular function, histology,embryonic origin, and anatomic locations. These designations carrywell-defined clinical implications and are also important for communicatinga diagnosis .Benign lesions can give rise to localized tissue masses. Fibromas, chondromas,and adenomas fall into this group. They are distinguished from each otherbased on micro- and macro-scopic patterns that characterize their cells oforigin. While not invasive, these masses can compete for space and resources,requiring their removal. For example, benign tumours in the nervous systemare common. Due to their location, they can be harmful to the patient buttheir removal is also extremely difficult.61.1. Pathology and cancer diagnosisMalignant lesions, or tumours, are broadly classified into carcinomas,sarcomas, lymphomas, and leukaemias. Cancers that arise from epithelialcells are called carcinomas, and further subtyped based on the appearancethe tumour takes. Particularly, malignant epithelial tumours with aglandular growth pattern are called adenocarcinomas, whereas those with astratified distribution of cells are called squamous cell carcinomas. Sarcomasarise from mesenchymal tissues and are further subtyped based on theirhistogenesis. For example, cancers arising from fibrous tissues are calledfibrosarcomas. Cancers arising in lymph nodes and other parts of thelymphatic system are called lymphomas, whereas cancers arising from thebone marrow are commonly grouped under leukaemias. These classificationsare further described in Figure 1.1.The cells’ appearance can be graded along a scale of 1-3 indicating thedegree of differentiation. Well-differentiated tumours closely resemble thehealthy cells they originated from in form and structure, and have a lowgrade. Anaplastic tumours - those that do not display differentiation - canbe poorly differentiated (with minor resemblances to the primary tissue) orundifferentiated. These tumours have a high grade. The more anaplastic atumour, the less likely it is to have specialized functional activity. Since thesetumours also lack the defining morphologic features that well-differentiatedtumours possess, they are challenging to diagnose using histopathology. Lowgrade tumours typically have a good prognosis, and as the grade increases,the tumours tend to grow faster, metastasize easier and have a poorerprognosis. Tumour grading is important when determining the most suitabledrug therapy or any other post-operative medical treatment.At the time a tumour is detected, the international TNM classificationprotocol is used to place it along 5 stages - 0, I, II, III, and IV . Trefers to the primary tumour’s size, N denotes the extent to which it hasspread to regional lymph nodes, and M indicates the existence of distantmetastases. The lymph nodes are most frequently involved in distantmetastases of carcinomas since cancer typically spreads to other sitesthrough the bloodstream or the lymphatic system. The liver and lungs arealso frequently involved in secondary metastasis as all portal area drainageflows to the liver, while all caval blood flows to the lungs .71.1. Pathology and cancer diagnosisFigure 1.1: Histologic classification of cancers based on their organ-system oforigin. Various organ-systems of origin have multiple cancer-types associatedwith them, differing by the cell-type they originate from.81.1. Pathology and cancer diagnosis1.1.3 Pathology in recent decadesOver the centuries, histomorphology has driven the establishment ofclinical protocols for cancer classification. However, since the invention ofmicroscopy in the 1600s, and the discovery of H&E stains in the late 1800s,the field itself has remained broadly unchanged . Unsurprisingly, thesetwo tools of the trade can prove insufficient for providing an accurate cancerdiagnosis in today’s day and age . There are well-established diagnosticguidelines when the cancer presents as a well-differentiated mass of cells,with minimal invasion of blood vessels into the system (low vasculature),and high tumour content in the biopsied tissue. In the event any of theseconditions are not met, cancer diagnosis using traditional histomorphologybecomes a challenging task. Advances in genomics and computer technologyhave contributed to improvements in routine histopathology in recentdecades. Three innovations in particular, namely in-situ hybridization,immunohistochemistry, and digital pathology, have enhanced the amountof information that can be extracted from a tissue biopsy. The followingsub-sections describe these methods and summarize the benefits anddrawbacks of each.In-situ hybridization (ISH)ISH uses a complementary DNA or RNA probe to identify whether a certainDNA or RNA sequence exists in a biological specimen. The technique,developed in the late 1960s, allows pathologists to detect changes in theDNA directly on cytology specimens or FFPE slides . By the early 1980s,fluorescent tags on RNA probes could be used to identify complementaryDNA sequences in prepared tissue and visualized using fluorescencemicroscopy. Since then, advances in microscopy, digital imaging, andgenomics have led to significant improvements in the resolution, sensitivity,specificity, and accessibility of fluorescence in-situ hybridization (FISH). Common FISH-detected alterations include chromosomal deletions,gains, translocations, amplifications, and polysomy [32, 53].Particularly, gene fusions arise in carcinomas and sarcomas due to genomicrearrangements. When these fusions happen between fusion partners thatare otherwise distantly located on the genome, FISH can easily and reliablyidentify these events. However, fusions or translocations involving smallchromosomal distances can be difficult to resolve at the resolution providedby a light microscope. For example, chronic myelogenous leukemia (CML)has a known marker, the Philadelphia translocation, which joins the 5’91.1. Pathology and cancer diagnosisportion of the BCR gene (chromosome 22) to the 3’ portion of the ABLgene (chromosome 9). However, up to 6% of children and 20% of adults withacute lymphoblastic leukemia (ALL) also have this chimeric gene, but at aslightly different location (breakpoint). In this case, molecular techniques arethe only way to reliably distinguish CML from AML (DeVita et al. ).When FISH cannot resolve translocations and fusions easily, sequencingapproaches like transcriptome sequencing (RNA-Seq) are favoured (Chenget al. ). FISH remains the gold standard for detection of chromosomalabnormalities in routine diagnostic pathology. A key limitation of FISHis the need for sequence-specific probes, which can miss novel fusionsthat RNA-Seq can easily detect. As RNA-seq is also able to detect genemutations simultaneously, provide a precise mapping of fusion break-pointsand translocations, and discover cryptic or novel fusions at high resolution,it is increasingly being favoured over current molecular test methods likeFISH .ImmunohistochemistryDepending on the tissue in which they began, cancers express certaincharacteristic proteins. While the detection of many clinically relevantbiomarkers is best performed with nucleic acids (DNA or RNA), proteinexpression can be used when antibodies exist for specific proteins ormutated protein domains . The protein expression is visualized throughan enzyme-linked antibody that activates a fluorescent reporter, and thestain intensity used as the assessment metric . Monoclonal antibodiescan detect single amino-acid changes, such as the BRAF V600E mutation.By sequentially testing a series of known marker proteins, pathologists cannarrow down the diagnosis to a set of possible cancer types (differentialdiagnosis). The differential diagnoses guide the selection of antibodies. Thisarea of diagnostic pathology, called immunohistochemistry (IHC), is anindispensable tool of the trade .Refining a broad differential with immunohistochemistry means performinga series of tiered analyses. Primary cytokeratin markers (specifically, CK7and CK20) are used to stratify cases into subgroups of differential diagnoses,which can then be refined based on additional exclusionary markers likecarcinoembryonic antigen and urothelin . Candidate proteins are testedone at a time (single-plexed). It is no surprise then, that this approachcan still fail in cancer types where the diagnostic protein markers arenot known, or in cases where the marker stain is inaccurate or assessedwith inconsistent rules across different labs. For example, dedifferentiated101.1. Pathology and cancer diagnosismelanomas can stain negative for melanoma-specific markers like S100 .HER2 receptor status is an important prognostic factor in breast cancerand guides treatment selection. It has been shown that various HER2 IHCtests in breast cancer can be wrong because of differing criteria for HER2positivity among pathologists, or because of variability in positivity acrossdifferent sections of the same tumour .At the analytical level, IHC protocols have to be optimized every time anew marker is included. The staining results can also vary depending on thepreparation method and testing site. Immunostaining tests are generallyperformed at specialized independent laboratories using expensive specialstains, resulting in high costs for each tested slide . The amount ofavailable tissue also places a physical limitation on immunohistochemistryexperiments. The number of slides processed for diagnostic biopsies canrange from a few slides to more than 20, with an average total cost of $2000USD (varying by cancer type) . IHC studies using 10-12 stains for CUPdiagnosis have not been shown to increase diagnostic accuracy , and ameta-analysis investigating the use of IHC to diagnose metastatic samplesshowed that the approach led to an accurate diagnosis only 64-67% of thetime . Other factors impacting the reliability of IHC testing includetissue antigenicity, inter- and intra- observer variability of interpretation,and tissue heterogeneity . As a result, IHC staining is typically not usedas a stand-alone diagnostic tool.Digital pathologyAdvancements in commercially available digital cameras and scanners in thelate 1900s led to an innovative replacement for glass histology slides. Insteadof requiring extensive storage and preservation of prepared histology slides,electronic scanning could now be used to generate whole slide images (WSI).WSI systems have been tested in diagnostic, teaching, and research domains.These studies have found that scanned H&E slides can be used to renderprimary diagnoses, aid manual review of frozen sections, or share slidesacross remote locations for consultation and review . In recent years,machine learning methods have been developed to use these slides for cancerclassification as well. However, the prohibitive cost of scanners, additionaltime required for generating and reviewing clean scans, digital storage costs,and the inability to resolve tissue artifacts like folds and bubbles, all meanthat this technology is still in nascent stages for clinical-scale application[16, 222].In challenging cases, -ISH and IHC can still result in a wide differential111.1. Pathology and cancer diagnosiswith many plausible cancer types. Pathologists are routinely confrontedby tumours that cannot be accurately classified despite extensive work-upand expert reviews of the specimen . Furthermore, all of these methodsrely on histomorphology, keeping diagnostic pathology in the realm ofvisual inspection of fixed tissue. Depending on the specimen type and theexpertise of the pathologist, a high degree of discordance exists among thediagnoses - a 2015 study revealed that breast pathologists can rack up adiagnosis discordance rate of 25% amongst themselves (N = 60 biopsies,115 pathologists) . The subjective nature of the exercise has led to anincreasingly complex system of classifications based on histomorphologicdifferences and an increased use of sequential, single-plexed diagnostic teststhat impact the patient both in terms of time and monetary costs . Wewill now describe the biological factors impacting diagnostic pathology.1.1.4 Diagnostic challenges in pathologyCancers can lose the expression of predictive IHC markers, prohibitingaccurate diagnosis. Other ancillary histopathology tests can be confoundedby several biological issues. Table 1.1 summarizes biological issues that canimpact routine diagnosis. Specific challenges of cell-type mimicking, lack ofestablished pathology criteria, and complex phenotypes are outlined brieflyas follows:• Soft tissue sarcomas arise from connective tissue like fat, muscle,nerves, and blood vessels. They often mimic other cancers likemelanomas or carcinomas, particularly when they have epithelioidorigins [147, 150].• Cholangiocarcinomas arise from the bile duct epithelia. Distinguishingthis pancreato-biliary cancer from pancreatic cancers and livermalignancies remains a challenge due to the absence of pathologycriteria. Late detection and diverse phenotypes within the disease haveled to a paucity of effective markers to distinguish cholangiocarcinomasfrom secondary adenocarcinomas and hepatic cancers .• Several carcinomas and lymphomas can present as a mixture ofdifferent tissue morphologies, or as an intermediate phenotypebetween two distinct subtypes of the same cancer [59, 226].The eventual diagnosis in these cases is usually based on the exclusion of121.1. Pathology and cancer diagnosisother possible cancer types. 3-5% of metastatic tumours identified annuallystill cannot be definitively diagnosed with traditional histopathology[120, 176]. These cancers with unknown primary (CUPs) are usuallyadenocarcinomas (90%), with one-third of these adenocarcinomas beingpoorly differentiated. The remaining 10% of CUPs present as poorlydifferentiated neoplasms - squamous cell carcinomas and undifferentiatedcarcinomas, and rarely, as neuroendocrine cancers or mixed tumours.Cholangiocarcinomas can also go undiagnosed if IHC findings arenonspecific.Currently, the management of CUPs is based on serial exclusionary diagnosiswith IHC protocols. These diagnostics are sequential, require a vast arrayof immunohistochemical stains, and the stain reactivity status is based onsubjective analysis. Since tested tissue is exhaustive, this places an additionalconstraint on the diagnosis. This is a perplexing situation, given that CUPsare the 4th common cause of cancer-related deaths globally, with a 5-yearsurvival rate of 11% [58, 120]. Standard IHC workups are unsuccessful in~75% of the cases, meaning a putative diagnosis is only possible for 20-30%of these cases . In the absence of a definitive primary (80% of cases), thestandard treatment is broad-spectrum chemotherapy . This is typicallya platinum-based regimen like paclitaxel-carboplatin-gemcitabine. In about20% of the total cases, patients can be matched with a ‘treatable subset’,whereby the clinical features are used to suggest a specific diagnosis despitethe inability to identify a primary site. The former approach of empiricchemotherapy results in significantly poorer outcome than the latter, wherethe cases are treated based on a putative primary . Nevertheless, thetreatment outcomes for most patients in this group remains poor, with amedian survival of 9 months . Recent work in the area has attempted toaddress the development of improved diagnostics instead of evaluating newchemotherapy regimens.Particularly for adenocarcinomas of unknown origin, there is considerablevariability in the selection of IHC markers that are informative fordiagnosis. Except for prostate specific antigen (PSA) - used to identifyprostate cancer - individual IHC stains prove inadequate in differentiatingvarious adenocarcinoma CUPs. Treatment decisions in such cases are largelybased on clinical features that may be indicative of a specific diagnosis .A uniform approach to IHC driven diagnostics for CUPs is lacking, withthe expectation being that more refined molecular approaches can helpidentify biomarkers with diagnostic relevance.131.1. Pathology and cancer diagnosisTable 1.1: Diagnostic challenges in cancer histopathologyCancer Type Challenges with histopathology presentationCancers with known primaries Limited/improper staining, poor differentiationCancers with unknown primaries 5 broad histologic categories (non origin-specific)Gastrointestinal tumours Shared histological features and immunophenotypesSarcomas Diagnosis of exclusion1.1.5 Impact of diagnosis on treatmentCancer diagnosis plays an important role in selecting treatment options fora patient. Systemic cancer therapy covers a variety of treatment approaches- chemotherapy, hormonal treatment, and surgical removal of the tumour.Surgical resections are the most common course of treatment for primarycancers, if the loss of the tissue mass does not negatively impact the patient’swellbeing . Chemotherapy and hormonal treatments are based on theanatomic site and histopathology of the tumour.The NCI Molecular Analysis for Therapy Choice (NCI-MATCH) trial,initiated in April 2015, aimed to find genomic evidence that could matcheffective targeted therapies with patient molecular profiles. While this andseveral related studies have shown promise for targeting cancers based onindividual mutations irrespective of their site of origin, a diagnosis remainsessential for treating the vast majority of cancers. Especially in the case ofrare cancers (defined as fewer than 6 diagnosed cases per 100,000 people), CUPs, and other malignancies where actionable mutations may notbe found, a diagnosis can provide biological context and motivate therealignment of treatment options based on the putative primary. Cell-typeof origin can also influence the response of a tumour to therapy - this isdemonstrated by the variable response of vemurafenib to melanomas andcolorectal cancers carrying the oncogenic BRAF V600E mutation.Clinical actionability based on genomic profilesIn 2004, it was shown that gefitinib, a drug that inhibits the activity of agene called EGFR, worked particularly well at treating lung cancer patientsif their cancer harboured certain small mutations in this gene . Thesemutations led to increased expression of EGFR in the cancer cells. Gefitinibbinds to the protein product of EGFR, preventing it from performing itsrole within the cell, consequently killing the cell. The high levels of EGFRcaused by these mutations made cancer cells particularly susceptible to141.2. Genomics and cancer diagnosisgefitinib. The findings demonstrated the potential for clinical actionabilityof a cancer drug based on the cancer’s molecular profile. Gefitinib, andother tyrosine kinase inhibitor drugs like it, have since been in clinicaltrials for other cancers. A counterpart, lapatinib, yielded positive results inEGFR-mutated breast cancer patients, and another similar drug, erlotinib,is approved for metastatic pancreatic cancers in combination with anotherdrug, gemcitabine . Contrary findings, although rare, have also emerged.In a recent clinical trial for epithelial ovarian cancer patients, it was foundto have limited clinical activity when used as a stand-alone therapy .Context-specific behaviour of vemurafenibIn 2010, another drug, vemurafenib, was discovered through a large-scaledrug screen. Vemurafenib selectively killed skin cancer (melanoma) cellswhen these cells harbored a specific mutation in the kinase inhibitor gene,BRAF. This V600E mutation, which caused the 600th amino acid productof the gene to change from valine to glutamic acid, was quite commonin melanomas and well known since the early 2000s . Clinical trialsshowed the drug to be effective in 80-90% of melanomas harboring BRAFV600 mutations . Due to the efficacy of the drug, vemurafenib advancedrapidly through clinical trials, receiving FDA approval for BRAF mutatedmelanomas in 2011.Several other cancers, including hairy cell leukemias, colorectal cancers,gastric cancers, and papillary thyroid carcinomas were found to containsimilar activating BRAF gene mutations . Unlike gefitinib though,vemurafenib failed to be as effective in other BRAF V600 mutated cancers. For example, in recently finished clinical trials, the drug yielded aresponse rate of <5% in metastatic BRAF V600 mutated colorectal cancers and 25-50% in iodine-resistant thyroid cancers . It appears thatat least in the case of BRAF V600 mutated cancers, the histologic contextdetermine response to this drug  for unknown reasons.1.2 Genomics and cancer diagnosisThe Human Genome Project finished in 2003, and shortly thereafter in2005, The Cancer Genome Atlas (TCGA) project was launched by theNational Institute of Health in America to catalogue changes that occur atthe genomic level in cancers . The consortium sequenced the genomes ofover 10,000 cancer patients, generating high-resolution molecular profiles of151.2. Genomics and cancer diagnosistheir cancer and the healthy counterparts in their body. At the same time,short-read sequencing technologies started becoming increasingly routine.Today, it is estimated that more than 1.5 million individual human genomeshave been sequenced by 2018 - a significant increase from just the onegenome in 2003 .Large-scale projects like TCGA and ICGC  have profiled thousands ofprimary, untreated cancers and identified changes that occur at the DNA andRNA level in over 40 different cancer types. Extensive research arising fromthese projects has shed light on molecular changes that characterize variouscancers, which in turn has advanced our understanding of what causes thesediseases. As shown by various clinical trials evaluating targets identifiedthrough these analyses, this approach can also uncover diagnostic markersand cancer subtypes with different molecular profiles.The integration of genetic testing into clinical diagnostic pathology hasbenefitted cancer management. Companion diagnostic tests in the UnitedStates are usually co-developed with a particular drug, and approved bythe Food and Drug Administration for prescribing the safe use of the drug. For example, the MSK-IMPACT clinical assay and the FoundationOne CDx diagnostic both profile clinically actionable genomic variants in apre-defined set of biologically relevant genes to enable informative selectionof FDA-approved targeted therapies [30, 158]. Prognostic multi-geneexpression based panels also exist for specific common cancer types, suchas MammaPrint (now used routinely to predict metastasis risk for youngbreast-cancer patients ) and Oncotype-Dx (used to evaluate risk foraggressive breast cancer and for treatment stratification in breast cancerpatients ).In general, these tests rely on biomarkers - known molecular events thatare robustly associated with a specific cancer or predictive of favourableresponse to certain drugs. These molecular events can be at the geneexpression level (for example, HER2 overexpression testing is requiredbefore patients can be treated with trastuzumab) or at the genomic level(BCR-ABL fusion testing for CML patients on Tasigna). Some of thesechanges can also be hereditary, i.e. present in the patient’s healthy tissuesat the time of conception. Mutations in the CDH1 gene, for example, areassociated with hereditary diffuse gastric cancer. These familial syndromesare rare, and germline testing is generally not required unless a patient’sfamily history is suggestive of their existence . More commonly, thegermline effects can be diffused, or absent entirely, in which case the161.2. Genomics and cancer diagnosisgenetic changes within the tumour cells are more informative instead.Since the aim of these tests is to provide treatment planning supportin the clinic, for most companion diagnostics prior information aboutthe cancer diagnosis is either provided by the healthcare provider (forexample in MammaPrint, Slodkowska and Ross ) or is not necessaryfor medical oncology decision-making and treatment planning (for example,using pembrolizumab for solid tumours with high microsatellite instability,Marcus et al. ).Efforts in improving the efficiency of cancer diagnosis itself also aim toleverage molecular profiles generated from sequencing data to classifycancers. Different types of genomics data can be used to developcomputational algorithms for cancer-type classification. However, 3 maindata-types have been used consistently since the first classifier paperswere published in 2001 [163, 187] - DNA methylation, gene expression,and genomic changes including somatic mutations and copy numberchanges. DNA methylation measures the epigenetic landscape - changesin methylation (addition of a methyl group) in the genome, which in turncan regulate the activity of a DNA segment without modifying the actualsequence. Alternation of DNA methylation patterns is a known hallmark ofcancer, with the ability to distinguish cancer cells from normal tissue. Geneexpression is a broad term encompassing two main platforms for measuringexpressed genes, namely microarrays and RNA-sequencing. Both of theseplatforms can measure messenger RNA (mRNA) that is translated intoproteins, or smaller RNA species such as microRNAs (miRNA). Genomicchanges can be single-base mutations in the tumour, or large-scale gainsand deletions in segments of the genome, known as copy number changes.More complex rearrangements of genomic fragments, called structuralvariants, are known to drive certain types of cancers. Specific structuralvariants typify a subset of known cancers, and can be tested easily through-ISH approaches. The next sub-section discusses the development of ’omicsbased classifiers, and the differences in the classification performance asnoted in literature.In practise, molecular diagnostic tests can be classified as either commercialpanel-based diagnostics or laboratory developed high-throughput sequencingdiagnostics that aid research efforts. Typically, the scope of commercialdiagnostics is restricted to a pre-selected set of genomic changes withrigorously evaluated clinical actionability, whereas laboratory-developedtests are developed and used by specific institutions as part ofpharmacogenomics-focused clinical trials and research efforts, not linked171.2. Genomics and cancer diagnosisto a particular drug . Regardless of the approach though, each testcan be characterized by sensitivity, specificity, and limit of detection.In the next two sub-sections we discuss the research and developmentbehind current-day ’omics based classifiers and contextualize them againstcommercial assays available for cancer diagnosis.Genomic data used for cancer classificationExisting cancer diagnostics test changes in multiple genes simultaneously. The main types of molecular data commonly used for this purpose areexome sequencing, whole-genome sequencing, RNA sequencing, and bisulfitesequencing (BS-Seq). The first two interrogate the DNA in a sample, RNAsequencing can be used to capture transcribed RNA (mRNA) or small RNAspecies like microRNA (miRNA, long non-coding RNA), and BS-Seq is usedto identify methylated regions of the DNA.Early efforts for algorithm-based cancer classifiers focused on using geneexpression measurements from tissue microarrays to find a representativefeature set, and then classify a small set of tumours. These methodstried to distinguish 11-14 different types of tumours using supervisedmachine learning algorithms, achieving 78% (N = 218)  to 90% (N= 175) classification accuracy . These early methods used supportvector machines (SVMs) with optional recursive feature elimination.The classification groups themselves were quite high-level. For example,stomach and esophageal cancers would usually be combined into a singlegastroesophageal group, and lung adenocarcinomas and squamous cellcarcinomas combined as lung carcinomas.Efforts shortly thereafter used slightly different algorithms to classify roundblue cell tumours (Burkitt lymphoma, Ewing sarcoma, neuroblastoma, andrhabdomyosarcoma) and leukaemias [103, 196]. The two main algorithmsused at the time were neural networks  and nearest-centroids .The emphasis was on finding the smallest number of predictive genes thatcan facilitate the development of new diagnostic tools easily . In 2004,it was demonstrated that neural networks can be applied to two completelydifferent microarray platforms and build tumour classifiers with 83-85%accuracy across the board (N = 120) . A key insight was that featurescaling helps build a robust classifier.The flurry of papers around cancer-type classification in the early 2000’swas followed by concerns about the reproducibility of gene expression dataobtained from microarrays. Several studies at the time showed that different181.2. Genomics and cancer diagnosiscommercial microarray platforms yielded different intensity measurements. A 2007 study using a refined microarray approach observed that theprobe sequence can impact the relationship between observed gene intensityand actual gene expression . Background levels of hybridization ofa probe can also limit the accuracy of expression measurements frommicroarrays, particularly in case of low-abundance transcripts .Most of these tests, while touting high overall accuracy, also failed toimprove classification accuracy where it mattered most - in cancers thatwere refractory to histopathology-based diagnosis. For example, in a 2013study, which later evolved into the commercially available Cancer GeneticsIncorporated Tissue of Origin Test, researchers used a gene expressionpanel to distinguish 15 different types of cancers [81, 157]. In their set of160 test cases, they found that the method failed in metastases of gastriccancers (<30% accuracy) and had equivocal performance as compared toroutine IHC for non-adenocarcinomas. Others yet would aggregate thesechallenging categories into their high-level grouping, limiting clinical benefitfrom the application of these methods.RNA sequencing, particularly sequencing of transcribed RNAs throughRNA-Seq, has been steadily replacing microarray based methods forinterrogation of gene expression profiles. The methodology extends ourability to quantify splice variants, non-coding RNAs, isoform specificgene expression, and gain insight into variation that is not capturedat the genomic level. Systematic analyses of the utility of RNA-Seqbased diagnostic assays and classifiers has shown that RNA-Seq basedclassifiers outperform arrays in characterizing cancer transcriptomes .Furthermore, retrospective meta-analyses on these projects have revealedthat the large amount of biological diversity in CUPs is not fully capturedwithin panel-based assays, emphasizing the need to build up from a globalgene expression set in order to identify a comprehensive set of site of originmarkers that are universally expressed [118, 162].Besides gene expression, other data modalities like DNA methylation,somatic variations, and microRNA expression can be used to classify cancers.A microRNA-based classifier that used feature selection embedded in theLeast Absolute Shrinkage and Selection Operator (LASSO) classificationalgorithm obtained 88% accuracy on a cohort of 48 metastatic samples,but failed to classify metastases to the liver, and to distinguish betweenstomach and esophageal cancers . MiRview mets, an RT-PCR basedmiRNA assay classifying 25 different tumour types using decision trees and191.2. Genomics and cancer diagnosisK-nearest neighbours obtained 86% accuracy on the held-out set (N = 83). An independent test cohort of 80 samples only showed performancegains for distinguishing biopsies from the liver or otherwise, and biopsiesof gastrointestinal tumours or otherwise , falling appallingly short ofgeneralization. This test was later expanded into a commercially availablekit (Cancer Origin Test by Rosetta Genomics), subsequently withdrawnfrom market because of bankruptcy. Another recent study leveraged genemutations and copy number alternations to distinguish 28 different cancertypes, yielding an overall accuracy of 78% . Another group has shown91% accuracy to classify 17 different tumours using miRNA data, and 95%using DNA methylation data .Methods have been developed that use microRNA expression to achieve70-90% classification accuracy for identifying a putative primary in CUPs[127, 170]. Several small-scale studies have also shown the utility ofmicroarray panels to identify the site of origin of certain types of CUPs[77, 161, 201]. Demonstrating the potential for pan-cancer based diagnosticapproaches for CUPs, Tothill et al undertook a comprehensive interrogationof 13 cases of cancers of unknown primary against a background of 229primary and metastatic cancers spanning 14 tumour types . Leveraginga support vector machine (SVM) based training model, the authorswere able to demonstrate that an expression-based test could stronglycharacterize 11 of the 13 CUPs, validated by pathologic evaluation andclinical outcome information. The authors also found that if they left outdifferent primary tumour types during training, and used them for testingonly, the left-out class was predicted most often as a trained cancer typethat was the most similar to it biologically. However, all these analysesleveraged narrow representations of tissue types, and have been unable todemonstrate generalizability and clinical utility of the diagnostic methodto external validation cohorts of CUPs.Commercial assays for cancer diagnosisVarious -omics data have been leveraged to identify the tissue of originof metastatic cancers (including copy number variants, nonsynonymouspoint mutations, and single nucleotide substitution frequencies), but theperformance of methods that rely on gene expression information has beenshown to outperform methods based on other data types . The twocommercially available pan-cancer diagnostic assays rely on expressionprofiling using microarrays or reverse transcriptase - polymerase chainreaction (RT-PCR). These two assays provide a diagnosis for cancer201.2. Genomics and cancer diagnosistypes and subtypes at varying degrees of resolution - CancerTYPE ID(bioTheranostics, Inc.) and the Tissue of Origin Test (Cancer GeneticsIncorporated).The Tissue of Origin Test leverages the gene expression profile (microarray)of 2,000 genes, covering 15 tumour types . While the details of thealgorithms used for gene selection and classification are unclear, thecommercial assay achieved a performance of 89% on the test set (N = 462,15 cancer types). The tumour types, in this case, indicate the anatomicalsite of origin instead of nuanced histologic categories, for example kidney,gastric, pancreas, sarcoma, and non-small cell lung cancer. Notably, thetest performed better on metastatic samples than primary tumours, buthad a notable drop in performance for lung metastases (N = 3/5).The CancerTYPE ID assay uses RT-PCR to assay the expression of 92genes from FFPE samples, profiling 30 main tumour types indicating theanatomical site of origin (for example, brain, breast, cervix, and skin) and54 subtypes indicating the histologic classification (for example, pancreaticcarcinoma, melanoma, renal clear cell carcinoma) . In leave-one-outcross-validation, the method demonstrated 87% accuracy for the maintumour types and 85% for the histological subtypes (N = 2,206). Thegeneralizable performance of this approach as 83% (N = 187, 28/30 maincancer types). The genes themselves were selected for optimal multi-tumourclassification from a set of 578 tumour samples using a genetic algorithm,and include 5 genes for normalization in the RT-PCR experiment. Whilethe method was able to obtain robust performance, both training and testsamples were optimized to reach 80% tumour content prior to RT-PCR.The RT-PCR method requires a high RNA quantity and quality as well,which can preclude several cancer samples from analysis (95-98% successrate for meeting criteria, as reported in the paper). In routine practice,an entire FFPE block or biopsy core is required. Based on retrospectiveabstractions of pathology reports from the test cohort, the diagnostic utilityof the assay was found to extend beyond aiding diagnosis of CUPs, to thosesamples where multiple differential diagnoses existed but a single definitivediagnosis was absent . This highlights another important implicationof pan-cancer classifiers besides providing a diagnosis for CUPs - resolvingdifferential diagnoses that may come up routinely. A recent retrospectivestudy using this method showed successful application to render a diagnosisin 56 patients with neuroendocrine tumours of unknown primary .Two challenges still exist with direct adoption of these tests in precision211.2. Genomics and cancer diagnosisoncology. Firstly, precision oncology projects emphasize the study ofmolecular data, particularly sequencing of the DNA and RNA, tocharacterize and treat the cancer. Nucleic acid sequencing preferentiallyrequires fresh frozen tissue, whereas the existing commercial assaysnecessitate FFPE blocks. Collection of additional tumour biopsies placesan extra burden on the patient, can be excessively invasive (as is the casefor brain cancers), and may not be feasible in many cases (for example, inpancreatic adenocarcinomas).Secondly, only one of these tests reached a diagnostic resolution beyond theanatomical site of origin, limiting the diagnostic utility . For example, inclinical practice it is not sufficient to indicate that a tumour arises from thekidney - several clinically relevant and biologically distinct cancer types existwithin kidney cancers, with different treatment regimens. The test requiresa large amount of input RNA for assessment, reverting to the originalchallenge of limited tissue and the value of building a molecular profile of acancer. There is a current need to develop classifiers that are easy to scale,integrate within precision oncology projects that emphasize granular studyof advanced cancers and can leverage the deluge of information provided bygenome sequencing.Limitations of present-day ’omics based classifiersDespite years of work in this area, several problems remain with immediateadoption of the resulting methods to all cancer types. The selection ofrepresentative features to select for specific cancer types can lead tooverfitting or restrict the application of the method to those cancer typesonly. As these computational methods typically require a lot of trainingdata, as discussed in Section 1.2.1, rarer cancer types are excluded fromthe classification task, or amalgamated into a broader anatomic categorywith limited clinical actionability. These issues can also stem from the typeof genomic data used for the classification, since representative datasetsof whole-genome sequence, RNA-Seq, miRNA, and methylation, are notavailable for all cancer types.1. Selection of feature subsets can prohibit generalization to rare andcomplex aetiologiesClinically predictive gene expression signatures are not alwaysreproducible across multiple studies of the same tumour type either.This can arise from differences in technical platforms, standardizationof data, and at a biological level, a complex phenotype of survival .A 2010 review of gene expression based diagnostic panels for diagnosis221.2. Genomics and cancer diagnosisof CUPs noted that many available panels have decreased accuracy forpoorly differentiated tumours, or for specific tumour types like lungand pancreatic adenocarcinomas . This is particularly importantfor various heterogeneous cancer types reflected in CUPs, which maynot always have the same stable markers that are enriched for duringfeature selection optimized to differentiate known primary tumours.2. Existing classification approaches lack granularity and exclude rarecancer typesCancers arising from the hepatobiliary and pancreatic systems areparticularly challenging to diagnose using both histopathology andgenomics- based approaches. Some work in this domain sidestepsthese diagnostic challenges when developing classifiers, excludingthese cancer types from the classification task . In other work,these clinically distinct cancer types are merged together. Thisremoves the challenge of granularity in their prediction task, butnegatively impacts the clinical utility of such methods [81, 221]. Ata broader level in these studies, while the reported accuracy rangesfrom 87-93%, diagnostically challenging cohorts are simply combined,and/or samples refractory to the method are excluded from thereported calculations .3. Limited data availability across all sequencing platforms preventsefficient training of rare cancer typesA limitation in scaling the use of miRNA and BS-seq basedclassification to other cancer types has been the unavailability ofthese type of molecular data for rarer cancers. The challenge remains,with genome sequencing and RNA-Seq being the most commonlyavailable molecular profiles of the maximum different types of cancers.Additionally, previous results using various data modalities show thatcancers refractory to diagnosis by gene-expression based algorithmsshow similar trends using other data-types as well.4. Interpretation of diagnostic decisions from classification platforms islimitedCancer classifiers based on gene-panels and molecular assays canprovide a classification decision, but understanding the biologicalchanges contributing to the decision is difficult. Methods for featureprioritization in the underlying algorithms have been proposed, butit is unclear how to interpret these at a single-sample level. Since the231.2. Genomics and cancer diagnosisalgorithms themselves are based on marker selection, the breadth ofbiological changes that they can interrogate and prioritize remainslimited.In the next section we describe the computational approaches underpinningexisting classification tools, and how these algorithms can be leveraged toaddress some of the challenges highlighted above.1.2.1 Computational algorithms for cancer classificationFive main algorithms dominate the field of ’omics-based cancer typeclassifiers - support vector machines (SVMs) , random forests (RFs),k-nearest neighbours (kNNs), naive Bayes (NB), and neural networks (NNs). They broadly differ in the way they compare features to establishclassification thresholds .Linear classifiers distinguish two or more classes using a linear combinationof features. An example algorithm is Naive Bayes. In a two-dimensionalsetting (when using two features for classification), a linear classifier will bea line. In higher dimensions, this line becomes a decision hyperplane. Linearclassifiers attempt to learn the parameters describing slope and threshold forthis hyperplane, leveraging examples from the training set . For example,the 𝜃 weights for discriminating between 𝑐𝑙𝑎𝑠𝑠1 and 𝑐𝑙𝑎𝑠𝑠2 will be learntthrough multiple training examples in the following equation,𝑏 = 𝜃0 + 𝑥1𝜃1 + 𝑥2𝜃1 +⋯such that, for a new sample x,𝑏 < 𝜃0 + 𝑥1𝜃1 +⋯ ⟹ x ∈ 𝑐𝑙𝑎𝑠𝑠1, else x ∈ 𝑐𝑙𝑎𝑠𝑠2Linear classifiers work quite well when a linear combination of thedistinguishing features can separate the classes of interest (i.e. the classesare linearly separable). If linear separation holds, then we can find aninfinite number of linear separators. Selecting the most suitable decisionhyperplane, that is, one that generalizes the best to new data, is an inherentchallenge when training linear classifiers. Additionally, when the trainingdata is noisy, the decision hyperplane can easily overfit to the trainingsamples and generalize poorly to new data.If a classification problem is nonlinear, this means the class boundariescannot be approximated well using linear hyperplanes . In this case,241.2. Genomics and cancer diagnosisnonlinear classifiers are a suitable alternative. Some non-linear modelslike kNNs can also subsume linear models in special cases. kNNs have adecision boundary that can take more complex shapes than a hyperplane. During classification, the test sample is assigned the majorityclass of k nearest neighbours. Neural networks are a supervised machinelearning approach that have been shown to be quite powerful at non-linearclassification tasks. These algorithms require labelled data and are trainediteratively using a training dataset to distinguish the classes of interest.Nevertheless, there is no known universally optimal classification strategy.The optimal learning algorithm can be selected a priori if it is known thatclassifications are linearly separable or not. Further selection can be guidedby parametrization, feature subsetting, and by assessing the trained models’performance on a set of held-out samples. However, the generalizability ofa classification algorithm can be truly assessed only using external datasetsand new samples drawn from the real world.Evaluation of performanceThe performance of a classification algorithm can be evaluated on thetraining data itself, on part of the training data that the algorithm has never‘seen’, or on an external dataset reflecting the type of data the algorithmwill encounter in practice. These three types of data can reveal differentaspects of the algorithm. Performance on the training data can help theuser determine optimal model parameters, especially if this is wrappedwithin a resampling approach like cross-validation. k-fold cross-validationcan be used to train a model on k-1 folds of the data and test on theremaining fold, repeating this procedure k times and leaving a differentfold out for validation each time. The held-out dataset is generated bykeeping a part of the training data away from the entire training process(including resampling strategies). This dataset reflects the type of data thealgorithm was trained on, and its performance indicates whether the trainedmodel has a high variance to maximize prediction of the training samples(overfitting), or if it has learnt representative features that can generalize tonew data from the same underlying data distribution. The optimal test setshould ideally be an external dataset generated independent of the trainingdata and is the true measure of the generalizability of the trained model onnew data.When comparing the classification results across multiple classes, variousmetrics can be used to evaluate the performance of a classifier onheld-out or independent test sets . We can measure the number of251.2. Genomics and cancer diagnosiscorrectly classified samples (accuracy). However, aggregate accuracy can beconfounded by an over-representation of certain classes compared to other.Accuracy calculated after adjusting for smaller cases is called balancedaccuracy. However, balanced accuracy measurements give equal weightto the presence of false negatives (FN, or Type 2 error - the number ofsamples from category ‘A’ incorrectly classified as another category) andfalse positives (FP, or Type 1 error - number of samples not from category‘A’ incorrectly classified as category ‘A’). This can be misleading when theclass imbalance is severe and evaluation needs to take into account thehigher number of ‘negative samples’ for each class.In these cases, the F1-score is considered a suitable alternative since it takesboth FP and FN into account . For a given class, precision (𝑇𝑃/(𝑇𝑃 +𝐹𝑃)) and recall (𝑇𝑃/(𝑇𝑃 +𝐹𝑁)) measure the positive predictive value andsensitivity of a classifier respectively. Here, TP indicates true positives - thenumber of samples from a particular category ‘A’ correctly classified as such,with FP and FN as previously defined. These metrics can be combined as2∗𝑇𝑃/(2𝑇𝑃+𝐹𝑃+𝐹𝑁) to get the F1-Score for a class, with a high F1-Scorebeing desirable. The mean F1-score across all classes gives a macro-F1 scoremetric for a multi-class classifier with class imbalance.A common issue with cancer classification methods is the lack ofdemonstration of classification accuracy on samples that are not simplyheld-out from the training data or generated by the laboratory protocolsidentical to the training data. Systematic biases from data extractionand processing can easily lead to over-fitting of the trained model on thetraining data, wherein the classifier ends up learning the technical noiseand artefacts inherent to the training dataset instead of identifying featuresthat remain relevant in independent datasets. Cross-validation within thetraining dataset can demonstrate the ability of the classifier to learn aboutthe training dataset itself, but the generalizability of these classificationmethods can only be assessed by their performance on independent datasetsgenerated from different protocols and laboratory environments. Many ofthe reported studies for cancer classification measure the performance oftheir trained algorithms on the held-out samples, incorrectly reporting theresultant performance as a metric of generalizability [163, 183, 187, 191, 196].Even more misleadingly, what many of these methods label as ‘independenttest sets’ are in fact held-out samples from the training dataset, followingthe same tissue preparation and sequencing protocols [13, 187].The use of held-out sets to represent cancer classification performance is261.2. Genomics and cancer diagnosisparticularly questionable since an imperative first step in all these methodsis typically to select a small subset of representative genes prior to modelfitting. Feature selection can bias the performance of supervised machinelearning methods. The selection is usually statistically driven, based onPearson correlation to uncover associations between features and labels, Wilcoxon rank score , or recursive feature elimination . Ifthe features are selected to increase the separation of training data, as theytypically are, then performance on held-out samples will be expected tobe better by default, as compared to performance on a test set where therepresentative features may be different depending on the data preparationand processing protocols. We now discuss the various feature selectionmethods and their resultant impact on classification.126.96.36.199 Feature reductionFeature selection is used in classification algorithms to reduce dimensionality,discard noisy features, reduce computational costs, or to incorporate priorinformation about the system to increase the likelihood of generalization.Analytical methods for feature selection can be filtering based, wrapperbased, or embedded within the classification algorithm. Ensemble featureselection with bootstrapping samples can be used on top of any of thesestrategies. The feature selection method of choice is run on severalrandom subsamples of the training data, and different lists of variablesare selected. Eventually, these lists are merged into a subset that is mostrepresentative of the various classes of interest. Alternatively, featureselection can be performed based on prior knowledge about biologicalprograms characterizing the relevant signals. We discuss the prominenttypes of algorithms for both approaches here.Analytical approaches for feature reductionFiltering based approaches include analysis of variance (ANOVA), orstatistical thresholding from pairwise t-tests between categories. Thesemethods can lead to false positives if not careful and do not provide anintuitive cross-validation approach to optimize for a set of discriminatoryfeatures. However, they model feature dependencies independent of thetype of classifier used, reducing the risk of overfitting when identifying anoptimal subset of features.Some classifiers can incorporate feature selection as part of the classifier271.2. Genomics and cancer diagnosisconstruction. These embedded methods include lasso regression and elasticnets. Lasso regression controls the sparsity of a solution by encouraging theselection of individual discriminatory features for each category. The methodtries to avoid redundant genes in a given signature and is not expectedto be stable in data where typically many genes encode for functionallyrelated proteins. Ridge regression, on the other hand, distributes importanceover multiple input features when optimizing classification. Elastic net triesto combine both these strategies to select groups of correlated genes thatare useful for classification. However, an investigative study assessing theimpact of feature selection methods on stability and accuracy of molecularsignatures found that the performance was equivalent from any of thesestrategies, with very unstable outputs in all cases .Wrapper methods jointly select sets of variables with good predictive powerfor a classifier. They perform a greedy search in the space of sets of features.Extensive cross-validation is required to estimate the accuracy of theselected feature subsets. An example of this is recursive feature elimination,a strategy used in SVMs and random forests. This method is not necessarilystable, but can be combined with ensemble feature selection to improve thestability of features. These wrapper methods are less prone to local optima,but are computationally intense with a high risk of overfitting.Most feature selection methods make an assumption about the kind offeature coefficients that will be encountered. For example, in ridge or lassoregression, it is expected that most feature coefficients will be zero or nearzero, enabling the selection of important discriminatory features. In practicethough, it has been found that features selected through the Student’s t-testseem to provide the most robust and stable selection of features . Whenextended to a multi-class setting, the ANOVA implementation is used.Biological approaches for feature reductionOftentimes, well known clinical and genomic features may also be utilizedin the development of a cancer diagnostic. These selections are guidedby observed molecular associations within cancer types of interest, whichmay arise from known protein markers or previous studies that aimed tocharacterize the genomic underpinnings of various cancers [89, 160, 192].These features may include genomic or transcriptomic events like HER2expression evaluation for breast cancer stratification , APC genemutation status for colorectal cancer diagnosis , 1p19q locus mutationsand IDH1 mutations for brain cancer classifications . The Catalogue ofSomatic Mutations in Cancer (COSMIC) database lists the various genes281.2. Genomics and cancer diagnosismutated frequently in different types of cancers , and can be utilizedfor cancer classification and prognosis . Genome-wide mutations canalso be consolidated into distinct mutation signatures, many of which areassociated with cancers or exposure to carcinogens . The biologicalapproaches for feature reduction for not necessarily separate from theautomated approaches. Features can be selected through a combinationof the two to ensure a balanced representation of prior information andautomatically discovered rules in the training data .Feature selection, if done correctly, can reduce computational costs andavoid overfitting to noise in the training data. However, it requires alarge, representative dataset that encompasses the various heterogeneous‘genotypes’ (in the form of mutations, expression, or other input datatype) for each of the cancer types. Typically, sequencing data of untreatedprimary cancers from TCGA and ICGC is used for this purpose. Thesedatasets have an over-representation of common cancers that have beenstudied extensively over the decades and have well-established molecularsubtypes. Cancer types that are refractory to routine diagnosis fall outsideof these well-defined criteria. This includes poorly differentiated cancers,rare cancer types, and cases presenting with mixed cancer phenotypes (likesarcomatoid mesotheliomas). In such scenarios it is possible that samplesdo not display the same representative markers as the well-differentiatedprimary counterparts.There has also been a dearth of literature on using an approach devoidof feature selection for cancer classification. A 2011 study has shown thatit is possible to retain all genes for a pan-cancer microarray expressiondriven classification of CUPs, and still obtain high performance - in thiscase, obtaining 89% accuracy on a validation set of adenocarcinoma CUPs. Additional work in this area is required to evaluate if feature selectionis indeed a necessity for cancer classification, and whether bypassing thisstep can provide any additional biological insights for cancers refractory toroutine histopathology diagnosis. One useful outcome will be the ability tonot just classify various cancers, but to further interrogate the trained modeland obtain insights into the decision-making process.291.2. Genomics and cancer diagnosis1.2.2 Beyond the diagnosis - identifying biological changesin individual tumoursThe impact of individual changes in genes plays out through various cellularpathways. Placing genomic alterations in the context of the oncogenicpathways they impact can help us understand the biology of the tumour,identify potential causal mechanisms, and prioritize therapeutically relevanttargets . It is particularly relevant for personalized analysis of cancergenomes and transcriptomes, as biological interactions and patient-specificsources of variability (germline influences, prior treatment etc.) can easilyconfound our ability to identify relevant features for diagnosis and therapy.Molecular analysis based on sets of statistically or biologically selectedgenes can help detect known patterns of positive selection across varioustumours. On the other hand, this approach can overlook functional changesthat are important for a rare subset of tumours, simply due to the limitedpower of a typical cancer cohort .Analysis of genomic changes through the lens of cellular mechanismscan help distill the multitude of changes that happen in a single tumourinto oncogenic pathways. Aggregating the molecular changes into a viewof the most dysregulated pathways in an individual tumour typicallyrequires manual prioritization of known tumour suppressors and oncogenes,extensive literature review to align observations against known biologicalpathways, and integration of genomic changes with expression dysregulation.One prevalent approach for automated pathway analysis is network-based,whereby genes are overlaid with observed genomic changes (evaluatedagainst a set of controls for gene expression changes, or a healthy tissuereference for mutations) to identify ‘hubs’ of biological activity fromknown pathways [68, 208] or to recover novel interactions and topologies. Another popular approach is one of statistical enrichment, wherebycase-control comparisons are made to identify statistically significantpathways and gene clusters that are differentially expressed in the tumours. Statistical enrichment may be combined with network analysisto prioritize pathways in a set of samples . However, there are veryfew reference-free approaches available for automated prioritization ofimportant pathways in individual cancers [44, 116, 208].301.2. Genomics and cancer diagnosis188.8.131.52 Findings from clinical trials utilizing genomic analysisfor cancer managementClinical trials exploring targeted therapy in advanced cancersVarious clinical trials have been conducted over recent years to evaluatethe benefit of treating patients with therapies that are targeted to specificmolecular alterations in their tumour. While these programs have foundmarginal to significant benefit in different scenarios when providing targetedtherapy, the fraction of tested populations where actionable mutations werefound was typically quite low. The MOSCATO 01 clinical trial in 2017aimed to evaluate clinical benefit of targeted therapy in a cohort of 843adult patients using RNA-Seq and whole exome sequencing techniques .In the subset of 199 patients that eventually received targeted therapy,progression free survival (PFS) was 1.3-fold higher as compared to patientson prior therapy. Notably, PFS was lowest in a subset of 36 ‘ill definedprimary tumours’ (pathognomonic cancers of unknown primary) in thiscohort, regardless of being on matched therapy or otherwise. The 2014multi-center SHIVA trial screened 741 patients of any tumour type, findinga slight difference in median PFS between the matched treatment and theprior therapy arms (2.3 versus 2.0 months respectively), but a significantlyhigher average PFS in the matched treatment group of patients (maximumPFS 3.8 months versus 2.1 months) .One key limitation of actionable mutation based clinical trials is the limitedsubset of patients that can eventually benefit from this approach. In theMOSCATO 01 trial, for example, scientists found actionable mutations in411 patients but only 199 patients were eventually treated with a targetedtherapy (based on a matching genomic alteration). A recent multi-centerstudy across 2,579 patients also found that only 6% of patients wereable to receive matched therapy, with a very low overall response rate(0.9%) . Similar findings emerged from the Precision in PediatricSequencing (PIPseq) program for children with hematologic or solidcancers at Columbia University Medical Centre , where only a smallfraction (16%) of successfully screened patients obtaiend matched therapy.These findings suggest that molecular screening is not a viable approachfor routine clinical practice at present, but predictive biomarkers shouldcontinue being evaluated for efficacy . Interestingly, in the PIPseqprogram researchers found that genomic data - particularly RNA-Seq data- was useful for prognosis, diagnosis, or pharmacogenomics in an additional311.2. Genomics and cancer diagnosis38% of cases, suggesting that detailed molecular profiling beyond mutationpanels may still have wider benefits beyond treatment selection.Clinical trials on CUPsOngoing clinical trials on genomics-guided cancer treatment also suggestthat further research may be required to transpose existing targeted therapyapproaches to cancers of unknown primary (CUPs). In 2017, a large-scaleclinical trial at Memorial Sloan Kettering Cancer Center evaluated theutility of molecular/genomic profiling alongwith pathology and clinicalinformation in improving treatment options and outcome for 333 CUPpatients in particular . 150 patients (45%) had 34-410 cancer-associatedgenes sequenced in a panel in an effort to identify clinically actionablemutations. Of the 45 patients (34%) where an actionable mutation wasfound, only 15 received targeted therapy. Factors limiting the use oftargeted therapy included limited availability of suitable drugs, poorperformance status, and/or rapid clinical decline. In the small subsetthat benefited though, overall survival was 13 months, as compared tohistorical observations of 3 to 8 months in previous studies. These findings,accompanied by other recent basket trials that aim to group and treatpatients based on similar molecular profiles, have motivated the need forfinding actionable mutations in CUPs.In a systematic review of clinical trial outcomes from 2002-2009,Pentheroudakis et al  found assessed outcome and survival inpatients with CUPs, where a putative primary could be identified usingmolecular platforms like gene expression arrays. They found that patientswith CUPs of putative lung or pancreatic origin had equivalent tumourshrinkage and median survival as those with known metastatic lung andpancreatic cancers. However, within the metastatic breast and bowel cancergroups, putatively diagnosed CUPs had a significantly inferior response ratecompared to patients with known primaries of breast and bowel, suggestingthat while CUPs can be accurately classified by molecular approaches, insome cancer types they may be molecularly distinct from their knownprimary counterparts.A recent clinical trial compared site-specific therapy (based on geneexpression based diagnostic) with empirical chemotherapy for 113 CUPpatients . They found that there was no significant improvement in1-year survival between the two groups, but median overall survival (16.6versus 10.6 months) and progression-free survival (5.5 versus 3.9 months)were improved in patients treated based on the putative primary site.321.2. Genomics and cancer diagnosisPancreatic cancers presenting as CUPs have a four-fold incidence of bonemetastasis, and 30% higher incidence of lung metastases compared toknown primary pancreatic cancers . The higher incidence of metastaticoccurrence (compared to putative primary cancer) has be reasoned tobe driven by immunosuppression and aggressive metastatic potential ofearly progenitor cells of CUPs . Various projects focused on molecularcharacterization of CUPs have found that while they do not typically displayactivating point mutations in oncogenes or tumour suppressors [60, 120, 207],they are typically characterized by angiogenesis activation (in 50-89% ofcases), oncogene overexpression (10-30% of cases), hypoxia-related proteins(25% of cases), epithelial-mesenchymal transition markers (16% of cases)and activation of intracellular signals like AKT or MAPK (20-35% ofcases) [120, Rassy et al. ]. Emerging evidence on putative molecularhallmarks of CUPs has also indicated the need to further elucidate therole of mechanisms like growth factor independence, immune evasion,chromosomal instability, and telomerase activity in these elusive cancers.184.108.40.206 Role of cancer diagnosis in genomic analysisWhen undertaking an individualized analysis approach for a cancer patient,scientists need to contextualize the genomic and transcriptomic findings fromthe cancer against a background of healthy tissue and tumour samples withsimilar histology. This is particularly true for gene expression data, where abackground set of samples is used to determine if certain genes are over- orunder- expressed. A precise cancer diagnosis is, therefore, an important stepbefore a suitable background or comparators can be selected. Classificationmodels incorporate various molecular measurements of tumours to provide aquantifiable prediction. Presumably then, if we can understand the rationalefor cancer classification, we can obtain biological insights into the pathwaysand mechanisms driving a cancer.Feature selection prior to training an algorithm is one approach to aidinterpretation of classifier results. Manual or statistical prioritization ofrelevant features can indicate which genes or pathways are frequentlyassociated with a particular group of cancers. Alternatively, correlated ororthogonal feature sets guiding the machine-learning based diagnostic canbe inferred using post-hoc inference of feature importance from trainedmodels. Given a small set of features, we can measure the impact of each331.2. Genomics and cancer diagnosisfeature on the output through approaches like recursive feature elimination(for random forests, SVMs), and integrated gradients (for neural networks).However, feature selection may not always be the most robust approach,especially when studying cancer types where the markers may be redundant,confounded by another biological signal, or absent. Statistical approachesmay also lead to the selection of technical confounders or covariate genes.A comprehensive sample-level approach to cancer diagnosis and pathwayanalysis can obviate the need for choosing an appropriate background (achallenge when studying rare aetiologies), and avoid any bias towards knowngene candidates and pathways in the integrative analysis. Such a method canalso help compare results from related samples and datasets, identify newsubtypes based on common patterns of network alterations , and proposecancer mechanisms.220.127.116.11 Automated tools for single-sample analysis fromRNA-SeqSingle-subject analysis of transcriptomes is an underappreciated approachfor individual-level analysis of diseases. Current approaches leveragecohort-based population analysis either require large representative setsof each disease for comparison or rely on a case-control approach to finddifferentially expressed pathways . In these tools, the input (geneexpression values) is either used as is, or transformed either via ranking,z-score calculation or through statistical thresholding like log-likelihood.After data transformation, the pathway activity is measured by aggregatingthe values of all genes in a given pathway arithmetically or throughthe enrichment of gene-level perturbations. Results can be confoundeddepending on the type of statistical metric used. For example, in anextensive review done on these methods, the authors found that theenrichment based methods were highly sensitive to the way the pathwaywas defined .The current approach for single-subject transcriptomic analysis has clearlimitations. The requirement of controls and background samples makesit difficult to analyze cancers that present with mixed histology, lacksuitable comparator datasets (rare cancers, post-treatment cancers), orhave important individual signals that characterize the tumour. Theanalysis can be further impacted by platform biases, and in the event ofsmall case/control studies, be severely underpowered .341.3. Objectives and chapters overview1.3 Objectives and chapters overviewStudying the molecular profile of cancers is becoming standard practicefor patients with advanced disease, propelling an era of precision medicine.Molecular profiling has expanded from disease-specific tests to broaderpanels that interrogate multiple genomic changes simultaneously and linkto clinical data [42, 53]. These changes can be at the DNA, RNA, orprotein level in a patient’s tumour. Through numerous research consortia,thousands of high-resolution profiles of cancers have also been generatedusing whole-genome sequencing, exome sequencing, and RNA sequencing. These modalities capture the genomic and transcriptomic landscapeof each individual tumour. When associated with clinical data, thishigh-resolution sequencing information can provide us with a portrait ofvarious cancer types [43, 215] and identify functionally important genesin common cancers [144, 168]. They can also automate clinical tasks likecancer classification [73, 114] and guide treatment protocols . Analyzingthe molecular landscape of rare tumours can indicate a rationale fortransposing lines of therapy that align with widely studied and curatedcancer types.The overall objectives of this thesis are to investigate the utility ofRNA-Sequencing as a standalone diagnostic modality for common and rarecancers and to develop machine learning methods that utilize all availablegene expression information to resolve differential diagnoses, provide aputative diagnosis for CUPs, and guide genomic analysis of rare cancers,while also providing a molecular rationale for the resultant decision. Anoverview of the main contributions is shown in Figure 1.2. This work willmove us closer to a world where routine cancer diagnosis is based on detailedmolecular profiles of cancers, and where the diagnosis decision-making canbe easily broken down in terms of biological pathways and networks drivingthe tumour.The next chapter will begin with a background of an ongoing precisiononcology trial at BC Cancer, the personalized oncogenomics project(POG), from which the vast majority of our research data is drawn(Clinicaltrails.gov ID: NCT02155621). This chapter will motivate the needfor a cancer diagnosis in genomics-based cancer profiling. Challenges withdiagnosis and their relation with cancer treatment will be highlightedthrough a published case-study where transcriptomics was used as adiagnostic aid to contextualize analysis and revise diagnosis .351.3. Objectives and chapters overviewFigure 1.2: Thesis overview and key contributions. In this thesis we explorethe utility of bulk RNA-Seq as a diagnostic and analysis aid in personalizedoncogenomics initiatives. In a detailed retrospective study we review thefrequency of diagnostic changes motivated by genomic data and molecularobservations. We develop an automated, open-access tool (SCOPE) forcancer classification using large, representative RNA-Seq profiles. We thenextend this method to provide pathway-level profiles of individual cancersamples, also made available as an open-access tool (PIE).Chapter 3 scales up this investigation to the POG cohort, asking whenand how whole-genome sequencing and RNA sequencing can impact cancerdiagnosis. Through a retrospective analysis of >300 POG patients, weidentify cancer types that are most frequently refractory to pathology-baseddiagnosis. We further review how sequencing information could be used todetect, review, and resolve incidences of misdiagnosis, differential diagnosis,and indeterminate diagnosis associated with advanced and rare cancers.This includes the use of SCOPE, a published neural network based cancerclassification method, the development and validation of which is describedin detail in Chapter 4 .A pan-cancer classifier like SCOPE, that uses large transcriptomic profilesfor decision-making, provides a method for quantitative, robust orthogonalcancer diagnosis. In the process, does this model learn any biologicalproperties of each cancer? How does this automated learning compareto the manual genomic analyses commonplace in precision oncology? InChapter 5 we address these questions using PIE, a tool for extractingpathway-level impact scores from SCOPE. We find that PIE can recoverknown cancer biology for primary cancers from >10,000 samples in TCGA.The resultant pathway profiles can be used to cluster cancers by their361.3. Objectives and chapters overviewdiagnosed cancer type. We show that it can perform single-sample analysis- it identifies therapeutically-relevant pathways for the case described inChapter 2, and in another case-study, characterizes the biology of cancersof unknown origin.Finally, Chapter 6 concludes the thesis by discussing the strengthsand limitations of the research presented in Chapters 3-5. It outlinesoutstanding challenges and interesting directions for future research in thisarea, including the utility of the methods developed herein.37Chapter 2BackgroundVarious genomic indications for cancer drug selection and treatmentstratification have been translated to the clinic in the past decade. Theseinclude the evaluation of hormonal markers in breast cancer for selectingsuitable drugs , identification of treatment options for colorectalcancers based on microsatellite instability status , and treating cancerpatients with high tumour mutation burden using pembrolizumab .Many of these indications rely on an accurate cancer diagnosis priorto administration. What role does a cancer diagnosis play in cancermanagement, and how can molecular data contribute to the same? Herewe explore this question within the realms of a precision oncology trial formanaging treatment-resistant cancers.In bioinformatics analysis of cancers, an accurate diagnosis must occurafter the raw sequencing data has been processed into sample-specificgene expression values, and before aligning the changes to a referencetissue type. In the case study that follows, the patient presented with anadenocarcinoma of the vulva that was refractory to established lines oftherapy. The initial diagnosis of vulvar adenocarcinoma, provided by apathologist, was re-evaluated after the analysis of gene expression data.The resultant putative diagnosis was validated against clinical recordsand follow-up validation through immunohistochemistry. The moleculardiagnosis was compared with the diagnosis from an experienced pathologistto determine the correct cancer type, subsequently guiding the selection oftreatment options for the patient.The Personalized OncoGenomics (POG) project at BC Cancer wasestablished in 2011 with the aim to sequence and treat patients withadvanced cancers . Patients were enrolled after their cancer nolonger responded to standard lines of therapy. The project analyzes thegenomic and gene expression profiles of each patient’s cancer in orderto identify drugs that can target the individual cancer. The project has382.1. Case studyhad considerable demonstrated success, guiding targeted therapy andelucidating novel resistance mechanisms in highly aggressive cancers[96, 108].Patients in the POG project are recruited and biopsied at BC Cancer andaffiliated hospitals. Clinical laboratories at BC Cancer assess the site oforigin from these biopsies according to established protocols. Subsequently,standard Illumina protocols are followed for whole-genome sequencing(WGS) of the tumour and peripheral blood (as control), and transcriptomesequencing (RNA-Seq) for the tumour. Sequencing is performed usingIllumina HiSeq 2000 sequencers. The raw genomic and transcriptomicsequences are processed through a series of software tools to quantifythe expression of genes, identify structural variants, mutations, and copynumber changes. These findings need to be contextualized against areference tumour type before we can draw inferences about clinicallyrelevant changes in the patient’s cancer. The following case study will helpthe reader gain an appreciation for a routine precision oncology workflow,and to appreciate the implications of expression based cancer diagnosis.Patients in the POG project have typically received multiple lines ofchemotherapy prior to enrolment in the program. As a result, their cancersusually acquire complex molecular profiles, and oftentimes have movedaway from their site of origin (metastasized). Due to this, identifying thecancer’s site of origin is a major challenge.2.1 Case studyMammary-like glands in the vulva were first reported in 1872, and thoughtto be supernumerary breast tissue remnants located along the milk lines. Current understanding suggests that these are modified vulvar eccrineglands, that can give rise to vulvar adenocarcinomas . Vulvar cancersrepresent 5% of gynecologic cancers and <1% of all cancers in women. Approximately 90% of vulvar cancers are squamous cell carcinomas,and associated primarily with high-risk human papilloma virus (HPV). Most of the remaining vulvar cancers are not associated with HPV,and are typically vulvar adenocarcinomas. This category includes primarytumours arising from the vulva, and metastases to the vulva . Adeterminative diagnosis of vulvar adenocarcinomas is complicated, and canencompass primary adenocarcinomas (mammary-like, mucinous, adenoid392.1. Case studycystic, Bartholin gland, and extramammary Paget disease) and metastaticdisease. Metastatic adenocarcinomas to the vulva forms 5-8% of all vulvarcancers .Breast cancers are the most common malignancy affecting women in NorthAmerica . In contrast, mammary-like adenocarcinomas of the vulva(MLAV) are rare, locally aggressive tumours that arise from the vulvabut strongly resemble primary breast carcinomas . They were initiallyreported in 1875, and at the time were thought to be breast tissue remnantslocated along the milk line . Current understanding suggests that theyare modified vulvar eccrine glands that can give rise to several differenttumours, including vulvar adenocarcinomas . They can metastasizeto lymph nodes in approximately 60% of cases, and recur frequently aftertreatment . Treatment guidelines are traditionally the same across vulvarcarcinomas, but more evidence now suggests the transposition of breastcancer treatment regimens to MLAV; these include sentinel node biopsy,molecular subtyping, and adjuvant therapy [1, 21, 153].Herein we describe the case study of a patient who presented with apoorly differentiated vulvar adenocarcinoma. The tumour was subsequentlyreclassified as a HER2+ MLAV upon transcriptomic analysis. Themolecular profile of the case also aligned more strongly with breast cancerover gynecologic cancers. Put together, these findings suggest molecularlikeness between breast cancers and MLAV, adding further support to thetransposition of breast cancer regimens to this rare cancer type [1, 153]. Thecase also highlights the utility of genomics in resolving complex diagnoses.To our knowledge, this is the first case of a mammary-like adenocarcinomaof the vulva being described with detailed whole-genome and transcriptomesequencing analysis .2.1.1 Clinical backgroundA 60-year-old woman presented with a poorly differentiated, bleeding massin the vulva. There was no family history of cancer malignancies. PET/CTscans determined it to be a stage IV malignancy with a regional spreadfrom the bilateral inguinal area to the retrocaval lymph nodes, externaliliac, and common iliac lymph nodes. In total, three masses were noted onphysical examination - a 2.5 cm firm right labium maius mass, a 2.5 cmbleeding vaginal introital mass, and bilateral inguinal lymphadenopathyup to 3.0 cm. The entire vulva was severely flaky, but the no specific skin402.1. Case studylesion was observed. The Bartholin gland, from where Bartholin glandcarcinoma can arise in 1% of all vulvar neoplasms, was not proximal to theright labium maius. The largest lymph node was a left external iliac lymphnode measuring 3.6 cm. No breast masses were identified in the physicalexamination. Furthermore, the patient reported that a history of remotemammograms had shown no malignancy. Bilateral mammograms werealso negative for breast malignancy. Chest X-ray showed no metastaticpulmonary disease. A hypermetabolic mass in the medial aspect of theright labium maius (3.2 x 2.1 cm, maximal standardized uptake value(SUV) of fluorodeoxyglucose (FDG) 24.0) and an adjacent FDG-positivecircumferential mass in the vaginal introitus (3.0 x 2.0 cm, maximalSUV 23.6) were detected using positron emission tomography/computedtomography (PET/CT) scans. The uterus was enlarged but showed onphysiologic uptake of FDG. High SUV of FDG is indicative of cells havinga high metabolic rate, and can reveal differences in glucose consumptionof various cancerous lesions. No other putative primary tumour sites wereidentified. The clinical and radiologic findings were consistent with a stageIV vulvar cancer, with metastasis to the bilateral inguinal, retrocaval,external iliac, and common iliac lymph nodes. An initial vulvar biopsy wastaken at this point and pathology findings report “poorly differentiatedinfiltrating carcinoma, favor [sic] poorly differentiated adenocarcinoma”(see Section 2.1.3).The patient was treated with four rounds of carboplatin and paclitaxel,with a positive response observed in all areas apart from the inguinallymph nodes. The response was assessed using repeat imaging by PET/CT.Sequential radiotherapy was then given to the entire spread of the diseaseat baseline. This was accompanied by additional radiotherapy boosts tothe still FDG-avid inguinal lymph nodes. A rapid recurrence was observedin the patient’s left supraclavicular lymph nodes six weeks after thecompletion of radiotherapy. A fine needle aspirate confirmed this to bea poorly differentiated adenocarcinoma consistent with metastatic vulvaradenocarcinoma.In the absence of subsequent standard treatment options, the patient wasenrolled in the POG project at BC Cancer to identify actionable targetsand to validate the clinical diagnosis of vulvar adenocarcinoma. At the timeof enrollment in POG, the patient had no family history of cancer. Noadditional genetic testing was done, and no other treatment was receivedbetween the initial vulvar biopsy and presentation of metastasis in teh leftsupraclavicular lymph node. The sample from the left supraclavicular lymph412.1. Case studyFigure 2.1: Clinical history and pathology sampling timepoints for MLAVpatient. Initial treatment is indicated in orange, tumour biopsies at varioustime-points following metastasis indicated with purple lines, and treatmentsprovided based on genomic analysis are shown with purple drug symbols overdark-grey timeline bars. Tumour biopsies on which immunohistochemistrywas performed are shown with open circle termination of correspondingline. Abbreviations: IHC - Immunohistochemistry test, POG - PersonalizedOncoGenomics clinical trial (Clinical Trial number: NCT02155621).node (subsequently referred to as the recurrence biopsy) was submitted toPOG for sequencing and analysis. A detailed clinical time-line is shown inFigure 18.104.22.168.2 MethodsUltrasound guided core-needle biopsies were obtained for the POG study.FISH assays and IHC were performed by the clinical laboratories atBC Cancer according to established protocols. The rabbit monoclonalantibody for HER2 (clone 4B5; Ventana Medical Systems) was used forHER2 protein staining. Immunostaining was performed on the VentanaBenchmark Ultra automated system (Ventana Medical Systems) with 36minutes of ULTRA CC1 before being incubated with the prediluted HER2422.1. Case studyantibody for eight minutes at 36∘. The ultraView DAB detection kit wasused with an ultraWash step.RNA and DNA were extracted and sequence libraries constructed usingstandard protocols (summarized in Table 2.1). Sequencing was performedon an Illumina HiSeq2500 platform at the Canada’s Michael Smith GenomeSciences Centre (GSC). One microgram each of DNA from normal bloodand tumour biopsy were separately used as input to the GSC PolymeraseChain Reaction (PCR)-free WGS protocol, and sequenced to 43x and 90xcoverage respectively. 1.725 microgram of total RNA from the tumourwas treated with the strand-specific messenger RNA sequencing protocolwith poly-adenylated reads capture, and sequenced to a total of 291million reads. The reads were aligned to the GRCh37 reference humangenome using BWA v0.5.7 . Duplicate reads were marked usingPicard (v1.38, https://github.com/broadinstitute/picard/). Microbial andviral integration detection analysis was done using an in-house pipelineand BioBloom Tools . WGS variants identified using Samtools v0.1.7mpileup .The tumour and normal samples were compared to identify somatic events.Somatic single nucleotide variants (SNVs) were called using Strelka v0.4.62 and MutationSeq v1.0.2 . Strelka v0.4.62 was also used to callsmall insertions and deletions. The somatic variant annotation was donewith the Ensemble database (v69), and the effect calculation was assistedby annotations from snpEff 3.2 , COSMIC v64, and dbSNP v137. LOHevents and tumour content were determined with APOLLOH v0.1.1 .Copy number variants were identified using CNAseq v0.0.6 (https://www.bcgsc.ca/platform/bioinfo/software/cnaseq).RNA-Seq data was analyzed using JAGuaR v2.0.3 . The RNA-Seq datawas subsequently processed by an in-house pipeline for Whole-TranscriptomeShotgun Sequencing coverage analysis, to yield exon- and transcript- levelread counts and normalized expression values (Reads Per Kilobase oftranscript per Million mapped reads, RPKM). Gene-level RPKM valueswere then calculated based on a collapsed gene model. Fold change foreach gene was calculated by dividing each gene’s RPKM value against anaverage of the RPKM values for the gene in a compendium of adjacentnormal tissue samples from the Illumina Human BodyMap 2.0 project. Apercentile ranking of the RPKM of each gene against the compendium ofbreast cancer transcriptomes from TCGA was used to identify genes withaberrant expression and to prioritize genes of interest.432.1. Case studyExpression correlation analysis for tumour typing was undertaken relativeto the entire set of normal and tumour transcriptomes in TCGA. Two-wayAnalysis of Variance (ANOVA) was used to identify genes that distinguishedeach pair of TCGA tumour types. This resulted in a set of 3,000 genes thatwere the most informative in explaining patterns of variance amongst allTCGA tumour types. A spearman correlation was calculated for this set ofgenes from the tumour sample against each TCGA sample. These pairwisecorrelations were clustered by the disease status (tumour or normal) andcancer type of the TCGA samples. The cancer set with the highest mediancorrelation was determined to be representative of the closest cancer typefor the sample.Table 2.1: Details of sequencing experiments.Sample Type InputmicrogramsLibrary protocol Coverage Reads totalBiopsytumourDNA 1.000 PCR-free WGS 90x NABiopsytumourRNA 1.725 ssRNA-Seq NA 291 millionNormal blood DNA 1.000 PCR-free WGS 43x NA2.1.3 Pathology analysis and findingsPathology analysis was conducted on two biopsies - the initial vulvarmass, and an aspirate of the recurrence in the supraclavicular lymphnode. The initial vulvar mass was evaluated twice - once at the time ofinitial biopsy, and subsequently following the indication of a mammary-likeadenocarcinoma from the POG project’s genomic analysis.Initial vulvar biopsy assessmentAn initial biopsy of the vulvar mass showed nests and cords oflarge pleomorphic epithelioid cells with eosinophilic cytoplasm andhyperchromatic nuclei. Gland formation, papillary structures, orintra-luminal vacuoles were not evident, except for some occasionalcells showing possible signet-ring features and intra-luminal vacuoles.Mitotic figures were easily identified. No mammary-like glands or overlyingepidermis were present. Based pm immunohistochemical (IHC) workupat the receiving hospital, the tumour was assessed positive for CK7442.1. Case studyand Ber-EP4, and negative for CEA, CK5, CK20, MART-1 and S100.PAS-diastase was negative for definitive intra-luminal murin.These initial findings were consistent with a poorly differentiatedadenocarcinoma. Moreover, the poorly differentiated morphology andnonspecific IHC profile in the initial biopsy resulted in a broad differentialdiagnosis that included poorly differentiated vulvar squamous cellcarcinoma, poorly differentiated vulvar adenocarcinoma (Bartholin glandadenocarcinoma, MLAV, or adenocarcinoma arising from extramammaryPaget disease), and metastatic adenocarcinoma (from the gastrointestinaltract or gynecologic organs), and melanoma.Post-POG pathology analysisSubsequent to the genomic analysis favoring a diagnosis of mammary-typecarcinoma (described in Section 2.1.4), follow-up validation stains wereperformed at BC Cancer. This validation work was carried out on theinitial vulvar biopsy and on a repeat aspirate from the left supraclavicularlymph node.Initial biopsyAdditional validation stains on the vulvar biopsy were performed forvalidation at BC Cancer, to evaluate the POG-indicated diagnosis ofmammary-type carcinoma. IHC for HER2, the protein product of ERBB2gene, was equivocal (score 2 as evaluated according to ASCO/CAPguidelines ), as indicated in Figure 2.2. This prompted reflex testing forHER2 amplification in the genome, using Fluorescent In Situ Hybridization(FISH) testing. This showed the HER2/CEP17 ratio was 2.0 with 20 cellscounted, and an average HER2 copy number per cell of 6.35. Althoughthe HER2/CEP17 ratio was equivocal, HER2 was interpreted as amplifiedbased on HER2 copy number equal or greater than 6.0 signals per cellas per the 2013 ASCO/CAP guidelines . These guidelines require“complete, intense staining” of the circumferential membranes of >10% oftumour cells.Additional IHC testing revealed that the tumour was negative for ER,PAX8, GCDFP-15 and mammaglobin, and focally positive for vimentin.The negative CEA, GCDFP-15 and PAS-diastase excluded the diagnosisof adenocarcinoma arising from extramammary Pagets disease (EMPD).The negative CK5 ruled out squamous cell carcinoma. The negative ER,GCDFP, and mammaglobin discounted the possibility of luminal A/B-typesof primary MLAV. The negative PAX8 and CK20 ruled out metastatic452.1. Case studygynecologic and lower gastrointestinal tract carcinomas, respectively.The negative MART-1 and S100 (tested previously) were contrary toexpected indications for melanoma. The overall pathologic findings fromthe post-POG workup of the initial vulvar biopsy were in keeping withhigh-grade ER-negative, HER2-positive mammary-type carcinoma of thevulva.Recurrence biopsyA fine-needle aspiration (FNA) of the recurrent disease in the leftsupraclavicular lymph node was collected and showed poorly cohesiveirregular glandular clusters of pleomorphic malignant cells consistent withmetastatic carcinoma. A repeat FNA of the site, collected for IHC testing,demonstrated that the sample was strongly positive for GATA3, a relativelyrecent marker for breast cancer . HER2 IHC was positive (score 3+)based on 30% of tumour cells showing strong circumferential membranousstaining. ER was negative, identical to the initial vulvar biopsy. Histologicalestimates placed the tumour content of the recurrence biopsy at 69%, with85% cellularity.Alternative diagnoses and exclusion of differential diagnosesHER2 overexpression can also be observed in adenocarcinomas arisingfrom EMPD. A 2005 study of patients with mammary and extramammaryPaget disease observed co-expression of ERBB2 and AR in 88% (51/58) ofcases with mammary Paget ; the genomic analysis for this case alsoshowed high expression of ERBB2 and AR. With this observation in mind,invasive carcinoma arising from primary EMPD, rare anogenital tumourswith proposed precursors that include Toker cells, pluripotent germinativecells, eccrine or apocrine glands, and mammary-like glands [102, 217] wereconsidered for differential diagnosis during the validation pathology workupof the recurrence. The positive staining for CK7, GATA3, and HER2, andthe negative staining for ER overlapped with previous reports of EMPD[48, 133, 154]. However, CEA, a nonspecific immuno-stain that is positivein most cases of primary EMPD , was negative in this patient’s tumour.Furthermore, the overlying epidermis was not seen in the vulvar biopsy.Thus, the presence of pagetoid cells, which are necessary for the diagnosisof Paget disease and carcinomas arising therein, could not be assessed. Theconfluence of the infiltrating tumour also favor mammary-type carcinomaover Pagets disease, which is characteristically more superficial . Insummary the IHC findings supported the diagnosis of MLAV.462.1. Case studyFigure 2.2: Histopathology of biopsies retrieved fromMLAV Patient. A) Thebiopsy of the vulvar mass shows a poorly differentiated tumour composedof nests and cords of pleomorphic tumour cells. B) The HER2 immunostainon the initial vulvar mass biopsy is equivocal, compatible with score 2+based on predominantly incomplete, weak and moderate membrane stainingwithin greater than 10% of tumour cells. C) The fine needle aspirate ofthe recurrence lesion from the supraclavicular lymph node shows clustersof pleomorphic tumour cells (H&E stain). D) The HER2 immunostain ofthe supraclavicular lymph node shows tumour cells with complete, intensemembrane staining in greater than 10% of tumour cells compatible withscore 3+.472.1. Case study2.1.4 Genomic analysesThe recurrence sample was submitted to POG for whole-genome andtranscriptome sequencing and analyses. In the absence of subsequentoptions for standard treatment for the patient, the aim of this exercisewas to a) identify potentially actionable genomic targets, and to b)clarify the diagnosis and evaluate the validity of the initial diagnosisas vulvar adenocarcinoma. A constitutional blood sample and tumourfrom the lymph node biopsy were sequenced to a redundant sequencecoverage depth of 43-fold and 90-fold, respectively. A transcriptome of291 million sequence reads was also generated from the same tumoursample. The genomic and transcriptomic findings indicated the cancerwas a mammary-like adenocarcinoma. The association of specific genomicevents and transcriptomic changes with breast cancer was confirmed upondetailed literature review and integrative analysis.Somatic VariantsSomatic variants were identified by comparison of paired-end WGSresults from the tumour sample and the blood (germline reference).These variants were subsequently filtered to discard known artifacts andlow-confidence variants, resulting in 375 single nucleotide variants (SNVs)and 15 insertion/deletion events (INDELs). A subset of 16 non-synonymousprotein-coding SNVs were present in the COSMIC database and wereconsidered to be the most relevant variants. No structural variants of clinicalsignificance were identified. Copy number variant (CNV) analysis indicateda triploid karyotype with an estimated 68% tumour content, consistentwith the pathology estimate of 69%. Focal copy number amplifications weredetected on chromosomes 2, 8, 9, 17, and X. Screening for microbial andviral sequences was negative for any microbial contaminants. There was noevidence of HPV genomic integration either.A gain of function mutation (p.S310F) was observed in ERBB2 withprevalence level of 86%. This event was accompanied by a copy numbergain (five copies), as shown in Figure 2.3. The S310F mutation has beenidentified in several cancers including breast, lung, and ovarian .Loss of function mutations were observed in TP53 and RB1 tumoursuppressor genes. These genes overlapped with regions of loss ofheterozygosity (LOH) in the copy number landscape. Variants of unknownsignificance were also noted in PIK3CA, AKT3, and GNAS. All SNVs of482.1. Case studyinterest are summarized in Table 2.2.A triploid model with an estimated 68% tumour content was inferred basedon the prediction of allelic imbalance and loss of heterozygosity in the sample,as already described in Section 2.1.2. Copy-number variants were estimatedwith respect to this ploidy model. Focal copy-number amplifications weredetected in Chromosomes 2, 8, 9, 17, and X. Of particular interest, copynumber gains were observed for ERBB2, AKT3, PIK3CA, CDK1, CCNB1,and AR. Loss of heterozygosity events (LOH), arising from the loss of a singlecopy of the respective gene, were detected for the tumour suppressor genesBRCA2, RB1, and TP53. Additionally, RB1 and TP53 had loss of functionmutations in the two remaining (homozygous) copies. These findings aresummarized in Table 2.3.Table 2.2: SNVs of interest are listed, along with details on the counts of thesupporting reads spanning the tumour genome at the mutated and referencebases, in the tumour genome (transcriptome).Gene Chr DNA Change Variant Alt/Ref(Alt_RNA/Ref_RNA)AKT3 1 244006441 C>A VUS 18/113 (0/11)ERBB2 17 37868208 C>T GoF 165/26 (4181/625)GNAS 20 57430298 C>G VUS 7/39 (0/4)PIK3CA 3 178938934 G>A VUS 69/40 (102/24)RB1 13 49033844 C>T LoF 35/14 (294/59)TP53 17 7577082 C>T LoF 30/13 (230/24)MAP3K12 12 53877268 C>T 7/62 (4/44)OR14A16 1 247978827 G>TPAF1 19 39876915 G>A 10/37 (76/208)PCDHA6 5 140208403 G>A 25/38 (0/0)PPM1B 2 44428594 A>G 20/72 (71/232)SLCO3A1 15 92669422 G>A 18/37 (9/21)SNTG2 2 1271197 G>C 13/57THBS2 6 169629714 C>T 8/53 (0/102)UPF3A 13 115047496 G>C 9/15 (0/25)ZNF830 17 332893990 C>T 15/39 (51/74)ZXDB X 57618845 G>A 11/22 (2/16)492.1. Case studyTable 2.2: SNVs of interest are listed, along with details on the counts of thesupporting reads spanning the tumour genome at the mutated and referencebases, in the tumour genome (transcriptome). (continued)Gene Chr DNA Change Variant Alt/Ref(Alt_RNA/Ref_RNA)ZXDB X 57618849 A>C 11/21 (3/17)Abbreviations: AA, amino acid; Alt, coverage of alternativeallele; Alt_RNA, RNA reads mapping alternative allele; Chr,chromosome; GoF, Gain of function; LoF, Loss of function; Ref,coverage of reference allele; Ref_RNA, RNA reads mapping referenceallele; SNV, single-nucleotide variant; VUS, variant of unknownsignificance.Table 2.3: Copy number variants of interest in the tumour genome are listed,along with percentile values and fold changes calculated from the respectiveRPKMs against a background of TCGA Breast cancers.Gene Chr Copy type TCGAexpressionpercentileFoldexpressionchangeCopy change inploidy correctedmodel (versus 3n,triploid tumour)AKT3 1 Gain 21 -5.18 +1 (HET)ERBB2 17 Amplification 98 32.73 +8 (ALOH)GNAS 20 Gain 2 -1.55 +1 (NLOH)PIK3CA 3 Gain 49 -1.40 +1 (HET)RB1 13 Loss 88 1.76 -1 (DLOH)TP53 17 Loss 21 -1.10 -1 (DLOH)AR X Amplification 100 26.39 +8 (NLOH)BIRC5 17 Amplification 100 40.87 +5 (BCNA)BRCA2 13 Loss 94 2.21 -1 (DLOH)CDK12 17 Amplification 99 5.00 +8 (ALOH)502.1. Case studyTable 2.3: Copy number variants of interest in the tumour genome are listed,along with percentile values and fold changes calculated from the respectiveRPKMs against a background of TCGA Breast cancers. (continued)Gene Chr Copy type TCGAexpressionpercentileFoldexpressionchangeCopy change inploidy correctedmodel (versus 3n,triploid tumour)CCNE2 8 Amplification 100 25.71 +17 (ALOH)Abbreviations:ALOH, amplification with loss of heterozygosity; BCNA, balanced amplification;Chr, chromosome; DLOH, deletion with loss of heterozygosity; exprn, expression;GoF, gain of function; HET, heterozygous; LoF, loss of function; NLOH, neutralwith loss of heterozygosity.De novo assembly of the genome and transcriptome was performed toidentify structural rearrangements of potential biological and clinicalsignificance. However, none were detected.Transcriptomic analysisA pairwise expression correlation analysis was undertaken to comparethe gene RPKM values from the sample with The Cancer Genome Atlastumour samples (TCGA; see Section 2.1.2). This analysis, done across 27different tumour types available from TCGA, indicated that the tumoursample correlated the most with the breast cancer (BRCA) cohort. Basedon this observation, we replicated the PAM50 test by selecting the PAM50set of genes to correlate the sample’s transcriptome against TCGA breastcancer samples with known BRCA molecular subtype status. Consistentwith the amplification and gain-of-function mutation in the ERBB2 gene,the tumour sample correlated the highest with the HER2 enriched andLuminal B subtypes. These results are shown in Figure 2.4.Based on the findings from the correlation analysis, the genomic events andRNA-level changes were considered against a background of breast cancers.A fold-change value for each gene was calculated against a normal breasttissue transcriptome (Illumina Human BodyMap 2.0) and a percentile rankof expression calculated in comparison to the breast cancer cohort fromTCGA (detailed in Section 2.1.2). A fold change of -1.1 of TP53 geneexpression corroborated with the loss of function mutation identified in512.1. Case studyFigure 2.3: ERBB2 gene’s genomic locus is shown in the patient’s tumour.A) A lollipop plot showing the coordinates of the S310F gain-of-functionmutation observed in this case. B) A plot of the copy number landscape ofChromosome 17 in the tumour. The ERBB2 copy-number gain is indicated.522.1. Case studyFigure 2.4: Correlation plots of the cancer’s RNA-Seq profile withTCGA cancer datasets. A) Boxplot distribution of the pairwise Spearmancorrelation of the recurrence biopsy’s gene expression profile and all TCGAsamples. The x-axis represents cancer types following TCGA namingconventions. TCGA breast cancer cohort is indicated by BRCA. B) Boxplotdistribution of the pairwise Spearman correlation between the recurrencebiopsy and the TCGA breast cancer cohort based on the PAM50 set ofgenes. The pairwise correlations with adjacent normal are shown in blue.532.1. Case studyTP53. The ERBB2 gene, which had a gain of function mutation and copynumber gains, also had an RNA-level fold change of 32.7 relative to thecompendium average. Expression outliers were identified and evaluatedin conjunction with mutational status, copy-number state, and knownbiological function. Nine genes of interest were identified having gains ofmore than three copies each, and also ranked in the 98th–100th percentileversus TCGA breast cancers (Table 2.3). Of particular interest amongthese genes were ERBB2 (98th percentile), CDK12 (99th percentile), AR(100th percentile), and CCNE2 (100th percentile). The extreme outlierexpression of ERBB2 (33-fold overexpression, 98th percentile of BRCA)combined with the observed of the gain of function mutation (p.S310F)and estimated five copy gain (Figure 2.3) further supported a diagnosis ofa HER2+ mammary-like cancer and identified HER2 as a likely driver ofthe disease.Mutational signaturesTo further evaluate the differential diagnosis of the tumour as a breastcancer, we considered the WGS mutational data against previouslycatalogued mutational signatures . A strong APOBEC signature wasobserved (Signatures 2 and 13), and no HPV was detected. The APOBECfamily of cytidine deaminases generates mutations of a specific pattern (theAPOBEC signature mutation pattern), which has been reported in severalcancers, including HER2+ breast cancers . APOBEC activation hasalso been associated with HPV; however, screening for microbial and viralsequences was negative for any microbial contaminants and no evidence forgenomic integration of HPV was detected.2.1.5 Clinical decision and outcomeThe diagnosis of the tumour as a mammary-like adenocarcinoma, and thepresence of a clearly druggable target (HER2) led to the patient beingtreated with a 1st line standard of care for HER2+ breast cancer. A strongAPOBEC signature is associated with HER2+ breast cancers, and becauseof its association with PDL1, has been positively correlated with responseto immunotherapy in other cancers . Unfortunately, at the time of thisanalysis, immunotherapy was not available as an accessible line of treatmentand was not pursued further .Transformed Ba/F3 cells harboring the S310F mutation in HER2 (as in this542.2. Summarytumour) have been shown to be sensitive to neratinib, afatinib, lapatiniband trastuzumab. In consideration of these findings, the patient was treatedwith trastuzumab, pertuzumab, and vinorelbine, followed by capecitabineand lapatinib. The patient had a poor clinical response to all targetedtherapies tried, at best achieving short-term disease stabilization but neverachieving regression of disease (see 2.3 Conclusion for potential explanationof this behaviour). Future disease progression included the development ofcutaneous lesions in the left shoulder and back. An FNA sample of the regionconfirmed this to be a metastatic carcinoma. The patient passed away twoyears and five months after her initial diagnosis.2.2 SummaryThe diagnosis and classification of vulvar adenocarcinomas is a complicatedand understudied area, as this is a rare histologic subtype of vulvarcancers. The differential diagnoses include MLAV, adenocarcinoma arisingfrom EMPD, mucinal carcinoma, Bartholin gland adenocarcinomas,and metastatic adenocarcinomas from various sites. In our case, theclinical, radiologic, and histologic features indicated a poorly differentiated(high-grade) primary vulvar adenocarcinoma. The patient was enrolled inthe BC Cancer POG project upon the development of new metastases,with the two aims of characterizing the underlying genomics of this poorlydifferentiated cancer and identifying actionable therapeutic targets.The poorly differentiated morphology and non-specific immunoprofile ofthe initial biopsy had resulted in a broad pathologic differential diagnosisincluding mammary-type carcinoma, vulvar adenocarcinoma, and uppergastrointestinal adenocarcinoma. The subsequent bioinformatics analysisfindings indicated the patient’s tumour was most consistent with a HER2+breast cancer profile. Post-hoc histopathologic investigations on the initialbiopsy and a repeat aspirate from the supraclavicular lymph node sitecorroborated the determinative finding of HER2 overexpression from thebioinformatics analysis, and supported the diagnosis of MLAV. IHC wasnegative for the mammary markers mammoglobin and GCDFP-15, as is thecase for most ER-negative breast carcinomas . GATA3, a more recentlyestablished marker that is positive in breast carcinomas, was positive in therecurrence biopsy.Additional genomic events, specifically the GoF S310F mutation in ERBB2,552.3. Conclusionthe LoF mutations combined with LOH in TP53 and RB1, and theco-expression of ERBB2 and AR at high levels, pointed to a mammary-likecancer. It has been shown that RB1 LOH occurs at a higher frequency inbasal-like and luminal B breast cancers , suggesting a potential role ofRB1 LOH as a predictive marker in these subtypes. No definitive molecularsupport was found for ER over-expression.2.3 ConclusionAs demonstrated in this case, a visual inspection of gene expressioncorrelation patterns against independent primary cancer datasets can giveus clues about the molecular behaviour of a cancer. Insights driven byRNA-Seq comparisons with established cancer types can lead to a changein management of a patient with terminal disease, providing a brief buteventually insufficient respite from a rare malignancy. In the larger scaleof things, this study adds to the body of evidence for treating these rarecancers, MLAVs, as an ectopic breast tissue malignancy .At the biological level, the findings emerging from this analysis lend supportto existing observations in literature about mammary-like carcinomas. ARoverexpression is found in 60% of breast cancers and is generally observedmore frequently in ER+ breast cancers than ER- ones. However, whenpresent, AR expression is significantly correlated with HER2 expressionin ER- breast cancers, and a proliferative role for AR has been suggestedrecently in ER-, HER2+ patients [129, 156]. An overexpression of CDK2and a high-percentile expression of CCNE2 were observed in the recurrencesample’s transcriptomic analysis. These have been suggested as potentialresistance markers for trastuzumab , and we can speculate on theirpotential role in rendering the treatment ineffective.Additionally, post-genomic analysis findings from the pathology workupand validation tests provide evidence for the improved ability of emerginghistopathology markers like AR and GATA3 in diagnosing breast cancers,as compared to more established ones like mammoglobin and GCDFP-15.A recent IHC study showed that MLAV can be classified into four breastintrinsic subtypes, including a HER2+/ER− group [193, 202]. Duringthe initial pathology workup for the vulvar biopsy from this patient, thediagnosis of MLAV was not favored on the basis of ER, mammaglobin,and GCDFP-15 negativity. However, primary breast carcinomas with high562.3. Conclusionnuclear grade are mostly ER− , and a recent study has proposedGATA3 as a more sensitive marker for HER2+/ER− breast carcinomasthan mammaglobin and GCDFP-15 . The analyses presented hereinare consistent with this body of evidence and confirm that ER negativity isconsistent with the diagnosis of HER2+ MLAV.How often does realignment of diagnosis happen as a result of detailedgenomic analysis? Are there certain cancer types that are particularlyrefractory to routine histopathology based diagnosis? If so, can we developdiagnostic approaches that incorporate high-dimensional sequencing datato provide robust and confident assessments of a cancer type? The nextchapter delves deeper into these questions through a cohort-wide analysisof cases that have been subject to similar detailed genomic analysis usingwhole-genome sequencing and RNA sequencing.57Chapter 3Impact of genomics ondiagnostic pathology in aprecision oncology trial## Warning: package 'stringr' was built under R version 3.5.2Health care institutions now offer a varied selection of genetic testing toguide treatment options for cancer patients in the clinic. As discussed inChapter 1, this includes prognostic panels like MammaPrint and OncotypeDx [24, 181], and single-marker tests for prognostic genomic changes likemutations (for example, in KRAS, EGFR, or IDH1), and fusions likeABL-BCR (DeVita et al. ) or EWSR1 (for sarcomas). However, inthe absence of a corollary cancer diagnosis to guide the companion testselection, or when none of the informative and actionable targets assessedby these panel-based approaches are present in a cancer, whole-genome andRNA sequencing provide an unbiased and detailed molecular view of thedisease. Over the recent years, these sequencing modalities have enabled adramatic expansion in the scale and content of cancer characterization andmanagement within the scope of research investigations.Precision oncology can be described as the use of DNA and RNA sequencingto facilitate discovery and analysis of molecular changes that impact patientmanagement and treatment. The sequencing modalities vary, rangingfrom targeted deep sequencing of a subset of genes , to sequencingthe mRNA-coding regions (exome), the entire genome (whole-genome),or the transcribed RNA (RNA-Seq). Precision oncology trials typicallyuse one of exome (WES) or whole-genome (WGS) sequencing, combinedwith RNA-Seq [131, 146, 169]. As the price of sequencing decreases andmore healthcare research facilities develop sequencing capabilities at scale,routine comprehensive DNA and RNA sequencing can be expected to makeits way into cancer management [35, 169].58While advanced molecular techniques are being adopted steadily throughprecision oncology clinical trials, the interpretation of these assays isan ongoing challenge. Analysis and identification of therapeuticallyrelevant genomic and transcriptomic changes is a tiered process. The rawsequencing data in the form of reads is filtered for microbial contamination,assembled and aligned to the human reference genome, and based on thesequencing modality, genomic variants are identified or a quantification ofreads mapping to loci of interest (exons, transcripts) is made. Genomicvariants specific to the tumour are identified through comparison withthe assembled healthy normal tissue genome obtained from the patient,also known as the germline genome. The quantified reads obtained fromRNA-Seq are normalized, usually to the library size and the gene length,yielding reads per kilobase per million mapped reads (RPKM) values foreach gene/transcript. However, these expression measurements cannot beinterpreted in isolation and have to be compared against a healthy tissuethat matches the tissue of origin of the cancer type. The healthy tissueRNA-Seq reflects what a normal expression profile from that primarysite would look like, helping us pinpoint genes and pathways that areaberrantly expressed. The degree of aberration is placed in context withother primary cancers reflecting the same cancer biology. Changes atthe gene level (genomic or transcriptomic) are then assessed manuallyby a computational biologist, who summarizes and contextualizes themwithin biological pathway diagrams. These diagrams pin-point importantbiological associations that have been impacted in the tumour, and serve asputative therapeutic options.This process provides a vast amount of information about a single cancersample after drawing upon pre-existing datasets of healthy tissues andprimary cancers. Recently published clinical trials that utilized sequencingdata to do this have found that this approach can provide a hithertounmatched level of insight into the mechanisms driving metastatic cancers[169, 225]. No studies to date have assessed the effectiveness of combininglarge-scale sequencing protocols with histomorphology in guiding theanalysis and management of advanced cancers. Successful precisiononcology based analyses require the identification of a suitable healthytissue comparator, which in turn requires knowledge of the cancer type.We reviewed a series of cases in which whole-genome and transcriptomesequencing was performed as part of the POG trial in Canada, consideringthis data in the context of tumour histomorphology. Our goal was toevaluate the impact of integrating WGS and RNA-Seq on pathological593.1. Methodsdiagnosis, and to understand how this data impacted subsequent biomarkertesting and selection of targeted therapy.3.1 Methods3.1.1 Consent and institutional review board processThis research project was approved by the BC Cancer Agency ResearchEthics Board (protocols H14-00681 and H12-00137). Cancer patientswith advanced disease who failed conventional treatment and fulfilledthe inclusion criteria were consented for tumour profiling using RNA-Seq(tumour) as well as whole-genome sequencing (tumour and blood)(Clinicaltrials.gov ID: NCT02155621).3.1.2 Tissue biopsy and processingA fresh tissue biopsy was mandatory for all patients participating inthis study. Samples were taken from metastatic or recurrent tumoursthrough needle-core biopsies or surgical resection, with guidance fromimaging. Matching normal DNA was extracted from peripheral bloodleukocytes. The samples were snap-frozen and anchored in a small amountof optimal-cutting-temperature (OCT) compound for cryo-sectioning.These sections were used for DNA and RNA extraction, after beingassessed for histologic correlation. The snap frozen tissue specimens werecryo-sectioned at 50 𝜇𝑚 for nucleic acid and protein extraction, and at 5 𝜇𝑚for hematoxylin-eosin (H&E) staining every 200 𝜇𝑚. Cases were excluded ifthe tumour content of the sections was less than 40% by pathology review.The intervening sections were placed into RNAse-free Eppendorf tubes.Only a small amount of OCT compound was used to bind the tissue to thechuck of the cryostat since OCT is known to inhibit downstream nucleicacid extraction and PCR steps .3.1.3 Library construction and sequencingPaired-end DNA and RNA sequencing libraries were generated at Canada’sMichael Smith Genome Sciences Centre, and sequencing was performedusing the HiSeq platform (version 3: Illumina, San Diego, CA, U.S.A).603.1. MethodsAverage coverage for WGS was 80-100x on frozen tumour tissue andon germline DNA from blood. cDNA libraries for RNA sequencingwere prepared from biopsy samples using strand specific RNA-SeqSample Preparation kit (stranded, polyA+) from Illumina. Sequencingwas performed on the Illumina HiSeq 2500 platform. A minimum of200x coverage per sample was required for the targeted amplicon reads.RNA-Seq data was analyzed with JAGuar, and subsequently processed withpreviously published in-house pipelines to yield exon- and transcript-levelread counts and RPKM values. Gene level RPKM values were calculatedusing a collapsed gene model. The bioinformatics analysis pipelines usedfor this study are as described already in Chapter 2, Section 22.214.171.124.1.4 Determination of tumour typeEach case was reviewed during weekly tumour board meetings by amultidisciplinary group of pathologists, medical oncologists, bioinformaticians,and computational biologists. Data for determining the pathologic diagnosisof the case was gathered from all four main components of each case’scomprehensive analysis: 1) clinical background, 2) histomorphology ofthe tumour specimen, 3) gene expression, and 4) mutation profiling andstructural rearrangements.1. Relevant clinical history was presented by the treating medicaloncologist for each case. This information was provided to the genomeanalysts/bioinformaticians and pathologists atleast four weeks priorto the discussion of the case so as to facilitate any relevant clinicalinterpretation.2. The histomorphology assessment was gathered from previouspathology reports including diagnostic biopsies and/or resectionspecimens collected as part of routine management of the patient. Allcases were reviewed internally centrally by an expert specialty-practicepathologist prior to being analyzed by the computational biologistsand being presented to the multidisciplinary board.3. Expression correlation analysis for tumour typing was undertakenrelative to the entire set of normal and tumour transcriptomes inTCGA, as described in Section 2.1.2. Supervised Cancer OriginPrediction using Expression (SCOPE), described in detail in Chapter613.1. Methods4, was also evaluated as an ancillary automated tool to predictthe cancer type from RNA-Seq expression profiles. Based on thesetwo approaches, a reference cancer type was established. Fold-levelchanges in gene expression for the sample were calculated against abackground of GTEx samples whose biology most closely matchedthe site of origin of the reference cancer type. Individual gene changesin the sample were placed in the context of TCGA samples from thereference cancer type, and aberrant genes and pathways highlightedthrough manual analysis by a computational biologist.4. Mutation profiling was used to identify mutations and other genomicevents that are known to be associated with certain cancer types. Theseassociations arose from an internal knowledgebase curated throughextensive literature review performed as part of the analytic pipelinefor each case. When present, COSMIC variants were evaluated toassess support or confirmation of a particular diagnosis. The numberof mutations present in the sample were compared against the TCGAreference cancer dataset to determine if the mutation burden was highor low.Overall, a combination of mutation profile, mutation burden, geneexpression based cancer type classification, integrative pathway analysisof gene expression changes and mutations, histomorphology report,immunohistochemistry, imaging, and other clinical metrics was used tosuggest a tumour type/site of origin to the oncology clinical team. Thecases were labelled as cancers of unknown primary (CUPs) when a specificdiagnosis, including site and tissue type, could not be rendered after thepathology work-up of tumour tissue from the biopsied specimen.3.1.5 Assessment of clinical input of whole-genome andtranscriptome analysis in pathologyGenomic contribution to pathological diagnosis was assessed by reviewingthe POG reports presented to the tumour board. These reports highlightedthe pathway-level changes and key targetable genes based on the integrativeanalysis from bioinformatics. They also included information on thesuggested diagnoses based on genomic and transcriptomic data, pathwayanalysis, and recommended lines of therapy. Two pathologists asked thefollowing three questions.623.2. ResultsFirstly, was the tumour diagnosis confirmed or re-aligned after the POGanalysis, performed as described above? This was determined by comparingthe final diagnosis arising from the POG analysis (as determined from thePOG reports) to the initial pathology diagnosis (see Section 3.1.4).Secondly, was the molecular subtype significantly changed after the POGanalysis? This pertained to genomic events known to be clinically associatedwith specific cancer subtypes, for example, a lung adenocarcinoma thattested negative for ALK1 fusion negative through FISH, was found to harboran actionable ALK1 gene fusion on WGS.Thirdly, did the POG analysis augment the pathologic diagnosis workflow?It was said to ‘augment’ the pathologic diagnosis if a molecular alterationled to additional pathologic assessment (not already part of routine testingguidelines) of potential prognostic or predictive value without affectingthe diagnosis. Of note, cases where genetic event(s) affected the oncologicmanagement but could not be validated by histopathology methodswere not said to augment the pathologic diagnosis workflow. Only thosemolecular abnormalities that could be confirmed by orthogonal clinicallyvalidated biomarker assays, in an accredited clinical laboratory and usingtests already in clinical use, were considered. It should be noted thatpotentially ‘actionable’ molecular abnormalities were identified in mostcases, but these would arise within the context of off-label therapies, theanalysis of which was outside the scope of this study.3.2 Results3.2.1 Cohort demographics, clinical metrics, andsequencing dataAn initial 492 cases were selected for this study from the POG trial based onmaterial availability and completion of POG analysis. Of these cases, initialpathology diagnosis could be matched to a corresponding TCGA cancertype in 389 unique cases, which constituted the study cohort. Patients were36% males and 64% female, and the 5-year survival was 45% with a mediansurvival of three years. The cohort selection process is summarized in Figure3.1. Breakdown of the outcomes, separated by tumour types, is shown inFigure 3.2.A detailed breakdown of the 389 cases included in this study, separated633.2. ResultsFigure 3.1: Cohort selection for the assessment of the impact of DNA andRNA sequencing analysis on histopathologic diagnosis in the POG clinicaltrial.643.2. Resultsby the site of biopsy and by the diagnosed cancer type, is provided inFigure 3.4. As can be seen in this figure, the majority of tumours sampledwere metastases to the liver, lung, or the lymph node. Cancer metastasesaccounted for 352/389 (90%) of the cases. In the remaining set, two (~1% oftotal) samples were of undetermined origin, 29 (7% of total) were primarytumours, and six (2% of total) were recurrences. Cancer metastases from theliver, lymph node, pelvis, soft tissue (like muscle), and the abdominal cavity,were the most likely to have the initial diagnosis revised by genomic analysis.Among the cancer types that were prevalent in the cohort, breast cancer,colorectal adenocarcinoma, sarcomas, and lung adenocarcinomas formed themajority of cases.3.2.2 Correlation of histopathologic diagnosis and nextgeneration sequencing results126.96.36.199 Impact of genomic analysis on histopathologic diagnosisand prognostic marker identificationMost pathologic diagnoses were supported by the genomic analysisand clinical findingsOf the 389 total cases reviewed, 15 cases presented as CUPs. The integratedgenomics analysis agreed with the original pathologic diagnosis in 346 of theremaining 374 cases (92.51%). In most cases, RNA-Seq provided evidencetowards cancer diagnosis.In 18 cases (4.81% of cases with a determinative initial pathologicdiagnosis), the detailed genomic and transcriptomic analysis did notfind any evidence that supported or challenged the initial histopathologydiagnosis. The majority of these cases were pancreato-biliary cancers (twocholangiocarcinomas, four pancreatic adenocarcinomas), followed by fivegynecologic cancers with non-specific carcino-sarcomatoid histomorphology,and three metastatic breast cancers. Tumour content of these cases, asestimated through bioinformatics analysis, ranged from 25-90%, similar tothe rest of the cohort. Biopsy site and patient characteristics were also notsignificantly different from the rest of the cohort (Figure 3.2).Genomic analysis serves as a robust refractory tool for identifyingprognostic molecular markersIn a further eight (2.14%) cases, the molecular features of the cancer were653.2.ResultsFigure3.2:Tumourtypesinthecohortareshown,alongwiththetypeofgenomicdataguidingmajoroutcomesfromtheretrospectiveanalysisevaluatingthediagnosticutilityofRNA-SeqandWGS.663.2. Resultssignificantly redefined without a histopathology diagnosis change. Thesechanges were significant to the extent of supporting different therapeuticregimens. Of those, five cases were breast carcinomas where the HER2 statuswas changed after the genomic analysis, from HER2 negative (determinedusing IHC, and FISH assays for equivocal cases) to HER2 amplified based onthe amplification/overexpression identified based on copy number analysisand gene expression profile. Two cases were lung adenocarcinomas where nodriver mutations were identified on the routine IHC and gene sequencingpanels, but where an activating mutation in EGFR (L858R and exon 20insertion) was identified through the genomic analysis. The last molecularlyredefined case was a lung adenocarcinoma where the anaplastic lymphomakinase (ALK) fusion oncogene, an important marker of a molecular subtypeof lung adenocarcinomas, was negative on initial pathology testing throughFISH, but genomic analysis revealed an ALK1 rearrangement (Figure 3.3).188.8.131.52 Impact of genomic analysis on diagnosis of CUPsComprehensive whole-genome and transcriptome analysisidentified misdiagnoses and putative primaries for CUPsIn two cases (0.53% of cases with a determinative initial pathologicdiagnosis), the original pathology report diagnosis was found to be incorrectafter molecular analysis and review. One case was initially diagnosedas a vulvar adenocarcinoma but gene expression analysis closely alignedwith breast ductal adenocarcinomas. This triggered a detailed pathologyreview and validation through IHC, the diagnosis was adjusted to a HER2amplified mammary-like adenocarcinoma of the vulva, and the patient wastreated with an ERBB2 inhibitor (described in Chapter 2) . The secondcase was initially diagnosed as adenocarcinoma of likely ovarian origin,but comprehensive analysis of the gene expression and mutation profilessupported the diagnosis of ovarian clear cell carcinoma. Evidence includedhigh expression of HNF1𝛽, NAPSA (Napsin A), GPC3 (Glypican-3), aninactivating mutation in ARID1A, and copy gains in HNF1𝛽 and ERBB2genes.In 15 cases (3.9% of total cases), the initial pathology workup andclinical assessments could not confidently assign a tumour site of originor a histomorphologic category. Nine cases were initially diagnosed asadenocarcinomas of unknown origin, two as squamous cell carcinomas,three as carcinomas, and one as an unclassifiable malignancy. The673.2. ResultsFigure 3.3: Detection of clinically relevant molecular alterations bywhole-genome and RNA sequencing in the POG cohort. (A-C) Detectionof HER2 amplification in a colorectal carcinoma is shown, as indicated byimmunohistochemistry (IHC) staining for HER2 (overexpression, 3+) in thetumour sections in panels A) and B), and with FISH testing for additionalcopies of HER2 (HER2 to chromosome 17 centromere (CEP17) ratios > 2.0)in panel C). (D-F) ALK fusion identified in a lung adenocarcinoma, missedon initial FISH analysis. H&E staining of the tumour sample is shown in D).ALK IHC testing results showing equivocal ALK staining are represented inE), with the original negative FISH results (break apart probe test, lessthan 15% of cells showed break apart probes) shown in F). (G and H)Detection of an IDH1 mutation in a CUP supported the putative diagnosisof cholangiocarcinoma. The H&E staining is shown in G). Panel H) shows asnapshot of the Integrative Genomics Viewer track for the mutation locationwith proportional read-counts supporting the reference (G, in orange) andmutation (A, in green) in the tumour genome. This supported the putativediagnosis of this CUP as a cholangiocarcinoma in the clinical context, asaided by RNA-Seq analysis.683.2. ResultsFigure 3.4: The outcome from genomic analysis is shown separated by A)the site of biopsy of the tumour, and B) the organ-system of origin of thecancer. M and P indicated the number of metastatic and primary/relapsesamples respectively. 693.2. ResultsFigure 3.5: The final diagnoses for the 15 CUP cases and 2 cases with reviseddiagnosis are shown, along with the type of genomic data guiding each ofthe outcomes. WGS = Whole-genome sequencing.comprehensive genomic analysis was able to pinpoint the site of origin ofthese cancers confidently.Overall, within the highly frequent cancer types, breast cancer (N = 110)was mostly always correctly diagnosed. Cholangiocarcinomas, esophagealsquamous cell carcinomas, and colorectal adenocarcinomas were among thecancer types most refractory to initial histopathology-based diagnosis. Thetype of genomic information used to determine a diagnosis for all 17 cases(including the 15 CUPs), and the resultant cancer diagnoses, are indicatedin Figure 3.5.703.2. ResultsUtility of SCOPE, an automated RNA-Seq based cancerdiagnostic, for confirming diagnosesThe SCOPE algorithm was used retrospectively on these cases to assess thepotential of automated tools in aligning diagnoses from RNA-Seq data inprecision oncology workflows. SCOPE matched the final diagnosis in 273 ofthe 374 cases where an initial histopathology diagnosis was available (73%).When looking at tumour types with 10 or more cases, the SCOPE algorithmalone had the highest rate of success with breast carcinoma (BRCA, 87%accuracy, N = 96/110), ovarian carcinoma (OV, 84% accuracy, N = 16/19),lung adenocarcinoma (LUAD, 80% accuracy, N = 28/35), as opposed topancreatic adenocarcinomas (PAAD, 22% accuracy, N = 5/23) which weremissed most often by the method.Eleven tumour types had an accuracy of <50% when using SCOPE (N =46 samples), including pancreatic adenocarcinoma (PAAD, 22% accuracy),cholangiocarcinoma (CHOL, 33% accuracy), uterine carcinosarcoma (UCS,33% accuracy), head and neck squamous cell carcinoma (HNSC, 25%accuracy), lower grade gliomas (LGG, 33% accuracy), uveal melanoma(UVM, 20% accuracy), esophageal adenocarcinoma (ESCA_EAC, 20%accuracy), and thyroid carcinoma (THCA, 40% accuracy), follicularlymphoma (FL, 0% accuracy). Among these cancers, we found that theincorrect predictions were typically mis-identification of the cancer as ahistologically similar cancer (46% of mispredictions, N = 21/46, Table3.1). For example, the two follicular lymphoma (FL) cases were predictedas diffuse large B-Cell lymphomas (DLBCL) from the National CancerInstitute’s cohort (NCI_DLBCL) which contains some DLBCLs withFL-like features. Two head and neck malignancies were predicted as othertypes of squamous cell carcinomas instead.SCOPE is impacted by low tumour content in liver biopsiesEight of the 46 mis-predictions matched the site of biopsy instead.Particularly, in case of some liver biopsies it was observed that thehighest prediction would be hepatocellular carcinoma (LIHC), and thesecond-highest prediction would be the correct cancer-type. Hypothesizingthat these observations may be due to dilution of signal from the tumouritself, we decided to investigate the impact of biopsy site and tumourcontent on SCOPE’s outcome.We observed a significant association between biopsy site and SCOPEoutcome (p-value 0.008, chi-square test, only biopsy sites with minimum713.2. Resultsof 10 samples considered, N = 282/374). In biopsy sites with at least 10samples, tumour content was found to be significantly associated withSCOPE prediction in liver biopsies only (N = 124/282, p.adjusted = 0.017,unpaired t-test). In Figure 3.6 we show the differences in outcome fromSCOPE for the three most frequent sites of metastasis - the lymph node,lung, and liver.SCOPE is a suitable automated method for providing a diagnosisfor CUPsIn practice the effect of possible tissue contamination from the biopsysite could be accounted for during the case discussion meetings wheremolecular data was reviewed and SCOPE classifications became a usefultool in combination with orthogonal analyses, especially to establish celllineage in CUPs (RNA-Seq contributed to resolution in all 15, whereSCOPE’s predictions were accurate in 8/15 cases, and in 3/15 additionalcases matched the revised diagnosis confidently when accounting for biopsysite bias). When considering the performance of cancer types that hadlow accuracy (nine cancer types), SCOPE’s predictions were not foundto be significantly influenced by tumour content (Figure 3.7), reflecting asystemic inability in classifying these cancer types accurately.Table 3.1: Classification outcome from SCOPE for the cancer cohorts.SCOPE outcome matchedCancer type Class size FinaldiagnosisCancer typefrom sameorgan systemBiopsysiteOthercancertypeBRCA 110 96 (87%) 0 (0%) 10 (9%) 4 (4%)COADREAD 58 41 (71%) 7 (12%) 8 (14%) 2 (3%)SARC 42 29 (69%) 2 (5%) 2 (5%) 9 (21%)LUAD 35 28 (80%) 0 (0%) 0 (0%) 7 (20%)PAAD 23 5 (22%) 6 (26%) 5 (22%) 7 (30%)OV 19 16 (84%) 0 (0%) 0 (0%) 3 (16%)SKCM 9 9 (100%) 0 (0%) 0 (0%) 0 (0%)CHOL 6 2 (33%) 3 (50%) 0 (0%) 1 (17%)STAD 6 5 (83%) 1 (17%) 0 (0%) 0 (0%)UCEC 6 4 (67%) 0 (0%) 0 (0%) 2 (33%)UCS 6 2 (33%) 1 (17%) 0 (0%) 3 (50%)ESCA_EAC 5 1 (20%) 4 (80%) 0 (0%) 0 (0%)GBM 5 4 (80%) 1 (20%) 0 (0%) 0 (0%)MESO 5 4 (80%) 0 (0%) 0 (0%) 1 (20%)723.2. ResultsTable 3.1: Classification outcome from SCOPE for the cancer cohorts.(continued)SCOPE outcome matchedCancer type Class size FinaldiagnosisCancer typefrom sameorgan systemBiopsysiteOthercancertypeTHCA 5 2 (40%) 0 (0%) 1 (20%) 2 (40%)UVM 5 1 (20%) 3 (60%) 1 (20%) 0 (0%)HNSC 4 1 (25%) 2 (50%) 0 (0%) 1 (25%)LUSC 4 4 (100%) 0 (0%) 0 (0%) 0 (0%)ACC 3 3 (100%) 0 (0%) 0 (0%) 0 (0%)KIRP 3 2 (67%) 0 (0%) 0 (0%) 1 (33%)LGG 3 1 (33%) 0 (0%) 0 (0%) 2 (67%)FL 2 0 (0%) 2 (100%) 0 (0%) 0 (0%)PRAD 2 2 (100%) 0 (0%) 0 (0%) 0 (0%)TGCT 2 1 (50%) 0 (0%) 0 (0%) 1 (50%)THYM 2 2 (100%) 0 (0%) 0 (0%) 0 (0%)CESC_CAD 1 0 (0%) 0 (0%) 0 (0%) 1 (100%)DLBC 1 1 (100%) 0 (0%) 0 (0%) 0 (0%)LIHC 1 1 (100%) 0 (0%) 0 (0%) 0 (0%)PCPG 1 0 (0%) 0 (0%) 1 (100%) 0 (0%)Total 374 267 32 28 47Percent of cases in cancer type shown in brackets with corresponding outcome.184.108.40.206 Impact of genomic analysis on clinical workflowIn 14 cases (3.6%), the genomic and transcriptomic analysis prompteddownstream testing by pathologists for markers that had the potentialto alter patient management. This group was composed of six lungadenocarcinomas, three breast carcinomas, three colorectal adenocarcinomas,one esophageal adenocarcinoma, and one anal squamous cell carcinoma.Lung adenocarcinomas had the highest fraction (N = 6/36, 17%) of caseswhere additional testing was suggested through this analysis. All othermajor tumour groups (class size > 10) had rates of less than 5%. Among allcases where the biomarker status was called under review as a consequenceof the integrative analysis, 12 (86%) had HER2 overexpression (Figure3.3). The remaining two cases included the identification of a ROS1 fusionin a lung adenocarcinoma, at the time when screening for ROS1 was not733.2. ResultsFigure 3.6: Impact of tumour content on the ability of RNA-Seq to providethe correct putative diagnosis in the POG cohort. The majority of samplesarose from 3 biopsy sites - lymph node, lung, and liver, indicated in each ofthe panels. Wilcox test for significance between SCOPE outcome matchingfinal diagnosis, versus each of the other categories: * p =< 0.05; ** p =<0.01; *** p =< 0.001; ns p > 0.05743.2. ResultsFigure 3.7: Impact of tumour content on the ability of RNA-Seq toprovide the correct putative diagnosis in the POG cohort, agnostic of biopsysite. Wilcox test for significance between SCOPE outcome matching finaldiagnosis, versus each of the other categories: * p =< 0.05; ** p =< 0.01;*** p =< 0.001; ns p > 0.05753.3. Discussionstandard of care yet, and the detection of HPV integration in the DNA foran anal squamous cell carcinoma.3.3 DiscussionIn this study we assessed how whole-genome and RNA sequencing impactstumour diagnosis and biomarker assessment in the current clinical laboratoryenvironment. We studied this within an ongoing clinical trial that leveragedsequencing data to profile advanced, treatment-resistant cancers and suggestalternative, targeted lines of therapy. We showed that this approach hasparticular benefit for diagnosing and managing cancers of unknown primary.At the clinical level, integrative genomic analysis did not provide additionalroutine clinical guidance with the exception of rare genomic events (ROS1fusion, HPV integration) and HER2 marker status.Integrative genomic analysis is important for diagnosing andassessing complex presentations like CUPsOur capacity to generate high-resolution multi-omic data from tumourtissues has enhanced to the point where genomic and transcriptomicanalysis can be considered for integration in routine clinical oncology. Thisprecision oncology approach has shown itself to be extremely valuablefor understanding disease onset and advancement, determining clinicallyvaluable molecular subtypes, and identifying targeted treatment hypothesesthat have potential for clinical translation [30, 35, 146, 169]. It also hasimmense potential for resolving CUPs - indeed, in our series it was ableto identify a tissue of origin in all our CUP cases. These cases formeda significant proportion of the analyzed cases (3.9%), and based on ouranalysis, RNA-Seq was important in determining a putative primary inall 15 of these cases. It is important to note that due to the often poorlydifferentiated and advanced nature of these cases, there were no goldstandard diagnoses. Although post-POG testing using IHC markers couldbe done, some uncertainty about the tissue of origin could still remain.Nevertheless, knowing the site of origin of a cancer facilitates treatmentdecisions, helps the patient understand their disease, and provides someclosure to those impacted.Gene expression profiling for CUP diagnosis is not a novel approach.Commercial assays have been developed numerous times but struggledto reach the clinic. The ESMO clinical guideline for CUP diagnosis and763.3. Discussiontreatment does not recommend gene expression profiling as an ancillarydiagnostic test, but earlier studies in this domain agree that gene expressionprofiles are a useful prognostic marker . The microarray platform usedin the previous studies, however, are eclipsed by the dynamic range andquantifiable measurements obtained from RNA-Seq. These studies havealso discounted the integration of gene expression and mutation analysistogether for resolving CUPs. Our data shows that mutational analysis wasan added value in determining tissue of origin for most CUPs. The clinicalvalue in CUPs from whole transcriptome RNA-Seq and mutation analysisis unclear, given the absence of approved therapies that match potentiallyactionable changes that this process uncovers, but should be evaluatedfurther.In addition to considering the gene expression measurements and mutationprofiles by themselves, we explored the utility of SCOPE, an automatedmethod that could provide a diagnosis from the RNA-Seq data withoutrequiring an expert bioinformatician as an intermediary. While the methodwas quite successful for a variety of tumours, it failed to perform well oncertain challenging gastric malignancies like pancreatic adenocarcinomasand esophageal adenocarcinomas. We were able to determine that in thesecases the method was often confounded by low tumour content, and uponaccounting for the effect of the biopsy site, many of the predictions from themethod reflected the ground truth.Integrative genomic analysis provides improved testing optionsin patients already assayed through conventional histopathologytestsIn this series of cases, integrative genomic analysis had the potential toimprove management options over the conventional workup in 6% of cases,most of which consisted of identifying false-positive and false-negativefindings from conventional assays. The contribution of genomics to HER2status determination is particularly valuable considering its importanceas a prognostic factor, and given the high prevalence of false-positivesand false-negatives in IHC-based screening of HER2 status in the clinic. In general, by identifying potentially actionable targets, the analysisuncovered non-traditional routes for cancer management (in the form ofclinical trials for unapproved therapies, off-label drugs). Outcome analysisand assessment of off-label treatment options was outside of the scope ofour study.As expected, when considering clinically validated treatment options only,773.4. Conclusionour results suggest that integrative genomic analysis did not offer up anyadditional insights to supplement the current standard of evidence in cancermanagement. While these findings may point towards the efficiency ofcurrent testing guidelines for well-defined tumour histologies, based on ourexperience and observations within this cohort, we believe they underscorethe gap between the small number of therapeutic options for which thereis level 1 evidence to support usage, and the huge number of potentialtreatment targets determined through integrative genomic analysis forwhich there is no established efficacy data yet.It is quite likely that as there are advancements in our understanding ofoncologic mechanisms at the level of an individual, we will reach the pointwhere targeted panels and single-target gene assays will prove insufficientto evaluate the sheer number of evidence-based actionable targets. Inthat scenario, whole-genome and transcriptome sequencing may becomethe most effective test, accompanied by automated analysis methods likeSCOPE that complement this type of data. Our findings show promise inthis direction, with 2% of our genomically analyzed cases being redefinedhistopathologically and 3.6% of cases impacting clinical workflow directlythrough additional testing and revised patient management. Large-scalegenomic and transcriptomic clinical trials, like the one described here andothers conducted across the world [30, 169], will be an essential part of theprogression of whole-genome and transcriptome sequencing into a clinicalstandard.3.4 ConclusionOur experience with whole-genome and transcriptome sequencing as partof a clinical trial has defined its strengths and limitations in the areaof cancer diagnosis and resultant impact on clinical management. Aspreviously demonstrated by other projects like this one, integrative genomicanalysis is an excelled hypothesis generating tool. It also has extremevalue in identifying a putative primary for CUPs, guiding the managementof these patients. What we find in addition to these previous studies isthat in a noteworthy fraction of cases whole-genome and transcriptomesequencing also revised the findings of ancillary histopathology tests (FISH,IHC), revised patient management, motivated additional downstreamtesting through histopathology, and realigned the molecular diagnosis forwell-established entities like breast cancer subtypes. Translation of this783.4. Conclusionapproach in the clinic requires evolution in the predictive testing landscapeto the point where integrative analyses like these add significantly morebenefit to the current histopathology based approach.There were several rare cancers that were excluded from this case-seriesas the representative genomic datasets most suitable for their analysis arecurrently unavailable. The study of rare and complex cancers based on areference dataset of cancers is a challenge in itself, with no evident solutions.The use of automated tools like SCOPE can help find the closest cancer-typewith available data, as has been shown in a published case study discussedlater in Chapter 5. Aligning those findings to a clinical outcome still requiresthe generation and integrative analysis of these rare cancer types. Thedevelopment and validation of SCOPE is now discussed in Chapter 4.79Chapter 4Development and validationof SCOPE - supervisedcancer origin predictionusing expressionIdentification of the site of origin of a tumour in a patient is currentlyused to guide cancer treatment. It also informs any subsequent analysisthrough alignment with relevant tumour literature and expected molecularbackground. Currently, established pathology approaches are used for cancerdiagnosis and are considered the gold standard. In most cases, this includesmorphology- and histochemistry- guided approaches which also determineeligibility to drug regimens and clinical trials. Modern pathology is a processof sequential exclusion and prioritization across candidate diagnoses, but anexhaustive search is rendered unpossible by limited tissue and diagnosticstains.The efficiency of cancer diagnostics can be vastly improved if an automatedmethod can be developed to approach this task with some knowledge ofcancer biology, similar to a pathologist. A machine-learning method trainedacross diverse tumours and normal tissues will learn what characterizes eachcancer, rather than its tissue site. Training on high-resolution molecular datawill allow it to discover such tissue- and tumour- specific biological patternsfrom the entire transcriptome.The use of gene expression data has outperformed traditional pathologyworkflows for cancer diagnosis in several landmark studies [3, 127, 132, 152].Recent studies have also shown that transcriptome-wide profiling offersgreater information about tumours than microarrays [74, 186], withutility in precision oncology [30, 96]. We can therefore use high-resolutiontranscriptomic data as an orthogonal approach to improve diagnostic804.1. Backgroundaccuracy in many cancers [58, 146]. While analyzing such high-dimensionaldata within a diagnostic workflow is not manually feasible, machine-learningmethods can be trained to do so instead.Here we describe the methods underlying Supervised Cancer OriginPrediction using Expression (SCOPE), a set of neural networks that use thetranscriptome to identify the closest match for a tumour from amongst 40cancer types and 26 normal tissues, and which was used as an ancillary toolfor cancer diagnosis in the previous Chapter. We account for the influenceof differentiation and biopsy site by including normal tissues (classes) fromTCGA in our training dataset . We determine genes weighted heavilyfor decision-making and show that SCOPE is able to prioritize genesrelevant to each class without any prior information.SCOPE is trained devoid of any feature selection, and is able to achievehigh precision and recall within The Cancer Genome Atlas (TCGA)primary cancer and adjacent normal cohorts. Our results suggest thatusing the entire transcriptome in a pan-cancer classification approachperforms better than using feature selection. We also validate the classifieron an independent cohort of primary mesotheliomas, where we achieveclassification accuracy of up to 100%. Lastly, we show high performancewhen using this method in external use cases, to identify the site of origin of(a) treatment-resistant metastatic cancers, biopsied from their site of origin,(b) cancers that are refractory to standard histopathology techniques fordiagnosis, and (c) treatment-resistant metastatic cancers, biopsied fromtheir site of metastasis. Another valuable application of this method liesin providing an objective, orthogonal source of differential diagnoses incancers that are refractory to standard diagnostic practice.4.1 BackgroundPathology protocols for cancer diagnosis work best when the tissuespecimens display high quality and recognizable histological features in asubstantial number of cells. Generic histological features alone are oftennot sufficient to subtype a tumour, hence the confirmation of cell-of-origin -typically via IHC - remains the bedrock of modern pathology practice .Therefore, diagnosis can become a challenging task of tiered, single-plexIHC analyses for lineage-specific proteins, iteratively evaluating the nextlikely diagnostic candidates. Limited tissue availability and a limited list814.1. Backgroundof unambiguous IHC antibodies restrict the extent of validation work-ups.Inter-observer variability in pathology based diagnoses, sample relatedchallenges, and limited tissue samples for immunohistochemical analysescan further restrict the ability to identify the underlying pathology ofa biopsy sample . This is especially true for metastatic and poorlydifferentiated (high-grade) cancers.Misdiagnosis rates for metastases in clinical practice can range between45-94% in the event of challenging presentation (suboptimal sample quality,histologic similarity between tissues, poor differentiation) . This isconcerning since metastases can form up to 60% of distant recurrencesand cause upwards of 90% of cancer associated deaths for cancers detectedin the gastrointestinal tract and across certain gynecological cancers[33, 144, 209]. Biomarker conversion in metastases can confound diagnosisfrom IHC and from biomarker based assays . The site of biopsy is yetanother confounder, particularly in case of the liver . Previous workutilizing expression microarrays has indicated that the microenvironmentcan contribute to the enrichment of hepatic genes’ expression in livermetastases, confounding an accurate diagnosis . These issues aremagnified in CUPs, where developing specific diagnostic protocols remainsa challenge for pathology [10, 132, 206].Inclusion of rare cancer types and providing a refined diagnosis remainchallenges for current computational diagnostics. In order to optimizetraining, rare cancer types are often excluded, and geographically proximalcancers are merged. This inevitably leads to loss of granularity and limitedscope in the application of the models trained [55, 132]. Performance isevaluated on the test set, which can either be held-out from the initialcohort, or preferably (but rarely) a cohort of samples generated andprocessed at different centers.RNA-Seq has largely replaced microarrays for transcriptome-wide profiling.However, the current repertoire of diagnostics does not draw upon thehigh dynamic range and comprehensive coverage provided by RNA-Seq[30, 224]. Large-scale sequencing projects (The Cancer Genome Atlas,TCGA , International Cancer Genome Consortium, ICGC ) haveamassed RNA-Seq data from upwards of 10,000 patients with untreatedprimary cancers. This provides unprecedented opportunity to applymachine-learning approaches to improve the classification of all cancertypes. With the availability of high-performance computing systems, now itis also possible to train models using information about the transcriptional824.2. Methodsstatus of all genes.4.2 Methods4.2.1 Training dataMulti-platform RNA-Seq data was obtained from TCGA (multi-platform- Illumina Hi-Seq 2000 and Genome Analyzer II, processed with TCGARNA-Seq v2 RSEM processing pipeline), the National Cancer Institute(NCI) non-Hodgkin lymphoma dataset  (sequenced with IlluminaGenome Analyzer II, median normalized), and non- cell-line primary tumourdata from the Terry Fox Research Institute’s Glioblastoma Multiforme(GBM) project. 2 in-house cancer cohorts, adult medulloblastoma(MB-Adult) and follicular lymphoma (FL), further supplemented thisdataset (sequenced with Illumina HiSeq 2500). Colon and rectumadenocarcinomas from TCGA were combined into a single cohort(COADREAD) due to their geographical proximity in primary lesions,supported by findings from our initial quality control that showed insufficientdecomposition of these two cancer types based on their transcriptomicdata. The TCGA RNA-Seq libraries were prepared by various differentsequencing centers, but to facilitate harmonization across samples, theTCGA RNASeq v2 RSEM processing pipeline aligned all RNA-Seq readsin an unstranded manner.Table 4.1: Cancer types used for training, with abbreviations referenced intext.Code Name Full Name Normal TumourACC Adrenocortical Carcinoma 79BLCA Urothelial Bladder Carcinoma 19 408BRCA Breast Ductal Carcinoma 113 1095CESC_CAD Cervical and EndocervicalAdenocarcinoma3 47CESC_SCC Cervical Squamous CellCarcinoma6 257CHOL Cholangiocarcinoma 27 36COADREAD Colorectal Adenocarcinoma 51 372DLBC Diffuse Large B-Cell Lymphoma 48DLBC_BM DLBCL Blood/Bone Marrow 11834.2. MethodsTable 4.1: Cancer types used for training, with abbreviations referenced intext. (continued)Code Name Full Name Normal TumourESCA Esophageal carcinoma 3 15ESCA_EAC Esophageal Adenocarcinoma 24 79ESCA_SCC Esophageal Squamous CellCarcinoma6 90FL Follicular Lymphoma 50GBM Glioblastoma Multiforme 15 161HNSC Head and Neck Squamous CellCarcinoma44 520KICH Kidney Chromophobe Carcinoma 25 66KIRC Clear Cell Kidney Carcinoma 72 533KIRP Papillary Kidney Carcinoma 32 290LAML Acute Myeloid Leukemia 173LGG Lower Grade Glioma 516LIHC Liver Hepatocellular Carcinoma 50 371LUAD Lung Adenocarcinoma 59 515LUSC Lung Squamous Cell Carcinoma 50 501MB-Adult Adult Medulloblastoma 143MESO Mesothelioma 87NCI_GPH_DLBCL Diffuse Large B-Cell Lymphoma(NCI cohort)111OV Ovarian SerousCystadenocarcinoma305PAAD Pancreatic DuctalAdenocarcinoma12 178PCPG Paraganglioma &Pheochromocytoma9 179PRAD Prostate Adenocarcinoma 52 497SARC Sarcoma 6 259SKCM Cutaneous Melanoma 3 469STAD Stomach Adenocarcinoma 35 415TFRI_GBM_NCL Glioblastoma Multiforme (TFRIcohort)52TGCT Testicular Germ Cell Cancer 150THCA Thyroid Carcinoma 59 505THYM Thymoma 6 120UCEC Uterine Corpus EndometrialCarcinoma24 177UCS Uterine Carcinosarcoma 57844.2. MethodsTable 4.1: Cancer types used for training, with abbreviations referenced intext. (continued)Code Name Full Name Normal TumourUVM Uveal Melanoma 80Abbreviations:NCI - National Cancer Institute; TFRI - Terry Fox Research InstituteThis resulted in a dataset of 10,822 transcriptomes spanning 40 differentuntreated primary tumour types and 26 adjacent normal tissue types (66‘classes’), with individual class sizes ranging from three to 1095 samples. Nofeature selection was done on the consolidated set of transcriptomes besidesfiltering for (a) genes with a recorded RPKM value in every sample (N =21,220), and (b) genes that overlapped with available annotations for ourindependent test sets (N = 17,688). This resulted in a set of 10,822 samples,spanning 66 different tumour and adjacent normal classes, and with eachsample represented by 17,688 distinct median normalized gene RPKM values.Table 4.1 shows the annotations used, following TCGA nomenclature. Thetable also shows the number of training samples available for each cancertype.4.2.2 Test dataThe trained ensemble of neural networks was validated retrospectively on aset of primary mesotheliomas (MESO) published as part of an independentstudy by Genentech. Test sets for adult metastatic disease and 15 cancersof unknown primary were obtained retrospectively from the PersonalizedOncoGenomics clinical trial at BC Cancer . The attributes of thesedatasets are as follows.Genentech primary mesothelioma datasetMesothelioma is a rare and aggressive cancer arising in the linings of lung,abdomen, or the heart. An independent set of 211 adult primary untreatedmesotheliomas cancers was obtained from the Genentech Mesotheliomacohort . 126 of these samples are classic epithelioid mesotheliomas,while 85 are sarcomatoid variants. As the training set of mesotheliomaswas histologically classic epithelioid mesotheliomas, testing was as follows:For the epithelioid mesotheliomas, we tested whether the classification was854.2. Methodsexclusively for mesothelioma. For the biphasic and sarcomatoid variants,we tested whether the classification was split between sarcomas andmesotheliomas, as would be expected based on mixed histology of thesamples. The Genentech Mesothelioma dataset has 211 transcriptomesfrom untreated, primary lung biopsies of mesothelioma, spanning 4 distinctmolecular subtypes of mesothelioma – sarcomatoid (N = 29), epithelioid(N = 54), biphasic-epithelioid (N = 72), and biphasic-sarcomatoid (N= 56). The RNA-seq libraries were prepared using TruSeq RNA SamplePreparation kit (unstranded, polyA+) from Illumina, and sequenced on theHiSeq 2500 (~66 million paired-end reads per sample) .Personalized OncoGenomics (POG) trial metastatic diseasedatasetCases of adult metastatic disease were selected based on the following criteria– (a) a primary of origin was identified, based on a joint consideration ofclinical/pathology/genomic data, and (b) cDNA libraries prepared from thebiopsy sample passed in-house quality control. Based on these criteria, weidentified 201 samples spanning 26 different cancer types, summarized inTable 4.2 shows the annotations used, following TCGA nomenclature. Thetable also shows the number of training samples available for each cancertype. 168 of the 201 metastases were biopsied from their site of metastasis(24 cancer types), and the remaining 33 from their site of origin (12 cancertypes).Table 4.2: Breakdown of cancer types in the external metastatic cohort.Code Name Organ-System Full Name CountACC Endocrine Adrenocortical Carcinoma 2BRCA Breast Breast Ductal Carcinoma 70CESC_CAD Gynecologic Cervical and EndocervicalAdenocarcinoma1CHOL Gastrointestinal Cholangiocarcinoma 5COADREAD Gastrointestinal Colorectal Adenocarcinoma 22SKCM Skin Cutaneous Melanoma 3DLBC Hematologic Diffuse Large B-Cell Lymphoma 1ESCA_EAC Gastrointestinal Esophageal Adenocarcinoma 2ESCA_SCC Gastrointestinal Esophageal Squamous CellCarcinoma4FL Hematologic Follicular Lymphoma 1GBM CNS Glioblastoma Multiforme 4LIHC Gastrointestinal Liver Hepatocellular Carcinoma 1864.2. MethodsTable 4.2: Breakdown of cancer types in the external metastatic cohort.(continued)Code Name Organ-System Full Name CountLGG CNS Lower Grade Glioma 2LUAD Thoracic Lung Adenocarcinoma 18LUSC Thoracic Lung Squamous Cell Carcinoma 1MESO Thoracic Mesothelioma 5OV Gynecologic Ovarian SerousCystadenocarcinoma7PAAD Gastrointestinal Pancreatic DuctalAdenocarcinoma11KIRP Urologic Papillary Kidney Carcinoma 2PRAD Urologic Prostate Adenocarcinoma 1SARC Soft Tissue Sarcoma 23STAD Gastrointestinal Stomach Adenocarcinoma 3TGCT Urologic Testicular Germ Cell Cancer 1THYM Hematologic Thymoma 1UCS Gynecologic Uterine Carcinosarcoma 5UCEC Gynecologic Uterine Corpus EndometrialCarcinoma6Abbreviations: CNS - Central Nervous SystemCancers of unknown primaryAdditionally, the POG cohort contained 15 cases where the primary siteof origin could not be determined by initial pathology analysis. Genomicand transcriptomic analysis as part of the POG project determined thecorresponding cancer type for 15 of these cases, which was used as goldstandard for assessing the prediction from the classifier. The classificationwas performed retroactively after the closest suitable cancer type had beendetermined based on detailed pathway-level and genomic analysis of thecancer.4.2.3 Model trainingFor the initial selection of the optimal classification algorithm, gene RPKMswere used as input. Support vector machines, random forests, extra trees,and a fully connected neural network were compared. Five-cross validationwith grid search was used to identify the best parameters for each of these874.2. MethodsFigure 4.1: Performance of SMOTE as compared to other class expansionmethods. Cross-validation results on the TCGA training dataset are shown.Abbreviations: dup - duplication of samples in small classes, none - no classexpansion applied, weight - inverse cost for misclassification of smaller classesduring training.algorithms. The trained models were subsequently tested on the one-fifthheld-out set.Because the other ensemble models (random forest, extra trees) hadnear-equivalent 5-cross validation results with the neural network duringtraining, we evaluated the utility of extending the neural network model.An ensemble was developed by training multiple neural networks withdifferent linear transformations of the data. The resultant classifier(SCOPE) contained five neural networks. For one of these neural networks,we synthetically generated additional samples to expand the rarer classesduring training (SMOTE - Synthetic Minority Oversampling Technique). We compared SMOTE with other commonly used class expansionstrategies in machine learning and found it to outperform the others (Figure4.1). The differences in the five networks are described in detail in Table4.3.884.2. MethodsTraining data was randomly split up into 4/5th ‘model training’ data, and1/5th ‘held-out test’ data. The training and test splits maintained relativeclass frequencies. In classes with less than five samples (six classes, alladjacent normal), one sample was randomly assigned to the held-out testset, and the remaining samples were kept in the model training set. Allmodels discussed in this paper were trained on the 4/5th model trainingdata, with the held-out test set used as the first external validation ofperformance of the fully trained models. Stratified 5-fold Cross Validation(5-CV) was used for hyperparameter selection for each algorithm. TheStratifiedKFold function in the scikit-learn package in Python was used togenerate class-balanced CV folds .220.127.116.11 Data normalization and feature selectionData transformation and feature selectionTechnical artefacts in the training data can cause over-fitting while traininga classifier. This results in a classifier that performs quite well on thetraining data, but does not generalize to samples that it has not seen duringtraining. Two main approaches to overcome over-fitting prior to trainingare (a) data pre-processing, and (b) feature selection. Data pre-processingis generally done by re-scaling the input data to fall within a certain rangeof values, or by forcing it to follow a certain distribution (ex. normaldistribution for expression data). Feature selection can be done by severalmethods, but usually, a subset of features that are critical to distinguishingthe training cohorts are selected using feature reduction methods likePrincipal Component Analysis (PCA) or pair-wise analysis of variance(ANOVA). This subset is then used to train the classifier.We assessed the utility of data transformation and feature selection inimproving the best performing model in the previous step, the shallowneural network. To this end, various scaling and data transformationmethods, namely minmax scaling, L2 norm scaling, and rank normalization(average), were assessed separately. The performance of each approach wasassessed by stratified 5-CV.Subsequent to selection of the optimal algorithm as described, we tested theutility of feature selection in improving classification performance. Guidedby previous work , we used pair-wise ANOVA of log-transformedtraining data to identify a subset of 3,000 genes that are statistically894.2. Methodssignificant at discriminating the training classes. We also trained a classifierusing COSMIC’s list of 552 genes harboring somatic mutations . Neuralnetwork architectures optimal for each input space were identified usinggrid search across parameters, and trained with 5-CV for comparison.Class expansion using Synthetic Minority OversamplingA supervised machine learning based classifier works by seeing multipledifferent samples representing each cancer/tissue type and steadily learningwhich genes (features) are most valuable in identifying each type of interest.A common problem with this approach is that a classifier can sometimes failto appreciate the features that characterize the smaller cancer/tissue types.This class imbalance can be overcome by pre-processing the training set inspecific ways – by duplicating some of the samples in the smaller class(es), by‘punishing’ the classifier more for making a mistake with the smaller classes,or by supplementing the smaller classes with synthetic samples. One suchmethod for adding synthetic samples to smaller classes is Synthetic MinorityOversampling (SMOTE).We trained and assessed the performance of the RPKM-based neuralnetwork classifier method using three different class expansion approaches,(a) duplicating samples randomly in the smaller cohorts to inflate theirtotal sample size to the largest class, (b) adding an inverse weight factorfor mis-classification of smaller classes (i.e. making it more expensive forthe classifier to mislabel a sample from a smaller class during training),(c) adding synthetic samples using SMOTE, and compared these threeapproaches to (d) doing no class expansion. Duplicated/synthetic sampleswere only added to the training folds, so that the cross-validation testfold always only contained non-synthetic samples that were absent in thetraining folds. The synthetic sampling algorithm of SMOTE was adoptedfrom Chawla et al .18.104.22.168 Metrics for evaluationSince the training cohorts have a wide range of representative samples (N =3-1025), using accuracy as a metric of performance of the classifiers wouldnot necessarily reflect the ability of the classifier to discriminate betweenall 66 output classes. Precision, recall, and F1 score were used to evaluatemodels and demonstrate their performance. Aggregate precision and F1scores, where reported in text, are accompanied by 95% CIs. Precision is904.2. Methodsdefined as (true-positives)/(true-positives + false-positives), and intuitivelyrepresents the classifier’s ability to distinguish between positive and negativecases. Recall is defined as (true-positives)/(true-positives + false-negatives),and intuitively represents the classifier’s ability to correctly identify allpositive cases. The F1 score is the harmonic mean of the precision andrecall. These metrics are calculated for each individual class, and the meanreported as the cohort metric. Accuracy is reported as (true-positives +true-negatives)/(total cases), and is calculated for the entire cohort.A paired 𝜒2 test for association between prediction accuracy and tumourcontent was performed on the metastatic test cohort, with the nullhypothesis being, “the classification accuracy of SCOPE is independent oftumour content.” Tumour content was determined by pathology analysis.Students paired t-test was used to test the association between predictionaccuracy and confidence score (null hypothesis: no correlation existsbetween prediction accuracy and confidence score). The level of significancewas 2-sided P = 0.05 for all tests of association. Pearson correlation wasused to evaluate association between class-specific accuracy and trainingclass size. Statistical tests were conducted using the base statistics packageavailable in R (R version 3.5.0; RStudio API version 1.1.442; R Project forStatistical Computing).For a given input, the ensemble generates a pooled confidence score foreach of the 66 output classes. Predicted classes are jointly ordered bythe confidence score and number of machines in agreement. This maxvote-pooling method was used to obtain a quantitative confidence score foreach category. This confidence score was taken as a proxy for differentialdiagnosis when assessing metastatic samples. Thus, in the event that theprediction from the ensemble classifier was split between different cancertypes, the correctness of the prediction was assessed by comparing thediagnosed cancer type against the pool of confident predictions.4.2.4 Algorithmic model selectionFor the initial selection of the optimal classification algorithm, RPKMswere used as input. Support Vector Machines (SVM), Random Forests(RF), Extra Trees (ET), and a fully-connected neural network (NN) werecompared. 5-cross validation (5-CV) with grid-search was used to identifythe best parameters for each of these algorithms. The trained models weresubsequently tested on the held-out set of 1/5th of the total samples.914.2. MethodsA shallow neural network (with a hidden layer of 17,000 genes, tanhactivation, learning rate = 0.001, L2 regularization cost = 0.0001), wasfound to be the top performing model on the held-out test set. As theother ensemble models (RF, ET) had near-equivalent 5-CV results withthe NN during training, we evaluated the utility of extending the NNmodel. An ensemble was developed by training multiple neural networkswith different linear transformations of the data. The resultant classifier(SCOPE) contained five neural networks. The F1-Score was used as themain metric of assessment, to account for class imbalances in the trainingand test sets.4.2.5 Ensemble selectionBased on our observations from Section 4.2.3, we built an ensemble ofneural networks that used both RPKM- and rank normalized- training dataas input across varying architectures and regularizations. This extendedour selected classification model to include five additional neural networkarchitectures. The additional neural networks were selected using the5-CV approach discussed already. Furthermore, these neural networkswere evaluated on the held-out test set (1/5th of the training data) whichwas set aside prior to cross-validation, and networks (machines) that hada performance at par/greater than the RPKM-only, transcriptome-wideneural network on the held-out test set were used to build an ensembleclassifier. The resultant ensemble classifier contained five neural networks,with each ‘neural network machine’ in the ensemble assigning a confidencescore (as represented by class probability) for each output class.For the assessment of metastatic samples biopsied from the site ofmetastasis, the confidence of prediction was taken into account as anevaluation of differential diagnosis. In the event that the prediction fromthe classifier was split between different cancer types, the ‘correctness’of the prediction was assessed by comparing the diagnosed cancer typeagainst the split pool of predictions. This cohort represents advanced,treatment-resistant metastatic disease that has undergone multiple roundsof selective pressures from its local environment and chemotherapy regimens.We included a previously published baseline linear comparator for ourclassification method on this dataset, in order to identify the lower boundfor transcriptomics-based characterization of these cases using primarycancer data .924.3. ResultsTable 4.3: Architecture, identifying names, and additional information foreach neural network in the SCOPE ensemble.Model.name Architecture Data.pre.processing Additional.rulesnone17k 17688 x 17000 x 66 None (RPKM) Nonenone17kdropout 17688 x 17000 x 17000 x66None (RPKM) Dropout (10%)input in trainingsmotenone17k 17688 x 17000 x 66 None (RPKM) withSMOTE samples intrainingNonerm500 17688 x 500 x 66 Rank norm +minmax(0,1) scalingNonerm500dropout 17688 x 500 x 500 x 66 Rank norm +minmax(0,1)Dropout (10%)input in training4.2.6 Feature weights analysis for neural networkFollowing training, the weights and biases for each layer were extracted usingthe lasagne.layers.get_all_param_values(network) function. Subsequently,following the rules of weight propagation in fully connected neural networks,a forward multiplication loop was evaluated, resulting in a matrix ofdimensions [Number of genes, Number of output categories]. For eachoutput category, the resultant network weights were sorted, and the top-100genes with the highest weights for the class were saved. This was done over5-cross validation models for each neural network, resulting in 25 lists oftop-100 genes. For a given neural network, genes found to be top-ranked inat least three out of five CV folds were identified. Subsequently, for eachcategory, the NN-specific top genes were filtered for occurrence in at least3/5 neural networks, resulting in a set of important genes for each cancertype and normal tissue in the classification categories (Appendix Table 1).4.3 ResultsA total of 10,688 adult patient samples representing 40 untreated primarytumour types and 26 adjacent-normal tissues were used for training. Amongthe training data set, 5,157 of 10,244 (50.3%) were male and the mean(SD) age was 58.9 (14.5) years. Testing was performed on 211 patients withuntreated primary mesothelioma (173 [82.0%] male; mean [SD] age 64.5[11.3] years); 201 patients with treatment-resistant cancers (141 [70.1%]934.3. Resultsfemale; mean [SD] age, 55.6 [12.9] years); and 15 patients with cancers ofunknown primary of origin; among the treatment-resistant cancers, 168 weremetastatic, and 33 were the primary presentation. In our study, SCOPEachieved 97% accuracy and a macro F1-score of 0.92 on the 2,780 cases inthe TCGA held-out set.The transcriptome had improved performance over the COSMIC cancergene set and ANOVA-selected genes (Figure 4.2 A). The single neuralnetwork outperformed other machine-learning algorithms (Figure 4.2 B).For 46 out of the 66 classes, 80-100% of the samples in each class werecorrectly classified (Figure 4.2 C). We found that seven classes wererefractory to appropriate classification, among which three were cancertypes (esophageal carcinomas and adenocarcinomas and cervical cancers),and all seven had fewer than 50 training examples (class size range, 3-50).On closer investigation of the five neural networks in the ensemble, wefound that the neural network trained with SMOTE-supplemented trainingexamples showed improved performance on smaller classes compared withthe other four (Figure 4.3).The performance of the model on the held-out set was better than thatquantified through cross-validation on the training dataset (Figure 4.2 A).This is because of the difference in the training size of cross-validation andfor held-out. The class-specific metrics show a positive difference betweenthe cross-validation runs and the held-out set, but only for the classes wherethe total number of samples are extremely low (N < 50, Figure 4.3). Ascross-validation only happens on 80% of the data, which in turn is split into80% training and 20% cross-validation test fold, smaller classes have fewersamples to train on in a cross-validation run. As a result, the performanceof the classifier is poorer on the cross-validation test folds for such classes.However, when testing on the 20% heldout set, we are training the modelon the entire remaining 80% of the data. While this is of little consequenceto classes that are well represented, the smaller classes are more thoroughlylearnt during training.4.3.1 Association of classification anomalies and biologicalsimilarities in held-out setAmong the poor-performing classes in the TCGA held-out set, certainpatterns were evident. The three kidney adjacent-normal classes (KICH,KIRP, KIRP) had significant cross-calling, which was as expected because944.3. ResultsFigure 4.2: Results from algorithm and feature selection experiments, andperformance on held-out test set. A) Feature selection does not improvepan-cancer classification. B) Comparison of algorithms - performance ofsingle neural network on held-out set is higher than other algorithms.C) Validation of SCOPE on TCGA held-out set demonstrates highdiscriminatory power amongst most cancer types. Point with bar representsaverage F1-score and standard deviation spread for corresponding category.Incorrect predictions for more than 10% of samples belonging to a givencancer type are shown by curved directed edges. Curve width indicatesrelative fraction of samples in misprediction set. Mispredictions occuramongst cancer types with the same organ-system of origin. Specific trendsare discussed further in Section 22.214.171.1244.3. ResultsFigure 4.3: Performance of various models that make up SCOPE, on thecross-validation and held-out sets. The x-axis is ordered by increasing classsize. Performance is reported as precision for the test-folds from CV 𝑖𝑛𝑏𝑙𝑎𝑐𝑘and for all samples in the held-out set 𝑖𝑛𝑦𝑒𝑙𝑙𝑜𝑤. Number of samples intraining are shown in the upper histogram panel. Cancer codes follow TCGAnomenclature and are defined in Table A.1, with _TS samples indicatingtumours and _NS samples indicating adjacent normal tissues. The differencebetween CV-fold performance and held-out performance is typically largerfor small classes. The difference become insignificant as class size approachesN > 100. When the classifier is augmented with addition of syntheticsamples in the training folds (last panel), we observe an overall increasein performance for the smaller classes with a concomitant reduction in theperformance gap between mean-CV-precision and heldout precision. The lineof best fit (loess) is indicated for each model, with standard error bounds ingrey. The spread of performance across different CV folds is shown by theblack point (mean) with 1 standard deviation bars.964.3. Resultsall three represent healthy kidney tissue. Esophageal carcinomas andadenocarcinomas were often misclassified as stomach adenocarcinomas.For cervical cancers, which can be squamous, adenosquamous, andadenocarcinomas, subtypes were also challenging to distinguish by SCOPE.We found these trends were replicated in unsupervised clustering of theRNA sequencing data, suggesting biological rationale for the same (Figure4.4).As further evidence, we observed other molecular patterns previously notedin literature in our results. The endometrium is a common site of occurrencefor uterine carcinosarcomas, and an endometrioid carcinoma-like profile isa well-documented molecular subtype of uterine carcinosarcomas. We foundthat uterine carcinosarcoma was frequently misclassified as uterine corpusendometrial carcinoma. The Cancer Genome Atlas analysis has found thata majority of uterine carcinosarcoma samples had serous-like endometrialcarcinoma precursors . This cross-calling was also observed by anothergroup using this data set for classification .4.3.2 Prioritization of known diagnostic gene featureswithout prior knowledgeManual review of the high-importance genes summarized in Appendix Table1 showed that the genes prioritized for each class were biologically relevantto the corresponding cancer or normal tissue type. For example, twokidney-specific genes, UMOD and AQP2, were exclusively associated withthe adjacent normal tissues from all three renal cancer types in training.Known diagnostic markers for renal clear cell carcinoma, namely CA9 andCA12, were associated with renal clear cell carcinoma. Important genes fortesticular germline cancers, POU5F1, GDF3, and NANOG, are known andproposed biomarkers. High POU5F1 (OCT4) and NANOG expression isassociated with spermatogenesis dysregulation . Unexpectedly, in theabsence of a healthy tissue class corresponding to a primary tumour type,some important genes for the cancer reflect biological characteristics of theprogenitor healthy tissue, such as DPPA3/5 for testicular germline cancers,and TYR and MLANA for uveal melanomas. These observations underscorethe value of including adjacent normal tissues for a high-dimensionalpan-cancer classifier.974.3. ResultsFigure 4.4: t-SNE plot of transcriptomic data in TCGA trainingcohorts. The relevant gynecologic and gastrointestinal cancer types areshown, and reflect the trends of cross-calling observed in SCOPE.Esophageal adenocarcinoma 𝐸𝑆𝐶𝐴_𝐸𝐴𝐶 and stomach adenocarcinoma𝑆𝑇𝐴𝐷 cluster together, as do uterine carcinosarcomas 𝑈𝐶𝑆 with uterinecorpus endometrial carcinomas 𝑈𝐶𝐸𝐶.984.3. Results4.3.3 External validation on primary cancersMesothelioma is a cancer that arises in the pleura, which lines the lungs.Three main histologic categories have been defined within mesothelioma:epithelioid, sarcomatoid, and a biphasic type that presents a combination offeatures from the former . Subtype diagnosis in mesothelioma influencespatient prognosis and disease management, but without specializedhistopathologist training, there is low agreement between diagnoses .We applied SCOPE on a previously published cohort of primary, untreatedmesothelioma subtypes.Characterizing cancers with mixed histologyWe obtained 99.2% accuracy (125 of 126) in identifying epithelioidmesotheliomas and biphasic-epithelioid cancers in this cohort. This is asexpected, because SCOPE was trained to identify epithelioid mesotheliomas(this subtype was exclusively represented in the mesothelioma trainingset). Twenty-three of 29 sarcomatoid mesotheliomas (79.3%) and 55 of56 biphasic-sarcomatoid mesotheliomas (98.2%) were predicted with splitconfidence between mesothelioma and sarcoma (Table 4.4). In addition,four of the remaining six sarcomatoid subtype samples were predictedconfidently as sarcomas. Appendix Figure 1 shows an example of whatthese split predictions look like as an output from SCOPE.Table 4.4: Performance of SCOPE on the Genentech cohort ofprimary mesotheliomas. The training cohort was composed of epithelioidmesotheliomas, whereas the testing cohort was composed of epithelioidmesotheliomas and sarcoma-like mesotheliomas. Mesotheliomas that alsoshow sarcoma-like histology are either predicted correctly as part sarcoma,part mesothelioma (”sarcomatoid mesothelioma”), or otherwise, usuallyas mesothelioma alone (”epithelioid mesothelioma”), or as sarcoma alone(”sarcoma”).MesotheliomaSubtypeCaseCountPrecision Recall F1-Score PredictedcategoryCountBiphasicepitheliod-like72 1 1.00 1.00 Epithelioidmesothelioma72Epithelioid 54 1 0.98 0.99 EpithelioidMesothelioma53SarcomatoidMesothelioma18994.3. ResultsTable 4.4: Performance of SCOPE on the Genentech cohort ofprimary mesotheliomas. The training cohort was composed of epithelioidmesotheliomas, whereas the testing cohort was composed of epithelioidmesotheliomas and sarcoma-like mesotheliomas. Mesotheliomas that alsoshow sarcoma-like histology are either predicted correctly as part sarcoma,part mesothelioma (”sarcomatoid mesothelioma”), or otherwise, usuallyas mesothelioma alone (”epithelioid mesothelioma”), or as sarcoma alone(”sarcoma”). (continued)MesotheliomaSubtypeCaseCountPrecision Recall F1-Score PredictedcategoryCountEpithelioidMesothelioma5Sarcoma 4Sarcomatoid 29Other 2Epithelioidmesothelioma38Sarcomatoidmesothelioma17Biphasicsarcoma-like 56 Other 1Abbreviations: NA - not applicable; SCOPE - Supervised Cancer Origin Predictionusing Expression4.3.4 Providing diagnosis for pre-treated metastasesIn an independent set of 201 post-treatment metastatic cancers, SCOPEperformed well above the baseline linear classifier, achieving an overallaccuracy (SD) of 86% (11%), and a mean (SD) F1 score of 0.79 (0.12) (Figure4.5 A; Table 4.5). Among the 41 mispredictions, seven (17.1%) matched thesite of biopsy (for example, predicting hepatocellular carcinoma for a breastcancer biopsy specimen from the liver), and 13 of the 41 (31.7%) matcheda cancer type with same organ system of origin instead (for example,predicting uterine carcinosarcoma as ovarian cancer, predicting stomachadenocarcinoma as esophageal adenocarcinoma). For the remaining 21cases, no obvious explanation was found for misclassification. Because ourmethod provided a confidence score for each prediction, we found that inthe set of confident diagnoses from the ensemble (118 of 201, confidencescore of 80%, spanning 20 cancer types) accuracy went up to 92%.1004.3. ResultsTable 4.5: Performance of SCOPE on the metastatic cohort. Number ofmis-predictions are listed in brackets if more than one.Cohort metrics Cases predicted asCancer type Cases PrecisionRecall F1-Score Diagnosis BiopsySiteOrganSystemOtherMetastatic biopsiesAdenocortical CA 1 1.00 1.00 1.00 1 - - -FollicularLymphoma1 1.00 1.00 1.00 1 - - -Mesothelioma 1 1.00 1.00 1.00 1 - - -Prostate AC 1 1.00 1.00 1.00 1 - - -Testicular GermCell Tumour1 1.00 1.00 1.00 1 - - -Thymoma 1 1.00 1.00 1.00 1 - - -Colorectal AC 21 1.00 0.81 0.89 17 LIHC STAD(2)CHOL_nPapillary KidneyAC2 1.00 0.50 0.67 1 - - LUADUCEC 5 1.00 0.40 0.57 2 - BRCA ”BLCA,STAD”UterineCarcinosarcoma4 1.00 0.25 0.40 1 - ”OV,SARC”HNSCBreast CA 65 0.97 0.97 0.97 63 LIHC_n - BLCALung AC 14 0.93 1.00 0.97 14 - - -Sarcoma 17 0.90 0.53 0.67 9 LIHC - ”BRCA,DLBC(2),GBM,SKCM(2),KIRC”Ovarian CA 7 0.86 0.86 0.86 6 - - PAADProstate AC 9 0.75 0.33 0.46 3 LIHC ”CHOL(3),LUSC”BLCACholangio-CA 5 0.67 0.80 0.73 4 - STAD -CutanousMelanoma2 0.50 1.00 0.67 2 - - -Diffuse LargeB-Cell Lymphoma1 0.33 1.00 0.50 1 - - -Stomach AC 3 0.25 0.67 0.36 2 LIHC - -CESC-AC 1 0.00 0.00 0.00 - - - STADEsophageal AC 2 0.00 0.00 0.00 - LIHC STAD -Esophageal SCC 4 0.00 0.00 0.00 - LUSC(1)- ”CESC_SCC(2),LUSC(1)”Primary site biopsiesAdrenocortical CA 1 1.00 1.00 1.00 1 - - -Breast CA 4 1.00 1.00 1.00 4 - - -Colorectal AC 1 1.00 1.00 1.00 1 - - -1014.3. ResultsTable 4.5: Performance of SCOPE on the metastatic cohort. Number ofmis-predictions are listed in brackets if more than one. (continued)Cohort metrics Cases predicted asCancer type Cases PrecisionRecall F1-Score Diagnosis BiopsySiteOrganSystemOtherGlioblastomaMultiforme4 1.00 1.00 1.00 4 - - -Brain Glioma 2 1.00 1.00 1.00 2 - - -LiverHepatocarcinoma1 1.00 1.00 1.00 1 - - -Pancreatic AC 2 1.00 1.00 1.00 2 - - -CutaneousMelanoma1 1.00 1.00 1.00 1 - - -UterineCarcinosarcoma1 1.00 1.00 1.00 1 - - -Sarcoma 6 1.00 0.83 0.91 5 - - HNSCLung AC 4 1.00 0.75 0.86 3 - LUSC -Mesothelioma 4 1.00 0.75 0.86 3 - - KIRCLung SCC 1 0.50 1.00 0.67 1 - - -UCEC 1 0.00 0.00 0.00 - - CESC_SCC-Total201 0.80 0.76 0.75 160 7 13 21Abbreviations: Prediction categories: Cases where predicted cancer type matched pathologydiagnosis (Diagnosis), was same as tissue type of biopsy site (Biopsy Site), matched a cancer typewith same organ-system of origin (Organ-system), or did not match any of the above (Other).Abbreviations: AC - adenocarcinoma, CA - carcinoma, SCC - squamous cell carcinoma, CESC AC -cervical/endocervical adenocarcinoma, UCEC - uterine corpus endometrial carcinomaIn our assessment of this cohort, we found no association betweenclassification accuracy and tumour content (P = 0.59), and a weakcorrelation with the size of training class (Pearson correlation coefficient,0.39). There was an association between classification accuracy andconfidence score ( N = 201; P < 0.001). In metastatic site biopsies (N= 168), an association was found between low tumour content and thediagnosis of another cancer type with the same organ system of origin(Figure 4.5C). This association was absent in primary site biopsies (Figure4.5B, Table 4.5).4.3.5 Identification of putative primary tumour type forcancers of unknown primaryWe retrospectively predicted the cancer type for 15 cancers where theprimary site of origin was unknown after initial pathology assessment.1024.3. ResultsFigure 4.5: Performance of SCOPE on external metastatic cohort. A)Two-sided t-tests show a significant association of tumour content on generaldiagnosis as organ system, for biopsies samples from site of metastasis. B)Two-sided t-tests show no effect of tumour content on misclassificationto organ system, for biopsies sampled from the cancer’s site of origin.C) SCOPE has improved performance compared with baseline linearcomparator trained from a statistically filtered feature subset. Abbreviations:AC - adenocarcinoma, CA - carcinoma, SCC - squamous cell carcinoma,CESC AC - cervical/endocervical adenocarcinoma, UCEC - uterine corpusendometrial carcinoma.1034.3. ResultsFigure 4.6: SCOPE prediction and putative primary for cancers withunknown primary site. A confusion matrix of predictions is shown, wherethe size of the circles represents relative number of samples in each category.Case count for CUPs by putative origin is shown with a histogram on theright. Correct predictions are indicated in yellow whereas incorrect ones areshown in black. Salivary carcinoma, neuroendocrine tumours, and ewingsarcomas were not present in SCOPE training, explaining the inability ofthe method to identify these accurately. Abbreviations: CA - carcinoma, AC- adenocarcinoma.These tumours were therefore refractory to standard pathology protocols.Subsequent diagnosis was determined by analysis of whole-genomesequencing and RNA-Seq data, and validated by pathology reviewand immunohistochemistry. The prediction by SCOPE was comparedagainst this putative diagnosis. As shown in Figure 4.6, the classifier’sprediction matched all putative diagnoses except one Ewing sarcoma, oneneuroendocrine tumour, and one salivary carcinoma; these three cancertypes were not present in training.4.3.6 Impact of feature removal on classificationIn order to evaluate whether there is a thresholding effect on accuracy basedon the number of genes provided as input to SCOPE, we performed a1044.4. Conclusiongene ‘blanking’ experiment for the 81 samples in the metastatic cohort thatSCOPE predicted with high confidence. The RPKM values of a percentageof genes were randomly set to 0 before the RNA-Seq data was passed asinput to SCOPE, and this process repeated 10 times for each percentagethreshold. 10 percentage thresholds were tested, in the range 10%, 20%,30%, 40%, 50%, 60%, 70%, 80%, 90%, 99%. The resultant accuracy wasmeasured for all 81 samples. We found that as more genes are set to zero inthe input, fewer samples were predicted correctly (pearson correlation -0.80)(Appendix Figure 2). The decline in performance happens sharply when 40%of the genes are set to 0. Surprisingly, accuracy did not entirely go to 0 when99% of the genes were set to 0 (only the expression of 177 random genespassed as input), staying at around 12.5%.4.4 ConclusionIn this chapter, we describe the development and validation of a cancer-typeclassifier that leverages the entire gene-expression profile of a tumour sampleto correctly identify its site of origin. Out method achieves 97% overallaccuracy and a mean (SD) F1-score of 0.92 (0.06) on our held-out set. Thisperformance level is maintained on external cohorts, with an overall accuracyof 99% on primary mesotheliomas and mean (SD) accuracy of 86% (11%)for a dataset of various metastatic cancers. We use the confidence scorevalues (equivalent to probabilities) for predictions to characterize cancerswith mixed histology.Metastatic cancers form 12-15% of cancer diagnoses worldwide, but accountfor 90% of cancer-associated deaths. While in some cases this is becauseof a paucity of research in identifying appropriate targets for rare cancers,often this can occur as a result of delayed diagnosis or misdiagnosis (in15-28% of the cases) [39, 69]. SCOPE can be easily deployed for automateddiagnosis from RNA-Seq data, facilitate analysis of rare cancers and support re-alignment of diagnosis [105, retrospectively evaluated forcase-study described in Chapter 2; 72]. As shown by its performance onCUPs, it is particularly useful in expediting precision oncology workflowsand in clinical laboratories where access to a plethora of immunostains forsequential diagnosis may be limited. The method is available online as apython package, cancerscope.Since the method spans 40 established primary cancer types and 26 normal1054.4. Conclusiontissue types, it is able to consider multiple differential diagnoses andprovide a quantifiable prediction of the most likely primary of origin,guiding our efforts in using personalized cancer genomics to investigatethe underlying biology of morphologically challenging cancers and cancerswith multiple differential diagnoses from histopathology. Since it leverageswhole transcriptome profiles, the impact of known biological programscan be characterized at a sample-specific level as well. This applicationof SCOPE is described in greater detail in the next chapter, along withpublished case-studies where SCOPE was used for biological analysis andinterpretation.106Chapter 5Enabling cancertranscriptome analysis fromSCOPE using single-samplepathway impact evaluation(PIE)Precision oncology necessitates detailed characterization of aberrantpathways and driver biology of individual tumours. This processrequires an accurate cancer diagnosis to align the observed changesagainst the appropriate background and select comparators. Cancerclassifiers leveraging transcriptome and genomic data can optimize thetask of distinguishing various cancers, but understanding the biologicalunderpinning of the diagnosed cancer still remains a complex and laboriousmanual task. We have developed a method that uses classifiers trainedwith whole-transcriptome data to provide single-sample biological pathwayimportance scores. This unsupervised exclusion analysis approach forpathway impact evaluation (PIE) recapitulates cancer-specific biology andclustering of the classifier training data from The Cancer Genome Atlas,performs single-sample analysis of treatment-resistant cancers to helpexplain diagnosis and subtyping, identifies biological pathways associatedwith drug response, and reflects known biology for metastatic cancers. PIEprovides a score of each biological pathway across forty cancer types, fora given sample. It is available as a python package (‘cancerscope’) thatincludes an RNA-Seq based pan-cancer classifier.1075.1. Background5.1 BackgroundCancers acquire oncogenic potential through somatic mutations, epigeneticmodifications, genomic rearrangements, copy number alterations, andgene expression changes . Many genes have been identified that eitherpromote or restrict the growth of cancer cells. Their downstream effectorsand upstream regulators have been curated and studied, and the keyoncogenic genes and their interaction partners have been placed in thecontext of regulatory networks . Researchers are able to identify genesthat may be driving an individual cancer (and hence serve as therapeutictargets) by aligning observed changes against this complex set of oncogenicpathways [68, 213]. Evaluation of the scope and impact of these changes onspecific cellular pathways and protein networks at the single-sample level isa key challenge.In most existing approaches for single-sample analysis, comparator cohortsand samples are required every time an analysis is performed, and resultscan vary depending on the type of statistical metric used [116, 228]. Therequirement of controls and background samples for every single-sampleanalysis makes it difficult to analyze cancers that lack suitable comparatordatasets (rare cancers, post-treatment cancers), present with mixedhistology, or have important individual signals that characterize thetumour. The analysis can be further impacted by platform biases, and inthe event of small case/control studies, be severely underpowered .PARADIGM, the only equivalent approach for measuring patient-specificpathway activities pathway network of interest, requires users to manuallydefine the nature of interactions between each of the member genes foreach pathway . The tool itself has since been commercialized, and theutility of the publicly available implementation is limited in the absence ofaccompanying pathway networks.We have developed an approach for single-sample pathway impactevaluation (PIE) by quantification of the impact of various gene sets(representing pathways) on classification confidence from pan-cancerclassifiers trained with large feature representations. By encapsulatingall required comparator information into a classifier, we forego the needfor comparator datasets, and provide quantification of pathway impactacross a large number of cancer types simply using pathway gene lists. In aprevious work we had trained SCOPE, an ensemble of neural networks, todistinguish forty primary cancer types (represented by the site of origin and1085.2. Methodswell-established cancer subtypes) . As described in Chapter 4, SCOPEuses large gene expression profiles from bulk RNA-Seq as input (over17,688 genes). Using this tool as the core classification model, we tested theimpact on classification for 3,963 biological pathways and gene groups thatrepresent various regulatory and biochemical functions in eukaryotic cells.For a given sample, pathway impact scores were calculated by setting theexpression values of the pathway-specific genes to zero, and then calculatingthe difference in classification performance against the original sample.We demonstrate the utility of PIE for pathway-level analysis andinterpretation of classification results from SCOPE through four analyses– a) validating that the method recovers relevant biological pathways fromthe training data of the underlying classifier, b) conducting cohort-widepathway analysis and recovering clustering by site of origin based onpathway importance scores for two independent cohorts of metastaticcancers that were not included in classifier training, c) independentlyidentifying oncogenic pathways important for cancer maintenance and apathway associated with paclitaxel treatment by analyzing a previouslypublished case of vulvar adenocarcinoma , and d) refining the diagnosisand recovering known biological programs driving the cancer for previouslypublished case-study of a cancer with unknown primary (CUP) that wasoriginally diagnosed using SCOPE . Lastly, after using PIE to generatepathway-level representations of samples, we discover new subtypes in apreviously analyzed cohort of metastatic prostate adenocarcinoma .5.2 Methods5.2.1 Test DataCohort-level validation of PIE was performed using three publicly availablecohorts of advanced cancers. In all these cases, the gene expression data(bulk RNA-Seq) was filtered to select the 17,688 genes that overlapped withthe required input for SCOPE, unless otherwise indicated below. No othernormalization or pre-processing was done on the RPKM values. The firstcohort of 10,156 primary tumours and healthy tissues was drawn from TheCancer Genome Atlas . Additional details about this data are includedalready in Chapter 4, Section 4.2.1.A cohort of 651 advanced cancers was obtained from the personalized1095.2. Methodsoncogenomics (POG) clinical trial at BC Cancer. These cases were selectedbased on the following criteria – (a) a primary of origin was identified,based on a joint consideration of clinical/pathology/genomic data, and (b)cDNA libraries prepared from the biopsy sample passed in-house qualitycontrol.The second cohort was the MET500 cohort of advanced, metastatic patients. Only 375 of the 500 patients in this cohort had available RNA-Seqdata with a confirmed diagnosis that mapped to a TCGA category. 17,347of the 58,450 annotated genes overlapped with the 17,688 genes required asinput to SCOPE. The missing genes were set to 0 for all subsequent analyses.For sample-level analysis, PIE was used retrospectively to profile twoindividual cases where detailed integrative pathway analysis based onwhole-genome and transcriptomic sequencing was available. These twocases - a vulvar adenocarcinoma, and a rare thyroid-like renal carcinoma -have been previously published [72, 105].5.2.2 Classifier used for PIE measurementsSince PIE’s scores are calculated by blanking representative pathway genesand measuring changes in quantified classification scores from a multi-classclassifier, we wanted to use a previously published, open-access classifierthat leverages large transcriptomic profiles (so as to facilitate blanking ofall relevant gene sets) and provides classification probabilities across a vastnumber of cancer types (so as to provide insights about as many cancer typesas possible). SCOPE is a previously validated cancer-type classifier . It istrained on primary cancers and adjacent normal samples from The CancerGenome Atlas . SCOPE is an ensemble of five different neural networkclassifiers, each of which provide a probability value between 0.0 and 1.0 foroutput class (40 cancer types, 26 healthy tissues). The average probabilityvalue across the five ensemble members was used as the confidence scorefor a given class. The baseline confidence score for each class was calculatedusing the default sample input (RPKM values for 17,688 genes).5.2.3 Pathway analysis for individual samplesCalculation of pathway importance1105.2. MethodsPathways were curated manually at the Michael Smith Genome SciencesCentre from KEGG , Reactome , PathCards , TarBase ,Consensus Pathway Database , and the Pathways Interaction Database. This resulted in a set of 3,952 pathways. An additional 11 canonicaloncogenic pathways representing common signaling cascades that aredisrupted in cancer were curated in-house, resulting in a set of 3,963pathways. Pathways were represented as the set of their member genes.The impact of each pathway on classification was calculated by setting theRPKM values for the pathway genes to 0.0 in the input and calculatingthe resultant confidence scores across the 66 output classes. Pathways thatwere important for classification of the sample as class ‘m’ would have areduced confidence score for class ‘m’ when the relevant genes were removedfrom the input. Inversely, pathways that were preventing the sample frombeing predicted as class ‘m’ would have a higher confidence score for class‘m’ upon being blanked in the input.For a given pathway-sample pair, the pathway confidence score wassubtracted from the sample confidence score to obtain the pathwayimportance score for each output class. A positive score for a given outputclass ‘m’ meant the pathway was important for classification of the sampleas class ‘m’. A negative score meant the pathway was confounding thesample classification as class ‘m’.Calculation of number of important sample-level pathways inTCGA cohortsFor each sample, positively scoring pathways (PIE score > 0.0) were selected.The inter-quartile range (IQR) was calculated as the difference between the25th and 75th quantile of these scores. Pathways with a PIE score > 1.5 *IQR were selected as being important in the sample.5.2.4 Cohort-level pathway analysisPathway profiles of each sample were generated for 3,963 pathways usingthe pathway importance scores. A sample * (pathway, output class) matrixwas generated from 651 POG samples, resulting in a matrix of size [651,261558]. Similarly, a matrix of size [375, 261558] was generated from the 375MET500 samples.Visualization of samples in the TCGA, POG, and MET500metastatic cancer cohorts1115.2. MethodsThe pathway, output class pairs where the output class matched thecancer-type of diagnosis were selected for each sample. This reduced thenumber of ‘features’ per sample to 3,963, matching the number of uniquepathways. Uniform Manifold Approximation and Projection (UMAP) wasused for dimensionality reduction and visualization of this high-dimensionaldata. UMAP decomposition of each cohort was generated using theumap package in Python using Manhattan distance, n_neighbours set to15, n_components set to 2, and initialized with the first two PrincipalComponents of the PCA decomposition of the matrix, as recommendedelsewheres . All other function arguments were set to the default. Nonormalization of the input was done prior to generating the decompositionas the input values were already scaled measurements in a range between[-1,1].Measurement of cluster metrics by cancer-type in the TCGA,POG, and MET500 cancer cohortsSilhouette indices  were used to quantify the quality of clustersobtained from UMAP projections of cancer cohorts. Given clusterassignments, silhouette index measures how similar a given sample is to itsown cluster, compared to other clusters. A high positive value indicates thesample is well-placed in its present cluster, whereas a high negative valueindicates it is poorly placed in the current cluster (i.e. has more similarityto a different cluster). For the silhouette scores presented in this analysis,the diagnosed cancer-type was used as the default cluster label, and thePrincipal-Component initialized UMAP projections used as the samplemeasurements.The ‘silhouette_samples’ function in sklearn’s metrics  was usedto calculate the sample-level cluster correspondence score from thePCA-initialized UMAP projections, as measured against the cancer-types.Euclidean distance was used along with all other defaults in the function.Identification of important pathways distinguishing the variousprostate adenocarcinoma (PRAD) clusters in the MET500 cohortFor each cluster of PRAD samples, the cancer-type specific pathwayimportance scores were selected. The mean importance of each pathwayacross all the samples in the cluster was calculated. Pathways that had apositive mean importance compared to the other two PRAD clusters werefiltered and sorted by decreasing mean importance. The mean pathwayimportance of top pathways in each of the clusters were then plotted.1125.2. Methods𝑃 ⋆∧(𝜇⋆𝑗) = (1/𝐵)∑𝐵𝑗=1𝐼(𝜇∧𝑋 > ̂𝜇)5.2.5 Statistical selection of top pathways associated witheach cancer typeFor a given pan-cancer cohort X (X being TCGA, POG, or MET500),pathways positively associated with each cancer type were identified usingone-sided test of significance across bootstrapped samples. For a givencancer type C, the subset of samples in X belonging to C were identified,and bootstrapped datasets were generated from the resultant subset ofsample * pathway matrix, 𝑋𝑐. Sampling with replacement was done toform bootstrapped datasets with 20% the samples in 𝑋𝑐. 1000 iterationswere performed, indexed with j.Next, for each pathway, the P-value for rejecting the null hypothesis (nullhypothesis being that PIE score of pathway is not significantly higher incancer type C, compared to the entire cohort X) was calculated as follows:̂𝑃 ⋆(𝜇⋆𝑗) = (1/𝐵)∑𝐵𝑗=1𝐼( ̂𝜇𝑋 > ̂𝜇𝑗)̂𝜇𝑋 is the mean PIE score across all samples in X̂𝜇𝑗 is the mean PIE score of samples in the 𝑗𝑡ℎ bootstrap from 𝑋𝑐B is the number of bootstrap iterations sampled from 𝑋𝑐I is an indicator function that equals 1 when the arguments compute to True,and 0 otherwise.The resultant pathway p-values were ordered by increasing p-value and thenby decreasing mean value in𝑋𝑐 and the top n pathways (of highest statisticalsignificance and high quantitative impact) selected. Q-values were calculatedusing the ‘qvalue’ package  at a false discovery rate (FDR) thresholdof 0.05. Statistically significant, positively associated pathways for a givencancer type were selected with the criteria Q-value <= 0.001 and mean PIEscore in cancer type > 0.01.5.2.6 Statistical identification of important pathways forsingle-sample analysisFor a given sample, for each pathway, a two-sided Grubbs test wasapplied to identify single positive or negative outliers from across the 66classification categories. Positive (right-tail) outlier pathway, cancer-type1135.3. Resultspairs were filtered to identify outlier pathways in the classification categoryof interest. These pathways were then ordered by their PIE scores and thetop-25 pathways used to generate the associated visualizations.5.3 ResultsSCOPE uses large gene expression profiles from bulk RNA-Seq as input(over 17,688 genes). Using this tool as the core classification model, wetested the impact on classification for 3,963 biological pathways and genegroups that represent high-level cellular functions. For a given sample,pathway impact scores were calculated by setting the expression values ofthe pathway-specific genes to zero, and then calculating the difference inclassification performance against the original sample.5.3.1 Pathway impact profiles allow clustering and analysisof samples by cancer typeFirstly, we tested whether pathway profiles generated by PIE reflected thebiology of samples in SCOPE’s training data from The Cancer GenomeAtlas (TCGA; Weinstein et al. ). A uniform manifold approximationand projection (UMAP) from the 3,963 pathways for each sample retrievedthe expected clustering for 71% (N = 25/35) of the cancer types (meansilhouette score > 0.0, positive value indicates support for clusteringwith other samples in the same cancer-type), demonstrating that thepathway profiles were sufficient in recovering most cancer types (Figure5.1, Appendix Figure 3). Box plots illustrate the median (centre black line)with the lower and upper hinges indicating the 25th and 75th percentilesrespectively. The upper whisker shows the largest value at-most 1.5 timesthe interquartile range (IQR) from the hinge, and the lower whiskershows the smallest value at most 1.5 times the IQR of the hinge. TheIQR is calculated as the distance between the first and third quartiles.Data points outside these ranges are plotted as individual points. Poorlyclustered cancer-types primarily included gastrointestinal and gynecologicmalignancies.Pathways common to multiple cancer types across various organ-systems oforigin reflected cancer biology and pathways relevant to cellular function(Figure 5.2). The latter included pathways such as translation initiation1145.3. ResultsFigure 5.1: UMAP projections of PIE profiles for 3,963 biochemicalpathways, for samples in the TCGA cohort of primary tumours. For ease ofreadability, the projections in panel A) show TCGA tumour types colouredby their organ system of origin. The spread of sample-specific silhouetteindices, grouped by cancer type, is shown in panel B). 1155.3. Resultsand termination, metabolism, gene expression, adaptive immune system,and signal transduction. Various oncogenic pathways were also globallyrepresented, notably cell cycle, NOTCH, MAPK, VEGF and TNFalphapathways.For individual cancer types, statistically significant pathways from PIE(FDR adjusted Q-value < 0.001) overlapped with known biology (Figure 5.3,Appendix Table 2). Examples included the C-MYB and estrogen receptornetworks in breast cancer, steroid hormone pathways in endocrine tumours,tubulin folding and neural crest differentiation in gliomas, androgenreceptor and prostate cancer pathways for prostate adenocarcinoma, andT-Cell Receptor signaling in hematologic cancers.Next we investigated trends for pathway complexity in different cancer-types.We observed that complex and heterogeneous cancers like glioblastomamultiforme (GBM), sarcoma (SARC) and various hematologic malignancieshad over 200 unique statistically significant pathways associated withthem, whereas other tumours like pheochromocytoma (PCPG) and thyroidcancer (THCA) needed fewer than 20 pathways (Figure 5.4a). We founda similar trend when looking at the number of important pathways atthe sample-level (Figure 5.4b). Comparing this distribution with mutationburden data for these cohorts, we found that sample-level pathway countsseparated the cancer types by known mutation frequencies . On average,several cancer types with low mutation burden also had fewer importantpathways per sample (PCPG, Uveal Melanoma (UVM), Lower GradeGlioma (LGG), and acute myeloid leukemia (LAML)), and ones with highmutation burden had a higher number of pathways per-sample (esophagealsquamous (ESCA_SCC), lung squamous (LUSC), subcutaneous melanoma(SKCM)). Notable exceptions were testicular germ cell tumours (TGCT)and thyroid cancer (THCA), that harbor low mutation burden, but had ahigh count of sample-level pathways, and on the inverse trend (having veryfew important pathways per sample, but typically having a high mutationburden), stomach adenocarcinoma (STAD) and lung adenocarcinoma(LUAD).126.96.36.199 PIE profiles enable cohort analysis of metastatic cancersWe also interrogated PIE profiles for a cohort of 651 metastatic cancers fromthe Personalized OncoGenomics (POG) project and 375 metastatic cancersfrom the MET500 cohort . UMAP profiles showed consistent ability to1165.3. ResultsFigure 5.2: Pathways commonly associated with multiple cancer types inthe TCGA cancers are shown. Grey bars indicate total number of tumoursamples evaluated, whereas coloured bars indicate the number of tumoursamples from the respective organ-system of origin. Panel A) shows themost common cell-function pathways. Panel B) shows the most commoncancer-associated pathways. 1175.3. ResultsFigure 5.3: Statistically significant pathways in TCGA cancers. Panelsshow important pathways for each tumour and normal category, groupedby organ system of origin. Each group shows the top-5 pathways associatedexclusively with cancers in the relevant organ-systems of origin, ordered bynumber of samples in which the pathway had a positive PIE score. Colouredbars indicate fraction of tumour samples from the organ-system where therespective pathway was positively scored by PIE.1185.3. ResultsFigure 5.4: Determination of pathway-level activities for TCGA primarycancers, using PIE. Panel A) shows the number of statistically significantpositively associated with each cancer-type, from the group of 3,963pathways evaluated using PIE. Panel B) shows the number of pathwayswith statistically significant PIE scores per sample.1195.3. Resultscluster samples by cancer types in both these cohorts, with an average cohortsilhouette distance of 0.15 (+/- 0.35), 0.13 (+/- 0.41) and mean silhouettescore > 0.0 for 7/9, 6/9 common cancer-types in the POG and MET500cohorts respectively (Figure 5.5, Figure 5.6). Not surprisingly, malignanciesabsent in the SCOPE classifier had poorer clustering with the correspondingcancer-type (silhouette index < 0.0), primarily fusion-associated sarcomas(Figure 5.7). Top pathways for all cancer types in both cohorts are listed inAppendix Table 188.8.131.52.2 Important pathways for breast cancer within and acrossprimary and metastatic cohortsExamining the statistically significant pathways for breast cancer samplesin the TCGA, POG and MET500 cohorts (N = 1100 for TCGA, N = 160 inPOG, N = 60 in MET500) revealed features of breast cancer biology (Figure5.8). The prioritization of estrogen signaling in all three cohorts is consistentwith the role of the estrogen receptor and the estrogen signaling pathway inbreast cancer progression . The PI3K-Akt pathway has an important rolein cell growth and tumour proliferation, and is dysregulated in most commoncancers . PIE identified the extracellular matrix (ECM), which plays avital role in breast cancer progression and metastasis , as being relevantto both primary (TCGA) and metastatic (POG, MET500) breast cancers.It also recapitulates recent findings about differing roles of insulin signaling and retinoic acid signaling  in breast cancer biology. Interestingly,the Reactome pathway ‘Miscellaneous transport and binding events’, whichcontains the AZGP1 gene associated with breast cancer and lipoproteinregulation, was also prioritized in all three cohorts.Several inflammation-associated pathways were associated only withmetastatic breast cancer samples, including IL27 mediated signalingand inflammasome pathways, particularly the NLRP3 inflammasome.These pathways reflect emerging knowledge about the role of the NLRP3inflammasome and recruitment of myeloid cells through interleukin signalingin metastatic breast cancers . This indicates that the underlying classifier,SCOPE, had learnt the importance of tumour-associated inflammation inthis tumour type despite being trained on primary tumours.While no established support could be found for endochondral ossificationand warfarin metabolism (other important pathways in the metastaticgroup), previous research utilizing the TCGA breast cancers highlighted1205.3. ResultsFigure 5.5: Clustering of POG cohort samples by cancer-type using PIEprofiles for 3,963 biochemical pathways. Cancer types with at-least 10 (N =510/602) are shown for ease of readability. A) UMAP projections of pathwayprofiles are shown. Using pathway importance scores, samples cluster bytheir diagnosed cancer type. B) Silhouette indices of samples are shown,grouped by cancer type. A positive silhouette index indicates sample clusterswith assigned cancer-type.1215.3. ResultsFigure 5.6: Clustering of MET500 cohort samples by cancer-type usingPIE profiles for 3,963 biochemical pathways. For ease of readability, theprojections only show cancer types with at-least 10 samples in the MET500cohort (N = 259/375). A) UMAP projections of pathway profiles are shown.Using pathway importance scores, samples cluster by their diagnosed cancertype. Of note, we observe 3 distinct clusters of prostate adenocarcinoma(PRAD, in dark-green). B) Silhouette indices of samples are shown, groupedby cancer type. A positive silhouette index indicates sample clusters withassigned cancer-type.1225.3. ResultsFigure 5.7: Silhouette index spread for the MET500 cohort subtypes.Silhouette metrics are calculated from the UMAP projections initializedwith the first two principal components; clusters evaluated based on cancertype annotation. A positive silhouette index indicates sample clusters withassigned cancer-type. Abbreviations: BLCA – Bladder cancer, BRCA –Breast cancer, IDC – invasive ductal carcinoma, ILC – invasive lobularcarcinoma, CHOL – Cholangiocarcinoma, EHCH – extrahepatic CHOL,IHCH – intrahepatic CHOL, COADREAD – colorectal adenocarcinoma,ESCA – esophageal carcinoma, SCC – squamous cell carcinoma, EAC –adenocarcinoma, OV – ovarian cancer, PRAD – prostate adenocarcinoma,SARC – sarcoma, RHBD – rhabdoid, LMS – leiomyosarcoma, EW –Ewings Sarcoma, UPS – Undifferentiated pleomorphic carcinoma, DDL –dedifferentiated sarcoma, SKCM – subcutaneous melanoma.1235.3. Resultsthe prognostic value of COL10A1, a key member of the endochondralossification pathway , and dysregulation of Vitamin K pathway genes,which act antagonistically to warfarin . These pathways could indicatepotential new directions to better understand breast cancer biology.5.3.2 Pathway impact scores reveal prostate cancersubgroupsWe observed three distinct clusters of prostate adenocarcinoma (PRAD)samples in the UMAP projection of PIE profiles from the MET500 cohort(Figure 5.9a, Figure 5.6a). We compared these groups against a previouslypublished analysis  that included 15/62 PRADs analyzed here, andfound that the Group-1 samples all harbored fusions associated with prostatecancer. The overlapping Group-2 samples had a high frequency of CDK12copy loss and TP53 mutations (5/6, where 6/32 had published mutation datain Wu et al. ). We further validated these clusters through unsupervisedPrincipal Component Analysis and UMAP decomposition visualizations ofthe samples’ gene expression profiles. We observed separation of the Group-1and Group-2 in both the principal component analysis and the UMAP ofthe original gene expression data, suggesting that the observed differenceshave a biological basis (Figure 5.9b).For the Group-1 samples with available data (N = 7/20), 100% of thesamples harbored ETS fusions (primarily TMRPSS2-ERG) and 4/7harbored PTEN copy loss and mutations. This was in comparison to only3/8 samples with ETS fusions and 3/8 samples with PTEN copy loss andmutation in the other two groups combined (N = 8/42). A comparisonof the top 25 pathways distinguishing each of the clusters from the othertwo (Figure 5.10) revealed that the Group-1 samples had a high impactof immune signaling (N = 6/25 pathways) and cell-surface signalingpathways (N = 4/25 pathways), including the T-cell receptor complexesand ECM-receptor interaction respectively . Recent work has suggesteda strong association between high fusion burden and immunogenicity inprostate cancer . Additional key oncogenic pathways characterizingthis group were PI3K-Akt signaling axis, Notch pathway, and Jak pathway.The Group-2 samples were strongly driven by high PIE scores for theCelecoxib pathway and calnexin/calreticulin cycle. Celecoxib is a COX-2inhibitor drug commonly used in treatment of relapsed patients withprostate adenocarcinomas . It inhibits androgen receptor (AR) and1245.3. ResultsFigure 5.8: Cohort comparison between the top 25 pathways associated withbreast cancer, for The Cancer Genome Atlas (TCGA) cohort of primarycancers, the POG cohort of metastatic tumours, and the MET500 cohortof metastatic tumours. Panel A) shows the number of unique and sharedpathways between each of the cohorts. The MET500 and POG cohorts aregrouped as ‘metastatic’. Pathways common between primary and metastaticcancers (in purple), exclusive to primary cancers (in orange), and commonwithin the metastatic cohorts (in light blue) are shown in panel B) with thecorresponding mean PIE score across samples on the y-axis. 1255.3. ResultsErbB signaling; notably, we also observe a relatively higher averagepathway importance for various pathways associated with androgenreceptor signaling in this group (Appendix Figure 4). Suppression ofcalnexin by Celecoxib has been observed in cell lines , providingrationale for why this pathway was also considered important for thisgroup. Other oncogenic pathways associated with Group-2 included p53effectors, Wingless-related integration site (WNT) signaling, and MAPkinase pathway. Group-3 was characterized by immune response to viralinfections, allografts, and DAP12 signaling.5.3.3 PIE independently recovers sample-level findingsfrom integrative genomic analysisWe used PIE with the SCOPE classifier to identify prominent pathways fora previously studied rare mammary-like vulvar adenocarcinoma (describedin Chapter 2, ). This rare tumour initially presented as a poorlydifferentiated malignancy of the vulva. Subsequent genomic analysis andexpression comparison with TCGA tumours determined it to be mostsimilar to a HER2+ breast cancer.184.108.40.206 Pathways prioritized by PIE overlap with integrativeanalysisIntegrative manual pathway analysis was performed at the time of casepresentation, aggregating all observed changes into important pathwaysthat could explain the oncogenesis and provide potential therapeuticoptions (Appendix Figure 5). We used PIE to retrospectively identify themost important pathways driving the classification of this tumour as abreast cancer by SCOPE. The top 25 pathways, statistically prioritizedthrough the Grubbs test for outliers, recovered many of the pathwaysidentified through genomic analysis (Figure 5.11). Quantitatively, 48% ofthe top-25 pathways included genes identified through the manual analysis.This included signal transduction, ErbB, MAPK, c-MYC, and variouspathways involved with metabolism. The FOXA1 transcription factornetwork was also prioritized, supporting observed overexpression of AR andCDKN1B in the sample (these two genes are important members of thispathway (Belinky et al.  - pathway 3538)). The automated approach ofPIE in determining the involvement of cancer-driving pathways appeared1265.3. ResultsFigure 5.9: UMAP projections of the MET500 cohort are shown, filteredto view only the prostate adenocarcinoma samples. UMAP projections(initialized by the first two principal components) are calculated based onA) sample pathway importance profiles calculated automatically by PIEfor 3,963 pathways, and B) gene expression profiles of the samples (RPKMvalues). Panel B) also suggests a non-random separation of the samples.1275.3.ResultsFigure5.10:Top25pathwaysdrivingthe3distinctclustersobservedfortheprostateadenocarcinomasintheMET500cohort.1285.3. Resultsto be consistent with manual expert analyses of the molecular profiles. Thisoffers up the potential of this approach being a relevant starting point forsuch analyses.220.127.116.11 PIE discovers pathways associated with paclitaxeltreatmentWe also observed a high PIE score for endoplasmic reticulum (ER)stress signaling mediated by ATF6, ‘Protein processing in endoplasmicreticulum – Homo sapiens (human)’, and HTLV-I infection pathways,which were not described in the manual genomic analysis. Review ofclinical history revealed that the patient had been treated with paclitaxel,a microtubule stabilizing/ER stress inducer agent used widely for breastcancer treatment, and subsequently developed resistance. A 2013 study hasshown association between ER stress response and resistance to paclitaxelthrough activation of TRAP1 . ATF6 also controls expression ofER chaperone GRP78, which prevents activation of AKT by upstreamkinases. AKT3 was under-expressed in the tumour. A recent study foundthat certain TCGA breast cancers (included in training the underlyingclassifier) show gene-expression profiles consistent with those of breastcancer cell-lines harboring paclitaxel resistance , potentially explainingwhy SCOPE was able to recognize the ER stress response pathways asimportant for breast cancer classification.5.3.4 PIE enables sample-level genomic analysis of cancerswith unknown primaryIn a previously published study, we described the clinical and genomicpresentation of a rare thyroid-like follicular renal cell carcinoma in a 27 yearold . The tumour initially presented as a bone metastasis of unknownprimary, and no chemotherapy was given prior to genomic sequencing. Usinga pairwise expression correlation approach, the tumour sample correlatedstrongly with both the renal clear cell carcinomas (KIRC) and renal papillarycarcinomas (KIRP) from TCGA, while SCOPE strongly indicated this tobe similar to KIRP. We retrospectively evaluated the important pathwaysdriving the classification of this case using PIE.1295.3. ResultsFigure 5.11: Top 25 pathways from PIE-based pathway analysis of amammary-like vulvar adenocarcinoma. 40% of the pathways shown hereoverlap with the integrative pathway analysis (in yellow, green), and 16% areassociated with paclitaxel therapy that the patient had received previously(in green, blue). Size of pathways is indicated in brackets next to the pathwayname on the y-axis. The right panel shows the number of genes sharedbetween the integrative analysis (N = 50) and the indicated pathways.Distribution of PIE scores for the remaining 65 classes is shown in grey,for each pathway.1305.3. Results18.104.22.168 PIE correctly recovers putative driver pathways for rarecancerIntegrative genomic analysis of this tumour at the time of case presentationshowed a single copy loss in the tumour suppressor gene TP53 and aberrantexpression or copy changes in CDK6, MYC, AR, PDGFRA, PDGFRB, andMAP2K2. The MAPK pathway, WNT pathway, and cell cycle pathway wereidentified as putative drivers (Appendix Figure 6).We used PIE to retrospectively identify the most important pathwaysdriving SCOPE’s classification of this tumour as a KIRP. As shown inFigure 5.12, 60% of the top 25 pathways overlapped with the genes andpathways prioritized through manual genomic analysis - specifically the p53,PDGF signaling, and TGF pathway (activator of MAPK pathway amongothers). Upon filtering for signaling pathways important for oncogenesis,we found a high impact of p53, TGF, and PI3K pathways, consistentwith genomic findings and previously known attributes of renal papillarycarcinomas .Toll-like receptor pathways were also prioritized by PIE. These pathways areknown to be upregulated in follicular thyroid cancers , induced by MAPKsignaling. The MAPK pathway was also observed to be highly dysregulatedin the genomic analysis of this case. Subsequent histologic analysis guided bythe genomic alignment of the CUP as KIRP had also classified the canceras a thyroid-like follicular renal cell carcinoma, lending further biologicalrationale for the high rank of this pathway from PIE.22.214.171.124 PIE provides biological insights into subtyping of cancerof unknown originComparison of the resultant pathway scores between the two subtypesusing PIE revealed that PI3K/Akt pathway and NRF2 pathway were bothimportant for the classification of the sample as KIRP over KIRC (Figure5.13). TCGA reported NRF2/ARE pathway mutations in both KIRC andKIRP (3.2%, 9.0% respectively), and frequent mutations in the PI3K/AKTpathway within these subtypes (16.2% of KIRC and 9.8% of KIRP) .Comparison of PIE scores also suggested that compared to the NRF2pathway, the PI3K signaling pathway has a stronger negative impact onclassification of this sample as KIRC.1315.3. ResultsFigure 5.12: Top 25 pathways identified by automated pathway impactanalysis using PIE, for a cancer of unknown primary that was later diagnosedas a rare thyroid-like follicular renal cell carcinoma. Size of pathways isindicated in brackets next to the pathway name on the y-axis. Panel on theright shows the number of genes from integrative analysis (N = 34) thatoverlap with the genes in each of the pathways. 48% of the pathways in themain panel overlap with manual integrative pathway analysis findings (inyellow, red), of which 12% associated with the actual rare cancer type thatthis cancer represented (in red). Distribution of PIE scores for the remaining65 output classes is shown in grey, for each pathway.1325.3. ResultsFigure 5.13: Comparison of pathway importance scores for two differentoutput categories – renal clear cell carcinoma (KIRC) and renal papillarycarcinoma (KIRP). Scores were calculated by PIE using the SCOPEclassifier output. The input was the RNA-Seq profile of a cancer of unknownprimary, later diagnosed as a rare follicular renal cell carcinoma thatmolecularly aligned with KIRP. The pathways that were important forclassification of the sample as KIRP instead of KIRC are highlighted inyellow. Pathways important for classification of the sample as KIRC insteadof KIRP are shown in blue. As is evident, the magnitude of the pathwayimportance is higher for pathways driving the classification of KIRP overKIRC. Relevant pathways have been labelled.1335.4. Discussion5.4 DiscussionPIE performs single-sample pathway analysis using classifiers trained withrepresentative cancer profiles. It recovers biological pathways of relevanceto a large set of cancer types. We show that these pathways are relevantto the biology of primary and metastatic cancers, enabling clustering bycancer-type and identifying known biological programs. It also has potentialin identifying therapeutic targets in individual cancer samples – we shownexamples where it recovers known important pathways in two publishedcase-studies of rare cancers.We show that PIE recovers biologically relevant pathways of primarycancers from TCGA, and of metastatic cancers from two large cohorts ofadvanced, post-treatment tumours. Not only do the sample-specific pathwayscores allow clustering of the samples by their cancer type, but pathwayswith high importance for individual cancer types accurately reflect knownbiological mechanisms of action in the disease. We also identified newsub-groups in a previously studied cohort of prostate adenocarcinomasusing sample-clustering patterns observed from PIE scores and validatedin the gene expression space. The observed groupings seem to separatebased on high immune signaling and previous exposure to Celecoxib. Puttogether, these findings suggest that PIE also has utility in automatedpathway profiling and for discovery of new subtypes.As we found with the breast cancer case study, PIE was able to identifypathways that were prioritized through independent expert analysis of thegenomic and transcriptomic profiles of the patient’s tumour. It also identifiedpathways associated with prior paclitaxel treatment. This is an interestingfinding since the classifier that PIE calculated the pathway impact scoresfrom was trained using primary, untreated cancers.Thyroid-like follicular carcinoma of the kidney (TLFCK) are a rare renalcancer subtype that bear a high histologic resemblance to follicular thyroidcarcinomas . In this case, PIE was able to recover findings from manualgenomic analysis and flag pathways that could explain this rare cancer’sfollicular-like histology. While little is known about the characteristicmolecular biology of TLFCKs, recent work has indicated that Toll-likereceptor signaling is overexpressed in follicular thyroid carcinomas and inKIRPs. PIE independently identified toll-like receptor pathways as a vitalcomponent of the tumour’s molecular identity.1345.4. DiscussionHere we demonstrate a novel approach for using classifiers to obtainbiological insights and identify biological programs that characterizeindividual cancers. As long as a classifier trained with a sufficientlyrepresentative feature set is available, the analysis can be performed ata single-sample level without needing a matching normal or a datasetof cancers against which the pathway changes need to be aligned. Ourfindings also show that the underlying classifier does not necessarilyhave to be trained with post-treatment cancers in order to prioritizepathways associated with prior therapy and cancer metastasis. This is asurprising finding, since it suggests that classifiers trained with these large,informative feature representations can simultaneously learn characteristicsof the cancers that are not part of the optimization function. The tool forpathway impact evaluation from such classifiers, PIE, is a straightforwardalgorithm and can be easily repurposed to use different classifiers as thebackend, such as random forests, SVMs, or linear classifiers. It is alsoavailable as an extension in the original python package for the SCOPEclassifier, www.github.com/jasgrewal/cancerscope.135Chapter 6ConclusionsGenomic analysis of cancers has prompted a new design and implementationparadigm for clinical trials in recent years . The inclusion of genomicprofiling in clinical cancer care has shown demonstratable benefits andthe generation of molecular data is aiding our understanding of cancer.We are moving towards an era where the molecular experiments drivingthese efforts become accessible and affordable enough to enable routineclinical usage for advanced cancer patients. For this to happen, thedevelopment of technologies that enable quick and insightful interpretationof individual cancer genomes and transcriptomes is essential. In routinecancer management, cancer diagnosis is one common task that has remainedmanual with limited incorporation of molecular changes. Precision oncologyhas helped us realise the need to go beyond gross cytologic observationsto molecular changes, but the complex task of identifying and assessingimportant molecular changes has also remained largely manual. We canprioritize diagnostic workflows based on decision charts and pre-selectrelevant genes based on existing cohort-wide analyses, but both these tasksbecome severely complicated when presented with rare cancers or cancersthat might present as a mix of two or more established histo-types (mixedhistology cancers).The tools presented in this thesis enable better diagnoses based on molecularprofiles and expedite personalized cancer analysis efforts by automaticallyidentifying important molecular events driving the cancer classification.We curated the frequency with which personalized oncology initiativesencounter rare cancers or require a review of the initial pathology-baseddiagnosis (Chapter 3). We found that, as also observed in the literature,RNA-Seq is a common modality driving the diagnosis reviews and changes.In Chapter 4 we developed SCOPE, a pan-cancer classifier that is trainedwith large feature representations of the cancer samples, and validateit on advanced cancers. The method uses an RNA-Seq profile withoutadditional processing or feature selection, and provides a quantitative1366.1. Contributionsclassification decision from across 40 cancer-types in under two minutes.Lastly, in Chapter 5 we showed that by using a pathway impact evaluationtool we developed (PIE), we can identify pathway-level importance scoresfrom these classifier results, automatically generating pathway profiles ofadvanced cancers on a per-sample basis. Our results provide an impetusfor building classifiers with more comprehensive feature representations,instead of pre-selecting features that optimize the desired class separation.They also suggest that a cancer classifier that uses high-dimensional featurerepresentations can learn the biological underpinnings of the classifiedcancers.This final chapter will summarize and discuss the implications of thework presented in this thesis. We will review the current limitations ofthe developed tools and suggest interesting future directions of researchto help redress some of these limitations. We will also comment onbroader challenges in practice and highlight lessons learn in the process ofintegrating this work in precision oncology and diagnostic pathology.6.1 Contributions6.1.1 Impact of genomic information on diagnosis ofadvanced cancersOur assessment of the role of sequencing data in guiding diagnosis changes(Chapter 3) highlights an important component of precision oncologythat often goes unreviewed. As per our findings, rigorous evaluation ofpathologist-provided diagnoses using molecular data led to changes fromhistopathologic findings for 2-4% of advanced cancer cases. While forminga small fraction overall, at the individual level these pieces of molecularevidence can severely impact downstream choices for biological comparatorsand resultant interpretation of genomic findings. Using -omics informationto revise diagnosis was particularly relevant in 15 cancers of unknown origin- a rare set of cancers that are typically refractory to routine histopathologyanalysis.Our findings from SCOPE (Chapter 4) support observations in literaturethat physiologically proximal and morphologically similar cancer types - suchas stomach adenocarcinomas and esophageal adenocarcinomas arising nearthe gastric junction - are also quite similar at the transcriptome-wide level1376.1. Contributionsdespite having distinct clinical designations [33, 144]. It also reflects theexisting challenge with using glass-based pathology (even with the aid ofimmunohistochemistry) to discern these tumours.CUPs form 3-5% of all cancer diagnoses, and the identification of aputative primary in these cancers has important implications for themanagement and treatment of these cancers . Immunohistochemistrybased assessment of these cancers is often inconclusive, or yields a widedifferential diagnosis. Our method provides an orthogonal approach tonarrow down the diagnostic candidates and identify the most likely primarysite of such cancers. Since we include adjacent normals as separate classesin our trained ensemble classifier, the system learns to distinguish differenttumours, and also identifies expression patterns that are indicative of anormal tissue profile.6.1.2 Algorithmic advances in cancer classifier developmentComputational advances that reduce training time costs for high dimensionaldatasets, and deep learning algorithms that automatically prioritize relevantfeatures during the training process, are making it even easier to scalecomputational solutions to broad classification tasks without requiringfeature selection. We demonstrate that it is possible to train large neuralnetworks with >17,000 genes as input and maintain a high level ofclassification accuracy.Rare cancers form a difficult subset of cancers to try to classify usingsupervised machine learning methods, simply because of the limitedavailability of training datasets reflecting these cancer types. To addressthis issue, we demonstrate the success of using SMOTE as a viableapproach to generate additional training samples in these under-representedclasses, resulting in improved classification ability compared to using thebest-performing models trained on the class-imbalanced data alone. Wealso observe performance gains for well-represented classes when using rankrepresentations of the large transcriptome inputs.Feature selection is not always a stable process – the selected features canvary based on the selection dataset, criteria, and algorithms. Biologicalchanges like biomarker conversion can occur in the transition from primaryto metastatic disease, for example in the case of HER2/ERBB2 . Thiscan confound results from gene expression panels that were built from a1386.1. Contributionssubset of features discriminating primary tumour types and is particularlydetrimental when attempting to diagnose rare cancers where characteristicbiological features are still being researched or are ill-defined. We findthat SCOPE performs better than neural networks trained with featuresets selected either through pairwise t-tests or curated based on prevalentknowledge about cancer biology.6.1.3 Interpreting cancer classification decisions andperforming single-sample pathway analysisIn routine clinical practice, a cancer diagnosis carries a vast amount ofinformation about cancer biology with it. For example, knowing a breastcancer is HER2+ suggests that pathways associated with the ErbB familyof transmembrane receptor tyrosine kinases are activated. Using PIE, weextend the inference obtained from SCOPE to automatically identify andprovide activity scores for biological pathways driving the classificationdecision. These pathway representations recapitulate known biology andallow clustering of samples by tumour-type. In the event sub-clusters areidentified within a group of samples, we show that we can easily revert tothe individual pathway scores and identify the biological rationale behindeach sub-group.PIE analysis also empowers precision oncology efforts by allowingprioritization of pathways reflecting the biology and therapeutically-relevantpathways of a single tumour sample. We demonstrate its utility in a precisiononcology workflow with the analysis of two different rare cancer cases. Thefirst was a rare vulvar adenocarcinoma that was found to be more similarto breast cancers than other gynecologic cancers. We had described theactual manual analysis process for this tumour in Chapter 2. PIE was ableto recover the oncogenic pathways known to be impacted in this tumour.It also independently uncovered pathways associated strongly with thepatient’s prior treatment with paclitaxel, a finding that we then validatedwith a larger cohort of breast cancers. The second case was a cancerof unknown origin, where genomic analysis and application of SCOPEdetermined a putative diagnosis of thyroid-like follicular renal carcinoma.PIE was again able to pinpoint the known oncogenic pathways associatedwith this sample in particular, and with this rare cancer-type at large. Sincean early point of contention in diagnosing this case had been similaritywith another renal cancer subtype, we also compared the pathway scores1396.2. Limitations of developed toolsbetween these two renal cancer subtypes. We found evidence of knownpathways that made this sample look like the diagnosed subtype over theother, and prioritized other pathways that distinguish the profile of thiscancer between the two.Single-sample analysis is an essential and largely manual componentof personalized medicine. The prioritization of driver changes andtherapeutic targets remains a complex and vital exercise performed byexpert computational biologists. Using PIE we can extrapolate classifierresults into interpretable sets of pathways, identify biological programs thatdistinguish different samples and cohorts, and generate automated biologicalprofiles of advanced cancers on a per-sample basis. The proposed approachis a proof-of-concept for automated interpretation of gene expression datawith minimal requirements for expert manual review, comparator datasets,and controls. As additional training datasets become available across other-omics data-types, we expect to scale the scope and depth of interpretationprovided by PIE.6.2 Limitations of developed toolsA limitation of SCOPE is the lack of external validation sets for allclasses. A challenge for general application of this method is transcriptomicdata that has been generated from RNA extracted from formalin-fixedparaffin-embedded (FFPE) tissue, rather than from snap-frozen tissue.Formalin-fixed paraffin-embedded specimens are persistent morphologicrecords of tissue biopsies, and highly prevalent in pathology laboratoriesworldwide. However, controllable and uncontrollable variables, includingtissue characteristics, fixation technique, and storage conditions, can affectthe yield and quality of total RNA in FFPE blocks. We obtained 100%accuracy on five in-house primary FFPE samples (three primary pancreaticductal adenocarcinomas, one sarcoma, and one primary colon/esophagealadenocarcinoma), and were able to correctly identify the primary cancertype for all five cases. Nonetheless, FFPE application of this method willrequire additional validation.Unlike prior work in the development of cancer diagnostics using molecularprofiles, we did not pre-select our testing samples to have uniformly hightumour content. As a consequence, we obtained insights into the role oftumour content in empowering automated cancer diagnostics. Specifically,1406.3. Broader challenges in clinical translationwe observed that SCOPE has difficulty discriminating metastatic cancersthat share the same organ system of origin, if the tumour content ofthe sequenced sample is low. It is possible that although there is dilutedsignal for the correct cancer type, low tumour content limits an accurateprediction. Interestingly, based on our observations it appears this bias isonly significant in metastatic cancers to the liver (Figures 3.6, 3.7). Otherareas where SCOPE performed poorly in pretreated metastases may bedriven by known biological differences in the metastatic space (for example,pancreatic cancers). Training classification models with metastatic tumoursand examples of advanced cancers is a possible workaround for theseproblems in the future.The pathway analysis tool, PIE, does not take into account directionalregulatory effects of various genes in a pathway. Nor is it able to proposerationale for prioritized pathways, requiring manual interpretation todetermine which prioritized pathways associated with cancer biology versusprior therapy or site of biopsy. The method is also limited in its ability toflag novel or rare cancer-types that may not be present in the core classifier.While this can be overcome with a combination of highly-discriminantclassifier and manual expertise, automating the process remains an activearea of research.The output from PIE is a function of the quality of the classifier that isused as the foundation. It can flag biological artefacts output by a givenclassifier, but the effectiveness of the tool can potentially be impacted ifsuch artefacts dominate the classification results. The impact of sequencingprotocols on how different pathways are marked important or otherwiseremains another unexplored area in this body of work. Since the coreclassifier used to demonstrate PIE in these analyses (SCOPE) was trainedwith RNA-Seq data from poly-adenylated transcripts, we anticipate thatthe impact of pathways where members of the ribosomal RNA families areoverrepresented may not be as easily quantified in RNA-Seq data fromribo-depletion protocols.6.3 Broader challenges in clinical translationDuring the work summarized in this thesis, we came across some commonhurdles in data curation, methods validation and application in precisiononcology. Some of these relate to clinical practice and diagnostic pathology,1416.3. Broader challenges in clinical translationwhereas others stem from the need for understanding the rationale behindclassification decisions from algorithms. The ability to share tools throughsimplistic programming libraries and distributed healthcare networkslike CanDIG is vital for widespread access, evaluation, and adoption ofthese methods. The following section summarizes some of the key insightsrelated to these based on our experience. Technologies like single-cellsequencing and methylome profiling offer exciting prospects for sample-levelcharacterisation of tumour biology. Understanding the definition of a‘cancer type’ and incorporating cell-of-origin information, when available,will also help improve our ability to identify and characterize new canceretiologies. These potential areas of future work are also highlighted below.6.3.1 Management of diagnostic inaccuracies in clinicalpracticeDr. Eric Topol, Director of the Scripps Research Translational Institute,identifies certain attributes of doctors that contribute to diagnosticinaccuracy - cognitive bias through representative heuristic (taking shortcutsto decisions based on past experiences), availability bias (diagnosing basedon the options “available” to them mentally), overconfidence, andconfirmation bias (embracing information supporting one’s beliefs andrejecting contradictory information) . Making it even more difficultto recognize these pitfalls is the feedback mechanism - or lack thereof - inmodern medicine.A retrospective study on diagnostic uncertainty was conducted by the MayoClinic, evaluating diagnostic agreement between the referral and secondopinion diagnoses for 286 patients referred from primary care practitioners. They found that the second opinion diagnosis agreed with the referralphysician diagnosis in only 12% of patients. This is incredibly low, but evenmore concerning is the observation that second opinions are rare enough asis - either due to cost, access, or availability of expert physicians. In cancermanagement for patients with aggressive, treatment-resistant disease, secondopinions can also not be an option due to the extra time needed to identify,communicate, and reach a decision.Incorporating diagnostic changes from genomic analysisObtaining feedback on the accuracy of one’s decision is limited in medicine,and it takes years of experience for doctors to gain this perspective .1426.3. Broader challenges in clinical translationDr. Michael Lewis, a medical doctor from the University of Toronto,summarises this concisely - “The entire profession had arranged itself asif to confirm the wisdom of its decisions” . Relying on broad-basedhistopathology assessments as a way to reject granular diagnostic insightsis dangerous. It enables practitioners to discount information obtained frommolecular analysis and diminishes the demonstratable impact of precisiononcology efforts that genuinely advance the standard of care in this area.Acknowledging diagnostic changes arising from genomic analysis is anotherchallenge that we often observed within the framework of a personalizedoncology clinical trial (POG). The creation of a list of suspect diseases,also called a differential diagnosis, is a commonplace clinical practice. Bydefinition, the differential diagnosis facilitates the inclusion of common andrare suspect diagnoses into a single list. This hypothesis-generating methodis intended to guide pathologists in their attempt to classify a tumoursample following established histopathologic rules for cancer classification.The presence of a wide range of cancer categories in a differential diagnosesdoes not indicate that any of them is the final diagnosis. The exclusion of acancer from the differential diagnosis does not discount its possibility.We observed an alternative use of the differential diagnoses within POG.In some cases, the genomics-guided diagnosis, while disagreeing withthe final diagnosis from the pathologist, overlapped with the differentialdiagnoses list instead. In such cases, the differential diagnosis list wastypically assimilated into the gold-standard for comparison betweenpathologist diagnosis and genomics-guided diagnosis. While doing so isa reasonable approach for pooling various sources of knowledge together,anecdotally we found that it opened up room for practitioners to repudiateor ignore informative contributions towards diagnosis from genomics. Thework presented in Chapter 3 attempted to avoid this bias by includingindications for when and how genomic data indicated the final diagnosis.Formalizing this process in other personalized oncology trials would providevaluable evidence-based orthogonal diagnostics and help quantify thepotential contributions of genomic analysis to routine diagnostic pathology.It can also support and advance the pathology practice by providing anautomated feedback mechanism and evidence-based arguments for theadoption of emerging molecular markers or high-throughput sequencingmodalities for complex presentations.Curating diagnostic changes from genomic analysisAnecdotally, we observed that the curation of diagnosis changes in precision1436.3. Broader challenges in clinical translationoncology efforts faces organizational hurdles and aversion to acceptingobservations from molecular data. In our grand review of diagnostics inthe POG program, we undertook detailed manual review of pathologistnotes, tumour board findings, and clinical management system records inorder to confidently determine cases that underwent a change in diagnosisor molecular status through genomic analysis. The dispersion of recordsacross multiple sites and organizations is an evident barrier to recordingsuch changes.6.3.2 Facilitating adoption in routine practiceIn practice, physician uptake of multiplexed sequencing approaches forcancer diagnosis remains a challenge. A 2014 review of 160 physiciansat the Dana Farber Cancer Institute, an NCI-designated comprehensivecancer center revealed that physicians have low confidence in findingsfrom high-throughput genomic sequencing . The researchers found thathigh genomic confidence was associated with being a medical oncologist,a researcher, and using available genomic tests more frequently. Theseobservations are consistent with previous studies on the subject. In ourexperience as well, the gains in diagnostic assessment provided by SCOPEwere best evaluated by scientists, pathologists, and oncologists workingtogether. An important driver was the ability to support observed cases ofrefractory diagnoses with established genomic events associated with therevised histopathology and recognized in clinical oncology.Curating clinical annotations for improved cancer analysis andmanagementFindings from genomic analysis can be maximally utilized for patientcare if the important molecular changes are linked with patient clinicaldata such as history of previous treatments and malignancies, durationof previous treatments, and other clinicopathological information .Particularly in case of treatment-resistant cancers, the availability ofdetailed clinical annotations is severly lacking, consequently impacting thevalidation and clinical translation of actionable genomic changes. Clinicaldata recorded over a sustained duration of follow-up with treatment detailsrecorded in harmonized formats and standardized languages can facilitateresearch efforts across cancer centres, aid text-mining and natural languageprocessing efforts for cancer characterization and significantly improve ourability to utilize genomic data for clinical care in general . A good1446.3. Broader challenges in clinical translationmodel for this is the Genomics Evidence Neoplasia Information Exchange(GENIE), a pan-cancer registry initiative launched in 2016 to link existingand future clinical sequencing efforts (typically panel-based sequencing)with longitudinal patient outcomes to empower clinical decision-making .Replicating this model in research consortia that utilize more comprehensivesequencing data like WGS and RNA-Seq is essential for advancing ourunderstanding of treatment resistance and improving the management ofadvanced cancers.Making machine-learning models interpretablePredictive ML algorithms have certain benefits compared to a trainedphysician - they can be queried inexhaustibly with no impact onperformance, have high consistency between each re-run on the same input,and provide a mathematical representation of the resultant output. Aphysician can have varying degrees of expertise in couching their clinicaldecision-making with known facts and reasoning, but they possess theability to provide a narrative that explains their actions and behaviour. The inability to provide a flow of logic from the input to the outputis a fundamental aspect limiting the uptake of present-day machinelearning-based diagnostics in clinical practice.Providing a biological rationale for classification decisions from algorithmscan be extremely valuable for analysis of individual cancers and ourunderstanding of cancer in general. In Chapter 4 we extract importantgenes associated with each classification category of SCOPE using integratedgradients . We show that these genes reflect tumour and heatlhy tissuebiology, and overlap with known diagnostic markers of different cancertypes. We extend the interpretation beyond individual genes in Chapter5, using PIE to automatically calculate the impact of groups of genes onclassification by SCOPE. Distilling the causative features into pathwaysenables immediate characterization of individual tumours without the needto manually determine biological associations behind genes important forclassification.Adding interpretability for classification decisions can also reveal thebiological pathways influencing anomalous classification decisions. Forexample, a key pitfall of SCOPE is its inability to confidently resolvediagnoses in liver biopsies with low tumour content (Sections 126.96.36.199 andSection 4.3.4). We hypothesized that contamination from healthy tissuemight be leading to this effect. Using PIE we found that for cases wherethe biopsy site (liver) had confounded prediction, liver-associated pathways1456.3. Broader challenges in clinical translationhad significantly high PIE scores for hepatocellular carcinoma (LIHC)classification. On the other hand, pathways reflecting the expected tumourbiology drove the sample’s classification as the correct cancer-type. Theseobservations, while preliminary, enable a unique insight into the rationalefor anomalous classification results, and can help characterize the biologicalimplications of low tumour content in bulk RNA-Seq experiments.6.3.3 Ensuring equitable access to developed toolsResearch is not limited to discovery. Technology development from findingsis an essential component of innovative research, allowing better accessand benefits to tax-payers from discoveries arising through funded projects.The machine-learning methods developed in this project had an immediatebenefit for a large fraction of cancers with unknown primary, and for severalpresentations of advanced cancers refractory to histopathology. At the veryminimum, patients were able to obtain a diagnosis - valuable for personalclosure, guiding treatment, or a better understanding of a rare etiology -because of their enrollment in the POG trial. How do we ensure access toand support for using such tools for patients in remote communities? Howdo we enable clinical adoption of these observations into clinical practice?Precision oncology is a fast-growing area of research, and severalbioinformatics tools emerge from these cross-disciplinary investigations withpotentially widespread benefit. In our work we have made the developedtools available as plug-and-play python packages, available through GitHubor the traditional ‘pip’ install option. In the long term, ensuring equal accessto the underlying sequencing platforms and resultant analysis tools requiresthe development of strong collaboration networks across the country witha heavy emphasis on preserving patient privacy. Projects like CanDIG,which aims to develop a national platform for distributed analysis oflocally-controlled private genome data, provide the essential infrastructurefor enabling equitable access to these tools in urban and rural communitiesalike. However, currently in Canada there are no clear policies in place tosupport commercialization or public translation of research software andtechnologies of value in personalized healthcare.At present, researchers have limited access to time and resources fordeveloping new technologies, validating them, and ensuring that their usecontinues beyond the term of the lead graduate student or post-doctoralscholar. Funding mechanisms need to be developed to support the1466.3. Broader challenges in clinical translationdeployment, maintenance, and updating these technologies in the long run.Institute-level initiatives like the Canada Foundation for Innovation fundsare now emerging to support the evolving need of Canadian researchers.Another area for future work is the development of policy frameworks thatsupport researchers beyond the early phases of research and discovery,assisting them in fulfilling the continual requirements of developing andmaintaining bioinformatics tools for universal access.Cancer classifiers are one example of bioinformatics tools that have limitedlongevity in the research domain. A vast number of published work inthe field lack open-access releases of the developed models and becomeout-of-date quickly as classifications and subtypes with nuanced clinicalimplications are discovered. Few of the published tools have becomecommercialized, and the limited set that are open-access are renderedunusable due to deprecated dependencies and deployment platforms. Someof these issues can be addressed by funding policies that encourage andsupport the maintenance of bioinformatics tools with a demonstratableimpact in personalized healthcare. Ensuring continued relevance of theinformation provided by the tools is another challenge altogether.6.3.4 Keeping classifiers up-to-dateTraining cancer classification algorithms continuously to accommodateedge-cases and out-of-distribution test samples also pose computationaland algorithmic challenges. In recent years, promising advances in machinelearning have been made that can overcome these issues. Using methods likeactive learning and reinforcement learning, machine learning models canbe trained with new data and effectively learn from their mistakes. Recentwork has also suggested improvements in neural networks that enable themto reliably accommodate new examples and classes without ‘forgetting thepast’. Adopting these best practices for model development can help ensurethat classification models remain dynamic and can robustly learn from pastmistakes and new data without requiring extensive retraining.As we found in our analysis, studying the outcomes from machine learningalgorithms in the context of known biological artifacts and confoundingsources of noise is important before the output from these tools can betaken at face value. Re-training models with examples that reflect theseedge-cases is one approach for fixing such issues as low tumour-content andrare cancer-types not included in training the original classifiers. The ability1476.3. Broader challenges in clinical translationof these systems to make reliable decisions can only improve as more qualitydata becomes available and the edge-cases get absorbed into training.6.3.5 Incorporating other -omics technologies in automateddiagnosisA key limitation of the work presented in this thesis is the limited abilityto compare efficacy of mRNA sequencing against other modalities likemethylation, miRNA, and lnc-RNA. The precision oncology project used toevaluate the contributions of genomics in cancer diagnosis (Chapter 3) waslimited to whole-genome sequencing and poly-adenylated RNA sequencing.The training data used in Chapter 4 and which formed the basis for Chapter5’s analysis was limited to RNA-Seq, partly because this was the sequencingmodality for which the most number of cancers had representative data.A multi-omic approach for cancer diagnosis can capture novel subtypesand prognostic groups based on miRNA signatures or methylome profiles,potentially improving the diagnositc ability of associated methods .Emerging work has also shown that methylation profiles captured fromcirculating tumour DNA can effectively classify brain, gastrointestinal, andgynecologic cancers [82, 117]. Another benefit of alternative sequencingmodalities is that they can extend beyond tissue biopsies to any bodilyfluid, better capturing the inherent heterogeneity in a tumour, and enablingless invasive early detection and diagnosis for the patient .Till date, these discoveries have been limited in their translational valuebecause of limited analysis focused on primary cancers and due to theabsence of large, representative validation datasets for accurate evaluation ofgeneralizability of underlying machine learning models. Existing pan-cancerresearch in these alternative modalities and multi-omic diagnostics hasalso been extremely small-scale and limited to common cancer-types likebreast cancer, gliomas, and pancreatic cancers. Nevertheless, the increasingexploration and generation of cancer profiles using alternative -omicstechnologies only serves to increase future availability of such datasets.Going forwards, the emerging body of work in this area will enable us torigorously train and evaluate multi-omics models for classification of rarecancer-types and metastatic, treatment-resistant cancers as well.1486.3. Broader challenges in clinical translation6.3.6 Utilizing single-cell sequencing for interrogation ofcancer genomesIn the pathology review (Chapter 3) we observed that compared towhole-genome sequencing, RNA-Seq information was utilized more oftenin determining the vast majority of revised diagnoses, in alignment withits acceptance as a suitable molecular experiment for cancer diagnostics.Comprehensive, upfront RNA sequencing can eliminate the need forindividual assays, providing transcriptome-wide measurements of geneexpression . This is particularly attractive for cases where transcriptomicsignatures might not have been established (such as rare cancers, cancersof unknown origin) or in retrospective clinical trials . RNA-Seq is alsoable to connect genotypes (DNA profile) with the resultant phenotypes(cancer subtypes, drug response). It can be used to establish transcriptomicsignatures related to observed cell types. It can also profile genetic changeslike structural rearrangements, mutations, fusions, and viral integration. As a stand-alone experimental modality, RNA-Seq goes beyond whatcan be learnt from genetic testing alone, be it through comparative genomichybridization or DNA sequencing.So far, bulk short-read sequencing-based genomic analyses have shapedmost of our understanding of oncogenesis and treatment resistance in cancerpatients. These methods simultaneously profile the tumour and the diversemicroenvironment including normal, immune, and stromal components.Deconvolution of these components has been an important but challengingarea of research. Over the last decade, single-cell profiling has revealednew and interesting directions beyond bulk tissue sequencing for studyingcancers . Cancers are composed of a heterogeneous set of cell types andcell states. The complex interaction of these cells (which can be malignantor otherwise) impacts inter-tumour and intra-tumour heterogeneity, whichin turn has implications for treatment selection and prognosis [100, 141].Single-cell sequencing provides a powerful new approach to understand thefine-grained differences between various cancers.Combining single-cell sequencing with multiplexed imaging techniques canalso provide interesting insights into cancer subtypes with implicationson therapeutic strategies. This approach can preserve spatial techniquesand help us better understand the role of the microenvironment in tumourprogression . Understanding cancer heterogeneity at the single-celllevel can also provide a granular view of the differentiation hierarchy1496.4. Final wordsand reveal differences between cells of origin for histologically similarcancer-types [47, 100]. Once pan-cancer datasets become available and thetechnology itself becomes more robust to noise and drop-out, two immediateareas of utility would be accurate diagnosis of tumours from low tumourcontent biopsies, and determination of the cell-of-origin and its functionaldifferences in different cancer types.6.4 Final wordsThe work presented in this thesis provides proof-of-principle for theutility of machine learning methods in profiling advanced cancers usingbulk RNA sequencing data, and embiggens the potential of RNA-Seqas a stand-alone diagnostic and single-sample cancer analysis tool. Wefind that in practice, when treating advanced cancers, RNA-Seq profilesare used much more frequently than genomic information to determinethe site of origin of the cancer. Researchers have long shied away fromusing high-dimensional profiles for cancer classification, rightly arguingfor the negative impact this approach can have on overfitting to noise,poor generalization, and high computational costs. We show that thesepotential fallouts can be addressed when training with a sufficiently largetraining set, utilizing synthetic samples to supplement under-representedclasses, using appropriate measures like regularization and early stoppingto prevent overfitting during training, and leveraging graphical processingunits for training models. Recognizing that not all research and healthcarecentres will have access to identical bioinformatics pipelines and computeinfrastructure, we train the models without extensive feature selection ornecessitating any other prior processing.When a pathologist provides a diagnosis, they are able to associatecertain expectations of tumour biology with the assessment, basedeither on the morphology or based on the findings of ancillary testslike FISH and IHC. This immediate biological interpretation is lackingin present-day computational classifiers used for cancer diagnosis. Thevalidation and interpretation of the diagnosis from molecular data isanother challenging task for precision oncology. Understanding themolecular changes characterizing each individual tumour has been amanually-intensive downstream task, that uses the diagnosis to makecomparisons and identify important molecular events. We address these1506.4. Final wordsinterpretability problems by developing PIE. We show that this approachcan help delineate the biology, prior exposures, and therapeutic targetsfor individual patients. When taken at scale in large cancer cohorts, weuncover novel subtypes and provide biological rationale for such groupings.We make the resultant tools available for common use through GitHuband Zenodo. It is reasonable to suspect that the usage of such machinelearning-based interpretable diagnostics in clinical care will have to becomepart of best practices if these tools continue being demonstrably betterthan human-level assessment and become accessible at low cost .151Bibliography Abbott, J. J. and Ahmed, I. (2006). Adenocarcinoma of mammary-likeglands of the vulva: report of a case and review of the literature. TheAmerican journal of dermatopathology, 28(2):127–133. Adetiba, E. and Olugbara, O. O. (2015). Improved classification of lungcancer using radial basis function neural network with affine transformsof voss representation. PloS one, 10(12):e0143542. Agwa, E. and Ma, P. C. (2013). Overview of varioustechniques/platforms with critical evaluation of each. Current treatmentoptions in oncology, 14(4):623–633. Ahmed, A. A. and Abedalthagafi, M. (2016). Cancer diagnostics:the journey from histomorphology to molecular profiling. Oncotarget,7(36):58696. Alligood-Percoco, N. R., Kessler, M. S., and Willis, G. (2015). Breastcancer metastasis to the vulva 20 years remote from initial diagnosis: acase report and literature review. Gynecologic oncology reports, 13:33. Alshareeda, A. T., Al-Sowayan, B. S., Alkharji, R. R., Aldosari, S. M.,et al. (2020). Cancer of unknown primary site: Real entity or misdiagnoseddisease? Journal of Cancer, 11(13):3919. Bass, B. P., Engel, K. B., Greytak, S. R., and Moore, H. M. (2014).A review of preanalytical factors affecting molecular, protein, andmorphological analysis of formalin-fixed, paraffin-embedded (ffpe) tissue:how well do you know your ffpe specimen? Archives of pathology andlaboratory medicine, 138(11):1520–1530. Beaudin, S., Kokabee, L., and Welsh, J. (2019). Divergent effects ofvitamins k1 and k2 on triple negative breast cancer cells. Oncotarget,10(23):2292. Belinky, F., Nativ, N., Stelzer, G., Zimmerman, S., Iny Stein, T., Safran,152BibliographyM., and Lancet, D. (2015). Pathcards: multi-source consolidation ofhuman biological pathways. Database, 2015. Bender, R. A. and Erlander, M. G. (2009). Molecular classificationof unknown primary cancer. In Seminars in oncology, volume 36, pages38–43. Elsevier. Berger, M. F. and Mardis, E. R. (2018). The emerging clinical relevanceof genomics in cancer medicine. Nature Reviews Clinical Oncology,15(6):353–365. Blechacz, B., Komuta, M., Roskams, T., and Gores, G. J. (2011).Clinical diagnosis and staging of cholangiocarcinoma. Nature reviews.Gastroenterology & hepatology, 8(9):512–22. Bloom, G., Yang, I. V., Boulware, D., Kwong, K. Y., Coppola, D.,Eschrich, S., Quackenbush, J., and Yeatman, T. J. (2004). Multi-platform,multi-site, microarray-based human tumor classification. The Americanjournal of pathology, 164(1):9–16. Board, P. A. T. E. (2017). Vulvar cancer treatment (pdq®): Healthprofessional version. PDQ. Boots-Sprenger, S. H., Sijben, A., Rijntjes, J., Tops, B. B., Idema, A. J.,Rivera, A. L., Bleeker, F. E., Gijtenbeek, A. M., Diefes, K., Heathcock, L.,et al. (2013). Significance of complete 1p/19q co-deletion, idh1 mutationand mgmt promoter methylation in gliomas: use with caution. ModernPathology, 26(7):922–929. Boyce, B. (2015). Whole slide imaging: uses and limitations for surgicalpathology and teaching. Biotechnic & Histochemistry, 90(5):321–330. Brcic, L., Vlacic, G., Quehenberger, F., and Kern, I. (2018).Reproducibility of malignant pleural mesothelioma histopathologicsubtyping. Archives of pathology & laboratory medicine, 142(6):747–752. Brose, M. S., Cabanillas, M. E., Cohen, E. E., Wirth, L. J., Riehl,T., Yue, H., Sherman, S. I., and Sherman, E. J. (2016). Vemurafenibin patients with brafv600e-positive metastatic or unresectable papillarythyroid cancer refractory to radioactive iodine: a non-randomised,multicentre, open-label, phase 2 trial. The Lancet Oncology,17(9):1272–1282. Brown, H. M. and Wilkinson, E. J. (2002). Uroplakin-iii to distinguish153Bibliographyprimary vulvar paget disease from paget disease secondary to urothelialcarcinoma. Human pathology, 33(5):545–548. Bueno, R., Stawiski, E. W., Goldstein, L. D., Durinck, S., De Rienzo,A., Modrusan, Z., Gnad, F., Nguyen, T. T., Jaiswal, B. S., Chirieac,L. R., et al. (2016). Comprehensive genomic analysis of malignant pleuralmesothelioma identifies recurrent mutations, gene fusions and splicingalterations. Nature genetics, 48(4):407. Butler, B., Leath III, C. A., and Barnett, J. C. (2014). Primary invasivebreast carcinoma arising in mammary-like glands of the vulva managedwith excision and sentinel lymph node biopsy. Gynecologic oncology casereports, 7:7. Butterfield, Y. S., Kreitzman, M., Thiessen, N., Corbett, R. D., Li, Y.,Pang, J., Ma, Y. P., Jones, S. J., and Birol, I. (2014). Jaguar: junctionalignments to genome for rna-seq reads. PloS one, 9(7):e102398. Byron, S. A., Van Keuren-Jensen, K. R., Engelthaler, D. M., Carpten,J. D., and Craig, D. W. (2016). Translating rna sequencing into clinicaldiagnostics: opportunities and challenges. Nature Reviews Genetics,17(5):257. Carlson, J. J. and Roth, J. A. (2013). The impact of the oncotypedx breast cancer assay in clinical practice: a systematic review andmeta-analysis. Breast cancer research and treatment, 141(1):13–22. Chaffer, C. L. and Weinberg, R. A. (2011). A perspective on cancer cellmetastasis. science, 331(6024):1559–1564. Chahal, M., Pleasance, E., Grewal, J., Zhao, E., Ng, T., Chapman,E., Jones, M. R., Shen, Y., Mungall, K. L., Bonakdar, M., et al.(2018). Personalized oncogenomic analysis of metastatic adenoidcystic carcinoma: using whole-genome sequencing to inform clinicaldecision-making. Molecular Case Studies, 4(2):a002626. Chauhan, A., Farooqui, Z., Silva, S. R., Murray, L. A., Hodges, K. B.,Yu, Q., Myint, Z. W., Raajesekar, A. K., Weiss, H., Arnold, S., et al.(2019). Integrating a 92-gene expression analysis for the management ofneuroendocrine tumors of unknown primary. Asian Pacific journal ofcancer prevention: APJCP, 20(1):113. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P.154Bibliography(2002). Smote: synthetic minority over-sampling technique. Journal ofartificial intelligence research, 16:321–357. Chen, J., Ge, X., Zhang, W., Ding, P., Du, Y., Wang, Q., Li, L.,Fang, L., Sun, Y., Zhang, P., et al. (2020). Pi3k/akt inhibition reversesr-chop resistance by destabilizing sox2 in diffuse large b cell lymphoma.Theranostics, 10(7):3151. Cheng, D. T., Mitchell, T. N., Zehir, A., Shah, R. H., Benayed, R.,Syed, A., Chandramohan, R., Liu, Z. Y., Won, H. H., Scott, S. N.,et al. (2015). Memorial sloan kettering-integrated mutation profiling ofactionable cancer targets (msk-impact): a hybridization capture-basednext-generation sequencing clinical assay for solid tumor molecularoncology. The Journal of molecular diagnostics, 17(3):251–264. Cheng, L., Lopez-Beltran, A., Massari, F., MacLennan, G. T., andMontironi, R. (2018). Molecular testing for braf mutations to informmelanoma treatment decisions: a move toward precision medicine. ModernPathology, 31(1):24. Cheng, L., Zhang, S., Wang, L., MacLennan, G. T., and Davidson, D. D.(2017). Fluorescence in situ hybridization in surgical pathology: principlesand applications. The Journal of Pathology: Clinical Research, 3(2):73–99. Cherniack, A. D., Shen, H., Walter, V., Stewart, C., Murray, B. A.,Bowlby, R., Hu, X., Ling, S., Soslow, R. A., Broaddus, R. R., et al. (2017).Integrated molecular characterization of uterine carcinosarcoma. Cancercell, 31(3):411–423. Chu, J., Sadeghi, S., Raymond, A., Jackman, S. D., Nip, K. M., Mar, R.,Mohamadi, H., Butterfield, Y. S., Robertson, A. G., and Birol, I. (2014).Biobloom tools: fast, accurate and memory-efficient host species sequencescreening using bloom filters. Bioinformatics, 30(23):3402–3404. Cieślik, M. and Chinnaiyan, A. M. (2018). Cancer transcriptomeprofiling at the juncture of clinical translation. Nature Reviews Genetics,19(2):93. Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L.,Land, S. J., Lu, X., and Ruden, D. M. (2012). A program for annotatingand predicting the effects of single nucleotide polymorphisms, snpeff: Snpsin the genome of drosophila melanogaster strain w1118; iso-2; iso-3. Fly,6(2):80–92.155Bibliography Clark, A. M., Ma, B., Taylor, D. L., Griffith, L., and Wells, A. (2016).Liver metastases: Microenvironments and ex-vivo models. ExperimentalBiology and Medicine, 241(15):1639–1652. Clark, B. Z., Beriwal, S., Dabbs, D. J., and Bhargava, R. (2014).Semiquantitative gata-3 immunoreactivity in breast, bladder, gynecologictract, and other cytokeratin 7–positive carcinomas. American Journal ofClinical Pathology, 142(1):64–71. Clegg, L. X., Feuer, E. J., Midthune, D. N., Fay, M. P., and Hankey,B. F. (2002). Impact of reporting delay and reporting error on cancerincidence rates and trends. Journal of the National Cancer Institute,94(20):1537–1545. Connolly, J. L., Schnitt, S. J., Wang, H. H., Longtine, J. A., Dvorak, A.,and Dvorak, H. F. (2003). Role of the surgical pathologist in the diagnosisand management of the cancer patient. In Holland-Frei Cancer Medicine.6th edition. BC Decker. Connolly, R. M., Nguyen, N. K., and Sukumar, S. (2013). Molecularpathways: current role and future directions of the retinoic acidpathway in cancer prevention and treatment. Clinical Cancer Research,19(7):1651–1659. Consortium, A. P. G. et al. (2017). Aacr project genie: poweringprecision medicine through an international consortium. Cancer discovery,7(8):818–831. Consortium, I. C. G. et al. (2010). International network of cancergenome projects. Nature, 464(7291):993. Creixell, P., Reimand, J., Haider, S., Wu, G., Shibata, T., Vazquez,M., Mustonen, V., Gonzalez-Perez, A., Pearson, J., Sander, C., et al.(2015). Pathway and network analysis of cancer genomes. Nature methods,12(7):615. Croft, D., Mundo, A. F., Haw, R., Milacic, M., Weiser, J., Wu,G., Caudy, M., Garapati, P., Gillespie, M., Kamdar, M. R., et al.(2014). The reactome pathway knowledgebase. Nucleic acids research,42(D1):D472–D477. Dabney, A., Storey, J. D., and Warnes, G. (2010). qvalue: Q-valueestimation for false discovery rate control. R package version, 1(0).156Bibliography Darmanis, S., Sloan, S. A., Croote, D., Mignardi, M., Chernikova, S.,Samghababi, P., Zhang, Y., Neff, N., Kowarsky, M., Caneda, C., et al.(2017). Single-cell rna-seq analysis of infiltrating neoplastic cells at themigrating front of human glioblastoma. Cell reports, 21(5):1399–1410. De Leon, E. D., Carcangiu, M. L., Prieto, V. G., McCue, P. A.,Burchette, J. L., To, G., Norris, B. A., Kovatich, A. J., Sanchez,R. L., Krigman, H. R., et al. (2000). Extramammary paget disease ischaracterized by the consistent lack of estrogen and progesterone receptorsbut frequently expresses androgen receptor. American journal of clinicalpathology, 113(4):572–575. Del Pino, M., Rodriguez-Carunchio, L., and Ordi, J. (2013). Pathwaysof vulvar intraepithelial neoplasia and squamous cell carcinoma.Histopathology, 62(1):161–175. Dennis, J. L., Vass, J. K., Wit, E. C., Keith, W. N., and Oien,K. A. (2002). Identification from public data of molecular markers ofadenocarcinoma characteristic of the site of origin. Cancer research,62(21):5999–6005. DeVita, V. T., Rosenberg, S. A., and Lawrence, T. S. (2018). DeVita,Hellman, and Rosenberg’s cancer. Lippincott Williams & Wilkins. Ding, J., Bashashati, A., Roth, A., Oloumi, A., Tse, K., Zeng,T., Haffari, G., Hirst, M., Marra, M. A., Condon, A., et al.(2011). Feature-based classifiers for somatic mutation detection intumour–normal paired sequencing data. Bioinformatics, 28(2):167–175. El-Deiry, W. S., Goldberg, R. M., Lenz, H.-J., Shields, A. F., Gibney,G. T., Tan, A. R., Brown, J., Eisenberg, B., Heath, E. I., Phuphanich, S.,et al. (2019). The current state of molecular testing in the treatment ofpatients with solid tumors, 2019. CA: a cancer journal for clinicians. Elmore, J. G., Longton, G. M., Carney, P. A., Geller, B. M., Onega, T.,Tosteson, A. N., Nelson, H. D., Pepe, M. S., Allison, K. H., Schnitt, S. J.,et al. (2015). Diagnostic concordance among pathologists interpretingbreast biopsy specimens. Jama, 313(11):1122–1132. Erlander, M. G., Ma, X.-J., Kesty, N. C., Bao, L., Salunga, R., andSchnabel, C. A. (2011). Performance and clinical evaluation of the 92-genereal-time pcr assay for tumor classification. The Journal of MolecularDiagnostics, 13(5):493–503.157Bibliography Ershaid, N., Sharon, Y., Doron, H., Raz, Y., Shani, O., Cohen, N.,Monteran, L., Leider-Trejo, L., Ben-Shmuel, A., Yassin, M., et al. (2019).Nlrp3 inflammasome in fibroblasts links tissue damage with inflammationin breast cancer progression and metastasis. Nature communications,10(1):1–15. Etheridge, T., Liou, J., Downs, T. M., Abel, E. J., Richards, K. A.,and Jarrard, D. F. (2018). The impact of celecoxib on outcomesin advanced prostate cancer patients undergoing androgen deprivationtherapy. American journal of clinical and experimental urology, 6(3):123. Ettinger, D. S., Agulnik, M., Cates, J. M. M., Cristea, M., Denlinger,C. S., Eaton, K. D., Fidias, P. M., Gierada, D., Gockerman, J. P., Handorf,C. R., Iyer, R., Lenzi, R., Phay, J., Rashid, A., Saltz, L., Shulman, L. N.,Smerage, J. B., Varadhachary, G. R., Zager, J. S., Zhen, W. K., andNational Comprehensive Cancer Network (2011). NCCN Clinical PracticeGuidelines Occult primary. Journal of the National ComprehensiveCancer Network : JNCCN, 9(12):1358–95. Evens, A. M., Kanakry, J. A., Sehn, L. H., Kritharis, A., Feldman,T., Kroll, A., Gascoyne, R. D., Abramson, J. S., Petrich, A. M.,Hernandez-Ilizaliturri, F. J., Al-Mansour, Z., Adeimy, C., Hemminger,J., Bartlett, N. L., Mato, A., Caimi, P. F., Advani, R. H., Klein, A. K.,Nabhan, C., Smith, S. M., Fabregas, J. C., Lossos, I. S., Press, O. W.,Fenske, T. S., Friedberg, J. W., Vose, J. M., and Blum, K. A. (2015). Grayzone lymphoma with features intermediate between classical Hodgkinlymphoma and diffuse large B-cell lymphoma: Characteristics, outcomes,and prognostication among a large multicenter cohort. American Journalof Hematology, 90(9):778–783. Fizazi, K., Greco, F., Pavlidis, N., Daugaard, G., Oien, K., andPentheroudakis, G. (2015a). Cancers of unknown primary site: Esmoclinical practice guidelines for diagnosis, treatment and follow-up. Annalsof Oncology, 26(suppl_5):v133–v138. Fizazi, K., Greco, F. A., Pavlidis, N., Daugaard, G., Oien, K., andPentheroudakis, G. (2015b). Cancers of unknown primary site: ESMOClinical Practice Guidelines for diagnosis, treatment and follow-up.Annals of oncology : official journal of the European Society for MedicalOncology / ESMO, 26:v133–v138. Fodde, R. (2002). The apc gene in colorectal cancer. European journalof cancer, 38(7):867–871.158Bibliography Forbes, S. A., Beare, D., Boutselakis, H., Bamford, S., Bindal, N.,Tate, J., Cole, C. G., Ward, S., Dawson, E., Ponting, L., et al. (2017).Cosmic: somatic cancer genetics at high-resolution. Nucleic acids research,45(D1):D777–D783. Froomkin, A. M., Kerr, I., and Pineau, J. (2019). When ais outperformdoctors: confronting the challenges of a tort-induced over-reliance onmachine learning. Ariz. L. Rev., 61:33. Frost, A. R., Hurst, D. R., Shevde, L. A., and Samant, R. S. (2012).The influence of the cancer microenvironment on the process of metastasis.International journal of breast cancer, 2012. Garrison Jr, L. P., Babigumira, J. B., Masaquel, A., Wang, B. C., Lalla,D., and Brammer, M. (2015). The lifetime economic burden of inaccurateher2 testing: estimating the costs of false-positive and false-negative her2test results in us patients with early-stage breast cancer. Value in Health,18(4):541–546. Gatta, G., Van Der Zwan, J. M., Casali, P. G., Siesling, S., Dei Tos,A. P., Kunkler, I., Otter, R., Licitra, L., Mallone, S., Tavilla, A., et al.(2011). Rare cancers are not so rare: the rare cancer burden in europe.European journal of cancer, 47(17):2493–2511. Good, B. M., Ainscough, B. J., McMichael, J. F., Su, A. I., and Griffith,O. L. (2014). Organizing knowledge to enable personalization of medicinein cancer. Genome biology, 15(8):438. Graber, M. L. (2013). The incidence of diagnostic error in medicine.BMJ Qual Saf, 22(Suppl 2):ii21–ii27. Graber, M. L., Wachter, R. M., and Cassel, C. K. (2012). Bringingdiagnosis into the quality and safety equations. Jama, 308(12):1211–1212. Gray, S. W., Hicks-Courant, K., Cronin, A., Rollins, B. J., and Weeks,J. C. (2014). Physicians’ attitudes about multiplex tumor genomic testing.Journal of Clinical Oncology, 32(13):1317. Grewal, J. K., Eirew, P., Jones, M., Chiu, K., Tessier-Cloutier, B.,Karnezis, A. N., Karsan, A., Mungall, A., Zhou, C., Yip, S., et al.(2017). Detection and genomic characterization of a mammary-likeadenocarcinoma. Molecular Case Studies, 3(6):a002170. Grewal, J. K., Tessier-Cloutier, B., Jones, M., Gakkhar, S., Ma, Y.,Moore, R., Mungall, A. J., Zhao, Y., Taylor, M. D., Gelmon, K., et al.159Bibliography(2019). Application of a neural network whole transcriptome–basedpan-cancer method for diagnosis of primary and metastatic cancers.JAMA network open, 2(4):e192597–e192597. Gröschel, S., Bommer, M., Hutter, B., Budczies, J., Bonekamp, D.,Heining, C., Horak, P., Fröhlich, M., Uhrig, S., Hübschmann, D., et al.(2016). Integration of genomics and histology revises diagnosis and enableseffective therapy of refractory cancer of unknown primary with pdl1amplification. Molecular Case Studies, 2(6):a001180. Ha, G., Roth, A., Lai, D., Bashashati, A., Ding, J., Goya, R., Giuliany,R., Rosner, J., Oloumi, A., Shumansky, K., et al. (2012). Integrativeanalysis of genome-wide loss of heterozygosity and monoallelic expressionat nucleotide resolution reveals disrupted pathways in triple-negativebreast cancer. Genome research, 22(10):1995–2007. Hagström, J., Heikkilä, A., Siironen, P., Louhimo, J., Heiskanen, I.,Mäenpää, H., Arola, J., and Haglund, C. (2012). Tlr-4 expressionand decrease in chronic inflammation: indicators of aggressive follicularthyroid carcinoma. Journal of clinical pathology, 65(4):333–338. Hainsworth, J. D. and Greco, F. A. (2014). Gene expression profilingin patients with carcinoma of unknown primary site: from translationalresearch to standard of care. Virchows Archiv, 464(4):393–402. Hajdu, S. I. (2006). Thoughts about the cause of cancer. Cancer:Interdisciplinary International Journal of the American Cancer Society,106(8):1643–1649. Hajdu, S. I. (2011). Microscopic contributions of pioneer pathologists.Annals of Clinical & Laboratory Science, 41(2):201–206. Hanahan, D. and Weinberg, R. A. (2011). Hallmarks of cancer: thenext generation. cell, 144(5):646–674. Handorf, C. R., Kulkarni, A., Grenert, J. P., Weiss, L. M., Rogers,W. M., Kim, O. S., Monzon, F. A., Halks-Miller, M., Anderson,G. G., Walker, M. G., et al. (2013). A multicenter study directlycomparing the diagnostic accuracy of gene expression profiling andimmunohistochemistry for primary site identification in metastatictumors. The American journal of surgical pathology, 37(7):1067. Hao, X., Luo, H., Krawczyk, M., Wei, W., Wang, W., Wang, J., Flagg,K., Hou, J., Zhang, H., Yi, S., et al. (2017). Dna methylation markers for160Bibliographydiagnosis and prognosis of common cancers. Proceedings of the NationalAcademy of Sciences, 114(28):7414–7419. Hartung, E. (1875). Über einen Fall von Mamma accessoria. E. Th.Jacob. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elementsof statistical learning: data mining, inference, and prediction. SpringerScience & Business Media. Haury, A.-C., Gestraud, P., and Vert, J.-P. (2011). The influence offeature selection methods on accuracy, stability and interpretability ofmolecular signatures. PloS one, 6(12):e28210. Hayashi, H., Kurata, T., Takiguchi, Y., Arai, M., Takeda, K., Akiyoshi,K., Matsumoto, K., Onoe, T., Mukai, H., Matsubara, N., et al. (2019).Randomized phase ii trial comparing site-specific treatment based ongene expression profiling with carboplatin and paclitaxel for patientswith cancer of unknown primary site. Journal of Clinical Oncology,37(7):570–579. Herschkowitz, J. I., He, X., Fan, C., and Perou, C. M. (2008). Thefunctional loss of the retinoblastoma tumour suppressor is a common eventin basal-like and luminal b breast carcinomas. Breast Cancer Research,10(5):R75. Herter-Sprie, G. S., Greulich, H., and Wong, K.-K. (2013). Activatingmutations in erbb2 and their impact on diagnostics and treatment.Frontiers in oncology, 3:86. Hoadley, K. A., Yau, C., Hinoue, T., Wolf, D. M., Lazar, A. J., Drill,E., Shen, R., Taylor, A. M., Cherniack, A. D., Thorsson, V., et al. (2018).Cell-of-origin patterns dominate the molecular classification of 10,000tumors from 33 types of cancer. Cell, 173(2):291–304. Hoadley, K. A., Yau, C., Wolf, D. M., Cherniack, A. D., Tamborero,D., Ng, S., Leiserson, M. D., Niu, B., McLellan, M. D., Uzunangelov, V.,et al. (2014). Multiplatform analysis of 12 cancer types reveals molecularclassification within and across tissues of origin. Cell, 158(4):929–944. Huang, K.-H., Kuo, K.-L., Chen, S.-C., Weng, T.-I., Chuang,Y.-T., Tsai, Y.-C., Pu, Y.-S., Chiang, C.-K., and Liu, S.-H. (2012).Down-regulation of glucose-regulated protein (grp) 78 potentiates161Bibliographycytotoxic effect of celecoxib in human urothelial carcinoma cells. PLoSOne, 7(3). Huang, P.-J., Chiu, L.-Y., Lee, C.-C., Yeh, Y.-M., Huang, K.-Y.,Chiu, C.-H., and Tang, P. (2018). msignaturedb: a database fordeciphering mutational signatures in human cancers. Nucleic acidsresearch, 46(D1):D964–D970. Hylebos, M., Van Camp, G., van Meerbeeck, J. P., and de Beeck,K. O. (2016). The genetic landscape of malignant pleural mesothelioma:results from massively parallel sequencing. Journal of Thoracic Oncology,11(10):1615–1626. Hyman, D. M., Puzanov, I., Subbiah, V., Faris, J. E., Chau, I., Blay,J.-Y., Wolf, J., Raje, N. S., Diamond, E. L., Hollebecque, A., et al. (2015).Vemurafenib in multiple nonmelanoma cancers with braf v600 mutations.New England Journal of Medicine, 373(8):726–736. Jamshidi, F., Pleasance, E., Li, Y., Shen, Y., Kasaian, K., Corbett, R.,Eirew, P., Lum, A., Pandoh, P., Zhao, Y., et al. (2014). Diagnostic value ofnext-generation sequencing in an unusual sphenoid tumor. The oncologist,19(6):623. Jones, S. J., Laskin, J., Li, Y. Y., Griffith, O. L., An, J., Bilenky,M., Butterfield, Y. S., Cezard, T., Chuah, E., Corbett, R., Fejes, A. P.,Griffith, M., Yee, J., Martin, M., Mayo, M., Melnyk, N., Morin, R. D.,Pugh, T. J., Severson, T., Shah, S. P., Sutcliffe, M., Tam, A., Terry,J., Thiessen, N., Thomson, T., Varhol, R., Zeng, T., Zhao, Y., Moore,R. A., Huntsman, D. G., Birol, I., Hirst, M., Holt, R. A., and Marra,M. A. (2010). Evolution of an adenocarcinoma in response to selection bytargeted kinase inhibitors. Genome biology, 11(8):R82. Ju, Y. S., Martincorena, I., Gerstung, M., Petljak, M., Alexandrov,L. B., Rahbari, R., Wedge, D. C., Davies, H. R., Ramakrishna, M., Fullam,A., et al. (2017). Somatic mutations reveal asymmetric cellular dynamicsin the early human embryo. Nature, 543(7647):714. Kamburov, A., Pentchev, K., Galicka, H., Wierling, C., Lehrach, H.,and Herwig, R. (2011). Consensuspathdb: toward a more complete pictureof cell biology. Nucleic acids research, 39(suppl_1):D712–D717. Kanehisa, M. and Goto, S. (2000). Kegg: kyoto encyclopedia of genesand genomes. Nucleic acids research, 28(1):27–30.162Bibliography Karaayvaz, M., Cristea, S., Gillespie, S. M., Patel, A. P., Mylvaganam,R., Luo, C. C., Specht, M. C., Bernstein, B. E., Michor, F., and Ellisen,L. W. (2018). Unravelling subclonal heterogeneity and aggressive diseasestates in tnbc through single-cell rna-seq. Nature communications,9(1):1–10. Kaufman, L. and Rousseeuw, P. J. (1990). Partitioning aroundmedoids (program pam). Finding groups in data: an introduction tocluster analysis, 344:68–125. Kazakov, D. V., Spagnolo, D. V., Kacerovska, D., and Michal, M.(2011). Lesions of anogenital mammary-like glands: an update. Advancesin anatomic pathology, 18(1):1–28. Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M.,Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson,C., et al. (2001). Classification and diagnostic prediction of cancers usinggene expression profiling and artificial neural networks. Nature medicine,7(6):673. Kim, A. and Cohen, M. S. (2016). The discovery of vemurafenib forthe treatment of braf-mutated metastatic melanoma. Expert opinion ondrug discovery, 11(9):907–916. Ko, J. J., Grewal, J. K., Ng, T., Lavoie, J.-M., Thibodeau, M. L., Shen,Y., Mungall, A. J., Taylor, G., Schrader, K. A., Jones, S. J., et al. (2018).Whole-genome and transcriptome profiling of a metastatic thyroid-likefollicular renal cell carcinoma. Molecular Case Studies, 4(6):a003137. Kobak, D. and Linderman, G. C. (2019). Umap does not preserveglobal structure any better than t-sne when using the same initialization.bioRxiv. Kopetz, S., Desai, J., Chan, E., Hecht, J. R., O’Dwyer, P. J., Maru, D.,Morris, V., Janku, F., Dasari, A., Chung, W., et al. (2015). Phase ii pilotstudy of vemurafenib in patients with metastatic braf-mutated colorectalcancer. Journal of clinical oncology, 33(34):4032. Laskin, J., Jones, S., Aparicio, S., Chia, S., Ch’ng, C., Deyell, R.,Eirew, P., Fok, A., Gelmon, K., Ho, C., Huntsman, D., Jones, M., Kasaian,K., Karsan, A., Leelakumari, S., Li, Y., Lim, H., Ma, Y., Mar, C., Martin,M., Moore, R., Mungall, A., Mungall, K., Pleasance, E., Rassekh, S. R.,Renouf, D., Shen, Y., Schein, J., Schrader, K., Sun, S., Tinker, A., Zhao,E., Yip, S., and Marra, M. A. (2015). Lessons learned from the application163Bibliographyof whole-genome analysis to the treatment of patients with advancedcancers. Cold Spring Harbor molecular case studies, 1(1):a000570. Le Tourneau, C., Delord, J.-P., Gonçalves, A., Gavoille, C., Dubot, C.,Isambert, N., Campone, M., Trédan, O., Massiani, M.-A., Mauborgne, C.,et al. (2015). Molecularly targeted therapy based on tumour molecularprofiling versus conventional therapy for advanced cancer (shiva): amulticentre, open-label, proof-of-concept, randomised, controlled phase2 trial. The lancet oncology, 16(13):1324–1334. Lever, J., Krzywinski, M., and Altman, N. (2016). Classificationevaluation. Lewis, M. (2019). Bias in the er: Doctors suffer from the same cognitivedistortions as the rest of us. Nautilus. Li, C., Dong, H., Fu, W., Qi, M., and Han, B. (2015). Thyroid-likefollicular carcinoma of the kidney and papillary renal cell carcinoma withthyroid-like feature: comparison of two cases and literature review. Annalsof Clinical & Laboratory Science, 45(6):707–712. Li, H. and Durbin, R. (2010). Fast and accurate long-read alignmentwith burrows–wheeler transform. Bioinformatics, 26(5):589–595. Li, Y., Kang, K., Krahn, J. M., Croutwater, N., Lee, K., Umbach,D. M., and Li, L. (2017). A comprehensive genomic pan-cancerclassification using the cancer genome atlas gene expression data. BMCgenomics, 18(1):508. Liegl, B., Horn, L.-C., and Moinfar, F. (2005). Androgen receptorsare frequently expressed in mammary and extramammary paget’s disease.Modern pathology, 18(10):1283. Lim, S., Lee, S., Jung, I., Rhee, S., and Kim, S. (2020). Comprehensiveand critical evaluation of individualized pathway activity measurementtools on pan-cancer data. Briefings in bioinformatics, 21(1):36–46. Locke, W. J., Guanzon, D., Ma, C., Liew, Y. J., Duesing, K. R.,Fung, K. Y., and Ross, J. P. (2019). Dna methylation cancer biomarkers:translation to the clinic. Frontiers in Genetics, 10. Löffler, H., Pfarr, N., Kriegsmann, M., Endris, V., Hielscher, T.,Lohneis, P., Folprecht, G., Stenzinger, A., Dietel, M., Weichert, W., et al.(2016). Molecular driver alterations and their clinical relevance in cancerof unknown primary site. Oncotarget, 7(28):44322.164Bibliography Loison, L. (2016). The microscope against cell theory: Cancer researchin nineteenth-century parisian anatomical pathology. Journal of thehistory of medicine and allied sciences, 71(3):271–292. Losa, F., Soler, G., Casado, A., Estival, A., Fernández, I., Giménez, S.,Longo, F., Pazo-Cid, R., Salgado, J., and Seguí, M. (2018). Seom clinicalguideline on unknown primary cancer (2017). Clinical and TranslationalOncology, 20(1):89–96. Lynch, T. J., Bell, D. W., Sordella, R., Gurubhagavatula, S., Okimoto,R. A., Brannigan, B. W., Harris, P. L., Haserlat, S. M., Supko, J. G.,Haluska, F. G., et al. (2004). Activating mutations in the epidermalgrowth factor receptor underlying responsiveness of non–small-cell lungcancer to gefitinib. New England Journal of Medicine, 350(21):2129–2139. Maddalena, F., Sisinni, L., Lettini, G., Condelli, V., Matassa, D. S.,Piscazzi, A., Amoroso, M. R., La Torre, G., Esposito, F., and Landriscina,M. (2013). Resistance to paclitxel in breast carcinoma cells requiresa quality control of mitochondrial antiapoptotic proteins by trap1.Molecular oncology, 7(5):895–906. Marcus, L., Lemery, S. J., Keegan, P., and Pazdur, R. (2019). Fdaapproval summary: pembrolizumab for the treatment of microsatelliteinstability-high solid tumors. Clinical Cancer Research, 25(13):3753–3758. Marquard, A. M., Birkbak, N. J., Thomas, C. E., Favero, F.,Krzystanek, M., Lefebvre, C., Ferté, C., Jamal-Hanjani, M., Wilson, G. A.,Shafi, S., Swanton, C., André, F., Szallasi, Z., and Eklund, A. C. (2015).TumorTracer: a method to identify the tissue of origin from the somaticmutations of a tumor specimen. BMC Medical Genomics, 8(1):58. Massard, C., Michiels, S., Ferté, C., Le Deley, M.-C., Lacroix, L.,Hollebecque, A., Verlingue, L., Ileana, E., Rosellini, S., Ammari, S., et al.(2017). High-throughput genomics and clinical outcome in hard-to-treatadvanced cancers: results of the moscato 01 trial. Cancer discovery,7(6):586–595. McMaster, J., Dua, A., and Dowdy, S. C. (2013). Primary breastadenocarcinoma in ectopic breast tissue in the vulva. Case reports inobstetrics and gynecology, 2013. Meiri, E., Mueller, W. C., Rosenwald, S., Zepeniuk, M., Klinke, E.,Edmonston, T. B., Werner, M., Lass, U., Barshack, I., Feinmesser, M.,165Bibliographyet al. (2012). A second-generation microrna-based assay for diagnosingtumor tissue origin. The oncologist, 17(6):801–812. Meyer, A. N., Payne, V. L., Meeks, D. W., Rao, R., and Singh, H.(2013). Physicians’ diagnostic accuracy, confidence, and resource requests:a vignette study. JAMA internal medicine, 173(21):1952–1958. Micello, D., Marando, A., Sahnane, N., Riva, C., Capella, C.,and Sessa, F. (2010). Androgen receptor is frequently expressedin her2-positive, er/pr-negative breast cancers. Virchows Archiv,457(4):467–476. Miettinen, M., Cue, P. A. M., Sarlomo-Rikala, M., Rys, J., Czapiewski,P., Wazny, K., Langfort, R., Waloszczyk, P., Biernat, W., Lasota, J., et al.(2014). Gata 3–a multispecific but potentially useful marker in surgicalpathology–a systematic analysis of 2500 epithelial and non-epithelialtumors. The American journal of surgical pathology, 38(1):13. Mody, R. J., Wu, Y.-M., Lonigro, R. J., Cao, X., Roychowdhury, S.,Vats, P., Frank, K. M., Prensner, J. R., Asangani, I., Palanisamy, N., et al.(2015). Integrative clinical sequencing in the management of refractory orrelapsed cancer in youth. Jama, 314(9):913–925. Monzon, F. A. and Koen, T. J. (2010). Diagnosis of metastaticneoplasms: molecular approaches for identification of tissue of origin.Archives of pathology & laboratory medicine, 134(2):216–224. Morbeck, D., Tregnago, A. C., Netto, G. B., Sacomani, C., Peresi,P. M., Osório, C. T., Schutz, L., Bezerra, S. M., de Brot, L., andCunha, I. W. (2017). Gata 3 expression in primary vulvar pagetdisease: a potential pitfall leading to misdiagnosis of pagetoid urothelialintraepithelial neoplasia. Histopathology, 70(3):435–441. Morganella, S., Alexandrov, L. B., Glodzik, D., Zou, X., Davies, H.,Staaf, J., Sieuwerts, A. M., Brinkman, A. B., Martin, S., Ramakrishna,M., et al. (2016). The topography of mutational processes in breast cancergenomes. Nature communications, 7:11383. Morin, R. D., Johnson, N. A., Severson, T. M., Mungall, A. J., An,J., Goya, R., Paul, J. E., Boyle, M., Woolcock, B. W., Kuchenbauer, F.,et al. (2010). Somatic mutations altering ezh2 (tyr641) in follicular anddiffuse large b-cell lymphomas of germinal-center origin. Nature genetics,42(2):181.166Bibliography Mufti, A. and Jackson, R. (2016). Biopsy—what’s in the name? JAMAdermatology, 152(2):190–190. Muirhead, D., Aoun, P., Powell, M., Juncker, F., and Mollerup, J.(2010). Pathology economic model tool: a novel approach to workflowand budget cost analysis in an anatomic pathology laboratory. Archivesof pathology & laboratory medicine, 134(8):1164–1169. Mullane, S. A., Werner, L., Rosenberg, J., Signoretti, S., Callea, M.,Choueiri, T. K., Freeman, G. J., and Bellmunt, J. (2016). Correlationof apobec mrna expression with overall survival and pd-l1 expression inurothelial carcinoma. Scientific reports, 6:27702. Nadji, M., Gomez-Fernandez, C., Ganjei-Azar, P., and Morales, A. R.(2005). Immunohistochemistry of estrogen and progesterone receptorsreconsidered: experience with 5,993 breast cancers. American journal ofclinical pathology, 123(1):21–27. Nallanthighal, S., Heiserman, J. P., and Cheon, D.-J. (2019). Therole of the extracellular matrix in cancer stemness. Frontiers in Cell andDevelopmental Biology, 7. Navin, N. E. (2015). The first five years of single-cell cancer genomicsand beyond. Genome research, 25(10):1499–1507. Neto, A. G., Deavers, M. T., Silva, E. G., and Malpica, A. (2003).Metastatic tumors of the vulva: a clinicopathologic study of 66 cases. TheAmerican journal of surgical pathology, 27(6):799–804. Network, C. G. A. R. (2016). Comprehensive molecularcharacterization of papillary renal-cell carcinoma. New England Journalof Medicine, 374(2):135–145. Network, C. G. A. R. et al. (2017). Integrated genomic characterizationof oesophageal carcinoma. Nature, 541(7636):169–175. Nojadeh, J. N., Sharif, S. B., and Sakhinia, E. (2018). Microsatelliteinstability in colorectal cancer. EXCLI journal, 17:159. Oberg, J. A., Bender, J. L. G., Sulis, M. L., Pendrick, D., Sireci,A. N., Hsiao, S. J., Turk, A. T., Cruz, F. S. D., Hibshoosh, H.,Remotti, H., et al. (2016). Implementation of next generation sequencinginto pediatric hematology-oncology practice: moving beyond actionablealterations. Genome medicine, 8(1):133.167Bibliography Obiorah, I. E. and Ozdemirli, M. (2019). Clear cell sarcoma in unusualsites mimicking metastatic melanoma. World journal of clinical oncology,10(5):213. Ojala, K. A., Kilpinen, S. K., and Kallioniemi, O. P. (2011).Classification of unknown primary tumors with a data-driven methodbased on a large microarray reference database. Genome medicine,3(9):63. Onaiwu, C. O., Salcedo, M. P., Pessini, S. A., Munsell, M. F., Euscher,E. E., Reed, K. E., and Schmeler, K. M. (2017). Paget’s disease of thevulva: A review of 89 cases. Gynecologic oncology reports, 19:46–49. Peccerillo, F., Mandel, V. D., Di Tullio, F., Ciardo, S., Chester, J.,Kaleci, S., De Carvalho, N., Del Duca, E., Giannetti, L., Mazzoni, L., et al.(2019). Lesions mimicking melanoma at dermoscopy confirmed basal cellcarcinoma: Evaluation with reflectance confocal microscopy. Dermatology,235(1):35–44. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.(2011). Scikit-learn: Machine learning in python. Journal of machinelearning research, 12(Oct):2825–2830. Pentheroudakis, G., Greco, F., and Pavlidis, N. (2009). Molecularassignment of tissue of origin in cancer of unknown primary may notpredict response to therapy or outcome: a systematic literature review.Cancer treatment reviews, 35(3):221–227. Perrone, G., Altomare, V., Zagami, M., Vulcano, E., Muzii, L.,Battista, C., Rabitti, C., and Muda, A. O. (2009). Breast-like vulvarlesion with concurrent breast cancer: a case report and critical literaturereview. in vivo, 23(4):629–634. Perrotto, J., Abbott, J. J., Ceilley, R. I., and Ahmed, I. (2010).The role of immunohistochemistry in discriminating primary fromsecondary extramammary paget disease. The American Journal ofDermatopathology, 32(2):137–143. Pfeifer, J. D. and Wick, M. R. (1995). The pathologic evaluation ofneoplastic diseases. Clinical Oncology, 2nd ed. Edited by Murphy GP,Lawrence W, Lenhard RE. Atlanta: American Cancer Society, 75:95. Pietri, E., Conteduca, V., Andreis, D., Massa, I., Melegari, E., Sarti,168BibliographyS., Cecconetto, L., Schirone, A., Bravaccini, S., Serra, P., et al. (2016).Androgen receptor signaling pathways as a target for breast cancertreatment. Endocrine-related cancer, 23(10):R485–R498. Pillai, R., Deeter, R., Rigl, C. T., Nystrom, J. S., Miller, M. H.,Buturovic, L., and Henner, W. D. (2011). Validation and reproducibilityof a microarray-based gene expression test for tumor identification informalin-fixed, paraffin-embedded specimens. The Journal of moleculardiagnostics, 13(1):48–56. Piros, E., Petak, I., Erdos, A., Hautman, J., and Lisziewicz, J. (2016).Market opportunity for molecular diagnostics in personalized cancertherapy. Handbook of clinical nanomedicine. Law, business, regulation,safety, and risk. Stanford: Taylor & Francis, pages 273–301. Posadas, E. M., Liel, M. S., Kwitkowski, V., Minasian, L., Godwin,A. K., Hussain, M. M., Espina, V., Wood, B. J., Steinberg, S. M., andKohn, E. C. (2007). A phase ii and pharmacodynamic study of gefitinibin patients with refractory or recurrent epithelial ovarian cancer. Cancer:Interdisciplinary International Journal of the American Cancer Society,109(7):1323–1330. Pstrąg, N., Ziemnicka, K., Bluyssen, H., and Wesoły, J. (2018).Thyroid cancers of follicular origin in a genomic light: in-depth overviewof common and unique molecular marker candidates. Molecular cancer,17(1):116. Quon, G. and Morris, Q. (2009). Isolate: a computational strategyfor identifying the primary origin of cancers using high-throughputsequencing. Bioinformatics, 25(21):2882–2889. Raghav, K., Mhadgut, H., McQuade, J. L., Lei, X., Ross, A.,Matamoros, A., Wang, H., Overman, M. J., and Varadhachary, G. R.(2016). Cancer of unknown primary in adolescents and young adults:Clinicopathological features, prognostic factors and survival outcomes.PLoS One, 11(5). Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.-H.,Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., et al.(2001). Multiclass cancer diagnosis using tumor gene expression signatures.Proceedings of the National Academy of Sciences, 98(26):15149–15154. Rapin, N., Bagger, F. O., Jendholm, J., Mora-Jensen, H., Krogh,A., Kohlmann, A., Thiede, C., Borregaard, N., Bullinger, L., Winther,169BibliographyO., et al. (2014). Comparing cancer vs normal gene expression profilesidentifies new disease entities and common transcriptional programs inaml patients. Blood, The Journal of the American Society of Hematology,123(6):894–904. Rapiti, E., Verkooijen, H. M., Vlastos, G., Fioretta, G.,Neyroud-Caspar, I., Sappino, A. P., Chappuis, P. O., and Bouchardy,C. (2006). Complete excision of primary breast tumor improves survivalof patients with metastatic breast cancer at diagnosis. J clin oncol,24(18):2743–2749. Rassy, E., Assi, T., and Pavlidis, N. (2020). Exploring the biologicalhallmarks of cancer of unknown primary: where do we stand today?British Journal of Cancer, pages 1–9. Reyna, M. A., Leiserson, M. D., and Raphael, B. J. (2018).Hierarchical hotnet: identifying hierarchies of altered subnetworks.Bioinformatics, 34(17):i972–i980. Ricketts, C. J., De Cubas, A. A., Fan, H., Smith, C. C., Lang,M., Reznik, E., Bowlby, R., Gibb, E. A., Akbani, R., Beroukhim,R., et al. (2018). The cancer genome atlas comprehensive molecularcharacterization of renal cell carcinoma. Cell reports, 23(1):313–326. Robinson, D. R., Wu, Y.-M., Lonigro, R. J., Vats, P., Cobain, E.,Everett, J., Cao, X., Rabban, E., Kumar-Sinha, C., Raymond, V., et al.(2017). Integrative clinical genomics of metastatic cancer. Nature,548(7667):297–303. Rosenfeld, N., Aharonov, R., Meiri, E., Rosenwald, S., Spector, Y.,Zepeniuk, M., Benjamin, H., Shabes, N., Tabak, S., Levy, A., et al. (2008).Micrornas accurately identify cancer tissue origin. Nature biotechnology,26(4):462. Ross, J. S., Wang, K., Gay, L., Otto, G. A., White, E., Iwanik, K.,Palmer, G., Yelensky, R., Lipson, D. M., Chmielecki, J., Erlich, R. L.,Rankin, A. N., Ali, S. M., Elvin, J. A., Morosini, D., Miller, V. A., andStephens, P. J. (2015). Comprehensive Genomic Profiling of Carcinomaof Unknown Primary Site: New Routes to Targeted Therapies. JAMAoncology, 1(1):40–9. Rostoker, R., Abelson, S., Bitton-Worms, K., Genkin, I., Ben-Shmuel,S., Dakwar, M., Orr, Z. S., Caspi, A., Tzukerman, M., and LeRoith,170BibliographyD. (2015). Highly specific role of the insulin receptor in breast cancerprogression. Endocrine-related cancer, 22(2):145–157. Saadatpour, A., Lai, S., Guo, G., and Yuan, G.-C. (2015). Single-cellanalysis in cancer genomics. Trends in Genetics, 31(10):576–586. Saunders, C. T., Wong, W. S., Swamy, S., Becq, J., Murray, L. J.,and Cheetham, R. K. (2012). Strelka: accurate somatic small-variantcalling from sequenced tumor–normal sample pairs. Bioinformatics,28(14):1811–1817. Schaefer, C. F., Anthony, K., Krupa, S., Buchoff, J., Day, M., Hannay,T., and Buetow, K. H. (2009). Pid: the pathway interaction database.Nucleic acids research, 37(suppl_1):D674–D679. Schroten-Loef, C., Verhoeven, R., de Hingh, I., van de Wouw, A., vanLaarhoven, H., and Lemmens, V. (2018). Unknown primary carcinomain the netherlands: decrease in incidence and survival times remain poorbetween 2000 and 2012. European Journal of Cancer, 101:77–86. Seshacharyulu, P., Ponnusamy, M. P., Haridas, D., Jain, M., Ganti,A. K., and Batra, S. K. (2012). Targeting the egfr signaling pathway incancer therapy. Expert opinion on therapeutic targets, 16(1):15–31. Sethupathy, P., Corda, B., and Hatzigeorgiou, A. G. (2006). Tarbase:A comprehensive database of experimentally supported animal micrornatargets. Rna, 12(2):192–197. Shaoxian, T., Baohua, Y., Xiaoli, X., Yufan, C., Xiaoyu, T.,Hongfen, L., Rui, B., Xiangjie, S., Ruohong, S., and Wentao, Y. (2017).Characterisation of gata3 expression in invasive breast cancer: differencesin histological subtypes and immunohistochemically defined molecularsubtypes. Journal of clinical pathology, 70(11):926–934. Shendure, J., Findlay, G. M., and Snyder, M. W. (2019). Genomicmedicine–progress, pitfalls, and promise. Cell, 177(1):45–57. Slodkowska, E. A. and Ross, J. S. (2009). Mammaprint™ 70-genesignature: another milestone in personalized medical care for breast cancerpatients. Expert review of molecular diagnostics, 9(5):417–422. Sobin, L. H., Gospodarowicz, M. K., and Wittekind, C. (2011). TNMclassification of malignant tumours. John Wiley & Sons. Soh, K. P., Szczurek, E., Sakoparnig, T., and Beerenwinkel, N. (2017).171BibliographyPredicting cancer type from tumour dna signatures. Genome medicine,9(1):104. Søkilde, R., Vincent, M., Møller, A. K., Hansen, A., Høiby, P. E.,Blondal, T., Nielsen, B. S., Daugaard, G., Møller, S., and Litman, T.(2014). Efficient identification of mirnas for classification of tumor origin.The Journal of Molecular Diagnostics, 16(1):106–115. Song, H.-W. and Wilkinson, M. F. (2014). Transcriptional control ofspermatogonial maintenance and differentiation. In Seminars in cell &developmental biology, volume 30, pages 14–26. Elsevier. Stefanovic, S., Wirtz, R., Deutsch, T. M., Hartkopf, A., Sinn, P., Varga,Z., Sobottka, B., Sotiris, L., Taran, F.-A., Domschke, C., et al. (2017).Tumor biomarker conversion between primary and metastatic breastcancer: mrna assessment and its concordance with immunohistochemistry.Oncotarget, 8(31):51416. Su, A. I., Welsh, J. B., Sapinoso, L. M., Kern, S. G., Dimitrov, P.,Lapp, H., Schultz, P. G., Powell, S. M., Moskaluk, C. A., Frierson, H. F.,et al. (2001). Molecular classification of human carcinomas by use of geneexpression signatures. Cancer research, 61(20):7388–7393. Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attributionfor deep networks. In Proceedings of the 34th International Conferenceon Machine Learning-Volume 70, pages 3319–3328. JMLR. org. Suvà, M. L. and Tirosh, I. (2019). Single-cell rna sequencing in cancer:lessons learned and emerging challenges. Molecular cell, 75(1):7–12. Tan, S. Y. and Tatsumura, Y. (2015). George papanicolaou(1883–1962): discoverer of the pap smear. Singapore medical journal,56(10):586. Tang, W., Wan, S., Yang, Z., Teschendorff, A. E., and Zou, Q. (2017).Tumor origin detection with tissue-specific mirna and dna methylationmarkers. Bioinformatics, 34(3):398–406. Tang, W., Wan, S., Yang, Z., Teschendorff, A. E., and Zou, Q. (2018).Tumor origin detection with tissue-specific mirna and dna methylationmarkers. Bioinformatics, 34(3):398–406. Tessier-Cloutier, B., Asleh-Aburaya, K., Shah, V., McCluggage, W. G.,Tinker, A., and Gilks, C. B. (2017). Molecular subtyping of mammary-like172Bibliographyadenocarcinoma of the vulva shows molecular similarity to breastcarcinomas. Histopathology, 71(3):446–452. Thavarajah, R., Mudimbaimannar, V. K., Elizabeth, J., Rao, U. K.,and Ranganathan, K. (2012). Chemical and physical basics of routineformaldehyde fixation. Journal of oral and maxillofacial pathology:JOMFP, 16(3):400. Tian, R., Basu, M. K., and Capriotti, E. (2014). Contrastrank: anew method for ranking putative cancer driver genes and classification oftumor samples. Bioinformatics, 30(17):i572–i578. Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002).Diagnosis of multiple cancer types by shrunken centroids of geneexpression. Proceedings of the National Academy of Sciences,99(10):6567–6572. Titford, M. (2005). The long history of hematoxylin. Biotechnic &histochemistry, 80(2):73–78. Topol, E. (2019). Deep medicine: how artificial intelligence can makehealthcare human again. Hachette UK. Tormo, E., Adam-Artigues, A., Ballester, S., Pineda, B., Zazo, S.,González-Alonso, P., Albanell, J., Rovira, A., Rojo, F., Lluch, A., et al.(2017). The role of mir-26a and mir-30b in her2+ breast cancertrastuzumab resistance and regulation of the ccne2 gene. Scientific reports,7:41309. Toth, C. D., O’Rourke, J., and Goodman, J. E. (2017). Handbook ofdiscrete and computational geometry. Chapman and Hall/CRC. Tothill, R. W., Shi, F., Paiman, L., Bedo, J., Kowalczyk, A.,Mileshkin, L., Buela, E., Klupacs, R., Bowtell, D., and Byron, K. (2015).Development and validation of a gene expression tumour classifier forcancer of unknown primary. Pathology, 47(1):7–12. Tran, T. A., Deavers, M. T., Carlson, J. A., and Malpica, A. (2015).Collision of ductal carcinoma in situ of anogenital mammary-like glandsand vulvar sarcomatoid squamous cell carcinoma. International Journalof Gynecological Pathology, 34(5):487–494. Trédan, O., Wang, Q., Pissaloux, D., Cassier, P., de la Fouchardière,A., Fayette, J., Desseigne, F., Ray-Coquard, I., de la Fouchardière,C., Frappaz, D., et al. (2019). Molecular screening program to select173Bibliographymolecular-based recommended therapies for metastatic cancer patients:analysis from the profiler trial. Annals of Oncology, 30(5):757–765. Van der Putte, S. (1994). Mammary-like glands of the vulvaand their disorders. International Journal of Gynecological Pathology,13(2):150–160. Van Such, M., Lohr, R., Beckman, T., and Naessens, J. M. (2017).Extent of diagnostic agreement among medical referrals. Journal ofevaluation in clinical practice, 23(4):870–874. Varadhachary, G. R., Raber, M. N., Matamoros, A., and Abbruzzese,J. L. (2008). Carcinoma of unknown primary with a colon-cancerprofile—changing paradigm and emerging definitions. The lancet oncology,9(6):596–599. Varghese, A., Arora, A., Capanu, M., Camacho, N., Won, H., Zehir, A.,Gao, J., Chakravarty, D., Schultz, N., Klimstra, D., et al. (2017). Clinicaland molecular characterization of patients with cancer of unknownprimary in the modern era. Annals of Oncology, 28(12):3015–3021. Vaske, C. J., Benz, S. C., Sanborn, J. Z., Earl, D., Szeto, C., Zhu,J., Haussler, D., and Stuart, J. M. (2010). Inference of patient-specificpathway activities from multi-dimensional cancer genomics data usingparadigm. Bioinformatics, 26(12):i237–i245. Vennalaganti, P., Kanakadandi, V., Goldblum, J. R., Mathur, S. C.,Patil, D. T., Offerhaus, G. J., Meijer, S. L., Vieth, M., Odze, R. D.,Shreyas, S., et al. (2017). Discordance among pathologists in the unitedstates and europe in diagnosis of low-grade dysplasia for patients withbarrett’s esophagus. Gastroenterology, 152(3):564–570. Vitali, F., Li, Q., Schissler, A. G., Berghout, J., Kenost, C.,and Lussier, Y. A. (2019). Developing a ‘personalome’for precisionmedicine: emerging methods that compute interpretable effect sizes fromsingle-subject transcriptomes. Briefings in bioinformatics, 20(3):789–805. Vogelstein, B. and Kinzler, K. W. (2004). Cancer genes and thepathways they control. Nature medicine, 10(8):789–799. Wagle, M.-C., Castillo, J., Srinivasan, S., Holcomb, T., Yuen, K. C.,Kadel, E. E., Mariathasan, S., Halligan, D. L., Carr, A. R., Bylesjo, M.,et al. (2020). Tumor fusion burden as a hallmark of immune infiltrationin prostate cancer. Cancer Immunology Research.174Bibliography Wagner, A. H., Walsh, B., Mayfield, G., Tamborero, D., Sonkin, D.,Krysiak, K., Pons, J. D., Duren, R., Gao, J., McMurry, J., et al. (2018).A harmonized meta-knowledgebase of clinical interpretations of cancergenomic variants. BioRxiv, page 366856. Wang, H. L., Kim, C. J., Koo, J., Zhou, W., Choi, E. K., Arcega,R., Chen, Z. E., Wang, H., Zhang, L., and Lin, F. (2017). Practicalimmunohistochemistry in neoplastic pathology of the gastrointestinaltract, liver, biliary tract, and pancreas. Archives of Pathology andLaboratory Medicine, 141(9):1155–1180. Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M.,Ozenberger, B. A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J. M.,Network, C. G. A. R., et al. (2013). The cancer genome atlas pan-canceranalysis project. Nature genetics, 45(10):1113. Weiss, L. M., Chu, P., Schroeder, B. E., Singh, V., Zhang, Y., Erlander,M. G., and Schnabel, C. A. (2013). Blinded comparator study ofimmunohistochemical analysis versus a 92-gene cancer classifier in thediagnosis of the primary site in metastatic tumors. The Journal ofMolecular Diagnostics, 15(2):263–269. Willman, J. H., Golitz, L. E., and Fitzpatrick, J. E. (2005). Vulvarclear cells of toker: precursors of extramammary paget’s disease. TheAmerican journal of dermatopathology, 27(3):185–188. Wolff, A. C., Hammond, M. E. H., Hicks, D. G., Dowsett, M.,McShane, L. M., Allison, K. H., Allred, D. C., Bartlett, J. M., Bilous, M.,Fitzgibbons, P., et al. (2013). Recommendations for human epidermalgrowth factor receptor 2 testing in breast cancer: American societyof clinical oncology/college of american pathologists clinical practiceguideline update. Archives of Pathology and Laboratory Medicine,138(2):241–256. Wu, Y.-M., Cieślik, M., Lonigro, R. J., Vats, P., Reimers, M. A.,Cao, X., Ning, Y., Wang, L., Kunju, L. P., de Sarkar, N., et al. (2018).Inactivation of cdk12 delineates a distinct immunogenic class of advancedprostate cancer. Cell, 173(7):1770–1782. Xia, A., Zhang, X.-Y., Wang, J., Yin, T., and Lu, X.-J. (2019). Tcell dysfunction in cancer immunity and immunotherapy. Frontiers inimmunology, 10:1719. Xu, Q., Chen, J., Ni, S., Tan, C., Xu, M., Dong, L., Yuan, L., Wang,175Q., and Du, X. (2016). Pan-cancer transcriptome analysis reveals a geneexpression signature for the identification of tumor tissue origin. ModernPathology, 29(6):546. Yagi, Y. and Gilbertson, J. R. (2008). A relationship between slidequality and image quality in whole slide imaging (wsi). In Diagnosticpathology, volume 3, page S12. BioMed Central. Yang, J., Nie, J., Ma, X., Wei, Y., Peng, Y., and Wei, X. (2019).Targeting pi3k in cancer: mechanisms and advances in clinical trials.Molecular cancer, 18(1):26. Zararsız, G., Goksuluk, D., Korkmaz, S., Eldem, V., Zararsiz, G. E.,Duru, I. P., and Ozturk, A. (2017). A comprehensive simulation study onclassification of rna-seq data. PloS one, 12(8). Zehir, A., Benayed, R., Shah, R. H., Syed, A., Middha, S., Kim, H. R.,Srinivasan, P., Gao, J., Chakravarty, D., Devlin, S. M., et al. (2017).Mutational landscape of metastatic cancer revealed from prospectiveclinical sequencing of 10,000 patients. Nature medicine, 23(6):703. Zhang, F., Chen, X. P., Zhang, W., Dong, H. H., Xiang,S., Zhang, W. G., and Zhang, B. X. (2008). Combinedhepatocellular cholangiocarcinoma originating from hepatic progenitorcells: Immunohistochemical and double-fluorescence immunostainingevidence. Histopathology, 52(2):224–232. Zhang, M., Chen, H., Wang, M., Bai, F., and Wu, K. (2020).Bioinformatics analysis of prognostic significance of col10a1 in breastcancer. Bioscience Reports, 40(2). Zhang, W., Chien, J., Yong, J., and Kuang, R. (2017). Network-basedmachine learning and graph theory algorithms for precision oncology. NPJprecision oncology, 1(1):1–15. Zilliox, M. J. and Irizarry, R. A. (2007). A gene expression bar codefor microarray data. Nature methods, 4(11):911. Zubor, P., Kubatka, P., Dankova, Z., Gondova, A., Kajo, K., Hatok,J., Samec, M., Jagelkova, M., Krivus, S., Holubekova, V., et al. (2018).mirna in a multiomic context for diagnosis, treatment monitoring andpersonalized management of metastatic breast cancer. Future Oncology,14(18):1847–1867.176AppendixAdditional materials for Chapter 4Table 1: Important genes based on frequency analysis of gene weights foreach neural network in SCOPE.Cancer Code Tissue Organ-System GenesACC Tumour Endocrine CYP11A1, CYP17A1, CYP21A2, DLK1,GSTA1, IGF2, NPTX2, STARBLCA AdjacentNormalUrologic ACTG2, CNN1, DES, DHRS2, GPX2,KRT13, KRT5, LY6D, OLFM4,PLA2G2A, S100P, SPRR3, UPK2BLCA Tumour Urologic AKR1C2, DES, DHRS2, GATA3, GPX2,KRT13, KRT17, KRT5, PSCA, S100P,SPINK1, UPK1B, UPK2BRCA AdjacentNormalBreast ADH1B, ADIPOQ, APOD, AZGP1,GATA3, KRT14, LPL, MUCL1, PIP,PLIN1, S100A1, SAA1, SCGB1D2,SCGB2A2, TFF1BRCA Tumour Breast AGR3, AZGP1, CALML5, CRABP2,EFHD1, FABP4, GATA3, KRT14,KRT6B, LTF, MMP11, MUCL1, NPY1R,PIP, SCGB2A2, SERPINA3, SPDEF,TFF1CESC_CAD AdjacentNormalGynecologic DESCESC_CAD Tumour Gynecologic CEACAM5, CLDN3, KRT7, MMP11,PIGR, SCGB2A1CESC_SCC AdjacentNormalGynecologic CNN1CESC_SCC Tumour Gynecologic CALML3, KRT13, KRT14, KRT19,KRT5, KRT6A, MMP11CHOL Tumour Gastrointestinal AGT, ALB, AMBP, CEACAM6, CRP,FGA, FGB, FGG, ORM1, REG1A,TM4SF4, TTRCOADREAD AdjacentNormalGastrointestinal AQP8, CA1, CEACAM7, CLCA1, DES,FABP1, FAM3D, GPX2, GUCA2A,KRT20, SLC26A3, SPINK4, ZG16177Additional materials for Chapter 4Table 1: Important genes based on frequency analysis of gene weights foreach neural network in SCOPE. (continued)Cancer Code Tissue Organ-System GenesCOADREAD Tumour Gastrointestinal CDH17, CDX2, CEACAM5, CEACAM6,DPEP1, FABP1, FAM3D, GPX2,LGALS4, MUC13, MUC2, PLA2G2A,PPP1R1B, REG4, S100P, SPINK4,TSPAN8, VIL1ESCA_EAC AdjacentNormalGastrointestinal ACTG2, DES, LIPF, PGA3, PGA4ESCA_EAC Tumour Gastrointestinal CEACAM5, CST1, KRT13, LGALS4,MALAT1, MUC13, PIGR, PLA2G2A,S100A7, SPRR3, TSPAN8, UBDESCA AdjacentNormalGastrointestinal KRT13ESCA_SCC AdjacentNormalGastrointestinal SPRR1BESCA_SCC Tumour Gastrointestinal CALML3, CST1, DES, KRT14, KRT5,LY6D, MALAT1, S100A7, SPRR1B,SPRR3, TRIM29ESCA Tumour Gastrointestinal CLDN18, CST1, MALAT1, REG1A,REG3A, SPINK1FL Tumour Hematologic CCL21GBM Tumour CNS AQP4, CHI3L1, GFAPHNSC AdjacentNormalHead and Neck ACTA1, CALML5, CKM, KRT13, KRT4,MB, MUC7, MYH2, MYL1, MYL2,MYLPF, PIP, PRB3, SAA1, SCGB3A1,SMR3B, STATH, TCAP, TGM3, TNNC2HNSC Tumour Head and Neck ACTA1, CALML3, CALML5, KRT13,KRT14, LGALS7, MMP1, SPRR2A,SPRR3KICH AdjacentNormalUrologic ALDOB, AQP2, FXYD2, UMODKICH Tumour Urologic ATP6V0A4, ATP6V0D2, CDH16,DEFB1, RHCG, SPINK1, SPP1,TMEM213KIRC AdjacentNormalUrologic AQP2, CDH16, SLC34A2, UMODKIRC Tumour Urologic ANGPTL4, CA12, CA9, DEFB1,EGLN3, ESM1, FXYD2, GSTA1, NAT8KIRP AdjacentNormalUrologic AQP2, PIGR, UMODKIRP Tumour Urologic C19orf33, MAL, MMP7, PIGR, SST,WFDC2LAML Tumour Hematologic AZU1, CSF3R, FOSB, MPO, PRTN3,RNASE2, S100A8LGG Tumour CNS EEF1A1P9, GFAP, PTPRZ1LIHC AdjacentNormalGastrointestinal IGFBP1178Additional materials for Chapter 4Table 1: Important genes based on frequency analysis of gene weights foreach neural network in SCOPE. (continued)Cancer Code Tissue Organ-System GenesLIHC Tumour Gastrointestinal ALB, APCS, APOA2, APOC3, CRP,FGA, FGB, GC, HULC, ITIH2, RBP4,TF, TM4SF4, UBD, VTNLUAD AdjacentNormalThoracic HBA1, NAPSA, SCGB1A1, SCGB3A1,SCGB3A2, SFTPA1, SFTPB, SFTPC,SFTPD, SLPILUAD Tumour Thoracic C8orf4, CEACAM5, CRABP2, FGG,NAPSA, PGC, S100P, SCGB1A1,SCGB3A1, SCGB3A2, SFTA2, SFTPA1,SFTPA2, SFTPB, SLC34A2, SPINK1LUSC AdjacentNormalThoracic CCL21, NAPSA, RPS4Y1, SCGB1A1,SCGB3A1, SCGB3A2, SFTA2, SFTPA1,SFTPA2, SFTPB, SFTPC, SFTPD,SLC34A2LUSC Tumour Thoracic AKR1C2, CALML3, CES1, KRT15,KRT16, KRT19, KRT5, KRT6A, KRT6B,NAPSA, NTS, SCGB1A1, SCGB3A2,SFTPA1, SFTPA2, SFTPB, SFTPC,SPRR2AMB-Adult Tumour CNS GFAP, STMN2MESO Tumour Thoracic C19orf33, CALB2, EFEMP1, ITLN1,KRT19, KRT7, MSLN, UPK3BOV Tumour Gynecologic CHI3L1, CLDN3, FOLR1, KLK6, KLK7,MALAT1, MSLN, PAX8, SCGB2A1,SOX17, SSTPAAD AdjacentNormalGastrointestinal CELA3A, CPA1, CPB1, CRP, CTRB1,CTRB2, CTRC, CTSE, GCG, INS,PNLIP, PPY, PRSS1, REG1A, REG3A,TTRPAAD Tumour Gastrointestinal AGR2, CEACAM5, CHGB, CTRB1,CTRB2, GCG, INS, PNLIP, PPY,REG1A, REG4, S100P, SFRP2, SPINK1,SST, TFF1, TFF2, TTRPCPG AdjacentNormalEndocrine CYP11B1, CYP17A1, DLK1, GSTA1,STARPCPG Tumour Endocrine CHGA, CHGB, DBH, DLK1, NPY,PENKPRAD AdjacentNormalUrologic ACPP, KLK2, KLK3, KLK4, NPY,OLFM4, PIP, SEMG1PRAD Tumour Urologic ACTG2, AZGP1, DES, FOLH1, FOXA1,KLK2, KLK3, KLK4, NKX3-1, NPY,PLA2G2A, SLC45A3SARC Tumour Soft Tissue DLK1SKCM AdjacentNormalSkin DCT, MLANA, PRAME, TYRSKCM Tumour Skin APOD, DCT, EDNRB, KRT6B,MLANA, PLP1, PRAME, S100A1,SERPINE2, SOX10, TYR, TYRP1, VGF179Additional materials for Chapter 4Table 1: Important genes based on frequency analysis of gene weights foreach neural network in SCOPE. (continued)Cancer Code Tissue Organ-System GenesSTAD AdjacentNormalGastrointestinal ACTG2, APOA1, APOA4, CLDN18,DES, GKN1, HSPB6, PGA4, PGC, PI3,REG3ASTAD Tumour Gastrointestinal ACTG2, CEACAM6, CST1, MALAT1,PGC, REG4, SPINK1, TFF1, TFF3TFRI_GBM_NCLTumour CNS MALAT1, PCDHGA1, PCDHGA8,PCDHGC4, PMP2TGCT Tumour Urologic DPPA3, DPPA5, GDF3, NANOG,POU5F1THCA AdjacentNormalEndocrine CCL21, HBA2, MT1G, PAX8, TG, TPOTHCA Tumour Endocrine C16orf89, CLIC3, NKX2-1, S100A1,SFTA3, SFTPB, TG, TPO, ZCCHC12THYM AdjacentNormalHematologic CALML3, CCL25, KRT5THYM Tumour Hematologic CALML3, CCL25, DNTT, KRT14,KRT15, KRT17, KRT19, KRT5, PAX1UCEC AdjacentNormalGynecologic CNN1, DESUCEC Tumour Gynecologic MMP11, MSX1, PAX8, SCGB1D2,SCGB2A1, SFN, VTCN1UCS Tumour Gynecologic CRABP1, DLK1, PCOLCE, PRAMEUVM Tumour Head and Neck CITED1, MLANA, SOX10, TYR,TYRP1180Additional materials for Chapter 4Figure 1: Example output from SCOPE for a sarcomatoid mesothelioma,predicted with split confidence as mesothelioma and sarcoma.Figure 2: Mean prediction accuracy of SCOPE as RPKM values of variousfractions of genes are set to 0 in the input RNA-Seq data. Grey bars aroundmean points indicate standard error bounds. Black line indicates the line ofbest fit (loess). At a given threshold n% genes in input were randomly setto zero. This was repeated 10 times for each n in (10, 20, 30, 40, 50, 60, 70,80, 90, 99).181Additional materials for Chapter 5Additional materials for Chapter 5Figure 3: UMAP projections of PIE profiles for 3,963 biochemical pathways,for samples in the TCGA cohort of primary tumours.The projections arecoloured by tumour-type.182Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores.Organ-Systemof originCancer Code Type Top 25 pathwaysBreast BRCA Normal A tetrasaccharide linker sequence isrequired for GAG synthesis, AP-1transcription factor network, boneremodeling, cadmium induces dnasynthesis and proliferation in macrophages,Chondroitin sulfate biosynthesis,Chondroitin sulfate/dermatan sulfatemetabolism, Dermatan sulfate biosynthesis,EGFR Inhibitor Pathway,Pharmacodynamics, Estrogen signalingpathway, Heparan sulfate/heparin(HS-GAG) metabolism, Hormone-sensitivelipase (HSL)-mediated triacylglycerolhydrolysis, il 3 signaling pathway, Lipiddigestion, mobilization, and transport,Miscellaneous transport and binding events,nerve growth factor pathway (ngf), PPARsignaling pathway - Homo sapiens (human),Quercetin and Nf-kB- AP-1 Induced CellApoptosis, Regulation of lipolysis inadipocytes - Homo sapiens (human),RIG-I/MDA5 mediated induction ofIFN-alpha/beta pathways, TRAF6mediated NF-kB activation,Transcriptional regulation of whiteadipocyte differentiation, TranslationFactors, Transport of fatty acids, tsp-1induced apoptosis in microvascularendothelial cell, west nile virus183Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysBreast BRCA Tumour Amoebiasis - Homo sapiens (human),Collagen biosynthesis and modifyingenzymes, Collagen formation, ECMproteoglycans, Endothelins, IL4-mediatedsignaling events, Inflammatory ResponsePathway, Insulin Signaling, Integrins inangiogenesis, Iron metabolism in placenta,Methionine and cysteine metabolism,miR-targeted genes in muscle cell -TarBase, miRNA targets in ECM andmembrane receptors, Miscellaneoustransport and binding events,pi3k_pathway, PI3K-Akt signalingpathway - Homo sapiens (human), PlateletAggregation Inhibitor Pathway,Pharmacodynamics, Protein processing inendoplasmic reticulum - Homo sapiens(human), Senescence and Autophagy inCancer, Signaling by Retinoic Acid,Syndecan-1-mediated signaling events,Validated nuclear estrogen receptor alphanetwork, Validated targets of C-MYCtranscriptional repression, VEGFR3signaling in lymphatic endothelium,Vitamin A and Carotenoid Metabolism184Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysCentralNervousSystemGBM Normal Acetylcholine Neurotransmitter ReleaseCycle, Asymmetric localization of PCPproteins, Autodegradation of the E3ubiquitin ligase COP1, beta-cateninindependent WNT signaling, Budding andmaturation of HIV virion, CREBphosphorylation through the activation ofRas, Dopamine Neurotransmitter ReleaseCycle, Effects of Botulinum toxin, GABAsynthesis, release, reuptake anddegradation, Gastrin-CREB signallingpathway via PKC and MAPK, GenericTranscription Pathway, Ion channeltransport, Neuronal System,Neurotransmitter Release Cycle, NGFsignalling via TRKA from the plasmamembrane, NorepinephrineNeurotransmitter Release Cycle, SerotoninNeurotransmitter Release Cycle, signaldependent regulation of myogenesis bycorepressor mitr, Signaling by EGFR,Signaling by ERBB2, Synaptic vesicle cycle- Homo sapiens (human), Synaptic VesiclePathway, Toxicity of botulinum toxin typeC (BoNT/C), Transmission acrossChemical Synapses, Uptake and actions ofbacterial toxins185Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysCentralNervousSystemGBM Tumour Anchoring of the basal body to the plasmamembrane, Assembly of the primary cilium,Cell Cycle, Cell Cycle, Mitotic,cell_cycle_pathway, EPO signaling, G2/MTransition, Gap junction - Homo sapiens(human), IL-7 signaling, JAK STATpathway and regulation, Loss of Nlp frommitotic centrosomes, Loss of proteinsrequired for interphase microtubuleorganizationfrom the centrosome,Mammary gland development pathway -Involution (Stage 4 of 4), miR-targetedgenes in epithelium - TarBase,miR-targeted genes in leukocytes -TarBase, miR-targeted genes in squamouscell - TarBase, Mitotic G2-G2/M phases,Organelle biogenesis and maintenance, p73transcription factor network,Parkin-Ubiquitin Proteasomal Systempathway, Pyrimidine metabolism,Regulation of PLK1 Activity at G2/MTransition, RMTs methylate histonearginines, stathmin and breast cancerresistance to antimicrotubule agents,tgf_pathway186Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysCentralNervousSystemLGG Tumour 3, -UTR-mediated translational regulation,Cap-dependent Translation Initiation,Chylomicron-mediated lipid transport,Cytoplasmic Ribosomal Proteins,Eukaryotic Translation Initiation,Eukaryotic Translation Termination, GTPhydrolysis and joining of the 60S ribosomalsubunit, HDL-mediated lipid transport,L13a-mediated translational silencing ofCeruloplasmin expression, Lipid digestion,mobilization, and transport, Lipoproteinmetabolism, Neural Crest Differentiation,Nonsense Mediated Decay (NMD)enhanced by the Exon Junction Complex(EJC), Nonsense Mediated Decay (NMD)independent of the Exon Junction Complex(EJC), Nonsense-Mediated Decay (NMD),Platelet degranulation , prion pathway,Response to elevated platelet cytosolicCa2+, Retinoid metabolism and transport,Spinal Cord Injury, SRP-dependentcotranslational protein targeting tomembrane, Statin Pathway, StatinPathway, Pharmacodynamics, Visualphototransduction, Vitamin B12Metabolism187Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysEndocrine ACC Tumour 11-beta-hydroxylase deficiency (CYP11B1),17-alpha-hydroxylase deficiency (CYP17),21-hydroxylase deficiency (CYP21),3-Beta-Hydroxysteroid DehydrogenaseDeficiency, Activated NOTCH1 TransmitsSignal to the Nucleus, Adipogenesis,Adrenal Hyperplasia Type 3 or CongenitalAdrenal Hyperplasia due to 21-hydroxylaseDeficiency, Adrenal Hyperplasia Type 5 orCongenital Adrenal Hyperplasia due to 17Alpha-hydroxylase Deficiency, Apparentmineralocorticoid excess syndrome, CardiacProgenitor Differentiation, CongenitalLipoid Adrenal Hyperplasia (CLAH) orLipoid CAH, Corticosterone methyl oxidaseI deficiency (CMO I), Corticosteronemethyl oxidase II deficiency - CMO II,Cytochrome P450 - arranged by substratetype, Endogenous sterols, FOXA2 andFOXA3 transcription factor networks,IGF-Core, Metabolism of steroid hormonesand vitamin D, Notch signaling pathway,Ovarian steroidogenesis - Homo sapiens(human), Posttranslational regulation ofadherens junction stability anddissassembly, Pregnenolone biosynthesis,SHC-related events triggered by IGF1R,Steroid hormones, superpathway of steroidhormone biosynthesis188Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysEndocrine PCPG Normal 11-beta-hydroxylase deficiency (CYP11B1),17-alpha-hydroxylase deficiency (CYP17),17-Beta Hydroxysteroid Dehydrogenase IIIDeficiency, 21-hydroxylase deficiency(CYP21), 3-Beta-HydroxysteroidDehydrogenase Deficiency, 3,-UTR-mediated translational regulation,Activation of the mRNA upon binding ofthe cap-binding complex and eIFs, andsubsequent binding to 43S, AdaptiveImmune System, Adherens junction - Homosapiens (human), Adrenal HyperplasiaType 3 or Congenital Adrenal Hyperplasiadue to 21-hydroxylase Deficiency, AdrenalHyperplasia Type 5 or Congenital AdrenalHyperplasia due to 17 Alpha-hydroxylaseDeficiency, Androgen and estrogenbiosynthesis and metabolism, Androgenand Estrogen Metabolism, androgenbiosynthesis, Androgen biosynthesis,antigen processing and presentation,Antigen processing and presentation -Homo sapiens (human), Apparentmineralocorticoid excess syndrome,Aromatase deficiency, ArrhythmogenicRight Ventricular Cardiomyopathy,Arrhythmogenic right ventricularcardiomyopathy (ARVC) - Homo sapiens(human), Axon guidance, Bacterial invasionof epithelial cells - Homo sapiens (human),BCR, Biological oxidations189Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysEndocrine PCPG Tumour FGFR2, FGFR3, FGFR4, -arrestins ingpcr desensitization, 11-beta-hydroxylasedeficiency (CYP11B1),17-alpha-hydroxylase deficiency (CYP17),17-Beta Hydroxysteroid Dehydrogenase IIIDeficiency, 2-Methyl-3-Hydroxybutryl CoADehydrogenase Deficiency, 21-hydroxylasedeficiency (CYP21), 3-Beta-HydroxysteroidDehydrogenase Deficiency,3-Hydroxy-3-Methylglutaryl-CoA LyaseDeficiency, 3-hydroxyisobutyric aciddehydrogenase deficiency,3-hydroxyisobutyric aciduria,3-Methylcrotonyl Coa CarboxylaseDeficiency Type I, 3-MethylglutaconicAciduria Type I, 3-MethylglutaconicAciduria Type III, Activated NOTCH1Transmits Signal to the Nucleus,Adipogenesis, FOXA2 and FOXA3transcription factor networks, Metabolism,Notch signaling pathway, notch_pathway,Signal Transduction, Signaling by NOTCH,Signaling by NOTCH1190Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysEndocrine THCA Normal Autoimmune thyroid disease - Homosapiens (human), BCR, cadmium inducesdna synthesis and proliferation inmacrophages, Calcineurin-regulatedNFAT-dependent transcription inlymphocytes, calcium signaling by hbx ofhepatitis b virus, Calcium signaling in theCD4+ TCR pathway, Choline metabolismin cancer - Homo sapiens (human),Colorectal cancer - Homo sapiens (human),HIF-1-alpha transcription factor network,IL2-mediated signaling events,IL4-mediated signaling events, LPAreceptor mediated events, mets affect onmacrophage differentiation, Oncostatin MSignaling Pathway, oxidative stress inducedgene expression via nrf2, Quercetin andNf-kB- AP-1 Induced Cell Apoptosis,RANKL-RANK Signaling Pathway,repression of pain sensation by thetranscriptional regulator dream, role of egfreceptor transactivation by gpcrs in cardiachypertrophy, T cell receptor signalingpathway - Homo sapiens (human), Thyroidhormone synthesis, Thyroid hormonesynthesis - Homo sapiens (human),Thyroxine (Thyroid Hormone) Production,trefoil factors initiate mucosal healing,tsp-1 induced apoptosis in microvascularendothelial cell191Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysEndocrine THCA Tumour FGFR2, FGFR3, FGFR4, -arrestins ingpcr desensitization, 11-beta-hydroxylasedeficiency (CYP11B1),17-alpha-hydroxylase deficiency (CYP17),Assembly of collagen fibrils and othermultimeric structures, Collagendegradation, Collagen formation,Complement and Coagulation Cascades,cpdb_cancer_related_pathway,Degradation of the extracellular matrix,Extracellular matrix organization, Gastriccancer network 2, Hemostasis, Lysosome -Homo sapiens (human), Platelet activation,signaling and aggregation, Plateletdegranulation , Prostaglandin Synthesisand Regulation, Renin secretion - Homosapiens (human), Response to elevatedplatelet cytosolic Ca2+, Thyroid hormonesynthesis - Homo sapiens (human),Trafficking and processing of endosomalTLR, Validated targets of C-MYCtranscriptional repression, Vitamin DReceptor Pathway192Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalCHOL Normal ABC transporters in lipid homeostasis,ABC-family proteins mediated transport,Bile acid and bile salt metabolism, Bindingand Uptake of Ligands by ScavengerReceptors, EGFR1, Folate Metabolism,FOXA2 and FOXA3 transcription factornetworks, HDL-mediated lipid transport,Hemostasis, Human Complement System,Lipid digestion, mobilization, andtransport, Lipoprotein metabolism,Metabolism of lipids and lipoproteins,Platelet activation, signaling andaggregation, Platelet degranulation ,Recycling of bile acids and salts, Responseto elevated platelet cytosolic Ca2+,Scavenging of heme from plasma, SeleniumMicronutrient Network, SLC-mediatedtransmembrane transport, Transmembranetransport of small molecules, Transport oforganic anions, Transport of vitamins,nucleosides, and related molecules,Urokinase-type plasminogen activator(uPA) and uPAR-mediated signaling,Vitamin B12 Metabolism193Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalCHOL Tumour Activation of C3 and C5, Alpha4 beta1integrin signaling events, Alpha9 beta1integrin signaling events, Alternativecomplement activation, alternativecomplement pathway, BDNF signalingpathway, Complement cascade, EGFR1,FGF signaling pathway, G alpha (i)signalling events, GPCR downstreamsignaling, Human Complement System,IL-6 signaling pathway, Immune System,Initial triggering of complement, InnateImmune System, lectin inducedcomplement pathway, OsteopontinSignaling, Osteopontin-mediated events,Overview of nanoparticle effects,Regulation of toll-like receptor signalingpathway, regulators of bone mineralization,Toll-like receptor signaling pathway,Validated nuclear estrogen receptor betanetwork, Vitamin D Receptor Pathway194Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalCOADREAD Normal cAMP signaling pathway - Homo sapiens(human), cGMP-PKG signaling pathway -Homo sapiens (human), DAP12interactions, DAP12 signaling, EPH-Ephrinsignaling, EPHA-mediated growth conecollapse, Host Interactions of HIV factors,IL12-mediated signaling events,IL4-mediated signaling events,Integrin-linked kinase signaling, Intestinalimmune network for IgA production -Homo sapiens (human), Musclecontraction, Nef-mediates downmodulation of cell surface receptors byrecruiting them to clathrin adapters,Pancreatic secretion - Homo sapiens(human), RHO GTPases activate CIT,RHO GTPases activate PAKs, RHOGTPases activate PKNs, RHO GTPasesActivate ROCKs, Sema4D in semaphorinsignaling, Sema4D induced cell migrationand growth-cone collapse, Semaphorininteractions, Smooth Muscle Contraction,The role of Nef in HIV-1 replication anddisease pathogenesis, Vascular smoothmuscle contraction - Homo sapiens(human), Vitamin D Receptor Pathway195Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalCOADREAD Tumour 3, -UTR-mediated translational regulation,Cap-dependent Translation Initiation,Cytoplasmic Ribosomal Proteins,Eukaryotic Translation Elongation,Eukaryotic Translation Initiation,Eukaryotic Translation Termination,Formation of a pool of free 40S subunits,Formation of the ternary complex, andsubsequently, the 43S complex, GTPhydrolysis and joining of the 60S ribosomalsubunit, IL11, IL2 signaling eventsmediated by PI3K, Interleukin-11 SignalingPathway, L13a-mediated translationalsilencing of Ceruloplasmin expression, mtorsignaling pathway, mTORC1-mediatedsignalling, Nonsense Mediated Decay(NMD) enhanced by the Exon JunctionComplex (EJC), Nonsense Mediated Decay(NMD) independent of the Exon JunctionComplex (EJC), Nonsense-Mediated Decay(NMD), Peptide chain elongation, PI3KCascade, PKB-mediated events, Ribosomalscanning and start codon recognition,Ribosome - Homo sapiens (human),SRP-dependent cotranslational proteintargeting to membrane, Translation196Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalESCA Tumour Adaptive Immune System, Antigenprocessing and presentation - Homosapiens (human), B cell receptor signalingpathway - Homo sapiens (human), Class IMHC mediated antigen processing &presentation, DAP12 signaling, Disease,Glycine, serine, alanine and threoninemetabolism, GPCR ligand binding, HIVInfection, HIV Life Cycle, Host Interactionsof HIV factors, Immune System,Immunoregulatory interactions between aLymphoid and a non-Lymphoid cell,Infectious disease, Integrin-mediated CellAdhesion, Metabolism of non-coding RNA,Notch Signaling Pathway, notch_pathway,Pyrimidine metabolism, Regulation ofTelomerase, Signaling by GPCR, snRNPAssembly, TGF-beta signaling pathway -Homo sapiens (human), Vascular smoothmuscle contraction - Homo sapiens(human), Vitamin E metabolism197Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalESCA_EAC Tumour a6b1 and a6b4 Integrin signaling, Adherensjunction - Homo sapiens (human),Aminosugars metabolism, ArrhythmogenicRight Ventricular Cardiomyopathy,Arrhythmogenic right ventricularcardiomyopathy (ARVC) - Homo sapiens(human), Bacterial invasion of epithelialcells - Homo sapiens (human), Collagendegradation, Dilated cardiomyopathy -Homo sapiens (human), eukaryotic proteintranslation, FAS pathway and Stressinduction of HSP regulation, Fibroblastgrowth factor-1, Focal Adhesion, Focaladhesion - Homo sapiens (human), Hipposignaling pathway - Homo sapiens (human),Hypertrophic cardiomyopathy (HCM) -Homo sapiens (human), Integrin, MitoticPrometaphase, Myometrial Relaxation andContraction Pathways, Pathways in cancer- Homo sapiens (human), Proteinprocessing in endoplasmic reticulum -Homo sapiens (human), Proteoglycans incancer - Homo sapiens (human), Rap1signaling pathway - Homo sapiens (human),Resolution of Sister Chromatid Cohesion,RHO GTPases Activate Formins,Shigellosis - Homo sapiens (human)198Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalESCA_SCC Tumour Adherens junction - Homo sapiens (human),AndrogenReceptor, Arrhythmogenic RightVentricular Cardiomyopathy,Arrhythmogenic right ventricularcardiomyopathy (ARVC) - Homo sapiens(human), AUF1 (hnRNP D0) destabilizesmRNA, Axon guidance, Bacterial invasionof epithelial cells - Homo sapiens (human),Developmental Biology, Dilatedcardiomyopathy - Homo sapiens (human),EPH-Ephrin signaling, EPHB-mediatedforward signaling, FAS pathway and Stressinduction of HSP regulation, Fcgammareceptor (FCGR) dependent phagocytosis,Focal Adhesion, Focal adhesion - Homosapiens (human), Hippo signaling pathway -Homo sapiens (human), Hypertrophiccardiomyopathy (HCM) - Homo sapiens(human), IL6, Influenza A - Homo sapiens(human), Leukocyte transendothelialmigration - Homo sapiens (human),Myometrial Relaxation and ContractionPathways, Oxytocin signaling pathway -Homo sapiens (human), p38 mapksignaling pathway, p38 MAPK SignalingPathway, Signal Transduction199Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalLIHC Normal Adenine phosphoribosyltransferasedeficiency (APRT), Adenosine DeaminaseDeficiency, Adenylosuccinate LyaseDeficiency, AICA-Ribosiduria, Androgenand estrogen biosynthesis and metabolism,Chemical carcinogenesis - Homo sapiens(human), Chylomicron-mediated lipidtransport, Complement and CoagulationCascades, Complement and coagulationcascades - Homo sapiens (human), Drugmetabolism - cytochrome P450 - Homosapiens (human), Galactose metabolism,Gout or Kelley-Seegmiller Syndrome, HeartDevelopment, Lesch-Nyhan Syndrome(LNS), mechanism of gene regulation byperoxisome proliferators via ppara,Metabolism of nucleotides, Metabolism ofxenobiotics by cytochrome P450 - Homosapiens (human), Mitochondrial DNAdepletion syndrome, Molybdenum CofactorDeficiency, Myoadenylate deaminasedeficiency, PPAR Alpha Pathway, PurineMetabolism, Purine NucleosidePhosphorylase Deficiency, Regulation oflipid metabolism by Peroxisomeproliferator-activated receptor alpha(PPARalpha), Retinoid metabolism andtransport200Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalLIHC Tumour Adaptive Immune System, Alzheimer,sdisease - Homo sapiens (human),Alzheimers Disease, Apoptosis-relatednetwork due to altered Notch3 in ovariancancer, Cell Cycle, cell_cycle_pathway,Chylomicron-mediated lipid transport,Clathrin derived vesicle budding,Complement and Coagulation Cascades,cpdb_cancer_related_pathway, GolgiAssociated Vesicle Biogenesis, ImmuneSystem, Innate Immune System, IntegratedPancreatic Cancer Pathway, Iron uptakeand transport, Membrane Trafficking,Metabolism of proteins, Mineral absorption- Homo sapiens (human), notch_pathway,NRF2 pathway, Nuclear ReceptorsMeta-Pathway, Peptide chain elongation,Scavenging by Class A Receptors, StatinPathway, trans-Golgi Network VesicleBudding201Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalPAAD Normal FGFR2, FGFR3, FGFR4, 3,-UTR-mediated translational regulation,activation of csk by camp-dependentprotein kinase inhibits signaling throughthe t cell receptor, Activation of geneexpression by SREBF (SREBP), AdaptiveImmune System, Adherens junction -Homo sapiens (human), Africantrypanosomiasis - Homo sapiens (human),Alendronate Action Pathway, Alpha4 beta1integrin signaling events, Alpha9 beta1integrin signaling events, Alzheimer,sdisease - Homo sapiens (human),Amyotrophic lateral sclerosis (ALS),Amyotrophic lateral sclerosis (ALS) -Homo sapiens (human), Androgen receptorsignaling pathway, AndrogenReceptor,Angiogenesis overview, antigen processingand presentation, Antigen processing andpresentation - Homo sapiens (human),Arginine and proline metabolism - Homosapiens (human), Arrhythmogenic RightVentricular Cardiomyopathy,Arrhythmogenic right ventricularcardiomyopathy (ARVC) - Homo sapiens(human), Aryl Hydrocarbon ReceptorPathway, Atorvastatin Action Pathway202Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalPAAD Tumour AGE-RAGE pathway, Anti-diabetic DrugPotassium Channel Inhibitors Pathway,Pharmacodynamics, Arf6 trafficking events,Differentiation Pathway, FoxO signalingpathway - Homo sapiens (human), Gastriccancer network 2, Glibenclamide ActionPathway, Gliclazide Action Pathway,GPCR signaling-cholera toxin, GPCRsignaling-G alpha i, GPCR signaling-Galpha q, GPCR signaling-G alpha s Epacand ERK, GPCR signaling-G alpha s PKAand ERK, GPCR signaling-pertussis toxin,growth hormone signaling pathway, InsulinPathway, Insulin processing, Insulinreceptor recycling, Insulin secretion - Homosapiens (human), insulin signaling pathway,Insulin Signalling, IRS activation, LeucineStimulation on Insulin Signaling, Maturityonset diabetes of the young - Homo sapiens(human), Senescence and Autophagy inCancer203Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalSTAD Normal Alcoholism - Homo sapiens (human),Alpha-defensins, Arachidonic acidmetabolism, Aryl Hydrocarbon ReceptorPathway, Cytokine-cytokine receptorinteraction - Homo sapiens (human),EPHA-mediated growth cone collapse,Fatty acid, triacylglycerol, and ketone bodymetabolism, Gastric pepsin release, GPCRsignaling-cholera toxin, GPCRs, Class ARhodopsin-like, Leukotriene metabolism,Metabolism, Metabolism of lipids andlipoproteins, Mitochondrial translation,Mitochondrial translation elongation,Mitochondrial translation initiation,Mitochondrial translation termination,Neuroactive ligand-receptor interaction -Homo sapiens (human), RHO GTPasesactivate CIT, RHO GTPases activatePAKs, RHO GTPases activate PKNs,RHO GTPases Activate ROCKs, SmoothMuscle Contraction, Transcription,Vascular smooth muscle contraction -Homo sapiens (human)204Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGastro-intestinalSTAD Tumour Adipogenesis, Alzheimer,s disease - Homosapiens (human), Alzheimers Disease,Fanconi-bickel syndrome,Fructose-1,6-diphosphatase deficiency,gluconeogenesis, Glycerol PhosphateShuttle, Glycogen Storage Disease Type 1A(GSD1A) or Von Gierke Disease,Glycogenosis, Type IA. Von gierke disease,Glycogenosis, Type IB, Glycogenosis, TypeIC, Glycogenosis, Type VII. Tarui disease,glycolysis, Glycolysis, GlycolysisGluconeogenesis, Huntington,s disease -Homo sapiens (human), MitochondrialElectron Transport Chain, NADH repair,Oxidative phosphorylation - Homo sapiens(human), PCP/CE pathway,Phosphoenolpyruvate carboxykinasedeficiency 1 (PEPCK1), repair_pathway,SIDS Susceptibility Pathways, Signaling byWnt, Triosephosphate isomeraseGynecologic CESC_CAD Tumour Cell Cycle, Cell Cycle, Mitotic,cell_cycle_pathway, EGF-EGFR SignalingPathway, Fatty acid, triacylglycerol, andketone body metabolism, Gastric cancernetwork 2, IL6, insulin Mam, InsulinSignaling, Interferon type I signalingpathways, JAK STAT pathway andregulation, jak_pathway, Kit receptorsignaling pathway, Leptin signalingpathway, M Phase, MAPK SignalingPathway, MAPK signaling pathway -Homo sapiens (human), mapk_pathway,Metabolism of amino acids and derivatives,miR-targeted genes in lymphocytes -TarBase, miR-targeted genes in squamouscell - TarBase, Prostaglandin Synthesis andRegulation, Signaling mediated byp38-alpha and p38-beta, Signalling byNGF, Vitamin D Receptor Pathway205Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGynecologic CESC_SCC Tumour a6b1 and a6b4 Integrin signaling,Activation of BAD and translocation tomitochondria , Activation of BH3-onlyproteins, Cell Cycle, Cell cycle - Homosapiens (human), cell cycle: g2/mcheckpoint, cell_cycle_pathway,Chk1/Chk2(Cds1) mediated inactivation ofCyclin B:Cdk1 complex, Class I PI3Ksignaling events mediated by Akt, DNADamage Response, DNA strand elongation,E2F transcription factor network, EGF,Epigenetic regulation of gene expression,estrogen responsive protein efp controls cellcycle and breast tumors growth, Fas, FGF,FoxO family signaling, G1 to S cell cyclecontrol, Intrinsic Pathway for Apoptosis,Meiosis, miRNA Regulation of DNADamage Response, Mitotic Prometaphase,Nucleotide Excision Repair, Nucleotideexcision repair - Homo sapiens (human)Gynecologic OV Tumour Alzheimer,s disease - Homo sapiens(human), Alzheimers Disease, Cori Cycle,downregulated of mta-3 in er-negativebreast tumors, Fanconi-bickel syndrome,Fructose-1,6-diphosphatase deficiency,Gluconeogenesis, Glucose metabolism,Glycerol Phosphate Shuttle, GlycogenStorage Disease Type 1A (GSD1A) or VonGierke Disease, Glycogenosis, Type IA.Von gierke disease, Glycogenosis, Type IB,Glycogenosis, Type IC, Glycogenosis, TypeVII. Tarui disease, glycolysis, Glycolysis /Gluconeogenesis - Homo sapiens (human),Glycolysis Gluconeogenesis, HIF-1signaling pathway - Homo sapiens (human),Iron metabolism in placenta, Methionineand cysteine metabolism, MitochondrialElectron Transport Chain, NADH repair,Phosphoenolpyruvate carboxykinasedeficiency 1 (PEPCK1), Triosephosphateisomerase, Validated targets of C-MYCtranscriptional repression206Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGynecologic UCEC Normal Alpha6Beta4Integrin, Apoptosis-relatednetwork due to altered Notch3 in ovariancancer, Apoptotic cleavage of cellularproteins, Apoptotic execution phase,Aurora B signaling, Caspase Cascade inApoptosis, Caspase-mediated cleavage ofcytoskeletal proteins, Cellular response toheat stress, EGFR1, HSF1 activation,Legionellosis - Homo sapiens (human),MicroRNAs in cancer - Homo sapiens(human), Muscle contraction,notch_pathway, PDGFR-beta signalingpathway, PLK1 signaling events, PrimaryFocal Segmental Glomerulosclerosis FSGS,Programmed Cell Death, Spinal CordInjury, Striated Muscle Contraction, TCRSignaling Pathway, west nile virus, WntCanonical, Wnt Mammals, Wnt signalingpathway - Homo sapiens (human)207Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGynecologic UCEC Tumour Activation of the mRNA upon binding ofthe cap-binding complex and eIFs, andsubsequent binding to 43S, Fanconi-bickelsyndrome, Fructose-1,6-diphosphatasedeficiency, Glucose metabolism, GlycerolPhosphate Shuttle, Glycogen StorageDisease Type 1A (GSD1A) or Von GierkeDisease, Glycogenosis, Type IA. Von gierkedisease, Glycogenosis, Type IB,Glycogenosis, Type IC, Glycogenosis, TypeVII. Tarui disease, glycolysis, Glycolysis,Glycolysis and Gluconeogenesis, GlycolysisGluconeogenesis, Human ComplementSystem, Iron metabolism in placenta,Methionine and cysteine metabolism,NADH repair, Phosphoenolpyruvatecarboxykinase deficiency 1 (PEPCK1),Protein processing in endoplasmicreticulum - Homo sapiens (human),superpathway of conversion of glucose toacetyl CoA and entry into the TCA cycle,Triosephosphate isomerase, Validatedtargets of C-MYC transcriptionalactivation, Validated targets of C-MYCtranscriptional repression, Warburg Effect208Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysGynecologic UCS Tumour Activation of the mRNA upon binding ofthe cap-binding complex and eIFs, andsubsequent binding to 43S, Adipogenesis,Amoebiasis - Homo sapiens (human),Apoptosis, Apoptosis-related network dueto altered Notch3 in ovarian cancer,Cardiac Progenitor Differentiation,Collagen biosynthesis and modifyingenzymes, Collagen formation,Endochondral Ossification, Extracellularmatrix organization, Formation of theternary complex, and subsequently, the 43Scomplex, GPCR signaling-cholera toxin,GPCR signaling-G alpha i, GPCRsignaling-G alpha q, GPCR signaling-Galpha s Epac and ERK, GPCR signaling-Galpha s PKA and ERK, GPCRsignaling-pertussis toxin, IGF-Core,miRNA targets in ECM and membranereceptors, Posttranslational regulation ofadherens junction stability anddissassembly, Protein digestion andabsorption - Homo sapiens (human),Regulation of Insulin-like Growth Factor(IGF) transport and uptake by Insulin-likeGrowth Factor Binding Proteins (IGFBPs),SHC-related events triggered by IGF1R,Syndecan-1-mediated signaling events,Translation initiation complex formation209Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysHead andNeckHNSC Normal Class A/1 (Rhodopsin-like receptors),Corticotropin-releasing hormone,Endogenous TLR signaling, Fatty acid,triacylglycerol, and ketone bodymetabolism, G alpha (q) signalling events,Huntington,s disease - Homo sapiens(human), IL1 and megakaryotyces inobesity, Metabolism, Metabolism of aminoacids and derivatives, Metabolism of lipidsand lipoproteins, Metabolism of proteins,Mitochondrial translation, Mitochondrialtranslation initiation, Non-alcoholic fattyliver disease (NAFLD) - Homo sapiens(human), O-linked glycosylation, O-linkedglycosylation of mucins, Organellebiogenesis and maintenance, Phase 1 -Functionalization of compounds,Phospholipid metabolism, Respiratoryelectron transport, Respiratory electrontransport, ATP synthesis by chemiosmoticcoupling, and heat production byuncoupling proteins., RhoA signalingpathway, TCR, The citric acid (TCA) cycleand respiratory electron transport, VitaminD Receptor Pathway210Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysHead andNeckHNSC Tumour Activation of APC/C and APC/C:Cdc20mediated degradation of mitotic proteins,APC/C:Cdc20 mediated degradation ofmitotic proteins, Assembly Of The HIVVirion, Assembly of the pre-replicativecomplex, Association of licensing factorswith the pre-replicative complex,CDK-mediated phosphorylation andremoval of Cdc6, CDT1 association withthe CDC6:ORC:origin complex, Cell CycleCheckpoints, Corticotropin-releasinghormone, degradation of AXIN,degradation of DVL, DNA ReplicationPre-Initiation, Ectoderm Differentiation,Fanconi Anemia pathway, G1/S DNADamage Checkpoints, G2/M Checkpoints,G2/M DNA damage checkpoint,Glucocorticoid receptor regulatory network,Hh mutants abrogate ligand secretion, Hhmutants that don,t undergo autocatalyticprocessing are degraded by ERAD, IKKcomplex recruitment mediated by RIP1,M/G1 Transition, Membrane binding andtargetting of GAG proteins, NOTCH2Activation and Transmission of Signal tothe Nucleus, Validated transcriptionaltargets of deltaNp63 isoforms211Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysHead andNeckUVM Tumour 3, -UTR-mediated translational regulation,Apoptotic cleavage of cellular proteins,Apoptotic execution phase, Cap-dependentTranslation Initiation, Caspase-mediatedcleavage of cytoskeletal proteins,Cytoplasmic Ribosomal Proteins,Degradation of Superoxides, EukaryoticTranslation Elongation, EukaryoticTranslation Initiation, EukaryoticTranslation Termination, eumelaninbiosynthesis, Formation of a pool of free40S subunits, Gene Expression, GTPhydrolysis and joining of the 60S ribosomalsubunit, L13a-mediated translationalsilencing of Ceruloplasmin expression,Melanogenesis - Homo sapiens (human),Metabolism of proteins, MicroRNAs incancer - Homo sapiens (human), NonsenseMediated Decay (NMD) enhanced by theExon Junction Complex (EJC), NonsenseMediated Decay (NMD) independent of theExon Junction Complex (EJC),Nonsense-Mediated Decay (NMD), Peptidechain elongation, Ribosome - Homo sapiens(human), SRP-dependent cotranslationalprotein targeting to membrane, Translation212Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysHematologic DLBC Tumour Antigen activates B Cell Receptor (BCR)leading to generation of second messengers,B cell receptor signaling, B Cell ReceptorSignaling Pathway, BCR signaling pathway,Cleavage of Growing Transcript in theTermination Region , ctcf: first multivalentnuclear factor, DNA Damage Response, erkand pi-3 kinase are necessary for collagenbinding in corneal epithelia, GeneExpression, Herpes simplex infection -Homo sapiens (human), IL4, MHC class IIantigen presentation, miRNA Regulation ofDNA Damage Response, mRNAProcessing, mRNA Splicing, mRNASplicing - Major Pathway, Post-ElongationProcessing of the Transcript, Processing ofCapped Intron-Containing Pre-mRNA, rhocell motility signaling pathway, RNAPolymerase II Transcription, RNAPolymerase II Transcription Termination,Signaling by the B Cell Receptor (BCR),Transcription, Transport of MatureTranscript to Cytoplasm, Tuberculosis -Homo sapiens (human)213Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysHematologic LAML Tumour FGFR2, FGFR3, FGFR4, -arrestins ingpcr desensitization, 11-beta-hydroxylasedeficiency (CYP11B1),17-alpha-hydroxylase deficiency (CYP17),17-Beta Hydroxysteroid Dehydrogenase IIIDeficiency, 2-Methyl-3-Hydroxybutryl CoADehydrogenase Deficiency, 21-hydroxylasedeficiency (CYP21), 3-Beta-HydroxysteroidDehydrogenase Deficiency,3-Hydroxy-3-Methylglutaryl-CoA LyaseDeficiency, African trypanosomiasis - Homosapiens (human), C-MYB transcriptionfactor network, Erythrocytes take upcarbon dioxide and release oxygen,Erythrocytes take up oxygen and releasecarbon dioxide, Folate Metabolism,hemoglobins chaperone, Malaria - Homosapiens (human), Metabolism, O2/CO2exchange in erythrocytes, RNA transport -Homo sapiens (human), Salivary secretion -Homo sapiens (human), Scavenging ofheme from plasma, Selenium MicronutrientNetwork, Vitamin B12 Metabolism214Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysHematologic THYM Normal FGFR2, FGFR3, FGFR4, -arrestins ingpcr desensitization, 11-beta-hydroxylasedeficiency (CYP11B1),17-alpha-hydroxylase deficiency (CYP17),17-Beta Hydroxysteroid Dehydrogenase IIIDeficiency, 2-Methyl-3-Hydroxybutryl CoADehydrogenase Deficiency, 21-hydroxylasedeficiency (CYP21), 3-Beta-HydroxysteroidDehydrogenase Deficiency,3-Hydroxy-3-Methylglutaryl-CoA LyaseDeficiency, 3-hydroxyisobutyric aciddehydrogenase deficiency,3-hydroxyisobutyric aciduria,3-Methylcrotonyl Coa CarboxylaseDeficiency Type I, 3-MethylglutaconicAciduria Type I, 3-MethylglutaconicAciduria Type III, 3-MethylglutaconicAciduria Type IV, 3-MethylthiofentanylAction Pathway, 3-Phosphoglyceratedehydrogenase deficiency, 3,-UTR-mediated translational regulation, Atetrasaccharide linker sequence is requiredfor GAG synthesis, a6b1 and a6b4 Integrinsignaling, Abacavir Pathway,Pharmacokinetics/Pharmacodynamics,ABC transporters - Homo sapiens (human),ABC transporters in lipid homeostasis215Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysHematologic THYM Tumour Adrenergic signaling in cardiomyocytes -Homo sapiens (human), Alcoholism - Homosapiens (human), Amphetamine addiction -Homo sapiens (human), Calcium signalingpathway - Homo sapiens (human),Chemokine receptors bind chemokines,Chemokine signaling pathway - Homosapiens (human), Circadian entrainment -Homo sapiens (human), Cytokine-cytokinereceptor interaction - Homo sapiens(human), Dopaminergic synapse - Homosapiens (human), G alpha (i) signallingevents, Glioma - Homo sapiens (human),Glucagon signaling pathway - Homosapiens (human), GnRH signaling pathway- Homo sapiens (human), GPCR ligandbinding, Inflammatory mediator regulationof TRP channels - Homo sapiens (human),Intestinal immune network for IgAproduction - Homo sapiens (human),Long-term potentiation - Homo sapiens(human), Melanogenesis - Homo sapiens(human), Neurotrophin signaling pathway -Homo sapiens (human), Olfactorytransduction - Homo sapiens (human),Peptide ligand-binding receptors,Phosphatidylinositol signaling system -Homo sapiens (human), Phototransduction- Homo sapiens (human), PPAR AlphaPathway, Renin secretion - Homo sapiens(human)216Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysSkin SKCM Normal FGFR2, FGFR3, FGFR4, -arrestins ingpcr desensitization, 11-beta-hydroxylasedeficiency (CYP11B1),17-alpha-hydroxylase deficiency (CYP17),17-Beta Hydroxysteroid Dehydrogenase IIIDeficiency, 2-Methyl-3-Hydroxybutryl CoADehydrogenase Deficiency, 21-hydroxylasedeficiency (CYP21), 3-Beta-HydroxysteroidDehydrogenase Deficiency,3-Hydroxy-3-Methylglutaryl-CoA LyaseDeficiency, 3-hydroxyisobutyric aciddehydrogenase deficiency,3-hydroxyisobutyric aciduria,3-Methylcrotonyl Coa CarboxylaseDeficiency Type I, 3-MethylglutaconicAciduria Type I, 3-MethylglutaconicAciduria Type III, 3-MethylglutaconicAciduria Type IV, 3-MethylthiofentanylAction Pathway, 3-Phosphoglyceratedehydrogenase deficiency, 3,-UTR-mediated translational regulation, Atetrasaccharide linker sequence is requiredfor GAG synthesis, a6b1 and a6b4 Integrinsignaling, Abacavir Pathway,Pharmacokinetics/Pharmacodynamics,ABC transporters - Homo sapiens (human),ABC transporters in lipid homeostasis217Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysSkin SKCM Tumour Allograft Rejection, Alpha6Beta4Integrin,Apoptosis, Apoptosis-related network dueto altered Notch3 in ovarian cancer,Apoptotic cleavage of cellular proteins,Apoptotic execution phase, Aurora Bsignaling, Caspase Cascade in Apoptosis,Caspase-mediated cleavage of cytoskeletalproteins, Common Pathway of Fibrin ClotFormation, Dissolution of Fibrin Clot,Epstein-Barr virus infection - Homosapiens (human), Inflammasomes, IntrinsicPathway of Fibrin Clot Formation,MicroRNAs in cancer - Homo sapiens(human), notch_pathway,Nucleotide-binding domain, leucine richrepeat containing receptor (NLR) signalingpathways, Primary Focal SegmentalGlomerulosclerosis FSGS, ProgrammedCell Death, Regulation of Ras familyactivation, Sema3A PAK dependent Axonrepulsion, Spinal Cord Injury, TCRSignaling Pathway, The NLRP3inflammasome, Transport of fatty acids218Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysSoft Tissue SARC Normal FGFR2, FGFR3, FGFR4, -arrestins ingpcr desensitization, 11-beta-hydroxylasedeficiency (CYP11B1),17-alpha-hydroxylase deficiency (CYP17),17-Beta Hydroxysteroid Dehydrogenase IIIDeficiency, 2-Methyl-3-Hydroxybutryl CoADehydrogenase Deficiency, 21-hydroxylasedeficiency (CYP21), 3-Beta-HydroxysteroidDehydrogenase Deficiency,3-Hydroxy-3-Methylglutaryl-CoA LyaseDeficiency, 3-hydroxyisobutyric aciddehydrogenase deficiency,3-hydroxyisobutyric aciduria,3-Methylcrotonyl Coa CarboxylaseDeficiency Type I, 3-MethylglutaconicAciduria Type I, 3-MethylglutaconicAciduria Type III, 3-MethylglutaconicAciduria Type IV, 3-MethylthiofentanylAction Pathway, 3-Phosphoglyceratedehydrogenase deficiency, 3,-UTR-mediated translational regulation, Atetrasaccharide linker sequence is requiredfor GAG synthesis, a6b1 and a6b4 Integrinsignaling, Abacavir Pathway,Pharmacokinetics/Pharmacodynamics,ABC transporters - Homo sapiens (human),ABC transporters in lipid homeostasis219Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysSoft Tissue SARC Tumour Activation of Matrix Metalloproteinases,Alpha6Beta4Integrin, Amoebiasis - Homosapiens (human), Apoptotic cleavage ofcellular proteins, Apoptotic executionphase, Beta1 integrin cell surfaceinteractions, Caspase Cascade in Apoptosis,Caspase-mediated cleavage of cytoskeletalproteins, Classical antibody-mediatedcomplement activation, Collagenbiosynthesis and modifying enzymes,Collagen formation, ECM-receptorinteraction - Homo sapiens (human),Extracellular matrix organization,Inflammatory Response Pathway, Integrinsin angiogenesis, MicroRNAs in cancer -Homo sapiens (human), miRNA targets inECM and membrane receptors, NCAM1interactions, pi3k_pathway, PI3K-Aktsignaling pathway - Homo sapiens (human),Platelet Aggregation Inhibitor Pathway,Pharmacodynamics, Protein digestion andabsorption - Homo sapiens (human),Regulation of Ras family activation,Syndecan-1-mediated signaling events,TCR Signaling Pathway220Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysThoracic LUAD Normal a6b1 and a6b4 Integrin signaling,Angiopoietin receptor Tie2-mediatedsignaling, Antigen Presentation: Folding,assembly and peptide loading of class IMHC, Apoptosis - Homo sapiens (human),Aryl Hydrocarbon Receptor, BladderCancer, C-type lectin receptors (CLRs),Chronic myeloid leukemia - Homo sapiens(human), Class I MHC mediated antigenprocessing & presentation, DNA DamageResponse (only ATM dependent), ErbBSignaling Pathway, ErbB signaling pathway- Homo sapiens (human), Factors involvedin megakaryocyte development and plateletproduction, Hedgehog, miR-targeted genesin epithelium - TarBase, miR-targetedgenes in lymphocytes - TarBase, Nefmediated downregulation of MHC class Icomplex cell surface expression,p75(NTR)-mediated signaling,ras-independent pathway in nkcell-mediated cytotoxicity, SignalingPathways in Glioblastoma, Sphingolipidsignaling pathway - Homo sapiens (human),Wnt Canonical, Wnt Mammals, WntSignaling Pathway, Wnt signaling pathway -Homo sapiens (human)221Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysThoracic LUAD Tumour Allograft rejection - Homo sapiens(human), Antigen processing andpresentation - Homo sapiens (human),Autoimmune thyroid disease - Homosapiens (human), Binding and Uptake ofLigands by Scavenger Receptors, Celladhesion molecules (CAMs) - Homo sapiens(human), Clathrin derived vesicle budding,Gastric cancer network 2, Golgi AssociatedVesicle Biogenesis, Graft-versus-hostdisease - Homo sapiens (human), Herpessimplex infection - Homo sapiens (human),Integrated Pancreatic Cancer Pathway,Iron uptake and transport, Lysosome -Homo sapiens (human), MembraneTrafficking, MHC class II antigenpresentation, Mineral absorption - Homosapiens (human), NRF2 pathway, NuclearReceptors Meta-Pathway, ProstaglandinSynthesis and Regulation, Scavenging byClass A Receptors, trans-Golgi NetworkVesicle Budding, Transmembrane transportof small molecules, Tuberculosis - Homosapiens (human), Type I diabetes mellitus -Homo sapiens (human), Viral myocarditis -Homo sapiens (human)222Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysThoracic LUSC Normal Activation of gene expression by SREBF(SREBP), Alpha-synuclein signaling, Bilesecretion - Homo sapiens (human),Chemokine receptors bind chemokines,Class A/1 (Rhodopsin-like receptors),Formation of Fibrin Clot (ClottingCascade), FOXA1 transcription factornetwork, Fructose Mannose metabolism, Galpha (i) signalling events, G alpha (q)signalling events, Glycine, serine andthreonine metabolism - Homo sapiens(human), GPCR downstream signaling,GPCR ligand binding, HumanComplement System, Intrinsic Pathway ofFibrin Clot Formation, Peptideligand-binding receptors, Pertussis - Homosapiens (human), Phagosome - Homosapiens (human), Primaryimmunodeficiency - Homo sapiens (human),Proton Pump Inhibitor Pathway,Pharmacodynamics, Regulation of CDC42activity, Regulation of cholesterolbiosynthesis by SREBP (SREBF),Regulation of toll-like receptor signalingpathway, SREBP signalling, Steroidbiosynthesis - Homo sapiens (human)223Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysThoracic LUSC Tumour Alzheimer,s disease - Homo sapiens(human), Alzheimers Disease, ChromosomeMaintenance, Cori Cycle, EctodermDifferentiation, Fanconi-bickel syndrome,Fructose-1,6-diphosphatase deficiency,Glucagon signaling pathway - Homosapiens (human), gluconeogenesis,Gluconeogenesis, Glucose metabolism,Glucose transport, Glycerol PhosphateShuttle, Glycogen Storage Disease Type 1A(GSD1A) or Von Gierke Disease,Glycogenosis, Type IA. Von gierke disease,Glycogenosis, Type IB, Glycogenosis, TypeIC, Glycogenosis, Type VII. Tarui disease,glycolysis, Glycolysis, Glycolysis /Gluconeogenesis - Homo sapiens (human),Hexose transport, hypoxia-inducible factorin the cardivascular system, Inositolphosphate metabolism - Homo sapiens(human), Olfactory transduction - Homosapiens (human)224Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysThoracic MESO Tumour A tetrasaccharide linker sequence isrequired for GAG synthesis,Acenocoumarol Action Pathway, Activationof C3 and C5, Alteplase Action Pathway,Alternative complement activation,Classical antibody-mediated complementactivation, classical complement pathway,Collagen biosynthesis and modifyingenzymes, Collagen formation, ComplementActivation, Classical Pathway, Complementand Coagulation Cascades, Complementand coagulation cascades - Homo sapiens(human), Complement cascade, Copperhomeostasis, Creation of C4 and C2activators, Dermatan sulfate biosynthesis,ECM proteoglycans, InflammatoryResponse Pathway, Initial triggering ofcomplement, Osteoblast Signaling,Regulation of Complement cascade,Scavenging by Class H Receptors,Senescence and Autophagy in Cancer,Signaling mediated by p38-alpha andp38-beta, Syndecan-1-mediated signalingevents225Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic BLCA Normal ATF-2 transcription factor network,Cholinergic synapse - Homo sapiens(human), EPO signaling, erbb_pathway,Hepatitis B - Homo sapiens (human), IL-6signaling pathway, IL-7 signaling,IL2-mediated signaling events,IL6-mediated signaling events, JAK STATpathway and regulation, jak_pathway,Jak-STAT signaling pathway - Homosapiens (human), MAPK signaling pathway- Homo sapiens (human), Oncostatin_M,PI3K/AKT activation, Prolactin, Prolactinsignaling pathway - Homo sapiens (human),Signaling by EGFR, Signaling by ERBB4,Signaling by the B Cell Receptor (BCR),Signalling by NGF, TGF_beta_Receptor,TGF-beta signaling pathway - Homosapiens (human), TNF signaling pathway -Homo sapiens (human), VEGF226Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic BLCA Tumour AUF1 (hnRNP D0) destabilizes mRNA,cell_cycle_pathway,Deadenylation-dependent mRNA decay,Direct p53 effectors, FoxO signalingpathway - Homo sapiens (human), GastricCancer Network 1, Gastric cancer network2, Gastrin, IL1 and megakaryotyces inobesity, Insulin signaling pathway - Homosapiens (human), mRNA Processing,mRNA Splicing, mRNA Splicing - MajorPathway, mtor_pathway, p73 transcriptionfactor network, Processing of CappedIntron-Containing Pre-mRNA,Prostaglandin Synthesis and Regulation,Regulation of mRNA stability by proteinsthat bind AU-rich elements, Signalingmediated by p38-alpha and p38-beta,skeletal muscle hypertrophy is regulatedvia akt-mtor pathway, TP53 RegulatesMetabolic Genes, TranscriptionalRegulation by TP53, Validatedtranscriptional targets of TAp63 isoforms,Viral carcinogenesis - Homo sapiens(human), Vitamin D Receptor Pathway227Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic KICH Normal Basal transcription factors - Homo sapiens(human), Cell cycle - Homo sapiens(human), cell_cycle_pathway, CellularSenescence, DNA Damage Response,Eukaryotic Transcription Initiation, G1 toS cell cycle control, HIV Life Cycle, HIVTranscription Initiation, Late Phase of HIVLife Cycle, MAPK family signalingcascades, mapk_pathway, miRNARegulation of DNA Damage Response,Pyrimidine metabolism - Homo sapiens(human), Pyrimidine nucleotidesnucleosides metabolism, RNA PolymeraseII HIV Promoter Escape, RNA PolymeraseII Pre-transcription Events, RNAPolymerase II Promoter Escape, RNAPolymerase II Transcription, RNAPolymerase II Transcription Initiation,RNA Polymerase II TranscriptionInitiation And Promoter Clearance, RNAPolymerase II Transcription Pre-InitiationAnd Promoter Opening, Senescence andAutophagy in Cancer,Senescence-Associated SecretoryPhenotype (SASP), Transcription228Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic KICH Tumour adenosine ribonucleotides <i>de novo</i>biosynthesis, Beta defensins, Ceramidesignaling pathway, Collecting duct acidsecretion - Homo sapiens (human),Defensins, Electron Transport Chain,Epithelial cell signaling in Helicobacterpylori infection - Homo sapiens (human),Formation of ATP by chemiosmoticcoupling, Huntington,s disease - Homosapiens (human), Latent infection of Homosapiens with Mycobacterium tuberculosis,LKB1 signaling events, Metabolism ofAngiotensinogen to Angiotensins, Oxidativephosphorylation - Homo sapiens (human),Peptide hormone metabolism, Phagosomalmaturation (early endosomal stage),Prolactin, Prolactin Signaling Pathway,purine nucleotides <i>de novo</i>biosynthesis, Respiratory electrontransport, ATP synthesis by chemiosmoticcoupling, and heat production byuncoupling proteins., Sphingolipidsignaling pathway - Homo sapiens (human),Synaptic vesicle cycle - Homo sapiens(human), The citric acid (TCA) cycle andrespiratory electron transport, thyroidhormone biosynthesis, Transferrinendocytosis and recycling, Validatednuclear estrogen receptor alpha network229Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic KIRC Normal 3-Methylthiofentanyl Action Pathway,Alfentanil Action Pathway, AlvimopanAction Pathway, Amiloride Action Pathway,Anileridine Action Pathway, Basigininteractions, Bendroflumethiazide ActionPathway, Benzocaine Action Pathway, Bilesecretion - Homo sapiens (human), Bluediaper syndrome, Bumetanide ActionPathway, Bupivacaine Action Pathway,Buprenorphine Action Pathway,Carbohydrate digestion and absorption -Homo sapiens (human), Carfentanil ActionPathway, Chloroprocaine Action Pathway,Chlorothiazide Action Pathway,Chlorthalidone Action Pathway, CocaineAction Pathway, Cyclothiazide ActionPathway, Cystinuria, Desipramine ActionPathway, Dezocine Action Pathway,Dibucaine Action Pathway,Dihydromorphine Action Pathway230Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic KIRC Tumour Allograft Rejection, Apoptotic cleavage ofcellular proteins, Apoptotic executionphase, Aurora B signaling,Caspase-mediated cleavage of cytoskeletalproteins, Celecoxib Pathway,Pharmacodynamics, Cori Cycle,Fanconi-bickel syndrome,Fructose-1,6-diphosphatase deficiency,Glutathione metabolism, glutathione redoxreactions I, Glycogen Storage Disease Type1A (GSD1A) or Von Gierke Disease,Glycogenosis, Type IA. Von gierke disease,Glycogenosis, Type IB, Glycogenosis, TypeIC, Glycogenosis, Type VII. Tarui disease,glycolysis, Glycolysis / Gluconeogenesis -Homo sapiens (human), Glycolysis andGluconeogenesis, GlycolysisGluconeogenesis, HIF-1 signaling pathway -Homo sapiens (human), MicroRNAs incancer - Homo sapiens (human), PLK1signaling events, reactive oxygen speciesdegradation, Validated targets of C-MYCtranscriptional activation231Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic KIRP Normal ABC transporters - Homo sapiens (human),Aflatoxin activation and detoxification,Clathrin derived vesicle budding,Detoxification of Reactive Oxygen Species,Folate metabolism, Glycine Serinemetabolism, Glycine, serine and threoninemetabolism - Homo sapiens (human),Glyoxylate and dicarboxylate metabolism -Homo sapiens (human), Golgi AssociatedVesicle Biogenesis, Integrated PancreaticCancer Pathway, Iron uptake andtransport, LKB1 signaling events,Lysosome - Homo sapiens (human),Membrane Trafficking, Metabolism ofamino acids and derivatives, Metabolism ofAngiotensinogen to Angiotensins, N-Glycanbiosynthesis, NRF2 pathway, NuclearReceptors in Lipid Metabolism andToxicity, Nuclear Receptors Meta-Pathway,Scavenging by Class A Receptors,SLC-mediated transmembrane transport,thyroid hormone biosynthesis, trans-GolgiNetwork Vesicle Budding, Transmembranetransport of small molecules232Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic KIRP Tumour Alpha4 beta1 integrin signaling events,Alpha9 beta1 integrin signaling events,BDNF signaling pathway, Beta1 integrincell surface interactions, Beta3 integrin cellsurface interactions, Degradation of theextracellular matrix, Direct p53 effectors,ECM-receptor interaction - Homo sapiens(human), Endochondral Ossification, FGFsignaling pathway, Human ComplementSystem, Integrin cell surface interactions,Integrins in angiogenesis, OsteoclastSignaling, Osteopontin Signaling,Osteopontin-mediated events,p53_pathway, PI3K-Akt signaling pathway- Homo sapiens (human), Proteinprocessing in endoplasmic reticulum -Homo sapiens (human), Regulation oftoll-like receptor signaling pathway,regulators of bone mineralization, Signalingby PDGF, TGF Beta Signaling Pathway,Toll-like receptor signaling pathway,Toll-like receptor signaling pathway - Homosapiens (human)233Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic PRAD Normal Alendronate Action Pathway, AtorvastatinAction Pathway, Cerivastatin ActionPathway, CHILD Syndrome, Cholesterolbiosynthesis, Cholesterol Biosynthesis,Cholesteryl ester storage disease,Chondrodysplasia Punctata II, X LinkedDominant (CDPX2), Circadianentrainment - Homo sapiens (human),Desmosterolosis, EPHA-mediated growthcone collapse, Fluvastatin Action Pathway,FOXM1 transcription factor network,Glutathione conjugation,glutathione-mediated detoxification,Miscellaneous transport and binding events,Muscle contraction, RHO GTPases activateCIT, RHO GTPases activate PAKs, RHOGTPases Activate ROCKs, Sema4D insemaphorin signaling, Sema4D induced cellmigration and growth-cone collapse,Smooth Muscle Contraction, Viral RNPComplexes in the Host Cell Nucleus,Vitamin B2 (riboflavin) metabolism234Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic PRAD Tumour Activated PKN1 stimulates transcription ofAR (androgen receptor) regulated genesKLK2 and KLK3, Acyl chain remodellingof PC, Acyl chain remodelling of PE, Acylchain remodelling of PG, Acyl chainremodelling of PI, Acyl chain remodellingof PS, alpha-Linolenic acid metabolism -Homo sapiens (human), Androgen receptorsignaling pathway, antigen processing andpresentation, Cardiac HypertrophicResponse, Coregulation of Androgenreceptor activity, DroToll-like, Ether lipidmetabolism - Homo sapiens (human),FOXA1 transcription factor network,Glycerophospholipid metabolism - Homosapiens (human), IGF signaling, Linoleicacid metabolism - Homo sapiens (human),Metabolism of proteins, Pathways in cancer- Homo sapiens (human), phospholipases,Prostate Cancer, Prostate cancer - Homosapiens (human), Regulation of Androgenreceptor activity, Regulation of Insulin-likeGrowth Factor (IGF) transport and uptakeby Insulin-like Growth Factor BindingProteins (IGFBPs), RHO GTPasesactivate PKNs235Additional materials for Chapter 5Table 2: Top 25 statistically identified pathways for each common cancerand normal tissue category in TCGA, based on PIE scores. (continued)Organ-Systemof originCancer Code Type Top 25 pathwaysUrologic TGCT Tumour Alzheimer,s disease - Homo sapiens(human), Alzheimers Disease, Cori Cycle,downregulated of mta-3 in er-negativebreast tumors, Fanconi-bickel syndrome,Fructose-1,6-diphosphatase deficiency,gluconeogenesis, Gluconeogenesis, Glucosemetabolism, Glycerol Phosphate Shuttle,Glycogen Storage Disease Type 1A(GSD1A) or Von Gierke Disease,Glycogenosis, Type IA. Von gierke disease,Glycogenosis, Type IB, glycolysis,Glycolysis, Glycolysis / Gluconeogenesis -Homo sapiens (human), Glycolysis andGluconeogenesis, GlycolysisGluconeogenesis, Metabolism ofcarbohydrates, POU5F1 (OCT4), SOX2,NANOG activate genes related toproliferation, POU5F1 (OCT4), SOX2,NANOG repress genes related todifferentiation, superpathway of conversionof glucose to acetyl CoA and entry into theTCA cycle, Transcriptional regulation ofpluripotent stem cells, Validated targets ofC-MYC transcriptional activation,Warburg Effect236Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores.Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysMET500 BRCA Breast 60 Amphetamine addiction - Homo sapiens (human),Aromatase Inhibitor Pathway (Breast Cell),Pharmacodynamics, AstrocyticGlutamate-Glutamine Uptake And Metabolism,ATF6-alpha activates chaperone genes,Dopaminergic synapse - Homo sapiens (human),Estrogen signaling pathway - Homo sapiens(human), FTO Obesity Variant Mechanism,IL27-mediated signaling events, Inflammasomes,Inflammatory bowel disease (IBD) - Homo sapiens(human), Inflammatory mediator regulation ofTRP channels - Homo sapiens (human), InsulinSignaling, miRNA targets in ECM and membranereceptors, Miscellaneous transport and bindingevents, Neurotransmitter uptake and MetabolismIn Glial Cells, Oocyte meiosis - Homo sapiens(human), PI3K-Akt signaling pathway - Homosapiens (human), Progesterone-mediated oocytematuration - Homo sapiens (human), Proteinprocessing in endoplasmic reticulum - Homosapiens (human), Signaling by Retinoic Acid,Signaling events mediated by VEGFR1 andVEGFR2, The NLRP3 inflammasome, Validatednuclear estrogen receptor alpha network, VitaminA and Carotenoid Metabolism, Warfarin Pathway,Pharmacodynamics237Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysPOG BRCA Breast 160 Amphetamine addiction - Homo sapiens (human),Aromatase Inhibitor Pathway (Breast Cell),Pharmacodynamics, AstrocyticGlutamate-Glutamine Uptake And Metabolism,ATF6-alpha activates chaperone genes,Dopaminergic synapse - Homo sapiens (human),Estrogen signaling pathway - Homo sapiens(human), FTO Obesity Variant Mechanism,IL27-mediated signaling events, Inflammasomes,Inflammatory bowel disease (IBD) - Homo sapiens(human), Inflammatory mediator regulation ofTRP channels - Homo sapiens (human), InsulinSignaling, miRNA targets in ECM and membranereceptors, Miscellaneous transport and bindingevents, Neurotransmitter uptake and MetabolismIn Glial Cells, Oocyte meiosis - Homo sapiens(human), PI3K-Akt signaling pathway - Homosapiens (human), Progesterone-mediated oocytematuration - Homo sapiens (human), Proteinprocessing in endoplasmic reticulum - Homosapiens (human), Signaling by Retinoic Acid,Signaling events mediated by VEGFR1 andVEGFR2, The NLRP3 inflammasome, Validatednuclear estrogen receptor alpha network, VitaminA and Carotenoid Metabolism, Warfarin Pathway,Pharmacodynamics238Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysMET500 CHOL Gastrointestinal21 ABC transporter disorders, alternativecomplement pathway, Anchoring fibril formation,Basal transcription factors - Homo sapiens(human), Beta oxidation of myristoyl-CoA tolauroyl-CoA, BMP Signalling and Regulation,BMP2 signaling TAK1, choline degradation,control of skeletal myogenesis by hdac andcalcium/calmodulin-dependent kinase (camk),Defective CFTR causes cystic fibrosis, Disordersof transmembrane transporters, FelodipineMetabolism Pathway, folate polyglutamylation,geranylgeranyldiphosphate biosynthesis, glycinecleavage, Glycine Metabolism, glycine/serinebiosynthesis, Heroin metabolism, Leukotrienemodifiers pathway, Pharmacodynamics, LRRFLII-interacting protein 1 (LRRFIP1) activatestype I IFN production, lysine degradation I(saccharopine pathway), lysine degradation II(pipecolate pathway), Resolution of AP sites viathe single-nucleotide replacement pathway,Trafficking of myristoylated proteins to the cilium,WNT ligand secretion is abrogated by thePORCN inhibitor LGK974239Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysPOG CHOL Gastrointestinal17 ABC transporter disorders, alternativecomplement pathway, Anchoring fibril formation,Basal transcription factors - Homo sapiens(human), Beta oxidation of myristoyl-CoA tolauroyl-CoA, BMP Signalling and Regulation,BMP2 signaling TAK1, choline degradation,control of skeletal myogenesis by hdac andcalcium/calmodulin-dependent kinase (camk),Defective CFTR causes cystic fibrosis, Disordersof transmembrane transporters, FelodipineMetabolism Pathway, folate polyglutamylation,geranylgeranyldiphosphate biosynthesis, glycinecleavage, Glycine Metabolism, glycine/serinebiosynthesis, Heroin metabolism, Leukotrienemodifiers pathway, Pharmacodynamics, LRRFLII-interacting protein 1 (LRRFIP1) activatestype I IFN production, lysine degradation I(saccharopine pathway), lysine degradation II(pipecolate pathway), Resolution of AP sites viathe single-nucleotide replacement pathway,Trafficking of myristoylated proteins to the cilium,WNT ligand secretion is abrogated by thePORCN inhibitor LGK974240Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysMET500 COADREADGastrointestinal10 Activation of the mRNA upon binding of thecap-binding complex and eIFs, and subsequentbinding to 43S, Advanced glycosylationendproduct receptor signaling, bupropiondegradation, Carnitine Synthesis, Conjugation ofphenylacetate with glutamine, creatine-phosphatebiosynthesis, DEx/H-box helicases activate type IIFN and inflammatory cytokines production ,Dorso-ventral axis formation - Homo sapiens(human), EGFR Transactivation by Gastrin,Eicosanoid Synthesis, ErbB receptor signalingnetwork, Formation of the ternary complex, andsubsequently, the 43S complex, G-proteinmediated events, Glycosphingolipid biosynthesis -lactoseries, HIF-1-alpha transcription factornetwork, icosapentaenoate biosynthesis II(metazoa), insulin Mam, mucin core 1 and core 2<i>O</i>-glycosylation, Mucin type O-Glycanbiosynthesis - Homo sapiens (human),PCSK9-mediated LDLR degradation, pentosephosphate pathway, Pentose phosphate pathway(hexose monophosphate shunt), PhenylacetateMetabolism, PLC beta mediated events, yaci andbcma stimulation of b cell immune responses241Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysPOG COADREADGastrointestinal98 Activation of the mRNA upon binding of thecap-binding complex and eIFs, and subsequentbinding to 43S, Advanced glycosylationendproduct receptor signaling, bupropiondegradation, Carnitine Synthesis, Conjugation ofphenylacetate with glutamine, creatine-phosphatebiosynthesis, DEx/H-box helicases activate type IIFN and inflammatory cytokines production ,Dorso-ventral axis formation - Homo sapiens(human), EGFR Transactivation by Gastrin,Eicosanoid Synthesis, ErbB receptor signalingnetwork, Formation of the ternary complex, andsubsequently, the 43S complex, G-proteinmediated events, Glycosphingolipid biosynthesis -lactoseries, HIF-1-alpha transcription factornetwork, icosapentaenoate biosynthesis II(metazoa), insulin Mam, mucin core 1 and core 2<i>O</i>-glycosylation, Mucin type O-Glycanbiosynthesis - Homo sapiens (human),PCSK9-mediated LDLR degradation, pentosephosphate pathway, Pentose phosphate pathway(hexose monophosphate shunt), PhenylacetateMetabolism, PLC beta mediated events, yaci andbcma stimulation of b cell immune responses242Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysMET500 ESCA Gastrointestinal14 Adherens junction - Homo sapiens (human),Arrhythmogenic Right VentricularCardiomyopathy, Arrhythmogenic rightventricular cardiomyopathy (ARVC) - Homosapiens (human), Chaperonin-mediated proteinfolding, chromatin remodeling by hswi/snfatp-dependent complexes, Cooperation ofPrefoldin and TriC/CCT in actin and tubulinfolding, EGFR Transactivation by Gastrin,Folding of actin by CCT/TriC, Gastric acidsecretion - Homo sapiens (human), GastricHistamine Release, Hippo signaling pathway -Homo sapiens (human), hop pathway in cardiacdevelopment, Influenza A - Homo sapiens(human), Interleukin-6 signaling, IRAK1 recruitsIKK complex, IRAK1 recruits IKK complex uponTLR7/8 or 9 stimulation, MAP kinase cascade,MAPK1 (ERK2) activation, miRs in Muscle CellDifferentiation, NOTCH2 intracellular domainregulates transcription, pkc-catalyzedphosphorylation of inhibitory phosphoprotein ofmyosin phosphatase, Rap1 signaling pathway -Homo sapiens (human), Regulation of ActinCytoskeleton, the information processing pathwayat the ifn beta enhancer, WNT mediatedactivation of DVL243Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysMET500 OV Gynecologic 13 Abciximab Action Pathway, Adipogenesis,Biosynthesis of A2E, implicated in retinaldegradation, BMP signaling Dro, Cell-Cellcommunication, Diseases associated with visualtransduction, Dopamine receptors, EptifibatideAction Pathway, ERKs are inactivated, Gastricpepsin release, Generic Transcription Pathway,Glycosylphosphatidylinositol(GPI)-anchorbiosynthesis - Homo sapiens (human), Inhibitionof PKR, Lipoic acid metabolism - Homo sapiens(human), Monoamine GPCRs, NHR, NOTCH2intracellular domain regulates transcription,Retinoic acid receptors-mediated signaling,Retinoid cycle disease events, reversal of insulinresistance by leptin, S1P4 pathway, Signalattenuation, signal dependent regulation ofmyogenesis by corepressor mitr, TCA CycleNutrient Utilization and Invasiveness of OvarianCancer, thymine degradationPOG OV Gynecologic 33 Abciximab Action Pathway, Adipogenesis,Biosynthesis of A2E, implicated in retinaldegradation, BMP signaling Dro, Cell-Cellcommunication, Diseases associated with visualtransduction, Dopamine receptors, EptifibatideAction Pathway, ERKs are inactivated, Gastricpepsin release, Generic Transcription Pathway,Glycosylphosphatidylinositol(GPI)-anchorbiosynthesis - Homo sapiens (human), Inhibitionof PKR, Lipoic acid metabolism - Homo sapiens(human), Monoamine GPCRs, NHR, NOTCH2intracellular domain regulates transcription,Retinoic acid receptors-mediated signaling,Retinoid cycle disease events, reversal of insulinresistance by leptin, S1P4 pathway, Signalattenuation, signal dependent regulation ofmyogenesis by corepressor mitr, TCA CycleNutrient Utilization and Invasiveness of OvarianCancer, thymine degradation244Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysMET500 SKCM Skin 12 APEX1-Independent Resolution of AP Sites viathe Single Nucleotide Replacement Pathway,Apoptosis-related network due to altered Notch3in ovarian cancer, BMP receptor signaling,Codeine and Morphine Metabolism, Codeine andMorphine Pathway, Pharmacokinetics, CommonPathway of Fibrin Clot Formation, Conversionfrom APC/C:Cdc20 to APC/C:Cdh1 in lateanaphase, Drug Induction of Bile Acid Pathway,Fanconi anemia pathway, FGFR4 mutant receptoractivation, Formation of Fibrin Clot (ClottingCascade), Inactivation of APC/C via directinhibition of the APC/C complex, Inhibition ofthe proteolytic activity of APC/C required for theonset of anaphase by mitotic spindle checkpointcomponents, Intrinsic Pathway of Fibrin ClotFormation, lanosterol biosynthesis, Maturity onsetdiabetes of the young - Homo sapiens (human),Melanin biosynthesis, NICD traffics to nucleus,Nicotine Pathway, Pharmacokinetics, Notch-HLHtranscription pathway, NOTCH2 intracellulardomain regulates transcription,Post-transcriptional silencing by small RNAs,spermine biosynthesis, Transport of fatty acids,WNT ligand secretion is abrogated by thePORCN inhibitor LGK974245Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysPOG SKCM Skin 15 APEX1-Independent Resolution of AP Sites viathe Single Nucleotide Replacement Pathway,Apoptosis-related network due to altered Notch3in ovarian cancer, BMP receptor signaling,Codeine and Morphine Metabolism, Codeine andMorphine Pathway, Pharmacokinetics, CommonPathway of Fibrin Clot Formation, Conversionfrom APC/C:Cdc20 to APC/C:Cdh1 in lateanaphase, Drug Induction of Bile Acid Pathway,Fanconi anemia pathway, FGFR4 mutant receptoractivation, Formation of Fibrin Clot (ClottingCascade), Inactivation of APC/C via directinhibition of the APC/C complex, Inhibition ofthe proteolytic activity of APC/C required for theonset of anaphase by mitotic spindle checkpointcomponents, Intrinsic Pathway of Fibrin ClotFormation, lanosterol biosynthesis, Maturity onsetdiabetes of the young - Homo sapiens (human),Melanin biosynthesis, NICD traffics to nucleus,Nicotine Pathway, Pharmacokinetics, Notch-HLHtranscription pathway, NOTCH2 intracellulardomain regulates transcription,Post-transcriptional silencing by small RNAs,spermine biosynthesis, Transport of fatty acids,WNT ligand secretion is abrogated by thePORCN inhibitor LGK974246Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysMET500 SARC SoftTissue53 Activation of BMF and translocation tomitochondria, Amoebiasis - Homo sapiens(human), AMPK Signaling, Assembly of collagenfibrils and other multimeric structures,Asymmetric localization of PCP proteins, Beta1integrin cell surface interactions, BMP SignallingPathway, Collagen biosynthesis and modifyingenzymes, Collagen formation, ECM-receptorinteraction - Homo sapiens (human), Extracellularmatrix organization, miRNA targets in ECM andmembrane receptors, NCAM signaling for neuriteout-growth, NCAM1 interactions,Post-translational protein modification, role oferk5 in neuronal survival pathway, Signaling byFGFR2 in disease, Signaling by FGFR2 mutants,Signaling by FGFR3 in disease, Signaling byFGFR3 mutants, Signaling by FGFR4 mutants,Smooth Muscle Contraction, Stimuli-sensingchannels, Syndecan-1-mediated signaling events,Uptake and actions of bacterial toxinsPOG SARC SoftTissue60 Activation of BMF and translocation tomitochondria, Amoebiasis - Homo sapiens(human), AMPK Signaling, Assembly of collagenfibrils and other multimeric structures,Asymmetric localization of PCP proteins, Beta1integrin cell surface interactions, BMP SignallingPathway, Collagen biosynthesis and modifyingenzymes, Collagen formation, ECM-receptorinteraction - Homo sapiens (human), Extracellularmatrix organization, miRNA targets in ECM andmembrane receptors, NCAM signaling for neuriteout-growth, NCAM1 interactions,Post-translational protein modification, role oferk5 in neuronal survival pathway, Signaling byFGFR2 in disease, Signaling by FGFR2 mutants,Signaling by FGFR3 in disease, Signaling byFGFR3 mutants, Signaling by FGFR4 mutants,Smooth Muscle Contraction, Stimuli-sensingchannels, Syndecan-1-mediated signaling events,Uptake and actions of bacterial toxins247Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysPOG LUAD Thoracic 53 -oxidation (unsaturated, odd number), AflatoxinB1 metabolism, Arachidonate production fromDAG, Cilostazol Action Pathway, Clathrin derivedvesicle budding, Dipyridamole (Antiplatelet)Action Pathway, dolichol and dolichyl phosphatebiosynthesis, Estrogen metabolism, extrinsicprothrombin activation pathway, Golgi AssociatedVesicle Biogenesis, Inhibition of PKR, Insulinprocessing, Iron uptake and transport, Lipoic acidmetabolism - Homo sapiens (human), MembraneTrafficking, mucin core 1 and core 2<i>O</i>-glycosylation, Mucin type O-Glycanbiosynthesis - Homo sapiens (human), NicotineMetabolism, Nicotine Metabolism Pathway, NRF2pathway, Opsins, pkc-catalyzed phosphorylation ofinhibitory phosphoprotein of myosin phosphatase,Termination of O-glycan biosynthesis, trans-GolgiNetwork Vesicle Budding, Transmembranetransport of small moleculesMET500 BLCA Urologic 14 (S)-reticuline biosynthesis, [2Fe-2S] iron-sulfurcluster biosynthesis, adenine and adenosinesalvage II, biotin-carboxyl carrier proteinassembly, Cap-dependent Translation Initiation,CDC6 association with the ORC:origin complex,estradiol biosynthesis I, estradiol biosynthesis II,Estrogen biosynthesis, Eukaryotic TranslationInitiation, Joubert syndrome, L-dopachromebiosynthesis, Myometrial Relaxation andContraction Pathways, NAD phosphorylation anddephosphorylation, PhosphatidylinositolPhosphate Metabolism, phospholipase c delta inphospholipid associated cell signaling, Porphyrinmetabolism, Reuptake of GABA, Ribosome -Homo sapiens (human), RNA degradation - Homosapiens (human), Shigellosis - Homo sapiens(human), Thiamine metabolism - Homo sapiens(human), Thyroxine (Thyroid Hormone)Production, Vitamin A (retinol) metabolism,Xenobiotics metabolism248Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysMET500 PRAD Urologic 62 -oxidation (unsaturated, odd number), ActivatedPKN1 stimulates transcription of AR (androgenreceptor) regulated genes KLK2 and KLK3,Amino Acid conjugation, Androgen receptorsignaling pathway, antigen processing andpresentation, Conjugation of benzoate withglycine, Conjugation of carboxylic acids,Conjugation of phenylacetate with glutamine,Coregulation of Androgen receptor activity,DroToll-like, FOXA1 transcription factor network,IGF signaling, Metabolism of proteins, Pathwaysin cancer - Homo sapiens (human), PhenylacetateMetabolism, Prostate Cancer, Prostate cancer -Homo sapiens (human), Regulation of Androgenreceptor activity, Regulation of Insulin-likeGrowth Factor (IGF) transport and uptake byInsulin-like Growth Factor Binding Proteins(IGFBPs), Renin-angiotensin system - Homosapiens (human), RHO GTPase Effectors, RHOGTPases activate PKNs, Secretion of HydrochloricAcid in Parietal Cells, Signaling by Rho GTPases,spermine and spermidine degradation I249Additional materials for Chapter 5Table 3: Top 25 statistically identified pathways for each common cancercategory in the POG and MET500 cohorts, based on PIE scores. (continued)Cohort CancerTypeOrgan-Systemof originNumberofsamplesTop 25 pathwaysPOG PRAD Urologic 74 -oxidation (unsaturated, odd number), ActivatedPKN1 stimulates transcription of AR (androgenreceptor) regulated genes KLK2 and KLK3,Amino Acid conjugation, Androgen receptorsignaling pathway, antigen processing andpresentation, Conjugation of benzoate withglycine, Conjugation of carboxylic acids,Conjugation of phenylacetate with glutamine,Coregulation of Androgen receptor activity,DroToll-like, FOXA1 transcription factor network,IGF signaling, Metabolism of proteins, Pathwaysin cancer - Homo sapiens (human), PhenylacetateMetabolism, Prostate Cancer, Prostate cancer -Homo sapiens (human), Regulation of Androgenreceptor activity, Regulation of Insulin-likeGrowth Factor (IGF) transport and uptake byInsulin-like Growth Factor Binding Proteins(IGFBPs), Renin-angiotensin system - Homosapiens (human), RHO GTPase Effectors, RHOGTPases activate PKNs, Secretion of HydrochloricAcid in Parietal Cells, Signaling by Rho GTPases,spermine and spermidine degradation IFigure 4: Pathway importance for various Androgen Receptor associatedpathways for the MET500 Prostate Adenocarcinoma samples, separated byobserved cluster groups.250Additional materials for Chapter 5Figure 5: Manual integrative analysis of a mammary-like vulvaradenocarcinoma. Colour of circles shows fold expression change of therespective gene in the sample, relative to a background of all healthynormal tissues from GTEx. Box adjacent to circle indicates percentileexpression compared to the Cancer Genome Atlas’ cohort of breast cancers.Over-expression is shown in red, and loss of expression in blue. The keyoncogenic pathways impacted in this case are shown with grey boxesand red border. Manual analysis identified activation of ERBB2/ERBB3,mTOR pathway, and the MAPK pathway. Overexpression of variousgenes participating in transcriptional regulation and metabolism was alsoidentified (shown in red borders).251Additional materials for Chapter 5Figure 6: Manual integrative analysis of a cancer with unknown primary,which was diagnosed as a thyroid-like follicular renal cell carcinomamolecularly similar to renal papillary carcinoma. Colour of circles showsfold expression change of the respective gene in the sample, relative toa background of healthy renal tissues. Box adjacent to circle indicatespercentile expression compared to the Cancer Genome Atlas’ cohort of renalpapillary carcinomas. Over-expression is shown in red, and loss of expressionin blue. The key oncogenic pathways impacted in this case are shown withgrey boxes and red border.252
UBC Theses and Dissertations
Utility of machine learning approaches for cancer diagnosis and analysis from RNA sequencing Grewal, Jasleen K 2020
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
- 24-ubc_2020_november_grewal_jasleen.pdf [ 19.15MB ]
- JSON: 24-1.0394043.json
- JSON-LD: 24-1.0394043-ld.json
- RDF/XML (Pretty): 24-1.0394043-rdf.xml
- RDF/JSON: 24-1.0394043-rdf.json
- Turtle: 24-1.0394043-turtle.txt
- N-Triples: 24-1.0394043-rdf-ntriples.txt
- Original Record: 24-1.0394043-source.json
- Full Text