Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Computational methods for systems biology data of cancer Ding, Jiarui 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2016_september_ding_jiarui.pdf [ 8.66MB ]
Metadata
JSON: 24-1.0303119.json
JSON-LD: 24-1.0303119-ld.json
RDF/XML (Pretty): 24-1.0303119-rdf.xml
RDF/JSON: 24-1.0303119-rdf.json
Turtle: 24-1.0303119-turtle.txt
N-Triples: 24-1.0303119-rdf-ntriples.txt
Original Record: 24-1.0303119-source.json
Full Text
24-1.0303119-fulltext.txt
Citation
24-1.0303119.ris

Full Text

Computational Methodsfor Systems Biology Data of CancerbyJiarui DingA THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)May 2016c© Jiarui Ding 2016AbstractHigh-throughput genome sequencing and other techniques provide a cost-effective way to studycancer biology and seek precision treatment options. In this dissertation I address three chal-lenges in cancer systems biology research: 1) predicting somatic mutations, 2) interpretingmutation functions, and 3) stratifying patients into biologically meaningful groups.Somatic single nucleotide variants are frequent therapeutically actionable mutations in can-cer, e.g., the ‘hotspot’ mutations in known cancer driver genes such as EGFR, KRAS, andBRAF. However, only a small proportion of cancer patients harbour these known driver mu-tations. Therefore, there is a great need to systematically profile a cancer genome to identifyall the somatic single nucleotide variants. I develop methods to discover these somatic muta-tions from cancer genomic sequencing data, taking into account the noise in high-throughputsequencing data and valuable validated genuine somatic mutations and non-somatic mutations.Of the somatic alterations acquired for each cancer patient, only a few mutations ‘drive’ theinitialization and progression of cancer. To better understand the evolution of cancer, as wellas to apply precision treatments, we need to assess the functions of these mutations to pinpointthe driver mutations. I address this challenge by predicting the mutations correlated with geneexpression dysregulation. The method is based on hierarchical Bayes modelling of the influenceof mutations on gene expression, and can predict the mutations that impact gene expression inindividual patients.Although probably no two cancer genomes share exactly the same set of somatic muta-tions because of the stochastic nature of acquired mutations across the three billion base pairs,some cancer patients share common driver mutations or disrupted pathways. These patientsmay have similar prognoses and potentially benefit from the same kind of treatment options. Idevelop an efficient clustering algorithm to cluster high-throughput and high-dimensional bio-logical datasets, with the potential to put cancer patients into biologically meaningful groupsfor treatment selection.iiPrefaceA version of Chapter 2 has been published in Bioinformatics as an original paper [46]. I im-plemented the mutationSeq algorithm and performed all computational analyses except DNAsequencing and most of the work of aligning short DNA sequences to a reference human genome.I co-wrote the text with Dr. Sohrab Shah and Dr. Anne Condon. The genome sequencing datawere provided by project leaders Drs. Samuel Aparicio and Sohrab Shah as part of the study“Shah et al. (2012). The clonal and mutational evolution spectrum of primary triple-negativebreast cancers. Nature, 486(7403):395-399.”A version of Chapter 3 has been published in Nature Communications as an article [47]. Mostof the data used in this chapter were from The Cancer Genome Atlas project. The METABRICdata were from the study “Curtis, et al. (2012). The genomic and transcriptomic architecture of2,000 breast tumours reveals novel subgroups. Nature, 486(7403):346-352.” I implemented thexseq software and performed all computational analyses described in this chapter. I co-wrotethe text with Dr. Sohrab Shah and Dr. Anne Condon.A version of Chapter 4 has been published in Bioinformatics as an original paper [49]. Iimplemented the densityCut software and performed all the computational analyses describedin this chapter. I co-wrote the text with Dr. Anne Condon and Dr. Sohrab Shah. All the dataused in this chapter were from previously published works.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Cancer and cancer genome sequencing studies . . . . . . . . . . . . . . . . . . . 11.2 Systems biology approach to profile cancer . . . . . . . . . . . . . . . . . . . . . 41.3 Modelling and prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Challenges in detecting and interpreting mutations from genomic sequencing data 131.5 Research contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5.1 Predicting somatic single nucleotide variants . . . . . . . . . . . . . . . . 151.5.2 Predicting somatic mutations that correlate with gene dysregulation . . . 161.5.3 Clustering high-dimensional, high-throughput biological datasets . . . . . 162 Predicting somatic single nucleotide variants from paired normal-tumour se-quencing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.1 Feature construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21ivTable of Contents2.2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2.4 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.1 Classifiers outperform standard approaches . . . . . . . . . . . . . . . . . 282.3.2 Robustness of classifiers to different feature sets . . . . . . . . . . . . . . 332.3.3 Discriminative features are different for tumour and normal data . . . . . 342.3.4 Sources of errors and sub-classification of wildtypes . . . . . . . . . . . . 352.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.4.1 Progress in SNV prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Predicting mutations that influence gene expression . . . . . . . . . . . . . . 403.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.1 Modelling the effects of mutations on expression with xseq . . . . . . . . 423.2.2 Inference and learning in xseq . . . . . . . . . . . . . . . . . . . . . . . . 453.2.3 Conditional distributions of gene expression values . . . . . . . . . . . . . 533.2.4 Influence graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.5 Collecting bona fide driver genes . . . . . . . . . . . . . . . . . . . . . . . 543.2.6 Modelling loss-of-function mutations and hotspot mutations . . . . . . . 553.2.7 Expression information in predicting driver mutations . . . . . . . . . . . 563.2.8 Compensating for the cis-effects of copy number alterations . . . . . . . . 583.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.3.2 Computational benchmarking and validation of xseq . . . . . . . . . . . 603.3.3 Cis-effects loss-of-function mutations across the TCGA data . . . . . . . 703.3.4 Trans-effect mutations across the TCGA data . . . . . . . . . . . . . . . 733.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.4.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814 densityCut: an efficient and versatile topological approach for automatic clus-tering of biological data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83vTable of Contents4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.2.1 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.2.2 A random walk based density refinement . . . . . . . . . . . . . . . . . . 894.2.3 Local-maxima based clustering . . . . . . . . . . . . . . . . . . . . . . . . 904.2.4 Hierarchical stable clustering . . . . . . . . . . . . . . . . . . . . . . . . . 914.2.5 Complexity analysis and implementation . . . . . . . . . . . . . . . . . . 934.2.6 Parameter setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.2.7 Comparing clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.2.8 Comparing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.3.1 Benchmarking against state-of-the-art algorithms . . . . . . . . . . . . . 994.3.2 Inferring clonal architectures of individual tumours . . . . . . . . . . . . 1034.3.3 Clustering single-cell gene expression datasets . . . . . . . . . . . . . . . 1074.3.4 Clustering single-cell mass cytometry datasets . . . . . . . . . . . . . . . 1094.4 Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.2 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122viList of Tables2.1 The definitions of features x1 to x20 . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 The definitions of features x41 to x60 . . . . . . . . . . . . . . . . . . . . . . . . . 222.3 The definitions of features x98 to x106 . . . . . . . . . . . . . . . . . . . . . . . . 232.4 The accuracy of classifiers by using different feature sets . . . . . . . . . . . . . . 342.5 The BART model selected features. . . . . . . . . . . . . . . . . . . . . . . . . . 353.1 Description of the random variables in xseq . . . . . . . . . . . . . . . . . . . . . 443.2 Description of the conditional distributions in xseq . . . . . . . . . . . . . . . . . 463.3 List of the twelve cancer types analyzed . . . . . . . . . . . . . . . . . . . . . . . 61viiList of Figures1.1 DNA sequencing and genetic aberration detection . . . . . . . . . . . . . . . . . . 31.2 Example probabilistic graphical models for both generative and discriminativemodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Overview of the work done during my PhD study . . . . . . . . . . . . . . . . . . 152.1 The mutationSeq workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 The training data projected onto the three-dimensional space spanned by the firstthree principal components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3 Cross-validation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.4 Predicting mutations in whole genome sequencing data . . . . . . . . . . . . . . . 322.5 The performance of RF and BART on different feature sets . . . . . . . . . . . . 332.6 The heatmap of wildtype features . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.7 Group two wildtype sequence motifs centred at error sites . . . . . . . . . . . . . 383.1 Overview of the xseq modelling framework . . . . . . . . . . . . . . . . . . . . . 433.2 A simple xseq model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3 Mixture-of-Binomial modelling of loss-of-function mutations and hotspot mutations 573.4 Detecting highly-expressed genes based on mixture-of-Gaussian distributions . . 583.5 Scatter plots of PTEN copy number alterations and expression across 12 cancertypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.6 Scatter plots of PTEN copy number alterations and expression after cis-effectremoving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.7 The xseq model used to sample T mutated genes . . . . . . . . . . . . . . . . . . 623.8 Theoretical performance of xseq on simulated datasets . . . . . . . . . . . . . . . 633.9 xseq parameter trace plots during Expectation-Maximization (EM) iterations forthe most challenging case when H is known . . . . . . . . . . . . . . . . . . . . . 64viiiList of Figures3.10 xseq parameter trace plots during EM iterations for the most challenging case(H is estimated offline) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.11 The xseq-simple model prediction ROC curves from different simulated datasets 663.12 xseq-simple parameter trace plots during EM iterations for the most challengingcase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.13 Permutation analysis of the TCGA acute myeloid leukemia datasets . . . . . . . 683.14 CCNE1 amplifications in high-grade serous ovarian cancer predicted by xseq-simple 703.15 CCNE1 amplifications in high-grade serous ovarian cancer predicted by xseq . . 703.16 The 65 genes harboured loss-of-function mutations with strong cis-effects on theexpression of these genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.17 KPNA2 amplifications in breast cancer correlated with a set of gene up-regulations 753.18 NFE2L2 mutations and FECH up-regulation . . . . . . . . . . . . . . . . . . . . 773.19 Boxplots showing NFE2L2 mutations and FECH up-regulation . . . . . . . . . . 783.20 Patients harbouring the same gene mutations but with variations in trans-associatedgene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.1 Major steps of the densityCut algorithm . . . . . . . . . . . . . . . . . . . . . . 874.2 Tree merging and efficiency of densityCut . . . . . . . . . . . . . . . . . . . . . 924.3 The influence of densityCut parameter K and α on the final clustering results . 964.4 The role of the valley height adjustment step . . . . . . . . . . . . . . . . . . . . 974.5 Results on the synthetic benchmark datasets consisting of irregular, non-convex,or overlapped clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.6 Clustering microarray gene expression data . . . . . . . . . . . . . . . . . . . . . 1034.7 Clustering variant allele frequencies (VAF) of somatic mutations . . . . . . . . . 1054.8 Clustering variant allele frequencies of somatic mutations using competing algo-rithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.9 sciClone clustering a lung/pancreas metastasis pair of a melanoma patient . . . . 1074.10 densityCut clustering silhouette values . . . . . . . . . . . . . . . . . . . . . . . . 1084.11 Clustering single-cell gene expression data . . . . . . . . . . . . . . . . . . . . . . 1104.12 Performance measures on clustering single-cell gene expression data . . . . . . . 1114.13 Visualizing the mouse brain stem cell expression data by t-SNE . . . . . . . . . . 1124.14 Comparing densityCut and PhenoGraph in clustering CyTOF data . . . . . . . 113ixList of Figures4.15 Comparing the time used by PhenoGraph and densityCut in clustering CyTOFdatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.1 Genes connected to IDH1 and dysregualted in IDH1 mutated glioblastoma patients118xAcknowledgementsI’m greatly indebted to my supervisor Professor Sohrab Shah, who introduced me to compu-tational cancer biology when I was a novice in the field. During the years in his lab, Sohrabinspired me to gain a deep understanding of cancer biology and statistical modelling. I’m ex-tremely excited to have the opportunity to work with real patient data and to know that myresearch could benefit patient care. I would like to extend my deepest gratitude to my co-supervisor Professor Anne Condon, for all her guidance, dedication, as well as the challengingquestions in our meetings. Her intellectual rigour pushes me to think critically of the researchproblems.I would also like to express my gratitude and appreciation for Professors Sam Aparicio, WillEvans, and David Huntsman for their additional supervision, insightful comments, questions,and feedback. Special thanks go to collaborators Melissa McConechy, Gavin Ha, Ali Bashashati,Andrew Roth, Fong Chun Chan, Anthony Mathelier, Calvin Lefebvre, Hugo Horlings, TylerFunnell, Sarah Mullaly, Ryan Morin, Alicia Tone, Gholamreza Haffari, Ju¨ri Reimand, GaryBader, Pier-Luc Clermont, and other coauthors. Thanks also go to the people help me outin one way or another, especially Chris Thachuk from the Condon lab and Professor LaksLakshmanan.Finally, I would like to thank my wife, Rachel, for putting up with my nights and weekendsin the office for so long, and for giving me the motivation and encouragement to finish thisjourney.xiDedicationTo my parentsandto Rachel and Anna LuxinxiiChapter 1Overview of the thesis“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age offoolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light,it was the season of Darkness, it was the spring of hope, it was the winter of despair, we hadeverything before us, we had nothing before us, we were all going direct to Heaven, we were allgoing direct the other way - in short, the period was so far like the present period, that some ofits noisiest authorities insisted on its being received, for good or for evil, in the superlativedegree of comparison only.”– Charles Dickens, A Tale of Two Cities, 18591.1 Cancer and cancer genome sequencing studiesIn a multicellular organism, the cells behave in a social, cooperative manner: they differentiate,grow, proliferate, rest or die for the benefit of the organism as a whole [3]. Cancer cells, onthe contrary, violate these fundamental rules of cell behaviour by which multicellular organismsare built and maintained [3]: they grow selfishly at the expense of normal cells; they invadeneighbour tissues and build colonies in other parts of the body; eventually they destroy the entireorganism. Douglas Hanahan and Robert Weinberg summarized these cancer cell behavioursinto ten hallmarks such as sustaining proliferative signals, evading growth inhibitory signals andresisting cell deaths [84].Cancer cells obtain these hallmarks by accumulating a set of (as small as three in adult solidtumours) critical heritable genetic and epigenetic aberrations [9, 196, 213, 214]. For many typesof cancer (originated in different sites of the human body), it takes years for a cell to finally obtainthe set of critical genetic and epigenetic aberrations. Therefore, cancer is an evolutionary processinvolving successive rounds of random mutations (or epigenetic alterations, but mutations appearto be the fundamental and universal feature) followed by natural selection [8, 79, 154, 213]: acell randomly acquires a mutation (as a result of an un-repaired DNA replication error or anun-repaired DNA damage) which gives the cell some growth advantages; this cell grows and11.1. Cancer and cancer genome sequencing studiesproliferates slightly more vigorously than its neighbours to form an adenoma (benign tumour);the second round of clonal expansion is triggered by an additional critical mutation in a cellof the adenoma; several rounds of a critical mutation followed by clonal expansion transforma normal cell to a cancerous cell. The progression of colorectal cancer is a classic example ofclonal evolution in cancer [213]: for example, it takes about 20 years for a normal epithelial cellto sequentially acquire mutations in genes APC, then KRAS, and finally PIK3CA to becomecancerous.Because of the limitations of various DNA repair mechanisms, a daughter cell can accumu-late a few mutations [202, 230]. Many years of cell proliferation produces a full-blown cancerconsisting of billions of cells with heterogeneous genomes in different spatial sections of a solidcancer. It is also possible that spatially distinct regions of a primary solid tumour harbour dif-ferent sets of critical mutations that drive the development of cancer. Similarly, heterogeneitiesare common within a metastasis and among metastases of a patient, as well as between twocancer genomes of different patients [4, 213]. Given the universal feature of genetic heterogene-ity in cancer, what is extraordinary about cancer is that billions of cancer cells of a typicaltumour share the same set of critical mutations that drive the initialization and progression ofcancer [193, 214].Our understanding of cancer biology has grown enormously thanks to technical advancessuch as sequencing technologies. Current high-throughput massively parallel sequencing devicesprovide a cost-effective way to sequence the whole genomes of a bulk of cancer cells to identifyall the genetic aberrations. These instruments typically generate billions of short reads of dozensto hundreds of nucleotides. Each read corresponds to a DNA segment from the original cancercells (Fig. 1.1). By aligning the short reads to a reference genome [113, 114, 121, 122], differentkinds of genetic abnormalities can be detected [54, 143]: 1) Single nucleotide variants (SNVs) –nucleotide substitution mutations. SNVs in specific sites (mutation ‘hotspots’) of genes such asIDH1, KRAS, NRAS, BRAF, PIK3CA and in two hotspots of the promoter region of the geneTERT have been demonstrated to promote cancer. 2) Small insertions and deletions (indels) of aDNA segment. Indels in genes such as PTEN, APC, ARID1A, CDH1, and BRCA1/2 have beenshown to promote cancer [70]. 3) Large copy number alterations (CNAs) – deletions or gainsof extra copies of DNA segment spanning a gene, multiple genes or a whole chromosome [83].Copy number homozygous deletions of gene CDKN2A are frequent events in the aggressive braincancer – Glioblastoma [137], and loss of CDKN2A is a shared event in Glioblastoma patients21.1. Cancer and cancer genome sequencing studiesFigure 1.1: DNA sequencing and genetic aberration detection. a) A DNA pool from multiplecells is fragmented. The ends of each double strand DNA fragment are repaired. Then adaptersare ligated to both ends of each DNA fragment. These DNA fragments are attached to aflowcell and amplified. Finally to sequence to generate paired-end reads. b) Aligning the paired-end reads to a reference genome to detect various genetic aberrations. Figure modified fromMeyerson et al [143]harbouring other various copy number alterations driving the progression of Glioblastoma, andthus represents an early event in primary Glioblastoma [191]. 4) Genomic rearrangements –inversions or translocations of DNA segments. The classic example is the BCR-ABL fusion genethat results from the translocation between chromosome 9 and chromosome 22 in a subtype ofblood cancer – chronic myeloid leukemia [55].The genes with cancer promoting mutations (‘driver’) can be broadly grouped into twocategories based on whether inactivation or activation of these genes help drive a cell towardcancer. Oncogenes, when activated by mutations or epigenetic events, promote cancer formation.On the contrary, tumour suppressor genes prevent cancer formation in their active form - anti-oncogenes. Most oncogenes require only one hit or mutation to promote cancer (they tendto be genetically dominant) whereas tumour suppressor genes require biallelic inactivation topromote cancer. (They are genetically recessive.) Oncogenes harbour mostly gain-of-functionmutations (e.g., some missense mutations, in-frame indels, or copy number gains), which confernew functions or enhance the activities of the final gene products. Tumour suppressor genesfrequently accumulate loss-of-function mutations (e.g., some nonsense mutations, frame-shiftindels, splice-site mutations, or copy number deletions), which reduce or abrogate the functions31.2. Systems biology approach to profile cancerof the final gene products. Oncogenes and tumour suppressor genes can contribute to theonset of cancer indirectly. For example, defects in genes normally repairing DNA damages (e.g.,BRCA1 ) confer cell growth advantages by allowing cells that have chromosomal changes favoringgrowth to survive and divide [213]. In addition, some genes such as NOTCH1 can act as eitheroncogenes or tumour suppressor genes depending on mutation types and cell types.1.2 Systems biology approach to profile cancerGenome sequencing could become routine in the future standard of cancer care as the sequencingcost keeps decreasing [219]. However, it is a challenging or impossible task to detect all the‘driver’ mutations of each cancer patient based solely on statistical analysis of mutation datafrom DNA sequencing alone because of the far larger number of ‘passenger’ mutations thatare immaterial to neoplasia [71, 195, 196, 213] co-acquired during tumourigenesis [116]. Moreimportantly, we still need to understand the mechanism of how the driver mutations lead totumorigenesis and the progression of cancer in each patient, e.g., how a dysfunctional proteinresulting from a driver mutation confers the cell growth advantages, and how a cancer cell obtainsthe ‘hallmarks’ through the set of driver mutations. This knowledge is essential to understand themechanism of drug resistance and seek better treatment options [77]. We suggest that a systemsbiology [28, 88, 102, 159, 224] approach to integrative analysis of various high-throughput datasuch as mutation and gene expression data could be invaluable to both nominate candidatecancer driver mutations and interpret mutation functions. Below I briefly discuss some high-throughput data used in this dissertation.Transcriptome data Different types of cells of a single multicellular organism (e.g., a livercell and a muscle cell) express different set of genes, and the resulting RNA and protein productsdetermine the physical characters and the functions of the cells. Directly measuring all the finalprotein products of protein-coding genes is possible by techniques such as mass spectrometrybut currently is limited by the data quality (reproducibility issue [220]) and the lack of efficientcomputational annotation tools [226]. In contrast, microarray and more expensive but higherresolution RNA sequencing (RNA-Seq), which directly measure the abundance of each RNAmolecule by counting the number of reads mapped to the exon region of each gene, can generatehighly reproducible measurements. In addition, although a cell can control protein content afterRNA molecules are produced by directly degrading RNA or protein molecules or controlling RNA41.2. Systems biology approach to profile cancertranslation, transcription regulation is paramount because it can minimize the energy wastedto produce unnecessary molecules [3]. Correspondingly, mRNA gene expression data have beenwidely used to study biological problems. In cancer research, mRNA gene expression data havebeen used to stratify morphologically (under the microscope) indistinguishable tumours intoclinically relevant subgroups (e.g., the expression of a panel of 50 genes to stratify breast cancerpatients into five subgroups).Single-cell transcriptome data Instead of analyzing a bulk of cells to measure the popula-tion average, current high-throughput single-cell techniques have made it possible to efficientlymeasure the levels of all mRNA molecules in a specific cell [59, 90, 103, 128, 194]. Typically,these techniques include several steps: first cells are separated and the mRNA molecules of eachcell are separately captured; then the mRNA molecules of each cell are reverse transcribed tocDNA, which is amplified (e.g., by polymerase chain reaction) to generate enough materials forsequencing [194]. Inaccuracy is introduced by the low capture ratio (the proportion of mRNAmolecule captured) and PCR-amplification bias [90] (e.g., different cDNA molecules are ampli-fied differently.) Another factor influencing single-cell mRNA measurements is the cell state atthe time of sequencing (a cell can be at rest, in different stages of the cell cycle or in apoptosis).Single-cell mass cytometry data Mass cytometry (CyTOF) uses antibodies labeled withrare-earth element isotopes (with different masses) to bind to target epitopes of protein antigens(marker proteins) of cells [15, 90]. Cells with bound antibody-isotope conjugates are thennebulized into single cell droplets, and each cell is further vaporized to produce ions of itsatomic constituents. The ions are mass filtered to only keep the rare-earth element ions, whichare then mass measured by passing through a time-of-flight mass spectrometer (heavier ionstake longer to ‘fly through’ a fixed distance to reach a detector). The abundance of rare-earthelements is thus a measure of the marker protein expression. Mass cytometry is limited by thenumber of proteins that can be simultaneously measured (around 40 at present) because of thelimited number of rare-earth element isotopes. Compared to single-cell gene expression, masscytometry has higher throughput to measure protein expression of tens of thousands to evenmillions of cells (currently can process hundreds of cell per second) [15].Molecular networks A typical human cell expresses thousands of genes, and the proteinproducts interact with other molecules (e.g., O2, glucose, ATP) or ions (e.g., Na+) to perform51.3. Modelling and predictiondifferent reactions necessary for the cell, e.g., to break down foodstuffs or synthesize molecules.An ordered series of reactions (the products of one reaction serve as inputs or substrates ofthe next) are connected to form a pathway, e.g., the glycolysis pathway takes many steps toconvert glucose to pyruvate and release ATP. Three kinds of pathways have been widely studiedand documented in databases: metabolic pathways [99, 100, 132] about a cell obtaining energyand synthesizing molecules, gene regulation pathways [96] to turn on a set of genes and shutdown another set of genes at the right time and the right place, and signalling transductionpathways [132, 180] to sense intracellular or extracellular signals and respond accordingly. Thesepathway databases are still far from complete [18, 28, 54], miss tissue context information, andare biased by manual curation [28]. Different pathways are inter-linked to form a network. Oneof the most widely studied biological networks is the protein-protein interaction (PPI) networkwhere a node in the network represents a protein and a link between two nodes representphysical binding of the two proteins. A protein typically binds to other molecules to participatein biological processes. Therefore, the PPI networks are useful in interpreting protein functionsbecause two bound proteins tend to share common functions [3]. Large scale detecting of PPInetworks can be performed by yeast two-hybrid screens [172] (the standard yeast two-hybridscreen system has low sensitivity and specificity and can only detect nucleus interactions) oraffinity purification coupled to mass spectrometry [3]. Computational methods have long beenused to predict functional PPIs [64], which means two predicted interacted proteins may notphysically interact but participate in the same biological process.1.3 Modelling and predictionIn cancer systems biology research, we frequently reason under uncertainty [176], e.g., to inferwhether a signalling pathway is highly active in a patient given noisy measurements of DNAmutations and mRNA expression in that patient. We can summarize the uncertainty in datafrom measurement noise using probability distributions, e.g., the observed expression of a genein skin across people is modelled by a Gaussian distribution. The Bayesian approach allows usto further describe the uncertainty in model parameters through prior distributions and eventhe uncertainty in choosing different kinds of models (e.g., a spherical or an unconstrained twodimensional Gaussian distribution) [75]. A joint distribution of all the random variables specifiesthe space of possible outcomes and the corresponding probability masses or probability densities61.3. Modelling and predictionfor continuous distributions. After observing the values of a subset of random variables (evidencevariables), we can calculate the marginal distributions of some query variables of interest, e.g.,after observing a CTNNB1 gene mutation in a patient, we can update the marginal distributionof Wnt signalling pathway activity in that patient.For distributions that cannot be specified by a small number of parameters, even for simplebinary random variables, directly enumerating and assigning probabilities to the combinationsof outcomes take time and space exponential in the number of random variables. Unfortunatelythe joint distribution of a real problem is typically flexible, we therefore need to explore thestructures (conditional independences) in the joint distribution by factoring it into the productsof distributions, each over a small subset of random variables. Then we only need to specify theset of local distributions each involving a small subset of random variables to fully determine thejoint distribution. The probabilistic graphical model framework [20, 21, 22, 107, 150] providesa pictorial way for modulators (e.g., domain experts) to explore the conditional independentstructures in the joint distribution.One type of probabilistic graphical models, the directed graphical model (Bayesian network)is a directed acyclic graph G(V,E) whose vertices V correspond to random variables. A vertexX ∈ V and its parents Parent(X) (directly connected to X by an edge ∈ E and can be empty)consist a family, which has a conditional distribution p(x | Parent(x)) that quantifies the effectsof the parents on X. The joint distribution encoded by the graph is the products of the localconditional distributions:p(x | G) =|V |∏v=1p(xv | Parent(xv)) (1.1)where |V | means the cardinality of the vertex set V . The vector x = (x1, x2, . . . , xv, . . . , x|V |)T isany possible values of random variables (X1, X2, . . . , Xv, . . . , X|V |)T , where ‘T ’ means transposeof a vector. We typically specify the structures (topology) of a directed probability graphicalmodel, and learn the parameters (the conditional distributions) given data.Many classic probabilistic models can be represented as probabilistic graphical models. Theseclassic models could serve as building blocks to develop more sophisticated probabilistic models.71.3. Modelling and predictionFor example, the mixture model used throughput this thesis has the following joint distribution:p(x | pi,Θ) =K∑k=1pikp(x | θk) (1.2)where p(x | θk) is the distribution of the mixture component k governed by parameter θk ∈Θ = {θ1, . . . ,θK} = {θk}Kk=1, and pik is the mixture coefficient of component k. We use thevector pi = (pi1, . . . , pik)T to denote the mixture coefficients, which are nonnegative and sum toone. Here (pi1, . . . , pik)T means the transpose of the row vector (pi1, . . . , pik). In the literature, acontinuous distribution such as p(x | θk) in Equation 1.2 is commonly written asfX|Θ(x | θk) (1.3)In addition, Equation 1.2 (the notations used in this thesis) has the same meaning as Equation 1.4with slightly different notations:p(x;pi,Θ) =K∑k=1pikp(x;θk) (1.4)The density function in Equation 1.2 is not in the conventional form of the products ofconditional distributions as in Equation 1.1. If we introduce a latent variable Z, such thatZ = z indicates that the source component of data point x is z, then the joint distribution isp(x, z | pi,Θ) = p(z | pi,Θ)p(x | z,pi,Θ) = p(z | pi)p(x | z,Θ) = p(z | pi)p(x | θz) (1.5)where p(z | pi) is the categorical distribution such that P (Z = k | pi) = pik. Then we can get themarginal distribution p(x | pi,Θ) = ∑z p(x, z | pi,Θ) = ∑Kz=1 p(z | pi)p(x | θz) = ∑Kk=1 pikp(x |θk). The probabilistic graphical model in Fig. 1.2 (a) specifies the mixture model with latentvariable Z.We can use a probabilistic graphical model to generate artificial data by sequentially drawingsamples of random variables from parents to children, using the conditional distribution at eachrandom variable. For example, for the mixture model in Fig. 1.2(a), given the mixture coefficientspi and the component parameters Θ, an artificial data point x is generated in two steps: 1) thecomponent indicator variable z is sampled from the categorical distribution p(z | pi), and 2) theactuarial data point x is generated from the distribution p(x | θz).81.3. Modelling and prediction(a)Z X θkpi K(b)Zn Xn θkpi N K(c)Zn Xn θkα δpi N K(d)Xn Yn wαNFigure 1.2: Example probabilistic graphical models for both generative and discriminative mod-els. Circles represent random variables. Squares represent deterministic parameters. Shadednodes are observed, and unshaded nodes are hidden. Here we use the plate notation, i.e., nodesinside each box will get repeated when the node is unpacked (the number of repeats is on thebottom right corner of each box). Each node and its parents constitute a family. Given the par-ents, a random variable is independent of the ancestors. Therefore, the joint distribution of allthe random variables is the products of the family conditional distributions. (a) The structureof a mixture model with known and fixed parameters. (b) A mixture model with K mixturecomponents. We assume that the mixture coefficients pi and the parameters of each mixturecomponent θk are unknown but fixed parameters. (c) A Bayesian mixture model where bothpi and θk are random vectors and follow specific distributions with hyper-parameters α andδ, respectively. (d) A discriminative regression model with observed training data. Althoughparameters are also random variables from a Bayesian perspective, they are represented by lowercase letters in the graphical models.In addition to as a knowledge representation system (to encode conditional independencerelationships and to specify the joint distribution), graphical models also serve as powerfulreasoning engines to infer unknown quantities after observing some variables (evidence vari-ables) [176]. Given a dataset X = {x1,x2, . . . ,xN} = {xn}Nn=1, where a data point is a Ddimensional column vector xn = (xn,1, xn,2, . . . , xn,D)T whose component xn,j is a real number,we can add those data points as observed random variables as in the probabilistic graphicalmodel of Fig. 1.2(b). The data point specific latent variables z = {z1, z2, . . . , zN} = {zi}Nn=1.Both pi and Θ are unknown but fixed parameters of the mixture model. Then we can write the91.3. Modelling and predictionjoint distribution for the mixture model in Fig. 1.2(b) asp(z,X | pi,Θ) =N∏n=1p(zn | pi)p(xn | Θ, zn) (1.6)For probabilistic graphical models, we typically separate inference from learning. The infer-ence problem is to compute the marginal distributions of latent variables of interest, and thelearning problem is to learn the model parameters. For example, in Fig. 1.2(b), the inferenceis to compute p(zn | xn,pi,Θ), and the learning problem is to estimate pˆi and Θˆ, typically bymaximizing the likelihood. For the mixture model in Fig. 1.2(b), given the parameters, theinference problem is simple because Zn is independent of {Zi,Xi}i 6=n. Therefore, we can processeach data point independently as follows:P (Zn = k | xn,pi,Θ) = P (Zn = k | pi,Θ)p(xn | Zn = k,pi,Θ)∑Kc=1 P (Zn = c | pi,Θ)p(xn | Zn = c,pi,Θ)=pikp(xn | θk)∑Kc=1 picp(xn | θc)(1.7)For a general graphical model with many latent variables and observed variables, a simple in-ference algorithm is by summing out (or integrating out) the latent random variables that arenot interested one by one given the factorized joint distribution (the variable elimination algo-rithm). A more efficient inference algorithm (the belief propagation algorithm) uses dynamicprogramming to store and reuse intermediate results, and it can compute the marginal distri-butions of all the latent variables by passing messages (distributions) along the graph twice.For complex models when exact influence (e.g., both the variable elimination algorithm and thebelief propagation algorithm) is intractable, we have to use approximate inference algorithms.For a graphical model without latent variables, maximum likelihood learning of the modelparameters is simple. For example, for the mixture model in Fig. 1.2(b) when Zn is given,pˆik =∑Nn=1 Zn=kN , and θˆk can be estimated from the data points {xn | zn = k} the same way asestimating the parameters of a single component distribution from data. For a graphical modelwith latent variables, the learning problem is typically solved iteratively by the Expectation-Maximization (EM) algorithm, which involves first solving the inference problem in each itera-tion. For example, for the mixture model in Fig. 1.2(b), given the current guess of the modelparameters pit and Θt, we can first compute the distribution p(zn | xn,pit,Θt). Then the mix-ture coefficient can be estimated as pˆit+1k =∑Nn=1 P (Zn=k|xn,pit,Θt)N . The parameter θˆt+1k can be101.3. Modelling and predictionupdated similarly for the case without latent variables by considering a ‘soft’ assignment of datapoints to clusters. This process is repeated until the likelihood increases very small in eachiteration.We can take a Bayesian perspective to treat parameters as random variables and furtherspecify prior distributions govern by hyperparameters for these random variables. A Bayesianversion of the mixture model is in Fig. 1.2(c), which specifies a joint distributionp(pi,Θ, z,X | α, δ) = p(pi | α)K∏k=1p(θk | δ)N∏n=1p(zn | pi)p(xn | Θ, zn) (1.8)In the Bayesian setting, there is no distinctions between inference and learning since the param-eters are also random variables. By so doing, we can quantify the uncertainty of the parametersthrough the posterior distributions. For example, we can compute the posterior distributionp(pi,Θ, z | α, δ,X) = p(pi,Θ, z,X | α, δ)∫Θ∫pi∑z p(pi,Θ, z,X | α, δ)dpidΘ(1.9)The integral (summation) in Equation 1.9 is typically difficult to compute. Instead of computingthe posterior distribution analytically, the Markov chain Monte Carlo (MCMC) method [20, 127,150] draws samples from the posterior distribution. Specifically, MCMC constructs a Markovchain whose stationary distribution is the target posterior distribution in Equation 1.9. Thenwe can collect (approximately independent) samples from the Markov chain, these samples areused to construct an empirical estimation of the posterior distribution as these samples can beconsidered as from the target posterior distribution. However, MCMC tends to be slow. Insteadof drawing samples from the target distribution, the variational inference [20, 23, 127, 150] picksa distribution in a family q(pi,Θ, z) that most ‘similar’ to p(pi,Θ, z | α, δ,X) by minimizing theKullback-Leibler divergenceq∗(pi,Θ, z) = arg minq(pi,Θ,z)KL(q(pi,Θ, z)||p(pi,Θ, z | α, δ,X)− log(p(X)) (1.10)= arg minq(pi,Θ,z)∫∫pi,Θ∑zq(pi,Θ, z) logq(pi,Θ, z)p(pi,Θ, z | α, δ,X)dpidθ − log(p(X)) (1.11)Compared to MCMC, variational inference is relatively new and how the algorithm behaves isstill not well understood [23]. Instead of computing the posterior distribution, a computationallyefficient alternative is to do maximum a posteriori (MAP) estimation to find a mode of the pos-111.3. Modelling and predictionterior distribution. In the EM framework, the expectation step is the same as that in maximumlikelihood estimation of parameters. The prior distribution only influence the maximization stepin updating model parameters. Other optimization algorithms other than EM can also be usedto find a local maximum of the posterior distribution.Sometimes, we are given training data with ground truth information, i.e., a set of input-output pairs {(xn, yn)}Nn=1. Our task is to learn a mapping f(x) from x to y. From Bayestheorem we can get p(y | x) = p(x,y)p(x) = p(y)p(x|y)∑y p(y)p(x|y) for classification problems where y is a cate-gorical variable. This is called a generative model since we implicitly model the joint distributionp(x, y). To solve these problems, we can use discriminative models, i.e., directly modelling theconditional distribution of hidden variables given observed variables p(y | x), or simply findinga discriminative function f(x) (so called algorithmic modelling by Leo Breiman [27], includingpopular algorithms such as support vector machines, artificial neural networks [119], and ran-dom forests). Because the joint distribution p(x | y) is typically a multivariante distribution andis difficult to model, discriminative approaches bypass the difficulties of modelling p(x | y) andsolve simpler problems of either modelling p(y | x) or adopting the algorithmic modelling to finda discriminative function f(x). Compared to generative models, labelled ground-truth data aretypically required to train discriminative models. For example, Fig. 1.2(d) shows a regressionmodel. Let y = {y1, y2, . . . , yN} and w = (w0, w1, . . . , wD)T , where D is the dimensionality ofeach data point xn, the joint distribution for the statistical model in Fig. 1.2(d) isp(y,w | α,X) =N∏n=1p(yn | xn,w)p(w | α) (1.12)For ease of presenting the results, we add a constant one to each data point xn, e.g., xn =(1, xn,1, xn,2, . . . , xn,D)T . For binary logistic regression (Chapter 2) where y ∈ {0, 1}, p(yn |xn,w) = Ber(yn | sigm(wTxn)), which is a Bernoulli distribution with parameter sigm(wTxn) =11+exp(−wTxn) . For linear regression where y ∈ R, p(yn | xn,w) = N (yn | wTxn, σ2), which is aunivariate normal distribution with mean wTxn and variance σ2.121.4. Challenges in detecting and interpreting mutations from genomic sequencing data1.4 Challenges in detecting and interpreting mutations fromgenomic sequencing dataAlthough in principle we can sequence the whole genome of a cancer call to detect all the geneticaberrations, in practice we only observe part of these aberrations because of technical issues andexperimental design limitations in current high-throughput sequencing [213] (Chapter 2). First,typical sequencing platforms such as Illumina (Solexa), ABI SOLiD, Ion Torrent, SMRT ofPacific Bioscience and Roche 454 have platform-specific bias [170, 173], e.g., SMRT platformyields long single molecular reads that are subjected to insertions and deletions errors [170].Sequence errors can originate from sample preparation, e.g., low tumour cell content or formalinfixation that causes nucleotide mutations in samples for sequencing. Each step in sequencinglibrary preparation can introduce bias [170, 206]. Subclonal mutations (mutations only in someof the cancer cells) and mutations in copy number alterations regions can have very low variantallele frequencies (of the reads covering a mutation, the fraction of these reads with the variantallele) and make it difficult to detect them. Short reads from repetitive regions of the genomeare difficult to unambiguously map to the reference genome [46]. GC-rich regions such as thefirst exon of many genes can be difficult to fragment (Fig. 1.1(a)) or amplify by polymerasechain reaction (PCR) and are thus underrepresented in the sequence reads [17, 173]. Inadequatesequencing depth (or coverage, the average number of times a site is sequenced) can be a majorfactor in missing variants. Griffith et al show that current strategies for whole genome sequencingstudies missed many somatic mutations (mutations in the somatic cells but not the germ lineof a patient) [80]. By increasing the sequencing depths from 30x in their original study [53] to300x and using a consensus of somatic single nucleotide variant (SNV) callers, the number ofidentified SNVs increased from 118 to 1343. Moreover, an additional 2500 SNVs were highlylikely to be genuine somatic SNVs but still without enough evidence even at 300x coverage.Current cancer genome sequencing studies detected dozens to hundreds of mutations for atypical cancer genome, and even tens of thousands of mutations for the genomes with DNA-repair deficiency [80, 213]. To detect and interpret the functions of the small number of ‘driver’mutations is thus the focus of current cancer genomic sequencing studies [14, 29, 44, 117, 130,142, 164, 168, 203, 207, 208, 235] and of targeted therapies informed by patient genomic in-formation. If the driver mutation directly produces an overactive, or neomorphic (result in adominant gain of gene function that is different from the normal function) mutant protein or131.5. Research contributionsexcessive amount of protein products, drugs can possibly be designed to inhibit the activities ofthe mutant protein, e.g., BRAF inhibitors to treat melanoma patients with activating BRAFmutations [187]. For a driver mutation that results in non-functional protein products, the in-activation of this protein is expected to activate some growth-related proteins downstream ofa signal pathway [213]. The cancer cells are thus dependent on these proteins to survive andtherefore pathway information and synthetic lethality (mutations in two or more genes com-bined cause cell death, but a single mutated gene does not cause cell death) can be harnessed toselectively kill the cancer cells [138]. For example, ovary and breast cancer cells losing BRCA1or BRCA2 depend on PARP1 to repair single-strand DNA breaks, and inhibiting PARP causesaccumulated double-strand breaks in cancer cells, and leads to cancer cell death. To overcomethe almost inevitable developed resistance to single drug based treatments, multiple drugs canbe applied simultaneously to block several cancer signaling pathways to increase the possibilitiesof eliminating all the cancer cells [25]. However, to successfully apply these therapies, we needto separate driver mutations from passenger mutations in individual patients (Chapter 3).Although mutations in hundreds of genes are implicated in promoting cancer, these genesare components of a much smaller number of signal pathways through which cancer cells con-fer growth advantages [213]. Thus, it is common to observe similar messenger RNA (mRNA)expression profiles for two genetically different cancers as they share the same set of disruptedsignaling pathways (and show similar clinical outcomes). The large numbers of passenger mu-tations do not contribute to cancer development, but some of them result in aberrant mutantproteins that are foreign to the patient immune system. Therefore, the mutation data could actas markers to stratify patients for immunotherapy [118, 169, 181, 186, 192]. Recent technologyadvances, especially single cell technologies, enable us to profile cancer at single cell resolutions,e.g., mutations, copy number alterations, mRNA expression measurements from single cell se-quencing data, and protein expression from mass cytometry data. These data could revolutionizeour understanding of tumour heterogeneity and tumour microenvironment, and possibly helppatient stratification (Chapter 4).1.5 Research contributionsBelow, I summarize the research contributions of my work (Fig. 1.3). These points are expandedin subsequent chapters.141.5. Research contributionsMachine Learning,Algorithmsrhsmm [48, 50]mutationSeq[46], Chapter 2Analyzeendometrialcancer[134] Analyzebreastcancer[185]Analyzeovariancancer[13, 135]xseq [47],Chapter 3DriverNet [14]Analyzelymphoma[30, 131,148]densityCut[49], Chapter 4Figure 1.3: Overview of the work done during my PhD study. First author publications are inblue [46, 47, 48, 49, 50], co-first author publications in orange [14, 134], and other co-authoredpapers in grey [13, 135, 185].1.5.1 Predicting somatic single nucleotide variantsAdvances in high-throughput sequencing technology allow for rapid sequencing of tumour-normalpairs of DNA, where both tumour DNA and adjacent healthy tissue (or blood) DNA fromthe same individual are sequenced. These tumour-normal matched pairs allow for comparingbillions of DNA bases from a whole genome to identify somatic mutations. mutationSeq (http://compbio.bccrc.ca/software/mutationseq/) is among the first available tools specificallydesigned for the prediction of somatic single nucleotide variants – by integrating both normaland tumour sequencing data. mutationSeq is a feature-based supervised discriminative classifierapproach that takes advantage of valuable validated somatic mutations to predict novel single151.5. Research contributionsnucleotide variants. We further identified and classified the technical noise in high-throughputDNA sequencing data. mutationSeq has been used to profile 70 triple-negative breast cancers[185], high-grade serous ovarian cancer [13], endometrial cancer [134, 135] and many others. Sincethe publication of mutationSeq, large numbers of tools have been developed and are underdevelopment for discovering (calling) somatic single nucleotide variants from paired normal-tumour sequencing data. mutationSeq remains one of the best methods for single nucleotidevariant calling. mutationSeq has undergone three years of active development and has beenextended in several ways, and has been used in our lab and Canada’s Michael Smith GenomeScience Centre to profile thousands of cancer genomes and exomes.1.5.2 Predicting somatic mutations that correlate with gene dysregulationWe present a novel hierarchical Bayes statistical model, xseq (http://compbio.bccrc.ca/software/xseq/), to systematically quantify the impact of somatic mutations on expressionprofiles. We establish the theoretical framework and robust inference characteristics of themethod using computational benchmarking. We then apply xseq for the analyses of thousandsof tumour datasets available through The Cancer Genome Atlas, to systematically quantify so-matic mutations impacting expression profiles. We identify 30 novel cis-effect tumour suppressorgene candidates, enriched in loss-of-function mutations and bi-allelic inactivation. Analysis oftrans-effects of mutations and copy number alterations with xseq identifies mutations in 150genes impacting expression networks, with 89 novel predictions. We reveal two important novelcharacteristics of mutation impact on expression: (1) patients harbouring known driver mu-tations exhibit different downstream gene expression consequences; (2) expression patterns forsome mutations are stable across tumour types. These results have critical implications for iden-tification and interpretation of mutations with consequent impact on transcription in cancer.1.5.3 Clustering high-dimensional, high-throughput biological datasetsAs measurement technology advances have drastically enhanced our ability to generate varioushigh-throughput datasets, there is a great need to develop efficient and robust clustering al-gorithms to analyze large N (the number of data points), large D (the dimentionality of datapoints) datasets, with the ability to detect arbitrary shape clusters and automatically deter-mine the number of clusters. Therefore, in this chapter, we developed and implemented a novelclustering algorithm called densityCut (https://bitbucket.org/jerry00/densitycut_dev/161.5. Research contributionsoverview) to cluster high-dimensional, high-throughput biological datasets. Experimental re-sults on synthetic benchmark datasets demonstrate the robustness of densityCut in clusteringdatasets consisting of complex shape clusters. We apply densityCut to three real-world ap-plications, namely to cluster variant allele frequencies of somatic mutations to reveal clonalarchitectures of individual tumours, to cluster single-cell gene expression data to uncover cellpopulation compositions, and to cluster single-cell mass cytometry data to detect communitiesof cells of the same functional states or types.17Chapter 2Predicting somatic single nucleotidevariants from paired normal-tumoursequencing data“It’s difficult to make predictions, especially about the future”– Niels Bohr2.1 IntroductionThe genome-wide search for functionally important somatic mutations in cancer by emergent,cost-effective high throughput sequencing (HTS) technology has revolutionized our understand-ing of tumour biology. The discovery of diagnostic mutations [183], new cancer genes (ARID1A[225], PBRM1 [209], PPP2R1A [133], IDH1 [234], EZH2 [146]), insights into tumour evolutionand progression [51, 184] and definitions of mutational landscapes in tumor types (CLL [163],myeloma [32], lymphoma [147]) amongst many others, provide important examples of the powerand potential of HTS in furthering our knowledge of cancer biology.Using HTS to interrogate cancers for somatic mutations usually involves sequencing tumourDNA and DNA derived from non-malignant (or normal) tissue (often blood) from the samepatient. Consequently, cancer-focused HTS experiments differ considerably in experimentaldesign from the study of Mendelian disorders or normal human variation. In cancer studies,sequence reads from the two matched samples are aligned to a reference human genome, and listsof predicted variants using single nucleotide variant (SNV) predictors (callers) (e.g., Samtools[123], SOAPsnp [124], VarScan [105], SNVMix [78], GATK [136], VipR [5]) are compared inthe tumour and normal data. Using naive approaches, those variants appearing in the tumour,but not the normal sample would be considered putative somatic mutations and provide theinvestigator with a list of candidates to follow up for functional impact and clinical relevance.Unfortunately, such naive approaches often result in false predictions and we suggest herein that182.1. Introductionthe problem of computational identification of somatic mutations from HTS data derived fromtumour and matched normal DNA remains an unsolved challenge. As a result, labour-intensiveand often costly validation experiments are still required to confirm the presence of predictedsomatic mutations for both research purposes and clinical interpretation.Although some false predictions may be due to under-sampled alleles, most can be attributedto detectable artefacts that we argue can be leveraged in principled inference techniques toimprove computational predictions. Many different approaches to the problem of SNV discoveryfrom HTS have been implemented. Model-based methods such as SNVMix and SOAPsnp aimto probabilistically model the allelic distributions present in the data and infer the most likelygenotype from allelic counts. These methods avoid imposing ad-hoc depth-based thresholds onallelic distributions, but their accompanying software packages do not explicitly handle knownsources of technical artefacts and they must rely on pre- or post-processing of the data toproduce reliable predictions. Examples of features that can indicate artefacts include strandbias whereby all variant reads are sequenced in the same orientation [32], mapping quality - howwell each read aligns to its stated position, base quality - the signal to noise ratio of the basecall and average distance of mismatched bases to the end of the read, amongst many others (seeMethods). Many of these features are readily available from aligned data in packages such asGATK and Samtools and it is generally accepted that applying filters on these quality metricsis necessary to remove false signals. Some software tools such as VarScan, Samtools, and GATKaim to leverage these features in their SNV prediction routines; however, they are often guidedby heuristics, whereby somewhat arbitrary decision boundaries are implemented.We propose that training feature-based classifiers using robust classification methods fromthe machine learning literature will better optimize the contribution of each feature to thediscrimination of true and false positive somatic mutation predictions. Fitting such classifiers tolarge sets of ground truth data should allow us to discriminate classes of false positives that maybe predicted for different reasons, enabling a more thorough understanding of HTS machine,alignment and biology related artefacts that are informed by data. We suggest that features thatbest identify somatic mutations will differ in importance in the normal data compared to thetumour data, and so integrated analysis of the tumour and normal data will yield better resultsthan independent treatment of the two datasets. To our knowledge, this notion is currently notconsidered in any published somatic mutation detection method. Finally, flexible feature-basedclassifiers that can use any number of features can combine features from different software192.2. Methodspackages and therefore leverage newly discovered discriminative features to continually improvesomatic mutation prediction accuracy as the bioinformatics literature and methodology matures.In this chapter, we study the use of discriminative, feature-based classifiers and investigatecomputational features from aligned tumour and normal data that can best separate somaticSNVs from non-somatic SNVs. We implemented four standard machine learning algorithms:random forest, Bayesian additive regression tree, support vector machine, and logistic regres-sion, and compared their performance to each other and to standard methods for somatic mu-tation prediction. We trained the classifiers on a set of 106 features computed from tumour andnormal data on a set of ∼ 3400 ground truth positions from 48 primary breast cancer genomessequenced with exome capture technology, while simultaneously estimating the importance offeatures. Classifiers were evaluated in a cross validation scheme using robust quantitative accu-racy measurements of sensitivity and specificity on labeled training data, and on independentheld-out test data derived from four additional cases sequenced with a different technology. Weshow that principled, feature-based classifiers significantly improve somatic mutation predic-tion in both sensitivity and specificity over standard approaches such as Samtools and GATK.Finally, using discriminative features, we show how false positive (wildtype) positions can besegregated into several distinct types of systematic artefacts that contribute to false positivepredictions.2.2 MethodsFig. 2.1 shows the workflow of the feature-based classifier for somatic mutation prediction. Weused supervised machine learning methods fit to validated, ground truth training data originallypredicted using naive methods (see below for details). Using deep sequencing to validate predic-tions, we define positions as somatic mutations where the variant was found in the tumour butnot the normal, germline variants where the variant was found in the tumour and the normal,or wildtypes (no variants found in either the tumour or the normal, i.e., false positive predic-tions). The germline and wildtype positions are classed as non-somatic positions so that binaryclassifiers can be used. Features are constructed for each SNV in the training data using theexome capture bamfiles from the tumour and normal alignments. As explained below, we usefeatures available in Samtools, GATK and a set of features we have defined ourselves. Thesefeatures along with their somatic/non-somatic labels are the inputs to train classifiers. Given202.2. MethodsmutationSeq predictionmutationSeq trainingbamfilesA list of validated somatic,germline, and wildtype mutationsbamfilesconstruct features for each positiontrain classifierconstruct features for each positionpredictionsomaticprobabilityFigure 2.1: The workflow of the feature-based classifier for somatic mutation prediction fromDNA sequencing data. Given test bamfiles, each candidate site is represented by a featurevector, and the classifier trained on validated ground truth data is applied to make a prediction.The classifier outputs the probability of each site being somatic.test bamfiles, we construct features for each candidate site, and apply the trained classifier topredict the somatic mutation probability for each site.2.2.1 Feature constructionWe formalize the somatic mutation prediction problem as a classification problem. Each candi-date mutation site of the genome is represented by a feature vector x with 106 feature components{x1, . . . , x106}. The problem is to predict the label y of the feature represented site. y is definedas “1” if the site is a somatic mutation, and “0” otherwise. Below we first define each componentof the feature vector in detail, and then compare different models to predict y given x.Features x1 to x20 are constructed from the normal data and their definitions are given inTable 2.1. This table is based on the table in http://samtools.sourceforge.net/mpileup.shtml. Features x21 to x40 have the same definitions but are constructed from the tumour. Fea-tures x41 to x60 are constructed from the normal data and their definitions are given in Table 2.2.These features are constructed based on GATK. Features x61 to x80 have the same definitionsbut are constructed from the tumour. We show in Section 2.3.3 how simultaneous treatment ofthe tumour and normal data allow the classifiers to differentially weight corresponding features212.2. MethodsTable 2.1: The definitions of features x1 to x20. Q13 means base quality bigger or equal toPhred score 13; D represents the three-dimensional vector (depth, number of reference basesand number of non-reference bases) at the current site; Gi ∈ {aa, ab, bb} means the genotypeat site i, where a, b ∈ {A,C, T,G} and a is the reference allele and b is the non-reference allele.These features are constructed from Samtools.1. number of reads covering or bridging the site2. number of reference Q13 bases on the forwardstrand3. number of reference Q13 bases on the reversestrand4. number of non-reference Q13 bases on the for-ward strand5. number of non-reference Q13 bases on the re-verse strand6. sum of reference base qualities7. sum of squares of reference base qualities8. sum of non-reference base qualities9. sum of squares of non-reference base qualities10. sum of reference mapping qualities11. sum of squares of reference mapping qualities12. sum of non-reference mapping qualities13. sum of squares of non-reference mappingqualities14. sum of tail distances for reference bases15. sum of squares of tail distance for refer-ence bases16. sum of tail distances for non-referencebases17. sum of squares of tail distance for non-reference bases18. P (D | Gi = aa), phred-scaled, i.e., x istransformed to −10 log(x)19. maxGi 6=aa(P (D | Gi)), phred-scaled20.∑Gi 6=aa(P (D | Gi)), phred-scaledTable 2.2: The definitions of features x41 to x60. These features are constructed from GATK.41. QUAL: phred-scaled probability of the callgiven data42. allele count for non-ref allele in genotypes43. AF: allele frequency for each non-ref allele44. total number of alleles in called genotypes45. total (unfiltered) depth over all samples46. fraction of reads containing spanning deletions47. HRun: largest contiguous homopolymer run ofvariant allele in either direction48. HaplotypeScore: estimate the probability thatthe reads at this locus are coming from no morethan 2 local haplotypes49. MQ: root mean square mapping quality50. MQ0: total number of reads with mappingquality zero51. QD: variant confidence/unfiltered depth52. SB: strand bias (the variation being seenon only the forward or only the reversestrand)53. sumGLbyD54. allelic depths for the ref-allele55. allelic depths for the non-ref allele56. DP: read depth (only filtered reads usedfor calling)57. GQ: genotype quality computed based onthe genotype likelihood58. P (D | Gi = aa), phred-scaled59. P (D | Gi = ab), phred-scaled60. P (D | Gi = bb), phred-scaledso as to emphasize tumour-specific and normal-specific features that best discriminate betweenreal and false predictions.To account for variance in depth across the data, features that scale with depth (e.g., featurex2 to x17) are first normalized by dividing by the depth. In addition to Samtools and GATKwe added several features that we noticed may contribute to systematic errors. For example,222.2. MethodsTable 2.3: The definitions of x98 to x106. These features are used to boost weak mutation signalsin the tumour and decrease the influence of germline polymorphism. In this table, Fi means thenormalized version of the ith feature.98. Forward strand non-reference base ratioF24/F499. Reverse strand non-reference base ratio F25/F5100. Sum of non-reference base quality ratio F28/F8101. Sum of squares of non-reference base qualityratio F29/F9102. Sum of non-reference mapping quality ratioF32/F12103. Sum of squares of non-reference mappingquality ratio F33/F13104. Sum of non-reference tail distance ratioF36/F16105. Sum of squares of non-reference tail dis-tance ratio F37/F17106. Non-reference allele depth ratio F75/F55in [139], the authors found that GGT sequences are often erroneously sequenced as GGG. Tocapture this artefact, we computed the difference between the sum of the base qualities of thecurrent site and the next site, the sum of the square of the base qualities of the current site andthe next site, for both normal and tumour. These features are defined as features x81−84. Inaddition, the reference base, the alternative base of the normal as well the alternative base ofthe tumour are included as features x85−95 (by dummy representation of categorical variables).In addition, to combine strand bias effects from the tumour and normal data, we define featurex96 and feature x97 to estimate the strand bias from the pooled normal and tumour data.To boost weak signals such as rare somatic mutations that may be undersampled or representa mutation occurring in a small proportion of cells in the tumour, and to decrease the influence ofgermline polymorphism, another nine features are introduced. The definitions of these featuresare given in Table 2.3. Note in the table, Fi means the normalized version of the ith feature(dividing by the depth feature). All features are standardized to have zero mean and unitvariance prior to training and testing.2.2.2 ModelsAfter constructing the feature value vector x for each candidate somatic position, the problemis to find a discriminative function f(x) which optimally separates the true somatic positionsfrom false somatic positions. In so doing, we wished to simultaneously learn the features thatbest discriminate the two classes. Numerous tools have been developed to solve this problemin the statistics and machine learning community. Here we compare four methods: logisticregression (Logit), support vector machines (SVM), random forests (RF) [89], and Bayesianadditive regression tree (BART) [35]. These methods (described below) differ in their underlying232.2. Methodsmethodology and generally represent broad classes of classifiers present in the machine learningliterature. We set out to compare performance of the different approaches in the specific contextof predicting somatic mutations from HTS data.L1 regularized logistic regressionBinary logistic regression is a generalization of the linear regression for classification problemsand models the conditional probability of y ∈ {0, 1} given the feature vector x, representing acandidate mutation site, asp(y | w,x) = Ber(y | sigm(wTx + w0)) (2.1)where w = (w1, . . . , wD) is the weight for each feature, D = 106 is the dimensionality of eachfeature vector and the function sigm(wTx) = 11+exp(−wTx) is the logistic function.Here we put a prior p(w) on the weight of the logit model, and do maximum a posteriorestimation (MAP) of the parameter w. We focus on the factorized Laplace priorp(w) =D∏j=1p(wj |ρ) =D∏j=112ρexp(−|wj |ρ) (2.2)GivenN independent and identically distributed (i.i.d.) validated mutation pairsD = {(xi, yi)Ni=1},the posterior of w is:p(w | D) ∝ p(w)p(D | w)= p(w)N∏n=1sigm(wTxn + w0) (2.3)− log(p(w | D)) ∝ D log(2ρ) +D∑j=1|wj |ρ−N∑n=1(yn log(p(yn|w,xn)) + (1− yn) log(1− p(yn|w,xn)))(2.4)The Laplace prior introduces a penalty term 1ρ∑Dj=1|wj | to the negative log-likelihood func-tion. Because the L1 norm is defined as ‖w‖1:=∑Dj=1|wj |, the above logistic regression modelwith Laplace prior on the weights is called L1 regularized (penalized) logistic regression model.The L1 regularized logistic regression model has the property of shrinking the weights of irrele-242.2. Methodsvant features to zero.Support vector machineThe support vector machine (SVM) classifier is a generalization of logistic regression by basisexpansion and finds a linear discriminative function of the formf(x) = wTΦ(x) + w0 (2.5)where Φ is a basis function which maps x from a 106 dimension space to a new feature space.Although f(x) is a linear function of Φ(x), it may be a nonlinear function of x. When Φ(x) = x,the discriminative function is a hyperplane in the original space spanned by the set of trainingvectors x, and the corresponding classifier is called a linear SVM.The SVM assumes that the optimal discriminative function is the one which leaves thelargest possible margin on both sides of the feature space (for binary classification; The featurespace is the original 106 dimensional space for a linear SVM). When the data points are notlinear separable in the feature space, the points that are misclassified or inside the margin arepenalized. A parameter C is used to control the trade-off between the margin width and thenumber of points that are misclassified or inside the margin (A large C tends to produce anarrow margin).Random forestsRandom forests (RF) learns the set of basis functions Φ(·) from data and is a tree-based methodfor classification and regression analyses. Given a training dataset consisting N input-outputpairs {(xn, yn)Nn=1}, where xn = (xn,1, . . . , xn,j , . . . , xn,D)T is a feature vector and yn ∈ {0, 1} isthe corresponding output label, a classification tree g is grown by repeatedly binarily splitting thefeature space into disjoint hyper-rectangle regions {Um}Mm=1. The splitting rules are typicallybased on a single component x,j of the whole training feature vectors and are of the form{x,j ≤ s} vs {x,j > s}, where s is a real number determined in training. After the classificationtree is grown, a leaf node m represents a region Um with Nm observations from the trainingdata, and the points in this region are assigned a label based on the majority vote:km = arg maxk1Nm∑xn∈UmI(yn = k) (2.6)252.2. Methodswhere k ∈ {0, 1} in our case, and I(·) is the indicator function. Here a leaf node m defines abasis function Φm(·). Let x = (x1, . . . , xD)T is a D-dimensional feature vector,Φm(x | θm) =1, if x ∈ Um.0, otherwise(2.7)where θm is the parameter vector for the leaf node m of the tree (an ordered list of splittingrules on the path from the root to the leaf node). The discriminative function can be written interms of the basis functions f(x) =∑Mm=1 kmΦm(x | θm). RF makes a prediction based on theoutputs of an ensemble of B trees:p(y = k | x,Θ1, . . . ,Θb, . . . ,ΘB) =∑Bb=1 I(gb = k)B(2.8)where Θb is the set of parameters of tree b, and gb = k means that tree b’s prediction is k.Specifically, B bootstrap samples (random sampling with replacement, each sample includes Ndata points) are drawn from the training data, and a tree is grown on each bootstrap sample.To reduce the dependence of the B trees, only p features out of all the features (106 for ourcase) are selected as candidates for splitting at each intermediate node to grow a tree.Bayesian additive regression treeGenerally speaking, the Bayesian additive regression tree (BART) can be considered as aBayesian version of the random forests model and is based on regression trees. As for theclassification trees used for random forests classification, a regression tree is grown by repeatedbinarily splitting the feature space into disjoint hyper-rectangle regions {Um}Mm=1. The discrim-inative function of a regression tree with M leaf nodes isf(x) = g(x | T,Θ) +  =M∑m=1µmI(x ∈ Um) +  (2.9)where  ∼ N ( | 0, σ2), T represents the structure of the binary regression tree, Θ represents theset ofM parameter vectors, each for a leaf node (an ordered list of splitting rules on the path fromthe root to the leaf node), and µm is the mean response in region Um (i.e., µm =1Nm∑xn∈Um yn).For binary classification, e.g., to classify a site x as somatic (y = 1) or non-somatic (y = 0),262.2. Methodsthe class probability is defined as:p(y | x,Θ1, . . . ,ΘB) = Φ(B∑b=1gb) (2.10)where Φ(.) is the standard Gaussian cumulative distribution function.The BART model differentiates from RF because BART is a fully Bayesian model. All theparameters (tree structures T and leaf parameters Θ) are given hyper-priors and Markov chainMonte Carlo (MCMC) sampling is used for inference.2.2.3 DatasetsWe used two independent datasets to train and test the performance of the models for somaticmutation prediction. The first dataset (exome capture data) consists of 48 triple negative breastcancer Agilent SureSelect v1 exome capture tumour/normal pairs sequenced using the Illuminagenome analyzer as 76bp pair-end reads. These data were generated as part of a large-scale se-quencing project [185] whereby 3369 variants were predicted using only allelic counts and liberalthresholds. Follow-up re-sequencing experiments achieving ∼6000x coverage for the targetedpositions revalidated 1015 somatic mutations, 471 germline and 1883 wildtype (false positive)positions. The exome capture data are subdivided into two groups consisting of non-overlappingpositions:• SeqVal1 (somatic: 775 germline: 101 wildtype: 487, total: 1363)• SeqVal2 (somatic: 269, germline: 428, wildtype: 1410, total: 2107)SeqVal1 positions were obtained by aligning the reads to the whole human genome, while SeqVal2positions were obtained by aligning the reads to a reference limited to the targeted human exons.SeqVal2 was considerably noisier due to misalignments. (Note that 101 positions overlapped inthe two datasets therefore we removed redundant sites from the combined dataset.)The second dataset (whole genome shotgun data) is from four whole human genome tu-mour/normal pairs sequenced using the Life Technologies SOLiD system as 25− 50bp pair-endreads. These data were aligned to the human genome by using the BioScope aligner. Groundtruth for these samples was obtained from orthogonal Illumia exome capture experiments fol-lowed by targeted resequencing on the same DNA samples resulting in 113 somatic mutations,57 germline mutations and 337 wildtypes. These four samples are completely independent of272.3. Resultsthe training samples in SeqVal1+SeqVal2.2.2.4 Experimental designFor each of the four classifiers, we used the exome capture data for classifier training, and testedon the whole genome shotgun data. For training, we used the following procedure. Since each ofthe models accepts hyper-parameters, we applied a 10-fold cross-validation analysis on a rangeof hyper-parameters to approximate the optimal settings by the one-standard error rule. Weapplied the resulting settings on all of the exome capture data in the final training step. Weobtained a set of discriminative features using ensemble feature selection [1] after training using40 bootstrap samples and finally computed a feature set aggregated from these 40 samples foreach classifier. (For each bootstrap sample, Logit selected the features with nonzero weights,SVM and RF used background elimination to gradually remove the least informative features,and BART selected the set of features with appearance frequencies great than average of 1/106.)To test the robustness of each classifier to the input set of features, we trained each classifierusing each of the other classifiers’ feature sets, producing 4 × 4 = 16 results, which were thenassessed using sensitivity, specificity and accuracy metrics.To compare our classification methods to standard approaches for SNV detection, we usedtwo popular methods: GATK v1.0.5543M and Samtools v1.16. Samtools mpileup and bcftoolswere run independently on the tumour and normal bamfiles to produce SNV calls at the 3369positions in the exome capture data. Those SNVs present in the tumour list, but not thenormal were considered as somatic mutations, otherwise they were considered as non-somatic.For GATK, we used the UnifiedGenotyper tool in a similar fashion to classify the positionsin the exome capture data. We also compared the results after removing small indel-inducedartefacts using GATKs local realignment and base quality recalibration tool. We then comparedall methods using accuracy and receiver operator characteristic curves (ROC).2.3 Results2.3.1 Classifiers outperform standard approachesTo visually investigate the discriminative ability of the features, we used principal componentanalysis (PCA) to project the 106-dimensional feature vectors to a three-dimensional spacespanned by the first three principal components. Fig. 2.2 shows that somatic mutations were282.3. Results−10 −5  0  5 10 15 20 −15−10 −5  0  5 10 15 20−15−10 −5  0  5 10 15 20PC3PC1PC2l SomaticWildtypeGermlineFigure 2.2: The training data projected onto the three-dimensional space spanned by the firstthree principal components. The somatic mutations (red) are reasonably well-separated fromnon-somatic mutations (germline - green, wildtype - blue).reasonably separated from non-somatic mutations, suggesting that accurate classifiers couldpotentially be learned from the set of features we chose to examine. Comparison of accuracyon the combined dataset SeqVal1+2 of the different classifiers, GATK’s UnifiedGenotyper andSamtools bcftools (Fig. 2.3(a)) showed that BART was most accurate (0.9679) followed by RF(0.9567), SVM (0.9555) and Logit (0.9065). All classifiers were better than Samtools (0.8103)and GATK (0.7551). We next evaluated the contribution of specificity to the accuracy resultsusing a high fixed sensitivity of 0.99 to establish a probability threshold for each of the classifiers.Specificity at this threshold was 0.9584, 0.9422, 0.9405 and 0.8704 for BART, RF, SVM andLogit respectively, suggesting that Logit had less discriminative power than the other classifiers.The comparative sensitivity and specificity breakdown for Samtools was 0.8631 and 0.7876, andfor GATK it was 0.9842 and 0.6563. Thus, Samtools had more balanced misclassifications,whereas GATK was very sensitive but with a lower specificity than the other methods.Similar patterns were observed for independent analysis of SeqVal1 (Fig. 2.3(b)), although292.3. Resultsresults for GATK (sens: 0.9819, spec: 0.9031) and Samtools (sens: 0.8245, spec: 0.9218) werebetter than for the SeqVal1+2 results. We also tested whether local realignment reads aroundinsertions and deletions and base quality re-calibration (post alignment processing tools in theGATK package) improved results. The classifier results were nearly identical to those shown inFig. 2.3(b) for SeqVal1 (Fig. 2.3(c)). However, while results for Samtools and GATK both showedan improvement in specificity, there was a substantial reduction in sensitivity (Fig. 2.3(c)). Wealso assessed results on SeqVal2 independently (Fig. 2.3(d)) and found that accuracy was highestfor BART (0.9312) followed by RF (0.9282), SVM (0.9160), Logit (0.8677), Samtools (0.7651)and GATK (0.6208). All methods were worse on this dataset than on SeqVal1, although thedifference for the classifiers was more moderate than the other methods. The markedly worseperformance of Samtools and GATK for this dataset was mainly due to considerably decreasedspecificity; this dataset was generated from constrained alignments to exons that likely inducemany false alignments, thus the classifier methods may be more robust to artefacts introducedby misalignments.We then assessed the statistical significance of the observed differences between methodsusing the best performing results for Samtools and GATK (SeqVal1). For each cross valida-tion fold, we fixed sensitivity according to the Samtools results and computed the specificityof the other methods. We then compared the specificity distributions of the methods over allfolds using a one-way ANOVA test. Similarly, we evaluated sensitivity distributions by fixingspecificity. A similar procedure was then applied to the GATK results. The classifiers were notstatistically different from each other in any comparison. However, all classifiers were statisti-cally significantly higher in specificity and sensitivity (ANOVA, p< 0.00001) than Samtools andGATK.To test the generalization performance of the trained classifiers, we applied them to the testdata: four cases with whole genome shotgun sequencing from tumour and normal DNA on theSOLiD platform. Despite being trained on exome data, the classifiers performed extremely welland recapitulated the results seen in the cross validation experiments (Fig. 2.4). The accuracyfor the classifiers by using the same thresholds for the training data was 0.9487, 0.9487, 0.9369and 0.9191 for BART, SVM, RF and Logit respectively. Samtools accuracy was 0.9053 followedby GATK at 0.8738. These results indicate that, on a limited dataset, the trained parametersshould generalise well to other platforms and are likely robust to overfitting. On the orthogonaltest data, all classifiers outperformed GATK and Samtools in both sensitivity and specificity302.3. Results0 0.1 0.2 0.3 0.4 0.50.650.70.750.80.850.90.9511−SpecificitySensitivity  RF (0.9567)BART (0.9679)SVM (0.9555)Logit (0.9065)Samtools (0.8103)GATK (0.7551)(a) SeqVal1+20 0.1 0.2 0.3 0.4 0.50.650.70.750.80.850.90.9511−SpecificitySensitivity  RF (0.9699)BART (0.9735)SVM (0.9628)Logit (0.964)Samtools (0.8665)GATK (0.9479)(b) SeqVal10 0.1 0.2 0.3 0.4 0.50.650.70.750.80.850.90.9511−SpecificitySensitivity  RF (0.9735)BART (0.9699)SVM (0.9628)Logit (0.9628)Samtools (0.8048)GATK (0.9384)(c) SeqVal1 realignment0 0.1 0.2 0.3 0.4 0.50.650.70.750.80.850.90.9511−SpecificitySensitivity  RF (0.9282)BART (0.9312)SVM (0.916)Logit (0.8677)Samtools (0.7651)GATK (0.6208)(d) SeqVal20.750.80.850.90.951RF BARTSVM Logit STSensitivity ANOVA p−value = 4.1884e−260.750.80.850.90.951RF BARTSVM Logit STSpecificity ANOVA p−value = 5.2109e−09(e) SeqVal1, compare with Samtools (ST)0.750.80.850.90.951RF BARTSVM LogitGATKSensitivity ANOVA p−value = 9.8084e−090.750.80.850.90.951RF BARTSVM LogitGATKSpecificity ANOVA p−value = 5.3159e−09(f) SeqVal1, compare with GATKFigure 2.3: (a) Accuracy results from cross-validation experiments on all the exome capturedata (SeqVal1+2). All classifiers showed better results than Samtools and GATK’s predictionresults in terms of ROC comparison. The numbers in parentheses are the prediction accuracyby fixing the sensitivity at 0.99, except for Samtools and GATK’s prediction results becausetheir outputs are deterministic. (b) Accuracy results from cross-validation experiments on theexome capture data of SeqVal1. (c) Accuracy results from cross-validation experiments on theexome capture data of SeqVal1 after GATK’s local realignment around indels and base qualityrecalibration. (d) Accuracy results from cross-validation experiments on the exome capture dataof SeqVal2. (e) Comparison of classifiers and Samtools’s (ST) performance at the specificity andsensitivity level given by Samtools. (f) Comparison of classifiers and GATK’s performance atthe specificity and sensitivity level given by GATK.312.3. Results0 0.1 0.2 0.3 0.40.70.750.80.850.90.9511−SpecificitySensitivity  RF (0.9369)BART (0.9487)SVM (0.9487)Logit (0.9191)Samtools (0.9053)GATK (0.8738)Figure 2.4: ROC curves derived from the held-out whole genome shotgun independent test datafrom four cases show different classifiers’ prediction results as well as Samtools and GATK’sprediction results. The numbers in the parentheses are the prediction accuracy by using thesame threshold as for the exome capture data (except for Samtools and GATK’s predictionresults.)with BART exhibiting the best overall performance.We next investigated whether the use of classifiers or the expanded set of features contributedto increased performance of our methods. Using the SeqVal1 dataset, we restricted the analysisto only Samtools-derived features (x1−40) and compared the results of classifiers with those ofthe Samtools caller. All classifiers performed statistically better than the Samtools caller. Forthe second experiment, we restricted the analysis to only the GATK features (x41−80). As forSamtools, the classifiers showed statistically significantly better results than those of the GATKcaller. These results suggest that the classifiers on the same set of features for both Samtoolsand GATK better approximated the “optimal” decision boundary without the use of heuristicthresholds employed by the naive methods and demonstrate the clear advantages of the machinelearning approaches we used.Finally, we studied the effect of the additional 26 features we introduced (Table 2.3, x81−106)to the Samtools and GATK features in order to boost weak mutation signals in the tumourand decrease the influence of germline polymorphisms. We compared the performance of RFand BART on different feature sets: Samtools alone (x1−40), GATK alone (x41−80), our 26features alone (x81−106), and all features combined (x1−106). As shown in Fig. 2.5, by using allthe features, RF and BART showed the best performance in terms of accuracy. (Although the322.3. Results0 0.1 0.2 0.3 0.4 0.50.650.70.750.80.850.90.9511−SpecificitySensitivity  RF All (0.9699)RF ST (0.9509)RF GATK (0.9557)RF Other (0.9604)Samtools (0.8665)GATK (0.9479)0 0.1 0.2 0.3 0.4 0.50.650.70.750.80.850.90.9511−SpecificitySensitivity  BART All (0.9735)BART ST (0.9557)BART GATK (0.9533)BART Other (0.9545)Samtools (0.8665)GATK (0.9479)Figure 2.5: (a) The performance of RF on different feature sets. RF All: RF’s accuracy byusing all the features, RF ST: RF’s accuracy by using only the Samtools features, RF GATK:RF’s accuracy by using only the GATK features, RF Other: RF’s accuracy by using all the newconstructed 26 features. (2) The performance of BART on different feature sets. BART Allmeans BART’s accuracy by using all the features. BART ST, BART GATK and BART Otherare similarly defined.26 novel features did not dramatically increase the performance, partially because the originalfeature sets already captured some information in the new feature sets.) We therefore suggestthat while the use of the machine learning classifiers accounts for the majority of improvementover naive methods, further improvement is achievable with the introduction of novel featuresthus illustrating the power of the flexible framework we used in this study.2.3.2 Robustness of classifiers to different feature setsWe used ensemble feature selection to output a set of the most salient, discriminative featuresfor each classifier, leading to four feature sets overall. We then fit each classifier to the exomecapture data using only the four selected feature sets output from the ensemble feature selectionmethod and classified the SOLiD data. We noted (Table 2.4) that overall, BART and RF weremost robust to the initial set of features and performed similarly for each of the four sets, showingstable performance in the presence of variable initial feature sets. Interestingly, all four methodsperformed equally well on the features chosen by ensemble feature selection applied to BART(Table 2.4) with a mean accuracy of 0.9453 compared to 0.9359, 0.9339 and 0.9157 for SVM,Logit, and RF chosen features, respectively.In summary, all classifiers performed better than the naive methods for somatic SNV predic-tion, with RF and BART showing the best performance. RF classifier is slightly more sensitive(Fig. 2.4), while BART has slightly higher specificity (Fig. 2.3(d)). The performance of SVM332.3. ResultsTable 2.4: The accuracy of classifiers on the SOLiD data by using different feature sets. HereRF Feature means the feature selected by RF classifier. BART Feature, SVM Feature andLogit Feature are similarly defined. The numbers in parentheses are the number of featureselected.PPPPPPPPModelFeatureRF Feature (18) BART Feature (23) SVM Feature (17) Logit Feature (17)RF 0.9369 0.9487 0.9448 0.9329BART 0.9369 0.9428 0.9369 0.9310SVM 0.9034 0.9408 0.9369 0.9408Logit 0.8856 0.9487 0.9250 0.9310Mean 0.9157 0.9453 0.9359 0.9339and Logit is relatively poor, especially in the presence of outliers as can be seen from Fig. 2.3(d).Importantly, both RF and BART are less sensitive to different feature sets compared to SVMand Logit (Table 2.4). Overall, the data support using RF and BART over SVM and Logitand suggest that RF may achieve better sensitivity while BART will achieve higher specificity,though both methods are extremely comparable.2.3.3 Discriminative features are different for tumour and normal dataThe description of the set of features selected by BART is given in Table 2.5. The features fellinto five broad categories: i) allelic count distribution likelihoods: provided by both Samtoolsand GATK; ii) base qualities such as the sum of reference base qualities, sum of non-referencebase qualities, sum of squares of non-reference base quality ratio; iii) strand bias such as sumof the pooled estimation of strand bias on both strands; iv) mapping qualities such as themean square mapping quality; and v) tail distance such as sum of squares of tail distance fornon-reference bases and sum of squares of non-reference tail distance (min distance of variantbase to the ends of the read) ratio. Notably, the features are often different in the tumourand normal. For example, the reference base qualities (x6) are selected in the normal, but fortumour both reference (x26) and non-reference base qualities (x28) are selected. Other tumour-specific features included sum of tail distance of the non-reference bases (x37), allele frequencyfor each non-reference allele (x63), and variant confidence normalized by depth (x71). Therefore,BART assigned unequal weights to the features in the normal and tumour, suggesting that theimproved accuracy is due to treating the tumour and normal data differentially to optimize thecontribution of the discriminant features. We note in Table 2.5 that BART selected several ofthe new features we designed (x83, x96, x97, x99, x101, x102, x105). These were not in Samtools342.3. ResultsTable 2.5: The BART model selected features. As expected, the likelihoods provided by bothSamtools and GATK, the base quality, mapping quality, strand bias, tail distance features arerelevant. The features selected from the normals and tumours are different.Index Feature definition Tumour Samtools GATK6 sum of reference base qualities√10 sum of reference mapping qualities√19 maxGi 6=aa(P (D | Gi))√26 sum of reference base qualities√ √28 sum of non-reference base qualities√ √37 sum of squares of tail distance for non-referencebases√ √38 P (D | Gi = aa) √ √41 QUAL: phred-scaled probability of the call givendata√53 sumGLbyD√57 P (D | Gi = aa) √60 P (D | Gi = bb) √63 AF: allele frequency for each non-ref allele√ √69 MQ: root mean square mapping quality√ √71 QD: variant confidence/unfiltered depth√ √73 sumGLbyD√ √77 GQ: genotype quality computed based on thegenotype likelihood√ √83 the difference between the sum of the base qual-ities of the current site and the next site96 sum of the pooled estimation of strand bias onboth strands max(forward, reverse)97 sum of the pooled estimation of strand bias onboth strands∑(forward, reverse)99 Reverse strand non-reference base ratio101 sum of squares of non-reference base quality ra-tio102 Sum of non-reference mapping quality ratio105 Sum of squares of non-reference tail distance ra-tioor GATK, and some were a combined calculation from the tumour and normal data. Thisillustrates the advantage of the classifiers’ ability to add arbitrary features and the importanceof simultaneous (not independent) treatment of the tumour and normal data.2.3.4 Sources of errors and sub-classification of wildtypesWe subgrouped the wildtype positions (false positives from the original predictions) by theirfeature vectors in order to characterize false positives due to distinct sources of error. Using352.3. Resultsthe wildtype positions from Seqval1, we identified the features which were not unimodal withthe dip statistic [87] and selected 28 features with p-value < 0.1. We then used PCA to projectthe features to a seven-dimensional space spanned by the first seven principal components, andmodelled the wildtypes in the seven-dimensional space (the first seven principal components ac-count for about 95% of the variance) using a Gaussian mixture model based clustering algorithmfit with the Expectation-Maximization algorithm (EM) [63] (Chapter 1). We used the Bayesianinformation criteria (BIC) score to select six clusters. The number of wildtypes in Group 1 toGroup 6 was 37, 189, 43, 181, 6 and 31, respectively. We attributed the six events in Group 5 tooutliers and excluded this group from further analysis. We then identified discriminant featuresof the different groups, using an ANOVA test followed by a multiple comparison test on eachfeature.Broadly the groups had the following characteristics. Group 1 (black) featured high valuesfor x102 and x103 indicating disproportionate mapping qualities in the tumour compared to thenormal. Thus, the tumour reads harbouring variants mapped with higher qualities than thenormal reads harbouring variants at the same genomic location. In addition, Group 1 exhibitedstrand bias as shown by high values of (x96 and x97). The events in this group had low valuesfor x57 and x77 which indicated low genotype qualities (confidence in the genotype call). Takentogether, these data suggest that the combination of poor mapping quality of the normal readsand the strand bias may be affecting the callers’ ability to accurately call these variants.Group 2 (red) is characterized by high values of x96 suggesting strand bias (Fig. 2.6). Weexamined the surrounding sequence content around these variants and found the majority of thevariants in this group had a common tri-nucleotide sequence GGT, changing to GGG (Fig. 2.7),a pattern which has been discovered in whole genome methyl-Seq experiments [139]. Thus, weexpect the false positive events in this group to be induced by systematic artefacts owing tosequencing errors at specific tri-nucleotide sequences.The discriminative features for Group 3 (green) events were characterized by mappingquality-related features (x10, x11, x50, x70, x49, x69). Thus, these wildtypes may be the re-sult of misaligned reads, or simply repetitive regions for unambiguously aligning short readsto. To investigate these wildtypes, we computed the UCSC mapability ( http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=wgEncodeMapability ) of each site. The mappability of a site depictsthe uniqueness of the reference genome in a window size of 35. Overall, Group 3 wildtypes haveconsiderably lower mappability scores than the other groups and therefore can be best explained362.3. Results28 29 101 33 32 25 105104102103 98 96 97 62 63 73 71 72 67 42 43 47 53 51 55 52 75 106 61 18 4 48 94 50 70 68 39 40 16 17 12 13 8 9 5 11 10 69 49 21 65 74 76 31 77 57 19 20 1 45 60 41 56 54−4 0 2 4Value01000025000ColorKeyandHistogramCountFeatureWildtypeGroup1Group2Group3Group4Group6SomaticFigure 2.6: The feature heatmap obtained with K = 6 clusters. The six events in Group 5 werenot shown here.by characteristics of the genome at these positions that make variant calling error prone.For the wildtypes in Group 4 (blue), the pooled estimated strand-biases are zero (x96 andx97). This is because these two features were computed after passing Samtools internal basequality filter threshold of 13, and the variant alleles had small base qualities so they didn’t passthis filter. The Samtools caller utilizes this base quality filter and therefore did not call thesepositions as variants (large x39 and x40). Group 4 wildtypes were also characterized by the GGTto GGG systematic sequencing artefact we observed for Group 2 and therefore is fundamentallysimilar to Group2, but may be easier to detect owing to poor base qualities at the site of thesequencing error.Interestingly, Group 6 (magenta) exhibited very similar patterns to the true somatic muta-372.4. Discussion1 2 3 4 5 6 7 8 9 10 11012Sequence PositionBitsNumber of sites:1891 2 3 4 5 6 7 8 9 10 11012Sequence PositionBitsNumber of sites:1091 2 3 4 5 6 7 8 9 10 11012Sequence PositionBitsNumber of sites:64Figure 2.7: Group two wildtype sequence motifs centred at error sites. Errors occur at the 6thbases. (a) The logo for all the group 2 wildtypes, (b) the wildtypes whose reference base is ‘A’,(c) the wildtypes whose reference base is ‘T’.tions (yellow) and thus made them challenging to interpret. Upon inspection, many of thesepositions had weak signals for a variant in the normal data, but perhaps not enough to inducea variant call. The tumour data, conversely (as shown by (x62, x63, x71, x73)) exhibited strongsignals for a variant. Thus, the weak signals in the normal data were likely being prematurelythresholded out by the naive methods. Indeed, the Samtools caller called 13 of the 31 Group6 events as somatic while GATK called 29 of the 31 events as somatic. The characteristics ofthe positions in Group 6 underscore the strength of simultaneously considering the tumour andnormal features that we suspect enhances the ability of the classifier to choose better decisionboundaries.2.4 DiscussionWe studied the use of feature-based classifiers for the purpose of somatic mutation detection intumour/normal pair HTS data. Using an extensive set of ground truth positions, we trained fourdifferent machine learning classifiers using features extracted from existing software tools andnovel features we computed ourselves. All four classifiers statistically significantly outperformedpopular software packages that use a naive way to detect somatic mutations, treating the tumourand normal data independently. Results were consistent between a cross-validation analysisof the training data and a completely independent test data set derived from an orthogonalsequencing platform. Our results encapsulate three key results: i) classifiers can be trained usingprincipled machine learning techniques to significantly improve somatic mutation detection; ii)feature selection analysis revealed that our classification method selects different features in thetumour and normal datasets to optimize classification ability, underscoring that simultaneous382.4. Discussionrather than independent analysis of the paired data is important; and iii) we identified fivedistinct groups of false positive results. This last result indicates that feature-based analysis of‘negative’ or wildtype positions can be helpful to guide future developments in software pipelinesthat operate upstream of variant calling.2.4.1 Progress in SNV predictionSince the publication of mutationSeq, dozens of tools have been developed to call somaticSNVs from HTS data [37, 73, 106, 111, 115, 174, 179]. Probabilistic model based approachedsuch as JoinSNVMix [174] may have difficulties in modelling the combinations of errors in HTSsequencing data, thus discriminative features such as strand-bias are typically ignored in themodelling framework. To achieve high specificity for these methods, heuristic filters based onsome features can be used to remove false-positives in HTS data [37, 179]. Some other methodsare designed for special cases, e.g., calling mutations from targeted high-coverage sequencingdata [73]. An interesting direction is to systematically compare different SNV callers on extensivesimulation data through a challenge competition [58]. However, we should notice that the resultsmay not only reflect the performance of different algorithms, but also how much efforts each teamputs on the competition since the results are also correlated with the number of participants ineach team [58]. A promising approach for getting high performance may be to aggregate theresults from different somatic mutation callers [80].39Chapter 3Predicting mutations that influencegene expression“What I cannot create, I do not understand.”– Richard Feynman, 19883.1 IntroductionHuman cancers acquire malignant properties following a stepwise accumulation of somatic ge-nomic alterations [213] and subsequent evolutionary selection on resultant phenotypic changes.Genomic mutations (loosely classified as single nucleotide variants (SNVs), small insertionsand deletions (indels), copy number alterations and genomic rearrangements) show widespreadvariation in their functional impacts on gene products, biochemical pathways, and phenotypicproperties. Consequently, the effect of a mutation is often difficult to predict. Previous compu-tational approaches to predict functional effects of mutations include: evolutionary conservationof the mutation sites across species, the chemical properties of amino acid substitutions [168],and the frequency of mutations of a gene of interest relative to its expected background rateof mutations [117]. These approaches rely on interpretation of DNA sequences alone and donot consider other molecular measurements such as gene expression, methylation or proteomemeasurements that are co-acquired from the same tumour samples. Thus, histological or molec-ular context of mutations is often ignored in their interpretation. To address this deficiency,we propose that additional patterns representing functional consequences of mutations can bedetermined through simultaneous analysis of mutation and gene expression data.In this chapter, we study the impact of mutations on gene expression as a means of quan-tifying their potential functional effects. This concept is motivated by biological hypothesespredicting that some functional mutations will exhibit a “transcriptional shadow”. For exam-ple, loss-of-function mutations such as nonsense mutations, frame-shift small insertions and dele-tions (indels), and splice-site mutations can result in loss of expression of genes harbouring these403.1. Introductionmutations (cis-effects). This is primarily due to nonsense-mediated mRNA decay mechanism toeliminate transcripts with premature stop codons [152]. By contrast, some mutations can disruptthe expression of numerous other genes in the same biochemical pathway (trans-effects). Forexample, functional mutations in genes that encode signalling proteins or transcription factorscan dysregulate downstream target genes. Mutations in chromatin modifying and methylationfactors can promote or inhibit the accessibility of DNAs to proteins thus influencing transcrip-tion. These functional mutations tend to cast a long transcriptional shadow over the genes acrossthe genome [39]. The dysregulation of gene expression indicates the disruptions of specific genesor pathways, and may help pinpoint functional mutations, and even cancer driver mutations.Furthermore, association of mutation with gene expression in individual tumours may identifypatient-specific characteristics that could be exploited with a personalized treatment approach.There are few computational tools available [164] to systematically identify mutations im-pacting gene expression. CONEXIC [2] is a probabilistic approach to detect driver copy numberregulators and their target genes. EPoC [97] derives driver copy number alterations and theirtarget genes by using differential equations to model the expression synthesis rate of a gene as afunction of its copy number and the regulatory effects of other genes. MOCA [130] detects dif-ferently expressed genes in the presence of mutations in a gene, and tests the significance of thecorrelation (between mutation and gene differential expression). PARADIGM [210] integratescopy number and expression to identify disrupted pathways. DriverNet [14] uses a combinato-rial approach and a greedy algorithm to nominate cancer driver genes. However, none of thesemethods can identify individual mutations that correlate with dysregulated gene networks.We present a novel statistical model, xseq using a hierarchical Bayes approach and apply it tothe analysis of thousands of tumour datasets available through TCGA, systematically examiningthe impact of somatic mutations on expression profiles across 12 tumour types. We demonstratethe robustness of xseq by conducting extensive computational benchmarking, and by testingxseq on an independent breast cancer dataset. We identify 30 novel cis-effect tumour suppres-sor gene candidates, statistically enriched in loss-of-function mutations and frequent bi-allelicinactivations. We identify 150 genes from trans-effect analysis impacting expression networksin the 12 cancer types, with 60 known cancer genes and 89 novel predictions. Notably, 29 ofthese newly predicted genes are known interacting partners of cancer driver genes. Based onthe trans-analysis, we find two important characteristics of mutations impacting gene expres-sion that could not be revealed with other methods: (1) a stratification (partition) of patients413.2. Methodsharbouring known driver mutations, but that exhibit different downstream gene expression con-sequences; (2) identification of mutations driving expression patterns that are stable acrosstumour types, thereby nominating important molecular targets for therapeutic intervention,transcending anatomic sites of origin.3.2 Methods3.2.1 Modelling the effects of mutations on expression with xseqThe xseq model is predicated on the idea that mutations with functional effects on transcrip-tion will exhibit measurable signals in mRNA transcripts biochemically related to the mutatedgene –thus imposing a transcriptional shadow across part (or all) of a pathway. To infer thisproperty, three key inputs are required for the model (Fig. 3.1(a)): a patient-gene matrix en-coding the presence/absence of a mutation (any form of somatic genomic aberrations that canbe ascribed to a gene, e.g., SNVs, indels, or copy number alterations); a patient-gene expressionmatrix encoding continuous value expression data (e.g., from RNA-Seq or microarrays); anda graph structure encoding whether two genes are known to be functionally related (e.g., ob-tained through literature, databases, or co-expression data). xseq uses a precomputed ‘influencegraph’ [207] as a means to incorporate prior gene-gene relationship knowledge into its modellingframework. For analysis of mutation impact in-cis, the graph reduces to the simple case wherethe mutated gene is only connected to itself. Given the inputs, we get the xseq structures andthe actual expression value of the nth gene connected to mutated gene g in patient m, denotedby Yg,m,n (Table 3.1).The output of xseq consists of: a) the probability that a mutated gene g influences geneexpression across the population of patients with mutations in g (denoted by P (Dg = 1 | D),where D represents the observed gene expression as the Y variable in Fig. 3.1(b); Table 3.1);and b) the probability that a mutation in gene g in an individual patient m influences expressionwithin that patient (denoted by P (Fg,m = 1 | D)).In addition to the random variables Dg, Fg,m, and Yg,m,n, xseq also models the gene expres-sion distribution over the patient population with gene-specific three component mixture modelsof Student’s t emission densities. The three mixture components represent down-regulation, neu-tral, or up-regulation, respectively (Fig. 3.1(a)). Gg,m,n ∈ {L = −1,N = 0,G = 1} denotes thestatus of the nth gene connected to gene g in patient m, where L,N , and G represent down-423.2. MethodsFigure 3.1: Overview of the xseq modelling framework. (a) The inputs to the xseq model:a mutation matrix typically from next generation sequencing, a gene-interaction network anda gene expression matrix. xseq models the expression of a gene across all the patients bymixture distributions. The three mixture components represent down-regulation, neutral, andup-regulation, respectively. (b) The graphical model representation of xseq with the platenotation. Circles represent random variables and arrows denote dependencies between variables.Boxes are plates which represent replicates. For example, the graph represents a gene mutatedin M patients (we assume that a gene is mutated only once in a patient), and the gene isconnected to N genes. (c) xseq predicts the conditional probabilities (given observations of Y ) ofeach gene (P (D)), each mutation (P (F )) influencing expression and the regulatory probabilitiesof the genes connected to the mutated gene in a patient (P (G)). P (D) = P (D = 1 | D);P (F ) = P (F = 1 | D); P (G) = P (G = g | D), g ∈ {−1, 0, 1}.regulation, neutral, and up-regulation, respectively. The number of genes connected to g isobtained from the influence graph. The central assumption is that a mutation in gene g ofpatient m impacting gene expression (denoted by Fg,m=1) more frequently co-associates withnon-neutral states in its connected genes, compared to the mutations that do not impact ex-pression. The specific direction of expression is encoded by Hg,n ∈ {L = −1,G = 1} to denotethe nth gene connected to gene g is up-regulated or down-regulated when mutations in g in-fluence expression. (We also consider a simplified model, xseq-simple without modelling thedirectionality of gene-regulation for a specific gene, i.e., without the H variable in Fig. 3.1(b) for433.2. MethodsTable 3.1: Description of the random variables in xseqName Description RangeDg Whether mutations in gene g across pa-tients impact expressionBinary (0 means not impacting expres-sion, and 1 means impacting expres-sion)Fg,m Whether gene g’s mutation specifically inpatient m impact expressionBinary (0 means not impacting expres-sion, and 1 means impacting expres-sion)Gg,m,n The nth gene connected to g in pa-tient m is up-regulated, neutral, or down-regulatedTernary (L = −1, N = 0 andG = 1 represent expression down-regulation, neutral and up-regulation,respectively)Hg,n The nth gene connected to g is up-regulated or down-regulated across the pa-tients harbouring gene g mutations thatcorrelated with connected gene dysregula-tionBinary (L = −1 and G = 1 meandown- and up-regulation of expression,respectively)Yg,m,n The expression of the nth gene connectedto g in patient mReal (observed)simplicity of inference.) To represent a recurrent pattern of expression impact across multiplepatients, we consider information across all patients with a mutation in gene g. This allows forborrowing of statistical strength across multiple gene expression patterns associated with mu-tations in order to generalize whether a mutated gene impacts expression across the population(denoted by Dg = 1). Fig. 3.2(a) shows a simple xseq model for a mutated gene.Based on the xseq model structure in Fig. 3.1(b), for a mutated gene g, xseq specifies ajoint distribution [150] with both discrete and continuous random variables (assuming that g ismutated in M patients and g has N connected genes):P (D = d)M∏m=1P (Fm = fm | D = d)N∏n=1p(ym,n | Gm,n = gm,n)P (Gm,n = gm,n | Fm = fm, Hn = hn)=θD=dM∏m=1θF=fm|D=dN∏n=1p(ym,n | Gm,n = gm,n)θG=gm,n|F=fm,H=hn (3.1)In Equation 3.1, we write the joint distribution in terms of the model parameters (the θs inTable 3.2). Notice that when fm = 0, θG=gm,n|F=0,H=hn = θG=gm,n|F=0 as in Table 3.2. Forsimplicity, we consider the case with only one mutated gene and we remove ‘g’ in the notations.443.2. MethodsFigure 3.2: A simple xseq model. This xseq model has just one mutated gene g, two patientsharbouring mutations in gene g, and three genes connected to g. Here we drop ‘g’ in the figurefor simplicity. (a) H is hidden and (b) H is observed.Here we assume that a gene is just mutated once in a specific patient, i.e., M equals to thenumber of accumulated mutations in gene g. Therefore, m is the patient (mutation) index, andn is the gene index, e.g., Ym,n represents the expression of the n-th gene connected to g in them-th patients harbouring mutations in g. We now explain how we execute parameter learningand inference over this joint distribution.3.2.2 Inference and learning in xseqWe use the Belief Propagation algorithm [158] for inference and the Expectation-Maximization(EM) algorithm for parameter learning [45]. The inference problem is to compute the conditionaldistributions (given observations of the gene expression Y ) p(dg | D), p(fg,m | D), and p(gg,m,n |D) given the expression data. The learning problem is to estimate the conditional probabilities ofa variable given its parents, e.g., θF=1|D=1 –the probability of a mutation impacting expression ina specific patient given that this gene’s mutations impact expression across patients (Fig. 3.1(b),Table 3.2). For clarity of presentation, we have removed the subscripts and directly refer to D,F , and G. For example, we use P (D), P (F ) and P (G) to denote the conditional probabilitiesP (Dg = 1 | D), P (Fg,m = 1 | D), and P (Gg,m,n = gg,m,n | D), where gg,m,n ∈ {−1, 0, 1},respectively.Pearl’s Belief Propagation algorithm (polytree algorithm) was first proposed by Judea Pearlin 1982 [158] to solve inference problems for tree structure Bayesian networks. (There is only453.2. MethodsTable 3.2: Description of the conditional distributions in xseq. Here g refers to a mutated genein general.Name Description RangeθD=1def= P (D = 1) Probability of gene g’s mutations impacting expressionacross patientsRealθF=1|D=0def= P (F = 1|D = 0) Probability of gene g’s mutation impacting expression inpatient m given that gene g’s mutations do not impact ex-pression across patientsRealθF=1|D=1def= P (F = 1|D = 1) Probability of gene g’s mutation impacting expression inpatient m given that gene g’s mutations impact expressionacross patientsRealθG=L|F=0def= P (G = L|F = 0) Probability of a gene connected to g is down-regulated inpatient m given that g’s mutation in m do not impact ex-pressionRealθG=N |F=0def= P (G = N |F = 0) Probability of a gene connected to g is neutral in patientm given that g’s mutation in m do not impact expressionRealθG=L|F=1,H=0def=P (G = L|F = 1, H = 0) Probability of a gene connected to g is down-regulated inpatient m given that g’s mutation in m impact expressionand this gene is generally down-regulated when mutationsin g impact expressionRealθG=N |F=1,H=0def=P (G = N |F = 1, H = 0) Probability of a gene connected to g is neutral in patientm given that g’s mutation in m impact expression and thisgene is generally down-regulated when mutations in g im-pact expressionRealθG=L|F=1,H=1def=P (G = L|F = 1, H = 1) Probability of a gene connected to g is down-regulated inpatient m given that g’s mutation in m impact expressionand this gene is generally up-regulated when mutations ing impact expressionRealθG=N |F=1,H=1def=P (G = N |F = 1, H = 1) Probability of a gene connected to g is neutral in patient mgiven that g’s mutation in m impacts expression and thisgene is generally up-regulated when mutations in g impactexpressionRealp(y|G = L) The likelihood of observing expression of y in a gene giventhat this gene is down-regulated (the conditional probabil-ity density function at y)Realp(y|G = N ) The likelihood of observing expression of y in a gene giventhat this gene is neutralRealp(y|G = G) The likelihood of observing expression of y in a gene giventhat this gene is up-regulatedRealAlthough in probability theory, the notation D = 1 represents the set of events {D = 1}, here forsimplicity we use the random variables to index the conditional probabilities.one undirected path between any two nodes. In addition, a child node has only one parent.) Itwas then extend to do inference for polytrees [101]. (A child node can have multiple ‘parents’.)Notice that xseq-simple is a tree, while xseq is not a tree or a polytree, e.g., for the xseq463.2. Methodsmodel in Fig. 3.2(a), there is an undirected loop D → F1 → G11 → H1 → G21 → F2 → D.Therefore, generally speaking inference in xseq is difficult, i.e., the time and memory complexityis exponential in the number of mutations in g. However, if H is given, xseq is equivalent to apolytree because we can “break” H to convert the toy xseq model in Fig. 3.2(a) to a polytreeas given in Fig. 3.2(b).In addition to simplify the inference problem, there are many situations where H is ob-served. First, from pathway information, we may know that gene g up-regulates gene a. Thedirectionality of gene regulation information can be added to the xseq model as an observednode H. Then we can use xseq to search the mutations in gene g that correlate with gene a’sup-regulation. Secondly, if we observe that gene g up-regulates gene a in a discovery dataset,given a validation dataset, we also want to inspect whether the mutations in gene g up-regulategene a.The Belief Propagation algorithm computes the conditional distributions p(dg | D), p(fm,n |D) and p(gg,m,n | D) in two phases of message passing: 1) A leaf node (an observed geneexpression value node Yg,m,n) sends to its parent Gg,m,n a message, namely the likelihoodsp(yg,m,n | Gg,m,n) over parent Gg,m,n (for different values that Gg,m,n can take). After receivingall the messages from its children, a node can construct and send messages to its parents. Again,a message sent to a parent is a distribution over the parent, e.g., the message Gg,m,n sent toFg,m is a distribution over Fg,m. The detailed definitions of these messages are described below.This process is repeated until the roots of the tree - the nodes without any parents receive allthe messages from their children. 2) A root node can update its conditional distribution bynormalizing the product of all the incoming messages and its own prior distribution. Then aroot node can construct and send messages down to its children. A top-down message is alsoa distribution of the parent, e.g., the message Dg sent to Fg,m is the conditional distributionp(dg | D) divided by the message Fg,m sent to Dg before. This process is again repeated untilthe leaf nodes have received all the messages from their parents. After receiving all the incomingmessages from its neighbours (both parents and children), a node can update its belief – theconditional distributions.Details of the message passing steps in xseq To simplify our description of the inferencealgorithm, we use a simple example with just one mutated gene, and H is given. In addition,this gene is mutated in M patients and this gene is connected to N genes. Formally, at the473.2. Methodsfirst message collection phase, the bottom-up messages (with superscript (−)) consist of (insequential order):m(−)Ym,n→Gm,n = p(ym,n | Gm,n) (3.2)m(−)Gm,n→Fm =∑gm,nθG=gm,n|F=fm,H=hnbel(−)Gm,n(3.3)m(−)Fm→D =∑fmθF=fm|D=dbel(−)Fm(3.4)where bel(−)Gm,nand bel(−)Fmare the bottom-up belief states of Gm,n and Fm, respectively. Thebottom-up belief of a node is updated after the node receives all the bottom-up messages fromits children, for example,bel(−)Gm,n∝ m(−)Ym,n→Gm,n (3.5)bel(−)Fm∝N∏n=1m(−)Gm,n→Fm (3.6)Once the root D receives all the incoming messages, it can update its belief by normalizingthe product of these messages as well as the ‘prior’ distribution:belD = p(d | D) ∝ θD=d∏mm(−)Fm→D (3.7)Then D sends top-down messages to its children (with superscript (+)):m(+)D→Fm ∝belDm(−)Fm→D(3.8)m(+)Fm→Gm,n ∝belFmm(−)Gm,n→Fm(3.9)Here the ‘division’ operator for distributions is only defined for two distributions of the same setof random variables, and it is defined as element-wise divisions. The conditional distributions483.2. Methodsof Fm and Gm,n can be updated bybelFm = p(fm | D) ∝∑dθF=fm|D=d ∗m(+)D→Fm∏nm(−)Gm,n→Fm (3.10)belGm,n = p(gm,n | D) ∝∑fmθG=gm,n|F=fm,H=hn ∗m(+)Fm→Gm,nm(−)Ym,n→Gm,n (3.11)Details of parameter learning The EM algorithm [45] is used to learn the parameters(Table 3.2) in xseq. Given an initial guess of the model parameters (the θs in Table 3.2), theEM algorithm iterates between the E-step (computing the conditional distributions of hiddenrandom variables Dg, Fg,m, and Gg,m,n given the current guess of model parameters and theobservations D) and the M-step (update the model parameters) to find a local maximum ofthe objective likelihood function. To describe the EM algorithm for paramter learning, we firstextend the joint distribution in Equation 3.1 to the case with T mutated genes:T∏g=1θD=dgMg∏m=1θF=fg,m|D=dgNg,m∏n=1p(yg,m,n | Gg,m,n = gg,m,n)θG=gg,m,n|F=fg,m,H=hg,n (3.12)The corresponding complete log-likelihood (since we assume the hidden variables are given)function given data D islog p(D | θ) =T∑g=1log(θD=dg)+T∑g=1Mg∑m=1log(θF=fg,m|D=dg)+T∑g=1Mg∑m=1Ng,m∑n=1(log(θG=gg,m,n|F=fg,m,H=hg,n)+ log(p(yg,m,n | Gg,m,n = gg,m,n)))(3.13)Here θ is the vector of model parameters as in Table 3.2. Given the current ‘guess’ of modelparameters θold, we can compute the conditional distributions of the unknown variables in thexseq model. Then we take the expectation of the complete log-likelihood function with respect493.2. Methodsto the conditional distribution of the hidden variables to getQ(θ,θold) =E(log p(D | θ))=1∑i=0T∑g=1log(θD=i)P (Dg = i|D,θold)+1∑i=01∑j=0T∑g=1Mg∑m=1log(θF=j|D=i)P (Fg,m = j,Dg = i|D,θold)+1∑j=0∑k∈{L,N ,G}∑l∈{L,G}T∑g=1Mg∑m=1Ng,m,n∑n=1log(θG=k|F=j,H=l)P (Gg,m,n = k, Fg,m = j,Hg,n = l|D,θold) (3.14)We can maximize the above log likelihood function by Lagrange multipliers. For example, forthe first term of Equation 3.14,L(θD, λ) =1∑i=0T∑g=1log(θD=i)P (Dg = i | D,θold) + λ(θD=0 + θD=1 − 1) (3.15)θD=0 + θD=1 = 1∑Tg=11θD=0P (Dg = 0 | D,θold) + λ = 0∑Tg=11θD=1P (Dg = 1 | D,θold) + λ = 0(3.16)We can easily solve Equation 3.16 to getθD=1 =∑Tg=1 P (Dg = 1 | D,θold)∑Tg=1 P (Dg = 0 | D,θold) +∑Tg=1 P (Dg = 1 | D,θold)(3.17)In summary, the expectation step is to compute the conditional distributions of latent variablesgiven current guess of model parameters using the introduced belief propagation algorithm, and503.2. Methodsspecifically, to compute the following terms:NˆD=i =T∑g=1P (Dg = i|D) (3.18)NˆF=j|D=i =T∑g=1Mg∑m=1P (Fg,m = j,Dg = i|D) (3.19)NˆG=k|F=j,H=l =T∑g=1Mg∑m=1Nm,g∑n=1P (Gg,m,n = k, Fg,m = j,Hg,n = l|D) (3.20)By solving a series of constraint optimization problems to maximize Equation 3.14 to get themaximization step equations:θˆD=i =NˆD=i∑i NˆD=i(3.21)θˆF=j|D=i =NˆF=j|D=i∑j NˆF=j|D=i(3.22)θˆG=k|F=0 =∑l NˆG=k|F=0,H=l∑k∑l NˆG=k|F=0,H=l(3.23)θˆGk|Fj=1,Hl =NˆG=k|F=1,H=l∑k NˆG=k|F=1,H=l(3.24)Pseudo-counts are added to both the numerators and denominators of the above maximizationsteps to do maximum a posteriori estimation of parameters.Adding constraints in learning xseq parameters Parameter learning is a challengingproblem in using Bayesian networks to solve real world problems. When the Bayesian networkshave multiple hidden nodes, parameter learning is extremely difficult, and learning algorithmscan easily get trapped into local maxima. One solution to mitigate these local maximum prob-lems is to add prior distributions for parameters and use maximum a posteriori estimation ofparameters. However, we still need to set the hyper-parameters of the prior distributions. Wemay have very limited knowledge of these parameters. Here in addition to selecting good hyper-parameters for the prior distributions, we also constrain the parameter space in learning xseqconditional distributions. More specifically, in trans analysis, we require the parameters of the513.2. Methodsxseq-simple model:θG=L|F = θG=G |F (3.25)These constraints indicate that the genes connected to a mutated gene g are equally likely to beup-regulated or down-regulated, given the F status of the mutation in g of a patient. For thexseq model, we require the following constraints in analyzing the TCGA data:θG=L|F=1,H=L = θG=G |F=1,H=G (3.26)θG=G |F=1,H=L = θG=L|F=1,H=G (3.27)Incorporating network connection weights in xseq inference Our inference graph isa directed weighted graph. We can easily incorporate the weight between two genes/proteinsinto xseq inference. As before, we consider a simple case with just one mutated gene g. Theconditional distribution of D given gene expression Y (by summarizing out F , G, and H) canbe written as:P (D = d | Y1,1 = y1,1, . . . , Y1,N = y1,N , . . . , YM,1 = yM,1, . . . , YM,N = yM,N )∝P (D = d, Y1,1 = y1,1, . . . , Y1,N = y1,N , . . . , YM,1 = yM,1, . . . , YM,N = yM,N )=∑fm,gm,n,hnθD=dM∏m=1θF=fm|D=dN∏n=1p(ym,n | G = gm,n)θG=gm,n|F=fm,H=hn (3.28)The condotional distribution p(d | D) of observing expression ym,n,m ∈ 1, . . . ,M, n ∈ 1, . . . , Nis determined by each factor (or “bucket”), e.g., p(ym,n | Gm,n). Each factor can be consideredas a feature, and the features that are most relevant should play a larger role in computing theconditional probabilities. Assuming that the connection strength between gene g and the nthgene connected to g is wg,n, then we can replace the factor p(ym,n | G = gm,n) by p(ym,n | G =gm,n)wg,n in Equation 3.28.Let’s look at how the weight influences predictions. When wg,n = 0, the factor p(ym,n | G =gm,n)wg,n has no influence on the conditional probabilities. When 0 < wg,n ≤ 1, the relativevalues among p(ym,n | Gm,n = G), p(ym,n | Gm,n = N ), and p(ym,n | Gm,n = L) will decrease523.2. Methodsafter weighting, for example,p(ym,n | Gm,n = L)∑k p(ym,n | Gm,n = k)≤(p(ym,n | Gm,n = L)∑k p(ym,n | Gm,n = k))wg,n≤ 1 (3.29)As the conditional distribution p(d | D) should be normalized so they sum to 1 (∑d P (D =d | D) = 1), the conditional distribution is only sensitive to the relative magnitude amongp(ym,n | Gm,n ∈ {L,N ,G}). That means, the more discriminative of p(ym,n | Gm,n ∈ {L,N ,G}),the more important the nth is gene in predicting the conditional distribution p(d | D). Therefore,a small weight for wg,n means that the nth gene is less important.3.2.3 Conditional distributions of gene expression valuesThe conditional distributions p(y | G)1 are modelled as Student’s t-distributions and estimatedoffline. For example, the conditional distribution of gene g expression distribution is modelledas a Student’s t-distribution when gene g is down-regulated:p(y | G = L) =∫N (y | µL , σL/z)Gamma(z | ν2,ν2)dz (3.30)where y is the expression level of gene g, µL , σL and ν are the parameters of the Student’s t-distribution. As the parameter ν increases, the Student’s t distribution approaches a Gaussiandistribution N (y | µL , σL). Compared to Gaussian distributions, the Student’s t-distributionsare more robust to outliers, especially when ν is small. Now the gene expression distribution isa mixture of three Student’s t-distribution:p(y) =∑k∈{L,N ,G}ωkp(y | G = k) (3.31)where ωk is the mixture coefficient of mixture component k. We also use the EM algorithm touncover the parameters of the Student’s t-distributions.3.2.4 Influence graphThe influence graph can be any such graph encoding gene regulation. The ith vertex of thegraph represents gene (protein) gi and edge wi,j represents the association strength between1 For a given G = g , the probability density function of the expression of a gene. More formally written asfY |G(y|g)533.2. Methodsgene (protein) gi and gj . For analyses presented in this thesis, we constructed a combinedfunctional gene association network by merging the STRING v9.1 [64] functional protein asso-ciation network, the pathway datasets from KEGG [99], WikiPathway [160] and BioCyc [100],and transcription factors-targets networks. The pathways have already been integrated into theIntPath database [240]. The transcription factors-targets network [236] is download throughthe transcription factor encyclopedia web API. The ENCODE [72] transcription factor ‘prox-imal’ and ‘distal’ networks are also included in the combined network (download from http://encodenets.gersteinlab.org/). The majority of these interactions are transcription factor–target gene interactions (about 1% between transcription factor interactions in the ‘proximal’network [72]).For each dataset, we construct a weighted network. The weight of an interaction representsour prior confidence of the interaction. For the datasets that do not provide weights for inter-actions, a default weight of 0.8 is used. The STRING protein-protein interaction network isalready a weighted network so we use their provided weights. To merge these networks, for aspecific interaction, we take the largest weight for this interaction across different networks. Wethen only keep the interactions with at least median confidence (threshold of 0.4, the defaultthreshold suggested for the STRING database). In this combined network, 17, 258 genes (pro-teins) connect to 19, 070 genes (proteins) through 898, 032 interactions. This network is almostweakly connected (22 genes do not connect to rest genes.)Then for each mutated gene, we test whether the genes connected to it are differentiallyexpressed with adjusted p-value (BH method) threshold of 0.05. If there exist differentiallyexpressed genes, we only keep these genes and set their connection weights to 1. In addition, ifa gene is not differently expressed in a specific tumour type but differently expressed in othertumour types based on Fishers’ combined p-value FDR ≤ 0.05, we also set their connectionweights to 1. If no differentially expressed genes exist for a given mutated gene, we use thenetwork from the original weighted network.3.2.5 Collecting bona fide driver genesWe collected the genes from the manually curated, and widely used Cancer Gene Census (CGC)database [70] and two major recent review papers [116, 213] as our reference bona fide cancerdriver genes. Specifically, we collected 519 genes from CGC [70], 125 genes predicted by the“20/20 rule” [213] (more details below), 66 recently discovered frequently mutated genes collected543.2. Methodsin the Supplementary Table 4 of Lawrence et al [116], and 33 genes predicted by MutSig andhave strong and consistent connections to cancer [116]. In summary, these datasets include 603unique genes in total. The 127 significantly mutated genes predicted by the MuSiC suite [98]are not counted because we use this dataset for several comparisons. Notice that in our analysis,we use the samples with expression, copy number alterations, and mutations. Therefore xseqanalyzes a subset of mutations used by the MuSiC suite.3.2.6 Modelling loss-of-function mutations and hotspot mutationsThe mutation patterns of most known tumour suppressor genes and oncogenes are highly charac-teristic and non-random [213]. In a recent review [213], a “20/20 rule” is used to identify drivergenes: for oncogenes, at least 20% of all the mutations are required to be hotspot missensemutations or in-frame indels; for tumour suppressor genes, at least 20% of all the mutations arerequired to be loss-of-function mutations.Here we extend the “20/20 rule” by using mixture-of-binomial modelling of loss-of-functionmutations and hotspot mutations. We analyze oncogenes and tumour suppressor genes sep-arately. When predicting oncogenes, we first count the number of hotspot mutations ng,rec(recurrent missense mutations and in-frame indels) in gene g, and the total number of muta-tions Ng in gene g. Then we model the mutation count distribution as a mixture of two binomialdistributions: one component for oncogenes and the other component for non-oncogenes:P (ng,rec | Ng) = ω1Binomial(ng,rec | Ng, p1) + ω2Binomial(ng,rec | Ng, p2)P (OCG) = ω1Binomial(ng,rec | Ng, p1)/P (ng,rec | Ng)Then P (OCG) (or P (OCG = 1)) is defined as the probability of ng,rec in the mixture componentwith higher success rate, namely p1 here. The mixture parameters ω = (ω1, ω2) and successrates p = (p1, p2) are estimated by the EM algorithm.Similarly when predicting tumour suppressor genes, we first count the number of loss-of-function mutations ng,loss in gene g, and Ng. Then we model the count distribution as amixture of two binomial distributions: one component representing tumour suppressor genes553.2. Methodsand the other representing non-tumour suppressor genes:P (ng,loss | Ng) = ω1Binomial(ng,loss | Ng, p1) + ω2Binomial(ng,loss | Ng, p2)P (TSG) = ω1Binomial(ng,loss | Ng, p1)/P (ng,loss | Ng)P (TSG) (or P (TSG = 1) ) is defined as the probability of ng,loss in the mixture componentwith higher success rate (p1 here). Again, the mixture parameters ω and success rates p areestimated by the EM algorithm.The mixture-of-binomial approach can be considered as a generation of the “20/20 rule”because of its ability to estimate the parameters from data, and to account for the total numberof mutations to compute probabilities of genes to be oncogenes or tumour suppressor genes. Tomake the estimated parameters more accurate, we also added extra genome wide screen datafrom COSMIC v64 [62], downloaded from syn1855816, and the Pan-Cancer data downloadedfrom syn1710680. Fig. 3.3 shows the Binomial mixture modelling of oncogenes and tumoursuppressor genes for all the genome-wide screen somatic mutation data. We used a threshold of0.2 for P (TSG) and P (OCG) to call genes with tumour suppressor gene properties and oncogeneproperties, respectively.3.2.7 Expression information in predicting driver mutationsSome mutated genes are not expressed in cancer cells, and therefore the mutations in these genesare less likely to be pathogenic. Currently the expression information has not yet been fullyexplored for the identification of driver mutations [40], and only a few methods take expressioninformation into account to assess the background mutation rates [117].We adopt a mixture modelling approach to predict whether a gene is “highly-expressed” in atumour type. We first log2 transform the tumour gene RSEM abundance estimation values. Toprevent taking the log2 of 0, we remove the gene expression values if they are less than 0 beforelog2-transformation. We then compute the 90th percentile of the expression of a gene acrosspatients in a tumour type to represent the overall expression of that gene. Here we use the 90thpercentile instead of median considering the gene dosage effects of copy number deletions onexpression in cancer. It is unlikely that a gene is deleted in 90% of all the analyzed samples(e.g., for the Pan-Cancer datasets, the mostly frequently homozygous deleted gene is CDKN2A,which is deleted in 57% of patients in GBM). If we also consider heterozygous deletions, then563.2. MethodsFigure 3.3: Mixture-of-Binomial modelling of loss-of-function mutations and hotspot mutations.(a) P (TSG = 1) (tumour suppressor gene) histogram. We labeled the P (TSG) of the 30 neg-ative control genes (blue colour, Results section), as well as 21/23 predicted cis-effect tumoursuppressor genes with high P (TSG) (orange colour, Results section). (b) The proportion of loss-of-function mutations distribution. The two vertical lines represent the estimated success ratesof the two binomial distributions. (c) P (OCG = 1) (oncogene) histogram. We also labeled the30 negative control genes (blue colour), as well as 20 selected oncogenes for illustration purpose(orange colour). (d) The proportion of recurrent mutation distribution. The two vertical linesrepresent the estimated success rates of the two binomial distributions.the most frequently deleted gene is EBF3, which is deleted in 90% of patients in GBM. The 90thpercentile expression of a gene may overestimate the expression level of the gene in the studiedtumour type (since we are more concerned about losing important genes). Next, we model the90th percentile expression of genes as a mixture of two Gaussian distributions: one componentrepresenting “highly-expressed” and the other component representing “lowly-expressed”. Agene is considered to be “highly-expressed” if its probability in the “highly-expressed” group is573.2. Methods2 4 6 8 12 160.000.100.20BLCA2 4 6 8 12 160.000.100.20BRCA5 10 150.000.050.100.15COAD2 4 6 8 12 160.000.100.20 GBM2 4 6 8 12 160.000.10HNSC2 4 6 8 12 160.000.100.20KIRC5 10 150.000.10LAML2 4 6 8 12 160.000.100.20LUAD2 4 6 8 12 160.000.100.20LUSC2 4 6 8 12 160.000.100.20OV5 10 150.000.050.100.15READ2 4 6 8 12 160.000.10UCECmRNA expressionDensityFigure 3.4: Detecting highly-expressed genes based on mixture-of-Gaussian distributions. Foreach plot, the left blue curve is the Gaussian fit to the “lowly-expressed” genes, and the rightgrey curve is the Gaussian fit to the “highly-expressed” genes.greater or equal to 0.8. In the presence of outliers (some genes are expressed at extremely high orlow levels), we first remove outliers based on the boxplot rule, and then fit the data. We assigna probability of 1 to the highly-expressed outliers, and a probability of 0 to the lowly-expressedoutliers. Fig. 3.4 shows Gaussian mixture modelling of expression across the 12 cancer types.3.2.8 Compensating for the cis-effects of copy number alterationsBefore analyzing the trans-effects of somatic mutations, we first remove the cis-effects of copynumber alterations on expression; copy number alterations are common in cancer and the ma-jority have cis-effects on expression. Numerous studies have done integrative analysis of copynumber and gene-expression data [10, 74, 92, 145]. Here we use the Gaussian process (GP)regression to model the expression yi of a gene in a patient i, as a function of its copy numberlog2 value xi. GP regression is flexible to add extra variables such as DNA methylation data asindependent variables if necessary, and can capture nonlinear relationships between copy numberalterations and expression.583.2. Methods−0.6 −0.4 −0.2 0.0 0.2 0.4891011BLCA−1.0 −0.5 0.0 0.589101112BRCA−1.5 −1.0 −0.5 0.0891011COAD−2.0 −1.5 −1.0 −0.5 0.0891011 GBM−1.5 −1.0 −0.5 0.0 0.589101112 HNSC−0.8 −0.6 −0.4 −0.2 0.0 0.29101112KIRC−0.8 −0.4 0.0 0.2 0.411.012.013.0LAML−1.0 −0.5 0.0 0.589101112LUAD−1.0 −0.5 0.0891011LUSC−2.0 −1.5 −1.0 −0.5 0.0 0.5789101112 OV−1.0 −0.5 0.08.08.59.09.510.011.0READ−1.5 −1.0 −0.5 0.0 0.5789101112 UCECPTEN CNA log2PTEN mRNA expressionFigure 3.5: Scatter plots of PTEN copy number alterations and expression across 12 cancertypes. The black curves are the regression curves estimated by Gaussian process regression, andthe region between the two red curves for each plot is the 95% confidence interval.GP regression models the joint distribution of yi as a joint Gaussian distribution. The co-variance matrix is constructed based on the given copy number data, cov(xi, xj) = k(xi, xj),where k is the squared exponential kernel function. The hyper-parameters of the kernel func-tion are computed by optimizing the log-marginal likelihood function using scaled conjugategradient algorithms. To remove the cis-effects of copy number alterations, we subtract the re-gression values from the original expression values to get the residuals that are considered to beregulated by trans-effect mutations. Fig. 3.5 shows the scatter plots of copy number alterationand expression values for PTEN across the 12 cancer types. The GP regression lines and the95% confidence intervals of the regression lines are overlaid on the scatter plots. Fig. 3.6 showsthe scatter plots of PTEN expression across cancer types after removing the cis-effects of copynumber alterations on expression.593.3. Results−0.6 −0.4 −0.2 0.0 0.2 0.4−2−101 BLCA−1.0 −0.5 0.0 0.5−3−2−101BRCA−1.5 −1.0 −0.5 0.0−1.5−1.0−0.50.00.51.01.5COAD−2.0 −1.5 −1.0 −0.5 0.0−1.5−1.0−0.50.00.51.0GBM−1.5 −1.0 −0.5 0.0 0.5−1.0−0.50.00.51.0 HNSC−0.8 −0.6 −0.4 −0.2 0.0 0.2−1.5−1.0−0.50.00.51.0 KIRC−0.8 −0.4 0.0 0.2 0.4−1.00.00.51.01.5 LAML−1.0 −0.5 0.0 0.5−1.5−1.0−0.50.00.51.0 LUAD−1.0 −0.5 0.0−1.0−0.50.00.51.0LUSC−2.0 −1.5 −1.0 −0.5 0.0 0.5−1.5−0.50.00.51.0 OV−1.0 −0.5 0.0−1.0−0.50.00.5READ−1.5 −1.0 −0.5 0.0 0.5−1.5−0.50.51.01.5 UCECPTEN CNA log2PTEN mRNA expressionFigure 3.6: Scatter plots of PTEN copy number alterations and expression after cis-effect re-moving. The cis-effects of copy number alterations on gene expression have been removed bysubtracting the regression values from the original expression values.3.3 Results3.3.1 DatasetsWe used somatic point mutation, copy number alteration, and gene expression data in 12 cancertypes, from the TCGA Pan-Cancer project [223] (Table 3.3). A total of 2,786 patients withall three types of data were included in our analyses. For trans-analysis, focal copy numberhomozygous deletions and amplifications (with four or more copies) predicted by GISTIC [142,237] were also encoded as inputs to xseq. We did not analyze the cis-effects of copy numberalterations since the majority of them have cis-effects on gene expression [199].3.3.2 Computational benchmarking and validation of xseqSimulation analysis We examined the theoretical performance of xseq via simulation andpermutation analyses. To investigate the performance of xseq under different noise levels, wesimulated data (Fig. 3.7) from nine hyper-parameter sets, and ten independent realizations603.3. ResultsTable 3.3: List of the twelve cancer types analyzedData Mutation RNA-Seq SNP6.0 OverlapBLCA 99 96 125 94BRCA 772 822 879 743COAD 155 192 414 149GBM 291 167 576 144HNSC 306 303 306 295KIRC 417 428 452 390LAML 196 173 197 167LUAD 230 355 358 169LUSC 178 220 342 177OV 316 266 581 159READ 69 71 163 65UCEC 248 333 493 235The numbers are the sample counts.BLCA, Bladder urothelial carcinoma; BRCA, Breast invasive carcinoma; COAD, Colon adenocar-cinoma; GBM, Glioblastoma multiforme; HNSC, Head and neck squamous cell carcinoma; KIRC,Kidney renal clear cell carcinoma; LAML, Acute myeloid leukemia, also denoted as AML; LUAD,Lung adenocarcinoma; LUSC, Lung squamous cell carcinoma; OV, Ovarian serous cystadenocarci-noma; READ, Rectum adenocarcinoma; UCEC, Uterine corpus endometrioid carcinomaTotally 563,024 somatic mutations in the overlapped samples (363,676 missense mutations, 132,981synonymous mutations, 33,838 nonsense mutations, 13,260 frameshift indels, 6,952 non-coding RNAmutations, 8,699 splice-site mutations, 3,141 in-frame indels, and 477 stop-gain mutations).In trans-analysis, we added the 37,308 homozygous deletions in 2,084 genes (focal copy number dele-tion peaks), and 69,643 amplifications in 960 genes (focal copy number amplification peaks).of data for each hyper-parameter set. We used xseq (Fig. 3.7) to generate synthetic data.The number of mutated genes T was fixed to 300, and the number of patients was fixed to200. The number of accumulated mutations for gene g was sampled from a shifted geometricdistributions: P (Mg = m) = 0.1m−15 × 0.9 with m ≥ 15. The number of genes connected to gwas also sampled from a shifted geometric distribution: P (Ng = n) = 0.1n−10×0.9 with n ≥ 10.The hyper-parameters of the beta distribution for θD were set to (70, 30), which means thataround 30% of the genes accumulated mutations impacting expression. The hype-parametersof the beta distribution for θF |D=0 were set to (105, 5), and (10, 100) for θF |D=1. The Dirichlethyper-parameters for θG|F=0 were set to (10, 180, 10). The gene expression data were sampledfrom Gaussian distributions. The variances of the Gaussian distributions were sampled from aninverse-gamma distribution: IG(σ2 | ν, σ20). We fixed the hyper-parameter ν = 3, σ20 = 0.12.We first analyzed the influence of the degree of gene regulation (up- or down-regulation)correlated with mutations. For high degree of gene regulation correlated with mutations, we setthe Dirichlet parameters to (143, 55, 2) for θG|F=1,H=L . For θG|F=1,H=G , the Dirichlet parameterswere set to (2, 55, 143) (Fig. 3.8, column 1). For moderate gene regulation correlated with613.3. ResultsDg✓D ↵Fg,mGg,m,nYg,m,nµg,m,n µi2g,m,n⌫20✓F |D↵dd✓G|F=1,H=h↵1h↵2h↵3h✓G|F=0↵10↵20↵303i=1:3n =1:Ng,mm=1:Mgg=1:Td=0:1h 2 {L,G}1Figure 3.7: The xseq model used to sample T mutated genes. The gth gene has Mg mutations.Of the mth patient with a mutation in gene g, Ng,m genes connected to g. Here the shadednodes represent observed variables and the unshaded nodes represent hidden variables. Circlesrepresent random variables and shaded squares represent hyper-parameters.mutations, we set the Dirichlet parameters to (93, 105, 2) for θG|F=1,H=L . For θG|F=1,H=G ,the Dirichlet parameters were set to (2, 105, 93) (Fig. 3.8, column 2). Finally, for low levelgene regulation correlated with mutations, we set the Dirichlet parameters to (43, 155, 2) forθG|F=1,H=L . For θG|F=1,H=G , the Dirichlet parameters were set to (2, 155, 43) (Fig. 3.8, column3). As expected, as we decreased the degree of gene regulation correlated with mutations, thexseq prediction performance declined, especially in recovering the specific mutation variables F(the ROC curves in Fig. 3.8).We then analyzed the influence of the discrimination of expression of down-regulation, neu-tral, and up-regulation on predictions. For high discriminative down-regulation, neutral andup-regulation expression distributions, we set the mean parameters µi of Gaussian distributions623.3. ResultsFigure 3.8: Theoretical performance of xseq on simulated datasets. Each plot depicts a receiveroperating characteristic (ROC) curve, which displays the true positive rate as a function of falsepositive rate. (a) The expression of genes which are down-regulated, neutral, and up-regulatedis highly discriminative (first row), (b) moderately discriminative (second row), and (c) poorlydiscriminative (third row, see the enclosed figures, where cyan is down-regulation, grey is neutral,and red is up-regulation, respectively). The ROC curves in the first column, second column, andthe third column were computed when the degree of dysregulation of the expression of connectedgenes by mutations was high, moderate, and low, respectively.to (-2, 0, 2) (Fig. 3.8, the first row). For moderate discriminative down-regulation, neutral, andup-regulation expression distributions, we set the mean parameters of Gaussian distributionsto (-1, 0, 1) (Fig. 3.8, the second row). For poor discriminative down-regulation, neutral, andup-regulation expression distributions, we set the mean parameters of Gaussian distributions to(-0.5, 0, 0.5) (Fig. 3.8, the third row). The decreasing in the discrimination of down-regulation,neutral, and up-regulation expression distributions mostly declined the performance of recover-ing the gene regulation variables G. We also drew the trace plots for the most challenging caseby given the true H (Fig. 3.9) and estimating H offline (Fig. 3.10). When H was given, the EMalgorithm found the parameters that were very close to the true values. While H was unknown,the estimated parameters could have high bias because the three mixture models were highly633.3. ResultsFigure 3.9: xseq parameter trace plots during EM-iterations for the most challenging case whenH is known. (a) The ROC curves of the predictions based on the true parameters. (b) The ROCcurves of xseq predictions based on the parameters learned from the EM algorithm. (c)-(k) Thetrace plots of xseq parameters during EM iterations.overlap (Fig. 3.8(c), enclosed figures).We next used the xseq-simple model to analyze the simulation datasets (Fig. 3.11). Gener-ally speaking, the xseq-simple performed quite well. Compared to xseq predictions given thetrue values of H variables, we could see a slightly drop in performance, especially for the mostchallenging case (Fig. 3.8 and Fig. 3.11, bottom-right ROC curves). Because xseq-simple doesnot model the H variables, it substituted the two conditional distributions θG=L|F=1,H=L andθG=L|F=1,H=G with a single conditional distribution θG=L|F=1. We could see this phenomenonfrom the parameter trace plots during EM algorithm iterations (Fig. 3.12).Permutation analysis We performed several permutations to investigate the influence ofeach component of xseq on the final predictions. First, we switched the patient names within643.3. ResultsFigure 3.10: xseq parameter trace plots during EM-iterations for the most challenging case (His estimated offline). (a) The ROC curves of the predictions based on the true parameters. (b)The ROC curves of xseq predictions based on the parameters learned from the EM algorithm.(c)-(k) The trace plots of xseq parameters during EM iterations. The estimated parameterscould have high bias because the three mixture models were highly overlap (Fig. 3.8(c), enclosedfigures).the mutation matrix (Fig. 3.13(a), permute sample). Even after permutation, some mutationswere still predicted to have high probabilities P (F ). To help explain this phenomenon, wegenerated an expression heatmap (Fig. 3.13(b)), which showed the expression of genes connectedto RUNX1. We can see that some patients without RUNX1 mutations still showed similarexpression pattern to those with RUNX1 mutations. This “phenocopy” [222] effect could resultin some patients without mutations but high predicted probabilities P (F ). Phenocopying maybe a common event in cancer because of DNA methylation and other epigenetic alterations, andit may suggest novel treatment opportunities. For example, there is increasing evidence thattreating of patients based on phenotypes (expression) instead of genotypes (DNA mutations)653.3. ResultsFigure 3.11: The xseq-simple model prediction ROC curves from different simulated datasets.(a) The expression of genes which are down-regulated, neutral, and up-regulated is highly dis-criminative (first row), (b) moderately discriminative (second row), and (c) poorly discriminative(third row, see the enclosed figures, where blue is down-regulation, grey is neutral, and red isup-regulation, respectively). The ROC curves in the first column, second column, and the thirdcolumn were computed when the degree of dysregulation of the expression of connected genesby mutations was high, moderate, and low, respectively.produces better outcomes in some types of cancer [231]. We also switched the gene names withinthe mutation matrix (Fig. 3.13(a), permute gene), and the results showed similar performanceto those by switching patients.Next, we randomly drew the same number of connected genes as given by the combinednetwork (Fig. 3.13(a), permute network). Because the gene-regulation information is sparseand some master regulators can influence the expression of huge number of genes, the modelmay still predict a few mutations with high probability P (F ) because these “randomly-drawn”genes might be truly regulated by the mutated genes. Finally, if both the mutation matrix andthe network were shuffled, there were very few predicted high probability mutations (dashed red663.3. ResultsFigure 3.12: xseq-simple parameter trace plots during EM-iterations for the most challengingcase. (a) The ROC curves of the predictions based on the true parameters. (b) The ROC curvesof xseq-simple predictions based on the parameters learned from the EM algorithm. (c)-(i)The trace plots of xseq-simple parameters during EM iterations.curves in Fig. 3.13(a), permute all). We performed the same processing steps after permutationsthus minimizing the possibility of introducing bias. In addition, we kept the expression matrixthe same in all permutation analyses.Cross-validation analysis We executed a cross-validation analysis by splitting each TCGAdataset into approximately equally sized discovery and validation datasets. We trained a modelon the discovery dataset, and used the trained model to predict the validation dataset, with tenrepeats for each tumour type. We defined the validation rate as the proportion of high probabilitypredicted genes (P (D) ≥ 0.8, see next section on picking the threshold) in the training dataalso predicted to have high probabilities in the validation data. For bona fide cancer genes, themedian validation rate was 0.625 across all the 12 tumour types. For all of the predictions from673.3. ResultsFigure 3.13: Permutation analysis of the TCGA acute myeloid leukemia datasets. (a) Left panelshows the empirical distribution functions of P (D), and the right panel shows the empiricaldistribution functions of P (F ) estimated from different permuted datasets. (b) Heatmap showsthe expression of genes connected to RUNX1 : red represents high expression and blue representslow expression. Here columns represent patients and rows represent genes. For the patientswithout RUNX1 mutations, we “assume” the mutations still exist and estimate the probabilitiesof individual mutations P (F ). The mutation type ‘complex’ of a gene in a patient representsthe gene harbouring multiple types of mutations in the patient.the discovery data, the median validation rate was 0.492. The validation rate is sensitive to thenumber of patients (e.g., the median validation rate for the predicted bona fide cancer genes inCOAD and BRCA was 0 and 0.73, respectively). Restricting analysis to genes with at least fivemutations in both the discovery and validation datasets, median validation rates for the bonafide cancer genes and all the predicted genes increased to 0.68 and 0.63, respectively.Tested on independent datasets To examine how the model would translate to indepen-dently generated data, we used the METABRIC data [39] to validate the predicted copy num-ber alterations from TCGA in breast cancer. METABRIC copy number alterations [39] weregenerated with Affymetrix SNP 6.0, however gene expression was generated using Illumina mi-croarrays. We applied the xseq model trained on the TCGA breast cancer data to analyze683.3. Resultsthe METABRIC breast cancer data. This analysis generated 14 genes with high probability(P (D) ≥ 0.8), representing a strict subset of the 42 genes predicted in the TCGA breast cancerdata [47].Method comparison Finally, we quantitatively benchmarked xseq against CONEXIC [2]and xseq-simple. We ran CONEXIC algorithm on the TCGA breast cancer copy numberand RNA-Seq expression data. We first used the default parameter setting and initialized theclustering using K-means with K = 50 clusters. The candidate driver CNAs were the focal copynumber deletions and amplifications from the Pan-Cancer analysis [237]. Because of the largenumber of candidate driver genes (focal copy number alterations in 2891/3044 genes which haveEntrez gene symbols), CONEXIC generated 1385 modules, with median module size of six andmodulator number of three. These modules selected uniquely 313 modulators, and 21/313 geneswere putative cancer driver genes. Another 46/292 genes were directly connected partners ofthe bona fide cancer driver genes.To compare with CONEXIC, we re-ran xseq on the TCGA breast cancer data using onlyDNA copy number and RNA-Seq expression data. This analysis generated 40 high-probabilitygenes (P (D) ≥ 0.8), and 13/40 genes were bona fide cancer driver genes. Another 13/27 geneswere directly connected partners of the bona fide cancer driver genes. Ten genes overlappedwith CONEXIC predictions: BYSL, CALML3, CALML5, CCNE1, EGFR, ERBB2, EXOSC4,KPNA2, MRPS28, and PPP2R2A. Therefore, xseq was more specific but potentially less sensi-tive than CONEXIC in predicting copy number alterations influencing expression.xseq increased sensitivity without loss of specificity of results relative to xseq-simple. Ap-plication of xseq-simple to the Pan-Cancer datasets predicted a strict subset of 106 genes(relative to the 150 genes predicted by xseq. An example gene predicted only by xseq isCCNE1 in OV cancer as can seen from Fig. 3.14 and Fig. 3.15. The xseq-simple modelonly picked the CCNE1 amplifications in 5/30 patients correlated with extreme expression dys-regulation (P (F ) ≥ 0.5). By considering the direction of gene regulation, xseq picked 19/30amplification in CCNE1 (P (F ) ≥ 0.5). In addition, considering the direction of gene-regulationdoes not sacrifice specificity, e.g., 89/149 predicted genes were either putative cancer genes ordirectly connected parters of putative cancer genes for the xseq model, compared to 65/105xseq-simple predicted genes were either putative cancer genes or directly connected partnersof putative cancer genes (p-value = 0.80).693.3. ResultsFigure 3.14: CCNE1 amplifications in high-grade serous ovarian cancer predicted byxseq-simple. CCNE1 amplification probabilities P (F ) predicted by the simplified versionof xseq without considering the directionality of gene regulation.Figure 3.15: CCNE1 amplifications in high-grade serous ovarian cancer predicted by xseq.CCNE1 amplification probabilities P (F ) predicted by xseq when the directionality of generegulation was considered.3.3.3 Cis-effects loss-of-function mutations across the TCGA dataWe began analysis of the TCGA data by focusing on the cis-effect impacts of loss-of-functionmutations (frameshift, nonsense, and splice-site mutations) on gene expression, yielding 65 genesacross the 12 datasets with P (D) ≥ 0.8 (Fig. 3.16(a)). (We chose the threshold of 0.8 forP (D) to balance prediction of novel genes with introduction of false positives. Changes to the703.3. Resultsresults with thresholds in the range 0.75 to 0.85 were minor.) To place these predictions in thecontext of known cancer genes, we compiled a list of 603 bona fide cancer genes from the CancerGene Census (CGC) database [70] (Fig. 3.16(a), black coloured genes), Vogelstein et al [213],and Lawrence et al [116] (Fig. 3.16(a), blue coloured genes). In total, 34/65 xseq predictionsoverlapped bona fide cancer genes. We compared xseq predictions to those [98] predicted by anorthogonal method, MuSiC [44], which computes the statistical significance of the populationmutation frequency of a gene above an expected background mutation rate to predict its roleas a cancer gene. As MuSiC uses only mutation data, and not expression data, we used it as abenchmark to determine the effect of integrating gene expression data on cancer gene discovery.MuSiC predicted 22/65 genes as significantly mutated. Importantly, 13/43 of the xseq genesthat were not predicted by MuSiC were present in the list of bona fide cancer genes, suggestingthat integrating gene expression information can complement the existing mutation frequency-based methods to identify mutated cancer genes.We next characterized the tumour suppressor properties of the 65 xseq cis-effect predictionsfor consistency with known patterns of enrichment for loss-of-function mutations. We found51/65 genes with tumour suppressor characteristics (P (TSG) ≥ 0.2). Results were robust to amore conservative threshold, yielding 47/65 genes with P (TSG) >= 0.5. The cis-effect loss-of-function mutations were co-associated with genomic copy number (one-way ANOVA test p-value< 0.001, Fig. 3.16(c-d)), with xseq cis-effect genes enriched for coincidence with hemizygousdeletion. (Fisher’s exact test p-value < 0.001, Fig. 3.16(c-d). The statistical test method whenreporting p-values is omitted if Fisher’s exact test is used.)Additional biological characterization of the cis-effect genes suggested strong enrichment fortranscription factors, phosphoproteins, and X chromosome genes. Nearly half (30/65) of thecis-effect genes encode transcription factors (Fig. 3.16(a), p-value < 0.001), as annotated in theCheckpoint database [204]. Most of the cis-effect genes (54/65, p-value < 0.001) encode humanphosphoproteins, consistent with recent work predicting cancer driver genes based on enrichedmutations in phosphorylation regions [166]. Finally, cis-effect genes were disproportionatelyfound on chromosome X [42] (8/65, p-value < 0.01, Fig. 3.16(a)). Taken together, these dataindicate xseq cis-effect predicted genes’ properties are well aligned with known characteristicsof tumour suppressor genes.For the 30 novel predictions (not in our bona fide cancer driver gene list, nor significantlymutated based on MuSiC analysis), we searched for literature in support of their tumour sup-713.3. ResultsFigure 3.16: The 65 genes harboured loss-of-function mutations with strong cis-effects on theexpression of these genes. (a) The predicted cis-effect loss-of-function mutations across 12 tu-mour types (P (D) ≥ 0.8 in at least one tumour type). (TSG-probability, tumour suppressorgene probability; MuSiC-SMG, significantly mutated genes predicted by MuSiC; TF, transcrip-tion factors). (b) The histograms of conditional probabilities of mutations and genes given geneexpression data across tumour types. (c) The conditional probabilities of mutations given geneexpression data separated based on copy number status. (d) The loss-of-function mutationsin the 65 cis-effect genes (All-cis), 30 novel predictions (Novel-cis), 23 cis-effect tumour sup-pressor genes (TSG-cis), 108 non cis-effect TSGs (TSG-other), and 30 negative control genes(Negative controls) segregated based on copy number status. (e) A “novel” tumour-suppressorgene AMOT is not significantly mutated based on frequency-based methods, but AMOT is en-riched in loss-of-function mutations (tumour suppressor gene probability P (TSG) = 0.92). (f)The loss-of-function mutations in STAG2 typically correlate with lower expression, except fora splice donor site mutation GT → GC mutation. (both GT and GC are used by the splicingmachinery.)723.3. Resultspressor roles in cancer. In total, we found strong connections to tumour suppressor genes forat least 17 genes. The tumour suppressor roles of several of these genes have recently beenelucidated (e.g., UBQLN1 [182] and MED23 [189]). Notably, three genes (AMOT, AMOTL1,and ITCH ) encode proteins in the Hippo signalling pathway, involved in restraining cell divisionand promoting apoptosis. We further characterized the 30 genes according to criteria presentedabove for all the 65 genes. Of the 30 novel predictions, 18 genes accumulated enriched loss-of-function mutations (P (TSG) ≥ 0.2, p-value < 0.001), ten genes encode transcription factors(p-value < 0.05), 21 genes (p-value < 0.001) encode human phosphoproteins, three genes resideon the X chromosome (p-value < 0.1). All of the novel genes were rarely mutated in the 12studied cancer types (based on MuSiC results). 51/252 loss-of-function mutations in these geneswere in hemizygous deletion regions (p-value< 0.05).As a comparison to a negative control group of genes, we used the 30 genes flagged asfalse-positive cancer driver genes in a recent study [117]. These genes are not significantlymutated after correction for gene length, DNA replication time, and other factors in estimatingthe background mutation rates [117]. All 30 genes had P (TSG) < 0.1. In addition, loss-of-function mutations in these genes were not enriched in hemizygous deletion regions (p-value =0.7, Fig. 3.16(c-d)) and all 30 were predicted to have probabilities P (D) < 0.6 by the xseqmodel, suggesting that the false discovery rate for xseq is relatively low in the TCGA data, asshown in the permutation analysis.We next estimated the proportion of known tumour suppressor genes harbouring cis-effectloss-of-function mutations. We began by enumerating a set of 131 known tumour suppressorgenes from both CGC [70] and Vogelstein et al [213]. We found that 23/131 genes (signifi-cant enrichment of cis-effect genes, p-value < 0.001) were predicted to exhibit cis-expressioneffects indicating that loss-of-function mutations in ∼17.6% of tumour suppressor genes yieldconcomitant changes in mRNA expression levels.3.3.4 Trans-effect mutations across the TCGA dataApplication of xseq to predict mutations impacting expression in-trans resulted in a total of 150genes across the 12 cancer types (P (D) ≥ 0.8). Sixty of the 150 (40%) genes are bona fide cancergenes. We characterized these 60 trans-effect genes with annotated roles in cancer according tobiological functions and found that 30/60 genes encode transcription factors (p-value < 0.001),14/60 genes encode protein kinases (p-value < 0.001) and 4/60 genes (ATRX, BAP1, KDM5A,733.3. ResultsSETD2, p-value < 0.01) are chromatin regulatory factors. Moreover, 26/60 genes encode cellcycle proteins (gene ontology term: GO:0007049). By comparison, MuSiC only predicted 35/60of these genes. One gene (ACVR2A) was predicted by both xseq and MuSiC but was not inthe bona fide cancer gene list. Taken together, xseq uniquely predicted 89 novel genes throughtrans-impacting expression analysis.Further investigation revealed that 29/89 of the novel predicted genes were known interactingpartners of previously characterized bona fide cancer genes. The gene (protein) interactions wereassessed based on the high-quality protein-protein interaction networks [172] (downloaded fromhttp://interactome.dfci.harvard.edu/H_sapiens/index.php?page=download). As for the60 genes above, 23/89 genes encode transcription factors (p-value < 0.1), 7/89 genes encode pro-tein kinases (p-value < 0.05), 3/89 genes are chromatin regulatory factors (p-value < 0.1), 18/89genes encode proteins of the cell cycle process (p-value < 0.01). Pathway analysis indicated thesegenes encode proteins involved in major cancer pathways such as cell proliferation, apoptoticprocess, mitotic cell cycle, chromatin modification, cell migration, and focal adhesion. Nineteengenes were predicted to have P (TSG) or P (OCG) ≥ 0.2. The gene which harboured the largestnumber of high probability mutations was KPNA2 in breast cancer. (Fig. 3.17, 43 mutationswith P (F ) ≥ 0.5. Results are similar if choosing slightly different thresholds for P (F ).)To examine other sources of evidence of functional effects, we analyzed high-probability mis-sense mutations across all the tumour types, and computed the enrichment of phosphorylation-related SNVs (pSNVs) [166, 167]. (We found 839 missense mutations with P (F ) ≥ 0.5 in thegenes with P (D) ≥ 0.8, and also overlapped the set of missense mutations analyzed [167].) Ofthese mutations, 620 were unique when the same amino acid residue replacement in differentpatients was considered. We performed two analyses where the same amino acid substitutionmissense mutations (from different patients) were considered as separate events or collapsed intothe same event (unique). The Pan-Cancer dataset included 241,700 (236,367 unique) missensemutations. Among them, 16,840 (16,074 unique) mutations were pSNVs. Of the high prob-ability SNVs, 232/839 of them were pSNVs (134/620 unique). The high probability missensemutations were highly enriched in pSNVs in both analyses (p-value < 0.001). These results pro-vided additional data to support functional activity of the xseq predictions specifically relatedto impact on phosphorylation.743.3. ResultsFigure 3.17: KPNA2 amplifications in breast cancer correlated with a set of gene up-regulations.(a) These up-regulated genes play roles in DNA-regulation, mitotic signalling, transcriptionfactor networks. (b) The tumours harbouring high-probability KPNA2 amplifications weremostly Lumina B breast cancer. KPNA2 expression was highly correlated with the expressionof the up-regulated genes. (c) Compared to KPNA2, expression of genes in the same loci asKPNA2 (e.g, ERBB2, BPTF, NOL11 ) had lower correlation with the genes up-regulated withKPNA2 amplification.Expression dysregulation across tumour types Certain genes are frequently mutated inmultiple tumour types [98]. We asked whether these mutations across tumour types correlatedwith the dysregulation of the same set of genes. We focused on those genes whose mutations werepredicted to influence gene expression in multiple tumour types. For each gene connected to themutated gene g in a tumour type, we counted how many times this gene was dysregulated (P (G =‘up-regulation’) ≥ 0.5 or P (G = ‘down-regulation’) ≥ 0.5) in the presence of high probabilitymutations (P (F ) ≥ 0.5). We analyzed down-regulation and up-regulation independently usinga binomial exact test to test the significance of this correlation (high probability mutations753.3. Resultsand gene dysregulation). The binomial distribution parameters were obtained by maximumlikelihood estimation from all count data. From this analysis we found 17/20 recurrent geneshad at least one gene up-regulated or down-regulated in two tumour types. Mutations in RB1correlated with the same group of gene dysregulations across several tumour types. In particular,we observed that RB1 mutations correlated with E2F family gene up-regulations (e.g., E2F1 )in BLCA, BRCA, GBM, LUSC, OV and UCEC (Table 3.3), as well as genes encoding mini-chromosome maintenance proteins, e.g., MCM5 in BRCA, GBM, LUAD, LUSC and OV. Toconfirm these correlations, for each gene connected to RB1 in the original full influence graph(Methods), in each tumour type, we compared the expression of this gene in the patients withRB1 mutations to the patients without RB1 mutations using the Limma package [190]. Wethen aggregated all the obtained p-values from the genes connected to RB1 across tumour types,and computed the FDRs [16]. We found that E2F1 was up-regulated in BLCA, BRCA, GBM,LUSC, and UCEC (FDR < 0.1). We performed a similar analysis for MCM5 and found it wasup-regulated in BRCA, GBM, LUSC, OV and UCEC (FDR < 0.1).In addition, aberrations (mutations and amplifications) in the transcription factor NFE2L2in six different tumour types (BLCA, HNSC, KIRC, LUAD, LUSC, and UCEC) exhibited trans-effects on gene expression (P (D) ≥ 0.8). Two genes, MAFG and FECH (Figs 3.18-3.19) weresignificantly up-regulated in five and four tumour types, respectively, in the presence of NFE2L2aberrations (FDR < 0.1). As MAFG, FECH and NFE2L2 reside on chromosome 17, 18, andtwo, respectively, the correlations are not likely cased by gene dosage effects. Several other geneswere also up-regulated in the presence of NFE2L2 aberrations, e.g., NQO1, TXNRD1, PRDX1,GSR, GPX2, GCLM, FTL, AKR1C1, TXN, SQSTM1, GSTA1, KEAP1, GSTA4, ABCC1, andGCLC were up-regulated in six to three tumour types.Stratifying patients harbouring the same gene mutations We investigated whetherxseq probabilities P (F ) could stratify patients harbouring mutations in the same cancer drivergene. We analyzed each of the 127 genes from Kandoth et al [98] in each tumour type for thepresence of bimodal xseq P (F ) distributions over patients harbouring mutations in the genes ofinterest. Twenty-two commonly mutated genes exhibited bimodal distributions in at least onetumour type. This was particularly evident for CTNNB1 mutations in UCEC (Fig. 3.20(a));53/72 patients harboured high probability CTNNB1 mutations (P (F ) ≥ 0.5), with all 53 pa-tients harbouring CTNNB1 hotspot mutations (mutations hitting codons between 31 and 45).763.3. ResultsFigure 3.18: NFE2L2 mutations and FECH up-regulation. NFE2L2 mutations were predictedto correlate with FECH expression up-regulation in five types of cancer: BLCA, HNSC, LUAD,LUSC, and UCEC. Each dot in the scatterplots represents the expression of NFE2L2 and FECHin a patient. A blue cross ‘×’ means the patient does not have NFE2L2 mutations. Other typesof symbols represent different kinds of mutations (lamp, copy number amplifications). Thefilled colours encode the estimated mutation probability P (F ) from trans-analysis (both FECHexpression and the expression of other NFE2L2 interaction partners determine P (F )).By contrast, only 9/19 patients without high xseq probability CTNNB1 mutations harbouredhotspot mutations (Fig. 3.20(c)). In addition, 9/19 patients harboured POLE mutations, orwere annotated as “ultramutated” (tumours with more mutations than Q3+IQR ∗ 4.5, whereQ3 is the third quartile of mutation counts across a corresponding tumour type, and IQR is theinterquartile range, as defined in syn1729383), suggesting that the CTNNB1 mutations wereinconsequential passenger mutations. Moreover, patients in the P (F ) ≥ 0.5 group were signifi-cantly younger than patients with P (F ) < 0.5 (mean age 57.5 versus 65.7 years old, one-sidedt-test p-value < 0.01).Similar results for RB1 mutations in UCEC are shown in Fig. 3.20(d). All 11 loss-of-function773.3. Results−1012mutant wildtypeFDR=1.591e−08BLCA−1.00.01.0mutant wildtypeFDR=0.793BRCA−1.00.01.0mutant wildtypeCOAD−0.50.00.5wildtypeGBM−1.00.01.0mutant wildtypeFDR=2.559e−13HNSC−1.00.01.0mutant wildtypeFDR=0.9406KIRC−1.5−0.50.51.5wildtypeLAML−1.00.01.0mutant wildtypeFDR=0.003063LUAD−2−1012mutant wildtypeFDR=4.337e−08LUSC−1.00.01.0mutant wildtypeFDR=0.8917OV −0.50.00.51.0mutant wildtypeREAD−1012mutant wildtypeFDR=0.00472UCECFECH mRNA expressionNFE2L2 mutationhlampmissensenonsensesynonymousframeshiftsplicecomplexinframeFigure 3.19: Boxplots showing NFE2L2 mutations and FECH up-regulation. In BLCA, HNSC,LUAD, LUSC, and UCEC, FECH was up-regulated in the patients with NFE2L2 mutations oramplifications (Limma FDR < 0.1).mutations (in eight patients) were predicted to have high probabilities (P (F ) ≥ 0.5), however,only 2/13 patients that did not harbour loss-of-function mutations were predicted to accumulatehigh probability mutations (P (F ) ≥ 0.5, Fig. 3.20(d)). Taken together, although genes such asCTNNB1 and RB1 frequently harbour driver mutations, they still likely accumulate passengermutations without impact on gene expression. As such, patients’ tumours with these ‘inert’mutations do not exhibit expected pathway dysregulation. xseq is therefore capable of sub-stratifying patients into meaningful phenotypic groups, separating patients with mutations anddysregulated pathways from those patients with mutations, but normal pathway activities.TP53 mutations in UCEC also showed bimodal distributions. TP53 frequently accumulatesboth loss-of-function mutations and missense mutations. The variation in P (F ) cannot beexplained by the types and positions of the mutations. However, patients with P (F ) >= 0.5were more likely to harbour copy number hemizygous deletions. (36 patients harboured co-783.3. ResultsFigure 3.20: Patients harbouring the same gene mutations but with variations in trans-associatedgene expression. (a) In UCEC, CTNNB1 mutations correlated with the up-regulation of a setof genes, and down-regulation of another set of genes. The most extreme up-regulated genesincluded BMP4 (in the TGF-β signalling pathway), NKD1, AXIN2, DKK4, and KREMEN1 (inthe Wnt signalling pathway). The down-regulated genes included Wnt signalling pathway geneFZD5. Here red colour in the heatmap represents gene up-regulation and blue colour representsgene down-regulation. (MSI, microsatellite instability; MSS microsatellite stable; MSI-H, MSI-high; MSI-L, MSI-low). (b) The smallest unimodality dip-test p-values of P (F ) of the 127significantly mutated genes across tumour types. (c) The mutation sites, mutation types, andP (F ) (filled colours) of CTNNB1 mutations (d) and RB1 mutations in UCEC.occurring hemizygous deletions, compared to 8 patients with P (F ) < 0.5; only 9 patients lackedcopy number alterations in the group with P (F ) ≥ 0.5 compared to 12 in the group withP (F ) < 0.5, p-value < 0.005.)793.4. Discussion3.4 DiscussionWe developed a statistical model, xseq to quantitatively assess the association of mutations withdysregulated gene expression in 12 tumour types. Computational benchmarking and assessmentof independent datasets have demonstrated the robustness of xseq. Our results have implicationsfor the interpretation of somatic mutations in retrospective, discovery-based studies.Systematic analysis of mutation and expression landscapes from more than 2,700 tumoursuncovered several novel patterns. We revealed 30 novel tumour suppressor candidate genes bycis-effect loss of expression analysis. These genes showed the hallmarks of tumour suppressorgenes including a distribution of loss-of-function mutations, and biallelic inactivation throughloss-of-function mutations and heterozygous deletions. In addition, we assessed the landscapeof mutations impacting gene expression in-trans across the 12 tumour types. These results im-plicated 89 novel genes with mutations impacting gene expression. In total, 33% of these geneshad functional relationships with cancer genes in core tumourigenic processes. These genes werenot nominated by mutation analysis alone, suggesting that integrated analysis of mutationsand gene expression is a complementary approach towards comprehensive identification of func-tional mutations. Recent synthesis of mutation rates and discovery ‘saturation’ in genome-widesequencing studies has indicated that current standard of study design has under-sampled im-portant mutations, and that for some 50 tumour types, sequencing of >2000 cases are needed toreach comprehensive sampling [116]. The combined cis- and trans-analyses led to the elucida-tion of >100 novel candidate cancer genes predicted to impact expression. Integration of geneexpression data directly into analysis of mutations will therefore help to bridge the discoverygap left by DNA mutation analysis alone.Results from xseq analysis identified two important characteristics for biological interpreta-tion of mutations. The trans-analysis revealed that the same mutated gene in different patientscan exhibit distinct expression impacts. In our analysis, constitutive activation of Wnt signallinggenes due to CTNNB1 mutation segregates almost exclusively with known hotspot mutations.However, several cases exhibited mutation in CTNNB1 without evidence of Wnt activation,resulting in low xseq probabilities. These cases were primarily phenotyped as hypermutatorsdue to mismatch repair deficiency and/or POLE mutations [201], and patients were statisticallyolder at diagnosis. Thus, a real phenotypic distinction associates with low and high xseq P (F )probabilities, providing evidence for integrative analysis of mutations and expression as a route803.4. Discussionto stratifying phenotypically distinct tumours in the context of the same mutations.xseq analysis identified several genes that had consistent expression impact across tumourtypes. Despite distinct histologies and cell contexts of source tumours, RB1 loss-of-functionmutations and NFE2L2 mutations/amplifications exhibited similar expression patterns. RB1binds and inhibits the E2F transcription factor family. Accordingly, we observed that RB1 mu-tations correlated with E2F family gene up-regulation across tumour types. NFE2L2 binds toits regulator KEAP1 and regulates the expression of antioxidant-related genes to protect againstoxidative damage. We observed NFE2L2 mutations correlated with up-regulation of KEAP1,as well as of oxidative stress genes (e.g., GCLM, GCLC, TXNRD1, GPX2, and NQO1 ). Whiletherapeutic responses to targeted inhibitors administered against the same mutation can havevariable effects due to intrinsic gene expression context (e.g., BRAF inhibition in melanoma andcolorectal cancer [162]), the mutations we outlined (such as RB1 and NFE2L2 mutations) ex-hibit stable profiles and represent important targets for future development of broadly applicabletherapeutics. An intriguing evolutionary implication arises from these mutations: phenotypicimpact is selected for in multiple heterogeneous tumour micro-environments, indicating inde-pendent convergence of phenotype transcending cell context.In conclusion, this work provides a route towards closing the cancer gene discovery gap in thefield of cancer genome sequencing. Direct, model-based integration of mutations and co-acquiredgene expression measurements from tumour samples enhances interpretation capacity of discov-ered mutations leading to optimal selectivity of targets for functional studies and developmentof novel therapeutics.3.4.1 Limitationsxseq is not able to distinguish different mutations of a gene within a specific patient –a limitation,as these mutations may result in different functional impacts. Although genes are rarely mutatedmultiple times within a single patient, some large tumour suppressor genes (such as ARID1A)accumulate multiple mutations, as a result of their long coding sequences. Similarly, in glioblas-toma and lung cancers, EGFR is frequently mutated multiple times in single patients, often dueto the emergence of clonal populations following the administration of EGFR inhibitors [104].Examining the expression impact properties of such mutations in clonal populations would likelyrequire advanced single cell methods [157].Inference for the xseq-simple model is efficient (time and space linear in the number of813.4. Discussionnodes in the model) since it is a tree. For the xseq model given the gene regulation variables H,the inference is also efficient as it is a polytree with tree width of two. The most computationalresources gain in calculating and storing the doubled in size tables in Equation 3.20 and inestimating the parameter in Equation 3.24. However, when the number of patients is large,the Gaussian process regression used to remove the cis-effects of copy number alteration ongene expression could be slow. As we process each gene independently, this step can easily beparallelized.Each input (mutations, gene expression, and the influence graph) influence the xseq pre-dictions as shown in the permutation analyses. Although permuting the influence graph doesnot completely remove the signals because we select subnetworks from the influence graph asinputs to the xseq model. The mixture of Student’s t-distribution modelling of gene expressiondistribution is quite robust to outliers and the number of patients with measured gene expressionvalues. However, for some genes (e.g, genes not expressed in the studied tumour type), theirexpression may not follow a mixture of three Student’s t-distribution. The expression of thesegenes is better removed from further analyses.82Chapter 4densityCut: an efficient and versatiletopological approach for automaticclustering of biological data“Nothing truly valuable can be achieved except by the unselfish cooperation of manyindividuals.”– Albert Einstein, 19404.1 IntroductionIn the previous chapters, we have investigated the problems of predicting mutations from high-throughput DNA sequencing data and estimating the impacts of these mutations on gene expres-sion. Interestingly, we found that some patients, although their mutations were quite different,showed high similarity in gene expression patterns. In addition, recent studies have demon-strated that for some cancer, the treatments based on phenotype (gene expression) producebetter outcomes than the treatments based on genotype (mutations). Because of the complexityof the working mechanisms of human cancer, it is necessary to integrate all the measurements,e.g., mutation, gene expression, DNA methylations and micro-RNA expression to provide acomprehensive view of a specific tumour. Furthermore, to deliver the best treatment optionsto a patient, it is advantageous to divide (cluster) patients into distinct subgroups of clinicalrelevance, and apply similar treatment options to a group of patients.Clustering analysis (unsupervised machine learning), which organizes data points into sen-sible and meaningful groups, has been increasingly used in the analysis of high-throughputbiological datasets. For example, The Cancer Genome Atlas project has generated multipleomics data for individual patients. One can cluster the omics data of individuals into subgroupsof potential clinical relevance. To study clonal evolution in individual cancer patients, we cancluster variant allele frequencies of somatic mutations [185], such that mutations in the samecluster are accumulated during a specific stage of clonal expansion. Emerging technologies such834.1. Introductionas single-cell sequencing have made it possible to cluster single-cell gene expression data todetect rare cell populations, or to reveal lineage relationships [161, 232]. One can cluster single-cell mass cytometry data to study intratumor heterogeneity [120]. As measurement technologyadvances have drastically enhanced our abilities to generate various high-throughput datasets,there is a great need to develop efficient and robust clustering algorithms to analyze large N (thenumber of data points), large D (the dimensionality of data points) datasets, with the abilityto detect arbitrary shape clusters and automatically determine the number of clusters.The difficulties of clustering analysis lie in part with the definition of a cluster. Of thenumerous proposed clustering algorithms, the density-based clustering algorithms [63, 69] areappealing because of the probabilistic interpretation of a cluster generated by these algorithms.Let D = {xi}Ni=1,xi ∈ RD be drawn from an unknown density function f(x),x ∈ X ⊂ RD.For model-based approaches such as Gaussian mixture models f(x) =∑Cc=1 picN (x | µc,Σc),a cluster is considered as the points generated from a mixture component, and the clusteringproblem is to estimate the parameters of the density function from D [63]. To analyze datasetsconsisting of complex shape clusters, nonparametric methods such as kernel density estimationcan be used to estimate fˆ(x) =∑Ni=1Kh(x,xi), where Kh(·) is the kernel function with band-width h. Here a cluster is defined as the data points associated with a ‘mode’ of the densityfunction f(x) [227]. The widely used ‘mean-shift’ algorithm [34, 38, 69] belongs to this category,and it locates the modes of the kernel density function fˆ(x) by iteratively moving a point alongthe density gradient until convergence. This algorithm, however, is computationally expensive,having time complexity O(N2T ), where T is the number of iterations, typically dozens of iter-ations are sufficient for most cases. A more efficient, non-iterative graph-based approach [108]constructs trees such that each data point xi represents a node of a tree, the parent of node xiis a point xj which is in the direction closest to the gradient direction ∇fˆ(xi), and the root ofa tree corresponds to a mode of fˆ(x). Then each tree constitutes a cluster. This algorithm hasbeen used to reduce the time complexity of the mean-shift algorithm to O(N2) [211], and hasbeen extended in several ways, e.g., constructing trees after filtering out noisy modes [171].Nonparametric clustering methods have been generalized to produce a hierarchical clustertree [86]. Consider the λ level set of a density function f(x):L(λ; f(x)) = {x | f(x) ≥ λ}.844.1. IntroductionThe ‘high level clusters’ at level λ are the connected components of L(λ; f(x)) (in the topologicalsense, the maximal connected subsets of L(λ; f(x))). As λ goes from 0 to max f(x), the highlevel clusters at all levels constitute the level set tree, where the leaves of the tree correspondto the modes of f(x) [197]. The widely used DBSCAN algorithm [57] extracts the high levelclusters at just one given level λ. Many original approaches for level set tree construction instatistics [141] take the straightforward ‘plug-in’ approach to estimating the level set tree fromfˆ(x) by partitioning the feature space, i.e., X . Therefore, they are computationally demand-ing, especially for high-dimensional data. Recently, efficient algorithms have been proposed topartition the samples D directly [33, 109]. Recovering the level set tree from a finite datasetis more difficult than partitioning the dataset into separate clusters. Correspondingly, theoret-ical analyses show that for these algorithms to identify salient clusters from finite samples, thenumber of data points N needs to grow exponentially in the dimension D [33, 109]. Moreover,although the level set tree provides a more informative description of the structure of the data,many applications still need the cluster membership of each data point, which is not availabledirectly from the level set tree.The spectral clustering algorithm [153, 188] works on an N by N pairwise data similaritymatrix S, where each element Si,j measures the similarity between xi and xj . The similaritymatrix can be considered as the adjacency matrix of a weighted graph G = (V,E), wherevertex vi represents xi and the edge weight Ei,j = Si,j . Given the number of clusters C, thespectral clustering algorithm partitions the graph G into C disjoint, approximately equal sizeclusters, such that the points in the same cluster are ‘similar’, while points in different clustersare ‘dissimilar’. In contrast to density-based methods, the spectral clustering algorithm doesnot make assumptions on the probabilistic model which generates data D [215]. Therefore,selecting the number of clusters is a challenging problem for spectral clustering algorithms,especially in the presence of outliers or when the number of clusters is large. In addition, thespectral clustering algorithm is time-consuming because it needs to compute the eigenvaluesand eigenvectors of the row-normalized similarity matrix S, requiring Θ(N3) time. Insteadof using single value decomposition to calculate the eigenvalues and eigenvectors, the poweriteration clustering algorithm (PIC) [125] iteratively smoothes a random initial vector by therow-normalized similarity matrix, such that the points in the same cluster will be similar invalue. Then the k-means algorithm is used to partition the smoothed vector into C clusters.Although PIC has a time complexity of O(N2T ), where T is the number of iterations, PIC854.2. Methodsmay encounter many difficulties in practice. First, the points from two quite distinct clustersmay have very similar ‘smoothed’ densities, and therefore they may not be distinguishable byk-means. Second, the points in a non-convex shape cluster can break into several clusters. Asthe number of clusters increases, these problems become more severe [125].In this chapter, we introduce a simple and efficient clustering algorithm, densityCut, whichshares some advantages of both density-based clustering algorithms and spectral clustering al-gorithms. As for spectral clustering algorithms, densityCut works on a similarity matrix; thusit is computationally efficient, even for high-dimensional data. Using a sparse K-nearest neigh-bour graph further reduces the time complexity. Besides, we can use a random walk on theK-nearest neighbour graph to estimate densities at each point. As for many density-based clus-tering algorithms, densityCut is simple, efficient, and there is no need to specify the numberof clusters as an input. Moreover, densityCut inherits both methods’ advantage of detectingarbitrarily shaped clusters. Finally, densityCut offers a novel way to build a hierarchical clus-ter tree and to select the most stable clustering. We first benchmark densityCut against tenwidely used simulation datasets and two microarray datasets to demonstrate its robustness. Wethen use densityCut to cluster variant allele frequencies of somatic mutations to infer clonalarchitectures in tumours, to cluster single-cell gene expression data to uncover cell populationcompositions, and to cluster single-cell mass cytometry data to detect communities of cells ofthe same functional states or types.4.2 MethodsThe densityCut method consists of four major steps (Algorithm 1): (1) density estimation:given a set of data points D = {xi}Ni=1,xi ∈ RD, form a directed unweighted K-nearest neigh-bour graph and estimate the densities of data points by using the K -nearest neighbour densityestimator; (2) density refinement: refine the initial densities via a random walk on the un-weighted K-nearest neighbour graph; (3) local-maxima based clustering: detect local maximaof the estimated densities, and assign the remaining points to the local maxima; (4) hierarchicalstable clustering: refine the initial clustering by merging neighbour clusters. This cluster merg-ing step produces a hierarchical clustering tree, and the final clustering is obtained by choosingthe most stable clustering as the threshold in merging clusters varies. Fig. 4.1 demonstrates howdensityCut works on a toy example [68]. Below we discuss each step in detail.864.2. Methodsijka b3.5e−042.4e−034.5e−036.7e−038.7e−03c4.5e−052.7e−035.4e−038.2e−031.1e−02d0 0.2 0.4 0.6 0.8 1saliency indexe0 0.2 0.4 0.6 0.8 1saliency indexf0.00.10.20.30.40.51 2 3 4# clustersFrequencyg hFigure 4.1: Major steps of the densityCut algorithm. (a) Data points in D. (This dataset wasintroduced by Fu et al [68].) The dotted black circles represent the balls containing K = 8points from D centred at three example points, i, j and k, whose densities to be estimated.Each point connects to its K = 8 nearest-vertices by orange arrows. Other points connect toi, j and k by green arrows if i, j and k are among the K in-vertices of these points. Noticethe asymmetry of in-vertices and out-vertices of a vertex in a Knn graph, e.g., vertex vi hasone in-vertex but K = 8 out-vertices. (b) Colour coded Knn estimated densities at points fromD. The modes of densities are represented by triangles ‘O’. (c) The refined densities based ona random walk. (d) Initial clustering by assigning data points to modes. (e) The tree createdby merging clusters without adjusting valley heights. (f) The tree created by merging clustersbased on the saliency index using the adjusted valley heights. (g) The cluster number frequencyplot. (h) The final clustering results.4.2.1 Density estimationWe adopt the K -nearest neighbour density estimator to estimate the density at x ∈ RD(Fig. 4.1(a)):fK(x) =(K − 1)/NVK(x)(4.1)where VK(x) = VD×(rK(x))D is the volume of the smallest ball centred at x containing K pointsfrom D. VD is the volume of the unit ball in the D-dimensional space, and rK(x) is the distancefrom x to its Kth nearest neighbour. Compared to the widely used kernel density estimator,K-nearest neighbour density estimates are easier to compute, and also the parameter K is moreintuitive to set than the kernel bandwidths for kernel density estimators. Here for simplicity,874.2. MethodsAlgorithm 1 The densityCut algorithm. Unless otherwise specified, all results in this paperare obtained with K = log2(N) and α = 0.9.Input• A set of data points D = {xi}Ni=1,xi ∈ RD• The number of nearest neighbours K• The damping factor α1. Density estimationf0i =(K − 1)/NVK(xi)Wi,j ={1 xj ∈ Knn(xi)0 otherwise2. Density refinementPi,j =Wi,j∑jWi,jIterate until ||f t+1 − f t||≤  (default value:  = 10−6)f t+1 = αf tP + (1− α)f03. Local-maxima based clusteringDetect modes, i.e., local maxima, of the underling density function from{vj | ∀Pi,j > 0, fi < fj}Build trees of points rooted at the modes, usingParent(vi) = arg minvj∈Ni(∣∣dj − di∣∣ | fi < fj)Build one cluster per tree, containing all points in that tree4. Hierarchical stable clusteringCalculate heights of trees and valleys(Optional) adjust valley heights based on Equation 4.10Compute the saliency index ν for a pair of adjacent trees (Equation 4.9)Merge clusters to generate a hierarchical tree by varying ν,ν ∈ {0.0, 0.05, 0.10, . . . , 0.95, 1.0}Select the most stable clusteringwe only calculate the densities at data points from D, represented by f0 = (f01 , . . . , f0N )T , wheref0i = fK(xi) and the superscript ‘0’ indicates that this is the initial Knn density estimate sincewe will refine this density in the next section. Fig. 4.1(b) shows the estimated densities at eachdata point.When computing the Knn densities, we can get the K-nearest neighbour graph G as a884.2. Methodsbyproduct with the following adjacency matrix:Wi,j =1 xj ∈ Knn(xi)0 otherwise(4.2)As xi maps to node vi in the Knn graph G, we next use ‘points’ and ‘nodes’ interchangeably.The Knn graph G is a directed unweighted graph. While the sets of in-vertices and out-vertices of many nodes may overlap significantly, some outliers may have few, if any in-vertices.In addition, points at the boundary of density changes may also have quite different sets of in-vertices and out-vertices because their in-vertices commonly consist of points from low-densityregions while out-vertices are usually from high-density regions.4.2.2 A random walk based density refinementAs K nn density estimates are based on order statistics and tend to be noisy, we next introducea way to refine the initial Knn density vector f0 = (f01 , . . . , f0N )T . Our refinement is based onthe intuition that 1) a high-density vertex belongs to one of the K-nearest neighbours of manyvertices, and 2) a vertex tends to have high density if its in-vertices also have high densities.For example, point k in Fig. 4.1(a) has nine in-vertices, and these vertices are in high-densityregions, and indeed, the density of k is 0.0087 which is higher than average for this example.Based on the above assumptions, we get the following recursive definitions of densities:f t+1j = α∑iPi,jfti + (1− α)f0j (4.3)where α ∈ [0, 1] specifies the relative importance of information from vj ’s in-vertices and theinitial density estimate f0j , which is the K-nearest neighbour density estimate. If each row ofP sums to one, P is a Markov transition matrix, which contains the transition probability froma vertex to its K out-vertices. Therefore, Equation 4.3 defines a random walk with restart.The refined density vector is the stationary distribution of the random walk on the Knn graphwith adjacency matrix P. We can row-normalize matrix W to get P = D−1W, where D =diag(∑j W1,j , . . . ,∑j WN,j). Compared to the original Knn densities in Fig. 4.1(b), the refineddensities in Fig. 4.1(c) have fewer ‘local maxima’ (shown as triangle points, see next section fordetails). Similar methods have been used in information retrieval and semi-supervised learningapplications [155].894.2. MethodsEquation 4.3 can be solved exactly in the limit by f = (1−α)f0(I−αP)−1 if α < 1. Therefore,the above iterations guarantee convergence. When α = 1, f t converges to a left eigenvector ofP with the maximum eigenvalue of 1. In our computational experiments, when α < 1, e.g.,α = 0.90, Equation 4.3 typically converges within a few dozen iterations, and can be muchfaster than the case when α = 1. This rough density estimation process is faster than methodswhich attempt to solve density estimation to a high degree of precision since density estimationis well known to be a difficult problem. Moreover, our method can be applied to data that arepresented in the form of a graph, rather than as data points over the reals.4.2.3 Local-maxima based clusteringAfter obtaining densities for points, we estimate the ‘modes’ of the underlying probability densityfunction. The modes are the ‘local maxima’ of the density function with zero gradients. For finitesamples, the modes are rarely located exactly at points xi ∈ D, so we use the points close to themodes instead. Mathematically, modes can be approximated by points {xi | max|xj−xi|< fj ≤fi}, where  is a small distance threshold. The distance  is dataset dependent and difficult tochoose in practice. Instead, we can define a mode as a vertex whose density is the largest amongall of its in-vertices:{vj | ∀Pi,j > 0, fi < fj} (4.4)Here we use in-vertices instead of out-vertices in order to be able to detect small cluster with lessthan K data points. As the vertices of a small cluster with less than K vertices can form a clique(or are highly connected to each other) in the Knn graph, we can detect this small cluster basedon the definition of local maxima above if these points are not among the K-nearest neighboursof points outside this cluster. A potential problem is that some outliers with very few in-verticescould be detected as local maxima. We simply remove those local maxima whose numbers ofin-vertices are less than K/2. In other words, we treat a cluster less than K/2 in size as an‘outlier’ cluster, and densityCut is unlikely to detect this small cluster.Data points that fall into the basin of attraction of each mode constitute a cluster. Thisprocess can be done by moving each point along its gradient direction to reach a mode. Wemodify the efficient graph-based hill-climbing algorithm [108] to build a unique forest (a set of904.2. Methodstrees) for a given dataset. The parent of vertex vi is defined asParent(vi) = arg minvj∈Ni(∣∣dj − di∣∣ | fi < fj) (4.5)where Ni is the set of in-vertices of vi. In other words, the parent of vi is the vertex which isclosest to vi among all of vi’s in-vertices that have higher densities than vi. From the constructionof the trees, we can see that each vertex is associated with just one tree. Therefore, each treecan be considered as a cluster. Fig. 4.1(d) shows the clusters by assigning data points to theseven local maxima.4.2.4 Hierarchical stable clusteringWe then generate a hierarchical cluster tree and select the most stable clustering. First thedensity of the root of a tree T generated above is called the height of this tree, denoted by hT ,which has the largest density among all the vertices in T . Then we define the boundary pointsbetween trees T1 and T2:B(T1, T2) = {v ∈ T1 | ∃u ∈ Nv ∩ T2, fv < fu} (4.6)Sets B(T1, T2) and B(T2, T1) are not the same: B(T1, T2) ⊂ T1 and B(T2, T1) ⊂ T2. The valleyseparating two trees is:Valley(T1, T2) = B(T1, T2) ∪B(T2, T1), (4.7)The height of the valley separating T1 and T2 is defined ashValley(T1,T2) = maxv∈Valley(T1,T2)fv (4.8)The saliency index ν of a valley represents the relative height of the valley compared to theshorter tree:ν(T1, T2) =hValley(T1,T2)min(hT1 , hT2)(4.9)Fig. 4.2(a) shows the height of the valley (the length of the grey arrow) separating two adjacenttrees, and the saliency index is the ratio between the length of the grey arrow and the black914.2. Methods−2 0 2 4 60.00.10.20.30.40.5 ModesHeight of valleysHeight of treesalllll12 14 16 18 20−20246# data points (log2)CPU time (seconds, log2)bFigure 4.2: (a) Merging the first two trees (clusters) based on the relative height of the valleyseparating the two trees. (b) The time (in seconds, log2 transformed) increases almost linearlywith the number of data points (log2 transformed).arrow.The saliency index defined in Equation 4.9 has several properties. First, 0 ≤ ν ≤ 1, and ν isinvariant under scaling of the densities by a positive constant factor. This ‘scale-free’ property isvery useful for us to select a threshold for merging trees. Second, it automatically scales to thelocal densities of trees, e.g., to get the same saliency index, the height of the valley separatingtwo trees in low-density regions is shorter than that separating two trees in high-density regions.We can merge two clusters if the saliency index between them is above a threshold ν. Whenthe saliency index ν = 1, no clusters get merged, and when ν = 0, all the connected clusters getmerged to form a single cluster. Therefore, by gradually decreasing the saliency index threshold,we can get a hierarchical clustering tree, which is useful for us to interpret the structure ofdata, especially for high-dimensional data. Fig. 4.1(e) shows the cluster tree from mergingneighbouring clusters in Fig. 4.1(d).For a dataset consisting of clusters whose densities are considerably different, a single K fordensity estimation may be insufficient. This is because the points from a high-density clusterneed a larger K to estimate their densities, compared to the points from a low-density cluster.The high-density cluster could ‘break’ into several small clusters when K is small. Instead ofpicking the parameter K on a data point basis, we can adjust the height of a valley by:hˆValley(T1,T2) =(1 +∑ifi < hValley(T1,T2)N)hValley(T1,T2) (4.10)The intuition behind this density adjustment step is that a high-density valley could be an924.2. Methodsartefact caused by splitting a high-density cluster. Fig. 4.1(f) shows the cluster tree using thesaliency indexes based on the adjusted valley heights. A potential problem of this step is thatit also increases the valley height between two genuine clusters and increases their likelihoodto merge. For example, as can be seen from Fig. 4.1(e-f), the two clusters merge at saliencyindex around 0.30 before valley adjustment, but merge at saliency index around 0.35 aftervalley adjustment. We will further investigate the influence of this density adjustment step onclustering in later sections.We finally introduce a method to determine the number of clusters to produce the finalclustering from the hierarchical cluster tree. The basic idea is that by gradually decreasingthe saliency index, clusters will get merged. Noisy, non-salient clusters will get merged quickly,and true clusters will exist for a long period of time. We therefore can calculate the length ofsaliency index change for producing a fixed number of clusters, and select the cluster numberwhich spans the longest saliency index changes. In our current implementation of densityCut,we decrease the saliency index evenly and therefore we can interpret the saliency index changeinterval as ‘Frequency’. Fig. 4.1(g) shows the cluster number frequency plot, and Fig. 4.1(h)shows the final clustering by merging the initial clustering to produce two clusters as selectedby the cluster number frequency plot.4.2.5 Complexity analysis and implementationdensityCut has been implemented in the statistical computation language R. densityCut hasa worst-case time complexity of O(NK + C2) and a space complexity of O(NK + C2), whereN is the number of data points, K is the number of neighbours and C is the number of clusters(local maxima). In practice, as a majority of clusters are only adjacent to few clusters, thetime and space complexity is typically of O(NK + C). We did not consider the time usedto compute the Knn graph in densityCut as numerous algorithms have been developed forefficient Knn search with different complexities, and typically it takes less time to computethe Knn graph compared to cluster the data. To build the Knn graph given a data matrix,efficient algorithms such as kd-trees can be used in low-dimensional spaces (D ≤ 20) with timecomplexity O(N log(N)) [149]. To build the Knn graph in high-dimensional spaces (D < 1000),efficient software libraries based on random projection exist to repeatedly partition the datato build a tree (https://github.com/spotify/annoy). This algorithm can run in O(NDT )time , where T is the number of trees, and typically dozens of trees are enough to preserve the934.2. Methodsaccuracy of Knn search.To demonstrate the scalability of densityCut, we tested densityCut on a Mac desktopcomputer running OS X Version 10.9.5. The computer has 32 GB of RAM and a 3.5GHz four-core Intel i7 processor with 8MB cache. We carried out all the experiments presented in thepaper on this computer.We sampled {212, 214, 216, 218, 220} data points from a mixture of 64 two-dimensional Gaus-sian distributions. As can be seen from Figure 4.2(b), the running time increased almost linearlyin the number of data points. It took about 127 CPU seconds to cluster a million data points(N = 220).4.2.6 Parameter settingOur algorithm has two parameters: the number of nearest neighbours K, and the dampingfactor α in density refinement. K should be small enough to detect local maxima, e.g., smallerthan the number of data points in a cluster. However, very small K can result in poor densityestimates and produce large numbers of clusters, thus ‘overfitting’ the data, and there may notexist a ‘gap’ in the cluster number frequency plot for us to select the number of clusters. Onthe contrary, for large K, densityCut may fail to detect detailed structures thus ‘underfitting’the data.Theoretical analysis for spectral clustering shows that K should be Ω(log(N)) to producea connected graph [215], and limit results are obtained under conditions K/log(N) → ∞ andK/N → 0. K is also dependent on the dimensionality D. For the density estimate at x(f(x) is Lipchitz smooth in a neighbour of x) from its K-nearest neighbours, under conditionsk/N2/(2+D) → 0 and k →∞, we can get |fˆ(x)− f(x)|. f(x)/√k [41].In practice, K should be dataset dependent. For example, if the Euclidean distance is used,K should be sufficiently small such that the Euclidean distance is a good measure of the distancebetween two close data points even the data lie in a manifold. If the number of clusters is small,K should increase to prevent generating too many local maxima. We therefore conducted anempirical study of the influence of K and α on clustering the data in Fig. 4.1, for which N = 240(Fig. 4.3-4.4). First, when K = log2 (N) = 8, densityCut correctly detected the two clustersgiven different values for α (Fig. 4.3). Small K = log2(N) = 4 produced ‘spiky’ density estimatesand resulted in many local maxima (Fig. 4.3). On the contrary, large K produced flat densityestimates, and the two true clusters tended to merge because of no deep valley between them944.2. Methods(Fig. 4.3). We therefore used a default value of K = log2(N). In addition, when α = 0.9 or0.99, densityCut correctly detected the two clusters given different values for K. Increasing αproduced better clustering results but it took much longer for the density refinement step toconverge, e.g., median 176 iterations when α = 0.99 compared to 41 iterations when α = 0.90.We set the default value for α = 0.90 as it made a good compromise between accuracy andexecution time.The valley height adjustment step plays a role of ‘smoothing’ the density estimates. Thisfunctionality is especially useful for small K. For example, even when K = 0.5 log2(N) = 4,densityCut correctly detected the two clusters after adjusting the heights of valleys (Fig. 4.4).For all the results presented in the paper, we used the default parameter setting (K = log2(N)and α = 0.9) with the valley height adjustment step.4.2.7 Comparing clusteringThree kinds of measures have been developed in the literature to assess the similarity betweentwo clusterings [140, 217]. The first type of measure is based on set overlaps, i.e., to match twoclusterings such that the absolute or relative overlap is maximized. The second type of measureoriginates in information theory and is based on mutual information, i.e., our knowledge aboutone clustering increasing when we are told the other clustering. The third type of measure isbased on counting pairs, i.e., to consider(N2)pair of decisions of assigning a point from clusteringone and a point from clustering two to separate clusters or the same cluster. In this study, weused three measures, one from each type to compare clusterings. In addition, we can calculatethese measures between a clustering and the ground truth (if available) to assess the accuracyof the clustering.We first introduce the notations used in defining different measures. Let a clustering C ={C1, . . . , CL} is a partition of the dataset D = {xi}Ni=1 into L mutually disjoint subsets Ci, i ∈1, . . . , L. Let C′ = {C ′1, . . . , C ′R} denotes a second clustering of D. Let element Mi,j = |Ci ∩ C ′j |denotes the (i, j)th entry of the confusion matrix ML×R between C and C′. In other words, Mi,jdenotes the number of points that are common in cluster Ci and C′j .The maximum-matching measure (MMM) [140, 217] is based on set overlaps, and it is ageneralization of the accuracy in classification applications. It defines a mapping between C andC′, such that the sum of the number of common points between C and C′ (M¯) is maximized,under the constraint that only one entry in C can match one entry in C′. Then the MMM954.2. Methods0 0.2 0.6 1K = 40.000.251 3 6 11 160 0.2 0.6 1K = 80.00.20.41 2 3 5 7 110 0.2 0.6 1K = 160.00.31 2 5 60 0.2 0.6 1K = 240.00.31 20 0.2 0.6 1K = 320.00.31 20 0.2 0.6 10.000.251 3 5 10 160 0.2 0.6 10.00.20.41 2 3 5 6 110 0.2 0.6 10.00.31 2 3 60 0.2 0.6 10.00.31 20 0.2 0.6 10.00.31 20 0.2 0.6 10.000.251 2 3 5 6 160 0.2 0.6 10.00.31 2 3 5 70 0.2 0.6 10.00.31 2 3 40 0.2 0.6 10.00.30.61 20 0.2 0.6 10.00.31 20 0.2 0.6 10.00.31 3 6 80 0.2 0.6 10.00.31 2 4 50 0.2 0.6 10.00.30.61 2 30 0.2 0.6 10.00.30.61 20 0.2 0.6 10.00.31 2alpha = 0.01(Median iterations = 2)alpha = 0.1(Median iterations = 4)alpha = 0.9(Median iterations = 41)alpha = 0.99(Median iterations = 176)Figure 4.3: The influence of densityCut parameter K and α on the final clustering results. WhenK = log2 (N) = 8, densityCut correctly detected the two clusters given different values for α.Small K = log2(N) = 4 produced ‘spiky’ density estimates and resulted in many local maxima.Large K produced flat density estimates, and the two true clusters tended to merge because ofno deep valley between them. In addition, when α = 0.9 or 0.99, densityCut correctly detectedthe two clusters given different values for K. Increasing α produced better clustering resultsbut it took much longer for the density refinement step to converge, e.g., median 176 iterationswhen α = 0.99 compared to 41 iterations when α = 0.90.964.2. Methods0 0.2 0.6 1K = 40.00.20.41 2 4 70 0.2 0.6 1K = 80.00.20.41 2 4 5 80 0.2 0.6 1K = 160.00.41 2 5 60 0.2 0.6 1K = 240.00.40.81 20 0.2 0.6 1K = 320.00.31 20 0.2 0.6 10.00.20.41 2 3 5 80 0.2 0.6 10.000.251 2 3 5 6 90 0.2 0.6 10.00.41 2 3 60 0.2 0.6 10.00.40.81 20 0.2 0.6 10.00.31 20 0.2 0.6 10.000.201 2 3 4 5 60 0.2 0.6 10.00.31 2 3 40 0.2 0.6 10.00.31 2 30 0.2 0.6 10.00.31 20 0.2 0.6 10.00.41 20 0.2 0.6 10.00.31 2 3 50 0.2 0.6 10.00.31 2 3 40 0.2 0.6 10.00.31 2 30 0.2 0.6 10.00.31 20 0.2 0.6 10.00.30.61 2alpha = 0.01(Median iterations = 2)alpha = 0.1(Median iterations = 4)alpha = 0.9(Median iterations = 41)alpha = 0.99(Median iterations = 176)Figure 4.4: The influence of densityCut parameter K and α on the final clustering results (withthe valley height adjustment step). The valley height adjustment step plays a role of smoothingthe density estimates. This functionality is especially useful for small K. For example, whenK = 0.5 log 2(N) = 4, densityCut correctly detected the two clusters after adjusting the heightsof valleys.between C and C′ is defined as:MMM(C, C′) = M¯N(4.11)974.2. MethodsFor perfect match MMM(C, C′) = 1. However, unlike accuracy for classification, the maximum-matching measure between two random clusterings is not zero. In fact, the minimum maximum-matching measure is 1/N under the extrem condition that L = N and R = 1.The normalized mutual information (NMI) between C and C′ is defined as follows [66]:NMI(C, C′) = 2I(C, C′)H(C) +H(C′) (4.12)where H(C) = −∑Li=1 |Ci|N log2( |Ci|N ) is the entropy associated with clustering C. The mutualinformation between clustering C and C′ is defined as I(C, C′) = ∑i∑j |Ci∩Cj |N log2( |Ci∩Cj |/N|Ci|/N∗|Cj |/N ).The normalized mutual information is a number between 0 and 1. For perfect match, NMI(C, C′) =1, and NMI(C, C′) = 0 if the joint distribution Pi,j = |Ci∩Cj |N is independent. Because of thestrong independent requirement, the NMI between two random clusterings is typically a smallnumber but not zero.The adjusted Rand index (ARI) compares pair of assignments form C and C′, and is definedas:ARI(C, C′) =∑Li=1∑Rj=1(Mi,j2)− t3(t1 + t2)/2− t3 (4.13)where t1 =∑ki=1(|Ci|2), t2 =∑lj=1(|C′j |2), and t3 = t1t2/(N2). Compared to MMM and NMI,ARI has been corrected for chance, i.e., ARI(C, C′) = 0 when the elements of the confusionmatrix M follow a generalized geometric distribution (the two clusterings C and C′ are pickedat random, subjected to having the original number of elements in each cluster [93]. In otherwords, the marginal distributions of the confusion matrix M are the same as the originals.) Anundesired property of the adjusted Rand index is that negative values can occur. For perfectmatch, ARI(C, C′) = 1.4.2.8 Comparing algorithmsWe compared densityCut with three best algorithms reported in [229], i.e., the hierarchical clus-tering algorithm (HC, from the R stats package) with average linkage, the partitioning aroundmedoids (PAM, from the R cluster package) algorithm, and the density-based clustering al-gorithm OPTICS [7] (from the R dbscan package). Notice that in [229], two density-basedalgorithms (DBSCAN [57] and clusterdp [171]) were tested and showed good performance. Cur-984.3. Resultsrently the clusterdp algorithm needs some human interactions to select the cluster centers, andunfortunately there is no agreed way to automatically set this parameter. Similarly, DBSCANis very sensitive to the parameter epsilon, which is the radius used to define the neighbours foreach data point. We therefore used the OPTICS algorithm, which is similar to DBSCAN, butis more robust because essentially there is no need to set the epsilon parameters. We extractedclusters from OPTICS outputs based on the methods of Sander et al, 2003 [178]. The pointsconsidered as outliers by OPTICS were assigned to other clusters by a K-nearest neighbourclassifier (where K is the same as the MinPts parameter of OPTICS).We did not compare densityCut to one of the best clustering tools reported in [229], transi-tivity clustering [228], because we could not find an easy to use software package for clusteringlarge datasets represented as matrices. We also compared densityCut with the Gaussian mix-ture model (GMM, implemented in the R mclust package [63]) based clustering algorithm andthe normalized cut (NCut, implemented in the kernlab package [239]) spectral clustering algo-rithm. These algorithms generally represent broad classes of methods for clustering analysis(i.e., hierarchical, partition, density-based, model-based, and graph-based) [6]4.3 Results4.3.1 Benchmarking against state-of-the-art algorithmsSynthetic datasetsWe used ten synthetic datasets in our study (downloaded from http://cs.joensuu.fi/sipu/datasets/). The fourth column figures in Fig. 4.5 shows these synthetic datasets: Aggrega-tion [76], Compound [238], Flame [68], Spiral [31], Jain [95], Pathbased [31], R15 [212], D31 [212],S3 [65], and S4 [65]. The number of data points in each dataset is 788, 399, 240, 312, 373, 300,600, 3100, 5000, and 5000, respectively. Seven out of the ten datasets have been used in [229]to compare various clustering algorithms (except for Jain, D31, and S4).As shown in Fig. 4.5, densityCut performed the best in terms of the above evaluation mea-sures (with mean MMM, NMI, and ARI of 0.911, 0.897, and 0.854). Overall, the clusteringresults on these synthetic benchmark datasets demonstrated that densityCut can produce ex-cellent results if the high-density clusters were separated by low-density valleys. For the datasetsin Fig. 4.5(a,c-e, g-j), densityCut detected the right number of clusters and revealed the struc-tures of these datasets. The red colour cluster in Fig. 4.5(b) was considered as two separated994.3. Results0 0.2 0.6 1a0.00.20.45 6 7 8 12 0.00.40.8densityCutOPTICSPAMHCNCutGMM0.9970.9940.778 0.9960.9940.787 0.9930.9820.8890.9900.9820.9050.9940.9880.772 0.9930.9870.7860 0.2 0.6 1b0.00.20.44 5 6 7 10 0.00.40.8 0.8720.8670.647 0.8620.6120.5690.9150.8880.7420.8370.7690.698 0.8530.8440.5140.8030.5390.4900 0.2 0.6 1c0.00.20.41 2 3 4 0.00.40.80.9920.9830.8500.8330.6460.717 0.9360.8740.4420.4830.0480.3250.9670.9340.4880.4420.013 0.1800 0.2 0.6 1d0.00.40.83 4 0.00.40.81.0000.5380.3490.359 0.5960.3491.0000.1900.0010.0030.3870.0011.0000.133−0.006−0.002 0.275−0.0050 0.2 0.6 1e0.000.100.202 4 8 10 13 0.00.40.81.0000.8420.756 0.9461.0000.5681.0000.3570.3350.6981.0000.1941.0000.3760.2610.779 1.000−0.0180 0.2 0.6 1f0.00.20.41 2 3 4 5 6 0.00.40.80.6730.6270.7400.7300.7800.7100.6700.4680.5450.5220.5810.5230.5020.3840.4580.4440.5050.4330 0.2 0.6 1g0.00.40.811 14 15 0.00.40.80.9930.9850.9970.9950.818 0.9970.9890.9770.9940.9920.9210.9940.9860.9680.9930.9890.811 0.9930 0.2 0.6 1h0.000.100.205 19 30 33 39 0.00.40.80.9700.9470.9770.9370.714 0.9030.9590.9300.9680.9520.8860.9460.9390.8940.9540.9070.714 0.8660 0.2 0.6 1i0.000.101 16 20 27 52 0.00.40.8 0.8320.8530.8580.6910.7650.605 0.7840.7950.7970.7520.7610.7060.6890.7200.7300.5960.6360.5190 0.2 0.6 1saliency indexj0.000.100.201 15 20 29 50# clustersFrequency0.00.40.8MMM NMI ARI0.7830.7670.7990.6160.6890.6420.7190.7090.7210.6690.6870.6560.6120.6060.6360.4930.5500.486Figure 4.5: Results on the synthetic benchmark datasets consisting of irregular, non-convex,or overlapped clusters. The first-column figures show the clustering trees, the second-columnfigures show the cluster number frequency plots, the third-column figures show the final clus-tering results, and the fourth-column figures show the maximum-matching measure (MMM),the normalized mutual information (NMI), and the adjusted Rand index (ARI) from comparingclustering results of each algorithm to the ground truth.1004.3. Resultsclusters originally [238]. However, the sparse background points and the centre high-densitypoints could be considered as from the same cluster for density-based clustering. For the datasetin Fig. 4.5(f), densityCut failed to detect the three clusters because there were no ‘deep’ valleysbetween the outer arc cluster and the two Gaussian clusters. Therefore, the outer arc cluster gotmerged to the right Gaussian cluster. Without the valley height adjustment step, densityCutgenerated the same clustering. The density-based clustering algorithm OPTICS ranked secondwith mean MMM, NMI, and ARI of 0.840, 0.717, and 0.685, respectively (Fig. 4.5, last column).PAM and GMM performed poorly on the datasets consisting of irregular shape clusters(Fig. 4.5 (a-f), last column). PAM results on these datasets had mean MMM, NMI, and ARI of0.687, 0.492, and 0.414, respectively, and GMM results on these datasets had mean MMM, NMI,and ARI of 0.617, 0.441, and 0.311, respectively. However, PAM did very well on the datasetswhere the points in each cluster were sampled from a two-dimensional Gaussian distributionwith mean MMM, NMI, and ARI of 0.908, 0.870, and 0.828 (Fig. 4.5 (g-j), last column). Incontrast, GMM performed inferior to PAM on these datasets with mean MMM, NMI, and ARIof 0.787, 0.826, and 0.716. For example, for ‘D31’ in Fig. 4.5(h), cluster two consisted of only asingle data point, and cluster nine consisted of two data points. Cluster three and 18 consistedof points from multiple Gaussian distributions. One major reason for the failure of GMM wasthat the relatively large number of clusters (31) compared to the limited number of data points(3100) and the overlapped clusters resulted in many local maxima in its objective log-likelihoodfunction, while the Expectation-Maximization algorithm for fitting GMM only searched for alocal maximum of the objective function. By contrast, densityCut directly located the high-density peaks and selected the most stable clustering, and thus it was less likely influenced byspurious density peaks.Both HC and NCut can cluster datasets consisting of arbitrary shape clusters. On thedatasets consisting of non-convex shape clusters, HC (with mean MMM, NMI, and ARI of0.788, 0.589, and 0.577) and NCut (with mean MMM, NMI, and ARI of 0.771, 0.628, and0.55) performed slightly better than PAM and GMM (Fig. 4.5 (a-f), last column). On thedatasets consisting of convex shape clusters, HC (with mean MMM, NMI, and ARI of 0.810,0.841, and 0.746) and NCut (with mean MMM, NMI, and ARI of 0.746, 0.814, and 0.678)performed slightly worse than PAM and GMM (Fig. 4.5 (g-j), last column). Compared toHC and NCut, densityCut performed better in both the datasets consisting of arbitrary shapeclusters (with mean MMM, NMI, and ARI of 0.922, 0.919, and 0.886) and the datasets consisting1014.3. Resultsof convex shape clusters (with mean MMM, NMI, and ARI of 0.895, 0.863, and 0.807). Moreover,densityCut had low complexity and was scalable to large datasets.Microarray gene expression dataGene expression data have been used to stratify cancer patients into biologically orclinically meaningful subtypes, e.g., different subtypes of patients have distinct progno-sis. Here we tested densityCut on the two microarray gene expression datasets asin [11]. The first microarray gene expression dataset consists of the expression of 1543genes from four types of lung cancer tissues (186 snap-frozen tumours) and 17 normallung tissues [19]. These lung tumours include 139 adenocarcinomas, 21 squamous celllung carcinomas, 20 pulmonary carcinoids, 6 small cell lung cancer. This dataset wasdownloaded from http://bioinformatics.rutgers.edu/Static/Supplements/CompCancer/Affymetrix/bhattacharjee-2001/bhattacharjee-2001_database.txt.The second microarray gene expression dataset consists of the expression of 182 genes fromthe mixture of breast cancer tissues and colon cancer tissues [36]. The breast tumours consist of32 pairs of snap-frozen tumours and the corresponding preserved tumours, and the colon tumoursconsist of 20 pairs of snap-frozen tumours and the corresponding preserved tumours. Thisdataset was downloaded from http://bioinformatics.rutgers.edu/Static/Supplements/CompCancer/Affymetrix/chowdary-2006/chowdary-2006_database.txt.The results in Fig. 4.6 show that densityCut performs the best on these two datasets (withMMM, NMI, and ARI of 0.926, 0.780, and 0.853 on dataset one, and 0.981, 0.860, and 0.924 ondataset two) compared with PAM, HC, OPTICS, NCut, and GMM. Although OPTICS producedgood results on the previous two dimensional synthetic datasets, it performed poorly on thesehigh-dimensional gene expression datasets (ten dimensions are considered as high dimensions fordensity-based clustering in [110]). OPTICS produced just one cluster for each dataset. Althoughthe absolute distances between data points are not discriminative in high-dimensional spaces (thecurse of dimensionality, the distances between any two points are approximately the same), therelative distances (the order of closeness) could still be meaningful, and could be captured bythe Knn graph. densityCut explores the topology of the Knn graph thus performed better onhigh-dimensional spaces than OPTICS. GMM (with MMM, NMI, and ARI of 0.626, 0.621, and0.408 on dataset one, and 0.962, 0.765, and 0.850 on dataset two) and PAM (with MMM, NMI,and ARI of 0.714, 0.583, and 0.411 on dataset one, and 0.952, 0.727, and 0.815 on dataset two)1024.3. Results0.00.20.40.60.81.0 0.9260.6850.7140.8330.5860.6260.7800.0000.5830.5730.540 0.6210.8530.0000.4110.5620.326 0.408MMM NMI ARIadensityCutOPTICSPAMHCNCutGMM0.00.20.40.60.81.00.9810.5960.9520.6540.6540.9620.8600.0000.7270.1420.1420.7650.9240.0000.8150.0660.0660.850MMM NMI ARIbFigure 4.6: Clustering microarray gene expression data. Clustering results on (a) the lung cancerdataset, and (b) the mixture of breast cancer and colon cancer dataset.performed relatively well on these datasets. The results were consistent with the results fromthe previous study that GMM and PAM (more precisely, the k-means algorithm) performed wellon clustering gene expression data [43].4.3.2 Inferring clonal architectures of individual tumoursCancer cells are heterogeneous, and a subpopulation of cancer cells of the same patient couldharbour different sets of mutations [144]. Moreover, cancer cells frequently accumulate addi-tional mutations after treatment or in metastasis. Understanding the clonal architecture of eachtumour provides insights into tumour evolution and treatment responses. We used densityCutto cluster the somatic variant allele frequencies (VAF) measured from DNA sequencing of multi-ple tumour biopsies. The mutations in each cluster were accumulated during a specific stage ofclonal expansion. The clustering results provide valuable information of the clonal architecturesof tumours.We first tested densityCut on the mutation data from a primary myelofibrosis (PMF) pa-tient [56]. This patient was first diagnosed with PMF, and seven years later, this patient’stumour transformed to acute myeloid leukaemia (AML). After chemotherapy treatment, the1034.3. Resultspatient underwent complete remission. However, 1.5 years later, the patient redeveloped PMFbut no evidence of AML relapse. A total of 649 single nucleotide variants detected in wholegenome sequencing of either PMF, AML, or relapse PMF genomes were validated by targetedhigh-coverage sequencing. We used densityCut to jointly cluster the targeted sequencing VAFsfrom PMF, AML, and relapse PMF tissues. Figure 4.7(a) shows that densityCut grouped themutations into four clusters.Overall, densityCut clustering results matched those presented in the original study. How-ever, to produce the results, the authors [56] used different algorithms and several pre-processingsteps. For example, the authors used DBSCAN [57] to detect outliers (the mutations with circles‘◦’ in Fig. 4.7(a)), and then used Mclust [63] for model selection and final clustering analysis. Themaximum number of clusters was limited to four, and each cluster had to contain at least sevenmutations [56]. In contrast, we directly used densityCut to cluster the VAFs and producedexactly the same results (MMM=1, the outliers were not considered in calculating MMM.) Wealso changed the parameter K from the default log2(N) = 10 to 2 log2(N) until 10 log2(N). Onlyafter K was set to 8 log2(N), the red colour cluster and the violet colour cluster got merged, ascan be seen from Fig. 4.7(b). For K < 8 log2(N), densityCut produced the same four clusters.OPTICS, PAM, HC, NCut, and GMM produced the same results as densityCut results (tobe exact, only PAM assigned one point to different clusters, Fig. 4.8(c)). However, except forOPTICS, these algorithms either need the number of clusters as input parameters or cut thedendrogram to produce the desired number of clusters (HC).Next, we tested densityCut on the acute myeloid leukaemia sample AML28 [53]. We jointlyclustered the VAFs from sequencing both the primary tumour and the relapse tumour after 26months of chemotherapy [53]. Figure 4.7(c) shows that densityCut grouped the 804 detectedsomatic mutations into five clusters. The results matched those predicted by sciClone [144], avariational Bayesian mixture model based clustering algorithm. Only one mutation (with circle‘◦’ in Fig. 4.7(c)) was assigned to the red cluster by densityCut, but originally assigned to thecyan cluster by sciClone (MMM=0.999). densityCut is much more efficient than sciClone as canbe seen from Fig. 4.7(d). sciClone took a median of 48.12 seconds to run on the AML28 datasetwhile densityCut took a median of 0.074 second to run. Because it took less than a CPU secondto run densityCut, we ran both algorithms ten times to get a more accurate estimation of thetime used. For competing algorithms, PAM split the large cluster into two clusters because ittends to generate equal-size clusters (with MMM, NMI, and ARI of 0.619, 0.509, and 0.767;1044.3. Results 0 10 20 30 40 50 01020304050 0102030405060MYBJAK2ASXL1RUNX1U2AF1HCFC1PMF VAF (%)AML VAF (%)Relapsed PMF VAF (%)a MMM: 1, NMI: 1, ARI: 120 40 60 80 100K# clusters234b0 10 20 30 40 5001020304050AML28 Tumour VAF (%)AML28 relapse VAF (%)clIDH2IDH2DDX41DCLK1MMM:  0.999, NMI:  0.992, ARI:  0.996densityCut SciClone01020304050Time (Second)d0 20 40 60 80020406080100Lung metastasis VAF (%)Pancreas Metastasis VAF(%)MMM:  0.848, NMI:  0.831, ARI:  0.828eFigure 4.7: Clustering variant allele frequencies (VAF) of somatic mutations. (a-b) Clusteringmulti-time sample data from initial primary myelofibrosis (PMF), later acute myeloid leukaemia(AML), and after treatment relapsed PMF using densityCut. (c-d) Clustering the somaticmutations from sequencing a primary/relapse pair of an AML patient. (e) Clustering the somaticmutations from sequencing a lung/pancreas metastasis pair of a melanoma patient. The possible‘driver’ mutations in each cluster are labeled with a black plus sign ‘+’. The clustering validationindices (MMM, NMI, and ARI) were from comparing densityCut results with sciClone resultsor the results reported in the original studies. (a) Three-dimensional VAF plot. The mutationsin each cluster were assigned a unique colour. The mutations with a circle ‘◦’ were consideredas outliers in the original publication [56] before clustering analysis. (b) The number of clustersproduced by densityCut as we gradually increased K from log2(N) to 10 log2(N). (c) Themutation assigned to the violet colour cluster by sciClone but assigned to the red colour clusterby densityCut was labeled with a circle ‘◦’. (d) densityCut and sciClone execution time basedon repeated ten runs.1054.3. Results 0 10 20 30 40 50 01020304050 01020304050MMM: 1, NMI: 1, ARI: 1a 0 10 20 30 40 50 01020304050 01020304050MMM: 1, NMI: 1, ARI: 1b 0 10 20 30 40 50 01020304050 01020304050MMM: 0.998, NMI: 0.992, ARI: 0.995c 0 10 20 30 40 50 01020304050 01020304050MMM: 1, NMI: 1, ARI: 1d 0 10 20 30 40 50 01020304050 01020304050MMM: 1, NMI: 1, ARI: 1e0 10 20 30 40 5001020304050 MMM:  1NMI:  1ARI:  1f0 10 20 30 40 5001020304050 MMM:  0.619NMI:  0.767ARI:  0.509g0 10 20 30 40 5001020304050 MMM:  0.939NMI:  0.929ARI:  0.964h0 10 20 30 40 5001020304050 MMM:  0.999NMI:  0.992ARI:  0.996i0 10 20 30 40 5001020304050 MMM:  0.925NMI:  0.891ARI:  0.865j0 20 40 60 80020406080100Average silhouette width:  0.539k0 20 40 60 80020406080100Average silhouette width:  0.464l0 20 40 60 80020406080100Average silhouette width:  0.548m0 20 40 60 80020406080100Average silhouette width:  0.454n0 20 40 60 80020406080100Average silhouette width:  0.422oVAF (%)VAF (%)Figure 4.8: Clustering variant allele frequencies of somatic mutations using competing algo-rithms. (a-e) Clustering multi-time sample data from initial primary myelofibrosis (PMF),acute myeloid leukaemia (AML), and after treatment relapsed PMF. (f-j) Clustering the so-matic mutations from sequencing a primary/relapse pair of an AML patient. (k-o) Clusteringthe somatic mutations from sequencing a lung/pancreas metastasis pair of an melanoma patient.First column figures show the OPTICS clustering results, the second column figures show thePAM clustering results, the third column figures show the HC clustering results, the fourthcolumn figures show the NCut clustering results, and the fifth column figures show the GMMclustering results,.Supplementary Fig. 4.8(g)). HC assigned some ‘outliers’ to a distinct clusters, and merged thepoints from two clusters (with MMM, NMI, and ARI of 0.939, 0.964, and 0.929; Fig. 4.8(h)).Similarly, GMM modelled the outliers using a Gaussian component (with MMM, NMI, and ARIof 0.925, 0.865, and 0.891; Fig. 4.8(j)). OPTICS results differed from densityCut results bythe assignment of only one data point (Fig. 4.8(f)), and NCut produced the same results asdensityCut results (Fig. 4.8(i)).Finally, we used densityCut to cluster the somatic mutations from whole genome sequencingof the lung/pancreatic metastasis pair from the same melanoma patient MEL5 [52]. Compared toblood cancer genomes, melanoma genomes are much more complex, frequency harbouring copynumber alterations. The combinations of copy number alterations, homozygous mutations, andheterozygous mutations make it a challenging task to develop a model to uncover the clonal1064.3. Results0 20 40 60 80020406080100Lung metastasis VAF (%)Pancreas Metastasis VAF(%)MMM:  0.848, NMI:  0.831, ARI:  0.828Figure 4.9: Clustering the somatic mutations from sequencing a lung/pancreas metastasis pairof a melanoma patient using sciClone (without considering copy number alterations). Theyellow colour cluster may be meaningful in terms of clustering because it models the ‘outliers’.However, it may not have biological meaning because the mutations in this cluster could fromdifferent clones. The MMM, NMI, and ARI were computed from comparing sciClone results(ten clusters) with the densityCut results (12 clusters).structure of these cancer genomes [175]. The densityCut clustering results in Fig. 4.7(e) showthat the mutations in MEL5 could be grouped into 12 clusters, providing the starting point fordetailed inspection of the clonal structure of this cancer genome. Additional information suchas copy number alterations would be required to fully interpret the clonal architectures. We alsoran sciClone, which produced ten clusters (Fig. 4.9). Both algorithms agreed in clustering 84.8%of the mutations (MMM: 0.848, Fig. 4.7(e)). densityCut clustering had an average silhouettewidth of 0.58 (Fig. 4.10), which was higher than sciClone clustering average silhouette width of0.55. Other competing algorithms performed inferrer to densityCut with PAM and OPTICSperformed second and third with average silhouette widths of 0.548 and 0.539, respectively(Fig. 4.8(m, k)).4.3.3 Clustering single-cell gene expression datasetsWe used densityCut to cluster two single-cell mRNA gene expression datasets. The first datasetconsists of the low-coverage mRNA expression of 23,730 genes in 301 cells from 11 popula-tions [161]. The second dataset consists of the single-cell mRNA expression of 43,309 genesin 223 stem cells from the subventricular zone of eight-week-old male mice [126]. We did sev-eral pre-processing steps to only select a subset of genes [161] for clustering analysis because1074.3. ResultsSilhouette width si−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0Average silhouette width :  0.58n = 1139 12  clusters  Cjj :  nj | avei∈Cj  si1 :   240  |  0.625 :   338  |  0.6711 :   128  |  0.6313 :   161  |  0.3816 :   53  |  0.4718 :   42  |  0.5219 :   60  |  0.5020 :   41  |  0.3721 :   33  |  0.6122 :   25  |  0.5323 :   8  |  0.6224 :   10  |  0.85Figure 4.10: Plot of silhouette values from clustering the variant allele frequencies of somaticmutations from sequencing a melanoma lung/pancreas metastasis pair by dencityCut. Thehigh average silhouette width of 0.58 suggests that this dataset could contain 12 clusters.single-cell gene expression data have high technical noises (e.g., loss of cDNA in reverse tran-scription and bias in cDNA amplification) and Knn search in high dimensional spaces is stilltime-consuming. Specifically, we only kept the genes expressed in more than five cells because itis difficult to detect clusters less than five in size given the relatively large number of cells. Herea gene was considered to be expressed in a cell if its reads per kilobase per million (RPKM) value(or fragments per kilobase per million (FPKM) value for dataset two [126]) was greater than orequal to one in the cell. We then further normalized the RPKM values by log transformation:1084.3. Resultslog2(x+ 1). Here x was the original RPKM value of a gene in a cell. A small value of one wasadded to prevent taking the log of zero or generating very small numbers.Figure 4.11(a) shows that densityCut produced nine clusters for dataset one (MMM: 0.917,NMI: 0.953, and ARI: 0.918). densityCut cannot distinguish the cells from GW16, GW21,and GW21.2 based on the 1,000 genes. These cells were quite similar as they were all fromthe human cortex (GW16 cells were from the germinal zone of human cortex at gestationalweek 16, GW21 cells were from GW21, and GW21.2 cells were cultured cells of a subset of theGW21 cells [161]). These cells could possibly be separated by selecting a better set of featuresfor clustering analysis. In addition, one GW21 cell was misclassified as a neural progenitorcell (NPC), and one NPC was in the human-induced pluripotent stem (iPS) cell cluster. Forthe other seven types of cells, densityCut perfectly put them into separate clusters. Otherclustering algorithms such as OPTICS, PAM, HC, NCut, and GMM had inferior performancecompared to densityCut results with PAM ranked second with MMM, NMI, and ARI of 0.877,0.916, and 0.854, respectively (Fig. 4.12(a)).densityCut grouped the 223 stem cells of dataset two into four clusters (Fig. 4.11(b)). Glu-tamate aspartate transporter+/Prominin1+ (GP) cells and polysialylated-neural cell adhesionmolecule+ (PSA) cells were in separate clusters (except for one PSA cell). The GP cells weresubdivided into three clusters, consistent with the original finding that the GP cells consisted ofat least three subtypes of stem cells. We next used t-SNE [205] to project the 1000-dimensionalsingle-cell gene expression data to a two-dimensional space (Fig. 4.13). The results also showfour very distinct clusters. Compared with the original analysis using hierarchical clusteringcoupled with principle component analysis feature section [126], densityCut can be used in amore unbiased way to cluster single-cell gene expression data and produce the same results. Onthis dataset, densityCut, PAM, HC, and GMM results had average silhouette widths of 0.190,0.190, 0.191, and 0.189, respectively (Fig. 4.12(b); silhouette widths have less discriminativepower in high dimensional spaces).4.3.4 Clustering single-cell mass cytometry datasetsFinally, we used densityCut to cluster two benchmark single-cell mass cytometry (aka CyTOF)datasets [120]. The first dataset consists of CyTOF data of bone marrow mononuclear cells froma healthy individual. Manually gating (labelling) assigned 81,747 cells to 24 cell types based on13 measured surface protein markers [15]. Dataset two contains CyTOF data from two healthy1094.3. ResultsCell typeClusterCell type23382339BJGW16GW21GW21.2HL60iPSK562KeraNPCMMM:  0.917NMI:  0.953ARI:  0.918aCell typeClusterCell typeGP PSAbFigure 4.11: Clustering single-cell gene expression data. Each row is a gene and each columnis a cell. The cell type and cluster membership of each cell are colour coded. Heatmaps showclustering (a) 301 cells from 11 populations, and (b) 223 stem cells from the subventricular zoneof eight-week-old male mice.adult donors H1 and H2. For H1, manual gating assigned 72,463 cells to 14 cell types based on32 measured surface protein markers. Manual gating assigned 31,721 cells to the same 14 cellpopulations from H2 based on the 32 surface protein markers. These manually identified cellpopulations were used as ground truth to test densityCut.1104.4. Conclusions and discussion0.00.20.40.60.81.0 0.9170.1790.8770.4580.7540.821 0.9530.0000.9160.6520.8880.9160.9180.0000.8540.2920.745 0.843MMM NMI ARIadensityCutOPTICSPAMHCNCutGMM0.00.20.40.60.81.00.1900.0000.1900.1910.110 0.189Average silhouette widthbFigure 4.12: Performance measures on clustering single-cell gene expression data. (a) Theclustering measures from clustering the gene expression data of 301 cells. (b) Average silhouettewidths from clustering the expression data of 223 stem cells from the sub ventricular zone ofeight-week-old mice.We compared densityCut to the recently proposed algorithm, the PhenoGraph algorithm [120],in clustering the benchmark single-cell CyTOF datasets. As both densityCut and PhenoGraphfirst build a Knn graph, we used the same K = log2(N) for both algorithms. As can be seenfrom Fig. 4.14, densityCut detected 12 distinct cell types (clusters) in dataset one, 9 cell typesin H1, and 12 cell types in H2. The PhenoGraph algorithm detected 18, 24, and 20 clusters indataset one, H1, and H2, respectively. Based on MMM, NMI, and ARI, densityCut performedslightly worse on dataset one (MMM: 0.879 vs. 0.883, NMI: 0.878 vs. 0.900, and ARI: 0.857vs. 0.893), but performed better on H1 (MMM: 0.941 vs. 0.682, AMI: 0.935 vs. 0.833, andARI: 0.96 vs. 0.669) and H2 (MMM: 0.953 vs. 0.67, NMI: 0.945 vs. 0.829, and ARI: 0.977 vs.0.638). As for efficiency, densityCut was around two times faster than PhenoGraph based onthe current implementations (Fig. 4.15). Other clustering algorithms such as PAM, HC, NCut,and GMM are not scalable to these relatively large datasets. For example, we can only runOPTICS on the first dataset (OPTICS took about 17 minutes while densityCut took only 24seconds).4.4 Conclusions and discussionWe developed densityCut, a simple and efficient clustering algorithm. densityCut effectivelyclustered irregular shape synthetic benchmark datasets. We have successfully used densityCutto cluster variant allele frequencies of somatic mutations, single-cell gene expression data, andsingle-cell CyTOF data. densityCut is based on density estimation on graphs. It could beconsidered as a variation of the spectral clustering algorithms but is much more time- and1114.4. Conclusions and discussionllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllll llllllllllllllllllllllllll ll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllll llllll−10 −5 0 5 10−10−50510t−SNE coordinate 1t−SNE coordinate 2Figure 4.13: Visualizing the mouse brain stem cell expression data by t-Distributed StochasticNeighbor Embedding (t-SNE). We can see four distinct clusters.space-efficient. Moreover, it automatically selects the number of clusters and works for thedatasets with a large number of clusters. In summary, densityCut does not make assumptionsabout the shape, size, and the number of clusters, and can be broadly applicable for exploratorydata analysis.A recent study has shown that current strategies for whole genome sequencing studies missedmany somatic mutations [80]. By increasing the sequencing depths from 30x in their originalstudy [53] to 300x and using a consensus of somatic single nucleotide variant (SNV) callers,the number of identified SNVs increased from 118 to 1343. Based on the 1343 SNVs, theyidentified two extra sub-clones [80]. Moreover, an additional 2500 SNVs were highly likely tobe genuine somatic SNVs but still without enough evidence even at 300x coverage. For morecomplex genomes such as melanoma and breast cancer genomes, the number of SNVs could bemuch larger. Therefore, efficient algorithms such as densityCut are necessary to infer the clonalstructures in individual tumours as more genomes are sequenced at higher coverage in the nearfuture.In recent years, single-cell techniques have empowered scientists to investigate cellular het-1124.4. Conclusions and discussioniMMM:  0.67NMI:  0.829ARI:  0.638MMM:  0.953NMI:  0.945ARI:  0.977fcMMM:  0.682NMI:  0.833ARI:  0.669hMMM:  0.941NMI:  0.935ARI:  0.96ebMMM:  0.883NMI:  0.9ARI:  0.893gMMM:  0.879NMI:  0.878ARI:  0.857dat−SNE twot−SNE oneFigure 4.14: Comparing densityCut and PhenoGraph in clustering single-cell mass cytometrydata. The original high-dimensional mass cytometry data were projected onto two dimensionalspaces by t-SNE just for visualization purpose. Cell types and clustering memberships of datapoints were colour coded. (a-c) The ground truth. (d-f) densityCut clustering results. (g-i)PhenoGraph clustering results.erogeneity. Computational tools are necessary to analyze these single-cell measurements withhigh dimensionality and large numbers of cells. Efficient algorithms such as densityCut whosecomputational complexities are independent of the dimensionality of data, and can cluster mil-lions of points in a few minutes can be valuable tools to process these datasets to distill singlecell biology.Current technology advances have made it possible to simultaneously measure a tumourfrom different angles, e.g., mutations, gene expression, and DNA methylation. Each biological1134.4. Conclusions and discussion050100150200250300Dataset one H1 H2CPU time (seconds)PhenoGraphdensityCut115 24300103 36 11Figure 4.15: Comparing the time used by PhenoGraph and densityCut in clustering the bench-mark CyTOF datasets. Although directly comparison is difficult since PhenoGraph is imple-mented in Python and densityCut is implemented in R, densityCut is around two times fasterthan PhenoGraph based on the current implementations.measurement provides a different view of the tumour. For complex diseases such as cancer, it isnecessary to combine multi-view datasets to provide a comprehensive view of these diseases. Afuture extension of densityCut is to integrate multi-view data for cancer patient stratification,i.e., putting patients into different groups such that the patients in the same group are similarto each other in molecular features. As densityCut works on a K-nearest neighbour graph, weneed to effectively construct a combined Knn graph by integrating different datasets.Currently densityCut simply clusters data points without probability information of a pointbelonging to a cluster. It is possible to a do a ‘soft’ assignment of a data point to clusters at theexpense of efficiency. For example, for a point if its neighbours in a graph have been assignedto different clusters, this point could be assigned to different clusters as well. On the contrary,if all the neighbouring points have the same label, the point could be assigned to the same labelwith high probability.114Chapter 5Conclusions“It is a far, far better thing that I do, than I have ever done;it is a far, far better rest that I go to than I have ever known.”– Charles Dickens, A Tale of Two Cities, 18595.1 Summary of contributionsThis dissertation contributes three computational methods in the analyses of cancer systemsbiology data.Extendable feature based discriminative classifiers to predict somatic SNVs fromtumour/normal DNA sequencing data To predict somatic SNVs from DNA sequencingdata is a challenging task because of the various errors introduced during sample preparation,library preparation, or the alignment of short reads to reference genome as well as platform-specific errors. As it is difficult to explicitly model the combinations of all sources of errors, wetherefore developed versatile and extendable feature based discriminative classifiers trained onvalidated somatic SNVs to predict somatic SNVs given paired tumour/normal genome sequenc-ing data. We further categorized the discriminative features in separating somatic SNVs fromnon-somatic SNVs. We finally discussed some systematic errors in high-throughput sequencingdata.A generative model-based approach to predicting the somatic mutations that impactgene expression To identify the small set of critical mutations that transform a normal cellto a cancer cell is crucial for revealing the biology behind the transformation and designingtherapies. We used a hierarchical Bayes approach to model the impacts of a mutation on geneexpression in a specific patient and the impacts of all the mutations in a gene across patientson gene expression. By analyzing the TCGA pan-cancer datasets from 12 types of cancer,we identified loss-of-function mutations in 65 genes correlated with expression down-regulation.1155.2. Conclusions and future workFurthermore, mutations in 150 genes correlated with pathway dysregulations. These genes werecandidate cancer driver genes for further biological functional studies.An efficient and versatile algorithm for clustering high-throughput systems biologydata In addition to sequencing the whole genomes, current technology advances have madeit possible to rapidly and accurately monitor cellular changes, even within a single-cell. Theresulting data could possibly be used to stratify patients into clinically similar subgroups, e.g.,with similar prognostic values. We developed an efficient and versatile clustering algorithm toput similar objects into the same group. We applied it to uncover the clonal structures in indi-vidual patients by clustering mutation variant allele frequencies and to reveal the cell populationstructures by clustering single-cell gene expression data and single-cell mass cytometry data.5.2 Conclusions and future workAdvances in high-throughput DNA sequencing technologies have revolutionized cancer genomestudies to detect somatic mutations in individual genomes. Additional computational algorithmshave proven to be useful to detect somatic mutations and interpret mutation functions [54].Cancer genome sequencing data typically have specific bias originating from tumour/normalcontamination, intratumour heterogeneity, and copy number alterations [82]. All these factorscould result in low variant allele frequencies, which are challenging for SNV prediction. There-fore, ideally tumours should be sequenced at a high coverage (e.g, 300x) to detect somaticSNVs [80]. Specific classifiers could be trained to detect low variant allele frequency SNVs.One major bias causing non-uniform coverage in high-throughput sequencing data is the PCRamplification bias, which could be removed by PCR-free sequencing [151]. As new generationsof DNA sequencing machines to produce longer reads uniformly covering the whole genome withhigher depth, our ability to detect SNVs will be greatly improved.Although our understanding of the cancer signalling pathways is still fragmented, the path-way knowledge has already been translated to clinical care of cancer patients. By revealingthe central role of the phosphoinositide 3-kinase (PI3K)/AKT/mTOR intracellular signallingpathway in regulating cell growth, many inhibitors have been developed to shut down this path-way in cancer. For example, clinical trials have shown the success of a PI3K inhibitor (whichinhibits the delta isoform of the PI3K protein - PI3Kδ) to treat B cell cancer [67]. Notice thatthe inhibitor targets wildtype PI3Kδ because it is not mutated but essential for B cell survive.1165.2. Conclusions and future workThe RTK/RAS/MAP-kinase signalling pathway receives extracellular signals from the cell sur-face and conveys the signals to the nucleus. Cancer cells frequently manipulate this pathwayto generate endogenous mitogenic signals. For example, melanoma patients frequently harboura hotspot missense mutation that changes the 600th amino acid valine (V600) of the proteinencoded by the BRAF gene. In clinical trials, the majority of these patients had partial orcomplete responses to BRAF inhibitors [60]. However, colon cancer patients with exactly thesame BRAF hotspot mutation did not respond to the same inhibitor. An RNAi knockdownscreen pinpoints a feedback activating the cell surface growth factor receptor EGFR by inhibit-ing BRAF specificly in colon cancer where EGFR is highly expressed [162]. The gain in EGFRexpression is also one of the reasons why some melanoma patients become resistance to BRAFinhibitors [198]. These results demonstrate that the molecular context plays an important rolefor the success of targeted therapies. Therefore, direct interpretation of the molecular con-text through systematic incorporation of gene expression data could be valuable for deliveringeffective targeted treatments to cancer patients.In contrast, glioblastoma and especially low-grade glioma patients frequently have a hotspotIDH1 R132 missense mutation [26, 156]. Since IDH1 mutation in diffuse glioma is an early eventin tumorigenesis and shared by all cancer cells [221], it could be an effective drug target as BCR-ABL in chronic myeloid leukemia. Since the discovery of IDH1 hotspot mutations in 2008 [156],we have only partially revealed how the mutant protein contributes to tumorigenesis [61, 200].IDH1 encodes an enzyme to catalyze a metabolic process of oxidative decarboxylation of isoci-trate to 2-oxoglutarate. IDH1 mutation results in the massive production of 2-hydroxyglutarate(2-HG), which inhibits DNA demethylation through the TET family enzymes [233]. Eight yearsafter detecting IDH1 mutations in glioblastoma, Flavahan et al provides a mechanistic explana-tion about how IDH1 mutations alter metabolism to manipulate DNA methylation and finallyto promote cell proliferation through PDGFRA up-regulation in glioblastoma: IDH1 mutantelevates methyl groups which prevent the isolator protein CTCF binding to DNA thus alters theorganization of DNA [61, 81]. Altered DNA structure could change gene expression, e.g., up-regulation of PDGFRA from gene enhancer-promoter binding (enhancer of gene FIP1L1 bindsto the promoter of PDGFRA). Interestedly, xseq found the correlation between IDH1 mutationsand PDGFRA, TET2 dysregulation from the STRING functional protein interaction network(Fig. 5.1). These results suggest that although detailed pathway information is not available,integrated analysis of multiple ‘omics’ datasets such as mutation, expression, and protein inter-1175.2. Conclusions and future workFigure 5.1: Genes connected to IDH1 and dysregualted in IDH1 mutated glioblastoma patients.The functional protein interaction network was from the STRING database.action network data could help reveal part of the cancer pathways, and thus generate hypothesesfor designing validation experiments.Current targeted treatments (various protein kinase inhibitors and immunotherapies) donot produce durable responses in all patients. These results reflect our partial or fragmentedunderstanding of the cancer cell signalling pathways. On the one hand, even for very aggressivemetastasized tumours such as melanoma (with the BRAF hotspot mutations), we already canstop cancer in its tracks (at least transiently), sometimes even by an agent to target a singlegene [213]. On the other hand, drug resistance [77] is almost inevitable for monotherapy. We arestill at the beginning of uncovering the details of how cancer cells metastasize to other parts ofthe human body to build secondary colonies [91, 222]. A cancer cell, no matter how aggressiveit is, is still a living cell and must have a set of proteins to drive basic cellular functions suchas metabolism, transcription, translation, and DNA repair. Therefore, a human cancer cell stillneeds to express around 2000 ‘essential’ genes to survive [24, 85, 218]. Now a problem is to findthe set of tumour specific ‘essential’ genes for targeted treatment. We think that this kind of‘Achilles’ heel’ exists for cancer as demonstrated by the success (at least transiently) of targetedtherapies, the universally detected mutated oncogenes or tumour suppressor genes in individualpatients, and the much large number of synthetic lethal interactions in cells. Two challengesremain: 1) Cancer cells can have a different set of tumour specific essential genes because ofintra-tumour heterogeneity [8]; therefore we need to find the essential ‘trunk essential genes’ andapply multiple drugs; 2) Cancer cell could evolve to resist targeted treatments thus we need to1185.2. Conclusions and future workmonitor cancer progression and use immunotherapies.In this dissertation, I focused on computational predictions without biological experimen-tal validations. The CRISPR-Cas (clustered regularly interspaced short palindromic repeats- CRISPR-associated) systems enable rapidly and precisely knockout of genes or introducingmutations [112, 177]. These systems coupled with other measurements such as gene expressionenable rapidly quantification of genetic perturbations on molecular phenotypes such as geneexpression, and therefore can be used for validating xseq predictions. The xseq model can beextended to other applications, e.g., analyzing combinations of events that impact expressionand synthetic lethalities.Clustering analysis has been widely used to analyze various ‘omics’ data such as mRNAgene expression data, miRNA gene expression data, protein expression data, and DNA methy-lation data. A good framework for clustering analysis is density-based clustering. Interestingly,biological systems seem to adopt strategies similar to non-parametric density estimation tocommunicate with each other. For example, a haploid a yeast cell produces chemicals calledpheromones [165]. A haploid α yeast cell senses the concentration of the pheromones and growstoward the a yeast cell for mating to produce a diploid yeast cell [12]. In other words, the αyeast cell moves (grows) along the pheromone gradient direction to reach the pheromone densitypeak. Therefore, grouping points to the basin of attraction of each density peak represents anatural choice for clustering analysis. Our densityCut algorithm has been proved to be effec-tive in different settings. However, it does have limitations. For example, for some datasets, theclustering results could be sensitive to the parameter setting when two clusters are ‘similar’ toeach other. This is a problem for many other clustering algorithms as well. An extension of thedensityCut algorithm is for integrated analysis of multiple ‘omic’ datasets.Oncologists and scientists have observed some patients with exceptional responses to treat-ments. For example, patients with inactivating mutations in TSC1 or TSC2 resulted in mTORpathway activation, and are sensitive to mTOR inhibitors [94, 129, 216]. Tumours with DNArepair deficiency tend to accumulate many more somatic mutations than DNA repair proficiencytumours, and some of the mutant proteins could be recognized by the human immune systemsfor destruction. Therefore, patients with DNA repair deficiency tumour tend to respond tocheckpoint blockade immunotherapies [118, 169, 192]. Since PI3Kα and pan-PI3K inhibitorsinduce DNA damage, triple negative breast cancer and high-grade serous ovary cancer patientswith nonfunctional TP53 (some also with nonfunctional BRCA1 ) tend to have long response1195.2. Conclusions and future workto the combinations of PARP inhibitors and PIK3 inhibitors. We hope that the current excep-tional ‘outlier’ responders with become common in the coming years of cancer patient care. Toachieve this goal, we need to develop efficient computation algorithms to detect all the genomicabnormalities, interpret these abnormalities in the context of other molecular features such asgene expression, DNA methylations, and gene interaction networks as well as tumour micro-environment, detect the dysregulated pathways, and cluster these patients into subgroups fortreatment.Biology is always an inspiration for developing computational algorithms. In Chapter 2, weformalized the SNV prediction problem as a supervised classification problem, and this problemcould act as a test case for developing efficient supervised machine learning algorithms for bigdatasets. (A whole human genome sequenced at 30x depth produces about 300Gb of compresseddata.) The xseq model presented in Chapter 3 is designed for a specific application. However,a probabilistic graphical model represents a family of distributions and is very flexible to beextended to analyze new problems by introducing new variables or changing the structure ofthe model. The densityCut algorithms can be readily applied for other clustering analysisapplications.In the field of cancer systems biology, we can increasingly access massive biological datasets.We need to analyze and interpret these data to discover biology and collaborate with wet-labscientists to validate the discoveries. Of particular importance is to develop statistical models andthe corresponding efficient inference algorithms for the integrative analyses of biological datasets,to understand the mechanisms of how the genetic and epigenetic alterations transform a normalcell into a cancer cell. This probabilistic modelling approach could explore the dependenciesbetween different measurements (e.g., mutations and gene expression) and provide informativedescriptions of the data. A second approach is to transform each dataset into a similarity matrixand combines these similarity matrices for further analyses, e.g., patient stratifications.The biology we can learn from the massive datasets largely depends on the efficiency andeffectiveness of our computational methods. With the newly learned biology, we will keepinventing new technologies for much easier and more accurately measuring biological signalsin large scale, as well as measuring new biological signals. Our computational models willkeep improving as new data are added and new biology is learned. Computational power willalso increase to alleviate the challenges to process these data. As more informative data areextracted and collected for individual patients, e.g., time course measurement of driver gene1205.2. Conclusions and future workmutations and protein biomarker expression, we can better monitor patient disease progressionand provide more effective treatments. Eventually, I hope one day, with the biology we learned,and the effective signals we measured and processed for individual patients, we can developmodels to recommend drugs to cancer patients the same way as Google recommends webpagesand Amazon recommends books to users.121Bibliography[1] Thomas Abeel, Thibault Helleputte, Yves Van de Peer, Pierre Dupont, and Yvan Saeys. Robust biomarkeridentification for cancer diagnosis with ensemble feature selection methods. Bioinformatics, 26(3):392–398,2010. 28[2] Uri David Akavia, Oren Litvin, Jessica Kim, Felix Sanchez-Garcia, Dylan Kotliar, Helen C Causton, PanisaPochanard, Eyal Mozes, Levi A Garraway, and Dana Pe’er. An integrated approach to uncover drivers ofcancer. Cell, 143:1005–1017, 2010. 41, 69[3] Bruce Alberts, Alexander Johnson, Julian Lewis, David Morgan, Martin Raff, Keith Roberts, and PeterWalter. Molecular Biology of the Cell. Garland Science, 6 edition, 2014. 1, 5, 6[4] Ash A Alizadeh, Victoria Aranda, Alberto Bardelli, Cedric Blanpain, Christoph Bock, Christine Borowski,Carlos Caldas, Andrea Califano, Michael Doherty, Markus Elsner, et al. Toward understanding and ex-ploiting tumor heterogeneity. Nat. Med., 21(8):846–853, 2015. 2[5] Andre Altmann, Peter Weber, Carina Quast, Monika Rex-Haffner, Elisabeth B Binder, and Bertram Mu¨ller-Myhsok. vipR: variant identification in pooled DNA using R. Bioinformatics, 27(13):i77–i84, 2011. 18[6] Bill Andreopoulos, Aijun An, Xiaogang Wang, and Michael Schroeder. A roadmap of clustering algorithms:finding a match for a biomedical application. Brief. Bioinform., 10(3):297–314, 2009. 99[7] Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jo¨rg Sander. OPTICS: ordering points toidentify the clustering structure. In SIGMOD Rec., volume 28, pages 49–60. ACM, 1999. 98[8] Samuel Aparicio and Carlos Caldas. The implications of clonal genome evolution for cancer medicine. N.Engl. J. Med., 368(9):842–851, 2013. 1, 118[9] Peter Armitage and Richard Doll. The age distribution of cancer and a multi-stage theory of carcinogenesis.Br. J. Cancer, 8(1):1–12, 1954. 1[10] Miriam Ragle Aure, Israel Steinfeld, Lars Oliver Baumbusch, Knut Liestøl, Doron Lipson, Sandra Nyberg,Bjørn Naume, Kristine Kleivi Sahlberg, Vessela N Kristensen, Anne-Lise Børresen-Dale, et al. Identifyingin-trans process associated genes in breast cancer by integrated analysis of copy number and expressiondata. PloS one, 8(1):e53014, 2013. doi: 10.1371/journal.pone.0053014. 58[11] Jangsun Baek and Geoffrey J McLachlan. Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27(9):1269–1276, 2011. 102[12] Lee Bardwell. A walk-through of the yeast mating pheromone response pathway. Peptides, 25(9):1465–1476,2004. 119[13] Ali Bashashati, Gavin Ha, Alicia Tone, Jiarui Ding, Leah M Prentice, Andrew Roth, Jamie Rosner, KareyShumansky, Steve Kalloger, Janine Senz, et al. Distinct evolutionary trajectories of primary high-gradeserous ovarian cancers revealed through spatial mutational profiling. J. Pathol., 231(1):21–34, 2013. 15, 16[14] Ali Bashashati, Gholamreza Haffari, Jiarui Ding, Gavin Ha, Kenneth Lui, Jamie Rosner, David G Hunts-man, Carlos Caldas, Samuel A Aparicio, and Sohrab P Shah. DriverNet: uncovering the impact of somaticdriver muta- tions on transcriptional networks in cancer. Genome Biol., 12(4):R41, 2012. doi: 10.1186/gb-2012-13-12-r124. 13, 15, 41[15] Sean C Bendall, Erin F Simonds, Peng Qiu, D Amir El-ad, Peter O Krutzik, Rachel Finck, Robert VBruggner, Rachel Melamed, Angelica Trejo, Olga I Ornatsky, et al. Single-cell mass cytometry of differentialimmune and drug responses across a human hematopoietic continuum. Science, 332(6030):687–696, 2011.5, 109[16] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approachto multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol., 57(1):289–300, 1995. 76[17] Yuval Benjamini and Terence P Speed. Summarizing and correcting the GC content bias in high-throughputsequencing. Nucleic Acids Res., 40(10):e72, 2012. doi: 10.1093/nar/gks001. 13122Bibliography[18] Bonnie Berger, Jian Peng, and Mona Singh. Computational solutions for omics data. Nat. Rev. Genet.,14(5):333–346, 2013. 6[19] Arindam Bhattacharjee, William G Richards, Jane Staunton, Cheng Li, Stefano Monti, Priya Vasa, Chris-tine Ladd, Javad Beheshti, Raphael Bueno, Michael Gillette, et al. Classification of human lung carcinomasby mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA,98(24):13790–13795, 2001. 102[20] Christopher M. Bishop. Pattern recognition and machine learning. Springer, New York, 2006. 7, 11[21] Christopher M Bishop. Model-based machine learning. Phil. Trans. R. Soc. A, 371(1984), 2013. doi:10.1098/rsta.2012.0222. 7[22] David M Blei. Build, compute, critique, repeat: Data analysis with latent variable models. Annu. Rev.Stat. Appl., 1:203–232, 2014. 7[23] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. arXivpreprint arXiv:1601.00670, 2016. 11[24] Vincent A Blomen, Peter Ma´jek, Lucas T Jae, Johannes W Bigenzahn, Joppe Nieuwenhuis, JacquelineStaring, Roberto Sacco, Ferdy R van Diemen, Nadine Olk, Alexey Stukalov, et al. Gene essentiality andsynthetic lethality in haploid human cells. Science, 350(6264):1092–1096, 2015. 118[25] Ivana Bozic, Johannes G Reiter, Benjamin Allen, Tibor Antal, Krishnendu Chatterjee, Preya Shah, Yo SupMoon, Amin Yaqubie, Nicole Kelly, Dung T Le, et al. Evolutionary dynamics of cancer in response totargeted combination therapy. Elife, 2:e00747, 2013. doi: 10.7554/eLife.00747. 14[26] Daniel J Brat, RG Verhaak, Kenneth D Aldape, WK Yung, Sofie R Salama, LA Cooper, Esther Rheinbay,C Ryan Miller, Mark Vitucci, Olena Morozova, et al. Comprehensive, integrative genomic analysis of diffuselower-grade gliomas. N. Engl. J. Med., 372(26):2481–2498, 2015. 117[27] Leo Breiman. Statistical modeling: The two cultures. Stat. Sci., 16(3):199–231, 2001. 12[28] Andrea Califano. Cancer systems biology: The future. In John Mendelsohn, Peter M Howley, Mark AIsrael, Joe W Gray, and Craig B Thompson, editors, The Molecular Basis of Cancer, chapter 20, pages297–314. Elsevier, 4 edition, 2015. 4, 6[29] Hannah Carter, Sining Chen, Leyla Isik, Svitlana Tyekucheva, Victor E Velculescu, Kenneth W Kinzler,Bert Vogelstein, and Rachel Karchin. Cancer-specific high-throughput annotation of somatic mutations:computational prediction of driver missense mutations. Cancer Res., 69(16):6660–6667, 2009. 13[30] Fong Chun Chan, Adele Telenius, Shannon Healy, Susana Ben-Neriah, Anja Mottok, Raymond Lim, MarieDrake, Sandy Hu, Jiarui Ding, Gavin Ha, et al. An RCOR1 loss–associated gene expression signatureidentifies a prognostically significant DLBCL subgroup. Blood, 125(6):959–966, 2015. 15[31] Hong Chang et al. Robust path-based spectral clustering. Pattern Recog., 41(1):191–203, 2008. 99[32] Michael A Chapman, Michael S Lawrence, Jonathan J Keats, Kristian Cibulskis, Carrie Sougnez, Anna CSchinzel, Christina L Harview, Jean-Philippe Brunet, Gregory J Ahmann, Mazhar Adli, et al. Initial genomesequencing and analysis of multiple myeloma. Nature, 471(7339):467–472, 2011. 18, 19[33] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree. In NIPS, pages343–351, 2010. 85[34] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell.,17(8):790–799, 1995. 84[35] Hugh A Chipman, Edward I George, and Robert E McCulloch. BART: Bayesian additive regression trees.Ann. Appl. Stat., 4(1):266–298, 2010. 23[36] Dondapati Chowdary, Jessica Lathrop, Joanne Skelton, Kathleen Curtin, Thomas Briggs, Yi Zhang, JackYu, Yixin Wang, and Abhijit Mazumder. Prognostic gene expression signatures can be measured in tissuescollected in rnalater preservative. J. Mol. Diagn., 8(1):31–39, 2006. 102[37] Kristian Cibulskis, Michael S Lawrence, Scott L Carter, Andrey Sivachenko, David Jaffe, Carrie Sougnez,Stacey Gabriel, Matthew Meyerson, Eric S Lander, and Gad Getz. Sensitive detection of somatic pointmutations in impure and heterogeneous cancer samples. Nat. Biotechnol., 31(3):213–219, 2013. 39[38] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEETrans. Pattern Anal. Mach. Intell., 24(5):603–619, 2002. 84[39] Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M Rueda, Mark J Dunning,Doug Speed, Andy G Lynch, Shamith Samarajiwa, Yinyin Yuan, et al. The genomic and transcriptomicarchitecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403):346–352, 2012. 41, 68[40] Matteo D’Antonio and Francesca D Ciccarelli. Integrated analysis of recurrent properties of cancer genesto identify novel drivers. Genome Biol., 14(5):R52, 2013. doi: 10.1186/gb-2013-14-5-r52. 56123Bibliography[41] Sanjoy Dasgupta and Samory Kpotufe. Optimal rates for k-nn density and mode estimation. In NIPS,pages 2555–2563, 2014. 94[42] Teresa Davoli, Andrew Wei Xu, Kristen E Mengwasser, Laura M Sack, John C Yoon, Peter J Park, andStephen J Elledge. Cumulative haploinsufficiency and triplosensitivity drive aneuploidy patterns and shapethe cancer genome. Cell, 155(4):948–962, 2013. 71[43] Marcilio CP de Souto, Ivan G Costa, Daniel SA de Araujo, Teresa B Ludermir, and Alexander Schliep.Clustering cancer gene expression data: a comparative study. BMC bioinform., 9:497. doi: 10.1186/1471-2105-9-497. 103[44] Nathan D Dees, Qunyuan Zhang, Cyriac Kandoth, Michael C Wendl, William Schierding, Daniel C Koboldt,Thomas B Mooney, Matthew B Callaway, David Dooling, Elaine R Mardis, et al. MuSiC: identifyingmutational significance in cancer genomes. Genome Res., 22(8):1589–1598, 2012. 13, 71[45] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data viathe EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol., 39(1):1–38, 1977. 45, 49[46] Jiarui Ding, Ali Bashashati, Andrew Roth, Arusha Oloumi, Kane Tse, Thomas Zeng, Gholamreza Haffari,Martin Hirst, Marco A Marra, Anne Condon, et al. Feature-based classifiers for somatic mutation detectionin tumour–normal paired sequencing data. Bioinformatics, 28(2):167–175, 2012. iii, 13, 15[47] Jiarui Ding, Melissa K McConechy, Hugo M Horlings, Gavin Ha, Fong Chun Chan, Tyler Funnell, Sarah CMullaly, Ju¨ri Reimand, Ali Bashashati, Gary D Bader, et al. Systematic analysis of somatic mutationsimpacting gene expression in 12 tumour types. Nat. Commun., 6:8554, 2015. doi: 10.1038/ncomms9554.iii, 15, 69[48] Jiarui Ding and Sohrab Shah. A robust hidden semi–Markov model with application to aCGH data pro-cessing. Int. J. Data Min. Bioinform., 8(4):427–442, 2013. 15[49] Jiarui Ding, Sohrab Shah, and Anne Condon. densitycut: an efficient and versatile topological approachfor automatic clustering of biological data. Bioinformatics, 2016. doi: 10.1093/bioinformatics/btw227. iii,15[50] Jiarui Ding and Sohrab P Shah. Robust hidden semi-Markov modeling of array CGH data. In IEEE BIBM,pages 603–608. IEEE, 2010. 15[51] Li Ding, Matthew J Ellis, Shunqiang Li, David E Larson, Ken Chen, John W Wallis, Christopher C Harris,Michael D McLellan, Robert S Fulton, Lucinda L Fulton, et al. Genome remodelling in a basal-like breastcancer metastasis and xenograft. Nature, 464(7291):999–1005, 2010. 18[52] Li Ding, Minjung Kim, Krishna L Kanchi, Nathan D Dees, Charles Lu, Malachi Griffith, David Fensterma-cher, Hyeran Sung, Christopher A Miller, Brian Goetz, et al. Clonal architectures and driver mutations inmetastatic melanomas. PLoS ONE, 9(11):e111153, 2014. doi: 10.1371/journal.pone.0111153. 106[53] Li Ding, Timothy J Ley, David E Larson, Christopher A Miller, Daniel C Koboldt, John S Welch, Julie KRitchey, Margaret A Young, Tamara Lamprecht, Michael D McLellan, et al. Clonal evolution in relapsedacute myeloid leukaemia revealed by whole-genome sequencing. Nature, 481(7382):506–510, 2012. 13, 104,112[54] Li Ding, Michael C Wendl, Joshua F McMichael, and Benjamin J Raphael. Expanding the computationaltoolbox for mining cancer genomes. Nat. Rev. Genet., 15(8):556–570, 2014. 2, 6, 116[55] Brian J Druker, Franc¸ois Guilhot, Stephen G O’Brien, Insa Gathmann, Hagop Kantarjian, Norbert Gat-termann, Michael WN Deininger, Richard T Silver, John M Goldman, Richard M Stone, et al. Five-yearfollow-up of patients receiving imatinib for chronic myeloid leukemia. N. Engl. J. Med., 355(23):2408–2417,2006. 3[56] EK Engle, DAC Fisher, CA Miller, MD McLellan, RS Fulton, DM Moore, RK Wilson, TJ Ley, and ST Oh.Clonal evolution revealed by whole genome sequencing in a case of primary myelofibrosis transformed tosecondary acute myeloid leukemia. Leukemia, 29(4):869–876, 2015. 103, 104, 105[57] Martin Ester, Hans-Peter Kriegel, Jo¨rg Sander, and Xiaowei Xu. A density-based algorithm for discoveringclusters in large spatial databases with noise. In KDD, volume 96, pages 226–231, 1996. 85, 98, 104[58] Adam D Ewing, Kathleen E Houlahan, Yin Hu, Kyle Ellrott, Cristian Caloian, Takafumi N Yamaguchi,J Christopher Bare, Christine P’ng, Daryl Waggott, Veronica Y Sabelnykova, et al. Combining tumorgenome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Naturemethods, 12(7):623–630, 2015. 39[59] H Christina Fan, Glenn K Fu, and Stephen PA Fodor. Combinatorial labeling of single cells for geneexpression cytometry. Science, 347(6222):628–637, 2015. 5124Bibliography[60] Keith T Flaherty, Igor Puzanov, Kevin B Kim, Antoni Ribas, Grant A McArthur, Jeffrey A Sosman, Peter JO’Dwyer, Richard J Lee, Joseph F Grippo, Keith Nolop, et al. Inhibition of mutated, activated BRAF inmetastatic melanoma. N. Engl. J. Med., 363(9):809–819, 2010. 117[61] William A Flavahan, Yotam Drier, Brian B Liau, Shawn M Gillespie, Andrew S Venteicher, Anat OStemmer-Rachamimov, Mario L Suva`, and Bradley E Bernstein. Insulator dysfunction and oncogene acti-vation in IDH mutant gliomas. Nature, 529(7584):110–114, 2016. 117[62] Simon A Forbes, Nidhi Bindal, Sally Bamford, Charlotte Cole, Chai Yin Kok, David Beare, Mingming Jia,Rebecca Shepherd, Kenric Leung, Andrew Menzies, et al. COSMIC: mining complete cancer genomes inthe catalogue of somatic mutations in cancer. Nucleic Acids Res., 39(suppl 1):D945–D950, 2011. 56[63] Chris Fraley and Adrian E Raftery. Model-based methods of classification: using the mclust software inchemometrics. J. Stat. Softw., 18(6):1–13, 2007. 36, 84, 99, 104[64] Andrea Franceschini, Damian Szklarczyk, Sune Frankild, Michael Kuhn, Milan Simonovic, Alexander Roth,Jianyi Lin, Pablo Minguez, Peer Bork, Christian von Mering, et al. STRING v9.1: protein-protein inter-action networks, with increased coverage and integration. Nucleic Acids Res., 41(D1):D808–D815, 2013. 6,54[65] Pasi Fra¨nti and Olli Virmajoki. Iterative shrinking method for clustering problems. Pattern Recog.,39(5):761–775, 2006. 99[66] Ana L.N. Fred and Anil K Jain. Robust data clustering. In CVPR, volume 2, pages 128–133, 2003. 98[67] David A Fruman and Lewis C Cantley. Idelalisiba PI3Kδ inhibitor for B-cell cancers. N. Engl. J. Med.,370(11):1061–1062, 2014. 116[68] Limin Fu and Enzo Medico. FLAME, a novel fuzzy clustering method for the analysis of dna microarraydata. BMC Bioinform., 8:3, 2007. doi: 10.1186/1471-2105-8-3. 86, 87, 99[69] Keinosuke Fukunaga and Larry Hostetler. The estimation of the gradient of a density function, withapplications in pattern recognition. IEEE Trans. Inf. Theory, 21(1):32–40, 1975. 84[70] P Andrew Futreal, Lachlan Coin, Mhairi Marshall, Thomas Down, Timothy Hubbard, Richard Wooster,Nazneen Rahman, and Michael R Stratton. A census of human cancer genes. Nat. Rev. Cancer, 4(3):177–183, 2004. 2, 54, 71, 73[71] Levi A Garraway and Eric S Lander. Lessons from the cancer genome. Cell, 153(1):17–37, 2013. 4[72] Mark B Gerstein, Anshul Kundaje, Manoj Hariharan, Stephen G Landt, Koon-Kiu Yan, Chao Cheng,Xinmeng Jasmine Mu, Ekta Khurana, Joel Rozowsky, Roger Alexander, et al. Architecture of the humanregulatory network derived from ENCODE data. Nature, 489(7414):91–100, 2012. 54[73] Moritz Gerstung, Christian Beisel, Markus Rechsteiner, Peter Wild, Peter Schraml, Holger Moch, and NikoBeerenwinkel. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat.Commun., 3:811, 2012. doi: 10.1038/ncomms1814. 39[74] Olivier Gevaert, Victor Villalobos, Branimir I Sikic, and Sylvia K Plevritis. Identification of ovarian cancerdriver genes by using module network integration of multi-omics data. Interface Focus, 3(4), 2013. doi:10.1098/rsfs.2013.0013. 58[75] Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521(7553):452–459,2015. 6[76] Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas. Clustering aggregation. ACM Trans. Knowl.Discov. Data, 1(1), 2007. 99[77] Michael S Glickman and Charles L Sawyers. Converting cancer therapies into cures: lessons from infectiousdiseases. Cell, 148(6):1089–1098, 2012. 4, 118[78] Rodrigo Goya, Mark GF Sun, Ryan D Morin, Gillian Leung, Gavin Ha, Kimberley C Wiegand, JanineSenz, Anamaria Crisan, Marco A Marra, Martin Hirst, et al. SNVMix: predicting single nucleotide variantsfrom next-generation sequencing of tumors. Bioinformatics, 26(6):730–736, 2010. 18[79] Mel Greaves and Carlo C Maley. Clonal evolution in cancer. Nature, 481(7381):306–313, 2012. 1[80] Malachi Griffith, Christopher A Miller, Obi L Griffith, Kilannin Krysiak, Zachary L Skidmore, AvinashRamu, Jason R Walker, Ha X Dang, Lee Trani, David E Larson, et al. Optimizing cancer genome sequencingand analysis. Cell Systems, 1(3):210–223, 2015. 13, 39, 112, 116[81] Matthew R Grimmer and Joseph F Costello. Cancer: Oncogene brought into the loop. Nature, 529(7584):34–35, 2016. 117[82] Gavin Ha, Andrew Roth, Jaswinder Khattra, Julie Ho, Damian Yap, Leah M Prentice, Nataliya Melnyk,Andrew McPherson, Ali Bashashati, Emma Laks, et al. Titan: inference of copy number architectures inclonal cell populations from tumor whole-genome sequence data. Genome Res., 24(11):1881–1893, 2014.116125Bibliography[83] Gavin Ha, Andrew Roth, Daniel Lai, Ali Bashashati, Jiarui Ding, Rodrigo Goya, Ryan Giuliany, JamieRosner, Arusha Oloumi, Karey Shumansky, et al. Integrative analysis of genome-wide loss of heterozygosityand monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breastcancer. Genome Res., 22(10):1995–2007, 2012. 2[84] Douglas Hanahan and Robert A Weinberg. Hallmarks of cancer: the next generation. cell, 144(5):646–674,2011. 1[85] Traver Hart, Megha Chandrashekhar, Michael Aregger, Zachary Steinhart, Kevin R Brown, GrahamMacLeod, Monika Mis, Michal Zimmermann, Amelie Fradet-Turcotte, Song Sun, et al. High-resolutionCRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell, 163(6):1515–1526, 2015.118[86] John A Hartigan. Clustering algorithms. Wiley, New York, 1975. 84[87] PM Hartigan. Algorithm as 217: Computation of the dip statistic to test for unimodality. J. R. Stat. Soc.Ser. C Appl. Stat., 34(3):320–325, 1985. 36[88] Leland Hartwell, Leroy Hood, Michael Goldberg, Ann Reynolds, and Lee Silver. Genetics: from genes togenomes. McGraw-Hill Education, New York, 4 edition, 2011. 4[89] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining,inference, and prediction. Springer Verlag, 2 edition, 2009. 23[90] James R Heath, Antoni Ribas, and Paul S Mischel. Single-cell analysis tools for drug discovery and devel-opment. Nat. Rev. Drug Discov., 15(3):204–216, 2016. 5[91] Ayuko Hoshino, Bruno Costa-Silva, Tang-Long Shen, Goncalo Rodrigues, Ayako Hashimoto, Milica TesicMark, Henrik Molina, Shinji Kohsaka, Angela Di Giannatale, Sophia Ceder, et al. Tumour exosome integrinsdetermine organotropic metastasis. Nature, 527(7578):329–335, 2015. 118[92] Norman Huang, Parantu K Shah, and Cheng Li. Lessons from a decade of integrating cancer copy numberalterations with gene expression profiles. Brief. Bioinform., 13(3):305–316, 2012. 58[93] Lawrence Hubert and Phipps Arabie. Comparing partitions. J. classif., 2(1):193–218, 1985. 98[94] Gopa Iyer, Aphrothiti J Hanrahan, Matthew I Milowsky, Hikmat Al-Ahmadie, Sasinya N Scott, ManickamJanakiraman, Mono Pirun, Chris Sander, Nicholas D Socci, Irina Ostrovnaya, et al. Genome sequencingidentifies a basis for everolimus sensitivity. Science, 338(6104):221–221, 2012. 119[95] Anil K Jain et al. Data clustering: a users dilemma. In Pattern Recog. Mach. Intell., volume 3776 of LNCS,pages 1–10. Springer-Verlag, 2005. 99[96] C Jiang, Zhenyu Xuan, Fang Zhao, and Michael Q Zhang. TRED: a transcriptional regulatory elementdatabase, new entries and other development. Nucleic Acids Res., 35(suppl 1):D137–D140, 2007. 6[97] Rebecka Jo¨rnsten, Tobias Abenius, Teresia Kling, Linne´a Schmidt, Erik Johansson, Torbjo¨rn EM Nordling,Bodil Nordlander, Chris Sander, Peter Gennemark, Keiko Funa, et al. Network modeling of the tran-scriptional effects of copy number aberrations in glioblastoma. Mol. Syst. Biol., 7(1):486, 2011. doi:10.1038/msb.2011.17. 41[98] Cyriac Kandoth, Michael D McLellan, Fabio Vandin, Kai Ye, Beifang Niu, Charles Lu, Mingchao Xie, Qun-yuan Zhang, Joshua F McMichael, Matthew A Wyczalkowski, et al. Mutational landscape and significanceacross 12 major cancer types. Nature, 502(7471):333–339, 2013. 55, 71, 75, 76[99] Minoru Kanehisa and Susumu Goto. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res.,28(1):27–30, 2000. 6, 54[100] Peter D Karp, Christos A Ouzounis, Caroline Moore-Kochlacs, Leon Goldovsky, Pallavi Kaipa, Dag Ahre´n,Sophia Tsoka, Nikos Darzentas, Victor Kunin, and Nu´ria Lo´pez-Bigas. Expansion of the BioCyc collectionof pathway/genome databases to 160 genomes. Nucleic Acids Res., 33(19):6083–6089, 2005. 6, 54[101] Jin H Kim and Judea Pearl. A computational model for causal and diagnostic reasoning in inferencesystems. In Proceedings of the Eighth international joint conference on Artificial intelligence-Volume 1,pages 190–193. Morgan Kaufmann Publishers Inc., 1983. 46[102] PDW Kirk, AC Babtie, and MPH Stumpf. Systems biology (un) certainties. Science, 350(6259):386–388,2015. 4[103] Allon M Klein, Linas Mazutis, Ilke Akartuna, Naren Tallapragada, Adrian Veres, Victor Li, Leonid Peshkin,David A Weitz, and Marc W Kirschner. Droplet barcoding for single-cell transcriptomics applied to em-bryonic stem cells. Cell, 161(5):1187–1201, 2015. 5[104] Susumu Kobayashi, Titus J Boggon, Tajhal Dayaram, Pasi A Ja¨nne, Olivier Kocher, Matthew Meyerson,Bruce E Johnson, Michael J Eck, Daniel G Tenen, and Bala´zs Halmos. EGFR mutation and resistance ofnon–small-cell lung cancer to gefitinib. N. Engl. J. Med., 352(8):786–792, 2005. 81126Bibliography[105] Daniel C Koboldt, Ken Chen, Todd Wylie, David E Larson, Michael D McLellan, Elaine R Mardis, George MWeinstock, Richard K Wilson, and Li Ding. VarScan: variant detection in massively parallel sequencing ofindividual and pooled samples. Bioinformatics, 25(17):2283–2285, 2009. 18[106] Daniel C Koboldt, Qunyuan Zhang, David E Larson, Dong Shen, Michael D McLellan, Ling Lin, Christo-pher A Miller, Elaine R Mardis, Li Ding, and Richard K Wilson. VarScan 2: somatic mutation and copynumber alteration discovery in cancer by exome sequencing. Genome Res., 22(3):568–576, 2012. 39[107] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press,2009. 7[108] Warren LG Koontz, Patrenahalli M Narendra, and Keinosuke Fukunaga. A graph-theoretic approach tononparametric cluster analysis. IEEE Trans. Comput., 100(9):936–944, 1976. 84, 90[109] Samory Kpotufe and Ulrike V Luxburg. Pruning nearest neighbor cluster trees. In ICML, pages 225–232,2011. 85[110] Hans-Peter Kriegel, Peer Kro¨ger, Jo¨rg Sander, and Arthur Zimek. Density-based clustering. Wiley Inter-discip. Rev. Data Min. Knowl. Discov., 1(3):231–240, 2011. 102[111] Zhongwu Lai, Aleksandra Markovets, Miika Ahdesmaki, and Justin Johnson. Vardict: a novel and versatilevariant caller for next-generation sequencing in cancer research. Cancer Res., 75(15 Supplement):4864–4864,2015. 39[112] Eric S Lander. The heroes of CRISPR. Cell, 164(1):18–28, 2016. 119[113] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9(4):357–359, 2012. 2[114] Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. Ultrafast and memory-efficient alignmentof short DNA sequences to the human genome. Genome biol, 10(3):R25, 2009. doi: 10.1186/gb-2009-10-3-r25. 2[115] David E Larson, Christopher C Harris, Ken Chen, Daniel C Koboldt, Travis E Abbott, David J Dooling,Timothy J Ley, Elaine R Mardis, Richard K Wilson, and Li Ding. SomaticSniper: identification of somaticpoint mutations in whole genome sequencing data. Bioinformatics, 28(3):311–317, 2012. 39[116] Michael S Lawrence, Petar Stojanov, Craig H Mermel, James T Robinson, Levi A Garraway, Todd R Golub,Matthew Meyerson, Stacey B Gabriel, Eric S Lander, and Gad Getz. Discovery and saturation analysis ofcancer genes across 21 tumour types. Nature, 505:495–501, 2014. 4, 54, 55, 71, 80[117] Michael S Lawrence, Petar Stojanov, Paz Polak, Gregory V Kryukov, Kristian Cibulskis, AndreySivachenko, Scott L Carter, Chip Stewart, Craig H Mermel, Steven A Roberts, et al. Mutational het-erogeneity in cancer and the search for new cancer-associated genes. Nature, 499:214–218, 2013. 13, 40, 56,73[118] Dung T Le, Jennifer N Uram, Hao Wang, Bjarne R Bartlett, Holly Kemberling, Aleksandra D Eyring,Andrew D Skora, Brandon S Luber, Nilofer S Azad, Dan Laheru, et al. PD-1 blockade in tumors withmismatch-repair deficiency. N. Engl. J. Med., 372(26):2509–2520, 2015. 14, 119[119] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015. 12[120] Jacob H Levine et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlatewith prognosis. Cell, 162(1):184–197, 2015. 84, 109, 111[121] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform.Bioinformatics, 25(14):1754–1760, 2009. 2[122] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows–Wheeler transform.Bioinformatics, 26(5):589–595, 2010. 2[123] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, GoncaloAbecasis, Richard Durbin, et al. The sequence alignment/map format and SAMtools. Bioinformatics,25(16):2078–2079, 2009. 18[124] Ruiqiang Li, Yingrui Li, Xiaodong Fang, Huanming Yang, Jian Wang, Karsten Kristiansen, and Jun Wang.SNP detection for massively parallel whole-genome resequencing. Genome Res., 19(6):1124–1132, 2009. 18[125] Frank Lin and William W Cohen. Power iteration clustering. In ICML, pages 655–662, 2010. 85, 86[126] Enric Llorens-Bobadilla et al. Single-cell transcriptomics reveals a population of dormant neural stem cellsthat become activated upon brain injury. Cell Stem Cell, 17(3):329–340, 2015. 107, 108, 109[127] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press,2003. 11127Bibliography[128] Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, ItayTirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genome-wide ex-pression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015. 5[129] Vivien Marx. Cancer: A most exceptional response. Nature, 520(7547):389–393, 2015. 119[130] David L Masica and Rachel Karchin. Correlation of somatic mutation and expression identifies genesimportant in human glioblastoma progression and survival. Cancer Res., 71(13):4550–4561, 2011. 13, 41[131] Anthony Mathelier, Calvin Lefebvre, Allen W Zhang, David J Arenillas, Jiarui Ding, Wyeth W Wasserman,and Sohrab P Shah. Cis-regulatory somatic mutations and gene-expression alteration in B-cell lymphomas.Genome Biol., 16:84, 2015. doi: 10.1186/s13059-015-0648-7. 15[132] Lisa Matthews, Gopal Gopinath, Marc Gillespie, Michael Caudy, David Croft, Bernard de Bono, PhaniGarapati, Jill Hemish, Henning Hermjakob, Bijay Jassal, et al. Reactome knowledgebase of human biologicalpathways and processes. Nucleic Acids Res., 37(suppl 1):D619–D622, 2009. 6[133] Melissa K McConechy, Michael S Anglesio, Steve E Kalloger, Winnie Yang, Janine Senz, Christine Chow,Alireza Heravi-Moussavi, Gregg B Morin, Anne-Marie Mes-Masson, Mark S Carey, et al. Subtype-specificmutation of PPP2R1A in endometrial and ovarian carcinomas. J. Pathol., 223(5):567–573, 2011. 18[134] Melissa K McConechy, Jiarui Ding, Maggie CU Cheang, Kimberly C Wiegand, Janine Senz, Alicia A Tone,Winnie Yang, Leah M Prentice, Kane Tse, Thomas Zeng, et al. Use of mutation profiles to refine theclassification of endometrial carcinomas. J. Pathol., 228(1):20–30, 2012. 15, 16[135] Melissa K McConechy, Jiarui Ding, Janine Senz, Winnie Yang, Nataliya Melnyk, Alicia A Tone, Leah MPrentice, Kimberly C Wiegand, Jessica N McAlpine, Sohrab P Shah, et al. Ovarian and endometrialendometrioid carcinomas have distinct CTNNB1 and PTEN mutation profiles. Mod. Pathol., 27(1):128–134, 2013. 15, 16[136] Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky,Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, et al. The Genome Analysis Toolkit: AMapReduce framework for analyzing next-generation DNA sequencing data. Genome Res., 20(9):1297–1303, 2010. 18[137] Roger McLendon, Allan Friedman, Darrell Bigner, Erwin G Van Meir, Daniel J Brat, Gena M Mastro-gianakis, Jeffrey J Olson, Tom Mikkelsen, Norman Lehman, Ken Aldape, et al. Comprehensive genomiccharacterization defines human glioblastoma genes and core pathways. Nature, 455(7216):1061–1068, 2008.2[138] Donal P McLornan, Alan List, and Ghulam J Mufti. Applying synthetic lethality for the selective targetingof cancer. N. Engl. J. Med., 371(18):1725–1735, 2014. 14[139] Frazer Meacham, Dario Boffelli, Joseph Dhahbi, David IK Martin, Meromit Singer, and Lior Pachter.Identification and correction of systematic error in high-throughput sequence data. BMC Bioinform., 12:451,2011. doi: 10.1186/1471-2105-12-451. 23, 36[140] Marina Meila˘. Comparing clusterings–an information based distance. J. Multivar. Anal., 98(5):873–895,2007. 95[141] Giovanna Menardi and Adelchi Azzalini. An advancement in clustering via nonparametric density estima-tion. Stat. Comp., 24(5):753–767, 2013. 85[142] Craig H Mermel, Steven E Schumacher, Barbara Hill, Matthew L Meyerson, Rameen Beroukhim, Gad Getz,et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-numberalteration in human cancers. Genome Biol, 12(4):R41, 2011. doi: 10.1186/gb-2011-12-4-r41. 13, 60[143] Matthew Meyerson, Stacey Gabriel, and Gad Getz. Advances in understanding cancer genomes throughsecond-generation sequencing. Nat. Rev. Genet., 11(10):685–696, 2010. 2, 3[144] Christopher A Miller et al. SciClone: Inferring clonal architecture and tracking the spatial and temporalpatterns of tumor evolution. PLoS Comput. Biol., 10(8):e1003665, 2014. doi: 10.1371/journal.pcbi.1003665.103, 104[145] Qianxing Mo, Sijian Wang, Venkatraman E Seshan, Adam B Olshen, Nikolaus Schultz, Chris Sander,R Scott Powers, Marc Ladanyi, and Ronglai Shen. Pattern discovery and cancer gene identification inintegrated cancer genomic data. Proc. Natl Acad. Sci. USA, 110(11):4245–4250, 2013. 58[146] Ryan D Morin, Nathalie A Johnson, Tesa M Severson, Andrew J Mungall, Jianghong An, Rodrigo Goya,Jessica E Paul, Merrill Boyle, Bruce W Woolcock, Florian Kuchenbauer, et al. Somatic mutations alteringEZH2 (Tyr641) in follicular and diffuse large B-cell lymphomas of germinal-center origin. Nat. Genet.,42(2):181–185, 2010. 18128Bibliography[147] Ryan D Morin, Maria Mendez-Lago, Andrew J Mungall, Rodrigo Goya, Karen L Mungall, Richard DCorbett, Nathalie A Johnson, Tesa M Severson, Readman Chiu, Matthew Field, et al. Frequent mutationof histone-modifying genes in non-hodgkin lymphoma. Nature, 476(7360):298–303, 2011. 18[148] Ryan D Morin, Karen Mungall, Erin Pleasance, Andrew J Mungall, Rodrigo Goya, Ryan D Huff, David WScott, Jiarui Ding, Andrew Roth, Readman Chiu, et al. Mutational and structural analysis of diffuse largeB-cell lymphoma using whole-genome sequencing. Blood, 122(7):1256–1265, 2013. 15[149] David M Mount. ANN programming manual. Technical report, Dept. of Computer Science, U. of Maryland,1998. 93[150] Kevin P. Murphy. Machine learning: a probabilistic perspective. MIT press, Cambridge, 2012. 7, 11, 44[151] Masao Nagasaki, Jun Yasuda, Fumiki Katsuoka, Naoki Nariai, Kaname Kojima, Yosuke Kawai, YumiYamaguchi-Kabata, Junji Yokozawa, Inaho Danjoh, Sakae Saito, et al. Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nat. Commun., 6:8018, 2015. doi: 10.1038/ncomms9018.116[152] Eszter Nagy and Lynne E Maquat. A rule for termination-codon position within intron-containing genes:when nonsense affects rna abundance. Trends Biochem. Sci., 23(6):198–199, 1998. 41[153] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. NIPS,2:849–856, 2002. 85[154] Peter C Nowell. The clonal evolution of tumor cell populations. Science, 194(4260):23–28, 1976. 1[155] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bring-ing order to the web. Technical report, Stanford University, 1999. 89[156] D Williams Parsons, Siaˆn Jones, Xiaosong Zhang, Jimmy Cheng-Ho Lin, Rebecca J Leary, Philipp An-genendt, Parminder Mankoo, Hannah Carter, I-Mei Siu, Gary L Gallia, et al. An integrated genomicanalysis of human glioblastoma multiforme. Science, 321(5897):1807–1812, 2008. 117[157] Anoop P Patel, Itay Tirosh, John J Trombetta, Alex K Shalek, Shawn M Gillespie, Hiroaki Wakimoto,Daniel P Cahill, Brian V Nahed, William T Curry, Robert L Martuza, et al. Single-cell RNA-seq highlightsintratumoral heterogeneity in primary glioblastoma. Science, 344(6190):1396–1401, 2014. 81[158] Judea Pearl. Reverend bayes on inference engines: A distributed hierarchical approach. In Proceedings ofthe second national conference on artificial intelligence, pages 133–136. AAAI Press, 1982. 45[159] Dana Pe’er and Nir Hacohen. Principles and strategies for developing network models in cancer. Cell,144(6):864–873, 2011. 4[160] Alexander R Pico, Thomas Kelder, Martijn P Van Iersel, Kristina Hanspers, Bruce R Conklin, and ChrisEvelo. WikiPathways: pathway editing for the people. PLoS Biol., 6(7):e184, 2008. doi: 10.1371/jour-nal.pbio.0060184. 54[161] Alex A Pollen, Tomasz J Nowakowski, Joe Shuga, Xiaohui Wang, Anne A Leyrat, Jan H Lui, NianzhenLi, Lukasz Szpankowski, Brian Fowler, Peilin Chen, et al. Low-coverage single-cell mrna sequencing revealscellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol.,32(10):1053–1058, 2014. 84, 107, 109[162] Anirudh Prahallad, Chong Sun, Sidong Huang, Federica Di Nicolantonio, Ramon Salazar, Davide Zecchin,Roderick L Beijersbergen, Alberto Bardelli, and Rene´ Bernards. Unresponsiveness of colon cancer to BRAF(V600E) inhibition through feedback activation of EGFR. Nature, 483(7388):100–103, 2012. 81, 117[163] Xose S Puente, Magda Pinyol, Vı´ctor Quesada, Laura Conde, Gonzalo R Ordo´n˜ez, Neus Villamor, GeorgiaEscaramis, Pedro Jares, S´ılvia Bea`, Marcos Gonza´lez-Dı´az, et al. Whole-genome sequencing identifiesrecurrent mutations in chronic lymphocytic leukaemia. Nature, 475(7354):101–105, 2011. 18[164] Benjamin J Raphael, Jason R Dobson, Layla Oesper, and Fabio Vandin. Identifying driver mutations insequenced cancer genomes: computational approaches to enable precision medicine. Genome Med., 6(1):5,2014. doi: 10.1186/gm524. 13, 41[165] Jane Reece, Lisa A Urry, Noel Meyers, Michael L Cain, Steven A Wasserman, Peter V Minorsky, Robert BJackson, and Bernard N Cooke. Campbell biology. Pearson Higher Education AU, 10 edition, 2013. 119[166] Ju¨ri Reimand and Gary D Bader. Systematic analysis of somatic mutations in phosphorylation signalingpredicts novel cancer drivers. Mol. Syst. Biol., 9(1):637, 2013. doi: 10.1038/msb.2012.68. 71, 74[167] Ju¨ri Reimand, Omar Wagih, and Gary D Bader. The mutational landscape of phosphorylation signaling incancer. Sci. Rep., 3:2651, 2013. doi: 10.1038/srep02651. 74[168] B. Reva, Y. Antipin, and C. Sander. Predicting the functional impact of protein mutations: application tocancer genomics. Nucleic Acids Res., 39(17):e118, 2011. doi: 10.1093/nar/gkr407. 13, 40129Bibliography[169] Naiyer A Rizvi, Matthew D Hellmann, Alexandra Snyder, Pia Kvistborg, Vladimir Makarov, Jonathan JHavel, William Lee, Jianda Yuan, Phillip Wong, Teresa S Ho, et al. Mutational landscape determinessensitivity to PD-1 blockade in non–small cell lung cancer. Science, 348(6230):124–128, 2015. 14, 119[170] Kimberly Robasky, Nathan E Lewis, and George M Church. The role of replicates for error mitigation innext-generation sequencing. Nat. Rev. Genet., 15(1):56–62, 2014. 13[171] Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. Science,344(6191):1492–1496, 2014. 84, 98[172] Thomas Rolland, Murat Tas¸an, Benoit Charloteaux, Samuel J Pevzner, Quan Zhong, Nidhi Sahni, Song Yi,Irma Lemmens, Celia Fontanillo, Roberto Mosca, et al. A proteome-scale map of the human interactomenetwork. Cell, 159(5):1212–1226, 2014. 6, 74[173] Michael G Ross, Carsten Russ, Maura Costello, Andrew Hollinger, Niall J Lennon, Ryan Hegarty, ChadNusbaum, and David B Jaffe. Characterizing and measuring bias in sequence data. Genome Biol, 14:R51,2013. doi: 10.1186/gb-2013-14-5-r51. 13[174] Andrew Roth, Jiarui Ding, Ryan Morin, Anamaria Crisan, Gavin Ha, Ryan Giuliany, Ali Bashashati,Martin Hirst, Gulisa Turashvili, Arusha Oloumi, et al. JointSNVMix: a probabilistic model for accuratedetection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics,28(7):907–913, 2012. 39[175] Andrew Roth, Jaswinder Khattra, Damian Yap, Adrian Wan, Emma Laks, Justina Biele, Gavin Ha, SamuelAparicio, Alexandre Bouchard-Coˆte´, and Sohrab P Shah. Pyclone: statistical inference of clonal populationstructure in cancer. Nat. Methods, 11(4):396–398, 2014. 107[176] Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Prentice Hall, 3 edition,2010. 6, 9[177] Francisco J Sa´nchez-Rivera and Tyler Jacks. Applications of the CRISPR-Cas9 system in cancer biology.Nature Reviews Cancer, 15:387–395, 2015. 119[178] Jo¨rg Sander, Xuejie Qin, Zhiyong Lu, Nan Niu, and Alex Kovarsky. Automatic extraction of clusters fromhierarchical clustering representations. In Advances in knowledge discovery and data mining, pages 75–87.Springer, 2003. 99[179] Christopher T Saunders, Wendy SW Wong, Sajani Swamy, Jennifer Becq, Lisa J Murray, and R KeiraCheetham. Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs.Bioinformatics, 28(14):1811–1817, 2012. 39[180] Carl F Schaefer, Kira Anthony, Shiva Krupa, Jeffrey Buchoff, Matthew Day, Timo Hannay, and Kenneth HBuetow. PID: the pathway interaction database. Nucleic Acids Res., 37(suppl 1):D674–D679, 2009. 6[181] Ton N Schumacher and Robert D Schreiber. Neoantigens in cancer immunotherapy. Science, 348(6230):69–74, 2015. 14[182] Parag P Shah, William W Lockwood, Kumar Saurabh, Zimple Kurlawala, Sean P Shannon, Sabine Waigel,Wolfgang Zacharias, and Levi J Beverly. Ubiquilin1 represses migration and epithelial-to-mesenchymaltransition of human non-small cell lung cancer cells. Oncogene, 34(13):1709–1717, 2015. 73[183] Sohrab P Shah, Martin Ko¨bel, Janine Senz, Ryan D Morin, Blaise A Clarke, Kimberly C Wiegand, GillianLeung, Abdalnasser Zayed, Erika Mehl, Steve E Kalloger, et al. Mutation of FOXL2 in granulosa-celltumors of the ovary. N. Engl. J. Med., 360(26):2719–2729, 2009. 18[184] Sohrab P Shah, Ryan D Morin, Jaswinder Khattra, Leah Prentice, Trevor Pugh, Angela Burleigh, AllenDelaney, Karen Gelmon, Ryan Guliany, Janine Senz, et al. Mutational evolution in a lobular breast tumourprofiled at single nucleotide resolution. Nature, 461(7265):809–813, 2009. 18[185] Sohrab P Shah, Andrew Roth, Rodrigo Goya, Arusha Oloumi, Gavin Ha, Yongjun Zhao, Gulisa Turashvili,Jiarui Ding, Kane Tse, Gholamreza Haffari, et al. The clonal and mutational evolution spectrum of primarytriple-negative breast cancers. Nature, 486(7403):395–399, 2012. 15, 16, 27, 83[186] Padmanee Sharma and James P Allison. The future of immune checkpoint therapy. Science, 348(6230):56–61, 2015. 14[187] Alice T. Shaw, Manabu Soda, Yoshihiro Yamashita, Toshihide Ueno, Junpei Takashima, Takahiro Nakajima,Yasushi Yatabe, Kengo Takeuchi, Toru Hamada, Hidenori Haruta, et al. Resensitization to Crizotinib bythe lorlatinib ALK resistance mutation L1198F. N. Engl. J. Med., 363(18):1734–1739, 2015. 14[188] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal.Mach. Intell., 22(8):888–905, August 2000. 85[189] Jianxin Shi, Qingqi Han, Heng Zhao, Chenxi Zhong, and Feng Yao. Downregulation of MED23 promotedthe tumorigenecity of esophageal squamous cell carcinoma. Mol. Carcinog., 53(10):833–840, 2014. 73130Bibliography[190] Gordon K Smyth. Limma: linear models for microarray data. In Bioinformatics and computational biologysolutions using R and Bioconductor, pages 397–420. Springer, 2005. 76[191] Matija Snuderl, Ladan Fazlollahi, Long P Le, Mai Nitta, Boryana H Zhelyazkova, Christian J Davidson,Sara Akhavanfard, Daniel P Cahill, Kenneth D Aldape, Rebecca A Betensky, et al. Mosaic amplificationof multiple receptor tyrosine kinase genes in glioblastoma. Cancer cell, 20(6):810–817, 2011. 3[192] Alexandra Snyder, Vladimir Makarov, Taha Merghoub, Jianda Yuan, Jesse M Zaretsky, Alexis Desrichard,Logan A Walsh, Michael A Postow, Phillip Wong, Teresa S Ho, et al. Genetic basis for clinical response toCTLA-4 blockade in melanoma. N. Engl. J. Med., 371(23):2189–2199, 2014. 14, 119[193] Andrea Sottoriva, Haeyoun Kang, Zhicheng Ma, Trevor A Graham, Matthew P Salomon, Junsong Zhao,Paul Marjoram, Kimberly Siegmund, Michael F Press, Darryl Shibata, et al. A big bang model of humancolorectal tumor growth. Nat. Genet., 47(3):209–216, 2015. 2[194] Oliver Stegle, Sarah A Teichmann, and John C Marioni. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet., 16(3):133–145, 2015. 5[195] Michael R Stratton. Exploring the genomes of cancer cells: progress and promise. Science, 331(6024):1553–1558, 2011. 4[196] Michael R Stratton, Peter J Campbell, and P Andrew Futreal. The cancer genome. Nature, 458(7239):719–724, 2009. 1, 4[197] Werner Stuetzle and Rebecca Nugent. A generalized single linkage method for estimating the cluster treeof a density. J. Comp. Graph. Stat., 19(2):397–418, 2010. 85[198] Chong Sun, Liqin Wang, Sidong Huang, Guus JJE Heynen, Anirudh Prahallad, Caroline Robert, JohnHaanen, Christian Blank, Jelle Wesseling, Stefan M Willems, et al. Reversible and adaptive resistance toBRAF (V600E) inhibition in melanoma. Nature, 508(7494):118–122, 2014. 117[199] Yun-Chi Tang and Angelika Amon. Gene copy-number alterations: A cost-benefit analysis. Cell, 152(3):394–405, 2013. 60[200] Kensuke Tateishi, Hiroaki Wakimoto, A John Iafrate, Shota Tanaka, Franziska Loebel, Nina Lelic, DmitriWiederschain, Olivier Bedel, Gejing Deng, Bailin Zhang, et al. Extreme vulnerability of IDH1 mutantcancers to NAD+ depletion. Cancer cell, 28(6):773–784, 2015. 117[201] The Cancer Genome Atlas Research Network. Integrated genomic characterization of endometrial carci-noma. Nature, 497(7447):67–73, 2013. 80[202] Cristian Tomasetti and Bert Vogelstein. Variation in cancer risk among tissues can be explained by thenumber of stem cell divisions. Science, 347(6217):78–81, 2015. 2[203] Ali Torkamani and Nicholas J Schork. Prediction of cancer driver mutations in protein kinases. CancerRes., 68(6):1675–1682, 2008. 13[204] Sushil Tripathi, Karen R Christie, Rama Balakrishnan, Rachael Huntley, David P Hill, Liv Thommesen,Judith A Blake, Martin Kuiper, and Astrid Lægreid. Gene ontology annotation of sequence-specific DNAbinding transcription factors: setting the stage for a large-scale curation effort. Database, 2013:bat062,2013. doi: 10.1093/database/bat062. 71[205] Laurens Van der Maaten et al. Visualizing data using t-SNE. J. Mach. Learn. Res., 9(85):2579–2605, 2008.109[206] Erwin L van Dijk, Yan Jaszczyszyn, and Claude Thermes. Library preparation methods for next-generationsequencing: tone down the bias. Exp. Cell. Res., 322(1):12–20, 2014. 13[207] Fabio Vandin, Eli Upfal, and Benjamin J Raphael. Algorithms for detecting significantly mutated pathwaysin cancer. J. Comp. Biol., 18(3):507–522, 2011. 13, 42[208] Fabio Vandin, Eli Upfal, and Benjamin J Raphael. De novo discovery of mutated driver pathways in cancer.Genome Res., 22(2):375–385, 2012. 13[209] Ignacio Varela, Patrick Tarpey, Keiran Raine, Dachuan Huang, Choon Kiat Ong, Philip Stephens, HelenDavies, David Jones, Meng-Lay Lin, Jon Teague, et al. Exome sequencing identifies frequent mutation ofthe SWI/SNF complex gene PBRM1 in renal carcinoma. Nature, 469(7331):539–542, 2011. 18[210] Charles J Vaske, Stephen C Benz, J Zachary Sanborn, Dent Earl, Christopher Szeto, Jingchun Zhu, DavidHaussler, and Joshua M Stuart. Inference of patient-specific pathway activities from multi-dimensionalcancer genomics data using PARADIGM. Bioinformatics, 26(12):i237–i245, 2010. 41[211] Andrea Vedaldi and Stefano Soatto. Quick shift and kernel methods for mode seeking. In ECCV, pages705–718. Springer, 2008. 84[212] Cor J. Veenman et al. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell.,24(9):1273–1280, 2002. 99131Bibliography[213] Bert Vogelstein, Nickolas Papadopoulos, Victor E Velculescu, Shibin Zhou, Luis A Diaz, and Kenneth WKinzler. Cancer genome landscapes. Science, 339(6127):1546–1558, 2013. 1, 2, 4, 13, 14, 40, 54, 55, 71, 73,118[214] Bert Vogelstein and Kenneth W. Kinzler. The path to cancer –three strikes and you’re out. N. Engl. J.Med., 373(20):1895–1898, 2015. 1, 2[215] Ulrike Von Luxburg. A tutorial on spectral clustering. Stat. Comp., 17(4):395–416, 2007. 85, 94[216] Nikhil Wagle, Brian C Grabiner, Eliezer M Van Allen, Ali Amin-Mansour, Amaro Taylor-Weiner, MaraRosenberg, Nathanael Gray, Justine A Barletta, Yanan Guo, Scott J Swanson, et al. Response and acquiredresistance to everolimus in anaplastic thyroid cancer. N. Engl. J. Med., 371(15):1426–1433, 2014. 119[217] Silke Wagner and Dorothea Wagner. Comparing clusterings – an overview. Technical Report 2006-04,Universita¨t Karlsruhe, 2007. 95[218] Tim Wang, Kıvanc¸ Birsoy, Nicholas W Hughes, Kevin M Krupczak, Yorick Post, Jenny J Wei, Eric SLander, and David M Sabatini. Identification and characterization of essential genes in the human genome.Science, 350(6264):1096–1101, 2015. 118[219] Lukas D Wartman. A case of me: clinical cancer sequencing and the future of precision medicine. MolecularCase Studies, 1(1):a000349, 2015. doi: 10.1101/mcs.a000349. 4[220] Michael P Washburn, Dirk Wolters, and John R Yates. Large-scale analysis of the yeast proteome bymultidimensional protein identification technology. Nat. Biotechnol., 19(3):242–247, 2001. 4[221] Takuya Watanabe, Sumihito Nobusawa, Paul Kleihues, and Hiroko Ohgaki. IDH1 mutations are earlyevents in the development of astrocytomas and oligodendrogliomas. Am. J. Pathol., 174(4):1149–1153,2009. 117[222] Robert Weinberg. The biology of cancer. Garland Science, 2 edition, 2013. 65, 118[223] John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, KyleEllrott, Ilya Shmulevich, Chris Sander, Joshua M Stuart, Cancer Genome Atlas Research Network, et al.The cancer genome atlas pan-cancer analysis project. Nat. Genet., 45(10):1113–1120, 2013. 60[224] Henrica MJ Werner, Gordon B Mills, and Prahlad T Ram. Cancer systems biology: a peek into the futureof patient care? Nat. Rev. Clin. Oncol., 11(3):167–176, 2014. 4[225] Kimberly C Wiegand, Sohrab P Shah, Osama M Al-Agha, Yongjun Zhao, Kane Tse, Thomas Zeng, Ja-nine Senz, Melissa K McConechy, Michael S Anglesio, Steve E Kalloger, et al. ARID1A Mutations inEndometriosis-Associated Ovarian Carcinomas. N. Engl. J. Med., 363(16):203–211, 2010. 18[226] Mathias Wilhelm, Judith Schlegl, Hannes Hahne, Amin Moghaddas Gholami, Marcus Lieberenz, Mikhail MSavitski, Emanuel Ziegler, Lars Butzmann, Siegfried Gessulat, Harald Marx, et al. Mass-spectrometry-baseddraft of the human proteome. Nature, 509(7502):582–587, 2014. 4[227] David Wishart. Mode analysis: a generalization of nearest neighbor which reduces chaining effects. In A.J.Cole, editor, Numerical Taxonomy, pages 282–311. Academic Press, 1969. 84[228] Tobias Wittkop, Dorothea Emig, Sita Lange, Sven Rahmann, Mario Albrecht, John H Morris, SebastianBo¨cker, Jens Stoye, and Jan Baumbach. Partitioning biological data with transitivity clustering. Nat.Methods, 7(6):419–420, 2010. 99[229] Christian Wiwie, Jan Baumbach, and Richard Ro¨ttger. Comparing the performance of biomedical clusteringmethods. Nature methods, 12(11):1033–1038, 2015. 98, 99[230] Song Wu, Scott Powers, Wei Zhu, and Yusuf A. Hannun. Substantial contribution of extrinsic risk factorsto cancer development. Nature, 347(6217):78–81, 2015. 2[231] Murry W Wynes, Trista K Hinz, Dexiang Gao, Michael Martini, Lindsay A Marek, Kathryn E Ware,Michael G Edwards, Diana Bo¨hm, Sven Perner, Barbara A Helfrich, et al. FGFR1 mRNA and proteinexpression, not gene copy number, predict FGFR TKI sensitivity across all lung cancer histologies. Clin.Cancer Res., 20(12):3299–3309, 2014. 66[232] Chen Xu and Zhengchang Su. Identification of cell types from single-cell transcriptomes using a novelclustering method. Bioinformatics, 31(12):1974–1800, 2015. 84[233] Wei Xu, Hui Yang, Ying Liu, Ying Yang, Ping Wang, Se-Hee Kim, Shinsuke Ito, Chen Yang, Pu Wang,Meng-Tao Xiao, et al. Oncometabolite 2-hydroxyglutarate is a competitive inhibitor of α-ketoglutarate-dependent dioxygenases. Cancer cell, 19(1):17–30, 2011. 117[234] Hai Yan, D Williams Parsons, Genglin Jin, Roger McLendon, B Ahmed Rasheed, Weishi Yuan, Ivan Kos,Ines Batinic-Haberle, Siaˆn Jones, Gregory J Riggins, et al. IDH1 and IDH2 mutations in gliomas. N. Engl.J. Med., 360(8):765–773, 2009. 18132Bibliography[235] Ahrim Youn and Richard Simon. Identifying cancer driver genes in tumor genome sequencing studies.Bioinformatics, 27(2):175–181, 2011. 13[236] Dimas Yusuf, Stefanie L Butland, Magdalena I Swanson, Eugene Bolotin, Amy Ticoll, Warren A Cheung,Xiao Y Cindy Zhang, Christopher TD Dickman, Debra L Fulton, Jonathan S Lim, et al. The transcriptionfactor encyclopedia. Genome Biol., 13(3):R24, 2012. doi: 10.1186/gb-2012-13-3-r24. 54[237] Travis I Zack, Steven E Schumacher, Scott L Carter, Andrew D Cherniack, Gordon Saksena, Barbara Tabak,Michael S Lawrence, Cheng-Zhong Zhang, Jeremiah Wala, Craig H Mermel, et al. Pan-cancer patterns ofsomatic copy number alteration. Nat. Genet., 45(10):1134–1140, 2013. 60, 69[238] Charles T Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans.Comput., 100(1):68–86, 1971. 99, 101[239] Achim Zeileis, Kurt Hornik, Alex Smola, and Alexandros Karatzoglou. kernlab-an S4 package for kernelmethods in R. J. Stat Softw., 11(9):1–20, 2004. 99[240] Hufeng Zhou, Jingjing Jin, Haojun Zhang, Bo Yi, Michal Wozniak, and Limsoon Wong. IntPath–anintegrated pathway gene relationship database for model organisms and important pathogens. BMC Syst.Biol., 6(Suppl 2):S2, 2012. doi: 10.1186/1752-0509-6-S2-S2. 54133

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0303119/manifest

Comment

Related Items