UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Cis-regulatory somatic mutations and gene-expression alteration in B cell lymphomas Lefebvre, Calvin 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_september_lefebvre_calvin.pdf [ 12.1MB ]
JSON: 24-1.0166656.json
JSON-LD: 24-1.0166656-ld.json
RDF/XML (Pretty): 24-1.0166656-rdf.xml
RDF/JSON: 24-1.0166656-rdf.json
Turtle: 24-1.0166656-turtle.txt
N-Triples: 24-1.0166656-rdf-ntriples.txt
Original Record: 24-1.0166656-source.json
Full Text

Full Text

Cis-Regulatory Somatic Mutations and Gene-ExpressionAlteration in B Cell LymphomasbyCalvin LefebvreBachelor of Science, University of Toronto, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Bioinformatics)The University of British Columbia(Vancouver)August 2015c© Calvin Lefebvre, 2015AbstractSubstantial progress has been achieved in characterizing protein coding (PC) re-gions for cancer genomes, with large contributions coming from The Cancer GenomeAtlas (TCGA) and the International Cancer Genome Consortium (ICGC). In orderto obtain a complete mutational profile of cancer genomes, the whole genome mustbe analyzed for two reasons: a large proportion of somatic mutations are within thenon-coding region and 80% of the human genome is estimated to have some bi-ological functionality. The dramatic cost reduction afforded by next generationsequencing has now made it tractable to sequence entire cancer genomes, allowingmutational profiling of the functional loci in the non-coding regions, such as cis-regulatory elements. Recent cancer genomic studies observed somatic mutationswithin cis-regulatory elements have the capacity to deregulate gene expression, buttheir impact remains underexplored.Initial attempts to prioritize cis-regulatory mutations did not incorporate RNA-Seq. We used 84 B cell lymphoma samples to address this limitation by prioritiz-ing disruptive cis-regulatory mutations based on their potential to be the cause ofobservable cascading expression changes throughout biological networks. BCL6,ROBO1, GNA13, HAS2 and MYC were dysregulated genes targeted with somaticmutations through different mechanisms. Mutations either targeted the genes di-rectly (PC mutations), indirectly (cis-regulatory mutations) or both.Our analyses demonstrates the importance of identifying genomically alteredcis-regulatory elements, along with gene expression data, to interpret the muta-tional landscapes of cancers.iiPrefaceA version of this dissertation has been published in Genome Biology [45].Translocation predictions for Cohort 1 were generated by Andrew McPherson.Jiarui Ding helped choose appropriate parameters for Xseq. Ryan Morin’s labgenerated the somatic SNV calls for Cohort 1.Anthony Mathelier identified both cis-regulatory regions and frequently mu-tated regions, as well as predicted the binding affinities for TFBSs (Figure 2.1).Figures for frequently mutated regions (Figure 3.3, Figure 3.4 and Figure 3.5), en-richment analyses (Figure 3.6, Figure 3.7, Figure 3.8, Figure 3.11, Figure 3.12 andFigure 3.13) and TF binding affinity sections in Xseq examples (Figure 3.19 andFigure 3.21) were created by Anthony Mathelier.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Mutational Landscapes of Cancer Genomes . . . . . . . . . . . . 11.2 Genomic Variants . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 PC Mutations . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Cis-Regulatory Mutations . . . . . . . . . . . . . . . . . 51.3 Sequencing Technology . . . . . . . . . . . . . . . . . . . . . . . 101.3.1 WGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3.2 RNA-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.3 ChIP-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3.4 Integration of WGS, RNA-Seq, and ChIP-Seq . . . . . . . 131.4 Resources and Datasets . . . . . . . . . . . . . . . . . . . . . . . 131.4.1 B Cell Lymphoma . . . . . . . . . . . . . . . . . . . . . 14iv1.5 Experimental Aims . . . . . . . . . . . . . . . . . . . . . . . . . 152 Experimental Approach . . . . . . . . . . . . . . . . . . . . . . . . . 172.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.1 B Cell Lymphoma . . . . . . . . . . . . . . . . . . . . . 172.1.2 ChIP-Seq and TF Binding Profiles . . . . . . . . . . . . . 182.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.1 Identification of Somatic Mutations . . . . . . . . . . . . 192.2.2 Mutations Overlapping TFBS Regions . . . . . . . . . . . 222.2.3 Mutations Overlapping PC Regions . . . . . . . . . . . . 262.2.4 Mutation Rate . . . . . . . . . . . . . . . . . . . . . . . 262.2.5 Xseq . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2.6 Frequently Targeted Promoter Regions . . . . . . . . . . 282.2.7 Functional Enrichment Analysis . . . . . . . . . . . . . . 293 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1 Mutational Landscape for B Cell Lymphomas . . . . . . . . . . . 303.2 Comparing Mutation Rate Between PC and Cis-Regulatory Regions 313.3 Frequency of Mutations in Promoter Regions . . . . . . . . . . . 333.4 Impact of Cis-Regulatory Mutations on Gene Expression . . . . . 394 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 80vList of TablesTable 1.1 List of PC mutations and their definitions. . . . . . . . . . . . 6Table 1.2 Mapping between RNA codons and amino acids. . . . . . . . . 7Table 1.3 Classes and functions of cis-regulatory elements. . . . . . . . . 7Table 1.4 Signature translocations and mutated tumour suppressor genesfor B cell lymphoma. . . . . . . . . . . . . . . . . . . . . . . 14Table 2.1 B cell lymphoma datasets. . . . . . . . . . . . . . . . . . . . . 19Table 2.2 Length to consider when re-evaluating the binding affinity forthe altered TFBS. . . . . . . . . . . . . . . . . . . . . . . . . 25Table 3.1 The total number of mutations for each cohort with statisticalmeasures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Table 3.2 The number of Xseq predicted genes that have at least one sam-ple for each mutation composition. . . . . . . . . . . . . . . . 43Table A.1 PC mutations considered in Xseq. . . . . . . . . . . . . . . . . 82viList of FiguresFigure 1.1 Binding affinity for disrupted TFBSs . . . . . . . . . . . . . . 9Figure 2.1 Workflow of the analyses performed to address our experimen-tal aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 2.2 Workflow on prediction of somatic indels . . . . . . . . . . . 20Figure 2.3 Homopolymers or tandem repeats cause improper germlineand somatic labeling for mutations . . . . . . . . . . . . . . . 21Figure 2.4 Generating PWMs . . . . . . . . . . . . . . . . . . . . . . . 24Figure 2.5 Predicting binding affinity scores for TFBS using PWM . . . 25Figure 3.1 Distribution of the number of mutations per sample in Cohort1 and Cohort 2 . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 3.2 Comparison of SNV mutation rates for cis-regulatory and PCregions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Figure 3.3 Regions frequently targeted by somatic mutations overlappingcis-regulatory elements in promoters for Cohort 1 . . . . . . . 36Figure 3.4 Regions frequently targeted by somatic mutations overlappingcis-regulatory elements in promoters for Cohort 2 . . . . . . . 37Figure 3.5 Regions frequently targeted by somatic mutations overlappingcis-regulatory elements in promoters when combining cohorts 38Figure 3.6 Functional enrichment analysis of genes associated with fre-quently mutated promoters in Cohort 1 . . . . . . . . . . . . 40Figure 3.7 Functional enrichment analysis of genes associated with fre-quently mutated promoters in Cohort 2 . . . . . . . . . . . . 41viiFigure 3.8 Functional enrichment analysis of genes associated with fre-quently mutated promoters for combined cohorts . . . . . . . 42Figure 3.9 Cancer genes predicted by the Xseq tool from Cohort 1 . . . . 44Figure 3.10 Cancer genes predicted by the Xseq tool from Cohort 2 . . . . 45Figure 3.11 Functional enrichment analysis of disrupted pathways for Co-hort 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 3.12 Functional enrichment analysis of disrupted pathways for Co-hort 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 3.13 Functional enrichment analysis of disrupted pathways whencombining cohorts . . . . . . . . . . . . . . . . . . . . . . . 48Figure 3.14 Xseq results and translocations for BCL6 and MYC genes . . 49Figure 3.15 Impact of BCL6 expression alteration due to a PC mutation . . 52Figure 3.16 Impact of BCL6 expression alteration due to a TFBS mutation 53Figure 3.17 Impact of ROBO1 expression alteration due to a PC mutation 54Figure 3.18 Impact of ROBO1 expression alteration due to a TFBS mutation 55Figure 3.19 Predicted cis-regulatory mutation potentially impacting GNA13gene expression. . . . . . . . . . . . . . . . . . . . . . . . . 57Figure 3.20 Impact of GNA13 expression alteration due to a PC or TFBSmutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Figure 3.21 Predicted cis-regulatory mutation potentially impacting HAS2gene expression. . . . . . . . . . . . . . . . . . . . . . . . . 59Figure 3.22 Impact of HAS2 expression alteration due to a PC mutation . 60Figure 3.23 Impact of HAS2 expression alteration due to a TFBS mutation 61Figure A.1 Distributions of the Xseq probabilities when considering allsamples (Pr(Dg)) . . . . . . . . . . . . . . . . . . . . . . . . 81Figure A.2 Comparison of the indel mutation rates for cis-regulatory andPC regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Figure A.3 Impact of MYC expression alteration due to a PC mutation . . 84Figure A.4 Impact of MYC expression alteration due to TFBS mutations. 85Figure A.5 Impact of MYC expression alteration due to TFBS and PC mu-tations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86viiiGlossaryTCGA The Cancer Genome AtlasICGC International Cancer Genome ConsortiumPC protein codingRNA-Seq transcriptome sequencingWGS whole genome sequencingSNVs single nucleotide variantsindels small insertion and deletionsSVs structural variantsTFBSs transcription factor binding sitesTF transcription factorTSS transcription start sitePFM position frequency matrixPWMs position weight matricesRPKM Reads per kilobase of exon model per million mapped readsKDPs kernel density plotsBCR B cell receptorixIg immunoglobulinGC germinal centreBL burkitt lymphomaDLBCL diffuse large B cell lymphomaPMBCL primary mediastinal large B cell lymphomaFL follicular lymphomaxAcknowledgmentsFor their funding, I would like to thank the CIHR Bioinformatics Training Programand my supervisor.I would also like to thank my sister and mom for their continued support.xiChapter 1IntroductionLive as if you were to die tomorrow. Learn as if you were to liveforever. — Mahatma Gandhi1.1 Mutational Landscapes of Cancer GenomesMajority of tumours accumulate mutations throughout their genome, some of whichdisrupt normal biological function and potentially promote tumour growth and pro-liferation. Mutations are either inherited through germ cells (germline mutations)or arise through a variety of different mechanisms during a cell’s lifetime (somaticmutations). Somatic mutations originate by exogenous or endogenous mutagenexposure or inaccuracy of DNA replication machinery. Cancer cells are able toaccumulate mutations with the presence of defective DNA repair mechanisms andthe inability to halt the cell cycle and initiate apoptosis.Cancer genome sequencing projects have primarily focused on exploring so-matic mutational landscapes. The vast majority of these projects only analyze the∼ 2% of the human genome that codes proteins, allowing for the identification andinterpretation of somatic mutations in the accessible protein coding (PC) regionsthat can cause proteins to be non-functional or change protein function. Workingwithin the restricted PC region, allows for a higher sequencing depth while beingcost-effective [5]. As a result, there is a higher confidence for predicting somaticmutations. The disadvantageous aspect is the mutational landscape for the non-1coding space is neglected, which potentially omits relevant mutations. Over thepast decade of analyzing the mutational landscape of PC regions for a variety ofdifferent cancer types, there were a small number of frequently mutated genes iden-tified and a significantly larger number of infrequently mutated genes [72]. Forexample, IDH1 (a gene involved in cell metabolism [17] and DNA methylation[55]) was repeatedly mutated in multiple cancer types. Especially, in lower gradeglioma where 75% of the samples have an altered IDH1 gene [10, 22]. Moreover,TP53 is one of the most frequently altered genes in human cancers; in ovarianserous cystadenoma 94.9% of the samples have an altered TP53 gene [10, 22].The Cancer Genome Atlas (TCGA) projects, a large scale consortium, have re-ported the mutation prevalence of previously known and unknown cancer genesacross the major tumour types [30]. However, synthesis of the vast analyses ofTCGA projects have revealed a discovery gap in the search for new cancer genes[40]. We assert this gap can be partially filled through analyses of the non-codingregion.When identifying recurrently altered known or unknown cancer genes, inclu-sion of the non-coding mutational landscape is logical for two reasons: a largeproportion of somatic mutations are within the non-coding region and 80% of thehuman genome is estimated to have some biological functionality [15]. Studiesobserving the impact of non-coding mutations on human phenotypes have accumu-lated over the last few decades. Abnormal phenotypes from human diseases havebeen linked to specific disruptions of binding sites for cis-regulatory elements. Inparticular, hemophilia B Leyden occurs when the binding site for HNF4A harboursan alteration located upstream from the Factor IX gene [62]. Similarly, a mutatedGATA binding site, a regulatory region located upstream of the platelet glycopro-tein gene, causes Gilbert’s syndrome [43].The dramatic reduction in cost afforded by next generation sequencing hasnow made it tractable to sequence entire cancer genomes. This makes the func-tional loci in the non-coding genome, such as cis-regulatory elements, accessibleto mutational profiling. Recently, recurrent somatic mutations in the TERT pro-moter were revealed to potentially impact gene expression in human melanomastudies [28]. Also, two recent publications initiated the first attempts to prioritizecis-regulatory mutations in cancer genomes, but neither incorporated transcrip-2tome sequencing (RNA-Seq) data [35, 76]. Our contribution advances the fieldby prioritizing disruptive cis-regulatory mutations associated with gene expressionchanges in biological networks. Additionally, we identified recurrently mutatedgenes through different mechanisms, mutations that either targeted genes directly(PC mutations) or indirectly (cis-regulatory mutations). Regardless of the recentadvances in exploring the non-coding region of cancer genomes, somatic mutationsin cis-regulatory regions remain poorly characterized due to the limited number ofpublicly available whole genome sequencing (WGS) data and the current limi-tations in detection and annotation of non-coding variants. The combination offrequently mutated cis-regulatory and PC regions may lead to new biomarkers andtherapeutic targets.1.2 Genomic VariantsIn this project we considered single nucleotide variants (SNVs), small insertion anddeletions (indels) and structural variants (SVs). SVs are changes in the structure ofthe chromosome caused by double strand breaks; the type of SV is determined byhow the breaks are reconnected.The ability to predict somatic mutations depends on the quality and depth ofthe sequencing data. In studying sporadic cancers, only somatic mutations areconsidered, since interpreting germline variants requires large case-control stud-ies. Driver and passenger mutations are two subtypes of somatic mutations. Drivermutations have a contribution to the tumour’s progression and passenger mutationsare silent mutations that have no functional consequences or selective advantages.Interpreting somatic mutations through their impact on gene expression networkscan illuminate functional mutations and potentially distinguish drivers from pas-sengers [6, 19]. In functional and large-scale sequencing studies, driver genesthat obtain PC mutations were observed to acquire altered gene expression for themutated driver gene and their known interacting genes through functional proteinassociation networks. [6]. The same concept should apply to cis-regulatory ele-ments; disruptive cis-regulatory mutations that cause expression change for drivergenes should have cascading effects by disrupting expression for known interactinggenes. In fact, mutations within cis-regulatory regions should have a higher impact3on dysregulation, based on their functionality. PC mutations can down-regulategene expression by destroying start codons that are used to initiate translation, oth-erwise the mutation can potentially create non-functional proteins. Whereas, allcis-regulatory mutations have the potential to up- or down-regulate gene expres-sion depending on whether the mutation increases or decreases the binding affinityof the cis-regulatory element.1.2.1 PC MutationsThe mutational landscape for PC regions in cancer genomes has been thoroughlyexplored due to the inexpensive sequencing cost, and easier to interpret muta-tions when compared to the whole genome. Recent advances in this area of re-search have focused on targeted sequencing for regions of high interest and usingwhole genome or exome sequencing to profile recurrent mutations between 100’sor 1000’s of the same or across different tumour types. The following are someexamples: identifying clonal frequencies of predicted driver genes in 104 triplenegative breast cancers [64], validating the classification of 7,500 breast cancersbased on the integration of genomic and transcriptomic data [3], and TCGA pan-cancer projects that use 12 tumour types to analyze for similarities and differencesbetween the mutation prevalence of cancer genes [52].There are approximately 22,000 PC genes in the human genome and cancerstudies typically focus heavily on only two functional classes: tumour suppres-sor genes and oncogenes. Oncogenes and tumour suppressor genes promote andinhibit tumour formation, respectively. In cancer cells, oncogenes are generallyup-regulated and tumour suppressor genes are down-regulated. Adding to the com-plexity of the problem, genes can be classified as both tumour suppressor and onco-gene, depending on the cell’s environment.There are several different types of PC mutations, listed in Table 1.1. Depend-ing on the type of PC mutation, there can be a different level of impact eitherresulting in the protein to be gain or lose functionality. Multiple codons can mapto the same amino acid due to the degeneracy of codons (Table 1.2). Consequently,a mutation can alter the codon but not change the amino acid, referred to as a syn-onymous substitution. If the amino acid is not changed then there is no functional4alteration. The position of the mutation within the codon is significant; mutationsthat occur in the third position of the codon have a higher probability of not alteringthe amino acid (Table 1.2). Specifically, if we assume that substitutions occur withan equal probability between the four nucleotide types, then a mutation in the first,second and third position in the codon has 95.8%, 99.0% and 33.3% probabilityin causing a change of amino acids, respectively. Mutations that either create ordestroy the start/stop codon have significant consequences by silencing, truncat-ing or creating an alternative protein. Mutations that result in a change of aminoacids, nonsynonymous substitutions, are more difficult to determine if the resultingproteins are functional or nonfunctional without performing functional validation.The change in amino acids may not affect the folding of the protein due to the factthat amino acid properties may not change, such as chemical properties where theamino acid either remains hydrophilic or hydrophobic.The identification and analysis of PC mutations is critical when understandingmutations that drive tumour formation and growth, but the PC region is not theonly important region.1.2.2 Cis-Regulatory MutationsIn order for a cell to function properly, regulation of gene expression is an importantand elaborate process of control. This control can occur at several different stages;in this dissertation I focused on transcriptional regulation through cis-regulatoryelements. Specifically, how transcription factor binding sites (TFBSs), locationswhere transcription factor (TF) proteins can bind to regulate the initiation and sta-bility of transcription by recruiting other TFs or the RNA polymerase II, can bealtered to potentially assist in the formation and progression of tumours.Promoters, enhancers, silencers and insulators are cis-regulatory elements thatcontain TFBSs (Table 1.3) [77]. Promoters are DNA regions encompassing thetranscription start site (TSS), which contain several of these TFBSs, where TFsbind in order to recruit the RNA polymerase II. To strengthen transcription, ac-tivator proteins, a type of TF, can bind to distal TFBSs within enhancer regions.Silencers have distal TFBSs where repressor proteins bind to suppress transcrip-tion by blocking the RNA polymerase II from binding to the promoter. Insulators5Table 1.1: List of PC mutations and their definitions.Mutation Type Classification Definitionsynonymousmutation does not change theamino acidSNVsnonsynonymousmutation changes the aminoacidsplice site mutation altering splice sitestart codon lost or gained start codon is lost or gainedstop codon lost or gained stop codon is lost or gainedframe shiftone or two nucleotides deletedor inserted causing a codonchangeIndelsinframe deletionone or more codons are deleted(multiple of 3 nucleotides)codon deletion with frameshifta codon is changed and oneor more codons are deletedinframe insertionone or more codons are inserted(multiple of 3 nucleotides)codon insertion with frameshifta codon is changed and oneor more codon is insertedallow for genes to be regulated by enhancers or silencers in one chromatin domain,without influencing other genes in neighbouring domains.Cancer studies have recently highlighted that mutations within cis-regulatoryelements dysregulate gene expression [1, 26, 28, 31, 35, 69, 71]. This area ofresearch is still ongoing in order to understand the impact of cis-regulatory muta-tions on gene expression [77], as well as the global relationship between nucleotidevariants and the gain or loss of TFBSs remains vastly unexplored [9]. Predictionson how mutations disrupt TFBSs were previously calculated through TF bindingprofiles, phylogenetic footprinting, experimental data (DNase-seq, epigenetic, etc.)and population genetics [4, 11, 16, 32, 35, 70]. TF binding profiles are representedas position weight matrices (PWMs), which represent motifs by providing a toler-ance score for each nucleotide at positions within the binding site. These PWMsare used to calculate the TF binding affinity scores for specific TFBSs, where ahigher binding score corresponds to a higher probability that a TF will bind. To6Table 1.2: Mapping between RNA codons and amino acids.First Base Second Base Third BaseU C A GU UUU Phe UCU Ser UAU Tyr UGU Cys UUUC UCC UAC UGC CUUA Leu UCA UAA Stop UGA Stop AUUG UCG UAG UGG Trp GC CUU CCU Pro CAU His CGU Arg UCUC CCC CAC CGC CCUA CCA CAA Gln CGA ACUG CCG CAG CGG GA AUU Ile ACU Thr AAU Asn AGU Ser UAUC ACC AAC AGC CAUA ACA AAA Lys AGA Arg AAUG Met ACG AAG AGG GG GUU Val GCU Ala GAU Asp GGU Gly UGUC GCC GAC GGC CGUA GCA GAA Glu GGA AGUG GCG GAG GGG GTable 1.3: Classes and functions of cis-regulatory elements.Cis-regulatory Element FunctionPromoter encompasses the TSS and is where TFs bind inorder to recruit the RNA polymerase IIEnhancer strengthens transcription by creating a stabilizingbridge with the promoterSilencer suppresses transcription by blocking theRNA polymerase II from binding to the promoterInsulator isolates the influences from enhancers or silencers toa single chromatin domain7understand how to generate these PWMs and TF binding affinity scores refer Ch.2.2.2 in Experimental Approach.Binding affinity scores are useful when determining if a TFBS mutation de-creases or increases the binding affinity (Figure 1.1). When comparing the bindingscore for the reference binding site and the alternative, there is either an increase,decrease, or nearly no change in the binding affinity. A threshold is applied to thecomputed binding scores to determine if the TFs will bind to the mutated TFBS. Ifa binding score is ≤ 80% of the maximum binding score, then the TFBS mutationis considered to be disruptive [4]. Comparing the binding affinity scores betweenall possible individual SNV mutations within the TFBS is effective since it deter-mines if the observed mutated TFBS is among the most deleterious (Figure 1.1B).As expected, mutations within conserved positions of TF binding profiles are themost deleterious for TF-DNA bindings [75].TF binding motifs have an anti-correlation between variation among individu-als and positional information content [70]. Information content is a PFM mea-surement that quantifies the level of nucleotide certainty at each position and isbased on Shannon’s entropy calculation [65]. This anti-correlation supports thelink between population diversity and evolutionary conservation. Generally, TF-BSs with altered nucleotides at highly conserved positions will not allow properTF binding. Mutations within highly conserved positions can lead to cell deathdue to their functional value and positions with low conservation are less impactedif mutated. However, mutations at positions with low information content can bedeleterious, yet non-lethal [70].Kheradpour et al. (2013) showed that (in vitro) the impact of mutations withinTFBSs is consistent with predictions of TF-DNA binding affinity from the PWMmodels, as formerly shown [7]. This was accomplished with the use of the mas-sively parallel reporter assay to interpret the regulatory motifs in 2000 predictedhuman enhancers [33]. This increased confidence for computationally predictingthe binding affinities of mutated TFBSs from ChIP-Seq data without performingvalidation experiments.Disruption of TFBSs are known to affect the expression of nearby proximalgenes, distal genes or even genes on a different chromosome [42]. Unfortunately,information pertaining to TFs regulating specific genes is incomplete. This led to8TSSGGATCAAAGTCATTFBSA) Binding Affinity of TFBSsB) Comparison of Alternative Binding Scores  TSSGGATCAAGGTCATTFBSi) Reference TFBS ii) Observed Alternative TFBS bindingScoreGGATCAAAGTCATbindingScoreGGATCAAGGTCAT= 7.83= 4.08TFTFTFBS Binding ScoresReference Observed AlternativeAll other alternativesBinding Anity Scores45678910●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●10978654 TFBS PositionsSequence LogoFigure 1.1: Binding affinity for disrupted TFBSs. (A) Represents the abilityof a TF to bind to a reference TFBS (i), but when a mutation alters thebinding site the TF is unable to bind (ii). (B) Plots the binding scores forthe reference TFBS and all the possible altered TFBS given a SNV any-where within the TFBS. In this extreme case, the mutation that causesthe maximum difference between the binding scores of the referenceand alternative TFBS is observed. Also, the maximum difference is de-picted in the sequence logo where the mutation occurs at position 8,which is among the most conserved positions (Ch. 2.2.2).9the challenge of determining the association between TFBS and the gene(s) reg-ulated when predicting cis-regulatory mutations impacting gene expression. Thesimplest and most common assumption is that TFBSs regulate the closest gene.This assumption limits the analysis considerably, ignoring any other connectionsbetween TFBS and promixal and distal genes. Sikora-Wohlfeld et al. (2013) usedChIP-Seq to predict genes that are regulated by TFs where the closest gene as-sumption performed with acceptable accuracy [68].Another challenging problem is to accurately identify mutations within thenon-coding region. The non-coding region is vast, which causes computationallyintensive mutation callers to have high runtimes. Additionally, the large number ofhomopolymers and tandem repeats within the non-coding region adds difficulty insequencing and identifying mutations (Ch. 2.2.1).Integrating somatic mutations identified in the PC and TFBS regions allowedus to observe genes that were mutated through different mechanisms. For a givengene, a mutated TFBS or an altered amino acid can both decrease the production ofa functional protein. Our integration can identify genes that are recurrently mutatedthat otherwise would be considered insignificant when interpreted independently.This supports our overall goal of demonstrating the importance of including ge-nomically altered cis-regulatory elements when interpreting mutational landscapesof cancers.1.3 Sequencing TechnologyWith the emergence of WGS, RNA-seq and ChIP-Seq, the community has movedbeyond the gene-specific early examples to a broader, genome-wide approach. Theintegration of multiple types of data is essential to study complex diseases andprovides a more complete analysis of changes that occur within the cells. In fact,multi-layered data is vital in order to investigate how disruptive mutations in cis-regulatory regions affect gene expression.1.3.1 WGSWGS has lead to advances in using mutational profiles to identify the prevalenceof cancer genes across the major tumor types [30]. WGS is a procedure to identify10the complete DNA sequence for a genome including genes, regulatory, intergenicregions, etc. Computational methods use allele frequencies, which is the proportionof reads that support each nucleotide at a locus, and other characteristics fromthe aligned reads to predict mutations. Performing massive parallel sequencingon independent DNA fragments allows for a significant reduction in sequencingtime. WGS encounters biases and errors during amplifying, sequencing or aligningstages. Some examples are disproportionate amplification of one allele during PCRand incorporating the improper nucleotide during sequencing.A challenging problem is sequencing and aligning repetitive regions, knownas homopolymers and tandem repeats. Homopolymers are sequences that containidentical nucleotides, whereas tandem repeats are sequences of two or more nu-cleotides that are repeated serially. Non-coding regions of the human genome havea higher number of repetitive sequences. This leads to problems in classificationbetween mutations or sequencing errors within repetitive regions.1.3.2 RNA-SeqRNA-Seq is a method for capturing mRNA sequences that is commonly used forquantifying gene expression. However, RNA-Seq is biased by samples that aresequenced at different depths since deeper sequenced samples will have a highernumber of mapped reads. Additionally, the length of the transcript is a bias factorbecause larger transcripts have a higher probability for reads to be mapped. Readsper kilobase of exon model per million mapped reads (RPKM) is often used toaccount for these technical biases, which normalizes the read count:RPKM =109MgMe`(1.1)where Mg is the number of mapped reads to a gene, Me is the total number of readsmapped in the experiment, and ` is the exon length for the gene of interest.A common analysis that uses RPKM expression values is differential expres-sion analysis; ideally in cancer research a comparison of gene expression levelsbetween tumour and normal samples within the same tissue type. Finding genesthat have a higher or lower expression rate in the tumour sample compared to itsnormal sample may identify expression changes that aid in tumour formation or11progression. In scenarios where normal samples are unavailable, there are alterna-tive analysis methods such as comparing a tumour sample to all other tumour sam-ples using kernel density plots (KDPs). In our instance, a gene was predicted to bedysregulated by using the mean and standard deviation from KDPs (Ch. 2.2.5).As previously stated, the majority of mutations that are observed are infre-quent, therefore, theoretically one should be able to identify significant expressionchanges when comparing all tumour samples. On the other hand, if a frequentlymutated gene experiences similar expression change in all tumour samples then theexpression change for a gene would be classified as significant when comparingbetween tumour and normal sample, but insignificant when comparing all tumoursamples. One limitation with RNA-Seq data is that gene expressions are time andtissue specific.1.3.3 ChIP-SeqAn essential technology to identify cis-regulatory regions is ChIP-Seq [29]. ChIP-Seq is comprised of two stages, ChIP and sequencing. ChIP is the procedure thatisolates fragments of DNA where the TF proteins of interest bind using antibodies,these TF-DNA binding regions include: enhancer, repressor, promoter, and insu-lator regions. The sequencing stage is where these isolated DNA fragments aresequenced. Once these sequenced DNA fragments are aligned, the regions thatshow an enrichment of reads, known as ChIP-Seq peaks, represent regions wherethe proteins interact with the DNA. To further narrow the protein-DNA bindinglocation, there are computational methods that scan these ChIP-Seq peaks to findmatching known binding profiles (Ch. 2.2.2). An essential technology to identifycis-regulatory regions is ChIP-Seq [29]. ChIP-Seq is comprised of two stages,ChIP and sequencing. ChIP is the procedure that isolates fragments of DNA wherethe TF proteins of interest bind using antibodies, these TF-DNA . The sequencingstage is where these isolated DNA fragments are sequenced. Once these sequencedDNA fragments are aligned, the regions that show an enrichment of reads, knownas ChIP-Seq peaks, represent regions where the proteins interact with the DNA. Tofurther narrow the protein-DNA binding location, there are computational meth-ods that scan these ChIP-Seq peaks to find matching known binding profiles (Ch.122.2.2).Different tissue types are known to show different expression patterns; in partthis is accomplished by compressing chromosome regions to suppress genes or ex-panding other chromosome regions to allow transcription to occur. Similarly inChIP-Seq, depending on the tissue type used there will be certain chromosome re-gions that are open where proteins can bind and others will be condensed, denyingproteins to bind. This results in the identification of protein binding sites specificto the tissue type and does not represent all potential binding sites in the humangenome. This is one of the largest limitations for ChIP-Seq data that is available.1.3.4 Integration of WGS, RNA-Seq, and ChIP-SeqIn order to predict how cis-regulatory mutations alter gene expression in B celllymphoma, all three sequencing types were needed. The WGS data allowed theidentification of somatic mutations and ChIP-Seq data was used to identify themutations from WGS that overlapped cis-regulatory regions. Finally, the RNA-Seqwas utilized to observe the expression changes associated with these cis-regulatorymutations.1.4 Resources and DatasetsThe ENCODE project is a rich resource for annotating cis-regulatory regions throughtheir large number of generated ChIP-Seq datasets [15]. ENCODE’s ChIP-Seqdatasets do not represent the whole regulatory landscape, thus combining it withother ChIP-Seq datasets, such as PAZAR, was beneficial [59]. This provides uswith a remarkable resource to make an initial exploration into the dysregulation ofgenes due to disrupted TFBSs in human cancers where little information has beengathered. By performing computational analysis to identify TF binding profileswithin the experimental data, we can predict TFBS locations with a high level ofconfidence.In the study we focused on B cell lymphoma, due to it being the only cancertype with publicly available samples that have the required WGS and RNA-Seqdata. Two B cell lymphoma cohorts were collected; Cohort 1 was obtained fromMorin et al. (2013) [49] and Cohort 2 was downloaded from the ICGC data por-13Table 1.4: Signature translocations and mutated tumour suppressor genes forB cell lymphoma.Type ofLymphomaTranslocation Frequency (%)Mutated TumourSuppressor GeneFrequency (%)BL MYC-IgH or MYC-IgL 100 TP53 40RB2 20-80DLBCL MYC-IgH or MYC-IgL 15 CD95 10-20BCL2-IgH 15-30 ATM 15TP53 25PMBCL - - SOCS1 40FL BCL2-IgH 90 - -tal (https://dcc.icgc.org/projects/MALY-DE). The two cohorts together contain 84trios of WGS for tumour and matched normal genomes, and RNA-Seq in tumoursamples.1.4.1 B Cell LymphomaLymphoma develops in the lymph node organs, which are enriched with B and Tcells involved in the adaptive immune system. B cell lymphocytes are the originof 95% of all lymphomas and are characterized with a B cell receptor (BCR) onthe cell’s outer surface [37]. The receptors are antibodies that bind to antigens andeliminate foreign or harmful molecules; BCRs are composed of immunoglobulin(Ig) polypeptides. Within the structure germinal centre (GC), an antigen activatedB cell goes through clonal expansion where the Ig genes obtain hypermutations.Only mutations that increase the binding affinity to antigens are kept. The majorityof lymphomas are derived from GC or post GC cells.The categorization of B cell lymphomas is based on histology, expression sig-natures of gene markers and translocations between the Ig gene and an oncogene.The oncogenes involved in these translocations are controlled by the regulationof the Ig gene leading to its dysregulation. We will focus on the following lym-phomas: burkitt lymphoma (BL), diffuse large B cell lymphoma (DLBCL), pri-mary mediastinal large B cell lymphoma (PMBCL), and follicular lymphoma (FL).The Ig translocations that characterize these lymphomas and the frequently14mutated tumour suppressor genes are summarized in Table 1.4 [37]. For example,BL is associated with a translocation between IgH, one of the Ig loci, and theoncogene MYC. Ig-MYC translocation is inefficient to drive BL by itself, thereforeRichter et al (2012) identified additional recurrent mutations [63]. The recurrentinactivation of the ID3 gene suggested a cooperation with Ig-MYC translocationto drive BL [63].1.5 Experimental AimsWe set out to demonstrate the importance of including genomically altered cis-regulatory elements when interpreting mutational landscapes of cancers. Four ex-perimental aims were proposed that explored and interpreted the cis-regulatorydomain:1. Determine the mutational landscape based on the SNVs and indels identifiedfrom the B cell lymphoma genomes (Ch. 3.1).2. Compare the mutation rates between the PC and cis-regulatory regions (Ch.3.2).3. Investigate if SNVs and indels found within a larger dataset of B cell lym-phoma genomes show a similar enrichment in the promoter regions (Ch.3.3).4. Identify disruptive cis-regulatory mutations associated with gene expres-sion changes in biological networks. Determine if combining PC and cis-regulatory mutations can yield new prevalently mutated cancer related genes(Ch. 3.4).Exploring these aims is only made possible due to the combination of the reduc-tion of cost for WGS, newly available ChIP-Seq data from ENCODE, and moreefficient mutation callers that can cover the whole genome.The thesis is organized as follows. The Experimental Approach chapter (Ch.2) explains the details of the two cohorts of B cell lymphoma samples and pro-vides an in depth description of the procedures that were performed in order for15reproducibility. Next, the Results chapter (Ch. 3) covers the outcomes from theinvestigation of the four experimental aims proposed above. Following is a dis-cussion (Ch. 4) of results and the thesis ends with some concluding remarks (Ch.5).16Chapter 2Experimental ApproachThe workflow (Figure 2.1) illustrates the procedures taken to answer our proposedexperimental aims (Ch. 1.5) and details who contributed to each step. Four mainintermediate analyses were performed: identification of somatic mutations, muta-tions overlapping TFBS and PC regions, and Xseq. The following sections willdescribe in detail the datasets and methodology in order to replicate the same re-sults and figures.2.1 Datasets2.1.1 B Cell LymphomaAligned sequence data for WGS and RNA-Seq data were available from Morinet al. (2013) publication for 40 DLBCL samples, which we refer to as Cohort1 [49]. There were an additional 52 DLBCL samples with RNA-Seq. The geneexpression levels were generated for Cohort 1 from RNA-Seq raw data using thethe Bioconductor [23] packages, Rsamtools and GenomicFeature [39].Cohort 2 was retrieved from the ICGC data portal (https://dcc.icgc.org/projects/MALY-DE) and consisted of 44 B cell lymphoma, specifically, 14 BL, 15 DLBCL, 14 FLand 1 PMBCL. The sequence data for WGS and RNA-Seq pertaining to these 44B cell lymphoma samples were not available, but there were processed sets of mu-tations and RNA-Seq normalized expression levels available.17WGS(Normal & Tumor) RNA-Seq(Tumor)ChIP-SeqIdentication ofSomatic Mutations (SNVs, Indels)MutationalLandscape Comparison of Mutation Rate between PC and TFBS Frequently Mutated PromotersXseqMutations OverlapingTFBS Regions & is the TFBS disruptedDatasetsMutations OverlapingPC RegionsDysregulated GenesLinked to NearbyDisrupted TFBSIntermediateAnalysesExperimentalAimsCalvin LefebvreContributionAnthony MathelierFigure 2.1: Workflow of the analyses performed to address our experimentalaims. The workflow covers the initial data retrieval (outlined in blue)and the intermediate analysis (outlined in black) needed in order to an-swer our experimental aims (outlined in green). The fill colour of thebox corresponds to whomever performed those tasks.In order to have a uniform standard for the gene expression data between thetwo cohorts, additional post processing steps were carried out where 20,446 (resp.19,816) genes remained in Cohort 1 (resp. Cohort 2). Only genes with an officialHGNC symbol [24] were considered and genes with null expression over all thesamples were removed.The downloaded B cell lymphoma data is summarized in Table ChIP-Seq and TF Binding ProfilesAs previously mentioned, we combined ChIP-Seq peaks data from ENCODE andPAZAR for any type of human cells [15, 59], which resulted in a total of 477human TF ChIP-Seq datasets. The TF binding profiles for 103 TFs were obtainedfrom JASPAR database [44].18Table 2.1: B cell lymphoma datasets.Cohort B cell lymphoma Number of samples Downloaded file typesubtype WGS RNA-Seq WGS RNA-Seq1 DLBCL 40 92 bam files bam filesBL 14 calculated calculated2 DLBCL 15 54 somatic geneFL 14 mutation expressionPMBCL 1 levels2.2 Methodology2.2.1 Identification of Somatic MutationsSNV PredictionsSNVs were identified in Cohort 1 samples using a modifed version of Muta-tionSeq tool [18] (http://compbio.bccrc.ca). The modified version of MutationSeqused features to train a random forest to predict the probability of SNVs. SNVswith a probability < 0.9 were filtered out.Indel PredictionsWe used Dindel to call small indels for Cohort 1 samples where the Bayesianmodel focused on indels the size of 1-50 nt [2]. This range was acceptable for smallindels, since indels > 50 nt are more challenging to map during read alignment andthe majority of small indels are within 1-10 nt [53]. Following Dindel’s manualrecommendations, we provided Dindel with the BAM file and the set of candidateindels obtained from Pindel for each sample [79]. Figure 2.2 is the workflow foridentifying the somatic indels from WGS data.Dindel and Pindel use different approaches when predicting indels. Dindelclusters potential mutations into small regions, where classification of candidatehaplotype occurs after local realigning of these small regions. Pindel uses a patterngrowth approach, which is an algorithm for sequential pattern mining. Default pa-rameters were used for Dindel. Pindel candidates were obtained using the defaultparameters, except for the insert size, which was provided by the CollectInsert-SizeMetrics Picard subtool (http://broadinstittute.github.io/picard).Identifying the specific location of an indel within a homopolymer or tandem19Pindel : Predicted indels Dindel Stage 1: Generated indel candidate callsWGS(Tumour and normal)Filter 1:Removed germline mutationsFilter 2:Removed if the number of reads supporting the indel < 3Dindel Stage 2-4:Predicted indelsFormatting:Only allowed 1 indel prediction per lineFilter 4:Removed indels overlapping known variants(1000 genome and dbSNP)Filter 3:Removed false somatic indels that resultedfrom repetitive regionsCombined into a super set of indel candidatesFilter 5:Dindel quality score >= 10Figure 2.2: Workflow on prediction of somatic indels. The workflow takesWGS data from both the normal and tumour sample and returns somaticindels by performing Pindel and Dindel and applying filters.20CTGGATGATGATGATGATGATGATGAT    .    .    .    GATGATGATGATGATGATGATGATGATReferenceNormalTumourCTT AGAT repetitive regionSequenced readsfrom WGSDeletionA SNV where the alterednucleotide is AFigure 2.3: Homopolymers or tandem repeats cause improper germline andsomatic labeling for mutations. The sequenced reads from WGS can beseen for tumour sample (top row) and normal sample (middle row) withrespect to the reference sequence (bottom row). The observed deletionis incorrectly classified as a somatic mutation, since the reads supportingthe deletion are within a repetitive region and the exact location of thedeletion is unclear.repeat was challenging, which affected the ability to properly label an indel as so-matic or germline (Figure 2.3). Therefore, we labeled indels as germline mutationsif the distance of the repeating region from the start position of the indel in the tu-mour sample was longer than the distance to the closest indel in the normal sample.The repetitive sequences considered were any combination of base pairs between1 and 6bp in length. This was filter 3 in our indel workflow (Figure 2.2).There were 4 other filters applied. Firstly, we removed indels that were iden-tified as germline mutations or if the number of reads supporting the indel < 3(filters 1 and 2). Additionally, we only considered indels with Dindel quality score21≥ 10, which was equivalent to a 90% confidence (filter 5). All variants reported indbSNP and the 1000 genomes project were filtered out (filter 4) [14, 67].2.2.2 Mutations Overlapping TFBS RegionsGenerating PWMsPWMs were generated by first using experimentally validated TFBSs (Fig-ure 2.4 A) to calculate a position frequency matrix (PFM). The PFM representedthe frequency of all nucleotides for each position (Figure 2.4 B). Converting PFMto a PWM was accomplished by converting the PFM frequency values to a nor-malized log-scale (Figure 2.4 C). The mathematical formula to calculate these loglikelihood ratio values, PWMn,i, isPWMn,i = log2(p(n, i)p(n))(2.1)wherep(n, i) =(PFMn,i + s(n)N +∑n′∈{A,C,G,T} s(n′))(2.2)[75]. n, i and N represent the nucleotide, the ith position in the TFBS and thenumber of sites. The background probability of observing the nucleotide n is sym-bolized as p(n). PFMn,i are the values in PFMs, which were the number of siteswhere the nucleotide n was observed at position i. To reduce bias due to smallsample sizes and eliminate possible NULL values before log transformation, thepseudocount values, s(n), were added. There are a variety of different methods tocalculate the pseudocount value; we defined it as the square root of the number ofsites (√N) [54, 75].The PFM can be represented as a sequence logo that stacks the nucleotidesymbols (A,C,G,T) at each position in the TFBS, where the symbol size indicatesthe information content and the stack size corresponds to the conservation (Fig-ure 2.5 A). Finally, the summation of the PWM values for each position of a givensequence generated the TF binding affinity score (Figure 2.5 B). The formal math-ematical equation for the binding affinity score calculation for a given sequenceis22sequence = n1n2n3...nmbindingScoren1n2n3...nm = ∑1≤ j≤mPWM(n j, j)(2.3)where m is the size of the sequence, j represents the index position of the TFBS,and has the property of n j ∈ {A,C,G,T} where 1≤ j ≤ m.Predicting TFBS RegionsTFBS predictions were attained by sliding the PWMs for TFs across the wholelength of the ChIP-Seq peaks from ENCODE and PAZAR in 1 bp increments tocalculate each possible binding site [15, 59, 75]. Binding sites with a PWM score≥ 85% were classified as predicted TFBS, which was the same threshold used inprevious studies [38, 41].While scanning to calculate PWM scores, both strands were considered to iden-tify optimal hits. The PWM were generated from the TF binding profiles fromJASPER (Ch. 1) [44, 75]. The ChIP-Seq peaks were downloaded from ENCODEand PAZAR [15, 59]. Overall, there were 9,510,001 distinct TF/TFBS pairs pre-dicted that cover 76,160,599bp.Identify Disrupted Altered TFBSIn order to prioritize cis-regulatory variants, we focused on SNVs and indelspredicted to disrupt TF-DNA binding affinity. For all mutations overlapping pre-dicted TFBSs, we computed the PWM scores throughout the region that were af-fected. The highest PWM was kept, representing the strongest created TF-DNAbinding. The start and end position of the region that was re-evaluated for PWMscores depended on where the mutation was located in the TFBS and the typeof mutation. The length of the affected TFBS regions were defined in Table 2.2,where n was defined as the length of the PWM of interest and i was the length ofthe insertion. If the highest PWM score for the affected region was ≤ 80% of themaximum binding score then the TFBS mutation was considered to be deleterious[4].A mutation has the ability to either create or destroy a TFBS, but we onlyfocused on the destruction of TFBS. Identifying and interpreting the creation ofTFBS is a challenging task. A mutation that forms a prominent TFBS could occur23G G G T C A A A G G C A CA G A C C A A A G T C C GA G G T C C A A G G G C AC T A G C A A A G G T T AA G T G G T A A G G T C GC G G G C A A A G T T C TA G T C C A A A G T T C AG G G C T G A A G T C C AC G G G C A A A G G C C AC G G G T A A A G G T G ASites12345678910 1 2 3 4 5 6 7 8 9 10 11 12 13TFBS Positions4 0 2 0 0 7 10 10 0 0 0 1 64 0 0 3 7 1 0 0 0 0 4 7 12 9 6 5 1 1 0 0 10 6 1 1 20 1 2 2 2 1 0 0 0 4 5 1 11 2 3 4 5 6 7 8 9 10 11 12 13TFBS PositionsACGT0.54 -2.04 -0.23 -2.04 -2.04 1.24 1.71 1.71 -2.04 -2.04 -2.04 -0.87 1.040.54 -2.04 -2.04 0.2 1.24 -0.87 -2.04 -2.04 -2.04 -2.04 0.54 1.24 -0.87-0.23 1.57 1.04 0.81 -0.87 -0.87 -2.04 -2.04 1.71 1.04 -0.87 -0.87 -0.23-2.04 -0.87 -0.23 -0.23 -0.23 -0.87 -2.04 -2.04 -2.04 0.54 0.81 -0.87 -0.87A) Experimentally Validated HNF4 Binding Sites B) PFMC) PWMACGT 1 2 3 4 5 6 7 8 9 10 11 12 13TFBS PositionsFigure 2.4: Generating PWMs. (A) We represented the first step as a matrixof experimental validated data gathered from literature (ENCODE andPAZAR) where we used HNF4 as an example. The x-axis consistedof the different sites known to bind and the y-axis showed the positionwithin the TFBS. (B) By using the matrix created in (A), we created aPFM that represents the sum of each nucleotide type for all positionsin the TFBS. (C) Converted the PFM to a PWM using the mathematicequations in Equation 2.1 and Equation 2.2. PWM values were normal-ized and log-scaled PFM values.24A) Sequence LogoTFBS PositionsB) Binding Affinty ScoresSequence: GGATCAAAGTCAT Binding Score ΣPWMn,i1≤i≤13n є {A,C,G,T}= PWMG,1= PWMG,2 PWMA,3 PWMT,13+ + + ... += (-0.23) + 1.57 + (-0.23) + (-0.23) + 1.24 + 1.24 + 1.71 + 1.71 + 1.71 + 0.54 + 0.54 + (-0.87) + (-0.87)7.83= Figure 2.5: Predicting binding affinity scores for TFBS using PWM. (A) Asequence logo is another way to represent the PFM (Figure 2.4 B) thatstacks the nucleotide symbols (A,C,G,T) at each position in the TFBSwhere the symbol size indicates the information content and the stacksize corresponds to the conservation. (B) Using the PWM, we predictedthe binding score for a given TFBS. This was accomplished by addingthe PWM values at each position for the nucleotide of the TFBS (Equa-tion 2.3).Table 2.2: Length to consider when re-evaluating the binding affinity for thealtered TFBS. n is defined as the length of the PWM of interest and i isthe length of the insertion.Mutation Type Length of affected regionSNV 2n - 1Insertion 2n - 2 + iDeletion 2 (n - 1)25anywhere within the whole genome and is hard to determine if it would have anyaffect on gene expression without performing validation experiments. In contrast,when analyzing the destruction of a TFBS we had a predetermined location ofknown TFBSs.2.2.3 Mutations Overlapping PC RegionsPC regions were defined by the exonic positions (transcript accessors starting withNM ) from the UCSC hg19 Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables)by selecting the known gene table from the RefSeq gene track. Overlapping muta-tions were identified using these defined PC regions.2.2.4 Mutation RateMutation rates within TFBSs were computed by dividing the number of SNVs orindels lying within TFBSs over the total number of nucleotides within TFBSs. TheTFBS region consists of 76,160,599bp and was predicted within ChIP-Seq peakregions as described above. A similar computation was performed that consideredPC exons using the exonic start and end positions from RefSeq. The PC regioncovered 65,469,364bp.To construct genomes with randomly distributed mutations, we used the shufflesubcommand from bedtools to randomly shuffle mutations from Cohort 1 or Cohort2 [60]. 1,000 genomes were computed for both cohorts. For each genome, wecalculated the expected mutation rates by chance in the TFBSs and exons usingthese randomly positioned mutations.When computing local mutation rates surrounding TFBSs and exons, we ob-tained flanking regions using the flank and subtract subcommands from bedtools[60]. This extracted 1kb upstream and 1kb downstream of the TFBSs (resp. exons).We then filtered out sequences overlapping TFBSs (resp. exons).2.2.5 XseqTo assess the impact of a TFBS mutation on gene expression, we used Xseq [19].The Xseq analyses were performed by applying the following steps:1. Preprocessing - Identify mutated genes26(a) Mutations lying within TFBSs with impact on gene expressionMutations lying within TFBSs were obtained using the described methodin Ch. 2.2.2. The closest gene to each mutation was obtained using theset of TSSs of known RefSeq genes from UCSC (http://genome.ucsc.edu/cgi-bin/hgTables). When finding the closest gene we considered the startand end position for the mutation and the start positions for all PC TSS.Only mutations lying within TFBSs with potential impact on gene ex-pression were considered. Namely, we asked for the closest gene tothe corresponding mutations to be either up- or down-regulated in thesample of interest. To determine if a gene was dysregulated, we con-sidered the distribution of gene expression in the cancer samples andasked the expression in the corresponding sample to be ≥ µ + 1σ or≤ µ−1σ , where µ and σ represents the mean and standard deviationof the distribution of expression values.(b) Mutations lying within PC exonsAll mutations lying within a PC exon were considered. Namely, SNPeffwas used to extract the mutations overlapping PC regions and their pre-dicted impact on the protein [13]. Table A.1 provides the list of muta-tion impacts that were considered in the analysis.2. Xseq analysisAll genes obtained from the previous step were used as the input of theXseq tool. Xseq is a probabilistic model which aims to encode the impactof somatic mutations on gene expression profiles. The model is a genera-tive hierarchical Bayes approach, which uses three observed quantities asinputs: a patient-gene expression matrix, a patient-gene mutation matrix anda graph containing known interactions between genes (e.g. from pathwaydatabases). The model has two key unobserved random variables which con-stitute the output: Dg is a Bernoulli random variable where Dg = 1 indicatesthat gene g influences expression when mutated; F pg |Dg is a Bernoulli ran-dom variable where F pg indicates that mutated gene g influences expressionin patient p. As such, we model expression influence at two levels: overthe patient population, and at the level of individual mutations in individual27patients. Random variables are estimated using the belief propagation al-gorithm, with outputs consisting of two relevant probabilities: Pr(Dg) andPr(F pg ). Software implemented in an R package encoding Xseq is availablefrom http://compbio.bccrc.ca. By considering the disruption likelihoods ofeach ”mutated” gene and its neighbours in biological networks, Xseq com-putes the probability of the ”mutated” genes to be dysregulated and cause acascading dysregulation effect to its neighbours.3. Post-processing.Xseq provided the probability of each input gene as being impactful in thespecific samples where it was ”mutated” (single sample probability, Pr(F pg )),as well as the mutated gene’s impact when considering all samples (all sam-ples probability, Pr(Dg)). Potential false positives from Xseq were producedwhen a gene was only mutated in a single sample because there was minimalinformation to calculate Pr(Dg). By plotting a histogram of all the Pr(Dg)(Figure A.1), a distinct peak was observed which was formed by a largenumber of these false positives. These distinct peaks were used as thresh-olds; genes must have an Pr(Dg)≥ 0.5 in Cohort 1 dataset and 0.8 in Cohort2 dataset to be considered in our analyses. Furthermore, we required thePr(F pg ) of a gene to be ≥ 0.5 and the gene to be predicted in at least twosamples in order to be considered in Figure 3.9 and Figure Frequently Targeted Promoter RegionsThe definition of frequently targeted regions for this project was defined as 1kbwindows in the genomes that had at least 3 mutations. This was achieved by slidinga 1kb window across the human genome with increments of 500bp. False positivesin frequently mutated regions could occur in large repetitive regions, where multi-ple sequencing errors can occur. The frequently mutated TFBS regions for the twocohorts were visualized as circos plots (Circos tool version 0.64 [36]), as seen inFigure 3.3, Figure 3.4, and Figure 3.5.The closest TSS for a PC gene was identified for each mutation lying within afrequently targeted promoter region. Promoters were considered as the 2kb regionaway from the TSS for PC genes; the genomic locations for TSS were retrieved28from RefSeq. Additionally, the genes that had a frequently mutated promoter re-gion were displayed in the circos plots of Figure 3.3, Figure 3.4, and Figure Functional Enrichment AnalysisFunctional enrichment analyses were calculated through the Enrichr tool [12]. Athreshold of ≤ 0.05 was applied to the adjusted p-values generated from Enrichrfor each pathway. Visualization for the networks of enriched pathways were con-structed manually using Cytoscape 3.0.1 [66].The functional enrichment analysis illustrated in Figure 3.7, Figure 3.7, andFigure 3.8 used the list of genes provided in Figure 3.3, Figure 3.4, and Figure 3.5.The functional enrichment analysis in Figure 3.11, Figure 3.12, and Figure 3.13was computed using the genes predicted by Xseq (Figure 3.9 and Figure 3.10),along with their biological network neighbours with altered expression (i.e. pre-dicted by Xseq with a higher probability of being up- or down-regulated than beingneutral).29Chapter 3Results3.1 Mutational Landscape for B Cell LymphomasIn aggregate, we observed 406,611 SNVs (from 146 to 31,874 per sample; mean =10,165, median = 7,821, and standard deviation (sd) = 6,995) and 15,739 indels(from 65 to 4,810 per sample; mean = 393, median = 222, and sd = 735) in sam-ples from Cohort 1. In comparison, there were 282,636 SNVs (from 1,242 to37,987 per sample; mean = 6,424, median = 3,577, and sd = 7,165), and 8,080indels (from 67 to 871 per sample; mean = 184, median = 136, and sd = 142) insamples from Cohort 2 (Table 3.1).The two cohorts have similar distribution patterns for the number of mutationsper sample, as seen in Figure 3.1. In particular, the distribution of SNV for DL-BCL samples between the two cohorts were very similar, where they both had aTable 3.1: The total number of mutations for each cohort with statistical mea-sures.Mutation Total Min Max Mean Median StandardType deviationCohort 1 SNV 406,611 146 31,874 10,165 7,821 6,995Indel 15,739 65 4,810 393 222 735Cohort 2 SNV 282,636 1,242 37,987 6,424 3,577 7,165Indel 8,080 67 871 184 136 14230maximum value > 30,000 SNVs and a mean of∼ 10,000 (10,165 for Cohort 1 and11,729 for Cohort 2). B cell lymphoma subtypes clustered together within the dis-tribution for Cohort 2, where BLs had the lowest, FLs were average and DLBCLshad the highest number of mutations per sample (Figure 3.1 part B).3.2 Comparing Mutation Rate Between PC andCis-Regulatory RegionsAs previously stated, there is a vast amount of information for the mutational land-scape of PC regions, but minimal information related to TFBSs. PC mutation ratesare known to be different for each tumour type. Mutation rates signify how impor-tant it is for a region to remain unaltered in order for a cell to function normally.Intergenic regions acquire approximately twice the number of mutations as thePC region, due to no consequence of altering non-functional positions [76]. Thestrength of knowledge from the mutation rate in PC regions can be used to hy-pothesize the importance of keeping the integrity of TFBSs. A lower mutation ratefor TFBS regions compared to PC regions may imply that TFBSs have a higherconsequence for obtaining mutations. In contrast, a higher mutation rate for TFBSimplies that TFBS are more tolerant to harbour mutations.The defined PC region covers 65,469,364 bp (2.2% of the human genome) and76,160,599 bp were predicted to be within a TFBS (2.5% of the human genome)(Ch.2.2.4). Both the TFBS and PC regions cover∼ 2% of the human genome with onlyminor overlapping regions, consisting of 4,189,498 bp (6.4% of the PC region and5.5% of the TFBS region).Of the 422,350 mutations predicted, 8,184 (2.0%) overlap predicted TFBS po-sitions in Cohort 1. Similarly, in Cohort 2 there were 6,608 (2.3%) of the 290,716total mutations overlapping TFBS. With respect to the number of mutations over-laping PC exons, Cohort 1 had 4,990 (1.2%) and Cohort 2 had 5,098 (1.8%).39 samples in Cohort 1 (97%) and 25 samples in Cohort 2 (57%) had higherTFBS mutation rates relative to PC regions (Figure 3.2). Interestingly, the histo-logical subtypes for B cell lymphoma in Cohort 2 displayed different patterns forprevalence of mutations within TFBS and PC regions. Higher mutations rates forTFBS regions were observed for 10 out of 15 DLBCL samples and 9 out of 14 FL310100002000030000RG049RG036RG055RG082RG068RG133RG024RG008RG074RG065RG013RG083RG064RG040RG003RG051RG023RG132RG075RG138RG061RG067RG077RG038RG026RG115RG086RG034RG027RG136RG135RG054RG081RG130RG116RG043RG142RG028RG014RG111SampleNumberhofhMutationsCohort10100002000030000SA321041SA320818SA321004SA320980SA321012SA321030SA320830SA320920SA320860SA320986SA320974SA320824SA320914SA320932SA321032SA321066SA320902SA320854SA320878SA320842SA320998SA320890SA320992SA321103SA320926SA320896SA320836SA321057SA320950SA321050SA320938SA321119SA320956SA320962SA320866SA321128SA320944SA321048SA320848SA321021SA321111SA320968SA321106SA320872SampleNumberhofhMutationsCohort2MutationhTypeIndelSNVTumorhSubtypeaa aaBLDLBCL PMBCLFLABFigure 3.1: Distribution of the number of mutations per sample in Cohort 1(A) and Cohort 2 (B). The number of SNVs (red) and indels (blue) areplotted on the y-axis for the corresponding samples. The samples areordered in increasing number of mutations (from left to right). The x-axis sample names are colour coded by tumour subtype.32samples, whereas PC regions had a higher mutation rate for 9 out of 14 BL samples(Figure 3.2). The observed higher mutation rate in TFBSs than exons was expectedby chance; 540 (resp. 616) out of 1,000 genomes with randomly distributed mu-tations in Cohort 1 (resp. Cohort 2) present the same characteristics. Interpretingthe pattern for the indel mutation rate was more difficult due to the more randomdistribution (Figure A.2).To determine whether surrounding regions of TFBS or PC regions had simi-lar mutation rates, we investigated the local mutation rate of flanking regions. Alocal change in mutation rate is another indication for regions that have differentacceptabilities for mutations. Overall, TFBSs were less mutated than their flankingregions in both cohorts. Precisely, 38 out of 40 samples in Cohort 1 and 37 out of 44samples in Cohort 2 had lower SNV mutation rates in TFBSs compared to flankingregions. Comparatively, Cohort 1 had 30 samples while Cohort 2 only had 3 withlower SNV mutation rates in PC regions compared to their flanking regions. Indelswere randomly distributed throughout the TFBS, PC and their flanking regions.These results highlight that generally the predicted TFBS have a higher muta-tion rate than PC regions, but a lower mutation rate than their surrounding regions.3.3 Frequency of Mutations in Promoter RegionsThe distribution of somatic mutations across the human genome is known to benon-uniform [25, 58]. Some genomic regions were identified to harbour a highernumber of mutations than others. For instance, hypermutated promoters in DL-BCL samples; hypermutations are characteristics for B cell lymphomas [34, 49].A strong enrichment was observed for predicted TFBS in the promoter regionsof PC genes, as anticipated. Promoters overlapped TFBSs 5 times more than ex-pected by chance (11,073,418 bp from TFBSs overlapped promoters, where pro-moters covered 85,2996,239 bp). The accumulation of mutations within promotersis the most reasonable evidence of gene expression dysregulation, since the generalTFs and the RNA polymerase are recruited within the promoter regions. We fur-ther investigated the distribution of mutations overlapping cis-regulatory elements,specifically promoters, in our two cohorts of B cell lymphomas. In order to ac-complish this, we identified frequently targeted TFBS regions within the human33BL DLBCLPMBCL FLTumour SubtypeFigure 3.2: Comparison of the SNV mutation rates for cis-regulatory and PCregions. Only SNVs from Cohort 1 (A) and Cohort 2 (B) were consid-ered (for consideration of indels see Figure A.2). TFBS (y-axis) andPC (x-axis) mutation rates are plotted for all the samples. Each sampleis represented by a triangle and is color-coded depending on the tumoursubtype. The identity function (dashed grey lines) and linear regressionswith a 95% confidence region (blue lines, dark grey areas, and equation)were computed.34genome by sliding a 1kb window across the whole genome. Windows that haveat least 3 TFBS mutations were labeled as frequently mutated TFBS regions (Ch.2.2.6).Our findings support the conclusion of previous publications; promoter re-gions of B cell lymphomas were significantly enriched with mutations (Figure 3.3and Figure 3.4) [34, 49]. Precisely, Cohort 1 contained 135 mutations withinfrequently targeted TFBS regions that overlapped promoter regions, which repre-sented 49% of the 273 somatic mutations found within frequently mutated TFBSwindows throughout the whole genome (p-value = 1.16× 10−75, hypergeometrictest with 680 mutations overlapping TFBSs in promoters out of 8,185 in TFBSs).In comparison, Cohort 2 had a larger count for the number of somatic mutationsfound within these frequently targeted promoters with 348 TFBS mutations thatrepresented 65% of the total 534 mutations within frequently mutated TFBS re-gions (p-value = 3.28× 10−156, hypergeometric test with 1,102 mutations over-lapping TFBSs in promoters out of 6,608 in TFBSs).A list was generated to further investigate which genes were affected by thesefrequently mutated promoters, as seen in Figure 3.3 and Figure 3.4. There were15 and 39 closest genes to the frequently mutated promoter regions for Cohort 1and Cohort 2, respectively. 12 of these genes were found in both cohorts, includ-ing 5 classified oncogenes from the Cancer Gene Census (BCL2, BCL6, BCL7A,CD74, and CIITA) [21]. Combining the complete list of TFBS mutations fromboth cohorts to identify frequently mutated TFBS regions resulted in 13 new genes(ARID2, BCL2L11, BZRAP1, EPS15, HIST1H2BG, ID3, IGLL5, IL2R1, IRF1,KIAA0226L, NEDD9, RARS, and ZNF860) not identified as recurrently mutatedin previous lymphoma studies (Figure 3.5) [34]. Of the 13 genes, 6 (ARID2,BCL2L11, EPS15, IL2R1, NEDD9, and ZNF860) were exclusively mutated withintheir promoter regions where no mutations were identified within their exons.We sought to show that the accumulation of promoter mutations implicatespathways known to be disruptive in cancer development. To demonstrate this, listsof genes with frequently mutated promoters were submitted to EnrichR [12] forpathway enrichment analyses, highlighted in Figure 3.6, Figure 3.7, and Figure 3.8.Genes associated with the enriched apoptotic pathways were BCL2, BCL2L11,BCL6, BIRC3, BTG1, CD74, FOXO1, IRF1, MYC, and PIM1. Furthermore, en-351234567891011121314151617 18 1920 21 22x yHIST1H1BST6GAL1TMSB4XZFP36L1 NEDD9BCL7ARHOHBIRC3CIITAIGLL5 BTG2SGK1CD74BCL2BCL6Figure 3.3: Regions frequently targeted by somatic mutations overlappingcis-regulatory elements in promoters for Cohort 1. The inner grey cir-cles of the circos plots correspond to histograms of the frequently mu-tated TFBS region. The y-axis range for the histograms is [0, 40]. Theouter circles contain the closest gene names to the mutations in the con-sidered frequently mutated TFBS region, if the mutation is at most 2kbaway from the TSS of the gene. Names highlighted in red correspondto genes shared between both cohorts.361234567891011121314151617 18 1920 21 22x yHIST1H1CHIST1H1ETMSB4XZFP36L1BCL2L11BZRAP1DNMT1 ZNF860NCOA3FOXO1 BACH2CXCR4DUSP2 RFTN1TCL1ABCL7ASOCS1SEPT9P2RX5S1PR2RHOHEPS15BIRC3CIITADTX1IGLL5 BTG2BTG1 SGK1CD74CD83BCL2BCL6PIM1MYCB2MIRF1IRF4LTBID3Figure 3.4: Regions frequently targeted by somatic mutations overlappingcis-regulatory elements in promoters for Cohort 2. The inner grey cir-cles of the circos plots correspond to histograms of the frequently mu-tated TFBS region. The y-axis range for the histograms is [0, 40]. Theouter circles contain the closest gene names to the mutations in the con-sidered frequently mutated TFBS region, if the mutation is at most 2kbaway from the TSS of the gene. Names highlighted in red correspondto genes shared between both cohorts.371234567891011121314151617 18 1920 21 22x yHIST1H2BOHIST1H2BGHIST1H2BJKIAA0226L HIST1H1BHIST1H1CHIST1H1EST6GAL1TMSB4XZFP36L1BCL2L11BZRAP1DNMT1 ZNF860NCOA3FOXO1 BACH2CXCR4DUSP2 RFTN1TCL1ABCL7ASOCS1SEPT9 KLHL6P2RX5ARID2S1PR2RHOHEPS15BIRC3CIITADTX1IGLL5RARSBTG2BTG1IL21RSGK1CD74CD83BCL2 BCL6PIM1MYCB2MIRF1IRF4LTBID3NEDD9Figure 3.5: Regions frequently targeted by somatic mutations overlappingcis-regulatory elements in promoters when combining cohorts. The in-ner grey circles of the circos plots correspond to histograms of the fre-quently mutated TFBS region. The y-axis range for the histograms is[0, 80]. The outer circles contain the closest gene names to the muta-tions in the considered frequently mutated TFBS region, if the mutationis at most 2kb away from the TSS of the gene. Names highlighted inred correspond to genes shared between both cohorts, while blue arespecific to genes identified only when using combined mutations fromboth cohorts.38richment for B cell and oncogenic related pathways (IL2RB pathway, IL-7 path-way, P53 pathway, Gleevec pathway, small cell lung cancer, lymphoma, and leukemia)were observed to be frequently targeted by mutations within the promoter regionsto potentially disrupt cell functionality (Figure 3.6 and Figure 3.7).3.4 Impact of Cis-Regulatory Mutations on GeneExpressionA significant part of this thesis is the ability to predict the impact of the cis-regulatory mutations on the expression of nearby genes, which is possible due tothe availability of both RNA-Seq and WGS data. To accomplish this task, we useda pre-existing probabilistic model called Xseq [19] to relate specific mutations toexpression disruption in pathways (Ch. 2.2.5).Xseq uses the knowledge of known biological pathways to assess the likelyassociation of mutation presence with observed deviations from neutral expressionmeasurements taken from the same tumour. The method takes as input a patient-gene expression matrix and a binary patient-gene mutation matrix, and outputs theprobability that a) a mutated gene (over the whole patient population) impacts geneexpression and b) a patient-specific mutated gene impacts expression in the patient.Xseq was originally developed to consider genes harbouring mutations withinPC exons only, but the model is scalable to highlight cis-regulatory mutations thatpotentially dysregulate transcription. To achieve this, we modified the definitionof “mutated” gene in the patient-gene mutation matrix, originally considered asmutated PC genes, to include the closest gene to a cis-regulatory mutation that dis-plays an up- or down-regulation in the mutated sample compared to other tumoursamples (Ch. 2.2.5). With the applied criteria, a TFBS was associated to a singlegene, but a gene could be associated to several TFBSs (Ch. 2.2.5).A total of 371 (resp. 58) genes were predicted by Xseq to harbour mutationswith significant consequences relating to altered gene expression in Cohort 1 (resp.Cohort 2) from a total of 3,153 (resp. 2,237) PC or TFBS mutated genes. Ofthe 371 (resp. 58) predicted Xseq genes, 42 (resp. 52) were recurrently mutatedacross different samples (Figure 3.9 and Figure 3.10); only 27% (resp. 18%) of thetotal mutated genes considered by Xseq were recurrent. The total number of genes39Nodes color0 FDR# genes in intersectionEdges width0.05Figure 3.6: Functional enrichment analysis of genes associated with fre-quently mutated regions in Cohort 1. EnrichR [12] functional enrich-ment analyses were realized on the sets of genes listed in Figure 3.3. Theenriched pathways with the 20 lowest Bonferroni corrected p-values(node colour) are shown for each category by only considering Bon-ferroni corrected p-value < 0.05. Each node of the graphs representsan enriched pathway where the size of the node is proportional to thenumber of targeted genes. An edge between two nodes indicates thatthe pathways are sharing genes. A larger edge width represents a largernumber of shared genes.40Nodes colorFDR # genes in intersectionEdges width0.05Figure 3.7: Functional enrichment analysis of genes associated with fre-quently mutated regions in Cohort 2. EnrichR [12] functional enrich-ment analyses were realized on the sets of genes listed in Figure 3.3. Theenriched pathways with the 20 lowest Bonferroni corrected p-values(node colour) are shown for each category by only considering Bon-ferroni corrected p-value < 0.05. Each node of the graphs representsan enriched pathway where the size of the node is proportional to thenumber of targeted genes. An edge between two nodes indicates thatthe pathways are sharing genes. A larger edge width represents a largernumber of shared genes.41Nodes colorFDR# genes in intersectionEdges width0.05Figure 3.8: Functional enrichment analysis of genes associated with fre-quently mutated regions for combined cohorts. EnrichR [12] func-tional enrichment analyses were realized on the sets of genes listed inFigure 3.3 and Figure 3.4 when considering the combination of the twodataset(Figure 3.5). The enriched pathways with the 20 lowest Bonfer-roni corrected p-values (node colour) are shown for each category byonly considering Bonferroni corrected p-value < 0.05. Each node ofthe graphs represents an enriched pathway where the size of the nodeis proportional to the number of targeted genes. An edge between twonodes indicates that the pathways are sharing genes. A larger edge widthrepresents a larger number of shared genes.42Table 3.2: The number of Xseq predicted genes that have at least one samplefor each mutation composition.Mutation Composition Cohort 1 Cohort 2PC mutation 27 35nearby mutated TFBS 36 48nearby disrupted mutated TFBS 9 15PC mutation and a nearby mutated TFBS 1 6PC mutation and a nearby disrupted mutated TFBS 0 3that Xseq considered from Cohort 1 (3,153 genes) and Cohort 2 (2,237 genes) hadan intersection of 680 genes, where only 7 (TGFBR2, SGK1, ZFP36L1, CCNG1,PHIP, ASCC3 and SIN3A) were predicted by Xseq in both cohorts. The mutationtype composition for Xseq’s predicted genes can be seen in Table 3.2.Xseq predicted genes that were associated with frequently targeted promoters;4 genes in Cohort 1 (HIST1H1B, RHOH, SGK1, and ZFP36L1) and 7 genes inCohort 2 (BCL6, DUSP2, ID3, FOXO1, MYC, PIM1, and SGK1). As observed inthe functional enrichment analyses for frequently targeted promoters, the predictedXseq genes along with their dysregulated neighbours, defined by known biolog-ical interacting genes, were enriched for pathways related to cancer and cancerdevelopment (Figure 3.11, Figure 3.12 and Figure 3.13).By ranking the Xseq predicted genes based on their recurrence of dysregula-tion, the known cancer driver genes were highlighted: MYC, TP53, ID3,and BCL6(Figure 3.10). The highest recurrence of a gene being mutated with altered expres-sion was MYC in Cohort 2, where it was the greatest observed separator of BLfrom any other type of B cell lymphoma. MYC was predicted by Xseq in 11 sam-ples, where 10 of them were associated to BL. In all of the 11 samples, MYC wasobserved to be up-regulated (Figure A.3, Figure A.4, and Figure A.5), which agreeswith the current oncogenic function of MYC in cancers [57]. From these 10 BLsamples where MYC was mutated, 7 have MYC-IgH translocations (Figure 3.14).Of special interest is ID3 gene, since it has been recently proposed that a coop-eration exists between MYC translocation and disrupting gene expression for ID3that drives BL. The predicted gene expression alterations of ID3 from Xseq wasassociated with a mutation in a TFBS for SA320932 sample and mutations within43PC6and6disrupted6TFBSPCPC6and6TFBSDisrupted6TFBSTFBSROBO7HAS2MUTRPIAARL4AZFP36L7DPYDZNF254RAP7GDS7SGK7FHITRHOHCOL3A7ZNF97ATP6VyD2TSHZ3OGTIRF8DDX6yNRIP7AGAPXDNLPPM7DTMBIM7SEL7L3OXCT7YEATS2TUBD7ADD2RALGAPA2GRAMD3CXCR5CEP752ADD3HIST7H7BJAG7BCORVPS47C2orf42PIK3C3ZWINTDHFRL7RG777 RGy67 RGy28 RGy77 RGy86 RGy26 RG776 RGy74 RGy27 RGyy8 RG736 RGy73 RG775 RG73y RGy34 RGy24 RGy65 RGy74 RGy83 RG732 RG742 RGyy3 RGy87 RGy4yy57y75Mutation6TypeFigure 3.9: Cancer genes predicted by the Xseq tool from Cohort 1. Each rowcorresponds to a predicted gene and each column to a cancer sample.Colour boxes are drawn when a gene was predicted in a specific sample.The type of mutation associated to the gene is indicated by the boxcolours (see legend). The histograms at the top sum the number of genespredicted by Xseq in samples (using the same box-color coding).44MYCEYSTP53PTPRDSMARCA4BCL6RYR2ITPKBWWC1FCGBPSGK1TBL1XR1ID3CSMD3SIN3AVPS13CUNC5DNBASMTORPPP1R16BUSP25ASCC3GPHNDHX35PEX2XRCC4PXDNJRKLWHSC1L1FBXW11SRP72CRIM1FOXO1DGKDPHIPPYGLUSP15BRD2C2CD3LMO4FMN2SRFBP1N4BP2CCNG1RHOADUSP2TGFBR2ARRDC3CADPS2PIM1STIM2GNA13SA321012SA320920SA320824SA320932SA321004SA320860SA320992SA320830SA321030SA320914SA320842SA320818SA320980SA320998SA320848SA320866SA321106SA320872SA321119SA320968SA320962SA320944SA321050SA321048SA320956SA321021SA321103SA320836SA320902SA321128SA32097405101520PC0and0disrupted0TFBSPCPC0and0TFBSDisrupted0TFBSTFBSMutation0TypeFigure 3.10: Cancer genes predicted by the Xseq tool from Cohort 2. Eachrow corresponds to a predicted gene and each column to a cancer sam-ple. Colour boxes are drawn when a gene was predicted in a specificsample. The type of mutation associated to the gene is indicated bythe box colours (see legend). The histograms at the top sum the num-ber of genes predicted by Xseq in samples (using the same box-colorcoding). Sample names are color-coded as defined in Figure 3.1.45ABchronicEmyeloidEleukemiaEerbbEsignalingEpathwayEacuteEmyeloidEleukemiaE pancreaticEcancerE prostateEcancerendometrialEcancerEgliomaecmEreceptorEinteractionE focalEadhesionepithelialEcellEsignalingEinEhelicobacterEpyloriEinfectionErenalEcellEcarcinomaEsmallEcellElungEcancerEoxidativeEphosphorylationEcolorectalEcancerERBEinECancerIntegratedEBreastECancerEPathwayAndrogenEreceptorEsignalingEpathwayFocalEAdhesionEGF6EGFRESignalingEPathwayIntegratedEPancreaticECancerEPathwaySignalingEPathwaysEinEGlioblastomaBECellEReceptorESignalingEPathwayIntegrin3mediatedECellEAdhesionIL34ESignalingEPathwayPDGFEPathwayECardiacEHypertrophicEResponseEIL32ESignalingEPathwayMAPKESignalingEPathwayLeptinEsignalingEpathwayAGE6RAGEEpathwayTypeEIIEinterferonEsignalingEIL33ESignalingEPathwayOncostatinEMESignalingEPathwayAlphaE6EBetaE4signalingEpathwayFDRNodes color0 0.05SEgenesEinEintersectionEdges widthEdges width15# genes in in rsection 381Figure 3.11: Functional enrichment analysis of disrupted pathways for Co-hort 1. Xseq predicted genes, along with their neighbours in biologi-cal pathways showing altered expression, were derived from the Xseqanalysis (Ch. 2.2.5). A functional enrichment has been performedwith EnrichR [12] using these genes. The enriched terms from KEGG(A) and WikiPathways (B) with the 20 lowest Bonferroni adjusted p-values (adjusted p-value < 0.05) are shown. Each node of the graphsrepresents an enriched pathway where the colour of a node representsits Bonferroni corrected p-value. An edge between two nodes indi-cates the pathways are sharing genes. The larger the width of the edge,the larger the number of shared genes.46AB phosphatidylinositol2signaling2system2 chronic2myeloid2leukemia2gliomahuntingtons2disease2focal2adhesion pancreatic2cancer2small2cell2lung2cancer2endometrial2cancer2non2small2cell2lung2cancer2colorectal2cancer2cell2cycleprostate2cancercalcium2signaling2pathway2inositol2phosphate2metabolism2 erbb2signaling2pathway2mapk2signaling2pathway2vegf2signaling2pathway2long2term2potentiation2fc2epsilon2ri2signaling2pathway2gnrh2signaling2pathway2Kit2receptor2signaling2pathway2Leptin2signaling2pathway2InterleukinH((2Signaling2Pathway2ILH52Signaling2Pathway2 ILH)2Signaling2Pathway2ILH#2Signaling2Pathway2Oncostatin2M2Signaling2Pathway2Integrated2Cancer2pathway2ILH62signaling2pathway2EGFDEGFR2Signaling2Pathway2B2Cell2Receptor2Signaling2Pathway2Cell2CycleTSH2signaling2pathway2Androgen2receptor2signaling2pathway2BDNF2signaling2pathway2 Integrated2Breast2Cancer2Pathway2AdipogenesisDNA2Damage2Response2Fonly2ATM2dependent/2Integrated2Pancreatic2Cancer2Pathway2Signaling2Pathways2in2Glioblastoma2FDRNodes color0 0.05M2genes2in2intersectionEdges widthEdges width15# genes in in rsection 381Figure 3.12: Functional enrichment analysis of disrupted pathways for Co-hort 2. Xseq predicted genes, along with their neighbours in biologi-cal pathways showing altered expression, were derived from the Xseqanalysis (Ch. 2.2.5). A functional enrichment has been performedwith EnrichR [12] using these genes. The enriched terms from KEGG(A) and WikiPathways (B) with the 20 lowest Bonferroni adjusted p-values (adjusted p-value < 0.05) are shown. Each node of the graphsrepresents an enriched pathway where the colour of a node representsits Bonferroni corrected p-value. An edge between two nodes indi-cates the pathways are sharing genes. The larger the width of the edge,the larger the number of shared genes.47small2cell2lung2cancer2 chronic2myeloid2leukemia2pancreatic2cancer2jak2stat2signaling2pathway2 prostate2cancerlong2term2potentiation2focal2adhesioncolorectal2cancer2cell2cyclenon2small2cell2lung2cancer2inositol2phosphate2metabolism2 erbb2signaling2pathway2gnrh2signaling2pathway2fc2epsilon2ri2signaling2pathway2vegf2signaling2pathway2 phosphatidylinositol2signaling2system2endometrial2cancer2gliomab2cell2receptor2signaling2pathway2acute2myeloid2leukemia2Signaling2Pathways2in2Glioblastoma2Kit2receptor2signaling2pathway2Leptin2signaling2pathway2ILC12Signaling2Pathway2ILCO2Signaling2Pathway2DNA2Damage2Response2Monly2ATM2dependent)2TSH2signaling2pathway2ILC52Signaling2Pathway2B2Cell2Receptor2Signaling2Pathway2AGE/RAGE2pathway2EGF/EGFR2Signaling2Pathway2ILC62signaling2pathway2InterleukinC662Signaling2Pathway2Integrated2Breast2Cancer2Pathway2Insulin2Signaling2PDGF2Pathway Integrated2Pancreatic2Cancer2Pathway2 Androgen2receptor2signaling2pathway2BDNF2signaling2pathway2 Oncostatin2M2Signaling2Pathway2ABFDRNodes color0 0.05N2genes2in2intersectionEdges widthEdges width 15# genes in in rsection 381Figure 3.13: Functional enrichment analysis of disrupted pathways whencombining cohorts. Xseq predicted genes, along with their neighboursin biological pathways showing altered expression, were derived fromthe Xseq analysis. A functional enrichment has been performed withEnrichR [12] using the genes obtained from the intersection of Co-hort 1 and Cohort 2 datasets. Enriched terms from KEGG (A) andWikiPathways (B) with the 20 lowest Bonferroni adjusted p-values(adjusted p-value < 0.05) are shown. Each node of the graphs repre-sents an enriched pathway where the size of the node is proportionalto the number of targeted genes. An edge between two nodes indi-cates the pathways are sharing genes. A larger edge width representsa larger number of shared genes. The Bonferroni corrected p-value isrepresented by the node colour.48MYCPtranslocationMYCBCL6BCL6PtranslocationSA320860SA320992SA321012SA320830SA320932SA321004SA321030SA320920SA320914SA320824SA320842SA320980SA320998SA320818SA320968SA320872SA320848SA320962t(8;14)t(2;8)t(8;22)noPstructuralPdatat(3;14)PCTFBSPandPPCdisruptedPTFBSTFBSdisruptedPTFBCPandPPCMutationPTypeTranslocationFigure 3.14: Xseq results and translocations for BCL6 and MYC genes. Ob-servation of Cohort 2 samples where BCL6 or MYC were predictedby Xseq or have a translocation with an Ig gene. Each column cor-responds to a cancer sample and each row to either a Xseq predictedgene or a gene with a predicted translocation. Colour boxes are drawnwhen a gene was predicted by Xseq or contains a translocation in aspecific sample. The type of mutation associated to the gene is in-dicated by the box colours as indicated in the legend. t(8:14), t(2:8)and t(8:22) are translocations between MYC gene on chromosome 8and IgH gene on chromosome 14, IgK gene on chromosome 2 andIgL gene on chromosome 22; t(3:14) is a translocation between BCL6gene on chromosome 3 and IgH gene on chromosome 14. The samplenames are colour-coded as defined in Figure 3.1.exons for the other two samples (SA321012 and SA320818)(Figure 3.10).For each dysregulated gene predicted by Xseq, we further investigated eachassociated mutation, breaking down each mutation type in Table 3.2. There werepredicted genes with altered expressions associated with mutations only lying intheir exons, mutations only lying in nearby TFBSs, and mutations both within TF-BSs and exons (Figure 3.9 and Figure 3.10). There was an additional class wherethe TFBS was predicted to be disrupted. Genes associated with mutations only ly-ing in their exons were COL3A1, IRF8, and NRIP1 for Cohort 1 and TP53, RYR2,and SIN3A for Cohort 2. The Xseq predicted genes with only mutations lying in49nearby TFBSs were MUT, RPIA, TUBD1 and PIK3C3 for Cohort 1 and CRIM1,WWC1 and CCNG1 for Cohort 2. Interesting cases where genes were predictedin multiple samples using different mechanisms of alteration suggests the muta-tion of key genes in cancer cells can be accomplished through several differentmechanisms. Genes mutated in multiple samples with PC or cis-regulatory muta-tions were ROBO1, HAS2 and ZFP36L1 in Cohort 1 and ID3, MYC and BCL6 inCohort 2.Examples of specific genes with mutations disrupting gene expression throughthe alteration of TFBSsXseq analyses highlighted specific mutations associated to gene expressiondysregulation, along with cascading effects on interacting genes through functionalprotein association networks. In the following sections, we will describe a few in-teresting mutated genes from both cohorts as examples.BCL6 BCL6 was predicted to harbour significant mutations that caused alteredexpression in 4 samples (1 corresponding to a BL and 3 to DLBCLs) from Co-hort 2 dataset. Interestingly, we observed an up-regulation of BCL6 in samplesSA320848 and SA320968, which were associated to mutations disrupting TF-BSs. On the contrary, a down-regulation was observed in SA320962 associatedto non-disrupting mutations in TFBSs, and no expression change was observed inSA320872.There was no observed translocation event associated to BCL6 in SA320848,whereas a translocation between BCL6 and chromosome 14 was found in SA320968(Figure 3.14). The up-regulation of BCL6 in the two samples was associated withseveral mutations in TFBSs (7 SNVs in SA320848 lie within 10 TFBSs, 3 of whichwere predicted to be deleterious; 5 SNVs in SA320968 lie within 8 TFBSs, one ofwhich was predicted to be deleterious). 3 SNVs in SA320848 were predicted todisrupt binding for TFs: BRCA1, E2F1, USF1, and BHLHE40 (a mutation over-laps two TFBSs). In SA320968, a SNV was predicted to disrupt a GATA3 TFBS.The presence of promoter mutations and absence of exonic mutations indicated thatthe alteration of cis-regulatory elements likely caused the dysregulation of BCL6.We concentrated on the two disrupted TFBS examples to determine the after-effects of the up-regulation of BCL6. Over expression of BCL6, an oncogene,50enables accelerated proliferation and makes the cell more tolerant to DNA damage[27]. SMAD3 or SMAD4, and TXNIP were down-regulated in both SA320848and SA320968 samples, and have biological interactions with BCL6 (Figure 3.16and Figure 3.15). TXNIP was down-regulated in many cancer types, includingDLBCL, and was proposed to be a tumour suppressor in thyroid cancer [50]. Overexpression of BCL6 also inhibits the tumour suppressors SMAD3 and SMAD4,which are Smad transcription factors that mediate cell proliferation through thetransforming growth factor-beta signaling pathway [20, 74]. The DLBCL sampleSA320968 had a down-regulation of SMAD4, whereas the BL sample SA320848displayed a down-regulation of SMAD3 (Figure 3.16 and Figure 3.15).ROBO1 Our approach allowed us to extract genes that were predicted as signifi-cantly mutated, for which little is known regarding their involvement in cancer. Infact, a mutated ROBO1 was predicted to be the cause for expression changes in6 samples from Cohort 1 (Figure 3.9). 5 of these samples highlighted mutationsin TFBSs and one sample was associated to a mutation in the PC space. ROBO1was previously characterized as a potential tumour suppressor in different types ofcancer [8, 56, 78]. Here, we found that ROBO1 was down-regulated in the 5 sam-ples associated to mutations in TFBSs, whereas it was slightly up-regulated in thesample harbouring a mutation in the PC region. Even though the TFBSs were notpredicted to be disrupted, mutations might create new TFBSs for competing TFs.We hypothesize that the tumour suppressor ROBO1 gene was down-regulated atthe transcriptional level by mutations in cis-regulatory elements.To identify the potential impact of the mutated tumour suppressor ROBO1, weinspected the expressions for known interacting genes: SOS1, SOS2, and RAC1. 4out of the 5 DLBCL samples with a TFBS mutation and a high probability of down-regulation for ROBO1, also had a high probability that either SOS1 or SOS2 weredown-regulated (Figure 3.17 and Figure 3.18). SOS1 and SOS2 may regulate theRAS gene and a decrease of expression for SOS gene will suppress the apoptosispathway (http://www.genome.jp/kegg-bin/show pathway?hsa04014). RAC1 had ahigh probability of being up-regulated in 4 of the 5 samples where ROBO1 had aTFBS mutation; one sample had a missense PC mutation in RAC1 (Figure 3.17 andFigure 3.18). RAC1 is known to inhibit apoptosis and down-regulation of ROBO151TXNIP_872FOXO3_872NCOR2_872STAT5A_872CCND2_872FOXO4_872SMAD4_872BCL6_872FOXO1_872CD5_872SMAD3_872HDAC4_872PML_872SIN3A_872HDAC5_872BCL6B_872JUN_872UBC_872CD40_872PDCD2_872FCER2_872RUNX1T1_872EP300_872BCL2L11_872SP1_872PRL_872HDAC2_872NCOR1_872HDAC1_872STAT5B_872PIN1_872bceadNo MutationPCTFBSDisrupted TFBSPC and TFBSFigure 3.15: Impact of BCL6 expression alteration due to a PC mutation. Thesample SA320872 was predicted by Xseq to have a mutated BCL6gene with altered expression. The gene interactors (from biologicalnetworks) are provided along with their prediction as up- (red) ordown-regulated (blue). The more transparent colour corresponds to alower probability of either being up- or down-regulated. Nodes repre-sent genes and are labeled with the gene name and the sample number(e.g. BCL6 872 for the BCL6 gene in sample SA320872).52HDAC1_848SIN3A_848 CCND1_848SKP1_848 PDCD2_848CD40_848IL10_848SP1_848HDAC7_848NCOR2_848HSP90AA1_848SMAD4_848BCOR_848PAX5_848TP53_848FOXO1_848FOXO3_848FCER2_848HDAC2_848SMAD3_848IRF4_848RUNX1T1_848CCND2_848HDAC4_848CD5_848BCL2L11_848JUN_848SPI1_848BCL6_848IL6_848HDAC9_968NCOR1_968PDCD2_968STAT3_968IL6_968TXNIP_968BBC3_968BCL2L11_968BACH2_968HDAC4_968IL10_968PAX5_968EP300_968IRF4_968SMAD4_962FOXO1_962SMAD3_962BACH2_962JUNB_962BCL6_962HDAC1_962SP1_962PIN1_962SPI1_962EP300_962NCOR2_968FOXO1_968STAT5B_968HDAC1_968SP1_968HSP90AA1_968STAT5A_968PIN1_968STAT5A_962IL10_962CD40_962GZMB_962UBC_962CCND1_962BBC3_962CD5_962RGS4_848SIN3A_962TP53_962 PDCD2_962BCL2L11_962 BCOR_962HDAC7_962CCND2_962BCL6_968EP300_848TXNIP_848FOXO3_968CD5_968SMAD4_968CCND2_968SMAD3_968HDAC5_968FOXO3_962PAX5_962NCOR2_962NCOR1_962bceadNo MutationPCTFBSDisrupted TFBSPC and TFBSFigure 3.16: Impact of BCL6 expression alteration due to a TFBS mutation.The samples SA320848, SA320962 and SA320968 were predicted byXseq to have a mutated BCL6 gene with altered expression. The geneinteractors (from biological networks) are provided along with theirprediction as up- (red) or down-regulated (blue). The more transparentcolour corresponds to a lower probability of either being up- or down-regulated. Nodes represent genes and are labeled with the gene nameand the sample number (e.g. BCL6 848 for the BCL6 gene in sampleSA320848).53SRGAP3_077SOS1_077SRGAP2_077NCK2_077CABLES1_077SOS2_077SLIT3_077ARHGAP39_077RAC1_077SLIT2_077CDC42_077CTNNA1_077CTNND1_077SRGAP1_077ROBO1_077bceadNo MutationPCTFBSDisrupted TFBSPC and TFBSFigure 3.17: Impact of ROBO1 expression alteration due to a PC mutation.The sample RG077 was predicted by Xseq to have a mutated ROBO1gene with altered expression. The gene interactors (from biologicalnetworks) are provided along with their prediction as up- (red) ordown-regulated (blue). The more transparent colour corresponds to alower probability of either being up- or down-regulated. Nodes repre-sent genes and are labeled with the gene name and the sample number(e.g. ROBO1 077 for the ROBO1 gene in sample RG077).54SLIT2_086CDH2_086CTNNB1_026SRGAP2_028CDH2_026CTNNB1_028SOS2_086SOS2_028SLIT2_026SRGAP2_026CTNND1_086ARHGAP39_086NCK2_086SLIT3_026SRGAP3_086ARHGAP39_026ROBO1_086SRGAP3_026ROBO1_026VASP_086CTNNA1_086SRGAP1_086RAC1_086SRGAP2_086CABLES1_086ARHGAP39_028CDH2_028VASP_028CTNND1_028CABLES1_028SLIT2_028SRGAP1_028RAC1_028NCK2_028ROBO1_028VASP_026CTNND1_026NCK2_026CDC42_026CABLES1_026SRGAP1_026RAC1_026SOS2_111SRGAP2_111VASP_067RAC1_111SRGAP1_111CDC42_111SOS2_067CTNNB1_067 ARHGAP39_067SOS1_067CDH2_067CTNND1_067SRGAP1_067NCK2_111CDH2_111CABLES1_111SRGAP3_067RAC1_067CABLES1_067NCK2_067VASP_111SLIT3_067SOS1_111 CTNNA1_111SLIT2_111ARHGAP39_111SLIT3_111SRGAP3_111CTNND1_111ROBO1_111ROBO1_067CTNNA1_067SRGAP2_067CDC42_067bceadNo MutationPCTFBSDisrupted TFBSPC and TFBSFigure 3.18: Impact of ROBO1 expression alteration due to a TFBS muta-tion. The samples RG111, RG067, RG026, RG086 and RG028 werepredicted by Xseq to have a mutated ROBO1 gene with altered ex-pression. The gene interactors (from biological networks) are providedalong with their prediction as up- (red) or down-regulated (blue). Themore transparent colour corresponds to a lower probability of eitherbeing up- or down-regulated. Nodes represent genes and are labeledwith the gene name and the sample number (e.g. ROBO1 111 for theROBO1 gene in sample RG111).55results in an up-regulation of RAC1 in human lymphoma cells [56, 80].GNA13 GNA13 has previously been linked to tumour progression and describedas an important mediator of prostate cancer cell invasion [61]. Additionally,GNA13 has been previously identified as a recurrent target of mutations in DL-BCLs [48]. In our Xseq analysis of Cohort 2, GNA13 was predicted to harbor mu-tations that impacted gene expression in two BL samples. SA321004 sample wasassociated to a SNV in the PC space, whereas SA32848 sample was associated to aSNV lying within 3 overlapping TFBSs (for the TFs GATA1, GATA3, and JUND).We predicted the SNV disrupted the TFBS associated to GATA3 (Figure 3.19).Given the GATA3 TF binding profile, we observed that the mutation is one of themost severe mutations that can occur to the TFBS. In the corresponding sample,we observed an up-regulation of GNA13 transcription, consistent with its impor-tant role in cancer cell invasion. Notice that GNA13 was also found up-regulatedwhen predicted by Xseq in the sample SA321004, associated to a mutation in anexon.The BL SA320848 sample had an interesting dysregulation for RHOB, RAC2,VAV1 and ECT2, which have known protein interactions with GNA13 (Figure 3.20).The oncogenes VAV1 and ECT2 are both up-regulated. Down-regulation of RHOBis associated with cell proliferation and an increase of DNA double strand breaks[46]. Up-regulation of RAC2 has the same effect as its analog RAC1. As previ-ously stated, over-expression of RAC1 inhibits apoptosis.HAS2 HAS2 is an essential membrane-embedded Hyaluronan (HA) synthasenecessary for the catalyzation of HA. We predicted HAS2 expression to be dis-rupted due to harbouring a mutation in 4 samples from Cohort 1. Two samples(RG014 and RG027) were associated to mutations in the PC space, whereas theother two samples (RG067 and RG116) were associated to mutations in TFBSs.The RG116 sample harbours an SNV in a CEBPA TFBS that was predicted tobe deleterious (Figure 3.21). We observed a down-regulation of the HAS2 in thesample associated to the disruption of the CEBPA TFBS. From these 4 samples,we observed that two show a down-regulation and the two others show an up-56EDCBA GATAdpositionbits ugu g d 0 5 6 7 8TGTGGAT TGATA AGA GTCAGAATGTGGAT TGATA CGA GTCAGAArefmaltmScoreNdiferenceReferenceAlternative−5b5SOSuFgRLdCTNNAuARHGEFgAGTRuARHGAP0ADRAuDPLEKHG5ARHGEFuRASGRFgTBXAgRFGDuESRuPIKdCAECTgPIKdRgSOSgGNGgARHGEF6TIAMuRACgPIKdRuVAVu GNAudARHGEFu8ARHGEF9RGSgg RHOBCTNNBu SuPR0ExpressionNinNsampleharbouringNmutationGNAudFigure 3.19: Predicted cis-regulatory mutation potentially impacting GNA13gene expression. In SA32848, a GATA3 (TF binding profile in A)TFBS was predicted to be disrupted (see reference and alternative se-quences in B where the SNV was highlighted with the reference nu-cleotide in green and the alternative in purple). Score differences be-tween the reference TFBS and all possible alternative TFBSs wereplotted in C. The distribution of GNA13 expression from RNA-seqdata was plotted in D with an arrow pointing to the expression valuein sample SA32848. E represents the network of genes associated toGNA13, which are predicted to be either down- (blue) or up-regulated(red) in SA32848. The higher the opacity, the stronger the down- orup-regulation.regulation for HAS2, seen in Figure 3.23 and Figure 3.22. Note that HAS2 hasbeen described previously as either an oncogene or a tumour suppressor dependingon HA length and concentration in the extracellular matrix.57RGS22_004ARHGEF6_004ARHGEF9_004LPAR1_004ARHGEF11_004PIK3R3_004ESR1_004ARHGEF1_004PLA2G2F_004TBXA2R_004ARHGEF17_004PIK3R2_004F2RL2_004VAV2_004SOS1_004RAC2_004ARHGEF1_848MCF2L_004S1PR4_848RIC8A_004PLEKHG2_004F2RL3_848GNA13_004ARHGEF9_848GNA13_848ARHGEF18_848CXCR4_004PIK3R1_004CTNND1_004CTNNB1_848PLEKHG5_004GNA12_004 VAV1_004FGD1_004CTNNA1_004ARHGEF18_004F2RL3_004ARHGAP4_004RASGRF2_004ESR1_848SOS1_848 RHOB_848CTNNA1_848ARHGEF2_848ECT2_848RGS22_848TBXA2R_848PLEKHG5_848ARHGAP4_848FGD1_848RASGRF2_848ADRA1D_848PIK3CA_848AGTR1_848VAV1_848SOS2_848RAC2_848ARHGEF6_848GNG2_848PIK3R2_848TIAM1_848PIK3R1_848bceadNo MutationPCTFBSDisrupted TFBSPC and TFBSFigure 3.20: Impact of GNA13 expression alteration due to a PC or TFBSmutation. The samples SA321004 and SA320848 were predicted byXseq to have a mutated GNA13 gene with altered expression. Thegene interactors (from biological networks) are provided along withtheir prediction as up- (red) or down-regulated (blue). The more trans-parent colour corresponds to a lower probability of either being up- ordown-regulated. Nodes represent genes and are labeled with the genename and the sample number (e.g. GNA13 004 for the GNA13 genein sample SA321004).58−5 N5 ReferenceAlternativeScore0diferenceTTTGAGCAAT GTTGAATAA CCCTGTATAGCTTTTGAGCAAT GTTGAATAA ACCTGTATAGCTrefMaltMpositionbits VIV I F X 5 6 7 8 9 VN VVCEBPA HASIExpression0in0sampleharbouring0mutationDCTHYALVGAPDHADAMTSVPLAGVHYALFPLAUBGNUBCBRDIVCANSPVGREMV CDXDPPAXSTARUGDHBMP7PTGERIBCANPTGIRSERPINEIILVB EREGHMMRCDXX FSTPTGERX HAPLNV TNFAIP6SMADIPTGERV TFRCTGFBIDCNHYALXHASIGDF9TGFBV AREG TIMPVHASFADAMTSFHYALIBMP6RUNXI TFPIIFNVTNFRSFVVB PTGSIEDCBAFigure 3.21: Predicted cis-regulatory mutation potentially impacting HAS2gene expression. HAS2 was predicted by Xseq in samples RG116.A CEBPA (TF binding profile in A) TFBS was predicted to be dis-rupted (see reference and alternative sequences in B where the SNVwas highlighted with the reference nucleotide in green and the alter-native in purple). Score differences between the reference TFBS andall possible alternative TFBSs were plotted in C. The distribution ofHAS2 expression from RNA-seq data was plotted in D with an arrowpointing to the expression value in sample RG116. E represents thenetwork of genes associated to HAS2, which are predicted to be eitherdown- (blue) or up-regulated (red) in RG116. The higher the opacity,the stronger the down- or up-regulation.59HYAL2_027PTGER4_027DCT_027TFPI2_027ADAMTS1_027HAPLN1_027GREM1_027DCN_027BGN_027SP1_027HAS3_027CD44_027SMAD2_027FN1_027TIMP1_027HAS2_027PTGER2_027HYAL4_027PTGIR_027TFRC_027STAR_027TNFRSF11B_027VCAN_027PLAU_027DPPA4_027EREG_027CD4_027BCAN_014IL1B_014CD4_014STAR_014PTGER2_014TIMP1_014CD44_014HAS3_014PTGER4_014BMP7_014 HYAL3_014HYAL2_014SP1_014FST_014TGFBI_014HMMR_014TGFB1_014UBC_014TNFRSF11B_014EREG_014DPPA4_014HYAL4_014BRD2_014TFRC_014SERPINE2_014SMAD2_014HAS2_014FN1_014UGDH_014 VCAN_014BGN_014PTGER1_014HYAL1_014RUNX2_014PTGIR_014TGFBI_027PTGS2_027BRD2_027IL1B_027HYAL3_027HMMR_027RUNX2_027BMP6_027BMP7_027FST_027 ADAMTS3_027bceadNo MutationPCTFBSDisrupted TFBSPC and TFBSFigure 3.22: Impact of HAS2 expression alteration due to a PC mutation. Thesamples RG027 and RG014 were predicted by Xseq to have a mutatedHAS2 gene with altered expression. The gene interactors (from bio-logical networks) are provided along with their prediction as up- (red)or down-regulated (blue). The more transparent colour correspondsto a lower probability of either being up- or down-regulated. Nodesrepresent genes and are labeled with the gene name and the samplenumber (e.g. HAS2 027 for the HAS2 gene in sample RG027).60BCAN_116SMAD2_116TIMP1_116 PTGIR_116HAS2_116HYAL4_116 STAR_116CD4_116TFRC_116 UGDH_116DPPA4_116TGFBI_116PTGER2_116DCN_116BMP7_116SERPINE2_116EREG_116HAS3_067FN1_067UBC_067STAR_067BMP7_067SMAD2_067ADAMTS1_067PTGER2_067HMMR_067DCN_067GDF9_067DPPA4_067HYAL4_067HAS2_067HAPLN1_067CD4_067HYAL1_067GAPDH_067AREG_067 SP1_067RUNX2_067SERPINE2_067TGFB1_067BMP6_116ADAMTS3_116HYAL2_116RUNX2_116 IL1B_116TFPI2_116VCAN_116HMMR_116FN1_116FST_116BGN_067VCAN_067EREG_067 PTGIR_067BRD2_067TFRC_067TIMP1_067 PLAU_067CD44_067bceadNo MutationPCTFBSDisrupted TFBSPC and TFBSFigure 3.23: Impact of HAS2 expression alteration due to a TFBS mutation.The samples RG116 and RG067 were predicted by Xseq to have a mu-tated HAS2 gene with altered expression. The gene interactors (frombiological networks) are provided along with their prediction as up-(red) or down-regulated (blue). The more transparent colour corre-sponds to a lower probability of either being up- or down-regulated.Nodes represent genes and are labeled with the gene name and thesample number (e.g. HAS2 116 for the HAS2 gene in sample RG116).61Chapter 4DiscussionOur results highlighted the importance of fully characterizing somatic mutationsin cis-regulatory regions of B cell lymphoma genomes. Wherein we predictedthe effects on gene expression from disrupting somatic mutations within TFBSsby incorporating WGS and RNA-Seq. Overall, these results demonstrated a morecomplete interpretation of the mutational landscape for cancer genomes when in-cluding both PC and cis-regulatory mutations, which caused dysregulation of geneexpression.Initial investigation of the mutational landscape was important (ExperimentalAim 1) because the mutation frequencies for each sample led to understandingthe mutational pressure that the tumour genomes have been under. Too few or toomany predicted mutations can indicate normal contamination in the tumour sampleor improper alignment of reads. There was a large range in the number of totalpredicted somatic mutations throughout the samples; Cohort 1 ranged from 200 to32,000 and Cohort 2 extended from 1,300 to 38,000. The frequencies of mutationsthroughout the human genome differ for each cancer type, but between 1,000 to50,000 predicted somatic mutations was the general trend observed in the largedatasets for TCGA [76]. There was one sample from Cohort 1 and no samplesfrom Cohort 2 below 1,000 total somatic mutations.The number of predicted indels was considerably smaller than SNVs. Indelsare more challenging to identify due to the difficulty of correctly mapping readswith indels [47]. Also, SNVs only have three possible alternatives, whereas indels62can have many possibilities. Additionally, indels could be less frequent since theyhave a higher probability of causing major changes; they have a larger area ofeffect. These larger changes could result in cell death. In Cohort 1, the sampleRG043 had 4,810 predicted indels, which was 5 times more than any other sample.This could be explained by some disruption in the repair mechanism that identifiesand repairs indels properly, which has been observed in other cancers, such asbreast cancers [51].A higher mutation rate was observed in TFBS regions than PC regions (Exper-imental Aim 2), where both regions had similar regional dimensions. This obser-vation could occur by chance but another explanation is that PC regions are welldefined, whereas TFBS regions are predicted binding sites using PWMs that mayresult in false positives. False positive predictions for TFBS positions would in-crease the possibility of observing TFBS mutations. Another potential source offalse positives are the predicted TFBSs that are selectively active in contexts unre-lated to the studied lymphoma, resulting from limited resources for tissue specificChIP-Seq data. TFBS regions were locally depleted for SNVs in both cohorts,which implies a lower acceptability of mutations for TFBSs compared to their sur-rounding regions.We expected the number of predicted somatic mutations overlapping functionalnon-coding elements was underestimated. We focused on TFBSs, one type of non-coding region, that relied on high-quality annotations predicted from ChIP-Seqand manually curated TF binding profiles (Ch. 2.2.3). Although we are combiningChIP-Seq experiments from multiple cell types and conditions, the experimentalChIP-Seq information provides the best current opportunity to focus on the non-coding space. The predicted TFBS space covers ∼ 2% of the human genome,which is the same approximation as the PC region. However, the annotated cis-regulatory space is expected to expand over the next couple of years due to thedecrease in cost that leads to an increase in the availability of more antibodies, andChIP-Seq cell-type specific datasets. An increase in the annotated cis-regulatoryspace will allow for a more comprehensive interpretation of disruptive biologicalpathways due to mutations that alter TFBSs.We observed that promoter regions were frequently mutated for genes relatedto apoptotic and oncogenic processes (Experimental Aim 3). In our analysis, we63combined the samples that were previously used to explore hypermutated promot-ers from Cohort 1 with 44 unexplored samples from Cohort 2. As mentioned,Cohort 1 has been used in a publication that observed highly mutated promotersfor Ig genes and the oncogenes that Ig loci join through translocations [34]. Fromboth cohorts, BCL2, BCL6 and MYC had frequently mutated promoters and areknown to be involved in hypermutations and translocations with Ig loci; this addsvariation and leads to an increase in the binding affinity for B cell receptors to bindto different antigens [37]. This may explain why the promoters for BCL2, BCL6and MYC were frequently mutated. Khodabakhshi et al. results supported con-cordance between genomic translocation and frequently mutated promoters [34].Specifically, they observed that in the absence of Ig loci translocations there werestill samples with recurrently mutated promoters for genes involved in transloca-tions. They note this supports a proposed model that the Ig loci and the oncogeneare first frequently mutated, causing genomic instability and leading to the knowngenomic rearrangements. A contrasting model suggests the translocations occurfirst, resulting in genomic instability and hypermutations in nearby regions. Re-gardless of the correct model, there are other frequently mutated promoters, suchas BIRC3, BTG1, CD74, FOXO1, IRF1 and PIM1, that are not involved in translo-cation and are associated with apoptotic and oncogenic processes.Our data indicates for the first time that ID3 [63] can be targeted through mu-tations that overlap TFBSs in its promoter region. Thus, both exonic and promoterportions of the gene are recurrently mutated, suggesting different mechanisms forgene disruption. While assuming that at least some cis-regulatory mutations havean impact on gene expression regulation, we hypothesized that genes involved incarcinogenesis should be more frequently targeted by mutations in their promotersthan other genes.The most significant observation was the evidence that cancer related geneswere mutated through different mechanisms to cause altering gene expression (Ex-perimental Aim 4). We focused on prioritizing mutations with the highest proba-bility to have an impact on gene expression through the disruption of cis-regulatoryelements, computed by Xseq. We showed that some genes, including BCL6,ROBO1, GNA13 and HAS2, were predicted in multiple samples using differentmechanisms of dysregulation through alterations within PC, TFBS or both re-64gions. Xseq’s ability to predict genes related to cancer growth was reinforcedby the strong over representation of cancer related pathways from the functionalenrichment analyses.As previously mentioned, the mutations within the PC and TFBS may havebeen caused by the instability from translocations. Up-regulation of MYC couldbe associated with either a SNV mutation or a MYC-Ig translocation. ID3 was ofspecial interest since it has previously been proposed that a cooperation betweenthe up-regulation of MYC and disrupting ID3 could potentially drive BL [63].Strikingly, a MYC translocation was also observed for every sample where ID3 hadbeen mutated through some mechanism. These are promising results, suggestingthat ID3 could be mutated through both PC and TFBS regions and there could stillbe other undiscovered mechanisms where ID3 was disrupted.Xseq’s ability to accurately predict impactful mutated genes relies on threeassumptions. Our first assumption was the ability to accurately predict if a mu-tation was disrupting TFBSs. Over the years, various approaches were exploredto address this problem but more accurate binding affinity predictors are needed[4, 35]. In this report, all mutations lying within TFBSs are of potential interest,but those most likely to disrupt TFBSs where the alternative score was below adefined threshold were highlighted. Since we considered this approach to be sim-ple and conservative, future improvements are likely to increase sensitivity whencoupling ChIP-Seq data to TFBS variant prediction.Secondly, we assumed that mutations were associated to only the closest genewithin cis-regulatory elements. While this approach is relevant when analyzingpromoter regions, more information about the association of distal regulatory ele-ments to promoters is ultimately required. In the future, the availability of numer-ous chromatin configuration capture datasets for multiple cell types will empoweranalysis on linking distal regulatory elements to their targets.Lastly, we assumed that a competitive environment did not exist between TFswith different specificities when predicting TFBSs within ChIP-Seq peaks. Thiscould explain the down-regulation of BCL6 gene for only one of the four samplesin Cohort 2, where a STAT3 TFBS was mutated. While STAT3 is a known activator,the down-regulation of BCL6 could be associated to an alternative explanation. Ithas been hypothesized that a competition between STAT3 and STAT5 occurs at this65binding site since they recognize similar motifs. Additionally, it was highlighted ina previous study that STAT5 outcompetes STAT3 for repressing the expression ofBCL6 [73]. An advantage for STAT5 binding at this location may be provided bythe mutation. Functional studies are required in order to decipher the mechanismsunderlying the competition between TFs.The Xseq model can predict recurrent mutations more accurately than infre-quent mutations since it draws strength across samples. Infrequent mutations arechallenging to predict, therefore, we chose a filter to remove potential false posi-tives (Ch. 2.2.5). In cancer genomics, it is still a challenge to separate infrequentdriver mutations from thousands of passenger mutations that are only observed ina single sample.The analysis of cis-regulatory elements will be crucial for the interpretation ofcancer genomes once more WGS data becomes available, along with large scalegene expression data. Our findings support this claim; combining mutations foundin both cis-regulatory and PC regions leads to a more complete interpretation ofdisrupted biological pathways related to promotes tumour progression, whereasthe majority of previous studies focus on only the PC regions.66Chapter 5ConclusionIn order to provide an initial genome-scale analysis of cis-regulatory mutationsimpacting gene expression in cancer, a set of 700,000 somatic SNVs and indelsin 84 B cell lymphoma samples were analyzed for this report. The mutationalprofiles, the first experimental aim, showed the B cell lymphoma subtypes clusteredtogether in the distribution for the number of somatic mutations per sample. Allsamples, except one, followed the general trend observed in the TCGA project,cancer genomes have between 1,000-50,000 predicted somatic SNVs.By using these mutational profiles, we answered our second aim; a higher mu-tation rate was observed in TFBS regions than PC regions, where both regionshave similar regional dimensions. Even though this was expected by chance, localmutations confirmed that TFBS were depleted for SNVs compared to their sur-rounding regions. The third aim addressed that cis-regulatory mutations are foundmore often in promoters than expected by chance. The set of genes, where theirpromoters were targeted with mutations, were enriched for apoptosis-related andcarcinogenesis pathways.Ultimately, mutation information was combined with gene expression to an-swer our last objective, which led to our prediction that cancer related genes withaltered expression are associated with mutations found either in exons or nearbyTFBSs. Samples where genes are possibly dysregulated through the disruptionof cis-regulatory elements are revealed by this approach and has lead to new andhigher prevalence of predicted cancer related genes. Overall, the importance of in-67vestigating the cis-regulatory genomic space to interpret the mutational landscapesof cancers is highlighted in this report.Functional validations in the future will add invaluable strength to our claims.First, by proving the predicted disruptive TFBS mutations are truly disrupting thebinding of TFs. Secondly, and even more challenging, proving that the disruptedTFBS is the true cause for the nearby gene dysregulation. As more WGS andRNA-Seq data emerges, this will enable the identification of more recurrent ge-nomic aberrations within PC and TFBS regions for B cell lymphomas and differenttumour types.68Bibliography[1] S. Aerts and J. Cools. Cancer: Mutations close in on gene regulation.Nature, 499(7456):35–36, 2013. → pages 6[2] C. A. Albers, G. Lunter, D. G. MacArthur, G. McVean, W. H. Ouwehand,and R. Durbin. Dindel: Accurate indel calls from short-read data. GenomeResearch, 21:961–973, 2010. → pages 19[3] R. H. Ali, O. M. Rueda, S.-F. Chin, C. Curtis, M. J. Dunning, S. A. Aparicio,and C. Caldas. Genome-driven integrated classification of breast cancervalidated in over 7,500 samples. Genome Biology, 15(8), 2014. → pages 4[4] M. C. Andersen, P. G. Engstrm, S. Lithwick, D. Arenillas, P. Eriksson,B. Lenhard, W. W. Wasserman, and J. Odeberg. In silico detection ofsequence variations modifying transcriptional regulation. PLoSComputational Biology, 4(1):e5, 2008. → pages 6, 8, 23, 65[5] C. E. Barbieri, C. H. Bangma, A. Bjartell, J. W. Catto, Z. Culig, H. Grnberg,J. Luo, T. Visakorpi, and M. A. Rubin. The mutational landscape of prostatecancer. European Urology, 64(4):567–576, 2013. → pages 1[6] A. Bashashati, G. Haffari, J. Ding, G. Ha, K. Lui, J. Rosner, D. Huntsman,C. Caldas, S. Aparicio, and S. Shah. Drivernet: uncovering the impact ofsomatic driver mutations on transcriptional networks in cancer. GenomeBiology, 13(12):R124, 2012. → pages 3[7] P. V. Benos, M. L. Bulyk, and G. D. Stormo. Additivity in proteinDNAinteractions: how good an approximation is it? Nucleic acids research, 30(20):4442–4451, 2002. → pages 8[8] A. E. Bonner, W. J. Lemon, and M. You. Gene expression signaturesidentify novel regulatory pathways during murine lung development:implications for lung tumorigenesis. Journal of Medical Genetics, 40(6):408–417, 2003. → pages 5169[9] A. R. Borneman, T. A. Gianoulis, Z. D. Zhang, H. Yu, J. Rozowsky, M. R.Seringhaus, L. Y. Wang, M. Gerstein, and M. Snyder. Divergence oftranscription factor binding sites across related yeast species. Science, 317(5839):815–819, 2007. → pages 6[10] E. Cerami, J. Gao, U. Dogrusoz, B. E. Gross, S. O. Sumer, B. A. Aksoy,A. Jacobsen, C. J. Byrne, M. L. Heuer, E. Larsson, Y. Antipin, B. Reva, A. P.Goldberg, C. Sander, and N. Schultz. The cbio cancer genomics portal: Anopen platform for exploring multidimensional cancer genomics data. CancerDiscovery, 2(5):401–404, 2012. → pages 2[11] C.-C. Chen, S. Xiao, D. Xie, X. Cao, C.-X. Song, T. Wang, C. He, andS. Zhong. Understanding variation in transcription factor binding bymodeling transcription factor genome-epigenome interactions. PLoSComputational Biology, 9(12), 2013. → pages 6[12] E. Chen, C. Tan, Y. Kou, Q. Duan, Z. Wang, G. Meirelles, N. Clark, andA. Ma’ayan. Enrichr: interactive and collaborative html5 gene listenrichment analysis tool. BMC Bioinformatics, 14(1):128, 2013. → pages29, 35, 40, 41, 42, 46, 47, 48[13] P. Cingolani, A. Platts, L. L. Wang, M. Coon, T. Nguyen, L. Wang, S. J.Land, X. Lu, , and D. M. Ruden. A program for annotating and predictingthe effects of single nucleotide polymorphisms, snpeff: Snps in the genomeof drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2):80–92,2012. → pages 27[14] T. . G. P. Consortium. An integrated map of genetic variation from 1,092human genomes. Nature, 491(7422):56–65, 2012. → pages 22[15] T. E. P. Consortium. An integrated encyclopedia of dna elements in thehuman genome. Nature, 489(7414):57–74, 2012. → pages 2, 13, 18, 23[16] D. A. Cusanovich, B. Pavlovic, J. K. Pritchard, and Y. Gilad. The functionalconsequences of variation in transcription factor binding. PLoS Genetics, 10(3):e1004226, 2014. → pages 6[17] L. Dang, D. W. White, S. Gross, B. D. Bennett, M. A. Bittinger, E. M.Driggers, V. R. Fantin, H. G. Jang, S. Jin, M. C. Keenan, K. M. Marks, R. M.Prins, P. S. Ward, K. E. Yen, L. M. Liau, J. D. Rabinowitz, L. C. Cantley,C. B. Thompson, M. G. Vander Heiden, and S. M. Su. Cancer-associatedidh1 mutations produce 2-hydroxyglutarate. Nature, 462(7274):739–744,2009. → pages 270[18] J. Ding, A. Bashashati, A. Roth, A. Oloumi, K. Tse, T. Zeng, G. Haffari,M. Hirst, M. A. Marra, A. Condon, S. Aparicio, and S. P. Shah. Featurebased classifiers for somatic mutation detection in tumour-normal pairedsequencing data. Bioinformatics, 2011. → pages 19[19] J. Ding, M. K. McConechy, H. M. Horlings, G. Ha, C. C. Fong, T. Funnell,S. C. Mullaly, A. Bashashati, D. Huntsman, S. Aparicio, A. Condon, andS. P. Shah. Systematic analysis of somatic mutations impacting geneexpression in twelve tumour types. submitted, 2015. → pages 3, 26, 39[20] N. I. Fleming, R. N. Jorissen, D. Mouradov, M. Christie,A. Sakthianandeswaren, M. Palmieri, F. Day, S. Li, C. Tsui, L. Lipton,J. Desai, I. T. Jones, S. McLaughlin, R. L. Ward, N. J. Hawkins, A. R.Ruszkiewicz, J. Moore, H.-J. Zhu, J. M. Mariadason, A. W. Burgess,D. Busam, Q. Zhao, R. L. Strausberg, P. Gibbs, and O. M. Sieber. Smad2,smad3 and smad4 mutations in colorectal cancer. Cancer Research, 73(2):725–735, 2013. → pages 51[21] S. A. Forbes, G. Tang, N. Bindal, S. Bamford, E. Dawson, C. Cole, C. Y.Kok, M. Jia, R. Ewing, A. Menzies, J. W. Teague, M. R. Stratton, and P. A.Futreal. COSMIC (the catalogue of somatic mutations in cancer): a resourceto investigate acquired mutations in human cancer. Nucleic Acids Research,38:D652–D657, 2010. → pages 35[22] J. Gao, B. A. Aksoy, U. Dogrusoz, G. Dresdner, B. Gross, S. O. Sumer,Y. Sun, A. Jacobsen, R. Sinha, E. Larsson, E. Cerami, C. Sander, andN. Schultz. Integrative analysis of complex cancer genomics and clinicalprofiles using the cbioportal. Science Signaling, 6(269):pl1–pl1, 2013. →pages 2[23] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling,S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn,W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini,G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. H. Yang, and J. Zhang.Bioconductor: open software development for computational biology andbioinformatics. Genome Biology, 5(10):R80, 2004. → pages 17[24] K. A. Gray, L. C. Daugherty, S. M. Gordon, R. L. Seal, M. W. Wright, andE. A. Bruford. Genenames.org: the HGNC resources in 2013. Nucleic AcidsResearch, 41(Database issue):D545–552, 2013. → pages 1871[25] A. Hodgkinson, Y. Chen, and A. Eyre-Walker. The large-scale distributionof somatic mutations in cancer genomes. Human Mutation, 33:136–143,2012. → pages 33[26] S. Horn, A. Figl, P. S. Rachakonda, C. Fischer, A. Sucker, A. Gast, S. Kadel,I. Moll, E. Nagore, K. Hemminki, D. Schadendorf, and R. Kumar. TERTpromoter mutations in familial and sporadic melanoma. Science, 339(6122):959–961, 2013. → pages 6[27] C. Huang, K. Hatzi, and A. Melnick. Lineage-specific functions of bcl-6 inimmunity and inflammation are mediated by distinct biochemicalmechanisms. Nature Immunology, 14(4):380–388, 2013. → pages 51[28] F. W. Huang, E. Hodis, M. J. Xu, G. V. Kryukov, L. Chin, and L. A.Garraway. Highly recurrent TERT promoter mutations in human melanoma.Science, 339(6122):957–959, 2013. → pages 2, 6[29] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold. Genome-widemapping of in vivo protein-dna interactions. Science, 316(5830):1497–1502,2007. → pages 12[30] C. Kandoth, M. D. McLellan, F. Vandin, K. Ye, B. Niu, C. Lu, M. Xie,Q. Zhang, J. F. McMichael, M. A. Wyczalkowski, M. D. M. Leiserson, C. A.Miller, J. S. Welch, M. J. Walter, M. C. Wendl, T. J. Ley, R. K. Wilson, B. J.Raphael, and L. Ding. Mutational landscape and significance across 12major cancer types. Nature, 502(7471):333–339, 2013. → pages 2, 10[31] J. Kapeller, L. A. Houghton, H. Mnnikes, J. Walstab, D. Mller, H. Bnisch,B. Burwinkel, F. Autschbach, B. Funke, F. Lasitschka, N. Gassler,C. Fischer, P. J. Whorwell, W. Atkinson, C. Fell, K. J. Bchner,M. Schmidtmann, I. van der Voort, A.-S. Wisser, T. Berg, G. Rappold, andB. Niesler. First evidence for an association of a functional variant in themicroRNA-510 target site of the serotonin receptor-type 3e gene withdiarrhea predominant irritable bowel syndrome. Human MolecularGenetics, 17(19):2967–2977, 2008. → pages 6[32] M. Kasowski, F. Grubert, C. Heffelfinger, M. Hariharan, A. Asabere, S. M.Waszak, L. Habegger, J. Rozowsky, M. Shi, A. E. Urban, M.-Y. Hong, K. J.Karczewski, W. Huber, S. M. Weissman, M. B. Gerstein, J. O. Korbel, andM. Snyder. Variation in transcription factor binding among humans.Science, 328(5975):232–235, 2010. → pages 672[33] P. Kheradpour and M. Kellis. Systematic discovery and characterization ofregulatory motifs in ENCODE TF binding experiments. Nucleic AcidsResearch, 2013. → pages 8[34] A. H. Khodabakhshi, R. D. Morin, A. P. Fejes, A. J. Mungall, K. L.Mungall, M. Bolger-Munro, N. A. Johnson, J. M. Connors, R. D. Gascoyne,and M. A. Marra. Recurrent targets of aberrant somatic hypermutation inlymphoma. Oncotarget, 3(11):1308–1319, 2012. → pages 33, 35, 64[35] E. Khurana, Y. Fu, V. Colonna, X. J. Mu, H. M. Kang, T. Lappalainen,A. Sboner, L. Lochovsky, J. Chen, A. Harmanci, J. Das, A. Abyzov,S. Balasubramanian, K. Beal, D. Chakravarty, D. Challis, Y. Chen,D. Clarke, L. Clarke, F. Cunningham, U. S. Evani, P. Flicek, R. Fragoza,E. Garrison, R. Gibbs, Z. H. Gm, J. Herrero, N. Kitabayashi, Y. Kong,K. Lage, V. Liluashvili, S. M. Lipkin, D. G. MacArthur, G. Marth,D. Muzny, T. H. Pers, G. R. S. Ritchie, J. A. Rosenfeld, C. Sisu, X. Wei,M. Wilson, Y. Xue, F. Yu, . G. P. Consortium, E. T. Dermitzakis, H. Yu,M. A. Rubin, C. Tyler-Smith, and M. Gerstein. Integrative annotation ofvariants from 1092 humans: Application to cancer genomics. Science, 342(6154):1235587–1235587, 2013. → pages 3, 6, 65[36] M. Krzywinski, J. Schein, . Birol, J. Connors, R. Gascoyne, D. Horsman,S. J. Jones, and M. A. Marra. Circos: An information aesthetic forcomparative genomics. Genome Research, 19(9):1639–1645, 2009. →pages 28[37] R. Kuppers. Mechanisms of b-cell lymphoma pathogenesis. Nature ReviewsCancer, 5(4):251–262, 2005. → pages 14, 15, 64[38] A. T. Kwon, D. J. Arenillas, R. W. Hunt, and W. W. Wasserman. opossum-3:Advanced analysis of regulatory motif over-representation across genes orchip-seq datasets. G3, 2(9):987–1002, 2012. → pages 23[39] M. Lawrence, W. Huber, H. Pags, P. Aboyoun, M. Carlson, R. Gentleman,M. T. Morgan, and V. J. Carey. Software for computing and annotatinggenomic ranges. PLoS computational biology, 9(8):e1003118, 2013. →pages 17[40] M. S. Lawrence, P. Stojanov, C. H. Mermel, J. T. Robinson, L. A. Garraway,T. R. Golub, M. Meyerson, S. B. Gabriel, E. S. Lander, and G. Getz.Discovery and saturation analysis of cancer genes across 21 tumour types.Nature, 505:495–501, 2014. → pages 273[41] B. Lenhard, A. Sandelin, L. Mendoza, P. Engstrom, N. Jareborg, andW. Wasserman. Identification of conserved regulatory elements bycomparative genome analysis. Journal of Biology, 2(2):13, 2003. → pages23[42] S. Lomvardas, G. Barnea, D. J. Pisapia, M. Mendelsohn, J. Kirkland, andR. Axel. Interchromosomal interactions and olfactory receptor choice. Cell,126(2):403–413, 2014. → pages 8[43] L. B. Ludlow, B. P. Schick, M. L. Budarf, D. A. Driscoll, E. H. Zackai,A. Cohen, and B. A. Konkle. Identification of a mutation in a GATA bindingsite of the platelet glycoprotein ib promoter resulting in the bernard-souliersyndrome. Journal of Biological Chemistry, 271(36):22076–22080, 1996.→ pages 2[44] A. Mathelier, X. Zhao, A. W. Zhang, F. Parcy, R. Worsley-Hunt, D. J.Arenillas, S. Buchman, C.-y. Chen, A. Chou, H. Ienasescu, J. Lim, C. Shyr,G. Tan, M. Zhou, B. Lenhard, A. Sandelin, and W. W. Wasserman. Jaspar2014: an extensively expanded and updated open-access database oftranscription factor binding profiles. Nucleic Acids Research, 42(D1):D142–D147, 2014. → pages 18, 23[45] A. Mathelier, C. Lefebvre, A. Zhang, D. Arenillas, J. Ding, W. Wasserman,and S. Shah. Cis-regulatory somatic mutations and gene-expressionalteration in b-cell lymphomas. Genome Biology, 16(84), 2015. → pages iii[46] N. Meyer, A. Peyret-Lacombe, B. Canguilhem, C. Mdale-Giamarchi,K. Mamouni, A. Cristini, S. Monferran, L. Lamant, T. Filleron, A. Pradines,O. Sordet, and G. Favre. RhoB promotes cancer initiation by protectingkeratinocytes from UVB-induced apoptosis but limits tumor aggressiveness.The Journal of Investigative Dermatology, 134:203–212, 2014. → pages 56[47] R. E. Mills, W. S. Pittard, J. M. Mullaney, U. Farooq, T. H. Creasy, A. A.Mahurkar, D. M. Kemeza, D. S. Strassler, C. P. Ponting, C. Webber, andS. E. Devine. Natural genetic variation caused by small insertions anddeletions in the human genome. Genome Research, 21(6):830–839, 2011.→ pages 62[48] R. D. Morin, M. Mendez-Lago, A. J. Mungall, R. Goya, K. L. Mungall,R. D. Corbett, N. A. Johnson, T. M. Severson, R. Chiu, M. Field,S. Jackman, M. Krzywinski, D. W. Scott, D. L. Trinh, J. Tamura-Wells,S. Li, M. R. Firme, S. Rogic, M. Griffith, S. Chan, O. Yakovenko, I. M.74Meyer, E. Y. Zhao, D. Smailus, M. Moksa, S. Chittaranjan, L. Rimsza,A. Brooks-Wilson, J. J. Spinelli, S. Ben-Neriah, B. Meissner, B. Woolcock,M. Boyle, H. McDonald, A. Tam, Y. Zhao, A. Delaney, T. Zeng, K. Tse,Y. Butterfield, I. Birol, R. Holt, J. Schein, D. E. Horsman, R. Moore, S. J. M.Jones, J. M. Connors, M. Hirst, R. D. Gascoyne, and M. A. Marra. Frequentmutation of histone-modifying genes in non-hodgkin lymphoma. Nature,476(7360):298–303, 2011. → pages 56[49] R. D. Morin, K. Mungall, E. Pleasance, A. J. Mungall, R. Goya, R. D. Huff,D. W. Scott, J. Ding, A. Roth, R. Chiu, R. D. Corbett, F. C. Chan,M. Mendez-Lago, D. L. Trinh, M. Bolger-Munro, G. Taylor,A. Hadj Khodabakhshi, S. Ben-Neriah, J. Pon, B. Meissner, B. Woolcock,N. Farnoud, S. Rogic, E. L. Lim, N. A. Johnson, S. Shah, S. Jones, C. Steidl,R. Holt, I. Birol, R. Moore, J. M. Connors, R. D. Gascoyne, and M. A.Marra. Mutational and structural analysis of diffuse large b-cell lymphomausing whole-genome sequencing. Blood, 122(7):1256–1265, 2013. → pages13, 17, 33, 35[50] J. Morrison, L. Pike, S. Sams, V. Sharma, Q. Zhou, J. Severson, A.-C. Tan,W. Wood, and B. Haugen. Thioredoxin interacting protein (txnip) is a noveltumor suppressor in thyroid cancer. Molecular Cancer, 13(1):62, 2014. →pages 51[51] S. Negrini, V. G. Gorgoulis, and T. D. Halazonetis. Genomic instability - anevolving hallmark of cancer. Nature Reviews Molecular Cell Biology, 11:220–228, 2010. → pages 63[52] T. C. G. A. R. Network, J. N. Weinstein, E. A. Collisson, G. B. Mills,K. R. M. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, andJ. M. Stuart. The cancer genome atlas pan-cancer analysis project. NatureGenetics, 45(10):1113–1120, 2013. → pages 4[53] J. A. Neuman, O. Isakov, and N. Shomron. Analysis of insertiondeletionfrom deep-sequencing data: software evaluation for optimal detection.Briefings in Bioinformatics, 2012. → pages 19[54] K. Nishida, M. C. Frith, and K. Nakai. Pseudocounts for transcription factorbinding sites. Nucleic Acids Research, 37(3):939–944, 2009. → pages 22[55] H. Noushmehr, D. J. Weisenberger, K. Diefes, H. S. Phillips, K. Pujara, B. P.Berman, F. Pan, C. E. Pelloski, E. P. Sulman, K. P. Bhat, R. G. Verhaak,K. A. Hoadley, D. N. Hayes, C. M. Perou, H. K. Schmidt, L. Ding, R. K.75Wilson, D. Van Den Berg, H. Shen, H. Bengtsson, P. Neuvial, L. M. Cope,J. Buckley, J. G. Herman, S. B. Baylin, P. W. Laird, and K. Aldape.Identification of a cpg island methylator phenotype that defines a distinctsubgroup of glioma. Cancer Cell, 17(5):510–522, 2014. → pages 2[56] A. Parray, H. R. Siddique, J. K. Kuriger, S. K. Mishra, J. S. Rhim, H. H.Nelson, H. Aburatani, B. R. Konety, S. Koochekpour, and M. Saleem.ROBO1, a tumor suppressor and critical molecular barrier for localizedtumor cells to acquire invasive phenotype: Study in african-american andcaucasian prostate cancer models. International Journal of Cancer. JournalInternational Du Cancer, 2014. → pages 51, 56[57] S. Pelengaris, M. Khan, and G. Evan. c-MYC: more than just a matter of lifeand death. Nature Reviews Cancer, 2(10):764–776, 2002. → pages 43[58] P. Polak, M. S. Lawrence, E. Haugen, N. Stoletzki, P. Stojanov, R. E.Thurman, L. A. Garraway, S. Mirkin, G. Getz, J. A. Stamatoyannopoulos,and S. R. Sunyaev. Reduced local mutation density in regulatory DNA ofcancer genomes is linked to DNA repair. Nature Biotechnology, 2013. →pages 33[59] E. Portales-Casamar, S. Kirov, J. Lim, S. Lithwick, M. I. Swanson, A. Ticoll,J. Snoddy, and W. W. Wasserman. PAZAR: a framework for collection anddissemination of cis-regulatory sequence annotation. Genome Biology, 8(10):R207, 2007. → pages 13, 18, 23[60] A. R. Quinlan and I. M. Hall. BEDTools: a flexible suite of utilities forcomparing genomic features. Bioinformatics (Oxford, England), 26(6):841–842, 2010. → pages 26[61] S. A. K. Rasheed, C. R. Teo, E. J. Beillard, P. M. Voorhoeve, and P. J. Casey.MicroRNA-182 and microRNA-200a control g-protein subunit α-13(GNA13) expression and cell invasion synergistically in prostate cancercells. The Journal of Biological Chemistry, 288(11):7986–7995, 2013. →pages 56[62] M. J. Reijnen, F. M. Sladek, R. M. Bertina, and P. H. Reitsma. Disruption ofa binding site for hepatocyte nuclear factor 4 results in hemophilia b leyden.Proceedings of the National Academy of Sciences, 89(14):6300–6303, 1992.→ pages 2[63] J. Richter, M. Schlesner, S. Hoffmann, M. Kreuz, E. Leich, B. Burkhardt,M. Rosolowski, O. Ammerpohl, R. Wagener, S. H. Bernhart, D. Lenze,76M. Szczepanowski, M. Paulsen, S. Lipinski, R. B. Russell, S. Adam-Klages,G. Apic, A. Claviez, D. Hasenclever, V. Hovestadt, N. Hornig, J. O. Korbel,D. Kube, D. Langenberger, C. Lawerenz, J. Lisfeld, K. Meyer, S. Picelli,J. Pischimarov, B. Radlwimmer, T. Rausch, M. Rohde, M. Schilhabel,R. Scholtysik, R. Spang, H. Trautmann, T. Zenz, A. Borkhardt, H. G.Drexler, P. Mller, R. A. F. MacLeod, C. Pott, S. Schreiber, L. Trmper,M. Loeffler, P. F. Stadler, P. Lichter, R. Eils, R. Kppers, M. Hummel,W. Klapper, P. Rosenstiel, A. Rosenwald, B. Brors, and R. t. I. M.-S. P.Siebert. Recurrent mutation of the id3 gene in burkitt lymphoma identifiedby integrated genome, exome and transcriptome sequencing. NatureGenetics, 44(12):1316–1320, 2012. → pages 15, 64, 65[64] S. P. Shah, A. Roth, R. Goya, A. Oloumi, G. Ha, Y. Zhao, G. Turashvili,J. Ding, K. Tse, G. Haffari, A. Bashashati, L. M. Prentice, J. Khattra,A. Burleigh, D. Yap, V. Bernard, A. McPherson, K. Shumansky, A. Crisan,R. Giuliany, A. Heravi-Moussavi, J. Rosner, D. Lai, I. Birol, R. Varhol,A. Tam, N. Dhalla, T. Zeng, K. Ma, S. K. Chan, M. Griffith, A. Moradian,S. Cheng, G. B. Morin, P. Watson, K. Gelmon, S. Chia, S. Chin, C. Curtis,O. M. Rueda, P. D. Pharoah, S. Damaraju, J. Mackey, K. Hoon, T. Harkins,V. Tadigotla, M. Sigaroudinia, P. Gascard, T. Tlsty, J. F. Costello, I. M.Meyer, C. J. Eaves, W. W. Wasserman, S. Jones, D. Huntsman, M. Hirst,C. Caldas, M. A. Marra, and S. Aparicio. The clonal and mutationalevolution spectrum of primary triple-negative breast cancers. Nature, 486(7403):395–399, 2012. → pages 4[65] C. E. Shannon. A mathematical theory of communication. The Bell SystemTechnical Journal, 27:379–423, 1948. → pages 8[66] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage,N. Amin, B. Schwikowski, and T. Ideker. Cytoscape: A softwareenvironment for integrated models of biomolecular interaction networks.Genome Research, 13(11):2498–2504, 2003. → pages 29[67] S. T. Sherry, M.-H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski,and K. Sirotkin. dbsnp: the ncbi database of genetic variation. Nucleic AcidsResearch, 29(1):308–311, 2001. → pages 22[68] W. Sikora-Wohlfeld, M. Ackermann, E. G. Christodoulou, K. Singaravelu,and A. Beyer. Assessing computational methods for transcription factortarget gene identification based on ChIP-seq data. PLoS ComputationalBiology, 9(11):e1003342, 2013. → pages 1077[69] M. Spielmann, F. Brancati, P. M. Krawitz, P. Robinson, D. M. Ibrahim,M. Franke, J. Hecht, S. Lohan, K. Dathe, A. Nardone, P. Ferrari, A. Landi,L. Wittler, B. Timmermann, D. Chan, U. Mennen, E. Klopocki, andS. Mundlos. Homeotic arm-to-leg transformation associated with genomicrearrangements at the PITX1 locus. The American Journal of HumanGenetics, 91(4):629–635, 2012. → pages 6[70] M. Spivakov, J. Akhtar, P. Kheradpour, K. Beal, C. Girardot, G. Koscielny,J. Herrero, M. Kellis, E. E. Furlong, and E. Birney. Analysis of variation attranscription factor binding sites in drosophila and humans. Genome Biol,13:R49, 2012. → pages 6, 8[71] J. E. VanderMeer and N. Ahituv. cis-regulatory mutations are a geneticcause of human limb malformations. Developmental Dynamics, 240(5):920–930, 2011. → pages 6[72] B. Vogelstein, N. Papadopoulos, V. E. Velculescu, S. Zhou, L. A. Diaz, andK. W. Kinzler. Cancer genome landscapes. Science, 339(6127):1546–1558,2013. → pages 2[73] S. R. Walker, E. A. Nelson, J. E. Yeh, L. Pinello, G.-C. Yuan, and D. A.Frank. Stat5 outcompetes stat3 to regulate the expression of the oncogenictranscriptional modulator bcl6. Molecular and Cellular Biology, 33(15):2879–2890, 2013. → pages 66[74] D. Wang, J. Long, F. Dai, M. Liang, X.-H. Feng, and X. Lin. Bcl6 repressessmad signaling in transforming growth factor-β resistance. CancerResearch, 68(3):783–789, 2008. → pages 51[75] W. Wasserman and A. Sandelin. Applied bioinformatics for theidentification of regulatory elements. Nature Reviews Genetics, 5(4):276–287, 2004. → pages 8, 22, 23[76] N. Weinhold, A. Jacobsen, N. Schultz, C. Sander, and W. Lee.Genome-wide analysis of noncoding regulatory mutations in cancer. NatureGenetics, 46(11):1160–1165, 2014. → pages 3, 31, 62[77] R. Worsley-Hunt, V. Bernard, and W. W. Wasserman. Identification ofcis-regulatory sequence variations in individual genome sequences. Genomemedicine, 3(10):65, 2011. → pages 5, 6[78] J. Xian, A. Aitchison, L. Bobrow, G. Corbett, R. Pannell, T. Rabbitts, andP. Rabbitts. Targeted disruption of the 3p12 gene, dutt1/robo1, predisposes78mice to lung adenocarcinomas and lymphomas with methylation of the genepromoter. Cancer Research, 64(18):6432–6437, 2004. → pages 51[79] K. Ye, M. H. Schulz, Q. Long, R. Apweiler, and Z. Ning. Pindel: a patterngrowth approach to detect break points of large deletions and medium sizedinsertions from paired-end short reads. Bioinformatics, 25(21):2865–2871,2009. → pages 19[80] B. Zhang, Y. Zhang, and E. Shacter. Rac1 inhibits apoptosis in humanlymphoma cells by stimulating bad phosphorylation on ser-75.MOLECULAR AND CELLULAR BIOLOGY, 24(14):6205–6214, 2004. →pages 5679Appendix ASupporting Materials800.0 0.2 0.4 0.6 0.8 1.00200400600800100012001400FrequencyMutation probabilityA Cohort 10.0 0.2 0.4 0.6 0.8 1.00100200300400Mutation probabilityFrequencyB Cohort 2Figure A.1: Distributions of the Xseq probabilities when considering all sam-ples (Pr(Dg)). The distributions for Cohort 1 (A) and Cohort 2 (B) areused to choose thresholds.81Table A.1: PC mutations considered in Xseq.Mutation Typenon stop decaynonsense mediated decaynon synonymous codingnon synonymous startsplice site acceptorsplice site donorsplice site regionstart gainedstart loststop gainedstop lostsynonymous codingframe shiftcodon deletioncodon change PLUS codon insertioncodon change plus codon deletioncodon insertioninitiator codon variantmissenseinframe deletion82y = 1.4e−08 + 1.1 ⋅x, r2 = 0.9420e+005e−071e−060e+00 3e−07 6e−07 9e−07ProteinpCodingpMutationpRateTFBSpMutationpRateCohortp1y = 1.8e−08 + 0.8 ⋅x, r2 = 0.4810e+005e−071e−060e+00 3e−07 6e−07 9e−07ProteinpCodingpMutationpRateTFBSpMutationpRateTumourpSubtypeBLDLBCLFLPMBCLCohortp2Figure A.2: Comparison of the indel mutation rates for cis-regulatory andPC regions. Only indels from Cohort 1 (A) and Cohort 2 (B) wereconsidered. TFBS (y-axis) and PC (x-axis) mutation rates are plottedfor all the samples. Each sample is represented by a triangle and iscolor-coded depending on the tumour subtype. The identity function(dashed grey lines) and linear regressions with a 95% confidence region(blue lines, dark grey areas, and equation) were computed.83CDK2_830RUVBL2_830BRCA1_830CDKN1B_830RPL11_830PIN1_830SKP2_830MYB_830CDC25A_830 GSK3B_830PFDN5_830CREBBP_830CCND1_830JUN_830MAPK1_830FBXW7_830IL2_830CCND2_830RELA_830RCC1_830 KAT2B_830CDK4_830MAX_830STAT3_830MYC_830RELA_998UBC_998SKP2_998CCND1_998CDK4_998CDC25A_998STAT3_998DNMT3A_830 SMAD2_830KAT5_830KAT2A_830BIRC5_830E2F1_830 CCNB1_830BRCA1_998ZBTB17_998CDK9_998RCC1_998CDKN1A_998RPL11_998CCNB1_998E2F1_998BIRC5_998EP300_998TBP_998KAT5_998SP1_998MAX_998IL2_998CDKN1B_998RELA_818CDC25A_818CDK4_818RPL11_818TRRAP_818CDKN1B_818CCND2_818RUVBL2_818STAT3_818TBP_818E2F1_818CDK2_818BRCA1_818ID2_818KAT2B_818CCNB1_818CDKN1A_818IL2_818BIRC5_818HDAC3_818CCNT1_818CCND1_818CREBBP_818GSK3B_818SP1_818MAPK1_818CDK9_818MYC_818SKP2_818KAT2A_998RCC1_818CDK2_998KAT5_818RUVBL2_998KAT2A_818NFYC_818MYB_998MYB_818HDAC3_998CCND2_998EP400_998GSK3B_998MYC_998JUN_998KAT2A_860E2F1_860PIN1_860 UBC_860TBP_860CCNB1_860CCND1_860BIRC5_860IL2_860CCND2_860ZBTB17_860EP400_860GSK3B_860MAX_860CDK9_860CREBBP_860RUVBL2_992RUVBL1_992CDK2_992RCC1_992PIN1_992MYB_992EP300_992SKP2_860CDK2_860RUVBL2_860RPL11_860BRCA1_860KAT5_860CDK4_860STAT3_860TRRAP_860MYC_860AXIN1_860RELA_860CCNB1_992RPL11_992NFYC_992MYC_992BIRC5_992MYCBP_992E2F1_992PFDN5_992EP400_992CDKN1B_992IL2_992CCND2_992CDKN1A_992AXIN1_992CDC25A_992 CDK9_992UBC_992SKP2_992CDK4_992TBP_992STAT3_992BRCA1_992 FBXW7_992MYB_860CREBBP_992KAT2A_992CDC25A_860RCC1_860bceadNo MutationPCTFBSDisrupted TFBSPC and TFBSPC and Disrupted TFBSfFigure A.3: Impact of MYC expression alteration due to a PC mutation. Thesamples SA320860, SA320992, SA320830, SA320818, and SA320998were predicted by Xseq to have a mutated MYC gene with altered ex-pression. The gene interactors (from biological networks) are providedalong with their prediction as up- (red) or down-regulated (blue). Themore transparent colour corresponds to a lower probability of either be-ing up- or down-regulated. Nodes represent genes and are labeled withthe gene name and the sample number (e.g. MYC 860 for the MYCgene in sample SA320860).84RCC1_030STAT3_030PFDN5_030SKP2_030CCND2_030UBC_030CDK2_030BRCA1_030CDC25A_030RUVBL2_030KAT2A_030DNMT3A_030CDK4_030TBP_030MYB_030E2F1_030MYC_030EP300_030EP400_030CCNB1_030SP1_030BIRC5_030 MINA_030MAX_030ZBTB17_030AXIN1_030RELA_030JUN_030IL2_030CDK9_030GSK3B_030CDK4_932CDC25A_932 SP1_932EP400_932RPL11_932STAT3_932KAT2B_932EP300_932MAPK1_932RELA_932CCNT1_932ZBTB17_932GSK3B_932CCND2_932CREBBP_932CCND1_932CDK2_932UBC_932TBP_932MAX_932ID2_932CCNB1_932TRRAP_932E2F1_932IL2_932SKP2_932BIRC5_932RUVBL1_932RCC1_932MYB_932MYCBP_932MYC_932RUVBL2_932PIN1_932JUN_012CDK4_012CDC25A_012STAT3_012CCND2_012RELA_012SKP2_012CREBBP_012CDKN1A_012UBC_012EP400_012GSK3B_012CDK9_012IL2_012AXIN1_012CCND1_012RUVBL2_012TBP_012CDK2_012MYB_012RPL11_012RCC1_012MYC_012MYCBP_012BRCA1_012KAT5_012E2F1_012 DNMT3A_012ID2_012CCNB1_012BIRC5_012SP1_012bceadNo MutationPCTFBSDisrupted TFBSPC and TFBSPC and Disrupted TFBSfFigure A.4: Impact of MYC expression alteration due to TFBS mutations.The samples SA321012, SA321030, and SA320932 were predicted byXseq to have a mutated MYC gene with altered expression. The geneinteractors (from biological networks) are provided along with theirprediction as up- (red) or down-regulated (blue). The more transparentcolour corresponds to a lower probability of either being up- or down-regulated. Nodes represent genes and are labeled with the gene nameand the sample number (e.g. MYC 012 for the MYC gene in sampleSA321012).85GSK3B_842UBC_842CCND2_842CDK4_842EP400_980CDK9_842CCND1_980RELA_980RPL11_842CDC25A_842STAT3_842SP1_842CCNT1_842RELA_842DNMT3A_842CCND1_842EP400_842IL2_842KAT5_980BIRC5_980SP1_980EP300_004CCND1_004CCNB1_980 EP300_980E2F1_980TRRAP_004MINA_842E2F1_842CCNB1_842BIRC5_842PIN1_842 ID2_842ZBTB17_842MAPK1_842MYCBP_842PFDN5_980GSK3B_980UBC_980PIN1_980MYC_980IL2_980CDK9_980JUN_980TRRAP_980BRCA1_980 CCND2_980STAT3_980SKP2_980CDK4_980RPL11_980CDC25A_980MYB_980CDK2_980TBP_980RUVBL2_980MYCBP_980BRCA1_842RCC1_980CDK2_842KAT2A_842SKP2_842RCC1_842MYC_842MYB_842RUVBL2_842CREBBP_004RUVBL2_004E2F1_004BIRC5_004KAT2B_004ID2_004MYCBP_004CCNB1_004PFDN5_004ZBTB17_004MYC_004JUN_004AXIN1_004CDK9_004CDKN1B_004CDC25A_004UBC_004MYB_004STAT3_004RELA_004EP400_004MINA_004SMAD2_004IL2_004RUVBL1_004MAPK1_004SP1_004RPL11_004BRCA1_004RCC1_004 NFYC_004bceadNo MutationPCTFBSDisrupted TFBSPC and TFBSPC and Disrupted TFBSfFigure A.5: Impact of MYC expression alteration due to TFBS and PC mu-tations. The samples SA321004, SA320842, and SA320980 were pre-dicted by Xseq to have a mutated MYC gene with altered expression.The gene interactors (from biological networks) are provided alongwith their prediction as up- (red) or down-regulated (blue). The moretransparent colour corresponds to a lower probability of either beingup- or down-regulated. Nodes represent genes and are labeled with thegene name and the sample number (e.g. MYC 004 for the MYC genein sample SA321004).86


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items