Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Bioinformatic approaches for identifying single nucleotide variants and profiling alternative expression… Goya, Rodrigo 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_february_goya_rodrigo.pdf [ 19.94MB ]
JSON: 24-1.0362107.json
JSON-LD: 24-1.0362107-ld.json
RDF/XML (Pretty): 24-1.0362107-rdf.xml
RDF/JSON: 24-1.0362107-rdf.json
Turtle: 24-1.0362107-turtle.txt
N-Triples: 24-1.0362107-rdf-ntriples.txt
Original Record: 24-1.0362107-source.json
Full Text

Full Text

Bioinformatic Approaches forIdentifying Single Nucleotide Variantsand Profiling Alternative Expressionin Cancer TranscriptomesbyRodrigo GoyaB.Sc. Tec de Monterrey, Mexico City, 2000A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2017c© Rodrigo Goya 2017AbstractOver the last decade, the advent of high-throughput sequencing (HTS) has given us the ability tostudy DNA and RNA sequences at nucleotide resolution at an unprecedented speed and at a relativelylow cost. This has been an invaluable tool in the study of cancer, allowing projects such as The CancerGenome Atlas and the International Cancer Genome Consortium to sequence thousands of tumoursfrom multiple cancer types. The ever-increasing amounts of data created by these projects demandednew analysis methods: in the first part of this thesis, I focus on method development for mutationcalling in genome and transcriptome data. I present SNVMix, a single nucleotide variant (SNV)caller based on a set of probabilistic models created to adapt to variations in allele representationin a tumour. Differential allele representation in DNA can occur when multiple clones are presentin the sequenced tumour, and in RNA can occur due to differences in gene expression or allele bias.These situations are nearly ubiquitously encountered in cancer sequencing studies, and thus needto be accounted for. I demonstrate that SNVMix was able to outperform another contemporarySNV caller that does not account for variations in allele representation. I also present BWA-R, anadaptation of the Burrows Wheeler Aligner, that can properly align RNA-Seq paired-end reads to agenome reference extended with exon-exon junction sequences formed through splicing. I show thatBWA-R provides better alignments for SNV calling in transcriptomes, resulting in an increase in theproportion of true positive calls obtained. In the second part of this thesis, I analyze RNA-Seq datafrom a triple negative breast cancer (TNBC) cohort and describe the alternative splicing profiles of thepreviously defined Basal and NonBasal subgroups. TNBC is characterized by the absence of estrogenand progesterone receptors and human epidermal growth factor receptor 2 (HER2), which precludesthe use of currently available targeted therapies. TNBC patients are thus treated with chemotherapy,and outcomes are generally poor. I identify alternatively expressed genes that may be relevant to thebiology of these two subgroups and that could provide clues for further studies or treatment options.iiLay SummaryThe ability to study DNA and RNA sequences at nucleotide resolution has been an invaluable tool inthe study of cancer. The ever-increasing amounts of data created by large cancer projects demandednew analysis methods. Mutation detection can help identify genes that are involved in cancer; however,variations found in cancer genomes can make this task difficult. I created statistical models and toolsto improve the detection of mutations in DNA and RNA of tumours. Another aspect of cancer studiesfocuses on the dysregulation of gene expression. Genes can be expressed in different ways throughmechanism called alternative splicing, which can create different versions of proteins that can alter agene’s function in a tumour. I explored the alternative expression profiles in a subgroup of breastcancer called triple negative breast cancer, and provide a summary of alternative expression changesthat can be used to guide future studies.iiiPrefaceAttribution of Work and Published MaterialsChapter 1Chapter 1 incorporates material published in a book chapter I co-authored with Dr. Marco Marraand Dr. Irmtraud Meyer. The book chapter was principally written by me with guidance from Dr.Marco Marra. The book chapter titled “Applications of High-Throughput Sequencing” appeared inthe “Bioinformatics for High Throughput Sequencing” (2012) book by Springer (DOI 10.1007.978-1-4614-0782-9).Chapter 2A version of Chapter 2 has been published as:Goya R., Sun M.G.F., Morin R.D., Leung G., Ha G., Wiegand K.C., Senz J, Crisan A., MarraM.A., Hirst M., Huntsman D., Murphy K.P., Aparicio S. and Shah S.P., 2010. SNVMix: predictingsingle nucleotide variants from next-generation sequencing of tumors. Bioinformatics 26 (6): 730-736.I was the primary author of Goya et al. (2010) and was responsible for the development,implementation and evaluation of the SNVMix package, and all other work including analysis, tablesand figures, except as indicated below. The text was written in collaboration with Sohrab Shah. TheSNVMix1 model (Figure 2.4(a)) was initially proposed by Sohrab Shah with input from Ryan Morinand Samuel Aparicio; I ran the analysis using the model and implemented it in C for the distributedsoftware package. I developed SNVMix2 under the advice of Sohrab Shah. Kevin Murphy providedhelp with the expectation maximization updating equations. Marco Marra, Sam Aparicio and DavidHuntsman provided guidance regarding the requirements for mutation calling. Mark Sun contributedwith the theoretical behaviour of SNVMix1 and provided Figure 2.3. Gavin Ha and Gillian Leungprocessed SNP micro-arrays to obtain testing data. I collaborated with Ryan Morin with testingSNVMix in a production setting while setting up SNVMix for his SNP discovery pipeline. RyanMorin also provided text for the paper and, along with Mark Sun and Anamaria Crisan, proofreadthe manuscript and provided general advise with the text. Ovarian and breast cancer samples wereprovided by David Huntsman and Sam Aparicio respectively. High-throughput sequencing was done atthe Michael Smith’s Genome Sciences Centre by Martin Hirst’s group under the supervision of MarcoMarra. Validation of mutations using Sanger sequencing was done by Janine Senz and KimberleyWiegand.Chapter 3Chapter 3 was entirely written by me and presents my own work regarding the design and expansionof BWA-R, and SNVMix for variant calling using RNA-Seq data. However, I closely collaboratedwith Dr. Ryan Morin during the development of BWA-R, as he helped test my BWA-R code andcontinuously provided insights into the data he was analyzing. Similarly, Dr. Ryan Morin and Dr.Erin Pleasance helped in the definition of the read alignment features that SNVMix could extract toimprove SNV calling.ivPrefaceChapter 4Chapter 4 was entirely written by me, under the guidance of Dr. Marco Marra, Dr. Samuel Aparicioand Dr. Irmtraud Meyer. Dr. Samuel Aparicio contributed to the design of the required analysiswith insights into the biology of breast cancer. This chapter contains a figure and data that I helpedgenerate for:Shah S.P., Roth A., Goya R., et al. 2012. The clonal and mutational evolution spectrum of primarytriple-negative breast cancers. Nature 486, 395–399.Validations of INPP4B expression were done by Dr. Damian Yap and Dr. Samuel Aparicio.Validations of expression of MYO6 long insert were done by Dr. Grace Cheng, Dr. Annie Moradianand Dr. Gregg Morin. High-throughput sequencing was done at the Genome Sciences Centre underthe supervision of Dr. Marco Marra.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Advances in the study of genetic variation and gene expression in cancer . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Cancer as a result of mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Variant calling in cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.1 Genetic variation and mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.2 Types of mutations in cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Gene expression analysis in cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.1 Gene expression profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.2 Alternative splicing and expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Overview of the advances in DNA sequencing technology . . . . . . . . . . . . . . . . . . . 101.5.1 Whole genome shotgun sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5.2 Whole genome resequencing for variant discovery . . . . . . . . . . . . . . . . . . . 111.5.3 Exome sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.5.4 Analysis strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5.5 Variant discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.5.6 Variant discovery in cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.5.7 Genomic rearrangements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.6 Overview of RNA-Seq for transcriptome analysis . . . . . . . . . . . . . . . . . . . . . . . . 201.6.1 RNA-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.6.2 Analysis strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.6.3 Expression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.6.4 Variant discovery in RNA-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251.7 Chapter summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27viTable of Contents2 SNVMix: a mixture model framework for predicting single nucleotide variantsfrom next generation sequencing of tumours . . . . . . . . . . . . . . . . . . . . . . . . . . 292.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.1.1 Single nucleotide variants in cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.1.2 NGS data pre-processing for SNV detection . . . . . . . . . . . . . . . . . . . . . . . 302.1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.2.1 SNVMix model specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.2.2 Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.2.3 Model fitting and parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . 372.2.4 Modelling base and mapping qualities . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.2.5 Implementation, running time and SNV discovery pipeline . . . . . . . . . . . . . 392.2.6 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.2.7 Accuracy metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.2.8 Benchmarking experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.3.1 Depth heuristics reduce sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.3.2 Estimating parameters in transcriptome data by model fitting confers betteraccuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.3.3 Evaluation of models on a deeply sequenced breast cancer genome with groundtruth SNVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502.4.1 Dependence on alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502.4.2 Limitations, extensions and future work . . . . . . . . . . . . . . . . . . . . . . . . . 512.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523 BWA-R and SNVMix for improved variant calling in transcriptomes . . . . . . . . 533.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.1.1 Paired-end read alignment in BWA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.1.2 Sequence and alignment features affecting single nucleotide variant calling . . . 583.1.3 Improving SNV calls using RNA-Seq data . . . . . . . . . . . . . . . . . . . . . . . . 593.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.2.2 Using BWA with junction-enhanced reference genome . . . . . . . . . . . . . . . . 603.2.3 Modifications for BWA read pairing introduced in BWA-R . . . . . . . . . . . . . 613.2.4 Alignment performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.2.5 SNV calling performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.2.6 SNVMix extended features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.2.7 Using random forests to filter SNVs from RNA-Seq . . . . . . . . . . . . . . . . . . 703.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.3.1 Using BWA with junction-enhanced reference genome . . . . . . . . . . . . . . . . 723.3.2 Modified BWA (BWA-R) with junction-enhanced reference genome improvespaired end alignments of RNA-Seq data . . . . . . . . . . . . . . . . . . . . . . . . . 723.3.3 BWA-R lowers the number of false positive SNV calls . . . . . . . . . . . . . . . . 783.3.4 Sequence and alignment features affecting SNV calling . . . . . . . . . . . . . . . . 853.3.5 Automating feature and threshold selection using random forests . . . . . . . . . 863.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100viiTable of Contents4 Transcriptome analysis of triple negative breast cancer subgroups . . . . . . . . . . 1034.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.2.2 RNA-Seq alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.2.3 Somatic mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.2.4 Gene expression measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.2.5 Comparing expression values between analysis pipelines . . . . . . . . . . . . . . . 1094.2.6 Comparing libraries of different read lengths. . . . . . . . . . . . . . . . . . . . . . . 1104.2.7 Clustering for group definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114.2.8 Analysis of exon usage differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114.2.9 Filtering genes with low expression across samples . . . . . . . . . . . . . . . . . . 1124.2.10 Subsampling for group size imbalance bias . . . . . . . . . . . . . . . . . . . . . . . . 1134.2.11 Event-specific splicing statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134.2.12 Accessing exon-exon junction information via JunctionDB . . . . . . . . . . . . . 1134.2.13 Annotation of alternative expression events at protein level . . . . . . . . . . . . . 1144.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1144.3.1 Unsupervised clustering of TNBC tumours creates two subgroups closely relatedto PAM50 labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1144.3.2 Differential gene expression between Basal and NonBasal subgroups highlightsmetabolic and morphology related genes . . . . . . . . . . . . . . . . . . . . . . . . . 1174.3.3 119 alternatively expressed genes related to dysregulated gene expression . . . 1204.3.4 Alternative expression profiling of breast myoepithelial and luminal epithelialcells highlights cell type specific splicing changes in TNBC subgroups . . . . . . 1254.3.5 PI3K-pathway regulator INPP4B potentially silenced by alternative splicing inBasal TNBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1324.3.6 MYO6 isoform linked to cell polarity is preferentially expressed in NonBasalTNBC and luminal epithelial cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334.3.7 Clustering tumours according to MYO6 isoform expression creates groups withmore similar exon usage profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374.3.8 Exon usage and exon-exon junction analysis suggest candidate genes withprotein changes in extracellular domains . . . . . . . . . . . . . . . . . . . . . . . . . 1424.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1454.4.1 Alternative phenotype-linked isoforms of MYO6 found in Basal and NonBasalTNBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1504.4.2 Alternative expression as a possible silencer of tumour suppressor INPP4B . . 1524.4.3 “Rare” isoforms of KRAS and PTK2 in TNBC subgroups . . . . . . . . . . . . . 1534.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1544.5.1 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1555 Conclusions and future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204Appendix A SNVMix supplementary material . . . . . . . . . . . . . . . . . . . . . . . . . . . 205Appendix B BWA-R supplementary material . . . . . . . . . . . . . . . . . . . . . . . . . . . 214viiiTable of ContentsAppendix C Transcriptome analysis of TNBC subgroups, supplementary material 218C.1 Comparison of hormone receptor expression between TNBC subgroups and ER+/PR+/HER2+BC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218C.2 Controlling for biases due to group size imbalances . . . . . . . . . . . . . . . . . . . . . . . 219Appendix D Transcriptome analysis of TNBC subgroups, supplementary tables . 233Appendix E JunctionDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292ixList of Tables2.1 Description of random variables in SNVMix1 and SNVMix2 . . . . . . . . . . . . . . . . . 362.2 Performance values for different runs of SNVMix2 and Maq. . . . . . . . . . . . . . . . . . 453.1 Contextual information presented by SNVMix for each SNV called. . . . . . . . . . . . . 703.2 Metrics for comparing BWA and BWA-R alignments . . . . . . . . . . . . . . . . . . . . . . 773.3 False discovery comparison for SNVs using BWA and BWA-R alignments. . . . . . . . . 823.4 False discovery comparison considering only read pairs mapping to the same gene. . . . 834.1 Genes with exon usage differences in both TNBC subgroup and breast myoepithelialversus luminal epithelial cell comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1294.2 Genes with alternative exon usage in extracellular domains. . . . . . . . . . . . . . . . . . . 144xList of Figures1.1 Hallmarks of cancer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Effect of introns in short read alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.1 Schematic diagram of input data to SNVMix . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2 Visual representation of mapping and base qualities as input for SNVMix2. . . . . . . . 322.3 Theoretical behaviour of SNVMix at varying depths. . . . . . . . . . . . . . . . . . . . . . . 332.4 SNVMix probabilistic graphical model for SNV detection. . . . . . . . . . . . . . . . . . . . 352.5 Conditional probability distributions of the SNVMix model . . . . . . . . . . . . . . . . . . 372.6 Effect of base qualities on SNVMix’s performance. . . . . . . . . . . . . . . . . . . . . . . . . 402.7 ROC curves at varying depth thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.8 Distribution of AUC and AUC-PR in cross-validation runs. . . . . . . . . . . . . . . . . . . 442.9 ROC scan of thresholding models using SNVMix2 on 497 positions . . . . . . . . . . . . . 472.10 ROC curves for MB scan on 497 ground truth positions. . . . . . . . . . . . . . . . . . . . . 482.11 Effect of thresholds on base and mapping qualities. . . . . . . . . . . . . . . . . . . . . . . . 493.1 Exon-exon junction enhanced reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2 Building exon-exon junction sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.3 Strand support for reference and alternate alleles. . . . . . . . . . . . . . . . . . . . . . . . . 713.4 Mismatches caused by small overlap across EEJs. . . . . . . . . . . . . . . . . . . . . . . . . 713.5 Comparison of BWA and BWA-R on gene expression measurements. . . . . . . . . . . . . 753.6 Changes in read mapping quality between BWA and BWA-R. . . . . . . . . . . . . . . . . 793.7 Number of SNV calls obtained using BWA-R and BWA. . . . . . . . . . . . . . . . . . . . . 803.8 SNV call comparison between BWA and BWA-R. . . . . . . . . . . . . . . . . . . . . . . . . 813.9 Effect on SNV calling performance when only using paired reads that map to the samegene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.10 Non reference read content in TP and FP groups. . . . . . . . . . . . . . . . . . . . . . . . . 873.11 Non-reference and position coverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883.12 Mismatches at centre and edges of supporting reads in mQ20 dataset. . . . . . . . . . . . 903.13 Log fold change ratios of centre and edge mismatches in mQ20 dataset. . . . . . . . . . . 913.14 ROC curves for manual thresholding on mQ20 dataset. . . . . . . . . . . . . . . . . . . . . 923.15 ROC curves for manual thresholding on mQ30 dataset. . . . . . . . . . . . . . . . . . . . . 933.16 Strand bias metric and FP calls in mQ20 dataset. . . . . . . . . . . . . . . . . . . . . . . . . 953.17 ROC curves from nested cross validation of random forests. . . . . . . . . . . . . . . . . . . 984.1 Immunohistochemical staining of breast tissue. . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.2 Sample sub-grouping and comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.3 Definition of exon parts by DEXSeq. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.4 Recurrent mutations in Basal and NonBasal TNBC tumours. . . . . . . . . . . . . . . . . 1154.5 Unsupervised clustering of TNBC RNA-Seq libraries. . . . . . . . . . . . . . . . . . . . . . . 1164.6 Differential gene expression between NonBasal and Basal subgroups. . . . . . . . . . . . . 1194.7 ESR1 and AR expression versus FOXC1 /FOXA1 ratio. . . . . . . . . . . . . . . . . . . . . 1204.8 Gene ontology enrichment for TN subgroup-specific genes. . . . . . . . . . . . . . . . . . . 121xiList of Figures4.9 DEXSeq results overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234.10 GO enrichment of alternatively expressed genes between Basal and NonBasal subgroups.1254.11 Overlap of GO enrichment for differential exon usage in BvsNB and non-tumour breastcells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274.12 Expression of INPP4B short isoform in TNBC Basal and myoepithelial cells. . . . . . . 1334.13 INPP4B could not be detected in samples with the short INPP4B mRNA isoform. . . 1344.14 MYO6 is alternatively expressed in TNBC Basal and NonBasal comparison. . . . . . . 1354.15 MCF7 cells express MYO6 long insert as measured using selected reaction monitoring. 1364.16 Subgrouping tumours according to MYO6 long isoform. . . . . . . . . . . . . . . . . . . . . 1384.17 GO enrichment overlap between TNBC subgroup and MYO6 long insert comparisons. 1404.18 KRAS is alternatively expressed between TNBC Basal and NonBasal TNBC. . . . . . . 1414.19 Higher inclusion of brain-specific PTK2 exon in NonBasal TNBC and MYO6-LI. . . . 1434.20 Junction coverage supports expression of PTK2 small exon. . . . . . . . . . . . . . . . . . 1464.21 Changes in extracellular domains detected by JunctionDB. . . . . . . . . . . . . . . . . . . 148xiiList of AbbreviationsAE alternative expressionAML acute myeloid leukemiaAPI application program interfaceAS alternative splicingAUC area under the (ROC) curveAUC-PR area under the precision-recall curveBC breast cancerBCCA British Columbia Cancer AgencyBCCRC British Columbia Cancer Research CentreBWA Burrows-Wheeler AlignerBWA-R Burrows-Wheeler Aligner RNABWT Burrows-Wheeler TransformationcDNA complementary DNACDS coding sequenceCGH comparative genomic hybridizationCIGAR compact idiosyncratic gapped alignment reportCNV copy number variationCV cross validationDE differentially expressedDEG differentially expressed geneDLBCL diffuse large B-cell lymphomaDNA deoxyribonucleic acidEEJ exon-exon junctionEM expectation maximizationEMT epithelial-to-mesenchymal transitionxiiiList of AbbreviationsENCODE Encyclopedia of DNA ElementsER estrogen receptorEST expressed sequence tagsFDR false discovery rateFISH fluorescent in situ hybridizationFL follicular lymphomaFN false negativeFP false positiveFPR false positive rateGO Gene OntologyHER2 human epidermal growth factor receptor 2hnRNP heterogeneous nuclear riboproteinHTS high-throughput sequencingICGC International Cancer Genome ConsortiumIHC immunohistochemistrylincRNA long intergenic non coding RNALOO leave one outMAF minor allele frequencyMAP maximum a posteriorimiRNA micro RNAmQ mapping quality (Phred scale)mRNA messenger RNANCBI National Center for Biotechnology InformationncRNA non coding RNANGS next-generation sequencingNMD nonsense mediated decayNMF non-negative matrix factorizationPAM Prediction Analysis of MicroarrayPCR polymerase chain reactionxivList of AbbreviationsPR progesterone receptorPSC premature stop codonRF random forestRNA ribonucleic acidROC receiver operating characteristicRPKM reads per kilobase of gene model per million mapped readsRRM RNA recognition motifrRNA ribosomal RNART-PCR reverse transcription polymerase chain reactionSAGE serial analysis of gene expressionsiRNA short interfering RNAsnoRNA small nucleolar RNASNP single nucleotide polymorphismsnRNA small nuclear RNASNV single nucleotide variantsSRM selected reaction monitoringTCGA The Cancer Genome AtlasTN true negativeTNBC triple negative breast cancerTP true positiveTPR true positive ratetRNA transfer RNAUCSC University of California, Santa CruzUTR untranslated regionWGS whole genome sequencingWTSS whole transcriptome shotgun sequencingxvAcknowledgementsI would like to thank my supervisor Dr. Marco Marra for the opportunity to work in his lab atthe Genome Sciences Centre, as well as his never ending support and patience, his guidance andwisdom, his wit and knowledge. He provided an unparalleled role model to follow, both professionallyand personally. The opportunities made available to me to collaborate with other researchers at theGenome Sciences Centre and the BC Cancer Research Centre were priceless, and have helped meimmensely during my graduate studies. I would also like to thank Dr. Irmtraud Meyer for her supportduring her time at the University of British Columbia, for her insights in algorithms and bioinformatics,and providing an environment where I could share ideas with other students and researchers at theMichael Smith Labs at UBC. I thank my committee members Drs Anne Condon, Steven Jones andPaul Pavlidis, for their advice and guidance. I am grateful to the Government of Canada for awardingme the Vanier Canada Graduate Scholarship; the Canadian Institutes of Health Research, the MichaelSmith Foundation for Health Research and the Genome Sciences Centre for supporting me throughthe Bioinformatics Training Program at UBC; and the University of British Columbia for awarding mewith the Four Year Doctoral Award and International Student Award. I also deeply thank Drs SamuelAparicio, Sohrab Shah, and Gregg Morin, for supporting me with their knowledge and helping inshaping my research, in the biological and computer science aspects. I thank Drs Ryan Morin, MalachiGriffith and Maria Mendez-Lago, for involving me in their research, introducing me to a whole newarea of research; as well as other Marra Lab alumni Drs Olena Morozova, Sorana Morrisey, and EmiliaLim. I also want to thank Elizabeth Chun, Andrew Mungall, Alessia Gagliardi, Andrew McPherson,Kieran O’Neill and Davide Pellacani for their friendship and immense help in both personal andacademic issues. All my lab members at the Marra Lab, Diane Trinh, Veronique LeBlanc, RyanHuff, James Topham, Jungeun Song, Emma Titmuss, Lisa Wei, Stephen Dongsoo Lee, Annie Cavalla,Suganthi Chittaranjan, Susana Chan, Dan Jin, and at CHiBi/MSL Daniel Lai, Alborz Mazloomian,Shu Yang, Evan Gatev, and Alice Zhu. On a personal note, I would like to thank my parents, Carlosand Lilia, for their lifelong support and raising an amazing family, I hope to make you proud. To mybrother Adrian for his candidness, for sharing his view of life, and making me realize what it is allabout. To our brother Dario, whose joyful memory still drives me. And to Ce´lie, my partner in life,for coming with me in this long trip and for her support during these arduous years, for fighting tokeep me from falling, and for giving us our amazing kids Ilan and Luna, I love you.xviA Dario,te extran˜amos hermano.xviiChapter 1Advances in the study of geneticvariation and gene expression in cancer1.1 IntroductionSince the nineteenth century, cancer has been known to be a genetic disease. During the last decade,technological advances in nucleotide sequencing have enabled the study of cancer at nucleotide-levelresolution and at an ever-increasing rate, allowing researchers to sample the entire DNA and RNAcontent of cancer cells. These advances have been used in large-scale projects such as The CancerGenome Atlas (TCGA) [1] and the International Cancer Genome Consortium (ICGC) [2] to studythousands of tumours of different types, generating large amounts of biological information. Whetherto elucidate the biology behind different types of cancers, or provide a basis for better patient treatmentor prognosis, detection of mutations in cancer and identification of their effects on gene expression havebecome crucial to understand cancer. With the emergence of new technologies come new datatypes,and with new datatypes comes the need for new analytical methods and strategies. This dissertationfocuses on the development of bioinformatic methods and strategies for the study of single nucleotidevariants (SNV) and alternative expression (AE) analysis in cancer transcriptomes. In this chapter,I give a brief overview of cancer biology and attempt to convey the importance of mutations andgene dysregulation in this disease. I also present an overview of techniques used to study geneticvariation and gene expression in cancer studies. Finally, Section 1.7 introduces the specific goals andcontributions presented in each of the research chapters of this dissertation.1.2 Cancer as a result of mutationsCancer is composed of aberrant cells, differentiated from their normal counterparts through theincremental acquisition of mutations [3] and eventual dysregulation of cellular pathways that enable11. Advances in the study of genetic variation and gene expression in cancerFigure 1.1: Hallmarks of cancer. Proposed by Hanahan, D. and Weinberg, R. (2000,2011) [4, 5], theHallmarks of Cancer describe a set of properties describing cancer cells. Reprinted from Publication Cell,144/5, Douglas Hanahan, Robert A. Weinberg, Hallmarks of Cancer: The Next Generation, Pages No. 646-674,Copyright (2011), with permission from Elsevier.tumourigenesis [4, 5]. Genetic aberrations can confer cells with the required traits to become cancercells. These traits have been described as the hallmarks of cancer, which include the ability tosustain unrestricted growth, both by self-sufficient growth signalling and evasion of growth suppressors,an ability to generate new blood vessels (angiogenesis), an ability to avoid programmed cell-death(apoptosis), unrestricted cell replication, and an ability to invade other tissues [4]. More recently, twonew hallmarks have been included: the ability to evade the immune system and the remodelling ofmetabolic pathways to support tumour growth [5]. In their review, Hanahan and Weinberg (2011) [5]also describe two cancer-enabling characteristics: inflammation mechanisms that provide a tumourcell with necessary cancer-promoting biomolecules, and genome instability, enabling the tumour tosustain an accelerated rate of genomic alterations.The genetic basis of cancer was first encountered by David von Hansemann in 1890 when henoticed asymmetric distribution of chromosomes in carcinoma cells undergoing mitosis [6]. In theearly 1900s, Theodor Boveri used sea urchins to study the role of chromosomes in heredity [7], notingdifferences in cell behaviour accompanying aberrant chromosomal representation, he theorized thatthe changes in chromosomal content altered the number of copies of genes that either stimulated or21. Advances in the study of genetic variation and gene expression in cancerinhibited growth [8]. Boveri’s insights form the base of what we now know as oncogenes and tumoursuppressors [9]. Tumour suppressors are genes that directly or indirectly limit cell proliferation, andin so doing they can stop a cell from undergoing tumourigenesis. Loss-of-function mutations caninactivate tumour suppressors, freeing the cell from the gene’s regulatory functions. As one copy maybe enough for a tumour suppressor to exert its function, generally both copies have to be silenced andthus they are considered to be recessive genes. A different kind of cancer genes are known as oncogenes,which are mutated versions of normal genes (proto-oncogenes) usually with roles linked to cell growth,proliferation, and inhibition of cell death. When a proto-oncogene is affected by gain-of-functionmutations or over expression, its activity can be dysregulated, advancing the cell in its path towardstumourigenesis. At least one allele needs to be mutated to activate a proto-oncogene and so they aregenerally considered to be dominant alleles. A third class of cancer-related genes are those involved inDNA repair pathways, processes by which a cell can correct errors in its DNA. Mutation of these genescan affect a cell’s ability to repair its DNA which increases the rate at which mutations can be acquired.Multiple genes have to be altered in a cell before it can become cancerous [10], some mutations canbe inherited (germline) and others need to be acquired (somatic) [11]. Germline mutations can thusbe considered to confer a cancer predisposition [12], while sporadically acquired somatic mutationscomplete the mutational load required for tumourigenesis. Although tumours may arise from differentsets of somatic mutations, similar cellular pathways may be affected. Thus, by studying patterns ofsomatic mutation in cancer, we can elucidate the mechanisms by which different tumours are created,as well as defining new gene functions and interactions. The characteristics that help classify a gene aseither an oncogene or a tumour suppressor are not always easy to define. The widely known tumoursuppressor TP53, for example, was initially classified as an oncogene due to its high expression intumours [13]. Years later, a different study found that mutation of both alleles of TP53 could causetumourigenesis, establishing TP53 as a tumour suppressor [14]. As technology advances and ourability to analyze DNA increases, description of mutation patterns in tumour samples have helped inthe identification of new cancer genes. Additionally, characterization of tumours can also help findnew biomarkers for detection, develop prognostic methods and guide patient treatment.31. Advances in the study of genetic variation and gene expression in cancer1.3 Variant calling in cancer1.3.1 Genetic variation and mutationsGenetic variation refers to the differences in the DNA of individuals of a species. In this thesis I willbe working with human samples, and so I will consider the base on which genetic variation happensto be a diploid genome. One of the defining features of an organism’s genome is its ploidy, whichindicates the number of copies of each chromosome. Thus, diploid genomes like those of humanscontain 2 sets of chromosomes, one from each of the individual’s parents, and each providing a copyor allele of every gene. Through DNA recombination during meiosis, the genetic material betweenhomologous chromosomes is exchanged, creating variation in the organism’s offspring. This form ofgenetic variation, essentially through reshuffling, is limited to combinations of alleles already presentin a population. Another source of variation is known as gene flow, or gene migration, by whichalleles are exchanged between populations. This permits the exchange of genetic information betweenpopulations that may have been under different selective pressures and thus different sets of allelicvariants might have been enriched by natural selection. These two sources of genetic variation arelimited to combinations of a fixed set of alleles. Genetic mutations are new and permanent changesin DNA; as such, they are ultimately the driving force behind variation. Mutations driving geneticdiversity can occur due to DNA replication or recombination errors; however, in order for thesemutations to become part of the population they need to be created in germ cells, which can pass themutation to the next generation. Mutations created in an organism’s somatic cells are not passedto the next generation, but may become part of a cell lineage in the organism itself, which can thenacquire new mutations. In cancer, it is this sequential accumulation of mutations in a cell lineage thatcan eventually alter one or more cancer genes, pushing the cell’s descendants towards malignancy.1.3.2 Types of mutations in cancerThere are different types of mutations in cancer and different methodologies exist to detect them.At the chromosome level a tumour can present translocations, inversions or copy number variants,including gain or loss of chromosome arms or entire chromosomes. At the nucleotide level mutationscan take the form of single nucleotide variants (SNV ) or small insertions and deletions (indels). In thisthesis I will focus on mutations of the single nucleotide variant (SNV) type; however, for completenessI will now give a general overview of the different mutations types and methods for their detection.41. Advances in the study of genetic variation and gene expression in cancerChromosomal aberrations can present as chromosomal translocations, where large portions ofchromosomes are exchanged between two or more chromosomes. These changes can reposition acancer gene near the regulatory region of a highly expressed gene, potentially causing over-expressionand activation of the tumourigenic functions of the cancer gene [15]. Another potential effect oftranslocations is the creation of chimeric, or fusion, protein products, in which the coding sequencesof two or more genes are expressed together to make a new protein [16]. Copy number gains, in whichsections or entire chromosomes are duplicated, can result in the over-expression of oncogenes; similarly,chromosomal losses can silence tumour suppressors. Detection of large chromosomal aberrations canbe done by inspecting chromosome banding patterns [17]. This technique is based on the stainingof chromosomes which can reveal different chromatin patterns observable under a microscope. Inanother technique called fluorescent in situ hybridization (FISH) [18], different targeted regions of achromosome are tagged with a fluorescent probe and the difference in luminescence can be used toinfer copy number gains or losses. Chromosomes from different samples can also be analyzed togetherwith comparative genomic hybridization (CGH) [19]. In CGH, DNA from two sources are labelledwith different colours (e.g. red and green) and hybridized competitively to a set of DNA probes ona slide. Inspecting the resulting colour components of the probes can be used to infer the relativecontent of each chromosome section between the samples. This technique can be used to compare apatient’s normal and tumour DNA, allowing the identification of copy number changes present inthe tumour. Advances in microarray technology allowed the creation of array-CGH (aCGH) [20, 21]which increased the throughput of this technique. The increasing probe density of DNA microarrays,initially used to detect single nucleotide polymorphisms (SNPs), allowed their use for the analysis ofcopy number variants [22]. With this technique, DNA from samples to be compared (e.g. tumourvs normal) would first be hybridized to individual SNP arrays. Then, using the intensity of thefluorescence created by the scanning laser as an indicator of the amount of DNA binding to eachprobe set, differences in the original amount of DNA material between samples would be detectedas changes in the intensity of each SNP probe between sample arrays. Computational approachesto compare multiple arrays were later developed and used to study CNVs in larger cohorts [23, 24].High-throughput sequencing (HTS) data was similarly used, using changes in read coverage acrossthe genome to infer CNV variants. Additional information obtained from DNA sequencing, likethat provided by paired-end reads, could also be used to detect translocations and inversions (seeSection 1.5.7).51. Advances in the study of genetic variation and gene expression in cancerChanges affecting nucleotides at a smaller scale include SNVs and short insertions and deletions.Depending on the genomic location and type of change, these mutations can have little or no effect. Forexample, a single nucleotide change on the protein coding part of a gene may affect the resulting protein,while an SNV in intergenic space devoid of regulatory signals may have no effect. SNVs affectingthe protein coding region of a gene can be classified into synonymous, non-synonymous/missenseor nonsense mutations. Synonymous mutations do not change the encoded amino acid and thusare generally considered silent; however, organisms are known to have different preferences in codonusage (codon bias) [25–27], which can be used as a means to regulate translation efficiency [28]. Asynonymous mutation that modifies a common codon into a rare codon can therefore cause a dropin the efficiency with which an mRNA is translated, which could even affect protein function [29].Non-synonymous, or missense, mutations can modify the amino acid encoded by the target codon andcan potentially have an effect on the protein product of the gene. Nonsense mutations are those inwhich an early stop codon is inserted, truncating the protein product and possibly triggering mRNAdegradation. Missense mutations can modify the function of a gene, while nonsense mutations willlikely silence a gene. It is no surprise that in cancer studies, the presence of these mutations is oftenuse to infer whether a gene is an oncogene or tumour suppressor respectively [9]. Indels are mutationsthat remove or insert one or more nucleotides in a DNA sequence. Similarly to SNVs, the effect ofan indel strongly depends on where it happens. Additionally, the length of the indel modulates itseffect, partial deletion or insertion of a codon (3 nucleotides) would cause a frame-shift in the codingsequence of a gene, while indels of complete codons add or remove amino acids from the encodedprotein. The location of a partial indel within the gene may affect the amount of damage done by aframe-shift, the earlier it appears in the sequence the more codons it will affect. Frame-shift mutationsmay also create a premature termination codon (PTC), a type of codon that signals the ribosome tostop translation. PTCs present more than 50-55 base pairs upstream of the last intron in a gene willcause degradation of the mRNA through a mechanism called nonsense mediated decay (NMD) [30].Techniques for DNA and RNA sequencing data for the detection of these mutations can be found inSection Advances in the study of genetic variation and gene expression in cancer1.4 Gene expression analysis in cancerGene expression analysis through the measurement of messenger RNA (mRNA) levels of proteincoding genes does not necessarily reflect the amount of the corresponding protein in a cell. However,measuring mRNA levels or, rather, changes in the mRNA levels helps us determine whether therehas been a change in the transcriptional regulatory network affecting a specific gene. Even with thislimitation, gene expression analysis at the mRNA level has been a common experimental tool indifferent types of studies.1.4.1 Gene expression profilingGene expression is the process by which DNA is transformed into a functional element. As describedin the Central Dogma of Molecular Biology [31], information can be transferred from DNA into DNA,RNA or protein. Enzymes in the RNA polymerase family can transcribe DNA into different types ofRNA (e.g. messenger RNA, transfer RNA, micro RNA and ribosomal RNA). Protein coding genes aretranscribed into messenger RNA molecules that encode proteins. RNA polymerase transcribes DNAinto a precursor mRNA (pre-mRNA), which is co-transcriptionally modified by mechanisms such asfive-prime capping, splicing and polyadenylation to create a mature mRNA molecule. Five primecapping is the process by which a modified guanine nucleotide is attached to the 5’ end (start) of thepre-mRNA molecule. The 5’ cap protects a transcript from degradation and facilitates translation [32].Splicing is the process by which introns in pre-mRNA are eliminated and exons are joined together,allowing a gene’s sequence to encode different gene products [33, 34]. Through polyadenylation, a seriesof adenine bases (poly(A) tail) are added at the transcription termination site (3’ end). The poly(A)tail facilitates the export of the mRNA from the nucleus [35] and regulates mRNA degradation andtranslation [36]. The mature mRNA molecule can then be exported from the nucleus and translatedinto protein by the ribosome. Proteins can then undergo further modifications (e.g. phosphorylation,methylation, acetylation) that may modulate their function. In this thesis I worked with DNA andRNA data, unless specified otherwise, and when referring to gene expression levels I will refer tomRNA transcript abundance.Gene expression (transcriptional) profiling involves the quantification of RNA from a group ofgenes in one or more samples. Levels of mRNA can vary depending on different factors such asthe accessibility of gene regulatory regions (e.g. promoters, enhancers, silencers, and insulators)71. Advances in the study of genetic variation and gene expression in cancerand the presence of proteins that recognize these transcriptional signals (e.g. transcription factors,activators, and co-activators) [37]. These factors can be modified in a cell-type and temporal-specificmanner. For example, expression patterns can be defined by histone marks on DNA signals such asenhancer elements [38], and expression of different combinations of transcription factors can targetdifferent enhancers [39]. Being able to measure gene expression levels enables us to investigate thetranscriptional activity of specific samples, which in turn allows us to better understand the biologyof the cells and the genes themselves.Measuring levels of mRNA can be done using different techniques. In situ hybridization, forexample, can be used to identify specific RNA directly on a tissue sample using labelled complementaryRNA probes [40]. This technique measures the expression of a transcript and its localization amonga group of cells. Northern blotting uses gel electrophoresis to first separate RNA molecules by size,then the RNA is blotted on a membrane where a labelled sequence-specific probe is hybridized tothe immobilized RNA [41]. The radioactive labels on the membrane are then used to expose anX-ray film, where the marks on the film indicate the size of the mRNA. The existence of bandsof different sizes than expected could indicate the presence of alternative transcripts. Anothertechnique used to measure expression of a gene across samples is reverse transcription polymerasechain reaction (RT-PCR) [42]. First, complementary DNA (cDNA) is obtained from RNA throughreverse transcription and amplified using polymerase chain reaction (PCR) and specific primers. PCRproducts are then separated on a gel and measured. Through the exponential growth in the number ofPCR amplification products, this technique is able to detect differences in RNA content even at verylow initial concentrations. Both northern blotting and RT-PCR are limited in the number of genesthey can measure at a time. Serial analysis of gene expression (SAGE) was developed to measurethe expression levels of many genes simultaneously [43]. This method works by capturing mRNAmolecules by their 3’ end, synthesizing cDNA by reverse transcription and using restriction enzymesto produce short tag sequences (≈20bp) from each cDNA. The resulting sequences are concatenatedand sequenced using capillary (Sanger) sequencing, generating a list of tag sequences representative ofthe original transcripts. These tags are counted and the frequencies can be interpreted as a measureof the transcript’s expression level. DNA microarrays were later developed which could be used tomeasure expression by binding cDNA to probes attached to an array [44]. Comparative studies wereenabled by using two fluorophores to label samples and hybridizing them to the same microarray [45].Technological advances allowed an increase in probe density [46], which eventually allowed to expand81. Advances in the study of genetic variation and gene expression in cancerfrom gene-level probes to exon-level probes [47, 48] The introduction of RNA-Seq demonstrated itsuse for gene expression with comparable results to microarrays [49, 50].Gene expression profiling has many uses in the study of cancer. The molecular signature of atumour can be defined by its gene expression patterns [51]. Molecular signatures can help inferphenotypic characteristics of a sample and relate it to other samples (e.g. from the same tissuetype or disease type) [52]. By analyzing gene expression patterns across different cancer samples,classifiers can be built to differentiate between tumour types [53] or between subgroups within thesame tumour type [54]. Expression profiling of diffuse large B-cell lymphoma (DLBCL), for example,can separate the samples into two different groups, which can be linked to different cell-lineages andpatient survival rates [55]. Comparing molecular signatures between normal and malignant cellscan help identify dysregulated genes relevant to the disease [56]. Expression analysis can also helpidentify proto-oncogenes, as they are often amplified and overexpressed [57]. In breast cancer, forexample, over-expression of the HER2 gene was linked to poor prognosis [58]; however, in this case,overexpression of HER2 also made these tumours susceptible to targeted therapy [59]. Joint analysisof mutation status and expression profiling can highlight cancer-relevant genes [60]. For example,truncating mutations can cause a splicing change that skips the exon that contains the mutation and,depending on the reading frame, could produce an altered protein [61]. In contrast, the adverse effectof mutated proto-oncogenes can be increased when the gene is also over expressed. This has been seenwith KRAS mutations in lung adenocarcinoma, where the mutant allele is also amplified via copynumber gains [62]. Although nucleotide variants (e.g. SNVs) are usually detected in DNA sequences,RNA-Seq’s base pair resolution allows the detection of nucleotide variants in transcribed sequences [63].The importance of this has been highlighted in studies using RNA-Seq data to find links betweenallelic imbalanced expression with chromosomal deletion and amplification events [64, 65].1.4.2 Alternative splicing and expressionAlternative splicing is the process by which different mRNA molecules, potentially encoding differentgene products, can be obtained from the same gene [33, 34]. After the sequencing of the humangenome [66, 67] it became evident that the number of protein coding genes in the human genome wasnot much different from that of worms (22,375 in human and 20,362 in C. elegans, Ensembl v90 [68]).Indeed the larger variation in protein content previously observed in humans was in part generated bydifferent combinations of exons from a smaller than anticipated number of genes [69]. The splicing91. Advances in the study of genetic variation and gene expression in cancermachinery depends on expression of splicing factors and splicing signals across genes to determine thecombination of exons to be joined [70]; as with gene expression, alternative splicing can also happenin a tissue specific manner [71, 72]. Dysregulation of alternative splicing has been linked to variousdiseases [73, 74] and including cancer [75]. Although technologies like exon arrays, which could beused for gene expression analysis [47, 48], can also be used to study alternative splicing [76–78], theyare limited in their measurements by the predefined set of probes in the microarray. The abilityof RNA-Seq to sample all isoforms in the sample makes it an ideal platform to study alternativesplicing [79].1.5 Overview of the advances in DNA sequencing technologyFor the last thirty years, DNA sequencing has been central to the study of molecular biology, havingbecome a valuable tool in efforts to understand the basic building blocks of living organisms. Theavailability of genome sequences provides researchers with the data required to map the genomiclocation and structure of functional elements (e.g. protein coding genes) and to enable the study of theregulatory sequences that play roles in transcriptional regulation. Large international collaborationshave undertaken the decoding of genome sequences for a diversity of organisms, including (but notlimited to) the bacteria Haemophilus influenzae Rd, with a genome of 1.8 megabases (Fleischmannet al. 1995); the yeast Saccharomyces cereviseae, with a 12 megabase genome [80]; the nematode C.elegans, with a 97 megabase genome [81] and the human genome, with a 3 Gb genome [66, 67]. Suchprojects yielded data that have been used to develop molecular “parts lists” that reveal not onlygene content, but inform on the evolutionary relationships and pressures that have acted to shapegenomes [82]. The technology historically employed for such reference genome sequencing projectswas based on the Sanger chain termination sequencing method. Although still considered the goldstandard for sequence validation, the relatively low throughput and high cost of Sanger sequencingbecame limiting factors when designing large experiments where massively parallel data collection wasrequired. The high throughput capabilities of massively parallel sequencing took sequencing efforts innew directions not previously feasible, enabling both the analysis of new genomes and also facilitatinggenome comparisons across individuals from the same species, thereby identifying intraspecific variantsin a high resolution genome-wide fashion.101. Advances in the study of genetic variation and gene expression in cancer1.5.1 Whole genome shotgun sequencingWhole genome shotgun sequencing uses genomic DNA as the source material for preparation ofDNA sequencing libraries [83]. A library is a collection of DNA fragments, obtained from the sourcematerial and rendered suitable for sequence analysis through a process of library construction. Thisinvolves shearing of the DNA sample by enzymatic (e.g. restriction enzymes) or mechanical means(e.g. sonication, nebulization) [84]. The aim of fragmentation is to reduce the physical size of theDNA template molecules to the optimal fragment length for the assay type and the instrument systembeing used, while endeavouring to maintain an unbiased representation of the starting DNA material.The resulting fragments are then subjected to gel-based electrophoresis separation, and the desiredsize range of DNA fragments is then recovered from the gel matrix [83]. Sequencing fragments ofroughly the same size is especially useful when analyzing paired-end sequences, where sequencesare collected from both ends of linear template molecules. As will be explained later, paired-endinformation can enable certain types of bioinformatic analysis. A common goal of whole genomeshotgun sequencing is the resequencing of multiple individuals, for example to study intraspecificvariation and the association of such variation with healthy and disease states.1.5.2 Whole genome resequencing for variant discoveryThe term resequencing refers to the act of sequencing additional individuals from the same species,where a reference genome has been generated and is used to assist in the interpretation of the datacollected using sequencing approaches [85]. Resequencing of human genomes has been used to discoverboth mutations [86, 87] and polymorphisms [88]. The existence of the reference genome sequencesmade variant discovery the first application of various platforms such as Roche / 454 [89], Illumina/ Genome-Analyzer [90] and Applied Biosystems / SOLiD technologies [91]. Alongside the obviousscientific impetus for resequencing species of significance in medical research, an initial reason for theemergence of resequencing was largely technical – software for a whole genome assembly using shortreads did not exist. In the absence of a reference genome to aid alignment, HTS was capable of littlemore than producing large collections of short sequence reads. Thus, because the short read lengthsimpeded direct assembly of the sequenced data, reference genome sequences assembled from longerSanger contigs (e.g. human [66, 67], mouse [92], and rat [93]) provided positional data.An early challenge in resequencing was the production of reads of sufficient length to align (“map”)uniquely to the human genome. Using simulated data, it was estimated that reads of at least 23111. Advances in the study of genetic variation and gene expression in cancernucleotides in length would be needed to uniquely cover 80% of the human genome, and reads of atleast 47 base pairs would be required to cover 90% of the human genome [94]. With the exception ofthe Roche / 454 instrument [89], early achievement of such read lengths entailed both instrumentationand chemistry challenges. The Roche instrument was used to demonstrate the potential of nextgeneration resequencing when it was used to sequence Dr. James D. Watson’s genome [95]. Withina time span of two months, 24.5 gigabases of raw sequence data were generated, providing 7.4 foldaverage base pair coverage of Dr. Watson’s genome. The sequence data provided sufficient resolutionfor the detection of known polymorphisms and novel mutations including insertions, deletions, andeven copy number changes [95].Since that landmark study, whole genome re-sequencing has been done in various projects, includingthe 1000 Genomes Project (, which aimed to discover common sequencevariants in healthy human populations, and many cancer studies including those conducted under theauspices of the TCGA ( and ICGC (, 2010) consortia.Applications for whole genome resequencing continue to emerge, and the steady decrease in cost perbase and the increased throughput associated with the latest technology advances has made this modeof data collection as appealing financially as it is scientifically.1.5.3 Exome sequencingA solution to the costs associated with whole genome resequencing emerged in the form of capturetechnologies. These approaches targeted a portion of the genome, thereby reducing the number ofreads required to achieve useful levels of redundancy of sequencing coverage. The resulting reductionin sequencing costs allowed the analysis of larger sample cohorts than with whole genome resequencing.An early example of capturing specific regions for high-throughput resequencing was presented byThomas et al. (2006) [96], in which 5 exons of the EGFR gene were targeted through PCR amplificationof 11 regions of 100bp each, followed by sequence analysis using a 454 sequencing instrument. Byapplying this strategy to 22 lung cancer samples and generating between 8,000 and 12,000 readsfor each sample, the authors were able to correctly detect previously known mutations; additionally,HTS was able to detect mutations in two samples which had been classified as wild-type by Sangersequencing. These mutations were found with low representation (e.g. a deletion in 9% of 4,488 reads)and had appeared in Sanger traces at a level indistinguishable from noise. The depth of sequencingcoverage attained, along with the sensitivity of the 454 sequencing approach, resulted in one of the121. Advances in the study of genetic variation and gene expression in cancerfirst studies to demonstrate the detection of cancer mutations present in only a portion of the cells inheterogeneous tumour tissue.Solution-phase hybridization approaches, such as those pioneered by Gnirke and colleagues [97]are now commonly used for targeted re-sequencing. Foremost among these are exome reagentsthat are commercially available from several vendors. These reagents, such as those made availableby Nimblegen and Agilent, typically contained oligonucleotide probes designed to recover, throughhybridization in solution, 10s of megabases representing exons and other regions of high biologicalinterest. Hybridized fragments could then be recovered from solution and sequenced using nextgeneration sequencing approaches. This approach was used for projects from large consortia suchas TCGA (, and has been used to discover mutations implicated in manydiseases such as renal carcinoma [98], Miller syndrome [99] and Kabuki syndrome [99].As sequencing throughput continued to rise and costs continued to fall, the opportunity formultiplexing (simultaneously sequencing more than one sample in a single reaction) emerged. Oneapproach to multiplexing is to directly pool captured sequences and then sequence them simultaneously.This approach may be of utility in cases where the relationship between a sample and its genotypeis not of interest [100]. A different strategy based on tagging, or “barcoding”, templates beforesequencing permits the identification of the source sample for each template. Several sequencingtechnologies provide a step during library construction where a sample specific barcode or indexsequence is appended to each fragment of the samples to be sequenced. In this way, each library canbe indexed with a specific sequence, and libraries can be pooled and sequenced together in the samerun. During analysis, the barcodes can be used to associate the individual reads back to the originalsample, thereby preserving the relationship between the sequences and the starting material.1.5.4 Analysis strategiesBioinformatic analysis of sequencing data can be divided into several stages. The first step is technology-dependent and deals with processing the data provided by the sequencing instrument. Downstreamanalyses are then constructed according to the type of experiment being done. Resequencing projects,for example, align short reads against a reference sequence from the source organism. These alignmentsare then analyzed to detect events such as nucleotide variants (SNVs and indels), structural variantsand copy number changes.The first step of an analysis workflow starts during sequencing and involves signal analysis131. Advances in the study of genetic variation and gene expression in cancerto transform the sequencing instrument’s fluorescent measurements into a sequence of charactersrepresenting the nucleotide bases. As sequencers image surfaces densely packed with the DNAsequencing templates and sequencing products, image processing techniques are required for detectionof the nascent sequences and conversion of this detected signal into nucleotide bases. Most technologiesassign a base quality to each of the nucleotides, which is usually a value representing the confidenceof the called bases. Although each vendor has methods specific to their technology to evaluate basequality, most provide the user with a Phred [101]-like score value: a quality measurement based on alogarithmic scale encoding the probability of error in the corresponding base call.To achieve contiguous stretches of overlapping sequence (i.e. contigs) in de novo sequencingprojects, software that could detect sequence overlaps among large numbers of relatively shortsequence reads was required. The process of correctly ordering the sequence reads, called assembly, iscomplicated by short read lengths, the presence of sequencing errors, repeat structures that may residewithin the genome, and the sheer volume of data that must be manipulated to detect the sequenceoverlaps. Hybrid methods involving complimentary technologies were initially successful at addressingthese complications. For example, the genomes of several marine organisms were obtained by mixing200 base pair 454 sequence reads with Sanger sequences [102]. A different approach eliminated theneed for Sanger sequencing by mixing two distinct next generation sequencing technologies [103]. Bytaking advantage of 454’s longer reads (250bp) with short Illumina reads (36bp), Reinhardt et al(2009) were able to sequence de novo a 6.5Mb-long bacterial genome. These studies provided practicalexamples of how the strengths of complementary technologies can be used to alleviate their respectiveshortcomings.Homology with previously sequenced organisms can also be used when sequencing new genomes.This strategy was demonstrated during sequencing of the mouse genome [104]. By taking advantage ofthe conserved regions between mouse and human, Gregory et al. (2002) were able to build a physicalmap of mouse clones, establishing a framework for further sequencing. A similar approach can beused to produce better assemblies with next generation sequencing. For example, to sequence thegenome of a fungus, Sordaria macrospora, [105] short reads from 454 and Illumina instruments werefirst assembled using Velvet [106]. The resulting contigs were then compared to draft sequences ofrelated fungi (Neurospora crassa, N. discreta and N. tetrasperma). This process helped produce abetter assembly by reducing the number of contigs from 5,097 to 4,629, while increasing the N50 (thecontig length N for which 50% of the genome is contained in contigs of length N or larger) from 117kb141. Advances in the study of genetic variation and gene expression in cancerto 498kb.New algorithms were later developed which could assemble genomes using only short reads. Mostof these methods were based on de Bruijn graphs. Briefly, the logic involves decomposing short readsinto shorter fragments of length k (k-mers). The graph is then built by creating a node for each k-merand drawing a link, or “edge”, between two nodes when they overlap by k-1 base pairs. These edgesspecify a graph in which overlapping sequences are linked. Sequence features can increase the resultinggraph’s complexity. For example, the graph can contain loops due to high similarity between sequences(e.g. gene family members or repetitive regions), and so-called bubbles can be created around singlebase differences (e.g. SNV). A single base mismatch results in the creation of non-unique edges in thegraph, which yields two possible paths around the site of the SNV. Graph complexity and size increasefor large genomes and, given that the graph needs to be available in memory for efficient analysis, notall implementations can handle human size genomes. Some publicly available implementations, such asVelvet [106] and Euler-SR [107], have been successfully used to assemble bacterial genomes. Anotherimplementation, ABySS [108], makes use of parallel computing through the Message Passing Interface(MPI) to distribute the graph between many nodes in a computing cluster. In this way, ABySS canefficiently scale up for the assembly of human size genomes using a collection of inexpensive computers.Other assemblers like SOAP-denovo [109] and ALLPATHS-LG [110] are able to assemble human sizedgenomes using large memory multi-cpu servers, requiring 150Gb and 512Gb RAM respectively.For re-sequencing experiments, high-throughput aligners are required to map reads to a referencegenome sequence. Many applications have long been available for sequence alignments; however, theamount and size of the short reads created by next generation sequencing technologies required thedevelopment of more efficient algorithms. Some methods use hashing approaches (e.g. Maq [111]) inwhich the reads are reduced in complexity to unique identifiers (i.e. hashed keys). The hashed readscan then be used to scan a table made from a similarly hashed representation of the reference genometo identify putative read alignments to the reference. Other methods based on Burrows-Wheelertransform became popular for read alignment. These included BWA [112], Bowtie [113] and Soap [114].Although these algorithms were relatively fast compared to Maq, they were somewhat limited whensplitting reads to achieve gapped alignments, which can occasionally be required due to indels. TheMosaik aligner [115] attempted to approach this by using a Smith-Waterman algorithm [116] to alignthe short reads. Aligners such as GSNAP [117] included the ability to consider known SNPs in thealignments, making mismatches at known SNP sites count as normal mappings.151. Advances in the study of genetic variation and gene expression in cancer1.5.5 Variant discoveryIdentification of SNVs and indels are central to the study of DNA sequence variation. Such sequencevariants are heavily studied as some have been linked to specific diseases [63, 99, 118]. Before HTStechnologies were available, the detection of mutations in disease states did not typically involvesequence-intensive analysis on a genome-wide scale. Instead, positional cloning approaches andgenome wide association studies (GWAS) were frequently used. In contrast, HTS provides the meansfor mutation discovery at a single-base resolution over entire exomes, transcriptomes or genomes.An additional advantage of HTS is its enhanced sensitivity compared to typical Sanger approaches.Using Illumina sequencers, it was seen that HTS could be used to detect mutations that traditionalSanger sequencing could not detect due to low representation of the mutated allele [96]. In a differentstudy the high resolution of HTS was used to simultaneously detect SNPs and estimate the minorallele frequency (MAF) [100]. By sequencing three pooled samples of enzyme-digested DNA from66 individuals representing three cattle populations (Holstein lineage, Angus bulls and a group ofmixed beef breeds), Van Tassell et al (2008) [100] were able to identify 60,042 putative SNPs andby analyzing the ratio of non reference to reference reads across all 66 cattle they obtained MAFestimates. Whole genome resequencing can also be used for SNP discovery, as is illustrated by projectslike the 1000 Genomes Project ( which takes a HTS approach todetection of variants in human populations, instead of the SNP microarrays previously used in thelatter phases of the HapMap Project [119].1.5.6 Variant discovery in cancerAs cancer is a genetic disease [4] and mutations are known to drive cancers [120], HTS approachestowards mutation discovery in human cancers became popular. Initially, due to limitations insequencing technology, studies aimed at identifying somatic mutations in cancer were often composedof two phases: one where a small set of samples was comprehensively sequenced to detect candidatemutations and a second phase where mutated genes were sequenced in a larger “extension” cohortto determine their frequency in a larger population. Prior to the development of HTS machines,heroic efforts were often required to sample even a fraction of human genomes in the search for cancermutations. In one project, Sanger sequencing was used to sequence more than 3 million PCR products,which included 13,023 genes from 11 breast and 11 colorectal cancer samples [121]. The massiveamount of work represented in this study for generation of the PCR products is now eliminated161. Advances in the study of genetic variation and gene expression in cancerthrough the library construction procedures used in HTS.The large number of loci in which candidate mutations can occur makes high–throughput sequencingan ideal platform for relatively unbiased mutation discovery. There is an increasing number of examplesin which HTS approaches have been used to characterize cancers, and many more are expected overthe near to medium time frame. For example, sequencing bone marrow and skin samples from an acutemyeloid leukemia (AML) patient permitted the identification of somatic mutations, which in turn ledto the identification of genes that were recurrently mutated in other patients [86, 122]. In anotherstudy, sequencing a primary and a metastatic breast tumour from the same patient, sampled nine yearsapart, showed that mutations that are dominant at a later stage may be present at low representation,if at all, earlier during tumour progression [87]. This indicated that through mechanisms such asgenetic drift, or selective pressure imposed by treatment, sub-populations of cells may become moreprevalent over time. These results and others like them demonstrated the utility and impact thatHTS can have when analyzing the mutational landscape of cancer patients.A bioinformatic analysis for mutation detection using HTS re-sequencing starts with the alignmentof reads to reference genomes. Nucleotide mismatches in the alignments are expected to occurdue to sequencing errors, read mapping errors, and to the existence of variants. It has been anongoing challenge to distinguish among these alternative possibilities. One conceptual approach todistinguishing technical errors from bona fide variants takes advantage of the considerable redundancyof sequence coverage that HTS produces. For example, when multiple reads cover the same position inthe reference assembly, and these consistently indicate the presence of a sequence variant, confidence inthe robustness of the variant call is increased. To infer whether a mutation is present, the ratio betweenthe sequence coverage of the reference allele and the non reference allele needs to be evaluated. Thiscan be done with hard thresholds on the ratio of supporting reads or with statistical methods that allowfor some flexibility by assigning a score or probability of the mismatch being a mutation [111, 123, 124].Indel detection is also challenging, especially in cases where the pairwise alignment, meaning the baseby base matching between the query and the target, is ambiguous over the same mapped location.Methods for calling indels often include local realignment of the reads at the site of a candidatemutation [125]. Mutations can also be called using the output from assemblies. During creation of anassembly graph, SNVs cause so called “bubbles”, in which paths diverge and converge around theSNV. Support for the SNV can then be evaluated by analyzing the read coverage supporting eachpath in the bubble [108]. Whether through alignments or assemblies, by comparing next generation171. Advances in the study of genetic variation and gene expression in cancersequencing reads to a reference genome sequence, we can obtain a list of sites with putative mutations.Once discovered, sequence variants can be evaluated for novelty and for a predicted effect on thegene product. This can be readily achieved through comparison of variants to databases containingpreviously observed variation, such as dbSNP [126]. Similarly, there are databases that containdisease-related genes and mutations. Examples include the Online Mendelian Inheritance in Man(OMIM; database, and the Catalogue Of Somatic MutationsIn Cancer (COSMIC; and Cancer Census ( databases. If discovery of somatic mutations is thestudy aim, then a comparison of sequence data from matched tumour and normal samples from thesame individuals can be used to distinguish somatic changes from those resident in the germline [86,87, 118, 122].Additional bioinformatic approaches can be employed to assess the possible effect of the sequencevariant on the gene product [127]. For example, in the case of protein coding genes, variants andmutations may affect a codon such that a different amino acid is encoded (missense mutation) or anearly stop codon is created (nonsense mutation). Both types of variants can change the structure orthe function of the resulting protein, which may in turn be a driving factor in disease. Of similarinterest is the distinction of driver mutations (those that functionally promote disease progression)from passenger mutations (those that get propagated by association with driver mutations, but donot have a functional role of their own). Differentiating between these types of mutations has beenused as a way of identifying tumour-related genes. One method involves the comparison betweenfrequencies of synonymous and non-synonymous mutations in genes [128]. Briefly, assuming thatsilent mutations do not confer any biological advantage to a tumour, these can be used to model amutational profile under the hypothesis of neutral selection. Genes that harbour non-synonymousmutations at frequencies deviating from the neutral selection profile can then be inferred to be underselective pressure and thus may be directly involved in the disease [128].Although HTS is useful for the discovery of candidate variants including mutations, care must betaken when analyzing the data. An overly sensitive mutation caller may generate a large amount offalse positive mutations which, given the large number of bases covered, may be impossible to validateusing alternative methods (e.g. Sanger sequencing). On the opposite side of the scale, an overlyspecific algorithm may miss many important variants. It is important to try to find a balance betweenthe false positives (calling a mutation where there is none) and the false negatives (failing to call a181. Advances in the study of genetic variation and gene expression in cancerreal mutation). Different methods can be employed, such as requiring specific types of supportingevidence (e.g. minimum number of reads, coverage on both strands, SNV not near an indel) or varyingparameters in order to maximize concordance to known databases (e.g. dbSNP) or to approximate atransition/transversion ratio estimated for the organism in question.1.5.7 Genomic rearrangementsThe availability of paired-end sequencing on HTS platforms has improved the detection of genomerearrangements to single base resolution [129]. During paired-end sequencing library construction,sheared DNA fragments of a desired size are selected and prepared for sequencing. Paired-endsequencing involves sequencing both ends of the DNA fragments in the sequencing library. Theresulting reads are thus matched in pairs, and are expected to align to the reference genome a certaindistance apart, following the fragment size distribution specified during library construction, and in acertain orientation with respect to each other. The paired end read alignments can be assessed forinconsistencies in the orientation or distance between both ends. If these are detected, the presence ofa misalignment or a genome rearrangement is inferred. For example, if the insert size (the number ofnucleotides to be sequenced, placed between sequencing adapters) is 200 base pairs, but the pairedreads mapped to the reference genome are 50,000 base pairs apart, one can infer that approximately50,000 base pairs have been deleted in the source genome relative to the reference genome. Usinga similar logic, if reads map to different chromosomes, a translocation may have occurred. Giventhat the library fragments are sequenced to yield tail to tail read orientations (in the case of Illuminapaired-end sequences), one expects the reads to align to opposite strands in the reference. If the readorientation is not preserved in the alignments, an inversion may be inferred.There are several examples of the use of paired-end sequences to detect genome rearrangements.For example, Korbel et al. (2007) [130] used paired-end sequences to observe that structural variantsin the human genome were more widespread than initially thought. Campbell et al. (2008) [131]used paired-end sequences to identify somatic rearrangements in cancer, and subsequently comparedthe pattern of rearrangements in primary and metastatic tumours to infer the clonal evolution ofcancer [132]. Ding et al. (2010) [133] analyzed related samples from a primary tumour, brain metastasisand xenograft, and compared their mutational profiles. In both the metastasis and xenograft tumourthe mutational representation was found to be a subset of the primary tumour, indicating that cellsfrom different tumour sub-populations gave rise to the metastasis and the xenograft. The discovery191. Advances in the study of genetic variation and gene expression in cancerpotential provided by HTS was further exemplified when, while scanning 10 chronic lymphocyticleukemia patients for rearrangements, researchers identified a patient which presented 42 somaticrearrangements involving a single arm of chromosome 4 [134]. This massive chromosomal remodelling,involving 10s to 100s of rearrangements, was found in additional patients and estimated to happen in2% to 3% of all cancers. This phenomenon was named chromothripsis, denoting a process in which asingle catastrophic event shatters one or more chromosomes [134].Another approach to detect genome rearrangements involves the use of de novo assembly methodssuch as ABySS [108]. Here, the approach involves assembling contigs, and then aligning the contigs(as opposed to individual reads) back to the reference genome. The advantage to this approach over aread-based alignment approach is that by analyzing the reads in a de novo fashion, the assemblermay be able to create contigs that represent the exact breakpoint of rearrangement events. For largegenomes with many rearrangements (e.g. a human tumour sample), de novo assemblies may notbe able to entirely reconstruct the original genome. However, de novo assembly may still producesufficiently large contigs that can be compared to the reference genome to determine putative structuraldifferences.1.6 Overview of RNA-Seq for transcriptome analysisHTS can be applied to characterize various types of RNA transcripts, including mRNAs, smallRNAs (including micro-RNAs), non-coding RNAs and antisense RNAs, collectively known as thetranscriptome. Many advantages of HTS in genome sequencing also apply in transcriptome sequencing:base pair resolution enables mutation discovery, paired-end reads enable detection of fusion genes andcoverage is proportional to concentration in the source material allowing for quantitative analysis.Transcriptome sequencing allows researchers to look at expression profiles with unprecedented detail.1.6.1 RNA-SeqEarly efforts to explore transcriptomes used the expressed sequence tag (EST) sequencing method.This approach helped catalyze gene discovery in many species, including human and mouse [135–137].Sequencing of full-length cDNA clones eventually became feasible, fuelled by the clear advantages ofunderstanding transcript isoform structure at the sequence level [138–140]. Although ESTs providedrapid access to expressed genes, they were not optimal for gene expression profiling, largely due to201. Advances in the study of genetic variation and gene expression in cancertheir low throughput. Serial analysis of gene expression (SAGE) [43] addressed throughput issues bymaking it possible to detect expression of 30 or more transcripts in a single pass sequencing read. Theability to detect an increased number of transcripts and the use of tag counts to estimate expressionlevels made SAGE useful for gene expression profiling [141, 142].Shortly after HTS approaches became available, transcript analysis technologies were adapted foruse on ultra high throughput sequencers. For example, a version of SAGE called “DeepSAGE” wasdeveloped [143] on the 454 instrument, allowing a ≈6-fold enhancement in tag counts (from 50,000 to300,000). An important advantage of certain tag sequencing approaches is that they allow one todetermine whether the transcript originated from the forward or the reverse strand in the genome.Tag-Seq [144], derived from SAGE, can be used to measure expression values of genes with strandspecificity. Further developments enabled the construction of strand-specific transcriptome libraries(Levin et al. 2010) for sampling entire cDNAs. These and other strand-specific approaches enabledstudies of the relationship between sense and antisense gene expression [145].Expressed sequences, longer than short tags, have also been analyzed using next generationsequencing. By capturing poly(A)+ mRNA molecules and using a shotgun style approach akin tothat previously defined for the genome, the entire mRNA content of a sample can be sequenced. Thisapproach is known as whole transcriptome shotgun sequencing (WTSS) or RNA-Seq. In one studythe transcriptome of a human prostate cancer cell line was explored using pyrosequencing, where theanalysis of 181,279 reads of 102bp average length revealed the expression of 10,117 genes [146]. Thereproducibility of RNA-Seq experiments using Illumina sequencing was shown with the sequencingof mouse transcriptomes [49]. In this study, Mortazavi et al (2008) introduced the now commonlyused gene expression metric RPKM (reads per kilobase of gene model per million mapped reads) asa way of normalizing read counts into expression levels. RPKM values were shown to be consistentbetween technical replicates and were proposed as a way to compare expression levels of genes amongmultiple libraries and within a library. The study also showcased the capabilities of RNA-Seq inmeasuring alternative splicing and finding novel genes. Measuring alternative splicing with RNA-Seqwas further explored by sequencing ten different human tissues and five cell-lines, identifying ≈22,000tissue-specific alternative splicing events [147]. These initial studies highlighted RNA-Seq’s ability toquery the transcriptome de novo. Unlike gene, exon or junction expression microarrays, which arelimited to the measurement of pre-defined features, the discovery capabilities of RNA-seq made it anideal tool to study the trancsriptome [79].211. Advances in the study of genetic variation and gene expression in cancerFigure 1.2: Effect of introns in short read alignment. Introns are removed from transcribed pre-mRNAby the spliceosome to give rise to mature mRNA molecules before they are polyadenylated. Construction ofpoly(A)+ selected mRNA libraries will result in short reads obtained from exon-exon junction sequences. Thesereads, also known as split reads (coloured in red), represent sequences not existing in the reference genome.The diagram shows the manner in which these reads have to be aligned back to the reference genome. Thisfigure is my own work, which I have made publicly available through the RNA-Seq Wikipedia entry, writteninitially by me in collaboration with Vincent Montoya.1.6.2 Analysis strategiesData processing for mRNA sequencing resembles that employed for genome sequencing, in thatthere are alignment-based approaches and de novo assembly-based approaches. A confounding factorwhen working with alignment-based methods is the existence of RNA splicing, which removes intronsfrom transcripts during mRNA maturation. Splicing is a common feature of eukaryotic transcriptprocessing; it has been estimated that 95% of multi-exon genes are alternatively spliced [148] in atissue and developmental stage specific manner [149]. Thus, a significant proportion of sequence readsgenerated from mature mRNA molecules will represent “junction sequences”, which span exon-exonjunctions created during the splicing process (Figure 1.2). These junction sequences are not encoded aslinear strings in the genome, and therefore reads originating from these sequences will not be properlyaligned by genome aligners. Given the large quantity of short HTS reads and the relatively largesize of the introns, a gapped alignment strategy to detect junction read sequences is computationallyexpensive. New approaches were thus developed that address this issue.One approach to address complications arising from inefficiencies of read mapping due to mRNAsplicing is to construct a database containing all the sequences formed by the possible combinations221. Advances in the study of genetic variation and gene expression in cancerof exons. This database can then be used in conjunction with the reference genome, and aligners suchas Maq [111], BWA [112], Bowtie [113], Soap [114] or others, can be used. This approach is efficient,but is constrained to the knowledge of the existing transcripts, therefore it has usually been usedwhen the focus of the experiment is to detect variants [150]. Annotation databases that may serveas the sources for exons include Ensembl, UCSC Genes, RefSeq, CCDS, Vega, Havana, Encode, andAceView. As each of these source databases has individual pipelines and quality metrics, any oneof them may individually include or exclude some isoforms found in the other databases. Anotherappealing set of tools provides a means for discovery of novel exon junctions. These include toolssuch as TopHat [151], HMMSplicer [152], SpliceMap [153], MapSplice [154]. These tools attempt todiscover junction sequences using modified versions of aligners. For example, the Bowtie aligner isused at an initial stage to align the reads onto the reference genome. During this initial step, readsoriginating entirely from one exon (i.e. non-split reads) are thus aligned, and any reads unalignedafter this stage are used for splice junction discovery.Another challenging problem related to splicing is the determination of the sequences of completetranscripts. Two approaches based on alignments were initially proposed: Cufflinks [155] andScripture [156]. Both rely on splice site predictions and alignments done by TopHat and subsequentlyapply their own statistical and probabilistic methods to determine the combination of exons thatmost likely explain the reads observed in the sequencing data. Other methods like MISO [157] focuson singular splicing events, comparing the numbers of reads supporting and negating a splicing event,and using generative models to determine the ratio of transcripts most likely to result in such counts.Yet another type of analysis is represented by DEXSeq [158] in which expression levels at the gene andexon levels are compared, providing insights into differential exon usage between groups of samples.Assembly methods can also be used to attempt to reconstruct transcripts. ABySS [159], Trans-ABySS [160], Trinity [161] and Oases [162] applied de Bruijn graphs to assemble transcriptome readsde novo, thereby allowing transcript isoform reconstruction and also allowing the detection of genefusions and other novel transcript structures. Unlike splicing, the sequences that form gene fusionsmay not be adjacent in the reference genome. For example, when chromosomal rearrangements occur,such as translocations, two genes may be placed adjacent to each other, and a hybrid transcript,containing sequences from both genes, may be expressed from that locus [163–165].231. Advances in the study of genetic variation and gene expression in cancer1.6.3 Expression analysisOnce reads have been assigned to transcripts and to genes, the number of reads mapping to a genecan be used to approximate transcript abundance (i.e. gene expression). One of the first and mostpopular methods for transforming read counts to gene expression measurements is known as readsper kilobase of gene model per million mapped reads, or RPKM [49]. The idea behind RPKM is tonormalize the number of reads against two factors: 1) the size of the gene, to avoid biases resultingfrom the increased number of reads that map to larger genes, and 2) the total number of reads inthe library, so that measurements from libraries with deeper coverage do not get artificially inflatedduring interlibrary comparisons. Normalization methods originating in microarray analysis, suchas quantile normalization, have also been proposed [166]. Quantile normalization can be especiallyuseful in the presence of over-expressed genes that can cause the library size to be inflated and causebiases in values like RPKM. Other methods, such as the one implemented in Cufflinks [155], resort toprobabilistic models where each read is assigned to the isoform most likely to have spawned it. Oncegene expression values are obtained, comparisons of mRNA abundance between samples can be made.RNA-Seq data can be used to measure RNA abundance at the level of the entire gene, butalso at the more granular level of individual exons. In this way, it is possible to search for bothgenes and exons that are enriched or depleted in sample set comparisons. An open source platformcalled Alexa-Seq [167] automates the analysis of RNA-Seq libraries for alternative expression analysis.Although it relies on known annotations to speed up the analysis, it is able to detect some novel eventssuch as exon skipping, retained introns, alternative 5’ and 3’ splice sites, alternative polyadenylationsites as well as alternative transcription start sites.Discovery of SNVs using transcriptome sequence data is limited to the genes being transcribed;however, even with this limitation, mutation discovery in RNA-Seq has proven useful especially whenDNA sequencing data is not available. As described previously, this has been successfully done inthe analysis of HeLa S3 cell transcriptomes and in the identification of a recurrently mutated codonin EZH2 in DLBCL tumours of germinal-centre origin [118, 168], as well as in ovarian cancer inwhich a recurrent mutation in the FOXL2 gene specific to the granulosa-cell tumour subtype wasdiscovered [63]. Although mutation discovery approaches using transcriptome data are similar to thoseapplied in genome data, care should be taken as changes in the ratio of reference versus non-referencereads, unlike in the genomic case, is not directly dependent on the number of copies of the chromosomethat exist in the genome, but on the expression of the gene itself. Deviations from a 50/50 ratio could241. Advances in the study of genetic variation and gene expression in cancerbe caused by different levels of expression of each allele. Calling SNVs in RNA-Seq has a higherprobability of false negatives with respect to DNA mutations. This is mainly caused by genes notbeing transcribed and thus not represented in the sample, or genes presenting allele-specific expression,in which the mutated allele is silenced and expression is limited to the reference allele, causing DNAmutations to not be detectable.The ability to detect gene expression at single base resolution offers an opportunity to measureexpression from each allele individually. For example, human genomes are composed of two copiesof the genome inherited from parents, and these copies can differ substantially from each other intheir nucleotide sequence. Using RNA-Seq data, it is possible to determine which of the two alleles isexpressed more abundantly at all the expressed loci where sequence differences between the parentalalleles exist. Such measurements can be extended to disease states, where a somatic mutation witha potential adverse effect on a gene’s product is not expressed, making the mutation effectivelytranscriptionally silent. In cases when both alleles are being transcribed, allele-specific expression canbe measured by comparing the presence of each allele in the expressed mRNA [169]. Care should betaken when analyzing this type of data, as mapping can be biased towards the reference allele [170].Additional DNA genotyping may be needed to determine the true genotype at a specific locus. Havingmatching information at the genome level may also help detect RNA editing events, in which cellularmechanisms may directly alter the sequence of a transcript. In a study by Shah et al (2009) [87] theADAR enzyme, responsible for A-to-G RNA editing, was observed to be highly expressed in a breastcancer sample, which was reflected in a high frequency of RNA editing in the transcriptome.1.6.4 Variant discovery in RNA-SeqUsing RNA-Seq, SNVs and short indels can be discovered in expressed transcripts by analyzingrepeated coverage of non-reference alleles. This was shown with the sequencing of the HeLa S3 cellline transcriptome [168], where information on SNVs was obtained alongside gene and exon expressionvalues. Applications of RNA-Seq in the study of cancer mutations quickly followed. For example,through the analysis of 15 ovarian cancer transcriptome libraries, Shah et al (2009) [63] were able todetect a mutation in a transcription factor gene (FOXL2 ) present only in a specific ovarian cancersubtype known as granulosa-cell tumors (GCT). This mutation was then validated in tumour DNAand was later found to be present in a larger cohort of GCT samples (86 out of 89) and not present in149 epithelial ovarian tumours. Similarly, by sequencing the transcriptome of 31 diffuse large B-cell251. Advances in the study of genetic variation and gene expression in cancerlymphomas (DLBCL), Morin et al (2010) [118] detected a recurrently mutated codon in the EZH2gene, encoding a histone methyltransferase. Through DNA sequencing of the locus containing themutation in 251 follicular lymphoma samples (FL) and 320 DLBCL samples, it was determined thatthe codon was recurrently mutated mainly in DLBCL samples specific to the germinal-centre originsubtype, and to FL samples. In both these studies, mutations detected through RNA-Seq were furthervalidated using DNA sequencing data and explored in larger cohorts using targeted approaches. Thesetwo studies exemplify the utility of transcriptome sequencing in the study of nucleotide-level mutationsin cancer.Alternative splicing in protein coding genes allows each gene to produce different protein productsdepending on which exons are joined in a mature mRNA [69, 171, 172]. Nucleotide mutations havebeen seen to have an effect on splicing [173] and alterations in gene splicing can cause disease [73, 74].For example, the importance of maintaining splicing signals in DNA can be seen in spinal muscularatrophy. This disease is caused by the loss of a functional SMN1 protein. Although an almost identicalcopy of the gene exists (SMN2 ), a translationally silent nucleotide difference between SMN1 andSMN2 causes skipping of an exon in SMN2 that makes it unable to compensate for SMN1 loss [174].Another mechanism of altering splicing profiles is the dysregulation of splicing factors, which can havean effect on the correct splicing of the splicing factor’s target genes [175]. Upregulation of the splicingfactor SF2/ASF in tumours, for example, was seen to silence a tumour suppressor by modifying itssplicing pattern, making SF2/ASF a potential proto-oncogene [176]. Given the mutational load andgene dysregulation often found in cancer samples, dysregulation of splicing can also be expected [75].As such, pre-HTS technologies were successfully used to find several genes with cancer specific splicingforms (e.g. EGFR [177], CD44 [178], NER [179], and CDH11 [180]). As with other HTS applications,RNA-Seq’s ability to explore cancer-specific splicing alterations in a high-throughput mode has becomea useful tool [181–184].Genome rearrangements can also be detected using transcriptome sequencing. One of the effectsseen as a result of genome translocations is the joining of two genes into a fusion gene, expressed atthe transcriptome level as a “chimeric transcript” [131, 163]. Being transcribed, these fusion eventscan be detected using RNA-Seq. One strategy included the use of long reads (200-500bp) from a454 instrument to detect possible chimeric transcripts and short reads from an Illumina sequencerto provide coverage across the putative breakpoint [164]. Maher et al. (2009) [164] sequenced thetranscriptomes of tumour samples with known fusion genes and were able to successfully detect them261. Advances in the study of genetic variation and gene expression in cancerusing this approach. In another study, Steidl et al. (2011) [165] sequenced two lymphoma cell linesand used novel fusion discovery tools [185] to identify gene fusion events. Follow up studies werefocused on a specific fusion candidate containing the gene CIITA, which was further found to befrequently fused to a variety of genes. This gene was found to be rearranged in primary HodgkinLymphomas (8 out of 55 cases) and primary mediastinal B cell lymphoma (29 out of 77 cases) butwas rare in other lymphomas like DLBCL (4 out of 131 cases). Additionally, the presence of CIITAfusions was significantly correlated with shorter survival times. This study clearly illustrates howstate of the art technology and bioinformatic analysis can be effectively used to further understandingof cancer.1.7 Chapter summariesTechnological advances have resulted in an unprecedented ability to study the DNA and RNA ofcancers and other biological samples. These advances have simultaneously created a data deluge andnovel methods of analysis are required to efficiently process these data. With large quantities of datacome increasing sources of noise, which can hide the biological signal of interest. I developed methodsfor the detection of SNVs in cancer, applicable to genomes and transcriptomes. Additionally, I usedbioinformatic approaches to study the transcriptome of triple negative breast cancer subgroups, witha focus on alternative expression.Chapter 2 presents my work on the development of SNVMix, a probabilistic variant caller withdirect applications to cancer genome and transcriptome analyses. Ability to detect SNVs in cancergenome and transcriptome data is impacted by different factors. Tumour heterogeneity and alteredploidy can modify the allele frequencies of a sample. At the transcriptome level, the allelic contentcan be further modified by variance in gene expression levels. Both of these changes can alter thenumber of read counts representing the reference and alternative nucleotides, making detection of atrue variant a challenging problem. Additionally, HTS data is inherently noisy and the large numberof reads generated, along with systematic sources of error, can artificially increase the number of readssupporting a putative SNV. At the time of SNVMix’s development, existing methods for SNV callingoffered hard thresholds in read and mapping quality settings, as well as a binomial model in thedistribution of counts. Using machine learning techniques, SNVMix presented the first probabilisticmodel that could adapt to different allele representations found in a genome or transcriptome library.271. Advances in the study of genetic variation and gene expression in cancerSNVMix was proven to out-perform another SNV caller prevalent at the time. The work presentedin this chapter has been published [124], and at the time of this writing had been cited 183 timesaccording to Google Scholar.Chapter 3 presents further developments on increasing the performance of SNV calling usingtranscriptome data. I present BWA-R, a modified version of the BWA [112] aligner that can workdirectly with a database of exon-exon junction sequences for efficient alignment of RNA-Seq data. Idemonstrate that by adapting the read-pairing functionality in BWA-R to better utilize the exon-exon junction sequences, I increased the quality of the read mappings and decreased the numberof mismatches inserted by bad alignments. I also present an extension to the SNVMix programwhich extracts additional information from reads supporting each SNV being called. Additionally,I show how using these features to filter resulting SNV calls, both manually or through the use ofmachine learning classifiers, can be used to lower the number of false predictions. The work presentedin this chapter was used in three publications on which I am a co-author [150, 186, 187] and twoothers [188, 189].Chapter 4 presents an analysis of the transcriptome and alternative expression profiles in triplenegative breast cancer (TNBC). Part of this work was included in a publication regarding the clonalevolution of triple negative breast cancer [186], where it became clear that there were marked differencesbetween distinct subgroups of this type of breast cancer. In that publication TNBC samples weredivided into two groups using gene expression levels and their mutational landscapes were compared.During the development of that project we detected differences in the preferences in exon usagebetween the groups. In this chapter I extend the analysis of TNBC transcriptomes to determinewhether there are significant differences in the exon usage or alternative splicing profiles of the TNBCsubgroups. Additionally, given that alternative splicing can have tissue-specific profiles, I includedthe analysis of two normal breast cell types to explore similarities between cell types and TNBCsubgroups. I describe genes which appear to share alternative expression profiles between TNBCsubgroups and breast cancer cell types, and genes that show differences in extracellular domains thatcould provide targets for diagnostic or treatment of TNBC. I conclude the chapter by presenting anoverview of some events of interest which emanate from the bioinformatic analysis.Finally, Chapter 5 presents an overview of the conclusions and a description of future directionsemanating from the work presented in this dissertation.28Chapter 2SNVMix: a mixture model frameworkfor predicting single nucleotidevariants from next generationsequencing of tumours2.1 Introduction2.1.1 Single nucleotide variants in cancerCancers are driven by mutations. In particular, single nucleotide variants (SNVs), present as eithergermline or somatic point mutations, are drivers of tumourigenesis and cellular proliferation in manytypes of human cancer. An SNV is a specific base position in the genome that exhibits a nucleotidechange in at least one allele. A heterozygous mutation contains an alteration of one allele and ahomozygous mutation exhibits the change in both alleles. SNVs that induce a non-synonymousamino acid change in the sequence of a protein are of great interest due to their possible effect on theprotein’s functions. Millions of such mutations have been identified in human cancers in the past 25years [2, 120]. Significant early examples include germline SNVs detected in BRCA1/2 [190, 191] andsomatic alterations in TP53 [192], CTNNB1 [193], PTEN [194], KRAS [195, 196] and EGFR [197].Analysis of recurrently altered genes has contributed to the understanding of cancers, and has led tothe development of novel diagnostics and therapeutics [198]. However, these mutations were identifiedusing targeted studies on single genes, and did not represent the full genomic landscape of mutations.Sampling tumours for mutations in unbiased, genome-wide screens has helped reveal novel SNVs.Stratton et al (2009) [120] projected that complete sequencing of cancer genomes would contributeseveral orders of magnitude more mutations than were then represented in mutation databases. Efforts292. SNVMix: predicting SNVs from next generation sequencing of tumoursto sequence exons of more than 20,000 genes using Sanger sequencing indeed increased the mutationalspectrum and contributed novel insights into breast, colorectal, pancreatic and brain cancers [199–201].In the following years, a new wave of technology development provided a technical leap forwardin the field of sequencing. Next generation sequencing (NGS) technology emerged as a bona fide,high throughput and low cost sequencing solution, enabling the full and rapid interrogation of thegenomes and transcriptomes of individual tumours at nucleotide resolution (see Chapter 1). Thus,the availability of NGS offered an unprecedented opportunity for SNV discovery in cancer.At the time my work [124] was conceived, NGS had been used to sequence several cancer cell linesand tumour genomes and transcriptomes. Ley et al (2010) [122] had just sequenced the genome of acutemyeloid leukemia using NGS and reported eight novel somatic mutations. In addition, they reported14 novel germline non-synonymous SNVs. Shah et al (2009) [63] sequenced the transcriptomes ofovarian cancer using RNA-seq [49, 50] and discovered a novel pathognomonic mutation in the FOXL2gene (previously not implicated in cancer) in granulosa cell tumours of the ovary. In addition, Morinet al (2008) [168] reported the presence of SNVs in human HeLa cell transcriptomes using RNA-seq.While these early studies emerged as proof of principle that novel SNVs could indeed be discoveredusing NGS, the availability of computational methods for their discovery was under-representedin the bioinformatics literature. The first phase of large scale data generation endeavours such asthe ICGC [2] and TCGA ( were under way, using NGS to sequencethousands of tumours to fully catalogue mutations in multiple types of cancer. It was therefore criticalto have robust and accurate analytical tools designed for detection of SNVs from NGS data. In thischapter, I describe a statistical model I created for this purpose, and demonstrate how its novelfeatures outperformed methods then considered to be state-of-the-art.2.1.2 NGS data pre-processing for SNV detectionThe data produced by NGS consist of millions of short sequence reads ranging in length fromapproximately 36 to 400 nucleotides (although this limit steadily increases with ongoing technologydevelopment). Here the focus is explicitly on the problem of inferring SNVs once these reads havebeen aligned to the genome. Numerous methods have been developed for short read alignmentincluding Maq [111], Bowtie [113], Eland (Illumina’s proprietary aligner), SHRiMP [202], BWA [112],SOAP [114], Novoalign (Novocraft, 2010, and GSNAP [117]. Incontrast, there was a relative paucity of methods for inferring SNVs as a post-alignment inference302. SNVMix: predicting SNVs from next generation sequencing of tumoursaattcaggaccca-----------------------------aattcaggacccacacga------------------------aattcaggacccacacgacgggaagacaa--------------attcaggacaaacacgaagggaagacaagttcatgtacttt----caggacccacacgacgggtagacaagttcatgtacttt--------acccacacgacgggtagacaagttcatgtacttt--------acccacacgacgggtagacaagttcatgtacttt----------------gacgggaagacaagttcatgtacttt---------------------------------atgtacttt344455557761766677566636666665555666666666aattcaggaccaacacgacgggaagacaagttcatgtacttt000000000016000000100030000000000000000000Aligned readsReference seqAllelic counts abbb abFigure 2.1: Schematic diagram of input data to SNVMix. I show how allelic counts (bottom) arederived from aligned reads (top). The reference sequence is shown bolded in blue. The arrows at the topindicate positions containing a homozygous SNV (bb) and a heterozygous SNV (ab). The non-reference basesin the alignments are shown in red.step. There are two main ways of preprocessing aligned data for input to SNV detection algorithms.The first method is shown in Figure 2.1, describing an example of aligned data where two SNVs areidentified, where it is assumed that the reads are correctly aligned and the nucleotide base calls arecorrect. The reads are positioned according to their alignment in the genome, where each base of analigned read can either support there reference nucleotide or support an alternative (non-reference)nucleotide. The first step involves transforming the reads into allelic counts. At each position i inthe reference sequence, we can count the number of reads ai with a base that matches the referencegenome and the number of reads bi that do not match the reference genome. In case of rare ’third’alleles, these reads are assumed to be errors1. The total number of reads overlapping each position(called the depth) is given by Ni = ai + bi. In this context, given {ai, bi, Ni} (i.e. reference read counts,non-reference read counts and depth) for all i ∈ {1, 2, . . . , T} where T is the total number of positions inthe genome, the task is to infer which positions exhibit an SNV.The second method (Figure 2.2) relaxes the assumption that the base calls and the alignments arecorrect, and instead considers two types of uncertainty related to determining ai and Ni, namely theuncertainty encoded in the base call qij ∈ [0, 1] that represents the probability that the stated base is1Although mutations with three alleles would be biologically possible through subclones or gene duplication events,they were out of the scope of my work.312. SNVMix: predicting SNVs from next generation sequencing of tumoursFigure 2.2: Visual representation of mapping and base qualities as input for SNVMix2. Thisfigure attempts to visually convey what information the model can obtain from the data. Each row representsa read aligned to a reference. Black and gray backgrounds indicate reads with high and low mapping qualitiesrespectively. Red nucleotides represent matches to the reference and green ones mismatches, with the brightnessadjusted proportional to the base call quality. Good quality mismatches can be seen in green and are moretrustworthy, while low quality mismatches are lost in the background.correct for read j ∈ (1, . . . , Ni) at position i, and rij ∈ [0, 1] representing the probability that read j alignsto its stated position in the genome. Note that although mapping quality is derived in part from basequalities, considering these qualities as independent allows encoding the fact that base qualities areposition specific, while mapping qualities are constant for all bases in the read. The input for thismethod can be visualized as shown in Figure 2.2: high mapping quality is shown as dark backgroundand high base quality as bright foreground, high contrast positions indicate positions where the dataare more trustworthy. I show in Section 2.2.4 how to explicitly model these uncertainties to performsoft probabilistic weighting of the data rather than thresholding the uncertainties to deterministicallycalculate the allelic counts. The following section describes how various authors have approached thisproblem given {ai, bi, Ni} and optionally, {qi1:Ni , ri1:Ni}.2.1.3 Related workA simple way to detect SNV locations would be to compute the fraction fi =aiNi, and then to call SNVsat locations where fi is below some threshold. In the example in Figure 2.1, applying a threshold of 16would successfully discard all columns (including the two columns which have singleton non-referencereads, which may be due to base-calling errors), except the two containing the SNVs. A critical flawwith this approach is that it ignores the confidence in the estimate of fi. Intuitively, we can trust theestimate more at locations with greater depth (larger Ni). This idea had been applied by Morin et al(2008) [168], wherein read depth thresholds of Ni ≥ 6 and bi ≥ 2 reads supporting the variant allele were322. SNVMix: predicting SNVs from next generation sequencing of tumoursFigure 2.3: Theoretical behaviour of SNVMix at varying depths. For depths in N ={2, 3, 5, 10, 15, 20, 35, 50, 100}, the plots show how the distribution of marginal probabilities changes withthe number of reference alleles, given the model parameters fit to a 10x breast cancer genome dataset.applied, with an additional requirement that the non-reference allele had to be represented by at least33% of all reads at that site. This approach eliminated SNVs with weak supporting evidence, but italso categorized the data into two discrete classes - SNV or not, without explicitly providing confidenceestimates on the prediction. Moreover, in transcriptome data, the number of reads representing agiven transcript is expected to be highly variable across all genes and thus determining a minimumdepth can be difficult. I demonstrate (in Section 2.3) that applying depth-based thresholds reducedsensitivity to finding real SNVs.To overcome these limitations, we proposed a probabilistic approach based on a Binomial mixturemodel, called SNVMix1, which computed posterior probabilities, providing a measure of confidence onthe SNV predictions. The model inferred the underlying genotype at each location. This genotype wasassumed to be in one of three states: aa=homozygous for the reference allele, ab=heterozygous, andbb=homozygous for the non-reference allele; the latter two genotypes were considered to constitutean SNV. Figure 2.3 shows how the posterior probability of each of these three states increased withmore depth, and demonstrates the theoretical qualities of our approach. The figure shows the output332. SNVMix: predicting SNVs from next generation sequencing of tumoursof SNVMix at different depths and indicates that the more data available, the more confident theprediction. Note, however, that even with only Ni = 2 reads, one can still confidently infer thatthe genotype is aa if 2 reference counts are observed (ai = 2), or is bb if 0 reference counts areobserved (ai = 0); only in the ‘mixed’ case, where ai = 1, is one uncertain about the true genotype.Two other approaches developed at the same time (Maq [111] and SOAPSNP [203]) proposed usingBinomial distributions to model genotypes; however, both approaches were developed in the contextof sequencing normal genomes, not cancer genomes. Both set parameters for the model assumingexpected distributions for normal allelic ratios, and applied post-processing heuristics to reducefalse positives. This posed a problem when working with cancer genomes and transcriptomes, bothof which may not follow expected distributions due to tumour-normal admixtures in the sample,intra-tumour heterogeneity, copy number changes and other factors. SNVMix thus used the expectationmaximization (EM) algorithm to find a maximum a posteriori (MAP) estimate of the parametersgiven some training data, allowing the model to adapt to genomes and transcriptomes that maydeviate from the assumed distributions for normal genomes and thus model the data more accurately.Previous studies employed stringent thresholding for removing poor quality bases and/or reads.Ley et al (2008) [204] used Maq followed by stringent post-processing of the predictions based on depthand base quality (qi1:Ni) thresholds. SNVs in their analysis were predicted when they were supportedby 7 or more reads and where the SNV had a Maq quality score of more than 30 (correspondingto a probability of correct base call qij = 0.999), or if all of the following were true: i) the SNV hadread support with a maximum base quality greater than 26, ii) was supported by 3 or more uniquestart-site reads supporting the SNV, and iii) the reads supporting the SNV had an average basequality greater than 16. In a similar way, Morin et al (2008) [168] included reads with base qualitiesof > 20 (qij ≥ 0.99). We proposed that hard thresholds may exclude informative data and I extendedSNVMix1 to explicitly encode base and mapping qualities, using them to probabilistically weight thecontribution of each nucleotide to the posterior probability of an SNV call. In addition, I exploredhow to optimally combine thresholding and probabilistic weighting in order to obtain more accurateresults. I also showed (Section 2.3) how this extended model, which I called SNVMix2, conferred anincreased specificity in my predictions.The statistical models I proposed provided posterior probabilities on SNV predictions, removingthe need for depth thresholds; explicitly modelled base and mapping qualities, removing the need forquality thresholds; and used an EM learning algorithm to fit the model to data, removing the need to342. SNVMix: predicting SNVs from next generation sequencing of tumours! Gi! µk! "k! "k! "! "Position i Genotype k ! a ji! z ji! q jiRead j,  ! j " {1,...,Ni}! rji(a) (b)Figure 2.4: SNVMix model for detecting SNVs from NGS data shown as a probabilistic graphicalmodel. Random variables are shown as round circles. Rounded rectangles indicate user settable parameters(or hyperparameters). Observed quantities (or those known at the time of inference) are shaded. Unobservedquantities (or those we wish to infer) are unshaded. Arrows indicate a probabilistic dependency of the tailvariable on the head variable. The conditional probability distributions are shown in Figure 2.5. (a) showsthe SNVMix1 model, where aij , the binary random variable that indicates a match to the reference for readj and position i is observed. (b) shows the SNVMix2 model, where aij is not observed, but rather have soft,probabilistic evidence for it in the form of base qualities qij and mapping qualities rij . This avoids having tothreshold qij and rij .set model parameters by hand. I also showed that these attributes of the model result in increasedaccuracy compared to Maq’s SNV caller and depth threshold based methods. I evaluated the modelbased on real data derived from 16 ovarian cancer transcriptomes sequenced using NGS, and a lobularbreast cancer genome sequenced to >40X coverage [87]. For all cases, high density genotyping arraydata were obtained for orthogonal comparisons. Finally, I demonstrated results on 497 positions froma breast cancer genome dataset that were subjected to Sanger sequencing and thus constituted a’ground truth’ data set for benchmarking.2.2 Methods2.2.1 SNVMix model specificationSNVMix is shown as a probabilistic graphical model in Figure 2.4a. The conditional probabilitydistributions for the model are given in Figure 2.5 and the description of all random variables is listedin Table 2.1. The input is composed of allelic counts from aligned data and the output is the predictedgenotypes. Consider Gi = k, k ∈ {aa, ab, bb} to be a Multinomial random variable representing the352. SNVMix: predicting SNVs from next generation sequencing of tumoursTable 2.1: Description of random variables in SNVMix1 and SNVMix2Parameter Description Valueδ Dirichlet prior on pi (1000,100,100)pi Multinomial distribution over genotypes Estimated by EM (M-step)Gi Genotype at position i Estimated by EM (E-step)aijIndicates whether read j at position imatches the referenceObserved in SNVMix1,latent in SNVMix2zijIndicates whether read j aligns to itsstated positionLatentqij Probability that base call is correct Observed (SNVMix2 only)rij Probability that alignment is correct Observed (SNVMix2 only)µk Parameter of the Binomial for genotype k Estimated by EM (M-step)α Shape parameter of Beta prior on µ (1000,500,1)β Scale parameter of Beta prior on µ (1,500,1000)genotype at nucleotide position i, where aa is homozygous for the reference allele, ab is heterozygousand bb is homozygous for the non-reference allele. At each position, there are an observed number ofaligned reads Ni. Letting aij ∈ 0, 1 represent whether or not read j ∈ {1, . . . , Ni} matched the reference atposition i. And letting ai (no j index) be the total number of reads that matched the reference at i.The likelihood model for the data was then assumed to be:p(ai|Gi = k,Ni, µ1:3) ∼ Binom(ai|µk, Ni) (2.1)where µk was the parameter of a Binomial distribution for genotype k. µk modelled the expectationthat for a given genotype k, a randomly sampled allele would be the reference allele. Intuitively,one should expect µaa to be close to 1, µab to be close to 0.5 and µbb to be close to 0. Thus, forgenotype k = aa, the Binomial distribution defined by µaa should be highly skewed toward the referenceallele. Similarly, for k = bb the distribution defined by µbb would be skewed toward the non-referenceallele. For k = ab, the distribution would be much more uniform. I imposed a prior on the genotypes,Gi|pi ∼Multinomial(Gi|pi, 1) where pi(k) was the prior probability of genotype k, and set pi(aa) >> pi(ab), pi(bb)to encode the belief that most locations would be homozygous for the reference. Given knowledge ofall the parameters, θ = (µ1:3, pi), Bayes’ rule could then be used to infer the posterior over genotypes,γi(k) = p(Gi = k|ai, Ni, θ), where:γi(k) =pikBinom(ai|µk, Ni)∑Kj=1 pijBinom(ai|µj , Ni)(2.2)362. SNVMix: predicting SNVs from next generation sequencing of tumoursp(pi|δ) = Dir(pi|δ)p(Gi|pi) = Multinomial(Gi|pi, 1)p(aij |Gi = k, µk) = Bern(aij |µk)p(ai|Gi = k, µk, Ni) = Binom(ai|µk, Ni)p(µk|αk, βk) = Beta(µk|αk, βk)p(zij) = Bern(zij |0.5)p(qij |aij , zij) =qij if aij = 1, zij = 11− qij if aij = 0, zij = 10.5 if zij = 0p(rij |zij) = rij if zij = 11− rij if zij = 0Figure 2.5: Conditional probability distributions of the SNVMix modeland the complete data log-likelihood was given by:log p(a1:T |µ1:K , pi) =T∑i=1logK∑k=1pikBinom(ai|µk, Ni) (2.3)where T was the number of observed positions in the input.The approach to inference involved learning the parameters θ = (µ1:K , pi) by fitting the model totraining data using maximum a posteriori (MAP) expectation maximization (EM - see below). Idemonstrated that this produced better results than Maq, which used fixed parameters (Section 2.3).2.2.2 Prior distributionsThe model assumed that pi was distributed according to a Dirichlet distribution parametrized by δ,the so-called pseudocounts. I set δ to be skewed toward piaa under the assumption that most positionswould be homozygous for the reference allele. µk was conjugately distributed according to a Betadistribution: µk ∼ Beta(µk|αk, βk); where the hyper-parameters α and β were set to αaa=1000, βaa=1;αab =500, βab=500; and αbb=1, βbb=1000, assuming that µaa should be skewed toward 1, µab should beclose to 0.5 and µbb should be close to Model fitting and parameter estimationThe model is fitted using the expectation maximization (EM) algorithm. First, the parameters areinitialized to their prior means, µk is set toαkαk+βk, while pi(k) is initialized to kNδwhere Nδ =∑k δ(k). Then372. SNVMix: predicting SNVs from next generation sequencing of tumoursthe EM algorithm iterates between the E-step where the genotypes are assigned using Equation 2.2and the M-step where the model parameters are re-estimated. At each iteration, the complete data-logposterior is evaluated (Equation 2.3) and the algorithm terminates when this quantity no longerincreases. The M-step equations are standard conjugate updating equations:pinew(k) =∑Ti=1 I(Gi = k) + δ(k)∑j∑Ti=1 I(Gi = j) + δ(j)(2.4)where I(Gi = k) is an indicator function to signal that Gi is assigned to state k at position i, and:µnewk =∑Ti=1 aI(Gi=k)i + αk − 1∑Ti=1NI(Gi=k)i + αk + βk − 2(2.5)This specifies the Binomial mixture model behind SNVMix1 and establishes how to fit the modelto the data to estimate its parameters. Below I discuss how to extend this model to consider base andmapping qualities that encode uncertainty in ai.2.2.4 Modelling base and mapping qualitiesThe model shown in Figure 2.4a assumed that aij was observed (it is a shaded node in the graph), andthus assumed correct. However, each nucleotide’s contribution to the allelic counts has uncertaintyassociated with it in the form of base qualities and mapping qualities. I proposed a soft weightingscheme, which would down-weight the influence of low quality base and mapping calls, but notdiscard them altogether. To model this, I changed aij to be an unobserved quantity as shown inFigure 2.4b, and instead observed the soft evidence on them in the form of probabilities, which arethen represented by the observed base qualities qij ∈ [0, 1]. Similarly, two unobserved random variableswere introduced, zij ∈ {0, 1} representing whether read j was correctly aligned, and soft evidence in theform of probabilities which I represent by the observed mapping qualities rij ∈ [0, 1]. The conditionalprobability distributions for p(qij |aij , zij) and p(rij |zij) are given in Figure 2.5. Thus, the input data becameq1:T , r1:T and the corresponding likelihood for each location i could be obtained by marginalizing out382. SNVMix: predicting SNVs from next generation sequencing of tumoursa, z as follows:p(qi1:Ni, ri1:Ni|Gi = k, µk) =Ni∏j=1∑a∑zp(aij |Gi, µ)p(qij |aij , zij)p(rij |zij)p(zij)∝Ni∏j=10.5(1− rij) + rij [(1− qij)(1− µk) + qijµk]As before, given the model parameters µ, pi, the genotype of each position could be inferred by modifyingEquation 2.2 as follows:γi(k) =pikp(qi1:Ni, ri1:Ni|Gi = k, µk)∑Kh=1 pihp(qi1:Ni, ri1:Ni|Gi = h, µh)(2.6)The updating equations were left unchanged in the M-step of EM. The model fitting algorithmchanged only in the E-step by using Equation 2.6 instead of Equation 2.2. I have specified howto encode base and mapping uncertainty into the model, obviating the need for thresholding thesequantities. I call this version of the model SNVMix2.Figure 2.6 shows the theoretical behaviour of this model using simulated data with varying basequalities. The model performed equally well for data sets where the mean certainty of the base callswas ≈80% and higher. This suggested that thresholding base calls at Phred Q20 (99% certainty [168])or Q30 (99.9% certainty [204]) may have been overly stringent.2.2.5 Implementation, running time and SNV discovery pipelineI initially implemented the inference algorithms for SNVMix1 and SNVMix2 in C, supporting bothSAMtools [125] and Maq [111] pileup format. I then subsequently upgraded the software to read directlyfrom BAM files, allowing access to more information regarding the context of mutation calls (seeChapter 3). Running EM (SNVMix2) on 14,649 positions for a 40X breast cancer genome dataset took36 s on a regular desktop computer. Predicting genotypes for the whole 40X genome took 11 min and 38s. (The Maq step cns2snp took 19 min and 9 s. in the same dataset) A script to choose optimal base andmapping quality thresholds, given ground truth [or orthogonal single nucleotide polymorphism (SNP)array] data, is provided in the software package ( was also integrated into an NGS transcriptome and genome analysis pipeline that i)cross references all predicted SNVs against known germline SNPs; and ii) gives functional context tocoding SNVs. The pipeline filters all predicted SNVs against dbSNP v129, the ‘personal genomes’392. SNVMix: predicting SNVs from next generation sequencing of tumours0.5 0.67 0.75 0.80 0.83 0.86 0.89 0.91 0.98 0.99 0.9990.50.550.60.650.70.750.80.850.90.951Area under ROC curveBase quality0 0.2 0.4 0.6 0.8 (b)Figure 2.6: (a) AUC distributions from fitting SNVMix2 to synthetic data with increasing levels of certaintyin the base call. (b) corresponding ROC curves for (a) data. The mean base call quality is provided in thelegend.(39-42), and data from the 1000 genomes project ( Removing commonlyoccurring mutations maximizes the number of potential somatic mutations that might be relevant incancer. Coding SNVs are related to the gene, affected codon, amino acid change if non-synonymousand the Grantham distance [205]. This contextual information is provided to the end user in anaccessible tab-delimited text file that is viewable in any spreadsheet program. This analysis pipelinewas used and described in Morin et al (2010) [118].2.2.6 Data setsI used three data sets to evaluate the SNVMix1 and SNVMix2 models (available as SupplementaryMaterials online at The first (online Supple-mentary Dataset 1A and B) consisted of 16 ovarian cancer transcriptomes sequenced at Canada’sMichael Smith Genome Sciences Centre (Vancouver, BC) using the Illumina GA II platform, RNA-Seqpaired end protocol [206]. For each of these cases, Affymetrix SNP 6.0 high density genotyping arrayswere used to analyze the corresponding DNA. I considered coding positions in the transcriptomedata for which there was a corresponding high confidence (>0.99) genotyping call from the arraypredicted using the CRLMM algorithm [207]. This filter resulted in an average of approximately9,000 genomic positions per case for a total of 144,271 positions. These data were used in the cross402. SNVMix: predicting SNVs from next generation sequencing of tumoursvalidation experiment, described below. The second dataset (online Supplementary Dataset 2A-D)consisted of 497 genomic positions from a lobular breast tumour genome which were predicted to beSNVs using the SNVMix1 model from data generated at Canada’s Michael Smith Genome SciencesCentre (Vancouver, BC) using the Illumina GA II platform [87]. These positions had been predictedto be non-synonymous protein coding changes and were subsequently re-sequenced using Sangercapillary-based technology. Of these, 305 were confirmed as SNVs and 192 were confirmed as falsepositives. I considered these 497 positions as a ground truth dataset that I used for sensitivity andspecificity calculations. I included the 192 false positive calls in the evaluation set to determinewhether further changes in the algorithm or thresholds could help identify them as false. In addition,Affymetrix SNP 6.0 array data were also generated for this lobular breast tumour and I consideredonly 14,649 positions that matched the coding position and CRLMM prediction criteria outlinedabove (online Supplementary Dataset 3A-D). All NGS data were aligned to the human genomereference (NCBI build 36.1) using Maq’s map tool (v0.6.8). Thus, for all comparisons between Maqand SNVMix, I used the same sets of alignments.2.2.7 Accuracy metricsWhile comparing to the SNP array data, I defined a true positive (TP) SNV as an ab or bb CRLMMgenotype. A true negative (TN) SNV was defined as an aa genotype from the SNP array. For theSanger validated positions, a TP was an SNV that was confirmed by Sanger sequencing, whereas aTN was a position that was not confirmed. To evaluate my models against these data, I computedp(SNVi) = γi(ab) + γi(bb) and standard receiver operator characteristic (ROC) curves. The area under theROC curve (AUC) was computed as a single numeric metric of accuracy that effectively measures thetrade-off between sensitivity and specificity. Similarly, I calculated the area of the precision-recall curve(AUC-PR) for comparisons where the SNV negative class was greater than the positive class [208]. Asan additional measure, I computed the F-statistic: f = 2∗(precision∗recall)(precision+recall), where precision was measuredas the proportion of predictions that were true and recall was the proportion of true SNVs that werepredicted.2.2.8 Benchmarking experimentsTo evaluate the effect of estimating parameters, I designed a four-fold cross-validation study. Ipermuted the 144,271 positions with matched array-based genotype data from the ovarian cancer412. SNVMix: predicting SNVs from next generation sequencing of tumoursdata, and divided the positions into 4 equal parts. I fit the model to 3 parts (training data) usingEM and used the converged parameters to calculate p(SNVi) for each of the remaining positions (testdata). I repeated this 10 times and computed the AUC and AUC-PR for each of the 16 cases. I alsocomputed the AUC and AUC-PR from the results predicted using Maq v0.6.8 and compared theAUC distributions across the 16 cases to SNVMix1 and SNVMix2. These data also allowed me todetermine the range of converged parameter estimates across the folds and 10 replicates. I also testedthe effect of depth-based thresholding by running SNVMix1 on the 14,649 positions from the breastcancer genome. To simulate the thresholding, I set p(SNVi) = 0 at locations where Ni was below somethreshold, chosen from the set {0, 1, . . . , 7, 10}. I compared SNVMix1, SNVMix2 and Maq on these dataas well. Finally, I evaluated the true positive rate (TPR) and false positive rate (FPR) on the 497ground truth positions from the same breast cancer genome for SNVMix1, SNVMix2 and Maq.2.3 Results2.3.1 Depth heuristics reduce sensitivityI first determined the effect that depth thresholding had using the lobular breast cancer genomedataset. I calculated ROC curves and corresponding AUC values (Section 2.2) from the output ofSNVMix1 at different cutoff values. The most accurate results were obtained when no thresholdingwas applied (Figure 2.7). At a threshold of 0, the AUC was 0.988 (the highest) and at a threshold of 10reads, the AUC was 0.614 (the lowest). At an FPR of 0.01, the TPR decreased with increasing numberof reads required for the threshold without exception, suggesting that depth-thresholding under theSNVMix1 model reduces overall sensitivity without increasing specificity, and should therefore beavoided. AUC for thresholds of 1, 3, 5 and 7 reads were 0.971, 0.893, 0.782 and 0.707, respectively.2.3.2 Estimating parameters in transcriptome data by model fitting confersbetter accuracyFigure 2.8 shows the AUC distribution of over 16 ovarian cancer transcriptomes (Section 2.2) forthe best and worst cross-validation runs of SNVMix1 and SNVMix2 as well as the results fromthe Maq SNV caller with the two recommended settings of the r parameter (0.2 and 0.001). Bothruns of SNVMix1 were significantly better than the Maq runs [analysis of variance (ANOVA) test,P < 0.0001], with mean AUC of 0.9533±0.0104 and 0.9530±0.0107 (mean AUC-PR 0.9319±0.0132 and422. SNVMix: predicting SNVs from next generation sequencing of tumours0 0.2 0.4 0.6 0.8 = AB or BBFPRTPRthresh=0, AUC=0.988thresh=1, AUC=0.971thresh=2, AUC=0.939thresh=3, AUC=0.893thresh=4, AUC=0.838thresh=5, AUC=0.782thresh=6, AUC=0.739thresh=7, AUC=0.707thresh=10, AUC=0.614Figure 2.7: SNVMix1 presents most accurate results when no thresholds are applied. ROC curveswere obtained from the 14,649 training positions, where the predicted probability was set to 0 if there were lessthan a threshold number of reads at a given position. Thresholds of 0-7 and 10 are shown, with 10 resulting inthe worst performance and 0 resulting in the best performance.0.9283±0.0152); compared with 0.9337±0.0115 and 0.9066±0.0127 (mean AUR-PR 0.8984±0.0159 and0.8686±0.0162) for the Maq runs. Although the improvement of SNVMix2 over SNVMix1 is notstatistically significant, it is noteworthy that no thresholds of any kind were applied to the data andthus probabilistic weighting can eliminate the need for arbitrarily thresholding of the data (see below).2.3.3 Evaluation of models on a deeply sequenced breast cancer genome withground truth SNVsI evaluated performance of the models on a lobular breast cancer sample sequenced to > 40X haploidcoverage [87] . In addition, I compared results obtained from the same genome at 10X coverage. Ifirst trained the model using 14,649 protein coding positions for which we had generated matchingAffymetrix SNP6.0 calls. I then computed the AUC for SNVMix1, SNVMix2 and Maq. Table 2.2shows that the highest AUCs were obtained with SNVMix2 on the 40X genome, followed by SNVMix2on the 10X genome (AUCs of 0.9929 and 0.9905, respectively). Both of these were higher than resultsachieved for SNVMix1 (AUC of 0.9880) and Maq (0.9824 for 40X and 0.9115 for 10X - both for the r= 0.001 parameter setting). After fitting the model to the 14,649 positions, I evaluated performance432. SNVMix: predicting SNVs from next generation sequencing of tumours0.840.860.880.900.920.940.96AUC across different SNVMix models and MaqAreaunder the ROC curve0.840.860.880.900.920.940.96AUC−PR across different SNVMix models and MaqArea under the PR curveMb_Q30_bestmb_Q0_bestMb_Q30_worstmb_Q0_worstmb_Q0_bestmb_Q0_worstr0.2r0.001SNVMix2 SNVMix1 MaqMb_Q30_bestmb_Q0_bestMb_Q30_worstmb_Q0_worstmb_Q0_bestmb_Q0_worstr0.2r0.001SNVMix2 SNVMix1 MaqFigure 2.8: Distribution of AUC and AUC-PR in cross-validation runs showed SNVMix1 andSNVMix2 provided more accurate results than Maq. Boxplots showing the distribution of AUC for thebest and worst cross-validation run for SNVMix1, Maq with r=0.02, and Maq with r=0.001. Both the best andworst runs of SNVMix1 (mean AUC of 0.9533±0.0104 and 0.9530±0.0107, mean AUC-PR of 0.9319±0.0132and 0.9283±0.0152) were significantly more accurate that the Maq predictions (mean AUC 0.9337±0.0115 and0.9066±0.0127, mean AUR-PR 0.8984±0.0159 and 0.8686±0.0162).using 497 candidate mutations originally detected using SNVMix1 at 10X, which were validatedusing Sanger amplicon sequencing (Section 2.2). These consisted of 305 true SNVs (variants seen inthe Sanger traces) and 192 that could not be confirmed in the Sanger traces. Table 2.2 shows thesensitivity, precision and F-measure results of Maq, SNVMix2, SNVMix1 and SNVMix2 combinedwith base and mapping quality thresholding at both 10X and 40X coverage at a p(SNV ) (Section 2.2)threshold determined using a FPR ≤ 0.01. I present the Maq results for reference only, as thiscomparison is biased toward the SNVMix1 model that led to the identification of these mutations inthe first place. SNVMix2 and SNVMix1 showed similar F-measure at both 10X and 40X, reinforcingthat the probabilistic weighting confers equal accuracy without having to select arbitrary qualitythresholds. In addition, both SNVMix2 and SNVMix1 had higher accuracy at 40X and 10X (Table2.2). Interestingly, all the models had increased false negative rates in the 40X genome in comparisonwith the 10X genome. Upon further review of the SNVMix2 positions predicted at 10X, but not at40X, I examined that the majority (9 out of 13) were marginally below threshold and significantprobability mass was indeed on the P (ab) state (>0.99), and would have been predicted with even aslightly less stringent threshold. Three out of the remaining four appear to be the result of DNA copy442. SNVMix: predicting SNVs from next generation sequencing of tumoursTable 2.2: Performance values for different runs of SNVMix2 and MAQ.Model RunTrainAUCTP FP TN FN Sens Prec F-measMaq10x MAQ-r.001 0.9115 276 95 97 29 0.9049 0.7439 0.816510x MAQ-r.02 0.9016 262 111 81 43 0.8590 0.7024 0.772840x MAQ-r.001 0.9824 210 56 136 95 0.6885 0.7895 0.735540x MAQ-r.02 0.9820 206 54 138 99 0.6754 0.7923 0.7292SNVMix10x SNVMix1 0.9880 305 192 0 0 1.0000 0.6137 0.760640x SNVMix1 0.9924 293 107 85 12 0.9607 0.7325 0.8312SNVMix210x mbQ0 0.9905 299 162 30 6 0.9803 0.6486 0.780740x mbQ0 0.9929 290 107 85 15 0.9508 0.7305 0.8262SNVMix2 +thresholding10x MBmQ50 bQ20 0.9882 287 88 104 18 0.9410 0.7653 0.844140x MBmQ50 bQ15 0.9928 287 71 121 18 0.9410 0.8017 0.8658number amplifications that skew the allelic ratios (Section 2.4.2).Even though the SNVMix2 model eliminated the need for thresholding through probabilisticweighting, I explored the effect of applying thresholds to the SNVMix2 input to determine if furtherperformance increases could be achieved. I tested five different ways to use mapping and base qualitiesat several quality thresholds in each case. These models and their rationale are as follows:• b: Filter on base quality value (qij > threshb) and use only base qualities, assume all unambiguousalignments are of equal quality (rij = 1) and probabilistically weight base calls;• m: Filter on mapping quality (rij > threshm) and use the corresponding quality as a surrogate forbase quality (qij = rij). The mapping quality is then considered to be perfect (rij = 1);• mb: Filter according to the minimum value between base quality and mapping quality (min(rij , qij)),this is similar to the filtering done by MAQ’s SNP caller;• M: Filter based on map quality (rij > threshm) and use only the base quality of the calls, assumethat any aligned short read that passed the threshold is perfect (rij = 1);• Mb: Filter according to map quality (rij > threshm) and use both base and map qualities.• MB: Filter according to map quality (rij > threshm) and base quality (qij > threshb) and use bothbase and map qualities.I applied the first five of these models to the 497 breast cancer positions, setting the correspondingthreshold to Phred-like (quality) scores of Q0, Q5, Q10, Q20, Q30 and Q40, and computed ROCcurves (Figure 2.9) to evaluate accuracy. The largest improvements were obtained by stringently452. SNVMix: predicting SNVs from next generation sequencing of tumoursthresholding mapping qualities shown in models M, Mb and m (red box in Figure 2.9). The largestimprovement in accuracy over SNVMix1 (AUC=0.7811) was seen in Mb with a threshold of Q40(see black arrow). Much lower improvements were seen in the models that thresholded base qualities(models mb and b – blue box in Figure 2.9). Of these runs, model b with a threshold of Q10 producedthe best improvement with an AUC of 0.7328. These results indicate that removing the very lowmapping quality reads helps in SNV detection while removing low quality bases has less of an effect.To investigate this further, I examined the effect of thresholding base qualities concomitantlywith read mapping quality alignments (Q30 and higher). For this, I evaluated the MB model withthresholds on mapping quality (Q30, Q35, Q40, Q45, Q50) and base quality (Q0, Q5, Q10, Q15, Q20,Q25) by running SNVMix2 on the resulting data from both 10X and 40X genomes (Figure 2.10). Asshown in Table 2.2, thresholding the mapping qualities at Q50 and base qualities at Q20 (mQ50 bQ20)at 10X performed better than all other 10X runs (F-measure=0.8441). For the 40X data, thresholdingthe mapping qualities at Q50 and base qualities at Q15 (mQ50 bQ15) performed best over all runs(F-measure=0.8658, see Appendix A.1 for all results from runs in increments of Q1 base qualitythresholds). These results indicated that the MB model found the best compromise between sensitivityand precision, and that probabilistic weighting of the base qualities applied to high quality mappingswhile ignoring very poor base calls gave the optimal result.Figure 2.11 shows a more refined look at the AUC on the test set for models mQ30, mQ35, mQ40,mQ45 and mQ50 at increasing base quality thresholds. The dashed lines show the AUC for the 10Xgenome while the solid lines show the AUC for the 40X genome. Results in the 40X genome weremore accurate than in the 10X genome. Interestingly, the 40X curves indicate that the accuracy isrelatively unaffected by base quality thresholding until the threshold exceeds Q25. However, at 10Xthe choice of base quality threshold is relatively consistent only until Q10. This difference can largelybe explained by the fact that the 10X data generation runs used a different quality calibration methodresulting in overall lower quality scores. Nonetheless, of the 155 MB runs that I performed, the top65 runs by F-measure all used a base quality threshold < Q20 (see Appendix A.1). These resultsindicated that previously reported base quality thresholds may be too stringent. Furthermore, whenused with stringent mapping thresholds, the SNVMix2 model can effectively use the base qualities byprobabilistic weighting to confer higher accuracy. Finally, these results showed that treating mappingand base qualities separately as opposed to taking the minimum over the two (i.e. the Maq and mbmodels) had advantages.462. SNVMix: predicting SNVs from next generation sequencing of tumours0 0.5 40x M Q0FPRTPRAUC: 0.719010 0.5 40x M Q5FPRTPRAUC: 0.723480 0.5 40x M Q10FPRTPRAUC: 0.722780 0.5 40x M Q20FPRTPRAUC: 0.724390 0.5 40x M Q30FPRTPRAUC: 0.746020 0.5 40x M Q40FPRTPRAUC: 0.776620 0.5 40x Mb Q0FPRTPRAUC: 0.72830 0.5 40x Mb Q5FPRTPRAUC: 0.727050 0.5 40x Mb Q10FPRTPRAUC: 0.723190 0.5 40x Mb Q20FPRTPRAUC: 0.725070 0.5 40x Mb Q30FPRTPRAUC: 0.750180 0.5 40x Mb Q40FPRTPRAUC: 0.781060 0.5 40x b Q0FPRTPRAUC: 0.732340 0.5 40x b Q5FPRTPRAUC: 0.730040 0.5 40x b Q10FPRTPRAUC: 0.732840 0.5 40x b Q20FPRTPRAUC: 0.718290 0.5 40x b Q30FPRTPRAUC: 0.650240 0.5 40x b Q40FPRTPRAUC: 0.50 0.5 40x m Q0FPRTPRAUC: 0.720750 0.5 40x m Q5FPRTPRAUC: 0.720820 0.5 40x m Q10FPRTPRAUC: 0.722580 0.5 40x m Q20FPRTPRAUC: 0.723310 0.5 40x m Q30FPRTPRAUC: 0.752080 0.5 40x m Q40FPRTPRAUC: 0.775680 0.5 40x mb Q0FPRTPRAUC: 0.728260 0.5 40x mb Q5FPRTPRAUC: 0.73210 0.5 40x mb Q10FPRTPRAUC: 0.725050 0.5 40x mb Q20FPRTPRAUC: 0.723010 0.5 40x mb Q30FPRTPRAUC: 0.642160 0.5 40x mb Q40FPRTPRAUC: 0.5M Mb m mb bQ0Q5Q10Q20Q30Q40Figure 2.9: ROC scan of thresholding models. ROC curves obtained running SNVMix2 on the 497Sanger validated breast cancer positions (at 40X) for each of the quality thresholding models M, Mb, m, band mb at at quality values Q0, Q5, Q10, Q20, Q30 and Q40. The dotted line in the background shows theROC obtained from the original predictions using the SNVMix1 model. The red box denotes the models thatthreshold on mapping quality and the blue box denotes those thresholding on base quality.472. SNVMix: predicting SNVs from next generation sequencing of tumours0 0.5 40x MB mQ30−bQ0FPRTPRAUC 40X: 0.75447AUC 10X: 0.774860 0.5 40x MB mQ35−bQ0FPRTPRAUC 40X: 0.78111AUC 10X: 0.77910 0.5 40x MB mQ40−bQ0FPRTPRAUC 40X: 0.77859AUC 10X: 0.789530 0.5 40x MB mQ45−bQ0FPRTPRAUC 40X: 0.80138AUC 10X: 0.790030 0.5 40x MB mQ50−bQ0FPRTPRAUC 40X: 0.79013AUC 10X: 0.795540 0.5 40x MB mQ30−bQ5FPRTPRAUC 40X: 0.75992AUC 10X: 0.778360 0.5 40x MB mQ35−bQ5FPRTPRAUC 40X: 0.77553AUC 10X: 0.778770 0.5 40x MB mQ40−bQ5FPRTPRAUC 40X: 0.77918AUC 10X: 0.792730 0.5 40x MB mQ45−bQ5FPRTPRAUC 40X: 0.79911AUC 10X: 0.795650 0.5 40x MB mQ50−bQ5FPRTPRAUC 40X: 0.8005AUC 10X: 0.796810 0.5 40x MB mQ30−bQ10FPRTPRAUC 40X: 0.76665AUC 10X: 0.785690 0.5 40x MB mQ35−bQ10FPRTPRAUC 40X: 0.78782AUC 10X: 0.785710 0.5 40x MB mQ40−bQ10FPRTPRAUC 40X: 0.78987AUC 10X: 0.792110 0.5 40x MB mQ45−bQ10FPRTPRAUC 40X: 0.80546AUC 10X: 0.792140 0.5 40x MB mQ50−bQ10FPRTPRAUC 40X: 0.80494AUC 10X: 0.793840 0.5 40x MB mQ30−bQ15FPRTPRAUC 40X: 0.77092AUC 10X: 0.785570 0.5 40x MB mQ35−bQ15FPRTPRAUC 40X: 0.77457AUC 10X: 0.783860 0.5 40x MB mQ40−bQ15FPRTPRAUC 40X: 0.76866AUC 10X: 0.789120 0.5 40x MB mQ45−bQ15FPRTPRAUC 40X: 0.7971AUC 10X: 0.789750 0.5 40x MB mQ50−bQ15FPRTPRAUC 40X: 0.79276AUC 10X: 0.791210 0.5 40x MB mQ30−bQ20FPRTPRAUC 40X: 0.78887AUC 10X: 0.776420 0.5 40x MB mQ35−bQ20FPRTPRAUC 40X: 0.78663AUC 10X: 0.77580 0.5 40x MB mQ40−bQ20FPRTPRAUC 40X: 0.7869AUC 10X: 0.787930 0.5 40x MB mQ45−bQ20FPRTPRAUC 40X: 0.81062AUC 10X: 0.784390 0.5 40x MB mQ50−bQ20FPRTPRAUC 40X: 0.8055AUC 10X: 0.79160 0.5 40x MB mQ30−bQ25FPRTPRAUC 40X: 0.77092AUC 10X: 0.753430 0.5 40x MB mQ35−bQ25FPRTPRAUC 40X: 0.77855AUC 10X: 0.752820 0.5 40x MB mQ40−bQ25FPRTPRAUC 40X: 0.78175AUC 10X: 0.757360 0.5 40x MB mQ45−bQ25FPRTPRAUC 40X: 0.79469AUC 10X: 0.761370 0.5 40x MB mQ50−bQ25FPRTPRAUC 40X: 0.80029AUC 10X: 0.76706mQ30 mQ35 mQ40 mQ45 mQ50bQ0bQ5bQ10bQ15bQ20bQ25Figure 2.10: ROC curves for MB scan on 497 ground truth positions. ROC curves obtained callingSNVMix2 on the 497 Sanger validated breast cancer positions (at 40X and 10X) using the MB thresholdingmodel with mapping quality thresholds of Q30, Q35, Q40, Q45 and Q50 (columns), and base quality thresholdsof Q0, Q5, Q10, Q15, Q20, Q25 (rows). For each run, the SNVMix2 model was fit using the 14,649 trainingpositions and corresponding thresholds. The resulting model and threshold parameters were then used to callthe SNV status at the 497 test positions and generate the ROC curves. The solid lines represent the ROC curvesfor the 40X genome, and the dotted lines represent the ROC curves for the 10X genome. The correspondingAUCs are displayed in each box.482. SNVMix: predicting SNVs from next generation sequencing of tumours0 5 10 15 20 25 30 350.9880.9890.990.9910.9920.993bQAUCBehaviour of AUC with different mapping and base quality thresholds40XmQ3040XmQ3540XmQ4040XmQ4540XmQ50mQ30mQ35mQ40mQ45mQ50Figure 2.11: Effect of thresholds on base and mapping qualities. Performance, measured in trainingAUC, drops as the minimum base quality thresholds are increased, while modifications in mapping qualitythresholds seem to have little effect. The plot shows curves depicting the behaviour of the AUC from SNVMix2results when applying different combinations and values of thresholds. The solid lines represent the AUCsobtained on the 40x version of the lobular breast cancer genome, and the dashed lines correspond to the 10xversion. The colour of each line corresponds to a different mapping quality threshold (see legend). The X-axiscorresponds to the minimum base quality required for a base call to be considered.492. SNVMix: predicting SNVs from next generation sequencing of tumours2.4 DiscussionI have described two statistical models based on Binomial mixture models to infer SNVs using alignedNGS data obtained from tumours. I demonstrated that a probabilistic approach to modelling alleliccounts obviated the need for depth-based thresholding of the data, and how fitting the model toreal data to estimate parameters was superior to Maq, which used fixed parameter settings on theassumption that the data come from a normal human genome. In addition, I extended the basicBinomial mixture to model mapping and base qualities by using a probabilistic weighting technique.This eliminated the need to employ arbitrary thresholds on base and mapping qualities, and instead letthe model determine the strength of contribution of each read to the inference of the genotype. Finally,I showed that even further gains in accuracy could be obtained by combining moderate thresholdingand probabilistic weighting of the base and mapping qualities. Importantly, gains in accuracy by theSNVMix models were shown in both transcriptome and genome data.2.4.1 Dependence on alignmentsSNVMix results are highly dependent on the accuracy of alignments as well as the consistency andaccuracy of mapping qualities reported by the aligner. Results in Table 2.2 showed that the combinedapproach of stringent thresholding on mapping quality and modelling the uncertainty of the remainingreads gave the highest accuracy. Given that the most gain was obtained in precision, it suggests thatfalse positive SNV predictions may indeed be reduced with more accurate alignments. As read lengthsincrease with technology development and mapping algorithms improve, I expect that the input toSNVMix will be of higher quality, which should yield even more accurate results. Moreover, alignmentusing Maq presents a drawback with regards to SNVMix2’s model. When a short read can be alignedto more than one position in the genome with the same mapping quality, this read is excluded, it isassigned a mapping quality of zero and only one randomly selected alignment is reported. SNVMix2’sdesign could leverage the read’s quality among the distinct alignment coordinates and still use theinformation it conveys to predict SNVs.Additionally, when working with transcriptome sequencing, one of the most challenging problemsis the alignment of the short reads to the reference genome due to alternative splicing and, especiallywhen dealing with tumour samples, the possible existence of gene fusions and novel splicing events. Atthe time SNVMix was developed, aligning short reads to exon-exon junctions could be done using an502. SNVMix: predicting SNVs from next generation sequencing of tumoursalternative library of known junctions. With this paradigm, aligned reads have to present a sufficientlylarge overhang across both sides of the exon-exon junction to be accepted. However, this requirementcould cause low coverage over exon-exon junctions as well as false positives in special cases whenthere is an alternative exon overlapping the intronic space between other two exons. These types ofartifacts are discussed further in Chapter Limitations, extensions and future workAs stated earlier, a major objective in cancer genome sequencing is to discover somatic mutations. Ifsequence data from tumour and normal DNA from the same patient are available, candidate somaticmutations can be identified as positions for which p(SNV ) is high in the tumour and 1− p(SNV ) is highin the normal data. If only tumour data are available, the recommendation is to filter against dbSNPand perform targeted validation on the remaining positions in both tumour and normal DNA asdescribed previously [87]. Moreover, the models I have presented assume identically and independentlydistributed genotypes. As such, the common prior over genotypes pi can be indexed by position (i.e. pii)and thus could encode information about what variants are known for each position i in the genome.We noticed that some positions not called by SNVMix1 at 40X were in what we believe to beallele-specific copy number amplifications. Future work would involve incorporating copy numberdata directly into the model to consider such situations, where the resultant allelic bias is expectedto mask variants present in the unamplified allele. In a similar vein, due to regulatory mutations orepigenetic changes, transcriptomes can show preferential expression of one allele. In such cases ofextensively skewed allelic expression, our model will be insensitive. Further extensions of the modelto consider these factors could be explored.Finally, Shah et al (2009) [87] had recently demonstrated that intra-tumour heterogeneity canbe seen using ultra-deep targeted sequencing. The allelic frequencies of SNVs in rare clones in thetumour population will likely result in false negative predictions at conventional sequencing depths(i.e. between 20X and 40X), and confound the estimation of the false negative rates of prediction.Future investigation of all of these problems would be necessary to characterize all mutations presentin the heterogeneous mixtures of genomes that make up tumours.512. SNVMix: predicting SNVs from next generation sequencing of tumours2.5 ConclusionIn this Chapter, I presented the theoretical background behind the development of the SNVMixframework. Two probabilistic models were included: SNVMix1 which models the genotype as afunction of read counts covering the positions; and SNVMix2 which augments the model by includingthe uncertainties that a basecall in a read is correct and that the alignment of the read to the referenceis correct. Both methods were proven to outperform Maq’s SNV caller, which, at the time, was thestate of the art SNV caller for next-generation sequencing data. I also assessed the functionality ofSNVMix for cancer studies, using genome and transcriptome sequencing, where the proportion ofalleles in a cell may not follow that of a strictly diploid genome. The advantages of SNVMix in thisregard are obtained by the application of the expectation maximization (EM) algorithm, which canadapt the parameters according to the data at hand.SNVMix has demonstrated impact, having been cited 183 times (Google Scholar) by 88 PubMedCentral articles (PubMed) and has been used in studies both at Canada’s Michael Smith GenomeSciences Centre and the BC Cancer Research Centre. These studies have featured ovarian cancer [206,209], oligodendroglioma [210], follicular and diffuse large B-cell lymphoma [118, 211], non-Hodgkinlymphoma [150], and Hodgkin lymphoma [212]. The software has also been used at other centres, suchas the Broad Institute of MIT for a high-risk neuroblastoma study [213]; for detection of mutations incolorectal cancer, lung adenocarcinoma and non-small cell lung cancer at the Mayo Clinic [214–216];and resulted the discovery of SMO and BRAF mutations in ameloblastomas, and was used to studythe effects of DDX3 mutation in medulloblastoma at Stanford University [217, 218].The SNVMix models have also stimulated the development of other methods such as CoNAn-SNV [219], a study in which I collaborated with the adaptation of the model to consider regions of copynumber variation; JointSNVMix [220], in which the model was expanded to consider tumour-normalpairs to better identify somatic mutations; FetalQuant [221], in which the authors expanded ourSNVMix model to four states to detect maternal-fetal genotype combinations; and has also beenmodified to detect RNA editing events in D. melanogaster [222].52Chapter 3BWA-R and SNVMix for improvedvariant calling in transcriptomesDetection of single nucleotide variants (SNVs) is often done using DNA data. However, variantcalling in transcriptomes still is useful. Accurate detection of SNVs in RNA-Seq data can be used,for example, to determine whether mutations are being expressed or not, to detect allelic biases thatmay help study gene regulation mechanisms, and to detect genes affected by RNA editing. As afollow up to my work in SNVMix (Chapter 2 and Goya et al (2010) [124]), I modified an existingDNA aligner to improve the quality of SNV calling using RNA-Seq, the results contributed to severalpublications [150, 186–189]. In this chapter I present a description of the problem, the modificationsand evaluation of the modified aligner, and its effect on SNV calling performance.3.1 IntroductionMutations are a hallmark of cancer [120]. The increased ability to detect mutations offered bynext-generation sequencing (NGS), or high-throughput sequencing (HTS), technologies has greatlyadvanced the pace at which cancer genes can be studied [1, 2]. As described before (see Chapter 2),both DNA and RNA sequencing can be used to detect mutations. Regardless of the protocol used(e.g. DNA-Seq, RNA-Seq), a comparison to a genomic reference is required, and thus the relatedanalysis pipelines commonly include an alignment step in which the short reads are mapped onto areference sequence. Given the relatively short length and large number of reads obtained from HTS,common alignment algorithms like Smith-Waterman [116], Blast [223] and BLAT [224] are too slowto efficiently process HTS data. New alignment strategies have thus been developed. Some of theearlier HTS aligners focused on the use of hash tables, a data structure used to efficiently map keysto values. By using a hashing function that creates an index from the key, the desired value can beefficiently recovered from memory. Some aligners created hash tables indexing the short reads and533. BWA-R and SNVMix for improved variant calling in transcriptomesthen scanned the reference genome, sequentially using the same hashing function on the sequence tofind matches in the short-read hash table (e.g. Eland, MAQ [111], SHRiMP [202], mrfast [225]). Otheraligners focused on creating a hash table from the reference genome and then scanning the reads (e.g.Novoalign [Novocraft, 2010,], Mosaik [226], SOAP [114]). These typeof aligners were soon followed by others using more complex data structures like the Burrows-WheelerTransform [227], which increased the performance of the aligners and was incorporated into alignerssuch as BWA [112], Bowtie [113] and SOAP2 [123].Short read aligners such as Maq, BWA, SOAP and Bowtie were developed for alignments toDNA sequences. They consider a continuous alignment space on which to place the short reads,usually allowing for few mismatches and short indels. When using data from a paired-end libraryanalyzed using Illumina sequencers, in which both ends of a DNA fragment are sequenced, alignerscan take advantage of the fragment size to better place the reads. Properly aligned read-pairs willmost often be placed at a distance dictated by the library’s fragment size. Any deviation from theexpected distance between reads in a pair can be useful for the end user and the algorithm itself.During post-processing of alignments, differences in paired-end tag alignments can be used to detectchromosomal aberrations like deletions, insertions or translocations (see Section 1.5.7). Additionally,the aligner can take advantage of the fragment size to resolve cases in which multiple hits exist foreither or both reads in a pair (see Section 3.1.1).When working with DNA obtained from individuals from a species whose genomic DNA haspreviously been sequenced, selection of the target sequence is relatively straightforward: a fasta filecontaining the genome sequence is fed to the aligner, which in turn uses it as the target sequenceagainst which to align the short reads, or query sequences. This changes when analyzing mRNA datadue to presence of introns and mRNA splicing. This causes two general problems for aligners. On onehand, the presence of introns will cause reads in a pair aligning to different exons to be separatedby a distance larger than the one dictated by the fragment size selected for library construction,affecting read-pairing. Secondly, the union of exon sequences through the removal of introns createsnucleotide sequences that are not represented in the DNA reference. Short reads obtained from suchexon-exon junctions (EEJs) require special treatment as their alignments need to be split between thecorresponding exons (Figure 1.2). The large number of reads and their short length (initial commonread-lengths were 36, 42 or 50 bases) makes gapped alignment of split reads on the genome a challengingproblem. Although gapped aligners like BLAST [223] had been used (e.g. ALEXAseq [167]), they were543. BWA-R and SNVMix for improved variant calling in transcriptomescomputationally inefficient, and so strategies based on specialized short-read aligners were desirable.The main approaches to RNA-Seq alignment can be separated into annotation based and annotationfree. The former is based on the idea of using known gene annotations to enrich the alignment reference,either by aligning reads directly onto transcript sequences, or aligning to a genome and then to atranscriptome, or by in silico addition of EEJ sequences to the reference genome. Annotation freeapproaches were later developed using procedures such as aligning reads to the genome and usingunmapped reads to find the introns between islands of short-reads (Tophat [151]), or splitting reads inhalf and using one half’s alignment to extend the second half and find the splice site (SpliceMap [153]),or splitting the reads into multiple tags and reconstructing the correct alignment by concatenatingtag alignments (MapSplice [154]).Both DNA and RNA alignments can provide information for the detection of SNVs. However,differences in allelic representation in each type of data creates additional limitations and challenges.Calling SNVs in DNA requires, for example, that the SNVs be present in enough cells in the sample.This may not happen when analyzing samples from highly heterogeneous tumours or samples affectedby chromosomal changes. In RNA, SNV calling is additionally restricted by the genes that are beingactively transcribed, and whether those genes are being subjected to post-transcriptional modificationssuch as RNA editing and nonsense-mediated decay. SNVMix was created to better deal with variabilityin the allelic representation of the underlying sequences. This characteristic made SNVMix a goodoption in the study of cancer, where differences in tumour content and copy number changes canmodify the representation of certain DNA sequences. Similarly, it made SNVMix ideal for RNA-SeqSNV calling, where differences in gene expression levels and allele-specific expression are importantsources of variability in sequence representation.As noted in Chapter 2, the performance of SNVMix and other callers strongly depends heavilyon the quality of alignments provided. When selecting an aligner to be used for RNA-Seq SNVcalling I considered several factors. First, to take full advantage of the probabilistic models I createdfor SNVMix (see Chapter 2), I sought an aligner capable of providing a probabilistic interpretationof the mapping quality of each aligned read. Additionally, to annotate SNVs into synonymous,non-synonymous, and nonsense mutations we needed gene annotations, which meant that noveldetection of splice sites was not required. I thus decided on a strategy using a DNA aligner on agenome reference expanded with EEJ references. The aligner options at the time were Maq [111],BWA [112], SOAP [114] and Bowtie [113]. Of these options, only Maq and BWA provided a truly553. BWA-R and SNVMix for improved variant calling in transcriptomesprobabilistic value for mapping quality2. I thus selected BWA as it had already generally replacedMaq due to algorithmic changes that increased its overall speed and performance.3.1.1 Paired-end read alignment in BWAAlignment of paired-end reads in BWA, as in many short-read aligners, is done as a two step process:reads are first aligned as single ended reads to the reference sequence and then each pair is evaluatedfor proper mapping (read pairing). Paired end reads from Illumina sequencing machines are sequencedfrom either end of a nucleotide template; it is thus expected that each read in a pair will align toopposing strands at a distance consistent with the fragment size distribution from which the sequencinglibrary was made. Read pairs fulfilling these characteristics (opposing strands and expected distancebetween pairs) are flagged as a proper pair. After the single-end alignment step, BWA goes througheach pair of reads and evaluates their possible alignments. Each read in a pair can either be unmapped,uniquely mapped or ambiguously mapped (i.e. multiple alignments were found). Each combination hasto be dealt with differently, as outlined below:• Both reads have unique alignment locations. This case provides only one possible alignment forthe pair. However, whether the read-pair is flagged as a proper pair depends on an adequatealignment orientation (opposite strands) and a distance between the alignments within theexpected fragment size.• One read has multiple equally probable alignments and the other read has a unique one. Inthis case, every location found for the ambiguously mapped read is evaluated for proper pairing(see above), if one location is found to provide a better pair than the rest (opposing strandsand a distance closer to the expected fragment size than the other possible alignments), thatone is selected. The resulting mapping quality of the ambiguously mapped read would then beassigned a lower value than a uniquely mapped read.• Both reads have multiple equally probable alignments. This case is an extension of the previousone, in which both sets of multiple alignments are considered to find the best possible pair.• Either of the reads is unaligned. This case results in single-end mappings. The resulting mappingquality of the aligned read is lower than those in proper pairs.2Bowtie did not provide probabilistic mapping quality until its later instalment, Bowtie2 [228].563. BWA-R and SNVMix for improved variant calling in transcriptomesA B CAs Ae Bs Be Cs CeAe BsBe Cs--chr1chr2chr3chr4. . .chrYBe Cs-Ae Bs-Figure 3.1: Exon-exon junction enhanced reference. A putative gene model composed of three exonsA, B and C is depicted at the top. The coloured boxes at the top of each exon mark the nucleotide sequencesthat would be needed to generate the set of EEJs. The start and end coordinates of exons are indicated beloweach exon. To build the expanded reference sequence, the EEJ sequences are appended as individual synthetic”chromosomes” at the end of the file containing genome sequences.When aligning RNA-Seq data to an EEJ augmented reference sequence, the target space to whichreads can be aligned is extended to include those nucleotide sequences formed by mRNA splicing events(see Figure 1.2). These sequences can be added at the end of the main genomic chromosome sequencesin a fasta file either in one large extra fasta entry or split into one entry per EEJ (Figure 3.1). Thisprovides the nucleotide data that BWA can use to search for the best alignment for reads, whetherthey come entirely from exons (genomic chromosomes) or from an EEJ (extended reference). However,BWA is unaware of what these sequences represent and so they are treated as extra chromosomes,which causes problems when read pairing.Consider a read pair where the first read (R1) aligns to an exon in chromosome 1 and the secondread (R2) aligns to an EEJ of the same gene. The pairing algorithm will consider R1’s alignment tobe in chromosome 1; however, as R2’s alignment is to an extended chromosome representing the EEJ(Figure 3.1), the coordinates considered by BWA for pairing are not in the same chromosome and thusBWA cannot make a proper pair. Furthermore, if R2 has a suboptimal alignment (e.g. with moremismatches) nearer R1’s alignment, the pairing algorithm would prefer that location even though ithas more mismatches. This would artificially increase the number of mismatches at that location,which may then cause erroneous SNV calls (see Section 3.2.3 for more details).573. BWA-R and SNVMix for improved variant calling in transcriptomesAdditionally, any reported alignment to an EEJ will need to be post-processed to relocate (orreposition) the reads to their correct genomic coordinates. This itself is not complicated; however, asthe alignment has already failed pairing, any benefit from proper pairing would have been lost (e.g.alternate locations for itself or its mate for better alignment would be lost). Additionally, reads thatcannot properly be paired will have their mapping quality capped at the single-end alignment quality,limiting their use for downstream applications that rely on mapping qualities. This would affect allread pairs in which one or both reads align to an EEJ.3.1.2 Sequence and alignment features affecting single nucleotide variant callingThe principal source of information for SNV calls are the reads overlapping the position presenting thealternative allele. SNVMix (see Chapter 2) was able to improve SNV calling by focusing solely on themapping quality of the reads overlapping putative SNVs and the base qualities of the correspondingbase. SNVMix initially outperformed Maq’s [111] SNV caller, which used post-processing of reads,extracting additional information regarding the manner in which the reads supporting the putativeSNV were aligned (I will call this contextual information). Maq [111] used five sources of contextualinformation to filter putative SNVs: proximity to a potential indel, a minimum number of readscovering the SNV position, at least one high-quality read supporting the SNV (Phred 60), a maximumnumber of SNVs within a short window, and a minimum SNV Phred quality. As aligners andSNV callers evolved, it became evident that further filtering of SNVs to remove known sources oftechnical artifacts resulted in a reduction of false positive (FP) calls. Packages such as VarScan [229],SAMtools [125] and GATK [230] applied heuristic filters to SNV calls to filter out putative artifacts.Mutation-seq [231] later presented a method using machine learning classifiers to build a model thatcould better identify somatic mutations given matched normal and tumour DNA alignments. Thesetools, their models and filters, were designed for SNV calling in DNA; as such, they did not considerRNA-Seq specific artifact sources.SNV calling using RNA-Seq data can also be affected by a series of sequence alignment problemsmainly caused by the improper alignment of reads generated from EEJs. Improper alignment of readsoriginating from EEJs may generate false calls near splice sites due to read ”overhang”, in which theread does not provide enough overlap with both exons and its edge is aligned outside the exon causingmismatches (Figure 3.4). Artifacts of this kind were later recognized by other groups who chose toignore SNVs near splice sites altogether [232].583. BWA-R and SNVMix for improved variant calling in transcriptomes3.1.3 Improving SNV calls using RNA-Seq dataWhen looking at SNVs in DNA, one can use gene annotations to infer the effect an SNV can have ona gene’s function. For example, an SNV may alter a protein’s function by changing an amino acid(missense mutation) or by interfering with its translation by creating an early stop codon (nonsensemutation). In cases where the SNV is homozygous, either copy of the gene being expressed wouldpotentially be affected by it; however, the SNV could be silent if it is removed by splicing or editingat the RNA level. On the other hand, in cases where the SNV is heterozygous, the effect of the SNVwould additionally depend on the allele being expressed. If the variant allele is not expressed, thenthe SNV is transcriptionally silent. That is, a non-synonymous or nonsense SNV predicted to alterthe encoded protein would not have an effect in a cell that expresses the wild type allele, or expressesthe mutated allele but removes the SNV through splicing or RNA editing. Thus, using RNA-Seq datato detect SNVs can provide information on the transcriptional activity of the SNVs. Additionally,RNA editing can post-transcriptionally alter RNA by substituting or removing single nucleotidesfrom a gene’s transcript [233]. While novel approaches were being developed to study this effect [234],increasing the performance of SNV callers using RNA-Seq data was needed to better exploit this typeof data for RNA editing studies. Furthermore, in cases with limited budget or sequencing capacity,a researcher might choose to only generate RNA sequencing data. Being able to call SNVs fromRNA-Seq data would enable them to exploit their data to detect expressed SNVs.The goal of the work presented in this chapter was to improve SNV calling using RNA-Seq data.As previously described, I based my alignments on BWA, a DNA aligner that can be used to mapRNA-Seq data when provided with the underlying annotation of EEJs. However, the data structureused to represent the reference sequence in BWA creates additional problems during the alignment ofpaired-end reads when using a EEJ extended reference. An additional challenge is the introduction ofartifacts caused when aligning reads to EEJs (e.g. alignments close to the actual splice sites bleedinginto an intron or overlapping exon). Both of these sources of alignment errors can have a negativeeffect on SNV calling, mainly an increase in false SNV calls. My hypothesis was that with betterread-pairing and using contextual information for SNVs, I could reduce the number of FP SNV callsfrom an RNA-Seq experiment while maintaining or increasing the true positive (TP) SNV calls. Theresulting SNV calls would contain less RNA-Seq derived artifacts and overlap more with DNA-basedSNV calls. To address the alignment problems, I first modified the generic BWA aligner to nativelyand efficiently handle an EEJ enhanced reference. This improved BWA’s ability to select proper593. BWA-R and SNVMix for improved variant calling in transcriptomespaired-end alignments, which in turn lowered artifacts caused by forcing good pairs at the expense ofmismatches. I then investigated the effects these changes had with respect to the basic BWA alignerand found an overall improvement in both the new alignments and in the resulting SNV calls. Ialso extended the SNVMix package to extract more information regarding the short-read evidencesupporting each SNV call (contextual information). Finally, I explored the use of the contextualinformation provided by SNVMix using both manual filters and machine learning classifiers to lowerthe number of spurious SNV calls.3.2 Methods3.2.1 DatasetTo evaluate the performance of my modified version of BWA, which I will refer to as BWA-R3, as wellas the corresponding performance changes in SNVs calling on RNA-Seq, I used 5 cancer datasets forwhich we had tumour whole genome sequencing and tumour RNA-Seq with the same read lengthsavailable [150]. Tumour genome libraries were used to determine a ground truth SNV dataset toevaluate the performance of SNV calling using the tumour RNA-Seq libraries. Maintaining a constantread length across libraries is required to avoid biases due to differences in base calling error profilesor read mappability.3.2.2 Using BWA with junction-enhanced reference genomeThe BWA [112] aligner was used to align both genome and transcriptome sequence data. For thegenome libraries, I aligned the reads against the human genome reference build NCBI36/hg18. Fortranscriptome alignments, the genomic reference was augmented by the sequences obtained fromEEJs. As previously discussed, this augmentation provided BWA the alignment space necessary toalign reads originating from the union of exons during splicing (Figure 1.2). EEJs were obtainedfrom several annotation databases including Ensembl (v54) [235], UCSC [236] and AceView [237].To minimize the creation of duplicated reference sequences that cause ambiguous alignments, EEJsequences were created so only reads crossing the junction would be aligned to them. To achieve this,I required reads to cross at least 4 bases into the EEJ. Thus, for a short read of length rlen, each EEJsequence is composed of (rlen − 4) bases obtained from each adjacent exon (Figure 3.2). Finally, to3BWA-R was suggested by Dr. Ryan Morin as shorthand for BWA for RNA-Seq.603. BWA-R and SNVMix for improved variant calling in transcriptomesOverlap = Oseq:A-BRead length = rlenJunction span(A - rlen + O),(B + rlen - O)Figure 3.2: Building exon-exon junction sequences. The figure shows how the nucleotide sequence froman EEJ is obtained. The boxes represent two adjacent exons and the line joining them indicates the spaceskipped by the EEJ. Each EEJ is named using the chromosome name and the inner-most exonic coordinates ofeach adjacent exon (seq:A-B). The split blue bars at the top describe the extreme and centre positions in whicha short read may align to the exons and the EEJ. The unsplit blue line shows a nearby read that would notcross the EEJ and instead aligns entirely to the exon. This read would align to the genome reference space.The shaded area of the exons denotes the nucleotides that would be used to generate the EEJ in the augmentedreference.avoid alignments overlapping two adjacent junction sequences, I used a padding of 40 N bases toseparate each junction sequence. Transcriptome alignments from BWA against a junction-enhancedreference results in reads aligned to targets representing normal chromosomes (e.g. chr1, chr2, chr3 ),as well as EEJs (e.g. chr10:103329564-103329995 ). These alignments have to be post-processed torelocate reads to the proper genomic coordinates as described below.3.2.3 Modifications for BWA read pairing introduced in BWA-RAlignments to EEJs (e.g. to chr10:103329564-103329995 ) need to be post-processed outside of BWA;however, this causes two main problems. Firstly, any secondary alignment locations are no longeravailable for consideration and any effects they would have in read pairing or mapping qualities arelost. Secondly, given that EEJ sequences are treated by BWA as different chromosomes, BWA isunable to properly pair reads, which, as will be detailed below, may increase the number of alignmentswith mismatches causing FP SNV calls. To help with these problems I modified the BWA package asdescribed below.BWA’s alignment algorithm is based on the Burrows-Wheeler transform (BWT), originally designedas a method for data compression [227]. BWT allowed the representation of mammalian-sized genomesin RAM (Random Access Memory) in a searchable data structure [238]. Its use for efficient alignmentswas expanded with dynamic programming algorithms [239], which were adopted and expanded by613. BWA-R and SNVMix for improved variant calling in transcriptomesBWA [112]. To efficiently map short reads to a reference genome using BWT, BWA requires that allof the target sequence be represented in RAM. As such, a first step when building the BWT index isto concatenate all the sequences present in the genome reference. Alignments in BWA are done withrespect to this concatenated reference, only translating to chromosomal coordinates when creatingthe final output. I will now describe how read pairing is done in BWA, how it causes problemswhen aligning RNA-Seq data to an EEJ extended reference sequence, and what changes I included inBWA-R to generate superior alignments to improve SNV identification using RNA-Seq data.Let g =−1 denote a genomic sequence of length l where ai ∈ {A,C,G, T}, and let G =(g1, g2, g3, ..., gC) be a genome reference composed of C chromosomes. The genome representation of Gused by BWA is then defined as the string obtained by concatenating the sequences g ∈ G:XG = g1g2...gC= x0x1x2...xLg−1,where x ∈ {A,C,G, T} and Lg =∑Ci=1 li. All alignments in BWA will be done with respect to the referenceXG to positions in the [0, Lg − rlen] range, where rlen denotes the length of the reads being aligned.During read pairing, evaluation of the location of each read will be done with respect to XG, theinsert size (length of fragment between sequencing adapters) can thus be directly calculated asi = abs(pr1 − pr2 ) + rlen, where pr{1,2} denotes the position where reads 1 and 2 align in XG regardless ofthe original chromosome. The fragment size selected during sequencing library construction dictatesthe expected distance at which two reads in a pair will be aligned from each other. Thus, by defininga maximum fragment size imax we can mark any read pairs where i > imax as bad pairs. Due tomulti-megabase chromosome sizes, any read pairs aligning to different chromosomes will have i >> imaxand thus cannot be tagged as good pairs4. Other considerations for good pairs are related to strandand read orientation. Let sr{1,2} ∈ {+,−} denote the strand where each read in a pair aligns, and pr{+,−}represent the positions where the forward and reverse strand alignments are located; then only readpairs where sr1 6= sr2 and pr+ < pr− can be marked as good pairs.Read pairing may select a suboptimal alignment over a perfect one if doing so creates a goodpaired end mapping. A suboptimal alignment is one that contains mismatches, insertions or deletionswith respect to the reference sequence. When multiple alignments of r1 and/or r2 fulfill strand and4It is noteworthy that, by definition, BWA also tags read pairs originating from legitimate gene fusion events causedby chromosomal aberrations as bad pairs. Specialized packages for gene fusion detection have to evaluate these eventsseparately.623. BWA-R and SNVMix for improved variant calling in transcriptomesorientation requirements, then the pair which provides the best alignment score is chosen. Alignmentscores are calculated by BWA as:q = mm ∗mmp + go ∗ gop + ge ∗ gep,where mm, go and ge correspond to the number of mismatches, gap openings and gap extensions inan alignment; and mmp, gop and gep their respective penalties. Each mate m ∈ {1, 2} in a pair may bemappable to zero or more locations on the reference sequence, resulting in two sets of alignmentoptions, R1 and R2. Combinations of alignment options (R1 ×R2) fulfilling sr1 6= sr2 and pr+ < pr− canbe considered for pairing. Let qm,k denote the score of alignment k of mate m, and consider a read pairwhere r1 maps uniquely and r2 has two alignment options (r2,1 and r2,2) with scores q2,1 > 0 and q2,2 = 0.During pairing, if r2,1 provides an insert size compatible with that specified at library construction(i < imax) and r2,2 causes a large insert size (i > imax), then the alignment with the mismatches (r2,1)will be selected as the final alignment for r2. This illustrates how the pairing process can affect thenumber of base mismatches reported in the final alignments. Finally, the mapping qualities are thenset according to the number of optimal and suboptimal alignments which were considered for eachread (see Figure B.1).When working with transcriptome alignments the genome reference G is extended using EEJsequences. Let J = (j1, j2, j3, ..., jn) be a set of n EEJs ji = ”seqi” : ”Ai” − ”Bi”, where seqi ∈ G, and Aand B are the upstream and downstream positions being joined (e.g. ”chr1:2345-6789”, Figure 3.2).T = (t1, t2, t3, ..., tn) is the set of EEJ sequences ti obtained by concatenating the rlen − overlap adjacentbases from either exon defined in ji for ji ∈ J (see Section 3.2.2 and Figure 3.2). I then define theconcatenation of sequences t ∈ T as the string XT = of length LT = 2∗n∗(rlen−overlap). AppendingT to the genome reference G, BWA’s extended reference can then be represented as the stringXE = XGXT= x0x1x3...xLG−1xLGxLG+1xLG+2xLG+3...xLG+LT−1,to which reads are mapped in the range [0, (LG + LT − rlen)]. Thus, reads representing sequencesoriginating fully from genomic locations (in mRNA these would be exons and retained introns) wouldbe expected to align in the range [0, (LG − rlen)] and those originating from EEJs would be mapped633. BWA-R and SNVMix for improved variant calling in transcriptomesto the range [LG, (LG + LT − rlen)]. Consider then two 50 basepair reads r1 and r2, where r1 originatesfrom g1 : 50000 and r2 originates 75 bases downstream at g1 : 50075 but crosses the EEJ located atg1 : 50100− 51100, corresponding to t1. BWA’s alignment would be then represented in XT coordinates aspr1 = 50000 and pr2 = LG + 25, giving an insert size i = pr2 − pr1 + rlen = LG + 50025. For the human genomehg18, LG = 3, 080, 436, 051, making i >> imax for any reasonable value of imax, therefore (r1, r2) cannot bemapped as a good pair. Given the large value of LG, with the exception of alignments between theend of gC and t15, any pair in which a read aligns to a sequence ti will not be considered a good pair.If no other alignment exists then only the mapping quality is affected (see Figure B.1); however, asexplained above, if suboptimal alignments with mismatches exist, they might be preferred if theyprovide a location where i < imax, sr1 6= sr2 and pr+ < pr− . In this way, appending EEJ sequences to agenome reference can have a detrimental effect on read alignment. I will now describe changes I madeto BWA that enable it to properly pair these types of reads.BWA first aligns reads in single end mode, locating positions on the concatenated reference Xwhere each reads maps. These single end alignments are constrained by user defined parameters suchas the maximum number of mismatches, gap openings, gap extensions and alignment hits. Aftersingle end alignment is done, read pairing can start. Inspecting BWA’s source code one can extractthe pairing algorithm, which I described below in Algorithm 1.Algorithm 1 BWA’s pairing algorithm.1: for R← READS do2: R1 ← SingleEnd(R,m = 1)3: R2 ← SingleEnd(R,m = 2)4: S ← Sort(R1, R2)5: optimal1, optimal2, u1, u2 ← NULL6: for alignment in S do7: if alignment on Forward strand then8: m← alignment’s read index ({1, 2})9: um ← alignment10: else if alignment on Reverse strand then11: m← alignment’s mate read index12: if um is not NULL then13: optimal1, optimal2 ← EvaluatePair(um, alignment)14: end if15: end if16: end for17: return optimal1,optimal218: end for5Depending on what EEJ is represented in t1 with respect to gC , read-pairing could be an error.643. BWA-R and SNVMix for improved variant calling in transcriptomesLines 4 and 13 of Algorithm 1 have to be modified for proper evaluation of reads aligning to EEJsequences. The Sort() function in line 4 uses coordinates in BWT space X. However, to properlyevaluate EEJ mapped reads, the function needs to have access to the true genomic location of theEEJ alignments. Similarly, to properly evaluate the insert size, the EvaluatePair() function referencedin line 13 needs to have access to the end of the alignment of the forward mapping read in the pair(p+, the read with the lowest mapping position). That is, the insert size corresponds to the spacebetween the inner coordinates of the read pair alignment plus the length of both reads. Positionsare always reported on the forward strand, thus p+ and p− denote the positions of the reads in a pairthat align to the forward and reverse strands respectively. Then, for genomic alignments, the firstinner coordinate can be calculated as p+ + rlen + I − D, where I and D are insertions or deletions inthe alignment, and the second coordinate can be calculated as p−. When the read on the forwardstrand overlaps a junction, the end of the alignment is affected by the number of bases inserted bythe presence of an intron, making the resulting end of the alignment p+ + rlen+ I −D +N.To be able to efficiently calculate the corresponding genomic position of a read aligned to anEEJ, as well as the proper value of N, I modified BWA in the following ways. First, I defined aset of parameters with which the EEJ expanded reference has to be built: the read length (rlen),the overlap (joverlap) and the padding between EEJ sequences (jpad). I then used these parametersto add a data structure jmap where the necessary information to decode the genomic coordinatesof every EEJs in the BWT index is stored. After scanning all the sequences corresponding to thegenome reference G, each EEJ t ∈ T was parsed and a reference to it added to jmap. Each reference tot ∈ T in jmap contains values extracted from the name identifier of t, the offset in X corresponding tothe chromosome sequence where it originated (Goffset), as well as both ends of the EEJ in genomiccoordinates (A and B). Given that each EEJ has a fixed nucleotide length:EEJlen = 2 ∗ (rlen− joverlap) + jpadand all EEJs were included in jmap in the order in which they appear in the reference X, having analignment in X provides access to the genomic location in constant time as follows. For a read alignedat coordinate p where p > LG, the array index tidx in jmap for the corresponding EEJ t can be found by:tidx = floor((p− LG)/EEJlen)653. BWA-R and SNVMix for improved variant calling in transcriptomesand the position of the read within the EEJ sequence can be then found by:tpos = floor((p− LG)%EEJlen).With both values, the corresponding position of p in G can be obtained by:trel = jpad+ rlen− joverlap− tpospG = jmap[tidx][Goffset] + jmap[tidx][A]− trelwhere trel corresponds to the relative shift from the EEJ. In the same operation, the correspondingvalue of skipped nucleotides caused by the intron can be obtained by:N = jmap[tidx][B]− jmap[tidx][A]− 1The above calculations were adapted to the BWA code structure in a way that any read aligned tothe transcriptome space would be transformed adequately and seamlessly, permitting read pairingand further manipulations of the alignments to happen with the correct genomic location.3.2.4 Alignment performance metricsTo determine the improvement of alignments obtained through my modified version of BWA, I alignedRNA-Seq libraries to the same junction-augmented reference described in Section 3.2.2, using boththe the unmodified (BWA) and modified (BWA-R) aligners. I compared the resulting BAM files andcalculated values for the following metrics, which I designed to assess BWA-R’s performance:• Alignment change (reads): number of reads that had their alignment changed. Reads canhave their alignment changed by being mapped to a different genomic coordinate or by a changein their CIGAR string. A CIGAR (Compact Idiosyncratic Gapped Alignment Report) string isa way to represent the manner in which an alignment was made. It is composed of pairs valuesof lengths and operations; indicating the number of bases that had to be matched (M), inserted(I), deleted (D), snipped (S) or skipped (N). For example, an alignment to an EEJ of a 50bpread might yield the CIGAR string ”18M3456N32M”, indicating 18 bases were aligned, thenthe read was split over an gap of 3456 bases (the intron) and then the remaining 32 bases werealigned.663. BWA-R and SNVMix for improved variant calling in transcriptomes• Ambiguity loss/gain (reads): reads aligned to multiple locations with equal quality andwith no added evidence of their true location are marked as being ambiguous. These reads areusually ignored as little information can be recovered from them. During the pairing process aread with ambiguity in single-end mode can become mappable if its mate has good mappingquality and can thus be used as an anchor for its true position. Ambiguity loss refers to thosereads that are ambiguous when using the default BWA but mappable using BWA-R. Ambiguitygain is the opposite, when aligning with BWA-R causes reads to become ambiguous. A betterpairing algorithm would increase ambiguity loss by providing better alignment options, whilemaintaining ambiguity gain low.• Chromosome fixed/broken (pairs): reads in a pair can be aligned to the same chromosomeor to different chromosomes. When a read pair aligned in BWA is split between chromosomes, butis aligned to the same one on BWA-R, it is considered as a chromosome fixed pair; the oppositeis counted as chromosome broken. Although read pairs that map to different chromosomes canbe used to detect chromosomal rearrangements and gene fusions, these alignments should not beinfluenced by the changes in the way the junction sequences are managed in BWA-R. Thus, inthis comparison I consider that better read pairing would result in a larger count of chromosomefixed than chromosome broken.• Gene pair loss/gain (pairs): in paired-end RNA-Seq, both reads in a pair are obtained fromthe same cDNA fragment, which in turn originates from a single mRNA molecule. Unless theoriginal mRNA existed as a chimeric RNA (e.g. a gene fusion) or a technical artifact, we canexpect both reads in a pair to align back to the same gene. Akin to the previous chromosomefixed/broken measurement, read pairs aligned to different genes in BWA and the same gene inBWA-R are counted as gene pair gain and gene pair loss for the reverse. Better performance inread pairing would be seen as an increase of pairs aligning to the same gene (larger gene pairgain) while minimizing gene pair loss.• Mismatch gain/keep/loss (reads): central to the purpose of the BWA-R modifications isthe control of artifacts introduced in the form of mismatches. With single-end reads, alignmentswith fewer mismatches are preferred; however, when aligning paired-end reads, a worse alignmentfor a read (i.e. more mismatches) may be selected if a better pairing is possible. A limitedability to pair reads, such as that found when using BWA with a junction-enhanced reference673. BWA-R and SNVMix for improved variant calling in transcriptomes(see Section 3.1.1), may therefore artificially introduce mismatches. Alternatively, a better readpairing algorithm may also introduce mismatches as new alignment positions become availablefor pairing. When comparing the final alignment of a read obtained using BWA or BWA-R itcan be determined whether the alignment gained, lost or kept its number of mismatches. Thenumber of reads falling into each category is represented by mismatch gain, mismatch loss andmismatch keep respectively. Better performance in read pairing would maximize mismatch loss,while minimizing mismatch gain.• Mismatch gain/keep/loss ambiguity gain/loss/none/kept (reads): for more preciseSNV calling, one would want an overall reduction in the number of mismatches (mismatchloss >mismatch gain). However, changing the pairing space for reads may also affect a read’sambiguity status. For example, some reads that previously were ambiguously mapped may nowbe unambiguously mapped through pairing. This would affect the influence of the correspondingread in SNV calling. To gauge the effect that this has on SNV calli