UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Algorithms and applications of next-generation DNA sequencing : ChIP-Seq, database of human variations,… Fejes, Anthony Peter 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2012_spring_fejes_anthony.pdf [ 4.25MB ]
JSON: 24-1.0072704.json
JSON-LD: 24-1.0072704-ld.json
RDF/XML (Pretty): 24-1.0072704-rdf.xml
RDF/JSON: 24-1.0072704-rdf.json
Turtle: 24-1.0072704-turtle.txt
N-Triples: 24-1.0072704-rdf-ntriples.txt
Original Record: 24-1.0072704-source.json
Full Text

Full Text

Algorithms and Applications of Next-Generation DNA Sequencing Chip-Seq, database of human variations, and analysis of mammary ductal carcinomas by  Anthony Peter Fejes Bachelor of Science, Biochemistry (Hons. Co-op), University of Waterloo, 2000 Bachelor of Independent Studies, University of Waterloo, 2001 Master of Science, Microbiology & Immunology, The University of British Columbia, 2004  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  Doctor of Philosophy in  THE FACULTY OF GRADUATE STUDIES (Bioinformatics)  The University Of British Columbia (Vancouver)  April 2012 © Anthony Peter Fejes, 2012  Abstract Next Generation Sequencing (NGS) technologies enable Deoxyribonucleic Acid (DNA) or Ribonucleic Acid (RNA) sequencing to be done at volumes and speeds several orders of magnitude faster than Sanger (dideoxy termination) based methods and have enabled the development of novel experiment types that would not have been practical before the advent of the NGS-based machines. The dramatically increased throughput of these new protocols requires significant changes to the algorithms used to process and analyze the results. In this thesis, I present novel algorithms used for Chromatin Immunoprecipitation and Sequencing (ChIP-Seq) as well as the structures required and challenges faced for working with Single Nucleotide Variations (SNVs) across a large collection of samples, and finally, I present the results obtained when performing an NGS based analysis of eight mammary ductal carcinoma cell lines and four matched normal cell lines.  ii  Preface The work described in this thesis is based entirely upon research done at the Canada’s Michael Smith Genome Sciences Centre (BCGSC) in Dr. Steve J.M. Jones’ group by Anthony Fejes. Two exceptions to this statement are the work on the dicer1 gene and the Motif Identification for ChIP-Seq Analysis (MICSA) software package, both of which involved collaborative work for which Anthony Fejes was granted co-authorship on subsequent publications. Contributions for each collaboration are detailed below. Work on chapter 2 was done by Anthony Fejes, with code contributions by Timothee Cezard, and with the guidance of Drs. Gordon Robertson and Mikhail Bilenky. Code contributions consist of the Lander-Water algorithm (implemented by Dr. Bilenky and merged into the FindPeaks code repository by Timothee Cezard), as well as numerous bug fixes contributed by Timothee Cezard. The work in this chapter is, in part, published in an application note, written by Anthony Fejes: A.P. Fejes et al. “FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology”. In: Bioinformatics 24 (Aug. 2008), pp. 1729–1730 Background information and a literature review on Chromatin Immunoprecipitation and Sequencing (ChIP-Seq) was published in the textbook: A. P. Fejes and S. J. Jones. “Chip-Seq: Mapping of Protein-DNA Interactions”. In: Next-Generation Genome Sequencing: Towards Personalized Medicine. Ed. by Michal Janitz. Wiley, John & Sons, November 2008  iii  Discussion of the MICSA software and extensions to the FindPeaks package in chapter 2 describes collaborative work directed by Valentina Boeva of the Institut Curie. Contributions to this publication included the completed FindPeaks 3.3 software, used as the basis for the MICSA algorithms, as well as support and consultations with the co-authors. V. Boeva et al. “De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis”. In: Nucleic Acids Res. 38 (June 2010), e126 Work on chapter 3 was done by Anthony Fejes, with contributions to insertion/deletion processing by Alireza Hadj Khodabakhshi. An He assisted by automating and importing the data sets into the database. The work in this chapter is, in part, published in an application note, written by Anthony Fejes with contributions by Alireza Hadj Khodabakhshi: A. P. Fejes et al. “Human variation database: an open-source database template for genomic discovery”. In: Bioinformatics 27 (Apr. 2011), pp. 1155– 1156 Details of the exact contributions of developers to the code base discussed in chapter 2 as well as chapter 3 can be obtained through the code repository at http://vancouvershortr.svn.sourceforge.net/viewvc/vancouvershortr. Discussion of the work on the dicer 1, ribonuclease type III (DICER1) gene was performed in collaboration with Alireza Moussavi in the Huntsman lab. Contributions to this work included customized searches to assist in the identification and classification of recurrent variations and access to the software and database described in chapter 3. The work described in chapter 3 has been published: A. Heravi-Moussavi et al. “Recurrent Somatic DICER1 Mutations in Nonepithelial Ovarian Cancers”. In: N Engl J Med (Dec. 2011) Work on chapter 4 was done by Anthony Fejes, using data generated at the B.C. Genome Sciences Centre, with the assistance of Richard Varhol (Alignment), Nina Theissen (Single Nucleotide Polymorphism (SNP)-calling) and Karen Mungal and Readman Chu (Ribonucleic Acid (RNA) Assembly). This work has not been  iv  published. Work on chapter 5 was done by Anthony Fejes (bioinformatics) and Steven Leach (wet lab work, including Sanger sequencing), with the exception of the screening of the panel of ductal carcinomas, which was organized by Sohrab Shah and Kane Tse, and analyzed by Anthony Fejes. This work has not been published.  v  Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  vi  List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xv  Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii 1 Introduction . . . . . . . . . . . . . . . . . . . . 1.1 Research Presented . . . . . . . . . . . . . . 1.1.1 Research not Included . . . . . . . . 1.1.2 Outline . . . . . . . . . . . . . . . . 1.1.3 Goals of this Research . . . . . . . . 1.2 Vancouver Short Read Analysis Package . . . 1.2.1 Open Source Bioinformatics . . . . . 1.2.2 Why do Open Source Bioinformatics? 1.2.3 Libraries . . . . . . . . . . . . . . . .  vi  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  1 2 2 2 4 4 5 6 8  1.3  1.4  1.5  1.6  1.7  1.2.4 Availability . . . . . . . . . . . . . . . . . . . . . . . . . ChIP-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 History of Chromatin Immunoprecipitation . . . . . . . . 1.3.3 Medical Applications of ChIP-Seq . . . . . . . . . . . . . 1.3.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 Future Uses of ChIP-Seq . . . . . . . . . . . . . . . . . . 1.3.6 FindPeaks . . . . . . . . . . . . . . . . . . . . . . . . . . Variation Database . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Pipelines Producing SNP- and SNV-Calls . . . . . . . . . 1.4.3 Single Nucleotide Polymorphism Databases . . . . . . . . Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 SQL Access . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Postgresql . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 ODBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.4 Keys and Indexing . . . . . . . . . . . . . . . . . . . . . 1.5.5 Hardware Performance . . . . . . . . . . . . . . . . . . . 1.5.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . 1.5.7 Background on the Variation Database . . . . . . . . . . . Breast Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Molecular Subtypes . . . . . . . . . . . . . . . . . . . . . 1.6.2 ATCC Mammary Ductal Carcinoma Cell lines . . . . . . . 1.6.3 Epstein-Barr/B-Cell Derived Matched Normals . . . . . . 1.6.4 Research Done . . . . . . . . . . . . . . . . . . . . . . . 1.6.5 Recurrent Variations . . . . . . . . . . . . . . . . . . . . 1.6.6 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . Notch Genes, Strawberry Notch and the Epidermal Growth Factor 1.7.1 Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Notch Signalling . . . . . . . . . . . . . . . . . . . . . .  vii  9 9 9 10 22 23 25 27 27 28 31 34 37 37 38 39 39 40 43 46 48 49 50 50 52 54 56 56 56 57  2 ChIP-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 FindPeaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Paired End Tag versus Single End Tag ChIP-Seq . . . . . . . . . . 2.3 Read Length Modelling . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Native Lengths - No Extension . . . . . . . . . . . . . . . 2.3.2 Hard Extension . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Triangle Distribution . . . . . . . . . . . . . . . . . . . . 2.3.4 Read Shifting . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Peak Calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Trimming Peaks . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Peak Separation . . . . . . . . . . . . . . . . . . . . . . . 2.5 False Discovery Rates and ChIP-Seq Controls . . . . . . . . . . . 2.5.1 Sources of Error . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Simulated Control - Monte Carlo . . . . . . . . . . . . . . 2.5.3 Simulated Control - Lander-Waterman . . . . . . . . . . . 2.5.4 Minimal Biological Control - Null Immunoprecipitation Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 Biological Control . . . . . . . . . . . . . . . . . . . . . 2.6 Comparing ChIP-Seq Experiments . . . . . . . . . . . . . . . . . 2.6.1 Normalization of ChIP-Seq Results . . . . . . . . . . . . 2.6.2 Normalization by Equivalent Peaks . . . . . . . . . . . . . 2.6.3 Limitation of Normalization by Equivalent Peaks . . . . . 2.6.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.5 Post-Normalization Processing . . . . . . . . . . . . . . . 2.7 Analysis of Normalized Samples . . . . . . . . . . . . . . . . . . 2.7.1 Comparison by Ratio - Method of Perpendicular Lines . . 2.7.2 Comparison by Equivalent Areas - “Method of Hyperbolic Sections” . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Example - Extending FindPeaks . . . . . . . . . . . . . . . . . . 2.8.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . .  viii  59 59 60 61 61 62 63 64 64 65 66 68 68 70 71 71 72 72 72 74 76 76 77 77 77 80 82 83  2.8.2 EWS-FLII . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  86 87  3 Variation Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 SNVs and INDELs . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Novel Functions . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Graphic Output . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Input Formats . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Library Information . . . . . . . . . . . . . . . . . . . . . 3.2.6 Variation Annotations . . . . . . . . . . . . . . . . . . . . 3.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Design Philosophies . . . . . . . . . . . . . . . . . . . . 3.4 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Database Application Programming Interface . . . . . . . 3.4.2 File Iterators . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Common Use-Cases and User Interactions . . . . . . . . . . . . . 3.5.1 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 ExperimentalRecord . . . . . . . . . . . . . . . . . . . . 3.5.3 Concordance . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Modifying the Application Programming Interface (API) and the User Interface (UI) . . . . . . . . . . . . . . . . . 3.5.5 Ensembl . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Applications Using the Variation Database . . . . . . . . . . . . . 3.6.1 Filtering Polymorphisms . . . . . . . . . . . . . . . . . . 3.6.2 Filtering Recurrent Variations . . . . . . . . . . . . . . . 3.6.3 Filtering to Identify Cancer Drivers . . . . . . . . . . . . 3.6.4 Variations Only Found in Cancer . . . . . . . . . . . . . . 3.6.5 Variations Never Found in Cancer . . . . . . . . . . . . .  88 89 89 89 90 90 90 90 91 92 92 98 99 100 102 104 104 105 114  2.9  ix  114 114 116 116 117 118 119 123  . . . . .  . . . . .  . . . . .  . . . . .  . . . . .  . . . . .  . . . . .  123 125 126 128 129  4 Mammary Ductal Carcinoma Cell Lines . . . . . . . . . . 4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Cell Lines . . . . . . . . . . . . . . . . . . . . 4.1.2 Reads Aligned . . . . . . . . . . . . . . . . . 4.1.3 Bioinformatics . . . . . . . . . . . . . . . . . 4.2 Properties of Cell Lines . . . . . . . . . . . . . . . . . 4.2.1 Concordance with dbSNP . . . . . . . . . . . . 4.2.2 Matched Normals . . . . . . . . . . . . . . . . 4.2.3 Relationships Between Cell Lines . . . . . . . 4.2.4 Transitions, Transversions and RNA Editing . . 4.2.5 Exome Capture vs. RNA . . . . . . . . . . . . 4.2.6 BRCA1/2 Status . . . . . . . . . . . . . . . . 4.2.7 Common Phenotypic Markers of Breast Cancer 4.2.8 Single Nucleotide Variations . . . . . . . . . . 4.2.9 Variant Frequency . . . . . . . . . . . . . . . . 4.2.10 Recurrent Overlaps Between Cell Lines . . . . 4.2.11 Verification . . . . . . . . . . . . . . . . . . . 4.2.12 Assembly . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . .  131 132 132 133 134 136 137 138 140 142 145 147 150 152 153 155 157 158  5 EGF and Notch Involvement in Ductal Carcinoma 5.1 Methods . . . . . . . . . . . . . . . . . . . . . 5.1.1 Cell Lines . . . . . . . . . . . . . . . . 5.1.2 Verification . . . . . . . . . . . . . . . 5.1.3 Expression Data . . . . . . . . . . . . . 5.1.4 Panel of Mammary Ductal Carcinomas  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  162 163 163 163 164 164  3.7  3.6.6 RNA Editing . . . . . . . . . . . . . . 3.6.7 Transition and Transversion Frequency 3.6.8 Growth of the Database . . . . . . . . . Future Plans for the Database . . . . . . . . . . 3.7.1 Expansion of the Database . . . . . . .  x  . . . . .  . . . . . .  . . . . .  . . . . . .  . . . . .  . . . . . .  . . . . . .  5.2  Results . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Cell Line Characterization . . . . . . . . . 5.2.2 Inter-se Analysis . . . . . . . . . . . . . . 5.2.3 Pathway . . . . . . . . . . . . . . . . . . . 5.2.4 Recurrence of Strawberry Notch Variations 5.2.5 HCC1500 Exhibits Few Mutations . . . . . 5.2.6 Interleukin Expression . . . . . . . . . . .  6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 6.1 Bioinformatics and Next Generation Sequencing . 6.2 FindPeaks . . . . . . . . . . . . . . . . . . . . . 6.2.1 Contributions . . . . . . . . . . . . . . . 6.2.2 Who is Using it? . . . . . . . . . . . . . 6.2.3 Strengths of the Software . . . . . . . . . 6.2.4 Limitations of the Software . . . . . . . . 6.2.5 Potential Applications of the Software . . 6.2.6 Future Directions . . . . . . . . . . . . . 6.3 Variation Database . . . . . . . . . . . . . . . . 6.3.1 Contributions . . . . . . . . . . . . . . . 6.3.2 Usage . . . . . . . . . . . . . . . . . . . 6.3.3 Strengths of the Software . . . . . . . . . 6.3.4 Limitations of the Software . . . . . . . . 6.3.5 Potential Applications of the Software . . 6.3.6 Future Directions . . . . . . . . . . . . . 6.4 Mammary Ductal Carcinoma Cell Lines . . . . . 6.4.1 Contributions . . . . . . . . . . . . . . . 6.4.2 Who is Using the Results? . . . . . . . . 6.4.3 Limitations of the Cell Lines Studied . . . 6.4.4 Future Directions . . . . . . . . . . . . . 6.5 SBNO1 and EGF . . . . . . . . . . . . . . . . . 6.5.1 Summary of Results Presented . . . . . . xi6.5.2 6.5.3 6.5.4 6.5.5  Contributions Made by Studying Breast Cancer Cell Lines Strengths of the Information Discussed . . . . . . . . . . Limitations of the Information Discussed . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . .  213 214 214 215  References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 A Publications using FindPeaks to Generate Results. . . . . . . . . . . 252 B Publications Discussing FindPeaks . . . . . . . . . . . . . . . . . . . 255 C Publications Citing FindPeaks . . . . . . . . . . . . . . . . . . . . . 258 D Events Detected in Ductal Carcinoma Cell-Lines . . . . . . . . . . . 259 D.1 Alternate Splicing Events . . . . . . . . . . . . . . . . . . . . . . 262 D.2 SNV and INDEL events from Ductal Carcinoma Cell Lines . . . . 265 E Indels and NS-SNVs in Proposed EGF Pathway . . . . . . . . . . . 266 F Genes with Non-Synonymous, Non-Polymorphic Variations in Ductal Carcinoma Cell Lines . . . . . . . . . . . . . . . . . . . . . . . . 269  xii  List of Tables Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6 Table 4.7 Table 4.8 Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 5.5 Table 5.6 Table D.1 Table D.2 Table D.3  Tumour Type of Origin for Cancer-Derived Cell Lines . . . . . Summary of Sequenced Cell Line Names and Reads . . . . . . List of Verification Primers . . . . . . . . . . . . . . . . . . . Comparison of RNA-Derived Single Nucleotide Variations (SNVs) Observed in Cell Lines and Matched Normals . . . . . . . . . Comparison of Deoxyribonucleic Acid (DNA)-Derived SNVs Observed in Cell Lines and Matched Normals . . . . . . . . . Status of Common Breast Cancer Markers in Cell Lines Studied Filtering Non-Synonymous Mutations . . . . . . . . . . . . . . List of Genes with High Confidence Mutations . . . . . . . . . Adaptor Oligonucleotide Sequences . . . . . . . . . . . . . . . Expression of the SBNO1 Gene mRNA . . . . . . . . . . . . . mRNA Expression of Epidermal Growth Factor and its Receptor Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mRNA Expression of the Notch Genes . . . . . . . . . . . . . mRNA Expression of Genes Regulating EGF . . . . . . . . . . Expression of Notch Pathway mRNA . . . . . . . . . . . . . .  132 133 135 139 140 150 153 156 167 169 177 179 180 183  List of Productive Fusion Events Observed . . . . . . . . . . . 260 List of Non-Productive Fusion Events Observed . . . . . . . . 261 List of Alternative Splicing Events Observed in 3 or More Cell Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262  xiii  Table D.4 Table D.5 Table D.6  List of Alternative Splicing Events Observed in 3 or More Cell Lines (continued) . . . . . . . . . . . . . . . . . . . . . . . . 263 List of Alternative Splicing Events Observed in 3 or More Cell Lines (continued) . . . . . . . . . . . . . . . . . . . . . . . . 264 Merged List of SNVs from Alignment and Insertions and Deletions (INDELs) from Assembly . . . . . . . . . . . . . . . . . . 265  Table E.1 Table E.2  Non-Synonymous SNVs in Annotated Pathway . . . . . . . . . 267 Indels in Annotated Pathway . . . . . . . . . . . . . . . . . . 268  Table F.1  All Genes with Non-Synonymous Non-Polymorphic Variations in Cell Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . 282  xiv  List of Figures Figure 1.1  Chip-Seq Protocol . . . . . . . . . . . . . . . . . . . . . . .  Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7  Read Model Distributions Available in FindPeaks . . . . . . Triangle Distribution . . . . . . . . . . . . . . . . . . . . . Trim Algorithm Visualization . . . . . . . . . . . . . . . . . Subpeak Algorithm Visualization . . . . . . . . . . . . . . . Visualization Aid for Normalization by Total Reads . . . . . Symmetrical Regression in FindPeaks 4.0 . . . . . . . . . . Visualization of Method of Comparing Samples by Hyperbolic Sections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . MICSA Algorithm Outline. . . . . . . . . . . . . . . . . . . Performance Comparison of MICSA and Peakfinders . . . .  Figure 2.8 Figure 2.9 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 3.7 Figure 3.8 Figure 3.9  . 61 . 63 . 66 . 67 . 74 . 80 . 83 . 84 . 85  Illustration of API and UI Layers for the Human Variation Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . HTML Summary of Experimental Record . . . . . . . . . . . Non-Synonymous Experimental Record Summary Report . . . Long Form of the Experimental Record . . . . . . . . . . . . Plot of Variants Appearance in Cancer and Normal Data Sets . Plot of Known RNA Edits Appearance in Cancer and Normal Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . Observed Transitions versus Transversions (unique variations) Transitions versus Transversions(all observations) . . . . . . . Growth of the Variation Database from January 8, 2010 . . . . xv  12  100 107 109 113 120 124 126 127 127  Figure 3.10 Proposed Arrangement for Expansion of the Variation Database 130 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9  dbSNP Concordance for Each Cell Line Studied . . . . . . . . Similarity of SNVs between Cell Lines Studied . . . . . . . . . Clustering of Cell Lines Studied . . . . . . . . . . . . . . . . Frequency of Base Substitutions in Cell Lines Studied . . . . Division of Intragenic Variants by Cell Line . . . . . . . . . . Visualization of Variations across BRCA1 and BRCA2 Genes ESR1 Gene Visualization . . . . . . . . . . . . . . . . . . . . Variant Frequency . . . . . . . . . . . . . . . . . . . . . . . . Venn Diagram Illustrating the Overlap of High Confidence Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.10 Verification of High Confidence Variations . . . . . . . . . . .  155 158  Figure 5.1 Figure 5.2 Figure 5.3 Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Figure 5.8  170 173 178 182 190 191 195 196  Exon Coverage of SBNO1 . . . . . . . . . . . . . . . . . Proposed Interaction Pathways of Proteins Around SBNO1 EGF vs EGFR Expression . . . . . . . . . . . . . . . . . Expression Levels of Notch Pathway Genes . . . . . . . . Distribution of SBNO1 F348V Mutation by Count . . . . . Distribution of SBNO1 F348V Mutation as Percentage . . Expression of Interleukin mRNA . . . . . . . . . . . . . . Expression of Interleukin Receptor mRNA . . . . . . . . .  xvi  . . . . . . . .  . . . . . . . .  138 141 142 144 146 149 152 154  Glossary dbSNP the database of human SNP. Found at http://www.ncbi.nlm.nih.gov/projects/ SNP/, last accessed April 13th, 2012. 36, 45–47, 89, 91, 101, 103, 107, 108, 114, 116–118, 122, 126, 136–138, 168, 206 exome capture A method that employs tethered probes (e.g. a micro-array) to hybridize and retain fragments of DNA of interest that originate in known coding sections of genes. These fragments can be retained while all other fragments are washed away, allowing selective capture of targeted genomic regions.. 133, 137, 138, 144–147, 151, 153, 154, 168, 169, 171 fragment original piece of DNA or RNA collected from the sample, presumably via a size selection process.. 10, 11, 13, 14, 16, 18, 60–62, 69 integrated development environment A software application that integrates many individual development components (editor, debugger, testing, etc) into a single graphical interface.. 5 polymorphism A DNA variation in the genome that occurs at some frequency in the human population. Also used to describe common variations that differ from the human reference genome, when the population frequency of the variation is unknown.. 116 read The portion of the DNA or RNA fragment that has been sequenced.. 23, 31 xvii  transition A change in DNA nucleotide that replaces a purine with a purine, or a pyrimidine with a pyrimidine. 142–144 transversion A change in DNA nucleotide that replaces a purine with a pyrimidine or vice versa.. 142–144  xviii  List of Acronyms American Type Culture Collection  ATCC  Areas of Enrichment  AOE API  Application Programming Interface  BAM  Binary alignment/map  BCGSC  Canada’s Michael Smith Genome Sciences Centre  BLAST  Basic Local Alignment Search Tool  BWA  Burrows-Wheeler Aligner  BWT  Burrows-Wheeler Transform  ChIP  Chromatin Immunoprecipitation  ChIP-Seq CNV  Copy Number Variation  COSMIC CPU  Catalogue of Somatic Mutations in Cancer  Central Processing Unit  DARNED DCIS  Chromatin Immunoprecipitation and massively parallel Sequencing  Database of RNA Editing  Ductal Carcinoma in Situ xix  DNA  Deoxyribonucleic Acid  EBV  Epstein-Barr Virus Efficient Local Alignment of Nucleotide Data  ELAND  Ensembl Java API  ENSJ ER  Estrogen Receptor  FDR  False Discovery Rate  GPL  GNU Public License  GWAS  Genome Wide Association Studies  HTML  Hypertext Markup Language  IDC  Invasive Ductal Carcinoma  IGV  Integrative Genome Viewer  ILC  Invasive Lobular Carcinoma Insertion and Deletion  INDEL INDELs Jar  Insertions and Deletions  Java Archive  JDBC  Java Database Connectivity  LIMS  Laboratory Information Management System  MACS MAQ MB  Model-based Analysis for ChIP-Seq Mapping and Assembly with Quality  Megabyte (1,048,576 bytes) xx  MeDIP  Methylated DNA Immunoprecipitation  MEME  Multiple Em for Motif Identification  MICSA  Motif Identification for ChIP-Seq Analysis  mRNA  messenger Ribonucleic Acid  NGS  Next Generation Sequencing  NOD  NOTCH Protein Domain  ODBC  Open Database Connectivity  OMIM  Online Mendelian Inheritance in Man  OSI  Open Source Initiative  PCR  Polymerase Chain Reaction  PET  Paired End Tag  RAID  Redundant Array of Independent Disks  RAM  Random Access Memory  RNA  Ribonucleic Acid  RNA-Seq RNA Isolation  and Massively Parallel Sequencing  Receiver Operating Characteristic  ROC SABE  Serial Analysis of Binding Elements  SAGE  Serial Analysis of Gene Expression  SET  Single End Tag  SNP  Single Nucleotide Polymorphism xxi  SNV  Single Nucleotide Variation  SQL  Structured Query Language  UI  User Interface  UCSC  University of California, Santa Cruz  UTR  Untranslated Region  VCF  Variant Call Format  VDB  Variation Database  VSRAP  Vancouver Short Read Analysis Package  xxii  Acknowledgements If I have seen farther it is by standing on the shoulders of Giants. — Sir Isaac Newton (1855) First and foremost, I’d like to thank my wife Elaine for her patience and support, and my family for their encouragement throughout the course of my post-graduate education. I would also like to thank my committee: Steve Jones, Angela Brooks-Wilson, Paul Pavlidis and Marco Marra. Several people have been willing sounding boards for difficult challenges: Timothee Cezard, who assisted with many of FindPeaks’ algorithm designs, Richard Varhol, Yvonne Li and Olena Morozova, who provided great advice on a wide variety of topics, and Simon Chan, with whom I’ve shared many excellent brainstorming sessions. I would also like to thank Jianghong An for his guidance through the work with the Ductal Carcinoma project and Stephen Leach for his collaboration on the project. Several other people have made significant contributions to my projects over the years, including Matthew Bainbridge, who provided me with a template for developing on the Java platform, as well as the first version of FindPeaks, and An He, who tirelessly populated the Human Variation Database. There is also a list of people (far too long to include) who have used my software and provided suggestions for improvements and bug reports where improvements were necessary, and a list of anonymous people who provided many hours of xxiii  help and advice on postgresql tuning and optimization through the freenode IRC #postgres channel. I would also like to thank the Michael Smith Foundation for Health Research for their support for the duration of the Trainee Award, and my supervisor for his support both before and after the term of the award. Finally, I would like to acknowledge the usually anonymous people who donated cells or genetic material used during the research and the many un-credited people who work tirelessly in the lab to generate the sequencing data used in the experiments.  xxiv  Chapter 1 Introduction This thesis is composed of four main sections; the first two focus on software applications, while the last two focus on the biology of mammary ductal carcinomas. In fact, the first two chapters discuss separate software applications that share a common code-base, while the latter two chapters discuss separate components of a single research project performed on a common data set composed of ductal carcinoma cell lines. However, despite the hardware-software divide between the first two and last two chapters, there are common themes that tie the final three chapters together. The second, third and fourth chapters are all closely linked together on the study of mammary ductal carcinomas. The second chapter describes the evolution of a tool designed for the analysis of the cell lines as well as its use for other analysis, the third chapter describes the results of the analysis, while the fourth chapter focuses on one specific subset of the analysis, identifying a pathway of specific interest. Thus, this thesis attempts to address both computational and a biological problems.  1  1.1  Research Presented  The initial goal of the research originally proposed for this thesis was to identify recurrent variations in breast cancer through the use of commonly used cell lines that could be used for identifying novel drug targets. However, over time, this research has shifted towards the goal of more accurately reflecting the underlying biology of the cell lines investigated.  1.1.1  Research not Included  A significant amount of time was first spent studying five commonly used breast cancer cell lines (MDA-MB-231 (adenocarcinoma), T-47D (ductal carcinoma), HS578T (ductal carcinoma), MCF7 (adenocarcinoma) and BT-549 (ductal carcinoma)) (cell lines from the American Type Culture Collection (ATCC)), sequenced in 2008. Due to the high error rates of the early Illumina sequencers (which has since improved with upgrades to both the sequencing machines and the the chemistry employed), as well as the lack of recurrent mutations observed, and the poor experimental design which failed to include matched normals or other sufficient mechanisms for filtering passenger mutations, these cell lines provided little or no insight into the molecular mechanisms of breast cancer cell lines. Additionally, the diverse origins of the cell lines, not pertaining to a single type of breast cancer complicated the analysis and further reduced the likelihood of obtaining insightful observations. Furthermore, with a sample of only five transcriptomes, the statistical power to identify valid hits among the high noise was poor. Sequencing and analysis of these five cell lines are not discussed in this thesis.  1.1.2  Outline  The research presented in this thesis represents a major effort to remedy issues observed in the analysis of the first five cell lines, discussed above. In contrast to investigating five breast cancer cell lines representing two different subtypes of cancer, it was decided to focus on mammary ductal carcinoma and to identify cell 2  lines with B-cell derived matched normals for the purpose of identifying somatic mutations. The ATCC carried four cell lines matching this description, as well as four mammary ductal carcinomas that were available, but did not have matched normals. As the cost of sequencing had dropped precipitously from the commencement of the project, it was possible to perform more sequencing for the second set of cell lines; including both exome capture sequencing and transcriptome sequencing for the eight cell lines with matched normals as well as the matched normals themselves - and transcriptome sequencing alone for the new cell lines without matched normals. To perform a more thorough examination of these cell lines, custom software used for the analysis of the original five cell lines was modified to be able to handle the increased volume of data and merged with an existing code base, described in chapter 2, that had already implemented many important utilities for processing and interpreting high-throughput sequencing data. This modified software is described in chapter 3 and discusses many elements important to processing the large amount of data that is now routinely generated using Next Generation Sequencing (NGS) platforms. Many additional details about the genomic behaviour of cancer have become clear since the original project was proposed. For instance, it has become abundantly clear that recurrent mutations are not as common as originally thought during early oncogenesis, and that distinct types of cancers share a diverse set of genomic variations.[6] Although some cancer types have clear causes, breast cancer has been shown to be a much more complex disease and that a variety of mechanisms and genes are involved, rather than a single cause or pattern.[7, 8] Our analysis, confirming these points, can be found in chapter 4. Finally, despite the diversity of the genes and mechanisms involved, certain pathways and mechanisms can be observed to be enriched for variations and differing expression signals, suggesting that there may be novel exploitable pathways in use by cancer cells. By exploiting both the expression and variation data from  3  transcriptome sequencing, as well as a knowledge based approach, it is possible to identify pathways that appear to play a significant role in the behaviour of cancer cells. This method is demonstrated in chapter 5, where a specific pathway has been identified from the sequencing data, and reinforced with relationships described in peer-reviewed publications. Thus, this thesis describes the work that was done subsequent to the initial analysis, as well as the software developed over the course of the research.  1.1.3  Goals of this Research  The goals of the research presented here can be summarized for each chapter: Chapter 2: Demonstration of novel algorithms for Chromatin Immunoprecipitation and Sequencing (ChIP-Seq), and description of the FindPeaks application implementing the algorithms. Chapter 3: Demonstration of novel concepts and methods for working with large volummes of NGS data. Chapter 4: Analysis of recurrent events in mammary ductal carcinoma cell lines. Chapter 5: In-depth investigation of a recurrent event with the purpose of identifying novel drug targets.  1.2  Vancouver Short Read Analysis Package  The Vancouver Short Read Analysis Package (VSRAP) collects together several different software applications designed and developed at the Canada’s Michael Smith Genome Sciences Centre (BCGSC), of which the FindPeaks ChIP-Seq application (see chapter 2) and the Variation Database (see chapter 3) are the most prominent members. This project also includes a wide variety of code for use in analysis of NGS data files and experiments. Among other elements, the VSRAP includes applications for file conversion, file iterating, calculating gene coverage, read depth 4  analysis, developing graphical visualizations of NGS data and rudimentary Single Nucleotide Polymorphism (SNP)-calling. All of the code in the VSRAP is developed and written in Java, following closely to the Sun Microsystems Code Conventions for the Java™Programming Language (April 20, 1999 revision). Coding practices are enforced through the use of the Enerjy plug-in (Enerjy Software, Teamstudio, Inc., Beverly, MA) for the Sun Eclipse integrated development environment (The Eclipse Foundation). This code base makes up the vast majority of code used in the preparation of this thesis. All code developed for the analysis of data reported in this thesis was developed under the Open Source GNU Public License (GPL) licence. (http://www.gnu.org/ copyleft/gpl.html, last accessed April 13, 2012). The code is freely available along with the project documentation at http://vancouvershortr.sourceforge.net.  1.2.1  Open Source Bioinformatics  Open source is a broad term used to describe a set of licences that aim to allow freedom to the users of a given piece of software to have access to the resources required to modify and re-use the programming instructions to alter or customize a software application. However, in more colloquial terms, it is used to describe any software with permissive licences that grants rights and privileges to the users of a software package above and beyond those required to run the binary code of the software. The Open Source Initiative (OSI) defines ten conditions required for a licence to be considered open source (http://www.opensource.org/docs/osd): 1. The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license shall not require a royalty or other fee for such sale. 2. The program must include source code, and must allow distribution in source code as well as compiled form. 3. The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original 5  4. 5. 6. 7.  8. 9. 10.  software. The license must explicitly permit distribution of software built from modified source code. The license must not discriminate against any person or group of persons. The license must not restrict anyone from making use of the program in a specific field of endeavour. The rights attached to the program must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties. The rights attached to the program must not depend on the program’s being part of a particular software distribution. The license must not place restrictions on other software that is distributed along with the licensed software. No provision of the license may be predicated on any individual technology or style of interface.  These conditions guarantee that the rights of the authors are respected, that each user is granted the rights to further modify and improve the code, and that the users of the modified code are provided with the same rights as those who do chose to modify and improve the code. This prevents the dilution of rights, regardless of the number of times new developers chose to modify each others work.  1.2.2  Why do Open Source Bioinformatics?  There are many reasons why open source bioinformatics should be are attractive to the developers, which provide some context for the open release of the code described in this document. More eyes on the code: One of the most obvious reasons to do open source development is that it allows more people to look at the source code and understand how the software works. While not all users will have the inclination or expertise required to do so, even a small number of additional people looking 6  at the code will enable more bugs to be found, and in some cases, to have the users provide suggested fixes to the application. This can dramatically accelerate the development cycle of the software, bringing better quality code to the users more quickly. More users of the code: Open source code is also frequently provided at no cost to the user, which can encourage a wider set of consumers to deploy the software. The maxim that the more users there are for a given piece of software, the faster that bugs will be identified, is a longstanding tradition in computer science. As the size of the user base increases, the quality of the code generally improves at an increasing rate, if there are enough developers to make use of the feedback. Code Contributions: As alluded to in the previous two points, if there are more people looking at the code and using the code, it is inevitable that someone will begin to make suggestions and contributions to improve the code. This has the net effect of increasing the size of the development team, spurring on a faster growth cycle for the software. Better Support: As communities of users and developers grow around a codebase, more people become familiar with the project and it’s workings. Thus, when questions arise, there is a larger network of people capable of providing answers an support to those who need it, requiring less time from the project’s core developers. This frees up the developers to focus on the projects core needs, rather than providing support to users. Continuity: Many software projects are started by individuals with a single issue they wish to solve, and often are simply abandoned once that issue ceases to be important to them. With open source software, it is possible for other individuals to build upon the work of others to provide continuity to projects that would otherwise not be possible without access to the source code.  7  Transparency: The availability of the source code also enables users to understand the functions of the software, fostering trust in the developers, as well as the ability to understand the methodology employed. This is particularly important in scientific software where “blackbox” deployments can be difficult to incorporate into academic processes. Infrastructure: There are a lot of resources available at no cost to the developers of Open Source code, including web hosting, wikis, versioning systems, bug trackers and other useful tools. For instance, SourceForge provides a complete package of tools for any project that is licenced with an OSI approved license. There are several informative articles on the subject of why software code for bioinformatics should be openly published, including Barnes [9] and Peng [10]. Open Source Bioinformatics Tools There are a large number of academic bioinformatics tools that use the Open Source model and have prospered from the benefits outlined above. Some of the more commonly recognized tools include the Bfast and Burrows-Wheeler Aligner (BWA)/Mapping and Assembly with Quality (MAQ) aligners, the Modelbased Analysis for ChIP-Seq (MACS) and FindPeaks applications for Chip-Seq, Abyss and Velvet assemblers, SNVMix/SNVMix2 SNP callers and even entire suites of tools such as SOAP, SOAPsnp and SOAPdenovo.[1, 11–18] There has been much discussion about bioinformatics encouraging the use of open source software to improve the transparency and repeatability of complex analysis that depend on the use of sophisticated analysis procedures.  1.2.3  Libraries  In addition to providing full source code, the VSRAP also provides a mechanism for integrating its functionality into other projects as a single binary file. This can be done by compiling the class files into a single Java Archive (Jar) file, which 8  then provides a portable mechanism for incorporating VSRAP functions into other, possibly unrelated, coding projects.  1.2.4  Availability  The software described in this document is available for download through the Vancouver Short Read Analysis Package on Sourceforge, http://vancouvershortr. sourceforge.net. Instructions for use and deployment are provided on the accompanying wiki pages: http://vancouvershortr.wiki.sourceforge.net.  1.3  ChIP-Seq  The first project included in the VSRAP package was the FindPeaks peak-finder application. It has become a popular software tool for the analysis of results generated using the ChIP-Seq protocol. In this section, we will cover the history of the ChIP-Seq technique, as well as the general method used, followed by discussions of alternative protocols available, and a discussion of some key works that have emerged in this field. Finally, we’ll explore some of the applications where ChIP-Seq is likely to be used.  1.3.1  Background  Prior to the sequencing of the human genome, it was commonly assumed that the genetic code of a species contained all of the information required for the development of a single organism.[19] On the surface, this assumption was reasonable: all cellular components are derived from the instructions contained within the genome. However, it has become increasingly clear that the control of the genome itself is managed through much more complex interactions, involving modifications to the Deoxyribonucleic Acid (DNA) and its associated proteins, and that protein-DNA interactions have huge impacts on the phenotype of an individual.[20–22] These modifications have the ability to regulate gene expression, turning genes on and off in ways that would be impossible to predict from the genome sequence alone. 9  Compounding the complexity of the problem are the interactions of the machinery involved in gene expression. This includes transcription factors that selectively interact with specific promotor, repreressor and enhancer DNA sequences to carry out their respective functions. Both epigenetics and gene regulation are now being studied by similar techniques, as they have the same fundamental goals: understanding and characterizing the nature of protein-DNA interactions involved in the regulation and mechanism of gene expression.[23] One method, arguably the best suited for studying protein-DNA interactions, is the chromatin immunoprecipitation and massively parallel sequencing technique, known as ChIP-Seq.  1.3.2  History of Chromatin Immunoprecipitation  Modern uses of the Chromatin Immunoprecipitation (ChIP) technique generally use a cross-linking agent to hold together DNA and the proteins with which it is in close contact, allowing for the collection of specific proteins of interest by immunoprecipitation. The earliest targets of the ChIP protocol were not the DNA fragments pulled down by immunoprecipitation, but the histone proteins themselves.[24] These proteins are now known to combine in specific complexes and to undergo modifications, such as methylation, acetylation and phosphorylation, which enable histones to participate both in a structural role as well as in more dynamic processes such as chromosome condensation during mitosis.[25–27] While the original aim of the ChIP protocol was the study of the histones structure and assembly, early ChIP users quickly discovered the persistent contact between histones and DNA when they were found not to dissociate by salt treatment.[24] Although the technology for investigating the immunoprecipiated DNA sequences was not available to early ChIP users, later researchers were able to carry out work on cis-acting transcriptional elements that has begun to provide insight into the control of transcriptional elements, such as promotors, enhancers, and repressors and the role of chromatin regulation in gene expression.[28, 29] The potential of ChIP to investigate DNA fragments was not overlooked. As new methods became available to study the immunoprecipitated DNA, ChIP was 10  paired with methods such as Southern blotting, Polymerase Chain Reaction (PCR), qualitative PCR, and finally more recent developments like gene arrays.[30, 31] As each new technique for investigating mixed populations of DNA became available, interest in the chromatin immunoprecipitation increased. By the end of the 1990’s, ChIP was again becoming a popular technique for the study of proteins and both protein-associated DNA and Ribonucleic Acid (RNA).[32] More recently, ChIP has been combined with massively-parallel sequencing techniques, with the name of ChIP-Seq. This protocol combines the ability to target specific proteins and their specific interacting DNA fragments with the highthroughput tag sequencing abilities of the next-generation sequencing machines to create an efficient and cost-effective whole genome view of DNA binding sites for proteins of interest. ChIP-Seq Method Chromatin precipitation is a relatively simple five step process with the ability to collect fragments of DNA known to be bound to a specific protein of interest from living cells.[32] The first step is to treat the target cells or tissue with a cross-linking agent, typically formaldehyde, to form covalent bonds between proteins or DNA molecules that are in contact. This selectively ties together compounds that are interacting in vivo, effectively freezing their interaction in time, ideal for capturing the short-lived transcription-factor binding interactions. This step is followed by sonication or vortexing to disrupt the cells and to physically shear DNA to a desired size. The third step is the application of an antibody with selective affinity for the protein of interest, to pull down the proteins of interest and any other cellular entities to which they are attached. The antibodies are then collected, allowing non-antibody-bound proteins and loose DNA to be washed away. Exposure to high concentrations of salt and heat can be used to reverse the formaldehyde crosslinking, simultaneously disrupting the antibody binding. This results in a highly enriched collection of desired proteins and DNA sequences to which they were bound in vivo. (See Figure 1.1.)  11  Figure 1.1: Illustration of the Steps in the Standard ChIP-Seq Protocol. Image created by John Chui, distributed under the CC-by-SA 3.0 licence.  When ChIP is paired with massively parallel sequencing, the DNA fraction is isolated and collected from the mixture, which can be done via manufacturer’s kit (e.g. catalogue #IP-102-1001, Illumina Inc.) or a simple phenol-chloroform extraction. The short DNA fragments will have an average length dependent upon 12  the length and vigorousness of the sonication or vortexing used during the shearing of the cells. The desired range of fragments can then be selected by running the DNA on an agarose gel and excising the portion containing the desired sizes, which is often in the range of about 150-500 base pairs. An alternative method exists for the isolation of histones and their associated DNA, using a ChIP method that does not require a cross-linking step. In this protocol, described for post-mortem brain tissue, the sonication or vortexing is replaced with a micrococcal nuclease digest, which does not disturb the coiling of nucleic acids on the histones of interest.[33] This allows the DNA to remain tightly wound around the histones throughout the entire chromatin immunoprecipitation process. Following precipitation, the DNA can be separated from the histones by Proteinase K treatment and extracted with phenol-choloroform. While not applicable for transcription factor precipitations, this process can greatly simplify the collection of histones without disturbing their signaling modifications. Histone-DNA interactions investigated with this method also indicate that both the histone-DNA interactions as well as the histone modifications are relatively stable and can be processed and successfully interpreted up to 30 hours after the death of the donor. Sanger Dideoxy Based Tag Sequencing One of the main reasons for the slow adoption of early ChIP based methods for analyzing DNA or RNA has been the inability to characterize the large populations of nucleic acid fragments obtained.[32] Using Sanger dideoxy based capillary sequencing to quantitatively analyze the millions of DNA fragments from a single ChIP experiment would be a costly process, complicated by the diversity of fragments represented in the population. This is especially true for larger mammalian genomes. However, several methods have been developed to sequence and sample the fragments of interest efficiently. Although sequencing every fragment is rarely possible using dideoxy sequencing, it is certainly possible to sample them to gain an understanding of the composition of the collection. One example of an early pioneering method for achieving this is the investigation of yeast telomeric  13  heterochromatin, which uses the precipitated DNA and a section of known yeast telomere as primer pairs to amplify and sequence regions near the ends of chromosomes.[34, 35] This creative work allowed the selective investigation of histone modifications near yeast chromosome telomeres, which would have been difficult or impossible using other techniques. Unlike the above example, much of the work combining ChIP and Sanger dideoxy sequencing strategies is aimed at studying areas distant from the telomeric regions of chromosomes and thus many of the strategies that have been developed include the creation of libraries containing cloned DNA fragments.[36, 37] Libraries have the clear advantages of being able to separate and amplify the many fragments from the original mixture in which they were isolated, as well as having flanking sequences for each cloned fragment that can be used as primers for sequencing reactions. However, the main drawbacks for Sanger-based investigations of DNA fragment libraries have been the large number of sequencing reactions required to obtain a statistically representative set of sequences from the ChIP derived fragments and the significant time and effort required to create the library before sequencing can begin.[38] In order to avoid these limitations, more high-throughput methods for sequencing the library of fragments have also been developed. These methods forgo a portion of the sequencing length in order to gain an increase in sampling quantity. This is done by shortening each fragment to create tags, which can then be serially ligated before sequencing. In this way a single sequencing reaction can detect the presence of a number of fragments, by identifying the points of origin of each unique tag sequenced. This method of serially ligated tags is similar to that of the well established Serial Analysis of Gene Expression (SAGE) technique.[39, 40] Three methods that combine ChIP and serially ligated tags include ChIP-SAGE, Serial Analysis of Binding Elements (SABE) and ChIP-Paired End Tag (PET).[41–44] All three techniques take advantage of the tag based concept of SAGE by including an extra cloning step in which short tags are extracted from each fragment through type II restriction digest to create small tags from the immunoprecipitated frag-  14  ments, which can then be ligated and sequenced using a single dideoxy sequencing reaction to obtain a greater sampling with fewer sequencing reactions. The first method, ChIP-SAGE (also known as GMAT), uses a straightforward method of collecting tags, but suffers from a loss in the resolving power that could be obtained from sequencing the full fragment: the site of the interaction can be narrowed down only to within 500-1000 base pairs, as the site of the protein-DNA interaction can be up to 1000 base pairs from the end of the fragment from which the tag was created.[45] Similarly, the SABE method uses an elegant SAGE like method, but introduces a further enrichment step in which only the immunoprecipitated fragments attached to one of a pair of linkers and then are subtractively hybridized against the non-enriched DNA fragments. Only pairs of fragments which are able to hybridize with complementary primers (i.e. both from the enriched, primer-treated fraction) will be amplified, eliminating any sequence that hybridizes with DNA sequences from the non-enriched pool of fragments. By including a restriction enzyme site in only one of the primers, the amplified fragments can be cleaved to create 18 base pair mono-tags that can then be spliced together, creating diTags, which can be further concatenated to create SAGE-like strings of diTags. Typically, up to 30 diTags can be sequenced in a single concatamer, dramatically increasing the information available per sequencing reaction. Because these tags come from either end of the fragment in which the protein-DNA interaction occurred, it is possible to map the Protein-DNA interaction to the genome more accurately by bracketing the interaction site between the observed tags. Similarly, the ChIP-PET method obtains information from both ends of the fragment, excising all but ∼20 base pairs on either end. Unlike SABE, this method is accomplished through the creation of a library of vectors containing the precipitated fragments. Whether using SABE or ChIP-PET, each set of paired end tags allows for improved mapping of the origin of the fragment back to the host genome, an automated task for which software exists.[46]  15  Hybridization Based ChIP Arrays Until recently, the most practical alternative to Sanger based sequencing or the less commonly used Maxim-Gilbert “chemical” sequencing, [47] was the use of DNA hybridization based techniques, which depend on the base-pairing or hybridization of a single stranded DNA (or RNA) molecule with a complimentary (or nearly complementary) DNA sequence to form a helical structure. The number of mismatches tolerated is referred to as the stringency of the hybridization and can be manipulated to achieve the desired range of complementarity. This simple technique has been used for a wide variety of purposes, including sequencing applications in the kilobase range.[48] The simplest use of hybridization is the use of a single DNA molecule of known sequence, referred to as a probe, to locate the presence, or anchor a complimentary sequence from among a mixture of fragments. Early hybridization methods were frequently based upon the use of a single probe, and were often applied to identifying the presence or absence of a given sequence in a mixed sample.[49, 50] These techniques were not combined with ChIP, likely because of the limited utility of determining the sequence of individual DNA fragments before the completion of the reference human genome and the development of the faster, more comprehensive hybridization based gene arrays in the 1990s. The advent of array technologies led to experiments in which a limited numbers of genes or sequences of interest were probed simultaneously, within the limit of the available arrays.[51] While early gene arrays, also known as gene chips, were only able to provide probes for a small number of sequences, modern arrays include nearly 2 million different probes (Affymetrix Inc.). However, because of the combinatorial explosion in the number of possible sequences that arise as the length of the probes increases, it is unfeasible to cover all possible sequences in one array. Additionally, because of the necessity of including probes that are sufficiently long to allow unique identification of their origin, each gene array must carefully select which probes are included. One example of the size to which arrays have grown is Affymetrix’ largest SNP array, the Genome-Wide Human SNP Array 6.0, which has 1.8 million probes, 16  including 906,600 that target known SNPs. Even the use of arrays of this magnitude for the analysis of ChIP derived fragments leaves open two significant but related issues: it is only possible to test for the presence of sequences for which there are probes, and the limiting number of probes that can be tested prevents truly fine grained sequence search scans from being possible. It is also important to note that the resolving power of hybridization techniques is related to the size of the DNA fragments that are being probed: the larger the fragment, the more likely it is to find a probe to which it will hybridize, but the poorer the resolution of data obtained. Finally, users of hybridization based techniques must be aware of difficulties experienced with the microarray platform itself. A common problem is the introduction of PCR-based biases, which can be difficult to quantify, caused by amplifying the DNA fragments before they are applied to the microarray.[52] Additional difficulties have been reported in the detection of low-affinity or low-abundance tags and repeatability is often poor when processing the same sample with the same or different microarray.[53–56] Despite these drawbacks, gene arrays have proven to be particularly useful when searching for known motifs or DNA sequences, such as in diagnostic applications or when working with small, fully sequenced genomes. In fact, DNA arrays have been used extensively with chromatin immunoprecipitation experiments performed in yeast and bacterial genomes, where it is possible to create DNA chips that contain all intergenic regions.[31, 57] This combination of chromatin immunoprecipitation and DNA chips is known as ChIP-chip or ChIP-on-chip. However, the use of DNA arrays has mainly been superseded by new sequencing methods based on the sequencing-by-synthesis approaches, which avoid many of the problems of hybridization approaches while achieving resolutions estimated to be equivalent to a gene array containing ∼1 billion probes.[58, 59]. Sequencing-by-synthesis uses a single-stranded DNA molecule as a template and reads each new base added as a complementary strand is processed. This method often makes use of blocked nucleotides, which prevent more than one base from being added at once, followed  17  by a chemical step to un-block the nucleotide, allowing each base addition to be measured individually. Application of Sequencing-by-synthesis Sequencing-by-Synthesis was first proposed and patented by Robert Melamede in 1985, but remained relatively unknown until the beginning of the millennium.[60] Since then, it has undergone rapid development and commercialization, such that it is now possible to purchase a variety of sequencing devices that use variations of this basic concept. Among the first companies offering technology for sequencingby-synthesis were 454 Life Sciences, acquired by Roche in 2007, and Solexa, acquired by Illumina in 2007.[39, 61, 62] Since 2007, the technologies offered by these vendors have improved in both the quality and quantity of the data produced. Although these parameters vary by platform, it is not unusual to get hundreds of millions of base pairs sequenced on the 454 platforms, or well into the 100’s of billions of base pairs on the Illumina GA per run. Additionally, due to chemistry improvements, the read lengths have increased up to 150 base pairs for the latest Illumina machines and can be in excess of 300 base pairs for the 454 platform.[63] This ability to sequence a very large number of DNA fragments in a massively parallel process is one of the major advantages of the sequencing-by-synthesis method. In the context of a ChIP experiment, this enables researchers to obtain a saturating coverage of the immunoprecipitated DNA fragments. The low cost of performing massively parallel sequencing also makes ChIP-Seq an effective strategy compared to using either hybridization or dideoxy based sequencing methods to achieve similar levels of coverage. It is also expected that sequencing-by-synthesis methods will continue to come down in price, making ChIP-Seq more accessible to researchers, while the increasing read lengths will improve the accuracy of the results obtained. One of the first experiments combining ChIP with sequencing-by-synthesis technology was based on a modification of the combination of the chromatin immunoprecipitation and paired end tag strategy, using 454-based sequencing to  18  map p53 binding sites in HCT1116 cells.[64] However, unlike standard ChIP-PET and SABE techniques where multiple tags sets are ligated to form long concatamers, only two PETs were joined, forming a diPET. This drastically shortened the number of tags that could be identified in each sequencing reaction, however the massively parallel nature of the sequencing reactions available on the 454 GS20 machine used in this experiment (between 200,000 – 300,000) more than compensated for the reduction in PETs per reaction compared to earlier methods. As a result, it was possible to identify 22,687 unique fragments, of which 8,896 were uniquely mappable, identifying 57 clusters of sequenced reads that indicate likely p53 binding sites. More recently, ChIP has been combined with the massively parallel sequencingby-synthesis method used by the the Illumina 1G machine. Unlike the ChIP-PET strategy used above, the Illumina derived reads contain only a sequence from one end of each immunoprecipitated fragment. However, to offset this disadvantage, the Illumina 1G is able to generate much deeper coverage, sequencing over 10 million reads per lane. Thus, instead of bracketing the site of a DNA-protein interaction between the two sequenced ends of a read, as would be observed in a PET based sequencing approach, multiple fragments are instead found to overlap at the site of the interaction. Because of the direct sequencing of single molecules in the enriched fragment using the Illumina 1G, it is also possible to show that sequencing of each DNA fragment is directly analogous to counting the frequency of which a binding event is observed. Thus, the greater the number of fragments observed, the greater number of times the event occurred in at the time of the experiment. Barski et al. [45] published the first demonstration of the ChIP-Seq method, highlighting its the speed and versatility in their May 2007 publication. They were able to provide comprehensive genome-wide coverage data for more than 20 epigenetic marks as well as the DNA-binding locations of the CCCTC-binding factor (CTCF) protein in human CD4+ T-cells. Mainly focusing on methylation of the tail segment of histones, they were able to demonstrate correlations between the presence of many of these marks with transcriptional activation or repression.  19  This provides a clear indication of the role that histone modifications play in the regulation of gene activity. From their results, it is impossible to dismiss the relationship between histone methylation and the control of transcription, which marks a major step forward in understanding epigenetic signals in human cells. A similar study was published by Johnson et al. [59] focusing on the neuronrestrictive silencer factor (NRSF) (also known as repressor element-1 silencing transcription factor (REST)). Using the ChIP-Seq protocol and searching for Areas of Enrichment (AOE) of the sequenced reads, they were able to map NRSF binding to 1946 locations in the Jurkat human T-lymphoblast cell line, with an estimated accuracy of +/- 50 base pairs. One example of the power of this method was demonstrated by their ability to identify a degenerate binding site for a gene thought to be regulated by NRSF. Previous experiments were unable to identify any sequences that were likely to be NRSF-binding sites, however, an area of enrichment was observed upstream, showing that NRSF did indeed bind and regulate the gene as previously hypothesized. Indeed, 94% of predicted binding sites from the ChIP-Seq method were found within 50 base pairs of an NRSF binding motif and virtually all sites with 90% or greater match to an NRSF motif were found to be occupied. Robertson et al. [65] also published the results their ChIP-Seq experiment based on the signal transducer and activator of transcription 1 (STAT1) transcription factor. Their work on the STAT1 protein is similar to the work done with NRSF, however STAT1 is regulated by phosphorylation, making the system more complex. Upon stimulation with interferon-gamma, STAT1 residing in the cytoplasm is phosphorylated, causing it to migrate to the nucleus, where it gains the ability to form homodimers, heterodimers and heterotrimers, which then bind DNA. However, STAT1’s phosphorylation is short-lived and it is rapidly dephosphorylated, whereupon it dissociates from the DNA and returns to the cytoplasm. This study was conducted on both interferon-gamma stimulated and unstimulated HeLa S3 cells, allowing a comparison between the two conditions. Sequenced reads found in the unstimulated sample were largely indistinguishable from noise, indicating that STAT1 binding is scarce without the interferon-gamma stimulation,  20  however reads in the stimulated sample were observed to cluster into enriched areas, which passed false discovery rate thresholding. Thus, the ChIP-Seq method is able to differentiate the two binding conditions and capture the short lived binding during phosphorylation of the STAT1 transcription factor. In order to facilitate the interpretation of the ChIP-Seq data obtained in this experiment, Robertson et al. [65] devised a method in which each short sequencing read was assumed to be indicative of a fragment with a mean fragment length of 174 base pairs, determined experimentally. By extending each sequence read to a constant length, representative of the original fragment size used for Illumina sequencing, they were able to align sequenced reads back to the genome and look for areas of overlaps, indicating enrichment. Using this process, binding sites appear as a Gaussian-like distribution, with median widths of about 40-50 base pairs and tails extending up to 1000bp on either side, comparing favourably to typical ChIP-chip results of a single featureless peak of 500-1000 base pairs. The Gaussian-like distributions observed in this method can be used to locate a peak maximum that was shown to coincide well with known locations of the STAT1 transcription factor. The STAT1 binding motif predictions were generally found within 100 base pairs of the peak maxima. STAT1 motif enriched sites were also observed near known transcriptional start sites, with the highest density at -100 base pairs upstream, as would be expected for a transcription factor. An interesting consequence of the cross-linking used in most ChIP-Seq methods, but not taken into account in the above experiments, is the detection of secondary protein-protein and protein-DNA interactions. Proteins in close proximity to the transcription factor of interest will become cross-linked during formaldehyde crosslinking and consequently will be pulled down along with the transcription factor of interest during the ChIP process. Any DNA to which they were also cross-linked will then be sequenced during the analysis of the immunoprecipitated DNA fragments. This creates multiple side-by-side points of enrichment of aligned reads along the genome, indicating the presence of the second DNA-interacting protein. Further analysis of these locations is likely to reveal that the binding motif of the targeted  21  transcription factor is not present, indicating the DNA-protein interaction may not be direct. In some cases, the secondary interaction may not be directly adjacent along the linear visualization of the genome. Because the chromosomes are able to adopt tertiary structures which bring seemingly distant regions into close contact, it is possible to obtain secondary protein-protein and protein-DNA interactions that appear unrelated to the originally targeted protein-DNA interaction. Mapping and identifying these long-range DNA interactions has the potential to contribute significantly to our understanding of the regions involved in gene expression and the mechanisms of transcriptional control. Using a ChIP-Seq based approach, Ruan and colleagues have demonstrated the ability to observe these interactions by ligating precipitated fragments before releasing them from their cross-linked protein.[66] This allows them to form a single DNA strand containing sequence information from both locations. The subsequent use of a pair-end sequencing strategy allows both fragments to be identified and the distance between them determined. This innovative and novel approach, entitled Chia-PET, has been used to map the longrange DNA interactions associated with estrogen receptor binding with excellent results.[66]  1.3.3  Medical Applications of ChIP-Seq  Researchers are now able to observe genome-wide interactions between DNA and proteins in vivo, as well as changes in genetic regulation in response to various stimuli using ChIP-Seq. It has begun to open the door for the development of genomewide maps indicating histone modifications and the locations of transcriptionfactor, enhancer, repressor and promoter recruiting sequences. Undoubtedly, this knowledge will have a broad impact in our understanding of genomics based medicine, leading towards the development of new medical treatments. There are already indications that our ever improving understanding of the molecular mechanics of protein-DNA interactions is changing the medical field in diverse areas such as toxicology, ageing, general health and cancer.[67–72] An increased 22  awareness of the consequences of altering gene regulation will play a major role in the future study of each of these fields. For example, in cancer, our understanding of the genetic basis of the disease will require the study of key events such as the effect of transcription factor binding site mutations on oncogene regulation, the effect of de novo binding sites arising through mutations, and the epigenetic controls involved in oncogenesis. Clearly the medical applications of the ChIP-Seq method are already having a transformative effect on how we perceive the study of human health.  1.3.4  Challenges  A significant hurdle to the use of short-read sequences occurs during interpretation of the data. Johnson et al. [59] devised a method in which a minimum threshold number of sequenced tags, set at 13, based upon Receiver Operating Characteristic (ROC) analysis must be found within 100 base pairs.[73] A second method, pioneered by Robertson et al. [65] is the use of so called “peaks”, in which observed sequences are assumed to extend to the mean fragment length of the precipitated DNA, and the number of times each base position is observed in any extended read is collected into a histogram, from which the peaks can be identified. Peaks that have heights greater than a false discovery rate threshold are retained. Both methods permit a quick genome wide scan of the template genome for AOE, representing the observed binding locations. An advantage of these method is the simplicity in expressing these peaks or sequenced reads graphically, making them relatively simple to interpret visually. However, where reads contributing to more than one nearby site overlap, or where secondary interactions exist, the peaks can become complex making it difficult to locate the true binding site of the protein of interest. Furthermore, as the sequencing depth increases, fragments from non-specific binding will begin to accumulate on the shoulders of each peak, making it difficult to find the peak region’s boundaries. A second major issue for ChIP-Seq is the requirement for aligning the reads back to a reference genome. The process of aligning the reads depends on the 23  quality of the reads generated by the sequencing platform as well as the length of the sequences generated. As the length of the sequence generated increases, so does the cost and time to generate the data. Thus, the sequences generated for ChIP-Seq experiments are often kept short. This provides a tradeoff between quality and length of the reads and the cost of the experiment that impacts on the ability to accurately re-align the reads to the true genomic point of origin of the DNA fragment. Related to this is the ambiguity inherent in determining the origin of DNA fragments that are derived from repeat regions in the genome. Similarly, it is also possible to sequence entire regions that do not exist in the template genome, or for which the sequence obtained contains sufficient mutations that the read shares too little identity with the analogous region in the template genome. Each of these situations causes a loss of information through the inability to interpret some sub-set of the results obtained from the interpretation. A further challenge in dealing with ChIP-Seq data has required the separation of signals from DNA-protein interactions into two separate classes. The first generation of software was designed to deal with narrow, well defined peaks, such as those found with transcription factors and a small number of post-translationally modified histones. However, many histone modification signals and DNA-protein interactions generate broad regions of coverage, poorly suited to early peak finder tools. There are several tools designed to handle the broad regions of coverage and diffuse signals of nucleosome data.[74, 75] Thus, an important component in any ChIP-Seq experiment is managing the data through the use of appropriate software. Moving forward, there are still further issues in ChIP-Seq that remain open for investigation of which visualization will undoubtedly play a part.[76] Additionally, normalization of ChIP-Seq experiments and signal detection remain contentious issues with no clear consensus emerging from researchers using the ChIP-Seq protocol.  24  1.3.5  Future Uses of ChIP-Seq  One challenge worth noting with respect to the ChIP-Seq method relates to a further epigenetic change: the methylation of DNA itself. The methylation of cytosine residues is a well known DNA modification that is thought to play a role in gene regulation. It is one of the best studied epigenetic modifications of DNA across all organisms.[77] Although current protocols for DNA methylation experiments do not involve Chromatin Immunoprecipitation directly to pull down methylated bases, the treatment of the data is highly similar to ChIP-Seq, particularly with respect to identifying regions of methylation enrichment. One common technique used to study the methylation of DNA employs sodium bisulfite, which converts methylated cytosine residues to uracil.[78] If the same segment of DNA were to be sequenced twice, for instance, once with bisulfite treatment and once without, it would be possible to compare the two sequences to identify which cytosines were converted to uracil, and thus where the methylation existed along the DNA sequence. In the case of short-read sequencing technology, bisulfite treated DNA is more difficult to interpret, as the bisulfite causes a loss of information corresponding with the decrease in cytosines residues, and it is impossible to pair reads as would be done for Sanger sequencing. Although it is difficult to map the slightly degenerate sequences back to the human genome, it is not impossible, through exhaustive searches, to identify likely sources of origin. However, reducing the amount of sequence space being searched using this method can be accomplished by performing a ChIP pull down and treating only one half of the sequences with bisulfite. The non-bisulfite treated sequences can then be used to identify a small regions of the genome where the DNA is likely to have originated, while the bisulfite treated DNA can then be aligned against the reduced genomic regions to identify areas of cytosine methylation. As read lengths increase, however, this issue may become more tractable due to the increased amount of information available in each read. An alternate method to bisulfite sequenced DNA is to use Methylated DNA Immunoprecipitation (MeDIP), which uses an antibody raised against 5-methylcytosine 25  (5mC) to enrich for methylated regions of the genome. The method is remarkably similar to a ChIP-Seq method and is only noticeably different because of the lack of a cross-linking step, which is included to hold the antibody-bound protein in contact with the DNA. In this case, the antibody holds directly to the methylated DNA, eliminating the need for the cross-linking. Of interest, the two methods, bisulfite sequencing and MeDIP provide comparable results, albeit with different coverage, resolution, accuracy and cost.[79] With the advent of sequencing techniques that allow comprehensive enumeration of the DNA fragments, it is now possible to begin asking detailed questions about allelic frequency and SNPs within protein-DNA interaction sites.[80] With personalized medicine on the horizon, it is unquestionable that the importance of SNPs outside of coding regions will be recognized for their importance in determining gene expression, which can then be used to predict phenotypic differences between individuals. An important application of the ChIP-Seq method will be in the study of cancer cells, where the regulation of genes plays a fundamental part of the disease condition.[71] While the regulation of genes through methylation has been shown to play an important part in the development of cancer cells, the potential for cancerous cells to de-regulate gene expression through the loss of existing transcription factor binding sites or the development of de novo sites through random sequence errors exists.[70] It can be expected that ChIP-Seq and its descendants will make these studies accessible, as the number and quality of antibodies for transcription factors increases. Among the many applications with which massively parallel sequencing can be combined, one of the more fascinating is the Chromosome Conformation Capture (3C) technology, used to study the folding of chromatin in vivo.[81] This technique applies formaldehyde cross-linking followed by a random digestion and a selective ligation step favouring cross-linked DNA fragments to capture sites of interacting DNA. The end result of this process is a series of ligated DNA fragment pairs representing interacting regions of the genome. With a paired end protocol, as is expected to be developed for the Illumina machines, it will be possible to interrogate  26  libraries of interacting DNA fragments to obtain a genome-wide snapshot of the tertiary structure of chromosomes.  1.3.6  FindPeaks  The software described in chapter 2 was one of the first applications developed exclusively for use in processing ChIP-Seq results. Due to its novel method of searching and identifying Areas of Enrichment (AOE) that appeared to accumulate into sharp signals, known as peak finding, the application was given the name “FindPeaks”. FindPeaks was begun by Matthew Bainbridge as a stand alone Java class designed to simply sort genomic coordinates and scan for regions where sequenced reads had accumulated. It had pioneered the use of a fixed length read extension (see section 2.3.2), and was developed in less than 1000 lines of code. At the time the project was undertaken, there were no publicly available applications for ChIP-Seq data processing and no pre-existing methods had yet been publicly described. FindPeaks has now grown to include nearly 25,000 lines of code and provides many functions for analysis of AOE that were not available at the time. Furthermore, it has made progress to address some issues that are still unresolved for ChIP-Seq analysis, including normalization, noise-signal filtering and comparing multiple samples. FindPeaks makes an ideal platform for experimenting and developing novel solutions to pressing issues in the realm of ChIP-Seq, further elucidating the interactions between DNA and proteins in vivo.  1.4  Variation Database  The second project included in the VSRAP package was the Variation Database (VDB) schema, User Interface (UI) and Application Programming Interface (API). Although it has not yet been adopted widely outside of the BCGSC, it has pioneered several novel concepts and become a core piece in the pipelines used for the analysis of human variations at the BCGSC.  27  In this section, we will discuss common variation types, focusing on point mutations, as well as some of the tools used frequently for identifying variants and some basics of relational databases. Finally, we’ll explore some of the applications where the VDB is currently used.  1.4.1  Variations  There are several common types of variation that are encountered in genomic data. Single Nucleotide Variation (SNV): A single nucleotide replacement of a single base with respect to the reference genome, detected in either DNA or RNA. These variations are typically not present in the germline of the organism, but appear due to biological events including replication errors, radiation events or chemically induced mutations from agents such as carcinogens or oxidizing radicals. Additionally, it is possible for other sources of error to provide incorrect base calls that appear similar to biological events in sequencing data sets, such as sequencing errors, base call errors or variant calling errors, all of which produces false positive variations that are difficult to distinguish from those of biological origin. Single Nucleotide Polymorphism (SNP): A single nucleotide replacement detected in either DNA or RNA, indicating a variant that is found at some frequency (commonly 1% or greater) in the general human population. They are most frequently present in the germline of the organism in which they are found and are rarely the result of spontaneous biological changes. The human reference genome frequently incorporates minor polymorphic alleles into the reference sequence, as sampling for the human reference genome was insufficient to ensure that the most common is always used as the reference base at every position. Thus, SNVs and SNPs are often used to describe the same divergence from the reference when it is ambiguous as to the true frequency of a polymorphism in the population.  28  Insertions and Deletions: The insertion or loss of a sequence of bases compared to the reference genome, composed of one or more bases being inserted or deleted. The number of bases that may be lost or gained in an insertion or deletion event can include the gain (duplication) or loss of full chromosomes, as is common in cell lines and cancer cells. From an informatics point of view, any gain or loss of information can be considered as an Insertion and Deletion (INDEL), although the manner of the gain or loss may allow for more descriptive terms to be applied. These are typically observed in DNA samples, but may be reflected in RNA as well. They do not include the normal splicing of messenger Ribonucleic Acid (mRNA) found in eukaryotes, which appears in the mRNA but not in the corresponding DNA. Duplication: A subclass of insertions, duplication events can include the gains anywhere in size from a single base pair, all the way to the duplication of an entire chromosome. The mechanism by which an event occurs will determine the size of the duplication, such as slipping events, which can insert a duplicate base or bases during replication, or viral insertions (jumping genes), that can transfer kilobases of information to random points in the genome. Duplications are found only in DNA based samples, but may be reflected in the amount of corresponding mRNA produced by a gene found within a duplication event. Fusion: An event in which genomic information is rearranged to cause the insertion of one gene inside another or to swap portions of one gene with another. These events can take place in a productive manner, in which the codon frame is consistent, producing elements of both proteins, or in a non productive manner, in which the second gene’s codon frame is inconsistent with the first gene, usually resulted in a truncated version of the first protein. These may be observed in both RNA and DNA based samples and can arise through translocation, duplication, or deletion events. Inversion: The effect of “flipping” a segment of DNA and replacing it in the 29  same location in the opposite orientation. While there is no net loss of genomic information, the points at which the breaks occur can create novel fusion products or disable important biological modules. These events may be observed in DNA and RNA, but would be more easily mapped, and thus identified, in DNA based samples. Translocation: An exchange of genomic material between two chromosomes, which can result in either a net loss of material (unequal translocation) and the potential for creating fusions or other rearrangements. These events can not be detected in RNA based samples unless they are also a fusion of two genes (e.g. a fusion event) from different chromosomes, but can be detected by DNA fragments that span the translocation point. Alternate Splicing: Alternative splicing is a change in the normal pattern of mRNA intron splicing and is thus not observed in genomic DNA. It can be the consequence of SNVs, Insertions and Deletions (INDELs) or other genomic changes, and is reflected in an altered pattern of exon usage. Copy Number Variation (CNV): This term is used to describe the loss or gain of genomic material in which the number of copies of a biological unit (i.e. alleles) is reduced or augmented, which occur through a duplication event. All of the variations above have the potential to create disruptive events that can have consequences on the behaviour or functioning of a cell, and in the context of cancer research, each one can provide a mechanism by which cells can circumvent the normal checkpoints for normal behaviour. For instance the causative variation for many childhood leukaemia patients appears to be a t(9;11)(p22;q23) translocation in or around the mixed-lineage leukemia (MLL) genes.[82] The most common events observed when sequencing genomes and transcriptome are the SNVs and SNPs, for which approximately 3 million deviations from the reference genome are present in each individual and for which our sequencing platforms are most amenable and reliable for identifying. (See section 1.4.2.) Thus, for the VDB, we have chosen to focus on SNVs and SNPs. 30  The VDB has also been expanded to include INDELs by Alireza Khodabakhshi, but is not discussed in detail here, as INDELs can be considered a trivial expansion of point coordinates (chromosome, position) to linear coordinate (chromosome, start, end). The code used to perform INDEL queries is based in large part upon the code written for queries on SNVs.  1.4.2  Pipelines Producing SNP- and SNV-Calls  Identifying SNVs and SNV calls is a multi-stage process involving several two main components. The first component is the aligner, the software that attempts to identify the most likely point of origin for each read produced by the sequencer. The second component is the SNP-caller, which functions to interpret the reads aligned to a reference genome or transcriptome to identify positions where the sequenced material predicts a different base occupying a position than is anticipated from the reference. Aligners Sequence aligners have played a key role in the repertoire of bioinformatics algorithms, and perhaps none more-so than the venerable Smith-Waterman alignment algorithm, which provides an exact optimal solution, and the Basic Local Alignment Search Tool (BLAST), which provides a rapid near optimal solution.[83, 84] Despite ongoing modifications to the Basic Local Alignment Search Tool (BLAST) algorithm, the underlying strategy employed is not well suited to rapidly identify the genomic origin of short fragments in a large reference genome, nor to handle the volumes of sequences produced routinely by NGS platforms.[85, 86] Fortunately, a new generation of short-read alignment programs have been developed to accomplish this task. One of the earliest alignment tools designed specifically for NGS data is the Efficient Local Alignment of Nucleotide Data (ELAND) application (Anthony J. Cox, Illumina Inc. 2007), based on the efficient use of hash tables - a common method of performing rapid retrieval of information from indexed data structures. 31  This code was able to identify the point of origin of short sequences, provided it was a matched to a unique location in the genome with no more than two base mismatches. When more than two mismatches were found or two or more equally likely points of origin for the fragment exist, ELAND was unable to provide any mapping coordinates. For many data sets of mammalian DNA, this allowed between 50-65% of the fragments to be identified.[59, 65] Eland also suffered from a common issue that still plagues NGS aligners - the ability to perform gapped alignments. In fact, the removal of gapped alignments from early NGS was one of the reasons for their increased speed, but at the cost of accuracy. The other major source, already alluded to, was the restriction on the number of mismatches allowed in each fragment mapped. These limitations severely restricted the accuracy of the early sequencing experiments and typical early ChIP-Seq analyses often found had mapping rates in the 50-60% range.[65] In the years following the introduction of ELAND, a plethora of competing aligner packages were produced, using similar hash-table based approaches, each making incremental contributions to the problems outlined above. Some of the early aligners included the Mosaik assembler (http://bioinformatics.bc.edu/marthlab/ Mosaik, last accessed April 13, 2012), Exonerate [87], SXOligoSearch (Synamatix, Kuala Lumpur, Malaysia) and Slim Search (SLIM Search Ltd., Auckland, New Zealand). One of the most influential was the Mapping and Assembly with Quality (MAQ) aligner, produced by Heng Li, (http://sourceforge.net/projects/maq/, last accessed April 13, 2012) which was able to take advantage of individual base sequencing confidence scores to lessen the influence of poor sequencing quality bases in the overall alignment process. This use of base scores provided a significant boost in the accuracy of the alignment. Another major milestone in the development of assemblers came with the introduction of the Burrows-Wheeler Transform (BWT). Originally introduced by the Bowtie aligner, the use of the BWT quickly replaced the use of hash table based approaches and was rapidly picked up by other aligners such as SOAP (http://soap.genomics.org.cn/, last accessed April 13, 2012) and BWA, the successor  32  to MAQ.[12, 88] This algorithm incorporates a simple alphabetical sort to reorder the characters and creates a more efficiently stored key for any given word. Because the key itself is easily compressed and the transform can be reversed, the algorithm can be made much more efficient than a standard hash-based approach, and the amount of information held while utilizing the compressed indices can be much smaller than hash-based approaches, thus making the algorithm faster and less resource intensive. The increase in speed that comes along with applying the BWT to the alignment problem enabled algorithms to be developed that improved accuracy as well by hybridizing search strategies. Currently, one of the most used algorithms for PET sequencing is implemented by the BWA application, which uses the BWT to align and anchor one end of the paired read and then performs a gapped local search using the Smith-Waterman algorithm to align the second end of the paired read, if the BWT is unable to locate a good alignment for it. Such strategies have made alignments a reliable and routine part of the NGS pipeline. SNP Callers Once an alignment is completed, it is possible to search the alignments to identify individual base pairs or regions where the alignments predict divergent sequences from those that are annotated in the reference, which can take many forms (see section 1.4.1). When the divergent sequences are single base replacements or Single Nucleotide Variations (SNVs), the application identifying the variants is known as a “SNP-caller”. At the most basic level, a SNP-caller can simply walk across the reference and identify any base read by the sequencers that does not match the reference. However, performing a na¨ıve SNP-call in this manner provides poor results, failing to take into account many different variables from the quality of the sequence produced by the NGS platform, to the quality of the alignment produced by the aligner. These qualities can dramatically influence the ability to trust the data  33  providing evidence for each variation from the reference. This is the basic information used by such applications as the varFilter scripts available in the SAMTools package.[89] Most SNP-callers have advanced beyond these simple metrics in order to improve the accuracy of the SNVs they predict. Varfilter has also moved to include filters that test the depth of the reads used to call a variant, as well as the number of variants in a small window. Both of these metrics can be used to filter out predicted divergences from the reference with poor support or caused by poor alignments. An alternate method for performing SNP-calls is also available in the SNVMix/SNVMix2 tool set that builds upon the metrics outlined above. Using Bayesian statistics, it is able to predict the probability of any variant being heterozygous or homozygous, depending on the expectations for a given environment.[17] This allows the application to use information about the origin of the sample to make better informed decisions when evaluating each variant observed.  1.4.3  Single Nucleotide Polymorphism Databases  The collection of polymorphisms and variations in the human genome is surprisingly large compared to other organisms, due to the explosion in the human population experienced over the past few thousand years, resulting in a dramatic increase in the number of SNVs carried in the human population.[90] This growth has also had a significant effect on the ratio of SNVs and SNPs as well, as no bottle neck events have occurred to the entire human race to counter the diversity generated by the population explosion within the past 10,000 years, suggesting that few variations will have been able to stabilize as polymorphisms within the last 50 generations. Thus, whereas the study of genomic variability in other organisms can benefit from polymorphisms collected across a small number of organisms, humans will require much greater sample sizes to adequately characterize the diversity present in the population. Consequently, humans also require larger databases to hold this information and to make use of the data. Current public databases storing variations for the human genome are based 34  upon collaboratively pooling data into a single database with a single interface available to the public. This gives little control to the collaborator to mine the database and requires that they freely share their data with the owners of the repository. We aim to provide an alternative mechanism: providing the source code for a database, UI and API of a database, enabling researchers to set up local versions without investing heavily in the development of the resource and allowing for confidential information to remain secure. With the introduction of next generation sequencing technologies, there has been a rapid increase in the amount of genomic and transcriptomic sequence data generated. Consequently, an ecosystem of bioinformatics applications and resources has grown up around the ever increasing volume of data being produced, including aligners, assemblers, repositories and protocol specific applications mainly designed to investigate one genotype at a time. Other databases and tools are available for performing analyses of small groups or conducting Genome Wide Association Studies (GWAS) studies, however, a significant keystone of this ecosystem has yet to be introduced: a simple way to collate and search across large numbers of sequenced libraries, particularly those for which the analysis can not be supplied by tools external to the organizations collecting the data.[91] Databases have long been a staple of bioinformatics research, with the most common organizational model based upon collaboratively pooling data into a single publicly available database. This allows the developers to maintain control of the hosting and formatting of the data, but also requires that collaborators freely share their data with the owners of the repository. In contrast, our project provides an alternative mechanism: providing the source code and API for the implementation of a local database, enabling researchers to set up private repositories without investing heavily in the development of the resource and allowing for confidential information to remain secure. This is an especially important model, when the genomic data is of medical nature and can not be publicly shared. However, distributing the burden of development can significantly provide a collective bootstrap to researchers involved in the collection and analysis of private genome or genome-derived data  35  sets. There are several advantages to collecting genomic information in the proposed database versus analysis of a set of independent DNA and RNA sequencing experiments. In contrast to publicly available databases, such as dbSNP, a local repository can be used to track more data about each observation and more information can be stored about the library of origin.[92] Thus, common variations will quickly become easy to identify, as will those that cluster into broad categories. This makes it easy to conduct meta-genomics experiments on larger data sets and enables data mining for each new set of sequencing data, building on previously performed experiments. Finally, the database itself does not require any genomic annotation system, and annotations can be imposed at the time of data export. We provide a UI and an API that utilizes an Ensembl database (http://agd.vital-it.ch/info/software/java/index.html, last accessed April 13, 2012) to identify and analyze genes/exons, applicable for most analysis needs. An alternate approach to the problem of confidential data is for the data to be held independently and for a public collaborative interface to be made public. Tools such as the Galaxy web platform use this approach to provide a broad set of collaboratively built tools that enable biologists and other researchers to utilize next generation sequencing data to perform analysis without requiring the user of the web interface to be skilled at programming.[93] This approach, while avoiding the use of a single database, fulfils the same mandate of data confidentiality and provides an API to the user. It can not, however, be used to analyze the same quantity of data that can be handled by a single dedicated relational database. Finally, it is also an important consideration that much of the data collected may come from medical research or other sources that are sensitive to privacy issues. In such cases, sharing of the data through systems such as web- or cloud-based interfaces is not an option, as they fail to guarantee data security. Because genomic sequencing data can be used in much the same way as a fingerprint, by identifying variations and combinations of variations unique to an individual, it is an important consideration that the data be held with confidentiality intact. In such a case, a  36  local database which allows rapid querying and processing, but prevents the data from being shared inappropriately, is an important part of a medical sequencing work-flow.  1.5  Relational Databases  As previously mentioned, the ability to maintain the scalable and flexible nature of the database relies on the mechanics of the database - the implementation - as well as the design. This encompasses the use of an API to handle access to the database, the ability to utilize Structured Query Language (SQL) access, and the indexing and optimization of both the hardware and software running the database. Specific features of optimization of relational databases are described below, as they provide a useful guide to those working with the VDB or on similar projects. The following topics are some of the tools necessary to develop and optimize a relational database that contains large tables (i.e. billions of individual rows) and has tables that are too large to be cached into the available memory of a single computer.  1.5.1  SQL Access  is a ISO-approved programming language developed by IBM, now used extensively by relational databases for the management and retrieval of data, as well as data access and other administrative tasks.[94] It provides a common language to facilitate queries to be composed and sent to SQL compliant databases without regard for the specific implementation of the database. For example, two relational databases with identical table structure (e.g. postgresql and mysql) would allow the same SQL compliant queries to be executed, given the presence of the same tables in either database. Thus, SQL forms a common language for database interaction, used at various levels of this project. If, at some future point, the database were to be migrated to any other SQL compliant implementation, it would allow the APIs developed to be ported to the new implementation with few changes SQL  37  necessary. The most typical SQL query is the “Select” query. A typical invocation would be to select one or more fields from a single table, for example, selecting the content of the “city” field stored in the “cities” table would be done as: S e l e c t c i t y from c i t i e s ;  or if multiple fields, such as “city” and “population” were selected from the CITIES table: S e l e c t c i t y , p o p u l a t i o n from c i t i e s ;  and finally, to be more selective: S e l e c t c i t y , p o p u l a t i o n from c i t i e s where province = ’ Ontario ’ ;  Select queries can be made much more complex, involving nested select statements, multiple tables and data grouping. These queries make of the majority of calls in the API for interacting with the VDB.  1.5.2  Postgresql  The database selected at the core of this project is the SQL compliant postgresql database.[95] This database was selected for it’s high performance capability, reliability, good community support and open source background, which is well suited for academic deployments. The database includes an interactive SQL environment (psql) that enables realtime querying and database interaction. This allows developers to maintain the database, as well as to test SQL queries before including them in any programming interface. The SQL access provides tools that become an integral part of the development process. Although work has been done to optimize the behaviour of the postgresql database, the same table structures could be implemented in any SQL compliant relational database with the expectation that the accompanying API would continue to function. Optimizations used specifically for the VDB may not apply to other SQL databases. 38  1.5.3  ODBC  In addition to the psql command interface, The Postgresql database also provides a second method for interacting with the stored information, via a designed Open Database Connectivity (ODBC) API. While SQL is a language for data manipulation, it does not handle any of the duties of forming a connection or managing data flow between a database and a client. Thus the ODBC layer or driver provides the wrappers required to allow the SQL queries to be executed outside of the database’s command shell. This, in essence, provides a mechanism for further abstraction, allowing SQL queries to be translated directly into the database as efficiently as possible and retrieving the data for use by the application utilizing the SQL query. Using an ODBC layer effectively allows for database-independent coding, further abstracting the relationship between any application and the database with which it interacts. This has the added effect of making it possible to switch between database back ends by porting the data and replacing the ODBC layer. In the Java environment, ODBC layers are frequently renamed as Java Database Connectivity (JDBC) layers, reflecting that the API they provide is suitable for the development of Java applications.  1.5.4  Keys and Indexing  One of the simplest and best methods for optimizing a relational database, with specific emphasis on postgresql and similar databases, is the use of effective indexes. An index provides an alternative method by which specific rows in a table can be identified and accessed without requiring the use of full table scans and primary keys. To understand how indices work, it is important to understand database keys. Keys A fully relational database utilizes a primary key in each table, which is a unique identifier for a given row. For instance a table storing the names of cities in a  39  province might use the city name as a key, indicating that a single city name would return only one row. However, if the table were to store city names for a whole country, a more effective key would be both the city name and province name, forming what’s called a “composite key”, utilizing more than one piece of information to look up information on a given city. In the above example, searching the table for a city/province combination would involve scrolling through the entire table to identify rows in the table in which both the city and province match to the query. A much faster way to do this would be to provide an index based on the province field. This would effectively allow someone searching the table to first identify only those rows in the given province, and then search through that set for the city name. Slightly more efficiently, an index could also be created containing the composite key itself, with both city and province, which would allow the database to search first in the index for province, then search the index for the city, yielding only the address of the one row that matches the queried city/province composite key. Indices, thus, provide a mechanism for using a priori information to reduce the number of rows that need to be searched, which can significantly reduce the query time for large queries. Indices also contain a reduced amount of information compared to a table, which may contain other fields, making them smaller, and thus more likely to loaded into memory for faster access (See Section 1.5.5). In optimizing a database, indices can be used to effectively improve access time to nearly all tables, however, each new index created requires time and space and must be (automatically) updated upon each new row added or deleted. Thus, there is an overhead that becomes greater as each new index is added.  1.5.5  Hardware Performance  One of the most pressing issues for a database of any significant size is the performance that users are able to achieve. This is generally measured as a response time to a given query. It’s important to note that not all queries can achieve the same 40  performance level and many of them are actively discouraged during the design of the database structure - thus, performance must be measured on those queries that are of most concern to the user. There are three hardware bottlenecks that a database is likely to experience: Disk Access The most critical bottleneck for a large database is most likely the speed at which data can be retrieved from storage devices - most commonly hard drives drives. The larger the volume of data stored, the more likely it will reside only on the hard disk drives, rather than in Random Access Memory (RAM). While there are certain strategies that can be undertaken (eg, Redundant Array of Independent Disks (RAID) mirroring and striping) to improve disk access, the amount of data that can be accessed from disk is limited in the order of 10’s to 100’s of MB/second. While this speed is reasonably quick, concurrent disk access by several users or threads can cause individual performance rates to be rapidly pushed down to unacceptable rates, particularly when large sections of tables need to be searched to find the desired results. Furthermore, disk access speeds are proportional to the number of physical heads available to read from the disk, the rate at which the disk platters spin, and the ordering of the data. Thus, searching for data encounters mechanical limits based upon the number of jumps the disk heads must make to read the data. For instance, one query searching through sequential records in a single table will make few, if any jumps, while simultaneous reading from two separate (non-contiguous) tables will require a series of jumps between the two sections of data. All of these factors can play a part in determining the peak data access rate from a physical hard drive or RAID. RAM RAM plays a key role in the operation of databases, storing information in a rapidly  accessible manner. Retrieval times for RAM can occur several orders of magnitude 41  faster than retrieval from disk, owing both the physical manner in which the information is stored, and to the common hardware organization that places RAM in close proximity to the Central Processing Unit (CPU). Whereas hard drives are mechanical devices in which information is stored and read magnetically, RAM is an electronic solid-state device in which electrical gates are flipped on and off, and their status can be accessed efficiently either individually or in batches. Thus, the physical retrieval rate of RAM is typically measured in GB/second. Furthermore, the integration of RAM within a computer places it in a position where it is able to communicate more frequently with the CPU than a hard drive can, further accelerating the disparity of access speed between RAM and hard drives. Thus, for data operations, it becomes of critical importance what information is being held in RAM versus that held on disk. Depending on the amount of RAM available to a database, it may cache indexes or entire tables in memory, making access to any given piece of information nearly instantaneous. However, as tables grow larger, they are less frequently cached, and their indexes are less frequently cached as well. This has the effect of making the database appear to slow down dramatically. Unfortunately, there are no solutions for large databases, other than to optimize queries to perform as few disk seeks as possible, to use small indices as often as possible or increase the RAM as much as possible. Most relational databases, including postgresql, have the typical optimization that the last index used is held in memory, thus it is most efficient to run similar or identical queries back-to-back, as this will have the effect of increasing the speed of subsequent runs, owing to the reduced need to load an index from disk. Unfortunately, if many simultaneous queries are being run that do not share common indices, this prevents any one index from being retained in memory, causing all queries to constantly require information retrieval from disk, and failing to give any one query an increase in speed.  42  CPU Although much less frequently a bottleneck than RAM or hard drive access speeds, the processing speed of the CPU can become an issue for larger queries, particularly those which use the “group by” SQL commands or those which join several tables. Because many queries do make significant demands on the processing speed of the server running the database, increasing the number of CPU cores, or the speed of the cores can function to increase the performance of the database. The SQL provides the ability to perform complex programming tasks as well, which can be heavily CPU dependent. (eg, http://www.developerfusion.com/article/ 84374/solving-sudoku-with-sql/, last accessed April 13, 2012). Although it is much less frequently the major bottleneck for large databases, it will have an impact on execution times for complex SQL requests.  1.5.6  Optimization  It is worth noting that performance optimization for a database can also be affected by specific database settings. Although this information is specific to a given database and version, it is a key component to maintaining an efficient database. Guides to performance tuning for Postgresql can be found at http://wiki.postgresql. org/wiki/Performance Optimization, last accessed April 13, 2012. Prepared Statements Prepared statements are a method of interacting with the database that reduces the amount of redundant computing performed by the database. When receiving SQL queries, the database must interpret the query, evaluate the request and then determine the most efficient way of performing the retrieval. The overhead required for these steps is frequently on the order of milliseconds, but can add up over thousands or millions of queries. To reduce the overhead, it is possible to use a “Prepared Statement” to indicate to the database that the client will be sending many queries of the same type. To  43  return to the example in Section 1.5.1, instead of performing thirteen queries, one for each province using the form: S e l e c t c i t y , p o p u l a t i o n from c i t i e s where p r o v i n c e = ’ O n t a r i o ’ ;  A more efficient approach would be to prepare the query, replacing the variable province value with a question mark: S e l e c t c i t y , p o p u l a t i o n from c i t i e s where p r o v i n c e = ? ;  Subsequently, the client can then iterate using the form: PreparedStatement ps = ” S e l e c t c i t y , p o p u l a t i o n from c i t i e s where p r o v i n c e = ? ” ; f o r ( S t r i n g p r o v i n c e : Country ) {  / / pseudocode  p s s e t p a r a m e t e r s ( ps , 1 , p r o v i n c e ) ; R e s u l t S e t r = p s g e t r e s u l t s e t ( ps ) ; print ( r );  }  This allows the evaluation step to performed once and then re-used by each subsequent call to the prepared statement. Batch Queries While select queries make up the vast majority of database interactions for research purposes, there are other types of queries, such as update and insert queries for which the user will not wish to return a result. These queries can be constructed as prepared statements, but may be executed all at once in a batch mode. This allows several prepared statement queries to be processed in a group. Prepared Statement ps = ” INSERT INTO c i t i e s ( c i t y ) VALUES ( ? ) ” ; for ( String new city : data set ) {  / / pseudocode  ps set param ( ps , 1 , n e w c i t y ) ; p s a d d t o b a t c h ( ps ) ;  } i n t r o w s i n s e r t e d = ps executeBatch ( ps ) ;  Like a prepared statement, this allows the query to be evaluated once, and to carry that evaluation data over for each subsequent re-use.  44  Data Ordering As described in section 1.5.5, a major consideration for the access speed, and thus the response time, for a database is the ability to retrieve the rows of interest from the hard drive. Furthermore, the number of seeks the hard drive heads must make to retrieve that information is directly correlated to the speed with which the information is made available to the user. For large tables and queries that must return many rows, the ordering of the data in the table can have a significant impact upon the retrieval time. Thus, the information in a table can be re-ordered to make blocks of commonly ordered information contiguous on the hard drive, producing more efficient retrieval strategies for the database. However, the re-ordering may not be based upon an arbitrary field or set of fields. Instead, the clustering must use an index, and the database maintainer should utilize indices that are known to be frequently used and likely to gain significantly from the expensive (time consuming) clustering process. The clustering command is simple, requiring only a table name and an index, providing the keys by which the data should be clustered. If the index is not provided, the primary index of the table is used instead. CLUSTER tablename [ USING indexname ]  This would provide significant increases in performance for queries that use the selected index, but would either not affect, or, in some cases, slow down other queries that do not use the same data ordering. Triggers Triggers are not necessarily optimization devices, but have a broad range of applications that make them a valuable addition to the optimization toolkit. Triggers themselves are simply a tool that causes the execution of a secondary query in response to an interaction with a table. For instance, one can designate a trigger such that any time an insert is performed on a table (an insert query), a secondary insert of the same data is caused on a second table. This, in fact, can be used to maintain a summary table. 45  1.5.7  Background on the Variation Database  The variation database grew out of the need to perform comparisons of variations called between several cell lines and to perform filtering against sets of resources including dbSNP and whole genome sequencing that had been done with early versions of the NGS technology.[96–98] The original database grew out of both the necessity to perform the comparisons as well as a forecast need to develop a repeatable analysis pathway that would allow for storage, management and interrogation of NGS data. Several other bioinformatics tools were already available at the time, which had begun address other needs for data management and interpretation. The University of California, Santa Cruz (UCSC) genome browser provided an excellent interface for visualization of processed high throughput data, the Broad institute’s Integrative Genome Viewer (IGV) for viewing aligned reads and several SNP-callers such as the one provided with the MAQ aligner were available for identifying and visualizaing the data.[99–101] These tools, however, failed to provide the ability to search through larger volumes of data to identify previously observed variations. Unlike the other methods available, the variation database had the specific need to be able to store data from a wide variety of data sets in a manner that would be capable of providing detailed information about each observation stored. Several model systems were available for design comparisons. The dbSNP database provided a wealth of human variation, but was only available in a collapsed form, where individual samples were no longer distinguishable in the data. A flatfile database where Binary alignment/map (BAM) files were queried as necessary was also possible, allowing every read of information in every data set ever sequenced was also a possibility, but would have been slow, as retrieval times for BAM files requires indexing and significant disk access time. Thus a compromise was determined, to balance the need to store information about each observation but only at positions mapped to the genome containing variations in any given sample. Thus, this database provided a storage mechanism for intermediate level data, allowing users to return to the original raw data to fill in low level information, or 46  to use the many annotation or visualization tools to explore high level data, such as gene annotations. The benefits of this system would be to grant rapid access to all of the variants in a wide range of data sets. There are specific challenges faced by this form of database. The two largest challenges are both caused by a lack of information stored in the database itself. The first lack of information is caused by the nature of the information stored, which is called variants. By storing only called variations, a database of variations will lack any information about positions where variations do not occur. Thus, it becomes impossible to differentiate positions that were not tested for variations from those that were tested but were not found to have any deviation from the reference. This missing information is subtle but can reflect upon the types of information that can be derived from the resource. The second lack of information relates to the ability to gather knowledge about specific information contained in the database. Despite the need for large scale databases such as the VDB, there will be a need for manual curration of data, which can not be done for a genome wide resource without distributing the ability to manage and interact with the database. However, despite the VDB’s open source template, no one central repository for curating knowledge is likely to be achieved. Thus, this form of database will always require interaction with other forms of knowledge to perform meaningful analyses.[102] Fortunately, there exist other databases with which the VDB can interact and utilize to provide interpretation. The VDB has been built with the ability to interact with the Ensembl Java API (ENSJ) to provide a basic level of annotation, as well as dbSNP to provide a measure of filtering on common variations in the human genome.[92, 103] However, in the long run, resources like the VDB will provide their greatest contribution when it becomes possible to integrate information from well curated and clinically relevant database such as Online Mendelian Inheritance in Man (OMIM), which are able to provide relevant information on the real world phenotypes with which they are associated.[104]  47  1.6  Breast Cancer  Statistics on the frequency of breast cancer indicate that one in eight American women (American Cancer Society) and one in nine Canadian women (Canadian Cancer Society) will be diagnosed with breast cancer during their lifetime. Statistics for breast cancer mortality range between one in twenty-nine and one in thirtyfive.[105] Among cancers, only Lung cancer kills more women and only skin cancer is diagnosed more frequently, and accounts for one in three cancer diagnoses among American women.[105] Breast cancer, however, is a heterogeneous disease, which can include several different types of cancers, generally classified by the cell type of the precursor cell or by other features. The most common forms are the lobular and ductal carcinomas, which arise in the lobules and the ducts of the breasts respectively. Some types of breast cancers are much more rare, such as phyllodes, arising in the connective tissue, or inflamatory breast cancer, which does not form a tumour mass, and compose a small fraction of tumours . Lobular and ductal carcinomas account for up to 90% of breast cancers, with ductal carcinomas making up between 70% and 80% of tumours diagnosed.[106] In addition to the tissue of origin, ductal carcinomas frequently exhibit different molecular properties than lobular tumours. In an experiment comparing invasive ductal carcinomas with invasive lobular carcinomas, using microarray technology, the two cancer types displayed differential expression changes in specific focussed pathways. Genes expressed preferentially in invasive ductal carcinoma correspond to genes promoting cell proliferation (e.g.human epidermal growth factor receptor 2 (HER2), Janus kinase 2 (JAK2), ankyrin repeat domain 32 transcription factor (ANKRD32) and calmodulin-binding neurogranin (NRGN)), in contrast to those overexpressed in invasive lobular carcinomas that code cell adhesion, lipid and retinoic acid metabolism. Of further interest, Invasive Lobular Carcinoma (ILC) cells also appear to express genes whose corresponding protein functions are associated with cell differentiation rather than proliferation.[107] Within these subtypes, the most common method of assessing the prognostic 48  outcome of the tumour type is to use several common markers: The Estrogen Receptor, the HER2 and the Progesterone Receptor (PR). These three markers provide groupings that can be used to determine the most effective treatment: Estrogen Receptor (ER)+ and PR+ expressing cancers respond well to drugs that block estrogen effects, such as tamoxifen, while cancers that over express HER2 respond well to trasuzumab. Unfortunately, some cancers do not express any of these three markers. This group of cancers is frequently called “triple negative” and have a dramatically poorer prognosis than any of the groups with positive histological markers.  1.6.1  Molecular Subtypes  In addition to the classification by tissue of origin, there exist methods of classifying breast cancers based on the differential expression of various genes. Breast cancers typically fall into five biologically distinct subtypes: luminal A, luminal B, human epidermal growth factor receptor-2 (HER2) over expressing, basal-like, and normallike. The ratios with which these subtypes occur is dependent upon the age or racial group studied, with the basal subtype being the most frequent (˜50%) and Her2 overexpressing being the least frequent (8-12%).[108, 109] These classifiers provide better prognostic value than the histological stains commonly used.[110] In large part, these molecular groupings map reasonably well to the histological ER/PR/HER2 markers: the Her2 over expressing cancers are nearly identical with the HER2+ group, while the triple-negative group shows a marked overlap with the basal-like cancer.[111] In practice, however, the basal-like and HER2 over expression designations are not the same, and likely both groups are composed of more diverse groups which have not yet been separately identified.[112] Because clustering analysis shows there are several subtypes of triple-negative or basallike cancers, it is most likely that moving beyond array technology will assist in identifying not only the full set of markers that make up the subtypes, but also new targets in each triple-negative/basal-like subtype.  49  1.6.2  ATCC Mammary Ductal Carcinoma Cell lines  The mammary ductal carcinoma cell lines available through the ATCC are a diverse collection of cell lines make from grade three (high grade) primary tumours at stages I, IIA, IIB and IIIA, indicating a wide range of sizes and progression. Stage 1 tumours indicate a size up to two centimetres and no infiltration of the lymph nodes, while stage IIIA indicates a cancer that has spread to the lymph nodes and may be larger than five centimetres. As a group, they also display a broad range of cellular markers, including several triple-negative designations, as well as two cell lines that express two of the three markers. Discussions of the individual markers and traits of each cell line can be found in section 4.2.7. As a group, there is no guarantee that cell lines created from Invasive Ductal Carcinoma (IDC) tumours will exhibit similar behaviours or molecular characteristics that underlie their oncogenic behaviors. Ductal carcinomas refers only to the tissue of origin and encompasses a wide variety of different tumour types.  1.6.3  Epstein-Barr/B-Cell Derived Matched Normals  One issue that is commonly faced when using tumour derived cell lines is that they often accumulate a significant number of variations as they are passaged. Their inherent chromosomal instabilities make them prone to developing novel variations over time that are no longer representative of the original tumour. Additionally, they can be difficult to interpret without context of a matched normal against which their behaviour can be compared. While the first problem is inherent, the second problem can be somewhat alleviated by the creation of a matched normal cell line a cell line with an origin in healthy tissues. Healthy tissues do not naturally exhibit immortalization characteristics on their own, and thus creating an immortalized cell line from a non-cancerous tissue requires treatment with an external agent that mimics cancer. One such agent is the Epstein-Barr Virus (EBV), which is associated with Hodgkin’s lymphoma, Burkitt’s lymphoma and nasopharyngeal carcinomas.[113] Thus, EBV is capable of transforming various cell types into immortalized cells, making them ideal for 50  cell lines. Paired with the ease of donating blood, B-cells are easily transformed by the EBV and are commonly used to provide a matched sample that carries the same genomic information as the healthy patient. There are some caveats that are necessary when dealing with EBV-transformed cell lines. EBV causes immortalization by modifying cell programming in a myriad of ways, some of which are well characterized, while others are still being studied. It is well known that there are three viral genes that are required for cellular transformation, Ebna2, Ebna3C and the Lmp1/Lmp2 genes, as well as a set of about 20 miRNAs.[114] While the functions of the miRNA produced are not yet clear, the genes are known to interact mainly with known oncogenes through different mechanisms. While the Ebna genes are able to localize to the nucleus to interact directly with target genes, the Lmp1 and Lmp1 genes can act as cell receptors, initiating signalling cascades by mimicking tumor necrosis factor (TNF) and interacting with many TNF associated factors. EBNA genes: Although the full range of interactions of the EBV nuclear antigen (EBNA) proteins are not known, they have been shown to interact with oncogenes recombination signal binding protein for immunoglobulin kappa J region (RBPJ) and v-myc myelocytomatosis viral oncogene homolog (MYC). The mechanism of the interaction between the EBNA proteins and RBPJ is of particular interest for cancer systems, however, as the EBNA proteins are able to mimic Notch signalling. This is achieved by binding the RBPJ transcription factor, out-competing the normal intracellular Notch domain (the cleaved signalling portion of the Notch genes) to constitutively activate RBPJ. LMP1: The Lmp1 protein is known to activate the nuclear factor kappa-lightchain-enhancer of activated B cells (NFKB1) pathway and can cause the downstream activation of telomerase, has also been shown to activate the mitogen activated protein kinase 1 (MAPK1) signalling pathway, which flows through known oncogenes jun proto-oncogene (JUN) and mitogen-activated protein kinase 14 (MAPK14), and increases the rate of chromosomal aberrations in a tumor protein p53 (TP53)-independent manner. 51  LMP2: Like Lmp1, Lmp2 shares a role in regulating telomerase activity and interacting with the MAPK1 pathway. In addition, it has also been shown to interact with the v-src sarcoma (Schmidt-Ruppin A-2) viral oncogene homolog (SRC) and spleen tyrosine kinase (SYK) family of protein kinases, and can alter the function of the TP53 pathway. Thus, the behaviour of the matched normal cell lines, in this case EBV transformed B-cell derived cell lines, combines the normal genome of the patient with a cancer-like programming that enables the cell line to replicate indefinitely.  1.6.4  Research Done  Understanding the repertoire of protein coding changes that are acquired within a tumour represents a fundamental goal in our understanding of cancer. Therefore, we have aimed to leverage the completed human reference sequence to determine the mutations present within a cancer. Before the advent of next generation sequencing, the most comprehensive method of performing such a search was to systematically sequence genes across a wide variety of tumour samples to identify potentially causative variations. In two such studies, published in 2007, one reported the systematic sequencing of 518 protein kinase genes and another also the used a PCR based approach to sequence coding elements from 18,191 genes, both studies using the established Sanger based sequencing technology.[115, 116] The latter study determined that approximately 80 amino-acid altering mutations accrue within a typical cancer. Significant efforts have also been made to curate the observed mutations in human cancer from the extensive literature of smaller independent studies.[117] The development of new sequencing-by-synthesis approaches have allowed for fundamental changes in the rate that tumour derived DNA can be sequenced. [118] Whilst routine sequencing of cancer genomes is certainly technically feasible, the overall cost has not yet dropped sufficiently far to makes it an attractive approach.[8, 119, 120] Therefore, whole transcriptome shotgun and exome capture approaches  52  represent potential methods to identify somatic protein coding mutations in cancer.[121, 122] These approaches also provide the ability to quantitatively determine the expression of the transcripts within the tumour and of their isoforms. The utility of these sequencing approaches have previously been shown in tumour derived material from both prostate and cell-line material as well as from normal tissue.[123–126] In this study, we have used Illumina sequencing to acquire data representing both whole transcriptome and exome capture DNA from eight breast cancer primary ductal carcinoma-derived cell-lines, as well as their B-cell derived matched normals for the four cell lines in which the normal was available. Cell-lines, including those discussed here, are used extensively world wide for basic cancer research and pharmacological testing. A large investment of research capital is therefore taking place using cell-based reagents for which the mutational status of the genes and pathways is unknown. Determining the aminoacid changes in these cell lines, both germ-line derived and somatically accrued, will not only provide salient information of the status of any drug target or pathway being studied, but will also provide an experimental cell-based resource of mutated proteins and protein variants. It is expected that such cell-lines will have acquired numerous mutations during their passaging in vitro, which makes the acquisition of the mutation spectrum in each cell-line a valuable resource for researchers who work closely with them, as these deviations from the original tissue may provide challenges to the interpretation of the characteristics of the tumour-like behaviour, and in some cases, may cause behaviours that are unlike the tumour from which they are derived. With the plethora of false positive SNVs reported by next generation sequencing, we also introduce the concept of high confidence variants. Although it may be ideal to utilize multiple sequencing technology platforms to identify variants, it is also possible to derive high confidence cancer variants by requiring the same variant be independently sequenced in both the RNA and the DNA from a single cancer sample and not in the RNA or DNA from a matched normal.[127] While  53  sampling distributions will play a role in the identification of variants, any variant that passes this strict filter is likely to be a real, expressed somatic variant. In addition to SNV data, we have also included INDELs that were observed in the transcriptomes of the cancer cell lines. With up to 97% of expressed genes having alternate splicing, it can be a challenge to separate cancer driver INDELs from those that are part of the background.[128] Using assembled transcriptomes from both our cancer and immortalized B-cell derived matched normals, we are able to produce a list of genes observed to have been altered using SNV or INDELs. This provides a list of genes that may be suitable targets, either with cancer specific snvs, cancer specific splicing or both. Chang et al. [129] performed a similar analysis of eight ovarian cancer derived cell lines using 454 sequencing and Nimblegen capture arrays to demonstrate that a similar analysis can be used to accurately gain insight into genomic SNVs and copy number variations. We have extended this analysis using breast cancer cell lines, incorporating data from matched normals and RNA Isolation and Sequencing (RNA-Seq) on the Illumina platform, allowing the analysis of gene expression and alternate mRNA splicing in addition to the identification of somatic and germ line variants.  1.6.5  Recurrent Variations  The identification of disruptive variations across a population of cancer cells provides an opportunity to locate genes that are likely part of the genetic variations involved in the development of cancer. However, the effect of each variant observed can be difficult to assess, and thus some assumptions can be used to simplify the list of potentially disruptive variations. Considering only Single Nucleotide Variations (SNVs), the first assumption is that non-synonymous variation will be more disruptive than synonymous variations. Non-synonymous variations include both missense mutations, in which an amino acid of the protein product is altered, as well as truncating mutations (also known as nonsense mutations), in which premature stop codons are the result of the altered 54  coding sequence, causing the protein to be terminated in an incomplete form. These forms of variations have the potential to alter the behaviour of any protein produced by changing the conformation of active sites, binding domains or other integral components of the protein’s structure, or preventing the expression of components of the protein’s structure. In contrast, synonymous proteins are able to influence the expression of proteins through coding bias effects, but not alter the final behaviour of the protein.[130] A second assumption is that the variations of greatest interest are those that are likely to be somatic rather than germline. This assumption is based on the concept that cancers are heterogeneous in composition, and those cancer cells that are able to compete the best will be the most predominant in the population. This suggests that positive selection for the most oncogenic traits is constantly present and that the variations that provide the best adaptations will be those that were spontaneously incurred in the cancer, but not present in the germ line. Of course, this assumption does not apply to many predisposing factors involved in cancer, such as the breast cancer 1, early onset (BRCA1) and breast cancer 2, early onset (BRCA2) genes, who’s suppression of DNA repair through germline mutations can assist cells in acquiring novel somatic mutations and promoting cancer growth.[131] Searching for somatic mutations is likely to miss predisposing oncogenes, but will instead yield those that assist cancers in gaining more aggressive ongenic properties. The final assumption in the search for recurrent variations is that recurrent variations are more likely to be cancer-related than non-recurrent variations. This model is based upon the idea that somatic mutations occurring frequently in cancer cells that are not present in healthy cell lines are probably enriching for proteins that play key roles in normal healthy functioning of a cell. This model will frequently yield targets such as TP53 which is a common target for disruption in a variety of cancers. However, TP53 is frequently disrupted because it plays a central role in controlling many elements of cellular growth and replication. For many other pathways, recurrence variations across a single gene is bound to be less common, as some cellular pathways involve multiple signalling partners, each of which can  55  block or augment signals. Thus, recurrence of variations in a single gene may work well for some pathways, but is much more challenging to detect as the pathways become longer or more complex.  1.6.6  Purpose  This project was undertaken to identify variations occurring in breast cancer cell lines with the goal of utilizing the information to reposition drugs to identify novel treatments for cancers. Chapter 4 describes the results of these investigations, highlighting novel and interesting findings.  1.7  1.7.1  Notch Genes, Strawberry Notch and the Epidermal Growth Factor Pathways  The growth in next-generation sequencing platforms has ushered in a significant expansion in the number of unbiased genome wide studies undertaken, which attempt to identify genomic variations that associate with specific phenotypes. While many studies have successfully identified SNVs that may be causative to diseases or traits, others have failed to identify any novel targets.[132–134] A more broad view can also encompass a search for variations that affect a common gene across the population under investigation. Such events are common in cancer, where the events that give rise to the disabling of a gene (e.g. disabling TP53 or Rat sarcoma (RAS) family proteins) are spontaneous events, as opposed to pre-disposing factors in tumour suppressors (e.g. familial BRCA1 and BRCA2 mutations). However, a common issue faced in cancer research is the broad range of mechanisms available for a cell to disrupt any given pathways or specific cellular functions. Signalling pathways, for instance, are particularly vulnerable to this problem, as they frequently contain cell-type specific components and the signal cascades often use large numbers of proteins to relay signals from the cell surface 56  to a vast number of activation targets. Thus, it is often necessary to consider a number of other factors, such as expression levels and known interaction networks, in order to move beyond simple recurrence of variations in individual genes and find the bigger picture.[135, 136] To this end, we have selected one gene from the set of genes with recurrent high confidence non-synonymous mutations, described in chapter 4, that appeared somatically in two different cell lines with, as well as in one additional cell line without a matched normal. This gene, (strawberry notch homologue 1), became the focus of literature search for it’s interacting partners and pathways, such that a pathway enriched for gene expression changes and variations in nucleotide sequences could be identified. It is important to consider that SNVs are only one of many types of variations, and the effect of the variations can range from the imperceptible to dramatic alterations of phenotype. Each genomic deviation must be considered in the context of its effects, its ability to alter expression and the environment.[137] In this experiment, we attempt to provide context through the combination of gene expression, known published gene-gene and gene-pathway interactions to identify a possible mechanism by which cancers encourage their own rapid growth and out-of-control expansion.  1.7.2  Notch Signalling  Notch Signalling is known to play a role in many different types of cancers and is a key upstream signal involved in a myriad of cellular process. Most notable among them, Notch signalling has long been known to be involved in cell survival and proliferation.[138, 139] The Notch pathways are also highly conserved, including a high level of homology between humans and Drosophila, where the four similar Notch proteins were first characterized. The Notch proteins themselves have been the subject of much research, and their function at the start of signalling cascades is reasonably well understood. Through their downstream interaction partners, it is also clear that the Notch genes play a significant role in cancers.[140] It has even been proposed that Notch-1 57  could play a significant role in the induction of epithelial-mesenchymal transitions (EMT), in a manner consistent with cancer stem cell phenotypes in pancreatic cancer cells.[141] In mice, both Notch-1 and Notch-4 have been implicated in the development of breast cancer via the murine mammary tumour virus, while Notch-2 and -3 have been implicated with pre-T-cell acute lymphoblastic leukemias (T-ALL).[142, 143] In humans, Notch-1 and Notch-3 have both been implicated in Breast Cancer.[144–146] Signalling though the notch 3 (NOTCH3) proteins has also been linked to ovarian cancer, where it is known to directly regulate the pre-B-cell leukemia homeobox 1 (PBX1) gene.[147] The same relationship was also demonstrated in the breast cancer cell line MCF7, and is thus not likely specific to ovarian cancer.[147]  58  Chapter 2 ChIP-Seq Epigenetics: the study of heritable changes in gene function that do not involve changes in DNA sequence. — Meriam-Webster Dictionary  2.1  FindPeaks  FindPeaks is a “peak finder” application, a piece of software designed to locate areas of increased mapping density where sequenced reads align to the reference genome in increased numbers relative to other locations. This class of application was originally employed as a means of interpreting sequencing data obtained from Chromatin Immunoprecipitation and Sequencing (ChIP-Seq) experiments, but has broad applications for other forms of Next Generation Sequencing (NGS) data. For instance, it can be used with Ribonucleic Acid (RNA) Isolation and Sequencing (RNA-Seq) data to identify expressed regions or with exome capture to evaluate coverage at probes. This chapter deals with some of the key issues involved in designing a peak finder application for ChIP-Seq and describes aspects of the implementation of the FindPeaks application.  59  2.2  Paired End Tag versus Single End Tag ChIP-Seq  One ongoing discussion in ChIP-Seq studies is the value of Single End Tag (SET) versus Paired End Tag (PET) library creation. When performing SET sequencing, one end of a Deoxyribonucleic Acid (DNA) fragment has been sequenced, while PET sequencing includes a step where the fragments are reversed and sequenced form the opposite end, bracketing a centre portion of the fragment that has not been sequenced. With SET, the cost of reagents required for library production is approximately half that of PET, although only one end of the each aligned read will be directly observed. Since the goal of ChIP-Seq is to map each read back to the genome to identify the exact region in which a given protein was bound, not knowing the location of the termination of the read suggests that this would be a major obstacle. As well, with the existence of PET protocols, it would seem that PET ChIP-Seq would be the ideal solution. In practice, however, the use of PET ChIP-Seq is less common - At nearly twice the cost, it actually adds very little new information. While a small percentage of un-mappable reads can be mapped when PET protocols are used, there are many ways in which SET ChIP-Seq can compensate for this difference in information. The foremost is the use of “extended reads”, in which the observed reads are extended to a length that represents the most likely size of the original fragment from which the sequence was obtained (excluding sequencing adaptors). For most ChIP-Seq experiments, the fragment sizes are selected during the gel extraction phase of the protocol. Depending on the width of the band extracted from the gel after electrophoresis, the distribution of fragments can form a range with a typical size distribution of 100-300 bp. With PET reads, both ends of fragments can be aligned to the genome, giving the exact size, however SET reads generate sequenced reads shorter than the original length.  60  2.3  Read Length Modelling  If the length of the fragment from which a sequence read was obtained is not known (e.g. SET sequencing was performed), it is possible to estimate the actual fragment length in order to better approximate the Areas of Enrichment (AOE) under investigation. There are many different ways in which this can be done (see figure 2.1), and the method chosen is important as it can dramatically change the results obtained. For ChIP-Seq done with PET reads, this step is not required. Single End Tag (SET)  Coverage is a fixed, unweighted length for all reads  Fragment dist_type 0 (fixed xset) dist_type 1 (weighted) dist_type 3 (native/PET)  Coverage is identical to length of sequence in fragment  Contribution of bases are weighted according to distance from start of fragment  Paired End Tag (PET) Fragment dist_type 3 (native/PET)  Figure 2.1: Read modelling distributions available in the FindPeaks software package. These models describe how individual sequencing reads are weighted to reconstruct the most likely binding location in ChIP-Seq experiments. Some models (such as the native/PET model) may also be used for other purposes, such as plotting chromosome coverage for any read-based data set.  2.3.1  Native Lengths - No Extension  DNA fragments that have reads covering both ends (e.g. PET) can be collected for ChIP-Seq experiments in a process known as Chromatin Immunoprecipitation (ChIP)PET.[148]. These reads are easily treated by simply assuming full coverage between 61  the start and the end of the fragment’s genomic start and end coordinates. As the read length of the NGS platforms has increased, the use of PET reads to more accurately identify binding sites has become a much less acute issue. However, as the cost and sequencing time of platforms continue to decrease, there may be a tipping point in which ChIP-PET becomes more popular. The Native lengths approach, however, is not limited to PET reads, and can be used for SET reads as well. Using this mode in FindPeaks gives rise to a coverage map that describes the depth at each base sequenced. This is not typically done for SET ChIP-Seq experiments as it gives rise to partial peaks that fail to adequately describe the binding zones of proteins of interest, but is commonly used for generating coverage maps for RNA-Seq, exome capture or other methods where coverage depth can be variable.  2.3.2  Hard Extension  A na¨ıve but relatively effective mode for extending SET reads is to utilize the average distribution of the original fragment lengths. In such a case, basic knowledge of the estimated mean of the size exclusion used to generate the fragments provides a simple guideline for the extension length. By extending all reads to this length, the coverage map can generally indicate the original location of the binding event - however, if the distribution of the fragments was not tightly centred around the extension value used, it can lead to significant over and under estimations of binding positions. If the distribution of fragments is not known (e.g. through an Agilent trace or other size profiling), then this method provides a reasonable approximation. If the distribution of sizes is known, that information can be used to improve the coverage maps generated. This method is the default for nearly all peak callers, and is nearly always the first method implemented when developing a basic method for modelling Areas of Enrichment of any type of NGS data.  62  2.3.3  Triangle Distribution  The Triangle Distribution is a simple mechanism, first implemented by FindPeaks, employed to take advantage of empirical knowledge of the size distribution of fragments obtained during the size selection step of the ChIP-Seq protocol. If the minimum and maximum fragment sizes are known, then it is useful to set these as the boundary conditions on fragment extension. However, since it is not possible to know which fragments would require which extension, the best solution is to utilize this information to weight the read extensions. Unlike the two previous distributions, this method does not assume a binary (one or zero) coverage for bases that fall between the start and end of the covered region. Instead, it assumes full coverage (a value of 1) for each read between the start coordinate and the minimum fragment length, and then utilizes non-integer values from that position until the maximum fragment length is reached. This scale can be smooth - e.g. a constant slope in the decrease of values between the minimum and maximum fragment lengths, or if a median value for the distribution is known that does not coincide with the midpoint of the minimum and maximum fragment length, it can be used to alter the weighting (and thus the slope of the decrease) to reflect the shift in the distribution. (See figure 2.2). Max. Read Length  1  Median Read Length  0  Weight  Min. Read Length  Read length  Figure 2.2: The triangle distribution weights the extensions according to the empirical distribution of the reads. The minimum read length should be set at the minimum length of the fragments sequenced, the maximum length should be set at the maximum length of the fragments sequenced, and the median should reflect the median fragment length of the reads sequenced.  63  2.3.4  Read Shifting  This method is not included in FindPeaks, but is a popular alternate model used by the Model-based Analysis for ChIP-Seq (MACS) and USeq software packages.[14, 149] Instead of attempting to accurately reconstruct the fragment’s coverage, an fixed extension is made to each read, similar to that described above, followed by a modification to the coordinates of the read or peak by a pre-determined number of base pairs. If the extension size is smaller than the one used by FindPeaks, and the shift is of a small number of base pairs, the net effect is equivalent to performing an extension of the correct length, and then trimming back the reads on both sides. This effectively assumes that the in vivo protein-DNA binding site occurred close to the midpoint of the original fragment at the time of the pull down, which simplifies the calculations performed, but is not necessarily supported by the biology behind the ChIP-Seq method.  2.4  Peak Calling  Peak calling is the process of utilizing a coverage map to identify AOE. This is most frequently done by generating a coverage map utilizing one of the read length extension methods outlined above, followed by either a binning-based algorithm (i.e. using less than single base resolution by dividing a chromosome into regions) or a scan for segments of the chromosome accruing higher coverage. FindPeaks uses the latter, as the coverage generated by the sequencing is at base-pair resolution and binning would cause a loss of resolution. FindPeaks basic peak calling begins by using pre-sorted reads (either sorted by the aligner (e.g. Burrows-Wheeler Aligner (BWA)), file processor (e.g. SAMTools) or external sort (e.g. SortReads.jar provided with FindPeaks and the Vancouver Short Read Analysis Package (VSRAP)). FindPeaks processes the reads in a chromosome-by-chromosome manner, utilizing the aligned coordinate of the reads to generate the coverage map. As the reads are pre-sorted by ascending coordinates, each portion of the map can be independently processed as gaps in the coverage are  64  identified. FindPeaks then stores the coverage map of the gap-delineated region, as well as summary information including start and end coordinates and the maximum depth of coverage for the region. The maps can then be used as the full list of potential peaks, and the summary information can be used to perform population based analysis on across the full set of peaks, or on a by-chromosome basis.  2.4.1  Trimming Peaks  Peaks produced using the standard pipeline give three defining characteristics a start coordinate, and end coordinate and the location of the peak’s maximum density. The start and end coordinates, by default, delineate a region of contiguous coverage. This can be useful in some situations, but is often far from ideal for identifying the core region of highest density. However, FindPeaks implements a peak trimming function that allows the user to determine what fraction of the peak is of most interest. Peak trimming is implemented in two parts - the first trims from the left side of the peak, the second trims from the right side of the peak. In both cases, the algorithm utilizes the maximum height of the peak, multiplied by the variability to identify the threshold at which the peak should be trimmed. Once this value is determined, the peak algorithm can start at the location of each peak maxima, and walk outwards until the threshold value is greater than the coverage reported at that position. (See figure 2.3.)  65  -trim value: 1.0  0.6  0.2  Trim begins at the maxima of the peak and walks outwards in both directions until the fraction of the maximum height indicated by the trim value is reached. The example shown is for -trim 0.40  Once the height is found, the peak is truncated, setting the height across the shoulder of the peak to zero.  Figure 2.3: The trim algorithm can be applied to any peak maxima. The algorithm begins at each maxima, walking outwards until the fraction of the peak set by the “-trim” parameter is reached. If the parameter is set at 0.40, then the algorithm will truncate the peak on either side at the first value that reaches 0.40 times the value of the maxima.  2.4.2  Peak Separation  Beyond peak trimming, a further complication can be encountered when two peaks infringe upon each other, creating one region of contiguous coverage with two separate peaks. To solve this problem, peak separation algorithms were implemented in FindPeaks.  66  These two peaks would be considered separate at -subpeaks 0.6, but not for -subpeaks 0.2  -subpeaks value:  Peak A  1.0 Peak B  0.6  0.2  Subpeaks uses the shorter of the two peaks in question to determine the fraction of peak hight to which the valley must dip in order to determine whether they should be separated.  Figure 2.4: Areas of Enrichment (AOE) that contain two or more peaks may be fragmented into smaller regions utilizing the Subpeak method of the FindPeaks package. This method identifies each peak across the AOE, then subsequently searches for valley regions between them that justify separating the two peaks into distinct regions. The minimum relative depth of the valley is a userselectable parameter set on the command line.  The method used for peak separation or calculating “subpeaks” begins by collecting all sequence reads that overlap in an Areas of Enrichment (AOE) and summing their weights at each position. All local maxima within a single region of contiguous coverage are identified and collected into an array in sequential order. The array containing the local maxima is then inspected in a local pair-wise manner in which each set of nearest neighbours is identified. The heights of each pair of maxima are then compared, and the lowest value is taken. This value is then multiplied by the float provided with the -subpeaks flag to yield the minimum valley depth required to classify the two peaks as distinct peaks. The intervening AOE between the two local maxima is then searched for values that are lower then the minimum valley depth. If found, the two peaks are then separated, with a single base pair gap, corresponding to the deepest local minima separating the two maxima. If a value lower than the minimum valley depth is not found, the lower of the two peaks is removed from the array of local maxima, and will not appear as a 67  separate peak in the peaks file, and may not appear in the output file. For example, A subpeak value of 0.2 will separate only those two points with very deep valleys (80% of the depth of the lower of the two peaks.) A subpeak valley of 0.8 will catch shallow valleys (a >20% dip in height between two peaks). (See figure 2.4.)  2.5  False Discovery Rates and ChIP-Seq Controls  A significant issue in ChIP-Seq experiments is the quantification of false positives and the estimation of error rates in AOE. There are many different source of error that can cause apparent AOE to occur, casting doubt on the results obtained. Thus, one of the most important factors for consideration is the use of an appropriate control to reduce false positive results.  2.5.1  Sources of Error  There are a great many potential sources of error in ChIP-Seq, from the biological to the computational.[150] At each step, it is possible to inject noise into the results, causing both false negatives and false positives. High Quality Antibody: The success or failure of a ChIP-Seq experiment often rides upon the quality of the antibody being used in the assay. Antibodies with poor specificity for their target can bind to other proteins, causing other proteins to be pulled down along with the target, causing erroneous peaks as DNA not specific to the target may be pulled along to contaminate the sequencing results. Cross-linking: The first step in a ChIP-Seq experiment is the cross-linking step, in which DNA-associated proteins are cross-linked to the nucleic acids. This step is done without regard for tertiary interactions, which results in cross linking of any proteins in physical contact with the DNA. Thus, cross-linking of cellular components occurs even for transient interactions and DNA in 68  extended configurations. Additionally, loops with proteins that are not associated through binding domains or bound through intermediate proteins may also be captured. Many such biological factors may, in fact, cause secondary interaction driven peaks that would be indistinguishable from other primary interactions. Furthermore, it is also possible that the crosslinking steps may cause proteins to be inaccessible to antibody mediated pull downs by obscuring the target site of the antibody used. Base calling: Sequencing and the immediate downstream processes of the sequencing procedure can be relatively error prone, depending on the sequencing platform used and the software used to process the results. Mistakes that occur during these steps will cause mis-alignments, non-alignments and reads that appear as noise. Biases in any of the preparation, sequencing or post processing steps will be reflected in false positive and false negative AOE. Alignment: The most challenging step in the computation of ChIP-Seq experiments is the correct assignment of sequenced DNA fragments to the reference genome. This allows a great many errors to appear, as both poorly aligned reads and incorrectly aligned reads will both introduce novel but incorrect AOE. Furthermore, reference genomes may not accurately reflect the genome of the subject being studied (e.g. cell-lines frequently show major departures from the reference genome of the organism sequenced). This can further increase the number of incorrectly assigned reads. Finally, genomes enriched for duplications can often cause a significant loss of information as aligners are generally unable to correctly map reads that would align to more than one location in the reference genome.[151] Size selection: Although unlikely to be a direct source of false peaks, the size selection protocol can be used to select the appropriate DNA fragment sizes pulled down with the immunoprecipitation, an important parameter for reconstructing peaks. If the fragment sizes pulled down come from a wide 69  distribution, it can be a greater challenge to determine the original binding sites of the proteins accurately. Thus, it is important to utilize the correct fragment size distribution and to have access to accurate details of the distribution for accurate peak construction. peak calling: Most peak callers are sufficiently well tested that they do not inject error at a noticeable rate into ChIP-Seq peak called results. However, the nature of the peaks often requires that parameters be well tuned, in order to provide reliable peak calls. This can require peak trimming or sub peaking to obtain an appropriate level of accuracy. Utilizing peak calling parameters incorrectly will likely result in subtle changes to the False Discovery Rate (FDR) that are not obvious and may not be noticed by the biologist using the software.[152]  2.5.2  Simulated Control - Monte Carlo  Monte Carlo algorithms are based upon the use of random (or quasi-random) events that, obeying a set of rules, give rise to distributions that reflect the rules imposed. A common example is the use of a playing field covered with evenly spaced lines and needle with a length matching the spacing of the lines. By repeatedly dropping the needle onto the playing field and counting the number of times the needle lands intersecting a line versus the number of times it lands not crossing the lines, one will observe a ratio that converges towards π. The same concept has been attempted and utilized for ChIP-Seq experiments, where a given number of reads (matching the total number of reads in a sample distribution), are assigned to locations randomly across a genome of given size. By performing peak calling on this randomly assigned distribution of peaks, one can approximate a background distribution of peak heights against which the sample peak heights can be compared. Unfortunately, the distribution of peak heights obtained using a Monte Carlo algorithm does not accurately reflect the distribution of peak heights obtained from 70  a ChIP-Seq control. Most controls for ChIP-Seq more closely resemble a power-law distribution, further demonstrating that results generated by standard Monte Carlo methods are not ideal for use as a control.[153]  2.5.3  Simulated Control - Lander-Waterman  A further refinement one can perform over the basic Monte-Carlo algorithm is to employ a Lander-Waterman correction. This method utilizes the distribution of peaks with heights of 1 and 2, (i.e. those peaks that are most numerous in the Monte Carlo distribution), to estimate the distribution of peaks with larger heights. While this has the ability to correct the distribution of peaks at the tail end, it still fails to compensate for many of the other sources of error and systematic bias. Also, the use of the Lander-Waterman algorithm only provides a correction for the distribution of the peaks of greater height, and does not guarantee that the distribution mimics that of a proper ChIP-Seq control.  2.5.4  Minimal Biological Control - Null Immunoprecipitation Control  A more appropriate control for many ChIP-Seq experiments is the use of a null Immunopreciptation control, a biological replicate in which the immunoprecipitation step is performed without the use of an antibody. This provides an excellent minimal control for most circumstances, indicating the locations of AOE caused by systematic errors, alignment errors and biological effects. It can be used to identify peaks that are artefactual, as well as provide a sample distribution for the background noise. It will provide an improved control compared to that of a randomly placed set of reads, both for distribution of peak sizes as well as specific false-positive peaks.  71  2.5.5  Biological Control  Many experiments require specific control samples, where the change between two biological conditions are the defining factor of the ChIP-Seq experiment. This can be a before and after set, where a starting and ending condition are being compared, or a series of conditions, where one of them is the “ground” condition. Assuming the same protocol was used on the samples treated under both conditions, platform biases or the non-random sources of error should be similar or identical in both sets of results collected, making it relatively simple to filter out.  2.6  Comparing ChIP-Seq Experiments  experiments are able to shed the most light on genomic protein-DNA binding events when they can be compared to a second state. Thus, a control experiment is often used, either comprised of a null-IP, a second time point or other biological condition. However, comparison between two samples can often be difficult process with several confounding factors. For instance, the two samples or the sample and the control rarely yield the same number of reads, giving rise to the need for normalization. Furthermore, even once the normalization is completed, it can still be a challenge to determine which AOE are of importance. FindPeaks provides a method for performing these comparisons, however the methods presented apply to only two samples and must be used in a pair-wise manner if more than two samples are being compared. ChIP-Seq  2.6.1  Normalization of ChIP-Seq Results  ChIP-Seq analysis can often  be confounded by several sampling and enrichment issues, where the number of sequence reads obtained may vary dramatically between collected samples. This can be a particularly challenging when comparing two samples in which the number of peaks is expected to differ widely, e.g. comparing an unstimulated and a stimulated sample such as the signal transducer and activator of transcription 1 (STAT1) transcription factor in HeLa cell lines when stimulated 72  (or not) with Interferon-γ.[65] Normalization by Number of Tags The most common method of normalizing ChIP-Seq data, used by MACS [14] and other ChIP-Seq software is based on the total number of successfully aligned reads. This method provides a simple means of reducing bias caused by comparisons using grossly unequal numbers of reads, but assumes that the reads are aligned into roughly equivalent peaks between the two samples. This can be a valid assumption when comparing closely related systems, such as the binding of a transcription factor in two separate cell lines, or a biological replicate. However, returning to the example above of STAT1 expression in an interferon-γ stimulated and unstimulated cell line, the number of peaks is significantly different, as is the strength of the signals between the two samples. Thus, normalization by read number would tend to overstate the peaks in the unstimulated cell line, where less binding occurs proportional to the number of unique captured reads, while understating the peaks in the stimulated sample, where enhanced binding is taking place, generating more unique reads. To illustrate this point, one could consider binding sites upstream of three genes, A, B and C. In front of genes A and C, the binding site is constitutively active, producing a low steady number of reads, for instance 10 each, while the binding site in front of gene B is only activated in the presence of some stimulus, providing zero reads in the unstimulated cells for a total read count in the region of 20. In the presence of the stimulating agent, genes A and C are unaffected, but gene B is up-regulated, producing a further 20 reads, for a total of 40 reads in the stimulated sample. If normalizing by the number of reads, the binding sites at genes A (10/20) and C (10/20) would look 100% active in the unstimulated cell line, but only 50% as active (10/40) in the stimulated cell line, despite being constitutively expressed at the same level under both circumstances. (See figure 2.5.)  73  Coverage  B A  Sample C  Chromosome Region  Coverage  Control A'  C'  D' Chromosome Region  Figure 2.5: An example of two experiments to be normalized against each other. If the first experiment (sample) represents a collection of cells stimulated by some condition, causing the new event (B) to occur, and the second experiment (control) represents a similar collection of cells that have not been stimulated by some condition, then normalizing by the total number of reads aligned, (A+B+C)/(A’+C’+D’) = normalization constant, will result in all peaks exhibiting skewed characteristics between the two samples. Because peaks observed in both the sample and the control (eg, A, A’ and C, C’) remain unaltered between the two samples, they can be used as the basis for normalizing the two samples. This forms the basis of the normalization by equivalent peaks method and suggests normalization by total number of tags is not appropriate in all conditions.  2.6.2  Normalization by Equivalent Peaks  An alternative method, implemented in FindPeaks 4.0, utilizes a method based on equivalent peaks. For a given chromosome, the two samples (A and B) are independently analyzed, to identify the full set of peaks (P) for sample, yielding two sets, PA and PB . Using a symmetrical peak pairing technique, the union of the two sets is obtained (Pintersection = PA ∩ PB ), which is then used to collect an array  74  of peak heights H(P)for the subset of overlapping peaks. Thus, HAB (Punion ) = [(H(PA1 ), H(PB1 )), (H(PA2 ), H(PB2 )), (H(PA3 ), H(PB3 )) ...] The matrix of HAB (Punion ) can then be treated as a simple correlation problem, and a regression line best fitting the two samples can be obtained, the slope of which can be used as the normalization factor. Regression of HAB (Punion ) One obstacle that can be found when normalizing on the HAB (Punion ) matrix is the need for symmetrical results. When comparing sample A to B, it is often desirable to obtain the same results as comparing sample B to A. When using a normal linear regression method, this relationship will not occur, as the normal linear regression works to minimize the distance in only one axis from the regression line. Thus, the order of the samples can dramatically influence the results obtained from the standard linear regression algorithm. To reduce the effect of sample ordering, FindPeaks 4.0 implements a “Perpendicular Linear Regression” algorithm, in which the distances perpendicular to the linear regression line are minimized to obtain a best fit through the points in HAB (Punion ). n  ∑ H(PAi)  x¯ =  i=1  (2.1)  n n  ∑ H(PBi)  y¯ =   n  slope = −  ∑ H(PBi )2 − (n · y¯2 )  i=1  ±   1  · 2   i=1  (2.2)  n  n  2  n  ∑ H(PBi )2 − (n · y¯2 ) −  i=1  n  (n · x¯ · y) ¯ −∑  i=1  75  ∑ H(PAi )2 − (n · x¯2 )   i=1  +1   (H(P ) · H(P )) Ai  Bi  (2.3)  n    ∑ H(PBi)  intercept =  i=1  n  n     ∑ H(PAi )   i=1  − · slope n    (2.4)  By using the best fit line through the paired heights of the peaks present in both sample A and B, only those reads present in both experiments are used to determine the normalization factor. This is less likely to cause erroneous normalization effects when new binding sites become activated under different environmental conditions.  2.6.3  Limitation of Normalization by Equivalent Peaks  The normalization by equivalent peak method is limited by the necessity of the two samples to share peaks in common. Thus, this method may not work well when null controls are used. However, depending on the protocol used for generating the null control, any artefactual peaks obtained in the control are likely to appear in the sample as well, which may allow this method to be extended to some null controls on a case-by-case basis. Further testing would be required to determine the limitations of null controls when using this approach.  2.6.4  Statistics  By utilizing the slope and intercept of the best fit line, the distance from the normalization line to any point (H(PAi )), H(PBi )) can be determined. d(PAi , PBi ) =  ((slope · H(PAi )) − H(PBi ) + intercept) slope2 + 1  (2.5)  Using the distribution of the peaks around the normalization line, a standard deviation can be calculated:  76  2  n  ∑ d(PAi, PBi)  σ=  i=1  (n − 2)  (2.6)  This can then be used for further sampling and evaluating the distribution of other points (peaks found in only one sample) relative to the peaks observed in both samples.  2.6.5  Post-Normalization Processing  Once the normalization is completed, the Pintersection data set is no longer required, and the union data set is required, Punion = PA ∪ PB. For a given peak (i) in the Pintersection data set, you will have peak values for H(PAi) and H(PBi). However, if the peak is not in the Pintersection set, it will have only one peak height. To correct for this, the peak boundaries of a peak identified on one sample can be mapped to the other sample, and the greatest read depth within that region can be used, giving a paired set of heights for each peak in Punion.  2.7  Analysis of Normalized Samples  There are two methods used in FindPeaks 4.0 to analyze matched peaks. The first is based on the distribution of the points relative to the normalization line, whereas the second method is based upon the concept of equivalent areas.  2.7.1  Comparison by Ratio - Method of Perpendicular Lines  This method is best used for comparing two similar data sets, e.g. comparing RNA-Seq (WTSS) data sets that share much in common. It works best when there are many shared peaks between the two samples. Examples of data sets for which this control method is anticipated to work well: • RNA-Seq from a sample with a drug treatment, and one without a drug 77  treatment. • Comparing RNA-Seq from matched normal and cancer cell line. • Comparing differential transcription factor binding with and without an activator.  Step 1: Peak Calling Peak calling is done on both the control and the sample data sets, using the parameters supplied on the command line. (e.g. subpeaks, trim, etc.) Step 2: Peak Pairing Peaks are matched between the sample and the control using the -window size parameter. Any peak that has a matching peak in both the sample and the control within the user selected window size will be paired. If there are two peaks that fit this description, the closer of the two will be chosen. If the region contains many peaks, (eg, a complex region separated using subpeaks) then the peaks will be paired such that all of the closest peaks are paired in order of proximity. If no peak maximum is found in the compared sample (eg, the sample has a peak, but the control does not), then the highest point in the matching sample will be selected as the corresponding value, and the peak pair index will use a -1 value. Step 3: Symmetrical Regression To normalize the two samples, a regression analysis, based on minimizing the perpendicular distances the regression line, is performed. However, the normalization is performed only on the set of peaks that occurred in both samples. The slope of the line provides the normalization factor. The sample is weighted such that the intercept passes near to the origin, but may deviate from it, depending on the match of the sample and the control.  78  Step 4: Confidence Interval The full set of data is then analyzed to calculate the distribution of data points from the regression line, providing a Poisson distribution. This distribution can then be analyzed to determine the sigma value, and thus, a t-value may be calculated to determine the confidence interval of choice: (1 − α ∗ 100) = ci  (2.7)  A 99% confidence interval corresponds to an α value of 0.01 (“-alpha 0.01”). The t-value for each data point is calculated, and values lower than the selected t value are accepted. Step 5: Thresholding The threshold minimum can be calculated as the intercept of the lower bound confidence interval, however, it does not mean that all points with peak height values above this threshold will be accepted. Only points that are below the lower bound threshold in the control will be accepted. (see image below).  79  Figure 2.6: FindPeaks 4.0 provides the ability to perform comparisons between ChIP-Seq experiments using a symmetrical-linear-regression-based method. A symmetrical regression line is plotted using only the subset of matched peaks (peaks that are present in both samples), which forms the basis of the normalization ratio. All of the peaks that were found in only one sample can then be added to the graph, and confidence intervals or other methods can be used to identify peaks that are of interest.  2.7.2  Comparison by Equivalent Areas - “Method of Hyperbolic Sections”  The rational for developing this method was to look for peaks that are outliers in one data set, utilizing a easy to visualize method that uses the natural shape of the visualized data to identify those peaks that appear distant from the main set of peaks common to both sample and control data sets. Thus, using tools previously developed, such as the peak caller, peak pairing and normalization tools, this method can be used to compare two data sets to identify peaks that warrant further investigation. This control method works well for data sets that share only a small subset of peas in common, giving a distribution that is too wide for the Method of Perpendicular lines. Examples of sequencing protocols in which this distribution is commonly 80  found: • ChIP-Seq with a background DNA control • ChIP-Seq with null-IP control • Comparing ChIP-Seq with different target proteins Step 1: Peak Calling Peak calling is done on both the control and the sample data sets, using the parameters supplied on the command line. (eg., subpeaks, trim, etc.) Step 2: Peak Pairing Peaks are matched between the sample and the compare using the -window size parameter. Any peak that has a matching peak in both the sample and the control within the windows size will be paired. If there are two peaks that fit this description, the closer of the two will be chosen. If the region contains many peaks, (eg, a complex region separated using subpeaks) then the peaks will be paired such that all of the closest peaks are paired in order of proximity. If no peak maximum is found in the compared sample (e.g. the sample has a peak, but the control does not), then the highest point in the matching sample will be selected as the corresponding value, and the peak pair index will use a -1 value. Step 3: Peak Normalization Unlike most controls, this method of performing controls does not normalize by the number of tags used in each data set. For ”input DNA” controls, the number of tags is clearly not the appropriate normalization factor, as one would expect a different sampling of genomic sites. Assuming the sample ”signal” is a significant component of the sample result sequences, but not of the control, indicates that the control will be over-sampling control sites for the same number of tags. Thus, this method performs normalization by identifying all peak pairs, and then calculating the sum of all ”non-zero” matches. (e.g. signal identified in both 81  sample and compare data sets at the same location. (see -window size) The ratio of signal/control can then be used to identify the normalization factor (shown as a sloped line on the graphs). This method assumes that the intercept of the data set is 0, thus, y = (normalization)x + 0. Step 4: Hyperbolic Sections and Thresholding Once the normalization line is determined, the control process overlays a series of hyperbolic sections across the data set, using the normalization line as the asymptote of the hyperbola. This is used to generate a graph indicating the number of peaks found in the control vs the sample, which can then be used to estimate the number of false peaks that would be expected. In this case, the “-alpha” parameter is used to determine the acceptable cutoff, and a float cutoff which best matches the value provided is returned. For instance, if an α of 0.01 is used, the hyperbolic section which best describes a likelihood of finding 1 false peak for every 100 peaks called will be identified. A threshold that reflects the lowest possible peak height in the sample will be returned. (”lowest possible peak height” refers to locations where there is no corresponding signal in the control - as the control signal rises, higher values for sample peak are required, following the contour of the hyperbolic section.)  2.8  Example - Extending FindPeaks  FindPeaks is cited as a robust peakfinder tool, that has the ability to identify regions of interest in genome wide ChIP-Seq studies with high accuracy.[154] However, in some cases, prior knowledge about a particular data set can be utilized to greatly improve upon the identification of potential transcription factor binding sites, effectively filtering false positive peaks through the use of both controls and motif identification.  82  peak height, sample 2  peak height, sample 2  1. identify normalization line  2. Use hyperbolic sections to determine the appropriate cutoff  peak height, sample 1  peak height, sample 1  3. Determine the appropriate ratio of expected false positives to set the accepted curve.  Figure 2.7: Visualization of method of comparing samples by hyperbolic sections  2.8.1  Method  The Motif Identification for ChIP-Seq Analysis (MICSA) tool is built in Java, incorporating software from both FindPeaks and the Multiple Em for Motif Identification (MEME) software packages to do much of the basic work.[1, 155] It operates in several distinct states. It begins by running FindPeaks to identify peaks, followed by MEME to identify motifs that are present in the best candidate peaks. Finally, it then revisits the full set of peaks called by FindPeaks to locate other candidate peaks containing the motifs most highly enriched identified by MEME. This enables MICSA to pull out regions of interest that may have otherwise been indistinguishable from the background noise and would have otherwise been overlooked. Motif Enrichment and Comparison of Peak Finders To test the MICSA algorithm, three comparisons were performed using the neuronrestrictive silencer factor (NRSF) data set described in Kharchenko, Tolstorukov, and Park [156]. The first test uses the 3000 best scoring scoring motif instances in the human genome, as predicted from the canonical sequence binding motif  83  Mapped DNA tags  Identify candidate peaks in ChIP and control data  Remove peaks occurring in satelite and/or centromeric regions  Remove peaks identified both in ChIP and control data  Get DNA sequences for peaks in ChIP-Seq  Extract overrepresented motifs from top area of several hundred high peaks  Check motif presence in remaining peaks and calculate motif p-values  Run optiization to report maximal number of peaks within a given number of false positives  Figure 2.8: Flowchart showing the steps used by the MICSA algorithm to refine transcription factor binding sites through the use of motif enrichment. Image modified from Boeva et al. [3].  of NRSF, the second test uses the best 500 scoring matches of the canonical NRSF binding sequence in the human genome, while the third method uses 83 qPCR validated binding sites for the NRSF transcription factor. To compare the sensitivity of the MICSA approach against other peak finders, the enrichment was plotted as the number of binding positions each peak caller was able to find as a function of the total number of positions called. (See figure 2.9.) It is clear from the plots in figure 2.9 that using the motif to enrich for peaks with a signal provides an improvement at all thresholds over algorithms that do not search for motif enrichment in putative binding sites. However, from the set of 3,000 canonical transcription factor binding sites identified in the human genome, only a maximum of 1422 were identified by MICSA. This is probably because many of the sites containing the motif are unlikely to be true functional binding sites for the NRSF transcription factor, being either occupied by other factors occurring in regions which are inaccessible to the transcription factor. When using the set of the best 500 binding sites, drawn from the larger set of 3000, MICSA was able to  84  Figure 2.9: Performance comparison of MICSA with FindPeaks, PeakSeq, QuEST and uSeq. As a positive set, binding sites of NRSF were used (A) 3,000 best matches of the canonical NRSF matrix in the human genome, (B) 500 best matches of the canonical NRSF matrix in the human genome, (C) 83 q-PCR verified NRSF-binding sites in the human genome. Peaks extracted by each algorithm were ranked according to scores or P-values provided by each tool using the default settings, as suggested by developers of each application. The frequency of identified positive sites was plotted for each ranking of best peaks. Tool names suffixed with ˆ indicate the default parameters of the tool were modified to report more peaks. Image/text modified from Boeva et al. [3].  identify 447, compared to FindPeaks at 428 without using motif searches. Other peak finders were also tested, and uSeq, MACS, F-Seq, CisGenome, peakSeq, wdt, SISSRs, ERANGE 3.1 and QuEST 2.0 were all able to identify from 437 to 402 peaks from the set, short of the value obtained by MICSA. With the set of 83 qPCR validated binding sites from the human genome, a similar picture emerged, where MICSA was able to identify 55 binding sites, where the rest of the peak finders located 50-52 sites, with FindPeaks in the middle at 51 peaks identified. In addition to consistently out-performing the peak finders that do not use motif information, MICSA was also able to locate a motif for the NRSF data set that corresponded well to the known canonical NRSF binding motif.[3]  85  2.8.2  EWS-FLII  To demonstrate that MICSA approach can be used to enrich peaks for actively used binding sites, the software was applied to a ChIP-Seq data set generated for the oncogenic fusion of the Ewing sarcoma (EWS) and friend leukemia virus integration 1 transcription factor (FLI1) genes that creates the EWS-FLI1 transcription factor.[157] This data set was previously analyzed with FindPeaks, yielding only 246 binding sites specific to the EWS-FLI1 fusion because of the low yield of DNA obtained.[158] However, upon re-analysis with the MICSA software, two distinct motifs were identified for the fusion gene and a total of 2,264 binding sites with an estimated FDR of 5% were discovered. The first motif discovered represents a (GGAA)≤6 microsatelite found in 496 peak regions, while the second corresponds to a consensus sequence of RCAGGARRY (R = A/G, Y = T/C), found in 1,768 peak regions.[158] The former microsatelite motif, although not characteristic of either the EWS or FLI1 proteins, was observed to show a tendency towards up-regulation (fold change >|2|) of neighbouring genes, found 150-kb to 50-kb downstream of transcription start sites. The second motif discovered matched closely with the known canonical motif of the E-twenty six (ETS) family, to which the FLI1 transcription factor belongs, but was unable to discover any direct effects on the expression of downstream genes. To test the ability of MICSA to differentiate between true signals and false signals among low peaks based upon the presence of motifs, ChIP-qPCR was performed on 16 low peaks with heights between 3.9 and 8, and a set of seven peaks with heights between 4 and 8 that did not contain either motif. For three genes with upstream motif containing peaks, cyclin D1 (CCND1), glycogenin 2 (GYG2) and pregnancy-associated plasma protein A, pappalysin 1 (PAPPA), a clear positive signal was obtained. For two other genes with upstream motif containing peaks, A kinase anchor protein 7 (AKAP7) and solute carrier organic anion transporter family, member 5A1 (SLCO5A1), a positive trend was observed. However, no positive signals were obtained for any of the seven peaks in the control set with low peaks devoid of the motifs. 86  2.9  Summary  Despite being one of the first ChIP-Seq applications developed and published, FindPeaks is one of the cornerstone bioinformatics applications in its genre. Despite the limited resources available for its development and support, many of the current generation of ChIP-Seq packages are unable to improve upon its accuracy.[154] FindPeaks has demonstrably made contributions both to the bioinformatics of DNA-protein interactions, as well as the underlying biology. (See Appendices A to C.)  87  Chapter 3 Variation Database ”We are still working with an incomplete compass, The time is right to bring the full power of genomics to bear on the problem of cancer.” — Francis S. Collins, 2005, describing work commencing on The Cancer Genome Atlas (TCGA) project. The Variation Database (VDB) was designed for the scalable storage of large volumes of Next Generation Sequencing (NGS) derived Single Nucleotide Variations (SNVs), Insertions and Deletions (INDELs). Unlike other SNV database projects (e.g. dbSNP [92], Inernational HapMap project [159]), this was designed for the storage of data that can not be shared outside of the institutions at which they were gathered, and thus has not been fitted with a public web front-end. However, the problems of storing large volumes of data are not limited to data that may be freely distributed, and the problem of working with variations from a large number of individuals can be a confounding issue for the already complex problem of the large sets of variations associated with NGS. The VDB demonstrates a flexible solution to many of these problems by providing mechanisms for rapid retrieval of the data (i.e. relative to non-database or other available databases access of the data), and by ensuring that the data are stored independently of external annotation systems. These innovations are fundamentally different from any other VDB currently available. A description of this work was 88  published in Fejes et al. [4].  3.1  SNVs and INDELs  The VDB was originally designed to store SNVs alone, but has since been expanded to include INDELs. This expansion is not discussed as it is a simple matter to extrapolate the code and design from using a single base position coordinate (chromosome, position) to using a multiple base coordinate (chromosome, start, end), requiring simple modifications to extend the code. Thus, results discussed in this chapter refer specifically to SNVs, but may be extrapolated to INDELs as well.  3.2 3.2.1  Methods Novel Functions  The database enables four novel functions that would be otherwise difficult to accomplish: 1) The ability to rapidly access genetic variation information across multiple data sets (e.g. to determine the frequency of a variation in the population); 2) Storing information in an annotation-free manner, allowing the user to select the appropriate annotation set to use (e.g. selecting which version of Ensembl annotations to use); 3) Rapid comparison of data from any sequencing platform, regardless of origin; and 4) Perform aggregate analyses,for instance, providing an assessment for the likelihood of a variant being a true positive variant based upon it’s recurrence across a larger population of samples. (i.e. identifying and eliminating sequencing artifacts and poor base calls.) These functions are accomplished by storing each variation observed in a manner indexed both by the library to which they belong, as well as the location of origin in the genome to which the reads were aligned.  89  3.2.2  Data  The database currently stores SNVs and Single Nucleotide Polymorphisms (SNPs), as well as INDELs. We also import information from external databases such as dbSNP in the form of annotations.[92] These data can be used for concordance analysis and to asses the quality of each imported data set.  3.2.3  Graphic Output  Scripts are provided along with the Java Application Programming Interface (API) and User Interface (UI) for obtaining gene/exon coverage information from a variety of file formats. The application combines information obtained from the VDB and an Ensembl database with coverage information to provide an image in Scalable Vector Graphics (SVG) format that can be used for visualizing up to four groupings of data (e.g. Ribonucleic Acid (RNA), exome capture, whole genome and controls) at once, each of which can contain multiple distinct samples. The end result is ideal for rapid comparison of groups of sequencing experiments. (See figures 4.7 and 4.6, which show excerpts from the graphic interface.)  3.2.4  Input Formats  The database currently accepts a wide variety of formats including: GFF3 (http:// www.sanger.ac.uk/resources/software/gff/, last accessed April 13, 2012), SAM[89], SNVmix[17], Variant Call Format (VCF) and several custom formats for SNVs. The application is also designed for quick and simple addition of new formats as new variant callers become available.  3.2.5  Library Information  In order to perform qualitative or quantitative analyses across the many data sets stored in the database, each data set is annotated with a minimal set of required information detailing the origin of the sample (e.g. cancer versus normal, cell line versus tissue). This enables users of the database to quickly identify the propensity 90  of any given variation to appear in a subset of the data and provides the ability to perform meta-analysis across the whole database rapidly to identify cancer associated variants.  3.2.6  Variation Annotations  The VDB also has a flexible system for holding annotations of variations in order to allow free-form or pre-formatted data entry relevant to a particular entry in the database. However the annotations are not tied to a specific observation, but rather to the coordinate of the observation, making the process of creating or retrieving annotations a simple task. As long as a snp id is available, either from the observations or the snp tables, the annotation can be accessed immediately via the snp id key on the annotations table. The annotation system enables flexible information storage with two text fields: a source field and a freeform notes field. The source field can be used to indicate the origin of the annotation, such as a version of dbSNP or the pubmed id for a journal in which it was found.[92] The notes field can be used to enter any free form text required, or a key value such as the dbSNP ID key, which can be used to retrieve information about the variant from the dbSNP database. A further example of the flexibility of the annotation system can be demonstrated by the incorporation of data from the Database of RNA Editing (DARNED) Database of RNA edits.[160] Utilizing this information, it is possible to clearly designate all of the variations known to be caused by RNA editing. This information is particularly useful when analyzing transcriptome information that does not come with matched normal genomic information, preventing known RNA edits from being mistaken as potential causative variations. Annotations are provided in the full report produced by the Java API/UI (see section 3.5.2). The database also holds annotations from both external databases (e.g. dbSNP) and manual annotations. Java API and UI utilities are provided to facilitate insertions and deletions of annotation data. 91  Other public variation databases exist, such as the Ensembl Variation database.[161, 162] At the onset of the project, it was decided that the Ensembl database did not contain significantly more data than dbSNP, and that the API would provide a slower response than holding a local copy of dbSNP. Although the Ensembl Variation database was not used in this project, it would not require a significant investment of resources to utilize it, given the tight incorporation of the Java Ensembl API into the project code.  3.3  Design  Database design is the precursor to implementation, determining the organization of the database structures as well as the typical work flows. Once the design of a database is set, implementation is the mechanics of transforming the design into a working database. As such, the design of the database and the philosophies behind the design determine the characteristics and idiosyncrasies of the database. Unfortunately, no database is immune from design flaws, and no database can be created that handles every possible query with maximal speed and efficiency. Thus, design is the process by which those tradeoffs are weighed and evaluated, and the use cases are ranked such that those that are of the most importance will be given priority.  3.3.1  Design Philosophies  The design philosophy of the VDB roughly focuses on two important points: The database holds information specific to a single reference, and does not hold information about the annotation of that reference. This allows for the database to be highly scalable and flexible, without requiring updates to the information stored as annotations change. It also has the side benefit of storing only the minimal amount of useful information, allowing annotations to be queried as data are exported. The design of the information storage is also an importance consideration. Database scaling is always improved by storing information with the least amount  92  of repetition, which can be achieved with complete normalization of data. In the original published description of the VDB, information pertaining to the sample itself is somewhat over-normalized, owing to several layers of abstraction between the Laboratory Information Management System (LIMS) system employed by the Canada’s Michael Smith Genome Sciences Centre (BCGSC) in which information relating to a single Deoxyribonucleic Acid (DNA) sample was entered and the final importation into the VDB. All other database tables are fully normalized based upon the typical relations observed in the source data. Reference Specific The main cornerstone of the database is the use of a single set of coordinates for each witnessed event. In the case of SNVs or SNPs, the basic coordinate is the chromosome and position of the single nucleotide substitution, as well as the actual substitution. (e.g. An A → C mutation at the same location as an A → G mutation are fundamentally different events, so they each would have a separate event.) However the reference base for a given genome can be unrelated to the reference base in a different version of the same genome. Thus, it is inappropriate to mix variant calls against a separate reference into one implementation of the database. The original implementation of the database provided a mechanism for mixing different references by specifying the genome as a part of the coordinate system, however, the use of this particular device is discouraged and has consequently been removed from the database structure. By mixing data from two separate reference systems, the net effect is to increase the size of the tables, which leads to greater latencies for accessing the data from disk. As the size of the data stored increases, the penalty for disk retrieval of large indexes or records that can’t be cached can be one or more orders of magnitude, depending on the relative speed differences between memory access and disk access times. In contrast, separating the data for the individual reference into individual database implementations (which may rest within the same instance of postgres or other db environment on a single machine) improves performance in nearly  93  all metrics by allowing the maintenance of separate files (providing shorter time to search data tables), separate indexes (faster entrance to records of importance) and potentially separate hardware, if the data tables become sufficiently large to warrant the investment. No use-cases have yet been proposed in which more than one reference data set is queried at a time, lending support to this division of information. Species/Annotation Independent A fundamental issue with most genome-oriented databases is the desire to store annotation related information in the database along side the data. It is of paramount importance that this not be done. Annotation related information can be re-created “on-the-fly” at any point with minimal overhead, but anchoring it into a database of human variation will irretrievably tie an instance of the database to a given version of annotation. As no genome annotation databases have yet been finalized, and none are likely to do so in the near future, it is important to make sure that flexible methods are employed that allow the most recent versions of any annotation be utilized. It is important to note that despite the separation of annotations from variation data and coordinates, the coordinate system used by the database is necessarily tied to a reference. While it is possible to remove annotations such as the location of genes and the consequences of a variation on coding regions, it is not possible to separate the data from the coordinate system (e.g. the reference genome) used to determine the presence of the variations. Thus, the data collected is specific for the reference scaffolding upon which it was aligned or to which an assembly was compared. This will allow changes to the annotations used, but imposes the limitation that novel coordinate systems can not be imposed Avoiding data duplication provides another reason for the division of annotations and coordinate/data sets. Storing annotation data in an instance of the VDB will inevitably duplicate information that is freely available in other databases, which have already been optimized for the tasks of storing and querying genomic  94  annotation information. Thus, it will always be a superior solution to utilize the resources that are available and designed or dedicated for this use, rather than attempting to merge and re-deploy an annotation database for the purpose of convenience (i.e. having all of the data in one database). Performance issues also play a role in this design decision, as accessing an external database of annotations will split the data over a greater number of file systems and machines. This has the added advantage of spreading out the hardware demand over more machines, which is not a detriment as long as the network bandwidth is sufficiently high. As long as either the annotation or the VDB is not wholly operating in memory (which is only possible for small databases on machines with large Random Access Memory (RAM) capacity), disk access (I/O) issues will play a significant part in determining the speed of the retrievals, thus further reinforcing the need to split queries over as many physical hardware devices as possible. As a note, despite the common name at the BCGSC for the database being the “Human Variation Database (VDB)”, the implementation and design are not specifically designed for use with human data. The database can be used for any species for which Ensembl annotations are available. If a species annotation is available through other sources other than Ensembl, it is also possible to modify the API to incorporate other sources of annotations, however this would require some development work. Required Data Set Associated Input One of the key design requirements for the database is determining which fields are required for each set of variations stored in the database. The presence or absence of these fields determines how the data can be used and what the data can be used for. While not every field in the database is mandatory, there are a small number that must be explicitly stated. The first is the explicit “cancer” vs. “non-cancer” flag (boolean), indicating whether the origin of the sample was derived from a  95  tumour. Without this information, it is impossible to know whether to expect the data to be similar or highly divergent from the reference genome, and underlies our interpretation of whether the variations observed are likely to be associated with a healthy prognosis for the patient. Despite the simplicity of the boolean field provided above, the biology of the samples may not always be a true or false condition. Many solid masses of cancer tissue contain healthy cells along the fringes of the tumour, while cancer samples taken from the blood can contain healthy cells, which contaminate the expected cancer genotype. Similarly, matching blood samples are often taken from individuals who have cancer, with the expectation that the blood will yield healthy cells, while the sample itself may contain cancerous cells that have been shed from the tumour itself or a metastatic growth. In both of these cases, the intent of the sample will likely determine the predominant population of the cells being either healthy or cancerous, and recording this information will suffice, keeping in mind that future analyses may need to recall that both cancer and non-cancer samples from cancer patients are likely contaminated to some degree. Other fields that must be entered are the total number of reads with the variation and the total depth of reads at the variation. In the absence of a platformindependent or even software-independent method for scoring or evaluating each observed deviation from the reference, the issue of variant scoring remains a contentious and unsettled field. Thus, the most basic information required for a user to make a best guess case at the probability of the variant being real (and not a sequencing error) remains the number of independent reads observed as a function of the total coverage. This metric itself is ultimately context sensitive, as it is not obvious how tumour heterozygosity, copy number variations or sequencing bias might affect any prediction, but when comparing many separate samples, it becomes possible to separate sequencing errors (repeatedly low read counts at a single position across many samples) from likely candidate variations of interest (consistently high percentage of reads carrying the variation in samples of interest) from polymorphisms (consistently high percentage of reads carrying the variation  96  across both cancer and non-cancer sequencing datasets). Without these three fields, it would be impossible to perform most of the analyses used for identifying cancer associated changes or to use the database to filter out normal variation. Normalization A major organizational component of any database is the efficiency of the normalization. A fully normalized database will not duplicate storage of information, and all information will be stored in relationships that mirrors the real one to one, one to many or many to many relationships present in the orginal data. Thus, normalization also provides a useful means of ensuring that rows of data are not duplicated, which in turn ensures that the database is efficiently storing information in a way that does not slow down retrieval. A fully normalized database will always store information in a single table that has a 1-to-1 ratio, while information stored in a 1-to-many relationship will be split across two or more tables. The VDB attempts to fully normalize the data it stores by utilizing look up tables where possible, by carefully replicating the data relationships and by preventing duplication of data. Data Ordering For the VDB, periodic clustering or re-ordering (see section 1.5.6) of the observations table can provide significant (4-10x) access time improvements when clustering the table by SNP ID. Because many of the most frequently used queries access this table by snp id, and periodic insertions and deletions tend to disrupt this ordering, it can be worth utilizing the cluster command to refresh the disk order of the data. CLUSTER o b s e r v a t i o n s USING o b s e r v a t i o n s s n p i d ;  97  Triggers The VDB makes use of triggers (see section 1.5.6) in this manner. The snp table holds a complete list of each observed SNP, while the observations table keeps track of the number and type of observation. In order to provide a summary of the observations table, a third table, snpcancernormalcount, is utilized. However, the snpcancernormalcount table is prone to being out of date as new entries accumulate in the database, thus, an efficient method of ensuring that there is always an entry for every SNP coordinate in the snp table is to use a set of triggers. • One trigger on inserts to the snp table that then causes an insert of the new snp id to be added to the snpcancernormalcount table. • One trigger on deletions to the snp table that cause the deleted snp id to be removed from the snpcancernormalcount table. • One trigger that updates the counts in the snpcancernormalcount table upon inserts or deletes from the observations table. This full set of triggers can be used to maintain a summary table, which can provide instant, up-to-date queries on the number of times a particular variation has been observed in cancer or non-cancer based samples. In practice, triggers can also slow down interactive inserting and deleting of records as all Structured Query Language (SQL) included in a trigger must be processed and completed immediately at each transaction with the database before subsequent transactions can take place (i.e. triggers are “blocking” interactions). For the VDB, we have removed the third trigger, and placed it in an individual, threaded, piece of stand-alone code (UpdateCounts.java) such that updates can happen as necessary, upon completion of a batch of inserts and/or deletes.  3.4  Modularity  The code for the Vancouver Short Read Analysis Package (VSRAP) VDB can be separated into several main components, including an API for accessing the database, 98  an interface for interaction with the user, and other components such as file handlers for accessing data files (See Figure 3.1). These separate layers are transparent to the user, who is only aware of the interface with which they are presented. However, dividing the code into these individual layers provides many benefits both to the user and to the developers. By separating the code by function, it creates a basic relationship between classes (a basic unit of code in Java) and their functions. As classes become more complex, it becomes more difficult to decompose the tasks each class is expected to perform, making code comprehension an issue. Thus, the modularity of the code and division of each task into separate classes, where possible, is a major influence on the design of code in this project. Enforcing modularity on the project design has the added benefit of making the code easier to debug, as database interactions are not unduly complicated by user interactions, as well as making it easier write new code to incorporate new functionality. New developers interested in contributing to the project will find themselves able to separate interface design and function more easily. For instance, the UI does not become entangled with the Ensembl interface, while the database API does not independently have the ability to access command line arguments.  3.4.1  Database Application Programming Interface  The database API is mainly hidden from the user, but provides developers with the functionality they require to take advantage of the VDB. Several layers are available within the API for developers to exploit. • PSQLInterface: The most basic level, which provides wrappers to the JDBC commands such as running querries, performing vacuums and working with prepared statements. Advanced users will find the most common SQL interface functions in this library. • PSQLutils: This library provides simple wrappers around ResultSet objects to make it easier to work with SQL generated results, as well as handle 99  VSRAP User Interface VSRAP API  File Iterator  SQL  SQ  L  Ensembl API  Mysql JDBC  PSQL JDBC  E n se m bl da ta ba s e  Pos tgresql database  Data Files  Figure 3.1: This illustration shows the relations between the JDBC layers, the API layers, the various databases and and the VSRAP. The user only has access to the UI layer, which utilizes the VSRAP API to pass SQL commands to the JDBC layer that, in turn, communicates with the postgresql or Ensembl databases. A similar mechanism is provided for reading files, although a connecting JDBC layer is not required.  common exceptions observed when working with queries. • PSQLroutines: This library provides a high level set of functions for performing common but complex tasks specific to the VDB. These functions include inserting information into tables, searching for library information and retrieving SNP/SNV information by ID. Each of the three levels in the database API utilizes the methods below, and frequently allows pass-through queries, allowing developers to take advantage of the functions in the lower levels without requiring separate access to each class.  3.4.2  File Iterators  A second set of classes that are generally hidden from the users are the file iterators. This group of classes handles all of the VSRAP interactions with text files and all 100  user input except for command line arguments. When importing to or exporting from the database, these interfaces, also called iterators because of their ability to iterate over a set of records, provide all of the functionality to translate from a variety of formats into a common set of properties used by the database. For instance, the VDB commonly employs iterators that translate from a variety of formats into a common “SNP” object, which contains information on the location of the variation event, the type of event observed, and any statistics that might be useful for future evaluation of the particular event. Iterators provided by the VSRAP use these common object types to allow the UI to interrogate common genomic variation files independently of the actual file type provided. For instance, when importing SNVs, the user indicates on the command line which type of file is provided, and the correct file iterator is chosen. The code performing the request to import the SNVs is not aware of the actual file type, but receives a collection of generic “SNP” objects, which can be handled independently of the file type used by the importing iterator. This abstraction is handled through a generic SNP iterator class, that handles the abstraction. A selection of file types currently handled: 1. Complete Genomics: Formats from Complete Genomics, supports SNPs and INDELs  2. dbSNP: Supported formats include the distributed files for dbSNP126, dbSNP130 and dbSNP131 3. GFF3: A common format used for SOLiD generated data, supports SNPs and INDELs  4. MAF: Multiple Alignment Format for Structural Variations. 5. MaqSnp: Supports the Mapping and Assembly with Quality (MAQ) format for SNVs 6. Pileup: Supports the Burrows-Wheeler Aligner (BWA) Pileup format for SNVs  7. SAMSnp: Supports the SAMtools format for SNVs 8. SNVMixSNP : Supports SNVs observed by the SNVMix2 application 101  9. VCF4: Supports the Variant call format 4.1 for SNVs 10. VCF: Supports the Variant call format 3.0 for SNVs and INDELs 11. VCFCustom: Supports an early variation of the Variant call format 3.0 for SNVs  12. WtssPipeLine: Supports SNVs and INDELs called by custom pipelines produced at the BCGSC 13. A variety of other custom SNP formats unique to the BCGSC.  3.4.3  User Interface  The User Interface (UI) component of the VSRAP’s Java code is also a modular collection of wrappers around the database API and iterators. It allows the user to interact with the database through the command line, chaining together common functions to provide reports and limited information from the VDB. Command Line Interface The only component of the database or the Java code with which the user interacts is the User Interface (UI). This interface is accessed as any other Java command line component, allowing the interface to be used in any operating system in a cross-platform manner. j a v a [ o p t i o n s ] [ command ] [ arguments ]  The Java command instructs the computer to launch the code through the Java interpreter, utilizing any optional parameters (options) that may be requested. Options often include altering the default behaviour for memory (e.g. -Xmx16G which would allocate and restrain the executing code to 16Gb of RAM), or including other libraries in the class path (e.g. the Ensembl API can be called by including the jar file in the classpath: -cp lib/ensj-41.jar). For the command parameter, Java allows either the use of compiled “.class” files, or the compiled and collected “.jar” files. In the case of the class files, the path and name of the class without the “.class” is used, whereas “.jar” files can be specified with the full path and name, but must be proceeded with a “-jar” flag. 102  Optional arguments that follow the name of the command are a function of the class or jar files being used, and the VSRAP always requires arguments in the form of a dash, followed by the name of the argument, a space, and then a space separated list of values, where required. For instance, to indicate the start and end locations of a region, such as the 120,000-130,000 bases of a chromosome, to be searched for SNVs in the database, the start and end parameters would be used as “-start 120000 -end 130000”. The list of arguments available for each interface module can always be requested by using appending “-help” to the end of the command line. Full listings for each of the available commands can be found in the VSRAP manual at http://vancouvershortr.wiki.sourceforge.net/ ParseInput Module To assist in presenting a common interface for interacting with the command line parameters across all of the code used in the VSRAP, a module was implemented to simply and unify the interface parameters for each of the command line applications. This module, called the ParseInput class, provides the interface programmer with all of the tools required to interpret and use command line arguments supplied by the user with the minimal amount of coding required. Moreover, for most of the common arguments, it provides shortcuts for retrieving and checking the user input for common errors, helping to make the program more user friendly and descriptive to aid the user in correcting incorrectly provided arguments. Furthermore, it also enforces that the argument parameters are consistent between applications by allowing the developer to select from a list available arguments to present to the user and creating a single point of entry for adding new arguments to the collection of common command line parameters. Not all portions of the UI use this ParseInput class, however. Some interface elements, such as the one used for importing SNVs does not share any parameters in common with other aspects of the UI, making it less useful to apply this module universally.  103  3.5  Common Use-Cases and User Interactions  The database is commonly used for several major tasks, including filtering, verification, analysis, and discovery. Filtering can be done by utilizing annotations (e.g. dbSNP), matched pair data sets or data sets marked as non-cancer for separating polymorphisms from putative variants. Similar methods can also be used for identifying somatic mutations, driver mutations or other mutations of interest in non-cancer related samples (e.g. genetic diseases). API utilities exist for performing filtering on sample sizes ranging from single bases of interest to entire genome wide studies. Limited verification of variants can also be performed by searching for support in other data sets, identifying the frequency of variants across cancer and normal samples. Further, the frequency of variants in both general and specific populations can be tested, lending credibility to commonly detected variants, while identifying those which are more likely to be sequencing errors (i.e. samples with low variant read depth in regions with high canonical read depth or variants only sparsely detected across many independent data sets). Several API utilities also exist for performing complex analysis, such as identifying variants common to one or more data sets, but not found in a second set of sequencing experiments (e.g. this would be used for identifying variants found in data obtained from cancers, but not found in any of the matched normals.) Finally, the database is an excellent resource for performing discovery of variants commonly found in cancers but not in non-cancer samples, or to identify common variations in genes of interest.  3.5.1  Querying  The most common method used by the API for querying the database is a simple retrieval based on a set of coordinates. Nearly all common methods of retrieving information from the database can be translated into this format, which utilizes a common Java object, the SNPsInRegion class. This one class holds a variety of methods for utilizing coordinates to obtain variations observed on a chromosome between a start position and an end position.  104  This method works particularly well for queries in named regions, such as genes. In which case, methods of convenience are provided in the UI, such as the getSNPsByName class, in which the user provides a gene name in order to define the region of interest. The getSNPsByName class translates the gene name into coordinates (chromsome, gene start and gene end) using the Ensembl API (Ensembl Java API (ENSJ)). These coordinates can then be passed to the getSNPsInRegion class, retrieving the data requested by the user. Alternative methods exist as well, enabling the user to do specific queries, such as queries by the database internal SNP ids (getSNPsById class), allowing the user to query specific information out of previously generated reports where variations are reported along with the internal unique snp id. Other queries also exist for custom analysis, which include querying by the ratios of cancer to non-cancer observations (Get all cancer SNPs percent diff class), for a single data set in which variations observed in non-cancer are not retained (getSNPsSubtractNormals class), or those that are found only in cancer data sets across the entire database (getAllCancerSNPs class). Reports can also be generated for a single data set in a standard format, (ExperimentalRecord class) that includes an html report format for quick visualization and database summary, a brief summary report of non-synonymous variations and a full report in which the data can be compared fully against other data sets in the database.  3.5.2  ExperimentalRecord  A common task undertaken by users of the VDB is to search for variants present within a single sample, and to compare them to variants in other samples in the database. To simplify this process, we have produced a standard, “ExperimentalRecord” query that provides a summary of the non-synonymous variants (indels and single nucleotide substitutions) and variations likely to interfere with splicing junctions. This report generating script makes use of the ENSJ to obtain annotations for genes, exons and transcripts, and to determine which variations in the database 105  are likely to affect coding regions. The command for using this report can be accessed in two ways, either through compiled class files or through a built jar file. In the case of the class files, the command will be: t i m e j a v a −cp . : . . / l i b / p o s t g r e s q l − 9.0 − 801. j d b c 4 . j a r : . . / l i b / mysql−connector −java − 5.1. \ 6− b i n . j a r : . . / l i b / ensj − 41. j a r s r c / p r o j e c t s / V a r i a t i o n D a t a b a s e / ExperimentalRecord −c \ o n f [ . . / l i b / c o n s t a n t s . c o n f ] −dbname [ SNP Database hg18 ] − p s q l [ . . / l i b / PSQL . c o n f ] \ − o u t p a t h [ / p r o j e c t s / a f e j e s / path / ] − l i b r a r y [ HS1897 ]  Each parameter in brackets should be substituted with the appropriate parameter for the specific case employed by the user. Alternately, the report can be accessed through the jar files using the command: j a v a − j a r j a r s / v a r i a t i o n d b / ExperimentalRecord . j a r −c o n f [ l i b / c o n s t a n t s . c o n f ] − p s q l \ [ l i b / PSQL . c o n f ] −dbname [ SNP Database hg18 ] − o u t p a t h [ / p r o j e c t s / a f e j e s / path / ] \ − l i b r a r y [ HS1897 ]  Again, with parameters in brackets to be modified by the user. The pre-compiled and assembled jar files have the inherent advantage of including the libraries listed in the class path option (“-cp”) such that they are not required to be entered on the command line. The successful execution of the code will provide the user with three separate reports: • A summary report of all SNVs in Hypertext Markup Language (HTML) format, for efficient browsing • A summary report of all non-synonymous SNVs in text format, for efficient parsing • A complete report of all SNVs in text format, for investigating results of interest with full details. HTML Report The HTML report is the most intuitive report, designed to facilitate access to the information contained in the report for users who are unable to parse the text reports 106  or who would like a simple overview of the information returned. Information is returned in rows, and the information for each variant is collapsed to be as visually compact as possible. The report also utilizes coloured features to make the data more accessible. For instance, the ratio of observations for a given SNV are coloured from red (cancer samples only) to blue (non-cancer samples only). In addition, when a variation has been observed in a cancer sample, information about the origin of the sample is indicated using a colour coded bar. However, to provide the user with enhanced feedback, the type of each cancer is shown in a “mouseover” box. This allows the user to understand the distribution of particular variants across the cancer types sampled in the database. (See Figure 3.2.)  Figure 3.2: Example lines from an HTML summary of the experimental record report. Six variants from the DICER1 genes are shown, illustrating examples of variants found only in cancer (c-n column in red), variants found only in normals (c-n column in blue), and some variations found in a large number of cancers. The final three variations also have entries in the dbSNP column, indicating that, despite the number of cancers in which they are found, they are all known polymorphisms.  Non-Synonymous Summary Report The non-synonymous summary report provides an easily parse-able format, such that queries and filters can be applied automatically to find the variations of most interest. The columns used in the report are: 1. SNPID: id in the database for a given chromosome/location/variation. 2. chromosome 107  3. 4. 5. 6. 7.  8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.  position of the variant canonical base observed base times variant was observed (if the query relates to a specific sample, otherwise -1 is displayed.) total coverage at that position. (if the query relates to a specific sample, otherwise -1 is displayed. Note: Some SNP callers report this incorrectly as the coverage of the 2 most common bases. ) heterozygous or homozygous. If not reported by SNP caller, then ”uncalled” is displayed Aligner and version used for SNP call. If not reported for a specific observation ”null” is displayed SNP caller and version used for SNP call. If not reported for a specific observation, and ”null is displayed for aligner, then this field is skipped. always ”non-synonymous”, as synonymous variants are not reported in this file. dbSNPs status. ”dbsnp” if previously reported in dbSNP, ”nds” otherwise. RNA edit status. ”RNAedit” if reported in DARNED as an RNA editing event, ”nre” otherwise; only applicable to transcriptome libraries. Number of times this variant is observed in cancers Number of times this variant is observed in normals Amino Acid substitution (Native >>> new variant caused by base pair change.) Native sequence at that location. Substituted Amino acid is indicated with a ’*’ on either side. name of gene/transcripts affected. Gene names can be obtained by removing the last four characters of each transcript name.  108  109  14052838 14 94642167 T G −1 −1 u n c a l l e d n u l l non−synonymous nds nre 5 6 984 N>>>T KFPSPEYETFAEYYKTKYNLDLTNL * N * QPLLDVDHTSSRLNLLTPRHLNQKGK DICER1− 202:DICER1−201 A04207 , A04205 , A03295 , A03297 , HS1725 36758066 14 94642221 T C −1 −1 u n c a l l e d n u l l non−synonymous nds nre 0 1 966 E>>>G QPHRFYVADVYTDLTPLSKFPSPEY * E * TFAEYYKTKYNLDLTNLNQPLLDVDH DICER1− 202:DICER1−201 113280985 14 94642285 A G −1 −1 u n c a l l e d n u l l non−synonymous nds nre 1 0 945 F>>>L FVFKLEDYQDAVIIPRYRNFDQPHR * F * YVADVYTDLTPLSKFPSPEYETFAEY DICER1− 202:DICER1−201 HS2561 135966748 14 94643753 C A −1 −1 u n c a l l e d n u l l non−synonymous nds nre 1 0 917 E>>>* IDFKFMEDIEKSEARIGIPSTKYTK * E * TPFVFKLEDYQDAVIIPRYRNFDQPH DICER1− 202:DICER1−201 HS1106 113760525 14 94644005 G A −1 −1 u n c a l l e d n u l l non−synonymous nds nre 1 0 872 A>>>V TRLHQYIFSHILRLEKPALEFKPTD * A * DSAYCVLPLNVVNDSSTLDIDFKFME DICER1− 202:DICER1−201 HS2566 79364007 14 94644006 C T −1 −1 u n c a l l e d n u l l non−synonymous nds nre 3 0 872 A>>>T TRLHQYIFSHILRLEKPALEFKPTD * A * DSAYCVLPLNVVNDSSTLDIDFKFME DICER1− 202:DICER1−201 HS2874 , A03549 , A00130 187152335 14 94644062 A T −1 −1 u n c a l l e d n u l l non−synonymous nds nre 1 0 853 I >>>K SIELKKSGFMLSLQMLELITRLHQY * I * FSHILRLEKPALEFKPTDADSAYCVL DICER1− 202:DICER1−201 HS1749 113767467 14 94644106 C G −1 −1 u n c a l l e d n u l l non−synonymous nds nre 1 0 838 L>>>F IPHFPVYTRSGEVTISIELKKSGFM * L * SLQMLELITRLHQYIFSHILRLEKPA DICER1− 202:DICER1−201 HS2566 112322016 14 94644425 G A −1 −1 u n c a l l e d n u l l non−synonymous nds nre 1 0 809 P>>>S ELNFRRRKLYPPEDTTRCFGILTAK * P * IPQIPHFPVYTRSGEVTISIELKKSG DICER1− 202:DICER1−201 HS2556 135980525 14 94644428 T C −1 −1 u n c a l l e d n u l l non−synonymous nds nre 1 0 808 K>>>E DELNFRRRKLYPPEDTTRCFGILTA * K * PIPQIPHFPVYTRSGEVTISIELKKS DICER1− 202:DICER1−201 HS1108  Figure 3.3: This is the short form of the experimental records report, providing information about non-synonymous variations only. Example is shown for the DICER1 gene.  Full Report The experimental record produces a full report, which is a complete record of all of the findings used to generate the two summary reports above. It includes details of the libraries in which the variant was observed, the total coverage at the variant position and the total number of reads used to make the variant call. The report for each new Variant begins with a non-indented line: 1. 2. 3. 4.  SNPID: id in the database for a given chromosome/location/variation. chromosome position of the variant location of variant (exon, intron, 5’ Untranslated Region (UTR), 3’ UTR, none) indicating if any annotations overlap this location. (annotations from Ensembl.) 5. Confidence Value. (Experimental values for determining the probability that this SNP is non-artefactual. Under development.) Indented lines compose the bulk of the report, and come in several types: syn: lines These lines indicate a transcript in which this variant causes a synonymous change. they have the appearance of: 5849094 ENST00000378161 NPHP4−004  syn :  Columns: 1. 2. 3. 4.  syn: type indicator position of the variant Ensembl transcript ID Hugo Gene/Transcript name  ns: lines These lines indicate a transcript in which this variant causes a nonsynonymous change ns :  1  3406309  C  A  ENST00000294599 MEGF6−001  CEAGYVGPRCEQQCPQGHFGPGCEQ* R *CQCQHGAACDHVSGACTCPAGWRGTF  110  811  R>>>L  \  Columns: 1. 2. 3. 4. 5. 6. 7. 8. 9.  ns: type indicator chromosome/contig position of the variant canonical base observed base Ensembl transcript ID Hugo Gene/Transcript name Amino Acid Number Amino Acid substitution (Native >>> new variant caused by base pair change.) 10. Native sequence at that location. Substituted Amino acid is indicated with a ’*’ on either side. library report lines These lines indicate a library in which the variant was found. 11  13  HS2123  normal  tissue  bwa − 0.5.7  SNVMix2 − 0.12 alpha  Columns: 1. number of times variant was observed 2. total coverage at that position. (Some SNP callers report this incorrectly as the coverage of the 2 most common bases.) 3. name of library in which the variant was found 4. boolean value cancer or normal 5. boolean value tissue or cell-line 6. aligner and version 7. SNP-caller and version an: lines If there are annotations in the database, they will always appear at the end of a section. an :  rs555164  dbsnp130  Columns: 111  dbsnp annotation  1. 2. 3. 4.  an: type indicator annotation notes annotation name annotation description.  112  113  200542035 14 94627132 exon syn : 14 94627132 ENST00000393063 DICER1−202 syn : 14 94627132 ENST00000343455 DICER1−201 71 75 HS1868 cancer t i s s u e unknown n o l i b r a r y i n f o r m a t i o n bwa − 0.5.7 sam − 0.1.13 u n c a l l e d 14 16 A01469 cancer t i s s u e unknown n o l i b r a r y i n f o r m a t i o n Bioscope − 1.2 sam − 0.1.8 u n c a l l e d 35037357 14 94627160 exon ns : 14 94627160 T A ENST00000393063 DICER1−202 1856 E>>>V WQVYYPMMRPLIEKFSANVPRSPVR* E * LLEMEPETAKFSPAERTYDGKVRVTV ns : 14 94627160 T A ENST00000343455 DICER1−201 1856 E>>>V WQVYYPMMRPLIEKFSANVPRSPVR* E * LLEMEPETAKFSPAERTYDGKVRVTV 3 3 NA19404 normal t i s s u e 1000Genomes n o l i b r a r y i n f o r m a t i o n ssaha2 −10 sam − 0.7.1 u n c a l l e d 38756459 14 94627191 exon ns : 14 94627191 A G ENST00000393063 DICER1−202 1846 S>>>P MDSGMSLETVWQVYYPMMRPLIEKF* S * ANVPRSPVRELLEMEPETAKFSPAER ns : 14 94627191 A G ENST00000343455 DICER1−201 1846 S>>>P MDSGMSLETVWQVYYPMMRPLIEKF* S * ANVPRSPVRELLEMEPETAKFSPAER 1 5 HS0445 50 cancer t i s s u e unknown unknown Unknown Tissue HS−578T T o t a l RNA maq− 0.7.1 \ VSRAP−$ R e v i s i o n : 1963 $ u n c a l l e d 199952415 14 94627267 i n t r o n 45 100 HS2260 normal t i s s u e unknown U n s p e c i f i e d Tissue Germline DNA sample ( A07 −325) bwa − 0.5.7 sam − 0.1.8 \ uncalled 38541067 14 94627304 exon ns : 14 94627304 C T ENST00000393063 DICER1−202 1839 R>>>Q SLAGAIYMDSGMSLETVWQVYYPMM* R * PLIEKFSANVPRSPVRELLEMEPETA ns : 14 94627304 C T ENST00000343455 DICER1−201 1839 R>>>Q SLAGAIYMDSGMSLETVWQVYYPMM* R * PLIEKFSANVPRSPVRELLEMEPETA 11 18 HS0296 cancer t i s s u e unknown b r e a s t B r e a s t B r e a s t P l e u r a l e f f u s i o n from L o b u l a r b r e a s t cancer p a t i e n t \ bwa − 0.5.7 sam − 0.1.8 u n c a l l e d 112422750 14 94627309 exon ns : 14 94627309 C A ENST00000393063 DICER1−202 1837 M>>>I FESLAGAIYMDSGMSLETVWQVYYP *M* MRPLIEKFSANVPRSPVRELLEMEPE ns : 14 94627309 C A ENST00000343455 DICER1−201 1837 M>>>I FESLAGAIYMDSGMSLETVWQVYYP *M* MRPLIEKFSANVPRSPVRELLEMEPE 3 4 HS2558 cancer t i s s u e SLX−T r a n s c r i p t o m e L i t e blood lymphoma Blood−P e r i p h e r a l I n j e c t e d i n t o mouse . bwa − 0.5.7 \ SNVMix2 − 0.12 alpha u n c a l l e d 207354620 14 94627322 exon ns : 14 94627322 A C ENST00000393063 DICER1−202 1833 V>>>G MGDIFESLAGAIYMDSGMSLETVWQ * V * YYPMMRPLIEKFSANVPRSPVRELLE ns : 14 94627322 A C ENST00000343455 DICER1−201 1833 V>>>G MGDIFESLAGAIYMDSGMSLETVWQ * V * YYPMMRPLIEKFSANVPRSPVRELLE 34 117 A04208 cancer t i s s u e unknown n o l i b r a r y i n f o r m a t i o n bwa − 0.5.7 SNVMix2 − 0.12 alpha heterozygous 54 217 A03294 normal t i s s u e unknown n o l i b r a r y i n f o r m a t i o n bwa − 0.5.7 SNVMix2 − 0.12 alpha heterozygous 199731680 14 94627328 exon ns : 14 94627328 C T ENST00000393063 DICER1−202 1831 W>>>* KAMGDIFESLAGAIYMDSGMSLETV *W* QVYYPMMRPLIEKFSANVPRSPVREL ns : 14 94627328 C T ENST00000343455 DICER1−201 1831 W>>>* KAMGDIFESLAGAIYMDSGMSLETV *W* QVYYPMMRPLIEKFSANVPRSPVREL 35 77 HS2213 cancer t i s s u e unknown Ovary Ovary Matching tumor RNA and DNA bwa − 0.5.7 sam − 0.1.8 u n c a l l e d  Figure 3.4: The long form of the experimental record. \ is used to indicate an artificially wrapped line.  3.5.3  Concordance  Scripts are provided to compute concordance with SNPs and INDELs from dbSNP. In the absence of validated sequencing results, this provides an approximate method for assessing the quality of calls with an expected background for the overall population. For INDELs, the concordance API calls accept a window parameter in order to compensate for the difficulty of rigidly defining Insertion and Deletion (INDEL) boundaries.  3.5.4  Modifying the API and the UI  The API and UI were designed to be as modular as possible to make modifications simple. Code duplication has been kept to a minimum and code reuse is performed as often as the nature of the project will allow. Many of the common pieces of code are isolated into simple classes designed to keep clutter to a minimum, and all of the formatting and project components follow the Sun Microsystems Code Conventions for the Java™ Programming Language (April 20, 1999 revision).  3.5.5  Ensembl  To derive usage from the VDB, it is necessary to superimpose an annotation layer upon the otherwise un-annotated stored entries. The selection of potential annotation layers, however is relatively small, when use of the existing Java-based API is taken into account. For several reasons, the Ensembl system was selected: • The Genome Sciences Centre maintains a local mirror of the Ensembl database, providing excellent on-demand retrieval. • Frequent versioned updates to the Ensembl database annotations. • Wide variety of supported species in the Ensembl database. • The Ensembl Database provides a Java API that can be used to interface with any Java code. 114  • Previous experience and deployment of the Java API. • Availability of staff with previous experience with with the Ensembl database • The reduced burden of coding infrastructure that would be required to interface with a flat-file based system. Consequently, Ensembl provides a near ideal interface for retrieving and interrogating annotations for a variety of organisms, including humans, that can be used to provide support to the annotation-free database of variations.[103]s A selection of functions is supported in the VSRAP in the Ensembl.java class. This library includes the ability to read a text file for available configurations to make updates easier, the creation and management of connections to the Ensembl database, and a set of functions for retrieving exon, gene and transcript annotations. This provides a simple platform for VSRAP developers to interact with the Ensembl database without re-creating much of the code required to interact with the ENSJ API. Maintenance of the Ensembl Java API Unfortunately, Ensembl has chosen to discontinue its support of the Java API, preferring to maintain only the PERL version. This, sadly, limits the ability of most Java developers to make use of the Ensembl database, short of deploying their own custom SQL queries to the MYSQL ODBC used by the Ensembl database. For those who are using the Java API, this discontinuation of service has presented an issue in the loss of support and developer knowledge for future changes to the API. However, the code used to create the ENSJ API is open source and available under the GNU Public License (GPL), granting the rights to modify and re-distribute the code as long as certain conditions are met. According to the terms of the licence, and to support future use and modifications of the code, a repository has been set up in parallel with the VSRAP at SourceForge (http://sourceforge.net/projects/ensj/), storing changes made to keep this code working as necessary. 115  To date, only a small number of changes have been required, but placing the code in a location that others can use or contribute to provides access to a community that may be capable of making valuable contributions to keeping the code working long past its loss of support from the developing institution.  3.6  Applications Using the Variation Database  The VDB is an ideal tool for both discovery and annotation of variations of interest. It pairs both an exhaustive and detailed listing of variations found in samples as well as information relevant to the clinical origin of the sample. This allows for associations to be made between disease states and variations from the background reference sequence. As the amount of information stored in the VDB grows, its power to provide insight into disease states also increases. This includes both the ability to filter out more normal variations as well as the ability to identify new patterns of disease associations.  3.6.1  Filtering Polymorphisms  The most frequent use for the database is to provide filtering for polymorphisms, removing variations that are known to be common in the general population. As many sequencing experiments are done with the goal of identifying novel or uncommon variations, using a database of this type is often the most efficient method of reducing a list of reported variations down to a candidate list of manageable size. Without the database, filtering can only be done by using published lists of polymorphisms such as dbSNPs, or by compiling lists from projects such at the 1000 Genomes project. Where the database provides a significant advantage is in the ability to filter out sequencing and bioinformatics platform based biases. For instance, if there are common errors in alignments of a particular sequence, these will be identified and collected during the sequencing of normal tissues as well as in samples of interest.  116  Thus, common sequencing platform and alignment errors stand a good chance of being filtered out when the database is used. Additionally, the database can be used to hold annotations and variant calls from dbSNPs and the 1000 genomes project, such that these resources can also be used simultaneously to provide filtering.[92, 163]  3.6.2  Filtering Recurrent Variations  Another use for the database is to filter variations that are present in the general population. This method is frequently used when seeking to identify rare variations in a cohort of patients. For instance, when identifying a causative mutation for a hereditary disease in a single family. In such a case, the most likely model for the causative mutation would be either a homozygous recessive mutation (e.g. both unaffected parents carry a heterozygous mutation and the child caries a homozygous mutation at the same location), or a complex heterozygous mutation (e.g. each unaffected parent carries a different mutation and the child suffers from a synergistic cumulative effect of both variations). In both cases, removing as many non-contributing variations as possible will assist in increasing the likelihood of identifying potentially causative mutations. To do this filtering, there are heuristics that can be applied. For instance, if the disease is rare, it is possible to remove any variation that is found in more than 5-10% of the samples in the database. With several thousand samples in the database, it is reasonable to assume that any variant found with relatively high frequency in the database is likely a reasonable reflection of the frequency of the variation in the human population. A second heuristic is to remove variations that are found in non-affected individuals, passing a certain threshold of coverage. If the coverage is reasonably deep and the variation in the non-affected individual appears to be homozygous, it is unlikely to be a causative agent for the disease of interest. It is, of course, important that this filter be applied only when the un-affected sample in the database is not cancer-derived, as it would be possible to obtain a random variation in a tumour 117  that does not cause the carrier to show the disease phenotype.  3.6.3  Filtering to Identify Cancer Drivers  The most frequent use to which the database is applied is filtering variants in a sample to identify cancer drivers. Cancer drivers are usually a small fraction of the variations in a given sample, given that cancers are genomically unstable and tend to accumulate large numbers of passenger mutations, making it difficult to identify those of specific interest. Separating those variations out that are not common in the general population and are likely to cause non-synonymous changes is usually the first step. The VDB is designed with these specific functions in mind. The first step, removing polymorphisms, can be a challenge without a resource like the VDB. Lists of known polymorphisms, such as dbSNP are a good first pass, but often need to be augmented by polymorphisms from other sources. These resources are also often inconsistent in what is considered a polymorphism, for instance, dbSNP 129 considered all tumor protein p53 (TP53) variations, a well known oncogene, to be considered as polymorphisms, whether they are oncogenic drivers or not. Ideally, a better method for filtering polymorphisms, including those that are not currently part of an annotated collection of polymorphisms, is to collect a wide variety of cancer and non-cancer genomes and transcriptomes to filter against. Common polymorphisms will appear in both the cancer and non-cancer populations with equal frequency, and thus the ratio of cancer to non-cancer samples containing a given variation can be used to determine if a particular variation should be considered a candidate or not. This, however, supposes that a large sampling of genomes and transcriptomes will be available, as low frequency polymorphisms will again be difficult to separate from the low frequency variants. The second step for identifying cancer drivers is to infer whether a given variation is likely disruptive to the functioning of a normal cell. A reasonable proxy for this can simply be the reporting of the variation’s ability to cause an amino acid change (missense mutation) or to insert a premature stop codon (nonsense muta118  tion). Collectively, these changes are known as non-synonymous variations and are the most likely SNVs to be involved in altering the behaviour of a cell. Unfortunately, our current ability to predict the biological relevance of non-synonymous variations is relatively limited, despite years of bioinformatics applications attempting to solve this problem.[164] This step requires the use of an annotation system that provides the definitions for each gene annotation, and is heavily dependent upon an external source of information. Because the gene annotations vary between sources, the results generated during this step will necessarily be a function of the annotation system used. A third step that can be undertaken is to use the results of the first two stages to identify which variations are likely to be high quality. This can use many different sources of input, and the criteria for quality can change depending on the project. This may involve using a matched normal tissue sample to filter out germ line variants, using exome capture to isolate RNA-edits or evaluating the coverage and depth of a reads for a variation to determine whether it should be trusted. An important caveat, however, is the common misconception that it is possible to identify cancer drivers by searching for variations that have not been observed in any normal, as discussed below.  3.6.4  Variations Only Found in Cancer  One of the most fascinating uses for the database has been in identifying variations that have never been observed in non-cancer samples, (i.e. variations that are limited to cancers). Despite the inherent issues with this approach, it is possible to identify variations that do not appear in normal tissues. There are several significant hurdles that must be considered with this approach. The first is that many of the variations that are known to be involved in cancer are found in the healthy population. For instance, heterozygous variations that disable one copy of a cancer suppressor gene may appear in individuals who are healthy at the time of the sample, but may be pre-disposed to cancer later 119  Figure 3.5: Each variation in the database is plotted on the x-axis for the number of cancer samples in which it is found, versus the number of normal samples in which it is found on the y-axis. The dark line along the line of x = y is most likely indicative of the number of variations that are independent of the cancer/normal status of a tissue, representing the background of common human polymorphisms.  in life. As a clear example, hereditary variations of the breast cancer 1, early onset (BRCA1) and breast cancer 2, early onset (BRCA2) genes are widely distributed in the population and are clearly oncogenic, but will not manifest until later in life and thus samples from some phenotypically healthy young women will contain the clearly oncogenic variation, skewing the ratio of healthy vs. oncogenic genomics carrying the variation. It is also important to factor in other metrics, such as the gender of the samples. To continue the example using BRCA1 and BRCA2 genes, men can also carry oncogenic variations in these genes, but have a dramatically lower risk of developing breast cancer compared to female carriers of the mutations. Thus, variations may also display variable penetrance depending upon other factors, which can be difficult to sort out using a simple “cancer/non-cancer” designation. 120  This effect is apparent in Figure 3.5. As the number of cancer samples in which a variation is observed increases, so to does the number of normal samples in which a variation is observed. (e.g. there are no cancer samples in 600 or more cancers that are found in zero normals.) There is a clear and distinct trend for highly prevalent variations that are cancer associated to be present proportionally in non-cancer samples. This ratio may be fertile ground for further investigations into cancer susceptibility genes. Furthermore, there are likely more subtle forms of cancer susceptibility in which several different genomic modifications or variations may work synergistically. Searching only for those variations that are unique to cancer samples will likely only yield the most clear non-hereditary variations that are oncogenic. Identifying hereditary variations will require other methods, such as family studies. Noise This issue is further complicated by the very nature of the cancers themselves. One of the defining characteristics of cancers is their genomic instability, often resulting in significant chromosomal rearrangements as well as an ability to rapidly replicate and evolve to overcome challenges, such as those presented by chemotherapy. This rapid evolution takes a toll on the genomic sequences of the cancer cells and is often reflected in their higher than average number of variations (data not shown) compared to non-cancer cells. It is no surprise, then, that cancers carry a lot of passenger mutations, or variations that are not oncogenic by themselves, but are not present in the germ-line for the carrier of the tumour. This, again, emphasizes a limitation of the filtering system. Because a single gene can be disabled in many different ways (e.g. silencing by modification to upstream sequencing, modifications to catalytic residues, disruption of important structures affecting the function of the protein, etc.), there is little reason to suspect that identical variations in a single gene will be recurrent in the vast majority of cases. BRCA1 and BRCA2 again provide good examples of this and have been well  121  studied since their discovery. Because BRCA1 and BRCA2 mutations are often hereditary, it is possible to trace the prevalence of some variations across various distinct populations. Indeed, various groups have been shown to predominantly pass along distinct mutations. For instance, Hungarians carry three, 300T>G, 5382insC and 185delAG, while Norwegians carry an entirely separate set of four variations, 816delGT, 1135insA, 1675delA and 3347delAG. There is no reason to expect that variations in other cancers genes will be constrained to a small set of positions. Thus, the odds of finding a single causative variation in any cancer gene is unlikely, and a much more broad perspective (e.g. enrichment for variations and changes in expression level for genes or pathways) must be taken to identify novel targets for all but the most exceptional cases. Example - DICER1 in Ovarian Cancer There are clear examples of the use of the database to identify cancer specific driver mutations. One example showcasing the discovery of a link between a gene and a specific cancer type is research that has identified that variations in the dicer 1, ribonuclease type III (DICER1) gene have a role in ovarian cancer samples. The VDB contains 107 non-synonymous DICER1 variations that are not found in dbSNP, the majority of which are only observed in single cancer samples. This suggests a lack of enrichment in cancer samples, thus, from this perspective it is difficult to come to any conclusions about the passenger or driver status of the variations reported. However, the same data was approached from a second direction, starting with several ovarian cancer sequencing samples in the database. By searching for recurrent variations, a significant fraction of the ovarian cancer samples carried mutations in the DICER1 gene across a wide number of codons. Thus, while it was difficult to identify an enrichment of variations in the DICER1 gene across the population of cancers available in the database, it was clear, when studying a subpopulation based on ovarian cancer samples, that there is a specific enrichment present. This work provided supporting evidence for the study described in Heravi-  122  Moussavi et al. [5].  3.6.5  Variations Never Found in Cancer  An equally interesting question is also to identify variations that are never observed in cancer samples. To date, this avenue of enquiry has not yet been pursued, but it is clear from Figure 3.5 that there is a small population of variations that have not yet been observed in tumours, or are only rarely observed in cancer samples. While many of them are likely due to sampling effects (e.g. the database contains mostly genomic information from normals, but transcriptomic information from cancer-derived tumours), the possibility exists that some of these variations may protect against the development of cancer. One possible mechanism by which this could arise would be through coverage effects in transcriptomic data. If cancer transcriptomes do not contain the expression of genes that suppress the development cancer, then variants in those genes will only be identified in healthy tissues, thus, identifying variations in genes that suppress cancer. This may provide clues towards the identification of genes that have a strong preventative influence on the cancer susceptibility of individuals carrying specific genomic variations. Alternately, it is entirely possible that some variations themselves provide protective mechanisms that suppress cancer on their own. Identifying genes in this category would provide a useful insight into new mechanisms for preventative health care.  3.6.6  RNA Editing  Using the Darned Database, which collects a list of known A → G editing events, it is possible to annotate those events using the VDB.[160] This collection of annotated RNA edits can then be investigated independently, and evaluated for their frequency in cancer versus non-cancer samples. Surprisingly, very few, if any RNA edits appear more frequently in non-cancer samples (e.g. on the upper-left hand side of the line of slope = 1), while many of 123  them appear enriched in cancer samples. This suggests that RNA editing may be a mechanism by which cancers create additional diversity and flexibility.  Figure 3.6: A graphical representation of the frequency of observations of variants in the cancer and normal populations, similar to figure 3.5, using only variations known to be RNA edits from the DARNED database of RNA edits. Two distinct populations are observed - those that are at similar frequencies in both samples, along the diagonal axis, and those that appear to be enriched in cancers, where normal counts are close to zero. Similar numbers of data sets were compared: 1187 cancer data sets, 1052 normal data sets.  124  3.6.7  Transition and Transversion Frequency  An interesting example of simple analyses that can be performed on the large collection of genomic and transcriptomic data gathered in the VDB is the the analysis of transition and transversion frequencies. A simple consideration of the unique variations collected (e.g. not weighted by the number of times each variation has been observed), demonstrates that the ratio of transitions to transversions is skewed dramatically for all possible substitutions. The probability of substituting a pyrimidine base for a purine base, or vice versa, is much lower than a substitution that replaces a purine with a purine or a pyrimidine with a pyrimidine. This ratio has long been observed in coding regions and has been used in models of DNA evolution over time.[165] Figure 3.7 provides a visualization of the transversion to transition rations. The distance from the centre indicates the number of variations of each type that have been mapped in the hg18 version of the database. Substitutions along the X and Y axes are transitions, and each is roughly 3.7- to 2.2-fold more more common than any one transversion. Also visible in figure 3.7 are some slight biases in the direction of various substitutions. For instance, A → C and T → G variations have been observed at only 2.9 and 3.0 million positions in the human genome, while C → A and G → T variations have been observed at 4.4 and 4.5 million positions respectively. This is somewhat at odds with a na¨ıve comparison to the GC content of humans, as only 41% of the human genome is made up of G/C pairs to begin with.[19] Thus, despite the genome having lower GC content, there is a proportionally larger bias in C → A and G → T variations. A slightly different picture emerges when the total number of times each variation is observed is used (Figure 3.8). The relationship becomes slightly more even between the variations, although transitions still occur roughly 3.9-5.0 fold more often than transitions. Once observations are taken into account, it’s clear that reciprocal transitions (C → A eg A → C) occur at roughly the same rate for all pairs. There is, however, a distinct trend for T → A and A → T transversions 125  A->G 9,034,194  A->T 3,641,362  Transitions  G->C 2,592,931  Tra nsv ers io  ns  A->C 2,998,288  io ers nsv Tra  ns  C->T 9,364,899  Transitions  T->C 9,103,641  s io n ers nsv Tra  Tra nsv ers io  T->G 2,932,776 T->A 3,468,334  G->A 9,280,554  G->T 4,532,147  ns  C->G 2,549,791  C->A 4,377,334  Figure 3.7: All observed transversions and transitions in the human genome, irrespective of the number of times each event has been observed. Simply charting all collected substitutions in the (hg18) human genome demonstrates the propensity for transitions (X and Y axes) to occur more frequently than transversions (diagonal axes).  to occur with lower frequency than for any other transversion. This observation has been made before with respect to pseudogenes, but has not been reported for sample sets this size, spanning both inter- and intragenic regions.[166]  3.6.8  Growth of the Database  The data content of the VDB has grown in two phases. The first was an initial growth as a small data set was imported, including dbSNP, several breast cancer cell lines and a sampling of the initial 1000 genomes data SNP calls. This is visible in figure 3.9 as the initial rise in observations from zero to ∼200 million. This period included several small data sets for breast cancer cell lines, dbSNP and non-cancer genomes in the public domain such as the Watson (James Watson), Yoruba (1000 Genomes trio) and Venter (Craig Venter) data sets.  126  A->C 55,317,002  A->G 225,386,052 A->T 46,647,713  G->T 58,002,617  G->C 58,641,197  T->C 225,519,072  C->T 227,830,078  C->G 58,507,399  T->G 55,156,181  T->A 46,396,606  G->A 228,102,780  C->A 57,592,735  Figure 3.8: Transversions and transitions occur at different frequencies. Comparing the actual frequency of SNP calls of all observations collected shows an even clearer pattern, in which transitions (A → G, G → A, T → C, C → T) dominate the variation landscape. The A → T and T → A transversions, however, show an approximately 16% under representation compared to the other six transversions.  Figure 3.9: The hg18 version of the VDB has expanded significantly since implementation, experiencing several significant growth periods. A drop is observed in the number of observations close to the 100 day mark, as several early data sets of poor quality were removed127 - this is also reflected in a subsequent drop in the number of SNPs in the database past the 500 day mark. The difference in dates for the removal is a consequence of the time stamps used for calculating the data plotted.  After several months, a second phase is visible, as data was imported from the full 1000 genomes project, and the inclusion of variant calls from the SNP-calling pipeline of the BCGSC were included. In this second phase, there is an average of approximately 3 million novel variants per day being included, or roughly the equivalent to one genome’s worth of variation per day. This is, on average, one billion variations per year, providing a rough estimate of the volume of data to be stored as the database continues to be used, assuming a constant sequencing throughput. Several notable features are present on the chart. The first is the drop in variations near day 100. This reflects the discovery that some data sets were of insufficient quality and were providing more noise than signal. Deleting them from the database reduced the overall number of variations present. The second is that the number of unique variations (e.g. coordinates sets of chromosome, position and observed base) does not increase proportionally to the number of observations overall. Thus, it can be inferred that most of the variation we observe is not new (e.g. is composed of polymorphisms), although clearly it does rise, which indicates that there is a small but steady amount of variation unique to each individual or each cancer imported.  3.7  Future Plans for the Database  The VDB has begun to provide insights both into population genetics and in the application of personalized medicine, by making it possible to focus on variations that are most likely to be of interest to researchers. However, the database is also a useful template upon which further expansion and discovery can be conducted. The database has exhibited a tremendous growth in terms of the information it contains and has continued to scale well to provide rapid responses to queries despite that growth. This makes it a good model for the storage of other forms of genomic information. Although not discussed extensively here, the database has already been extended to include the storage and processing of INDELs, and discussions are underway to utilize similar or identical schema to store and retrieve 128  other forms of genomic variations.  3.7.1  Expansion of the Database  The VDB has proved to be a successful model for a scalable method of collecting large amounts of biological data. However, Single Nucleotide Polymorphisms (SNPs) are not the only useful biological data that can be culled from Next Generation Sequencing (NGS) data sets. INDELs, Copy Number Variations (CNVs), expression data from RNA Isolation and Sequencing (RNA-Seq) and a host of other structural variation types can all be deduced from information obtained during sequencing. Thus, planning has begun to expand the VDB into a series of related databases, capable of sharing core sample information and retaining information in a similar manner to the original VDB. The general overview for the databases is shown in figure 3.10. The first part of the process is to separate the VDB into two components - those that relate specifically to the storage and annotation of SNVs and INDELs, and those that pertain to information relative to sequencing data sets (e.g. the library and sample tables). The SNV and INDEL data will be stored as in the original model, with no changes required, still referred to as the VDB. The library information will be moved to a separate database, called the sample database after it’s storage of information about data samples, and will be query-able in parallel to the SNV and INDEL database. Once this separation is complete, other databases can be implemented that share the same sample database, but store records for other forms of variations and related information. The main challenge faced by this move will be the separation of the annotations from the variation calls. This was one of the key designs of the VDB (see section 3.3.1) that ensures scalability as well as avoidance of annotation-based obsolescence. Unfortunately, most of the pipelines currently available terminate with annotation information being tightly integrated into the results returned, which will require a significant loss of information upon import, or changes to the pipelines 129  S a m p le DB  SN  LOH / C NV D B  V / I n d el D B  A nn  ota ti o n s  S tru  ctural Var. DB  Figure 3.10: In order to expand the model of the Variation Database (VDB) to new types of information gathered from assembly and alignment data, several new databases will be created to work in concert with the VDB. This involves splitting the VDB into two components, a “sample” database, containing information on the data being sequenced, and “SNV” database, containing the variants and INDELs as currently recorded. This will provide a scalable model for the future expansion data types with the least maintenance and duplication of information.  to remove unnecessary annotation steps. These challenges are similar to the ones faced in with the design and implementation of the SNVs and INDELs database, but will require more software modifications to existing pipelines. One of the planned databases will include information from the TRANS-Abyss pipeline, an RNA-Seq assembly and analysis tool.[167] While TRANS-Abyss is able to use known annotations, it’s also capable of discovering novel genes, exons or other features. Thus, storing information from TRANS-Abyss necessitates the storage of novel annotations. Fortunately, this also allows for the development of a novel database that can be used to supplement the Ensembl annotations currently used by the structural VDB (see fig 3.1).  130  Chapter 4 Mammary Ductal Carcinoma Cell Lines “The majority of breast cancer research is conducted using established breast cancer cell lines as in vitro models.” — Burdall et al., 2003, “Breast cancer cell lines: friend or foe?” This chapter deals with the analysis of eight ductal carcinoma cell lines and four B-cell derived matched normal cell lines, identifying specific mutations of interest as well as performing an inter-se comparison in order to locate common variations and features. Much of the bioinformatics work done utilizes the Variation Database described in chapter 3. The analysis of the eight cancer-derived cell lines and four B-cell derived normals is done in two parts. The first part utilizes the genome or transcriptome wide properties, such as concordance with snp databases, confirmation that B-cell derived samples are appropriately paired with their corresponding cancer-derived samples and the comparison of the inherent properties of each cell line to identify common themes and properties. These metrics provide confidence in the treatment of the samples and can be used to isolate errors in sample handling. The second component of the analysis covers the search for recurrent variations, common to more than one cell line. These variations can provide insight into 131  common pathways that are subject to selection for oncogenic properties. The major challenge in the analysis of these cell lines is the diversity of the samples. All eight cancer-derived cell lines come from grade three primary Invasive Ductal Carcinoma (IDC) tumours. However, the cell lines share little in terms of the histological markers or characteristics of the tumour from which they were taken. (See sections 4.1.1 and 4.2.7)  4.1  Methods  4.1.1  Cell Lines  12 cell lines were sequenced for this project: 4 primary ductal carcinoma cell lines (HCC38, HCC1187, HCC1395 and HCC2218) with respective Epstein-Barr Virus (EBV) immortalized B-cell derived matched normals (HCC38BL, HCC1187BL, HCC1395BL and HCC2218BL) and four primary cell lines without matched normals (HCC70, HCC202, HCC1500 and UACC-893). Cell lines were all obtained from the American Type Culture Collection (ATCC). Cell Line HCC38 HCC1187 HCC1395 HCC2218 HCC70 HCC202 HCC1500 UACC-893  Stage IIB IIA I IIIA IIIA IIIA IIB II  Grade 3 3 3 3 3 3 3 3  Table 4.1: The tumours from which the cancer-derived cell lines originated spanned several stages, although each one is derived from a primary tumour removed from the breast, rather than a metastatic site or pleural effusion.  Transcriptome Shotgun sequencing was carried out as previously described in Jones et al. [119] to generate paired end 50bp reads. Exome capture was carried out using the Agilent SureSelect (version 1) Kit as per manufacturer’s protocol to 132  generate paired end 50bp reads.  4.1.2  Reads Aligned  The numbers of reads obtained for each of the cell lines are listed in Table 4.2. Ribonucleic Acid (RNA) Isolation and Sequencing (RNA-Seq) was performed with either two or three lanes for all twelve cell lines. One lane of exome capture Deoxyribonucleic Acid (DNA) sequencing was performed for the four cell lines where matched normal DNA was available, as well as the DNA from the matched normal cell lines themselves. The local identifier provides the library names used internally at the BCGSC, and is the ID by which they can be located in the human variation database. Cell Line HCC38 HCC38BL HCC1187 HCC1187BL HCC1395 HCC1395BL HCC2218 HCC2218BL HCC70 HCC202 HCC1500 UACC-893  Local Identifier HS1623 HS2306 HS1850 HS2173 HS1745 HS2307 HS2004 HS2174 HS1804 HS2315 HS2005 HS2175 HS1849 HS2172 HS2006 HS2176 HS1634 HS1744 HS1805 HS2116  Protocol RNA-Seq DNA (exome) RNA-Seq DNA (exome) RNA-Seq DNA (exome) RNA-Seq DNA (exome) RNA-Seq DNA (exome) RNA-Seq DNA (exome) RNA-Seq DNA (exome) RNA-Seq DNA (exome) RNA-Seq RNA-Seq RNA-Seq RNA-Seq  Lanes 3 1 2 1 3 1 2 1 3 1 2 1 2 1 2 1 3 3 3 2  Reads 66906088 34380599 80807943 36104265 46089683 27906177 49860152 34358035 66893070 37448046 55490282 33792015 55848971 33531451 57975905 37093727 66268245 58764709 60682351 76120642  Table 4.2: Summary of Sequenced Cell Line names and reads  All of the sequencing for this project was done within a short time period, with the exception of the HCC38BL sample, which was shipped from the ATCC several 133  weeks after the other cell lines and whose sequencing coincided with chemistry and hardware upgrades to the Illumina sequencing platform used. This is primarily responsible for the significant increase in reads, despite the total of only 2 lanes of sequencing in Figure 4.2. HCC38BL also displays several oddities in terms of read depth and variants observed which appear in several of the other figures. These changes are most likely caused by changes in sensitivity and biases of the machine performing the sequencing, rather than a substitution or mislabelling of the sample.  4.1.3  Bioinformatics  Sequences were aligned to the human reference genome (hg18) using bwa 0.5.7, and Single Nucleotide Variations (SNVs) were called using SNVMix2 0.12alpha.[13, 17] Assembly Transcript assembly was performed using Trans-ABYSS 1.2.0 [167] Assembled transcripts were compared to the University of California, Santa Cruz (UCSC) KnownGene annotations [100], downloaded Sept 2009, using custom scripts. Analysis SNP calls were imported into the postgres variation database using tools in the Vancouver Short Read Analysis Package and first level reports were generated using the ExperimentalRecord tool [169]. Gene annotations were obtained from Ensembl version 54 (36p). Clustering analysis was done using the open source R package. Verification Bioinformatic verification was performed by comparing best candidate non-synonymous variants to the Catalogue of Somatic Mutations in Cancer (COSMIC) database [117], which contains lists of variants previously found in cell lines, particularly  134  those from ATCC origins, which have been partially sequenced. Any variant observed in both the COSMIC database and our own results, with the same coordinates and non-synonymous change confirmed for the same cell line, was considered independently validated, as it reflects the use of two entirely separate technologies. (See Figure 4.8.) (Single Nucleotide Polymorphisms (SNPs) were identified by SNP6.0 arrays from Wood et al. [116], stored in the COSMIC database, in contrast to Illumina Sequencing at the Canada’s Michael Smith Genome Sciences Centre (BCGSC).) Best candidate non-synonymous variants observed in two different cell lines that were not confirmed this way were validated using Sanger sequencing. More recently, SNVs from Stephens et al. [170] using the Illumina GA II sequencing platform with 65,000,000 paired end reads from each cancer, have been added to COSMIC, providing further validation for the reported variants. Primers for the studies were designed using “Primer 3” Web tool [171]. Forward primers had the –21M13F sequence (TGTAAAACGACGGCCAGT) added to them and the reverse primers had the M13R sequence (CAGGAAACAGCTATGAC) for sequencing of the Polymerase Chain Reaction (PCR) products. All primers were ordered at 50nmol scale from Invitrogen Life Technologies. The primer sequences for the amplicons are listed in Table 4.3. Gene name TRPM7 MACF1A MACF1B TUBGCP6A TUBGCP6B  primers FWD 5’-AAAATCCAGGCTCCAGTTGT-3’ REV 5’-CACCAAACCTGAAGTCATTCTG-3’ FWD 5’- AAGGGGATCAGAGACTCCTCA-3’ REV 5’- ACCCGGTCAGTTCCTCATAA -3’ FWD 5’- CTTGGCAACAGAATTCCAGA-3’ REV 5’- GCAAAAACCTGAGGGAACAA-3’ FWD 5’- CAAACCCCGAGAGAGATGAA-3’ REV 5’- ACCCTTGTGGTGACAGGTTC-3’ FWD 5’- GAGCCACGTCCGACAC-3’ REV 5’- ACGGACACGTATCCAATG-3’  Table 4.3: Primers used for the verification of high confidence variations.  A 3µl Aliquot of each PCR reaction was run on a 2% agarose gel (BioWhittaker Molecular Applications) to confirm amplification and quality of the product. 135  The remaining 17µl of PCR product was purified with Ampure magnetic beads (Agencourt Bioscience Corporation) and eluted in a volume of 30µl of TE (TrisEDTA pH 8.0). Purified PCR products were then cycle sequenced using Big Dye Terminator Mix V.3.1 (Applied Biosystems) in both forward (-21M13F primer) and reverse (M13R primer) directions. Completed cycle sequencing reactions were then loaded on ABI-3730XL capillary sequencers to collect sequence data.  4.2  Properties of Cell Lines  Several different metrics were used to ensure that the sequencing of the cell lines provided good quality data and to ensure that cancer-derived cell lines are correctly matched with their corresponding B-cell derived “normal”, when one is available. These metrics include concordance analyses for called variations with a known polymorphism database, concordance between matched pairs and a global comparison of similarities in genes with non-synonymous variations. We have also described a comparison of the population of transitions and transversions for samples and a large database of variations, providing some context for the variations observed. Additionally, it is also possible to identify trends between the DNA- and RNA-based sequencing performed, which are also described. Finally, we describe variations observed in the breast cancer 1, early onset (BRCA1) and breast cancer 2, early onset (BRCA2) genes, which are common risk factors for developing mammary cancers, and some of the phenotypic markers of each cell line, as reported in literature. Transcriptome sequencing, which measures transcript abundance, and thus is a metric for relative transcription of genes, is also shown to provide supplementary confirmation of these properties and can be used to provide details about cellular markers where the information is not otherwise available.  136  4.2.1  Concordance with dbSNP  Concordance, the set of overlap or agreement between two data sets, is commonly used to identify estimate the fraction of a sample that conforms to a known data pattern. One of the most frequently used concordances for variant calling is a comparison with dbSNP, as described in section 3.5.3. The use of a known set of polymorphism for comparison provides a rough estimate of the quality of a given sample, as random variation will typically not occur at polymorphic sites. Thus, for a sequenced human genome, exome or transcriptome, the concordance should remain high unless a lot of novel variation (e.g. cancer with genomic instability) or noise (e.g. sequencing or aligning errors) is present. However, a high concordance with a known set of polymorphisms does not reflect the confidence in any individual variant reported, and it does not necessarily indicate a high quality library. In contrast, a poor concordance with a known set of polymorphisms is good indication of a failure at some point in the sequencing or bioinformatics pipeline. It is unusual for a good quality libraries, sequenced after 2009, to have concordance rates with dbSNP below 60-65%. For standard concordance calculations, dbSNP is used to provide the known set of polymorphisms, from which quality estimations can be inferred. DNA-based libraries typically provide the best concordance rates (> 90% concordant) with dbSNP v.130, which is intuitive: the alignment process is direct and both sequenced material and reference templates should match closely for most samples. Furthermore, the DNA-based sequencing done in this study reflects the use of exome capture technology. This provides a bias in favour of exons, which are less likely to contain difficult to sequence regions, as well as to be better annotated in dbSNP, the metric used for determining concordance. It is also possible for DNA sequences containing variations to be captured less efficiently by the exome capture probe sets, depending upon the hybridization conditions used. In figure 4.1, all of the DNA-based libraries show a concordance rate between 90-95%, indicating the high confidence in the quality of the data obtained. The concordance rates for the cancer-derived samples is typically 1-2% lower than those of the B-cell derived 137  samples, reflecting that the cancers typically undergo greater instability and are believed to have more non-polymorphic variants, including both passenger and driver mutations. In contrast to DNA-based samples, RNA sequencing typically provides concordance rates between 70-95%, which reflects the existence of post-transcriptional processes such as RNA editing and splicing that further complicate the re-alignment process. In figure 4.1, RNA-based libraries show a wide range of concordance rates, from 75-85%.  Figure 4.1: Concordance with dbSNP v.130 for each cell line, showing RNA and exome capture, where collected.  4.2.2  Matched Normals  For cell lines with matched normals, it is always important to ask whether the matched normal is correctly paired with its diseased sample. To confirm that the pairing is done correctly, one can compare the number of variants observed in each combination of cell lines, and the two that come from the same individual should always provide the highest rate of common variations. 138  For the RNA-derived variants of the four ductal carcinoma cell lines with matched normals, these relationships are shown in table 4.4. The low concordance in this table reflects that the variations reported in RNA-based sequencing is a reflection of the genes expressed in the cell lines of origin. Because the matched normal is derived from immortalized B-cells, while the tumour originated in the mammary gland, it is to be expected that a significant number of genes expressed will be unique to each heritage, resulting in a low overlap. Cell Line HCC38BL HCC1187BL HCC1395BL HCC2218BL  SNV called 148836 59163 63400 72753  HCC38 40928 24370 (59.5%) 14316 (35.0%) 14860 (36.3%) 14860 (36.3%)  HCC1187 66043 22985 (34.8%) 23304 (35.3%) 17146 (26.0%) 18357 (27.8%)  HCC1395 69408 23379 (33.7%) 17215 (25.8%) 24104 (34.7%) 18728 (27.0%)  HCC2218 67919 25485 (37.5%) 17232 (25.4%) 17862 (26.3%) 26585 (39.14%)  Table 4.4: Matched Normals: This table illustrates the increased overlap with the matched normal and other normals for each of the cell lines using RNA derived SNVs, where the variant was observed a minimum of 5 times. HCC38BL was sequenced with the greatest number of reads, which is reflected in the number of variants observed, somewhat obscuring the relationships between the normals and the samples to which they match.  In contrast, variations derived from DNA, shown in table 4.5, are not influenced by the expression patterns of specific cell types, and thus the relationships between the tumour-derived samples and their B-cell derived matched normals is more apparent and serves as the best metric to demonstrate that the cell lines have all been paired correctly. Unfortunately, there is no baseline by which this can be assessed, and comparing two random samples together will yield a wide range of overlaps between samples. Furthermore, without information about the ancestry of the individuals from whom the cell lines were obtained, it’s impossible to know what the overlap should be, meaning that any metric of this type can only serve to identify when two cell lines are incorrectly paired - and not to confirm that the pairing is correct. It is unexpected that HCC1395 exhibits a high concordance with HCC38BL (the B-cell derived normal for HCC38), while HCC38 does not reciprocate this 139  Cell Line HCC38BL HCC1187BL HCC1395BL HCC2218BL  SNV called 64,491 61,496 64,645 62,709  HCC38 51,023 44,893 (87.99%) 30,072 (58.94%) 30,676 (60.12%) 29,949 (58.68%)  HCC1187 55,046 32,583 (59.19%) 45,218 (82.15%) 32,302 (58.68%) 32,092 (58.30%)  HCC1395 60,679 43,494 (71.68%) 35,422 (58.38%) 44,233 (72.90%) 35,369 (58.29%)  HCC2218 57,759 33,985 (58.84%) 33,055 (57.23%) 34,081 (59.07%) 49,117 (85.04%)  Table 4.5: Matched Normals: This table illustrates the increased overlap with the matched normal and other normals for each of the cell lines in DNA-derived SNVs, where the variant was observed a minimum of 5 times.  relationship. However, despite the high concordance, the HCC38BL cell line provides the highest match for HCC38 among all of the cancer-derived cell lines, and vice versa, indicating that the pairing is most likely correct, and the overlap may be indicative of other common elements, such as the racial origin of the samples. To provide additional confirmation that cell lines obtained were correctly labelled, non-synonymous variations from the cancer cell lines compiled in COSMIC were compared against the non-synonymous variations called on the sequenced data obtained in this experiment. In each case, the list of known non-synonymous variations in COSMIC for each cancer cell line was found in the matching sample, as anticipated. (See Figure 4.9.) COSMIC does not compile the same lists of variations for cell lines that are of b-cell lineage.  4.2.3  Relationships Between Cell Lines  To show the pairwise relationship, figure 4.2 utilizes the thickness and opacity of the tying lines to represent the number of genes with non-synonymous variations, not reported as polymorphisms, in common between each pair of cell lines. This method makes it apparent that HCC38 and HCC1395, both reported as triple negative ductal carcinomas, share the most variations in common. However, HCC1187, despite also being a triple negative ductal carcinoma, displays a much weaker relationship with HCC38, despite having some commonality with HCC1395. Thus, the triple negative status of the cell lines does not necessarily indicate that the 140  variations observed map to a common set of genes.  Figure 4.2: Relations between the sequenced cell lines. Thickness and opacity of the lines represent the number of genes with non-polymorphic, non-synonymous variations in common.  A second method of visualizing the same data is to perform a hierarchical cluster analysis based on the dissimilarities of the cell lines, as in figure 4.3. Although the number of samples is too small to provide great statistical significance for the trends observed, some useful information can be gleaned from the visualization. HCC1395, the two triple negative cell lines, share much in common, and very little with the other cell lines and appear at the end of their own node. The only two Her2+ , ER- , PR- cell lines, UAC-893 and HCC202, also cluster together. HCC1500, which shares the least in common phenotypically with the other cell lines, appears to cluster along with the Her2+ , ER- , PR- , despite being Her2- , ER+ and PR+ .  141  HCC202  HCC1500  UAC.893  HCC2218 HCC38  HCC1395  HCC1187  HCC70  1.08 1.04  1.06  Height  1.10  1.12  1.14  Cluster Dendrogram  z2 hclust (*, "complete")  Figure 4.3: Clustering of the cell lines using R’s hclust package. The height axis is an arbitrary unit measuring the dissimilarity of the samples based upon the number of genes with non-synonymous variations shared between each pair of samples.  4.2.4  Transitions, Transversions and RNA Editing  As discussed in section 3.6.7, transversions and transitions exhibit noticeably different frequency characteristics, which can be observed both for large collections of data as well as for individual data sets. For each of the 4 cell lines with matched normals, the transversion and transition ratios are shown for both RNA and DNA for 142  both the cancer-derived cell line and the B-cell derived matched normal in figure 4.4. In addition to the cell lines shown, the average for the database is also shown, reflecting the results indicated in figure 3.7 and figure 3.8. Several key points are apparent: • The overall transition and transversion ratios are similar to those observed for the whole database. • The RNA based samples show a distinct gain in A → G and T → C variations. • The RNA based samples show a distinct depletion in C → T and G → A variations. • A → T and T → A transitions are consistently depleted. The percentage transition and transversion ratios match closely with that of the whole database, which likely reflects the underlying thermodynamics of the process of substitution. Replacing a like base with another base of the same size (Transition) is always more favourable than replacing a base of different size (Transversion). While there is much discussion of this in the literature as well as description of the variable ratio of transitions and transversions across species, there appears to be an assumption that the ratios of each transition will be a constant 1/4 the total number of transitions, while the ratio of each transversion will be a constant 1/8 of the total number of transversions.[172–174] In the DNA-based samples, the transition ratios do appear to remain roughly constant, with transitions providing about 17% of the total substitutions occurring, about the same as the average for the whole database. In the RNA-based sample, however, the transitions exhibit a significant skew from the 17% of the DNA, showing about 20% A → G and T → C each, while C → T and G → A drop to ∼15%. These changes in the RNA are likely due to canonical RNA editing, which replaces Adenosine bases with Inosine (I), and Cytosine bases with Uracil.[175, 176] These bases are read by the sequencers as 143  G’s and T’s respectively. Because the sequences are also not strand specific, the A → G changes will also be observed as T → C changes, when mapped on the opposite strand, while C → T changes will also be observed as G → A. The increase in A → G and T → C variations is thus easily explained by the canonical A → I (G) RNA edit, however, this does not necessarily account for the drop in C → T and G → A variations. If the less common Cytosine to Uracil RNA edit were observed, these variations would also show an increase, with corresponding decreases in the C transversions. However, that is not observed. Thus, it appears that the A → G variations come at the expense of a depletion in G → A (and similar for the complementary transitions).  Figure 4.4: Frequency of base substitutions for each cell line with a matched normal, showing variants called in both RNA and exome capture sequencing. Depletion of the C → T and G → A transitions and increases of A → G and T → C transitions caused by RNA editing, are clearly visible in the RNA samples.  144  4.2.5  Exome Capture vs. RNA  In addition to RNA Editing, the DNA from the exome capture and the RNA from the RNA-Seq experiments also provide slightly different coverage of genes. Figure 4.5 illustrates these differences, using the variants observed in each cell line and by each technology.  145  Figure 4.5: Division of intragenic variants by cell-line. This table indicates the relative population of UTR, intron, synonymous and non-synonymous single nucleotide variants for both exome capture (indicated by the label DNA) and RNA-Seq (RNA) of each cell line and it’s matched normal.  The first trend that becomes clear is that exome capture is able to identify more variations in exons because it is able to see the exons that are not expressed, while RNA-Seq can only read variations in expressed genes. Furthermore, because RNA-Seq also reads exonic sequences proportionally to the expression level of the exon, exons that are not expressed highly can be lost or inadequately sampled, leading 146  to fewer SNV calls that pass quality assurance thresholds. However, RNA-Seq also shows fewer intronic SNVs, due to splicing removing most introns from messenger Ribonucleic Acid (mRNA) used to create the RNA-Seq library. The second is that RNA-based libraries identify more variations in UTR regions, likely due to the lack of probes covering these regions for the exome capture libraries. 5’ and 3’ UTR regions are retained in the sequenced RNA-Seq libraries, and thus more variations are discovered in these regions. Finally, exome capture identifies both more synonymous and non-synonymous variations in the coding regions, when compared to the RNA-based methods. This is likely due to the improved ability of DNA-based sequencing to align to the genomic template, although it is also possible that the use of probes may be identifying more genomic regions with variation that are not expressed in the samples. These may represent silent changes that do not affect genes of interest.  4.2.6  BRCA1/2 Status  One interesting facet of the sequencing done for this study has been the ability to look at known breast cancer associated genes. While it does not provide a complete picture of breast cancer, the status of the BRCA1 and BRCA2 genes of a patient can play a significant role in a patient’s risk factor for developing breast and ovarian cancers. In this case, we investigated the two genes for the eight patients whose tumours and B-cells were used to develop the cell lines can be investigated. Visualizations are provided in figure 4.6, illustrating the status each cell line for both RNA and DNA. In the BRCA1 gene, the HCC1395 and HCC1395BL cell lines show a truncating R1751 (BRCA-001) mutation in both the RNA and DNA, suggesting that this is in fact a germ-line BRCA1 variation. What is most interesting about this mutation is that it appears heterozygous in both the DNA of the cancer-derived and B-cellderived cell lines, as well as appearing in only ∼50% of the the RNA reads covering that location of the B-cell derived cell line (represented by triangles in figure 4.6), but makes up 100% of the reads at that location in the RNA of the cancer-derived 147  cell line (represented by a square in figure 4.6). This suggests that there may be allele specific expression, loss of heterozygosity or other mechanisms operating on this gene in the context of the cell line. Furthermore, the cancer-derived HCC1395 cell line also carries a heterozygous truncating E1593 BRCA2 mutation, which is not observed in the B-cell-derived HCC1395BL cell line, indicating that this is either a somatic mutation, or a change that took place after the establishment of the cell line. Of interest, the germline truncating R1751 mutation is a common founder mutation among Greek Europeans, suggesting at a likely ethnicity for the donor or the cells used to create the HCC1395 and HCC1395BL cell lines.[177] This fits well with the published “white, caucasian” description of the cell line. None of the other cell lines carry truncating BRCA1 or BRCA2 mutations, however, there are other non-synonymous mutations observed. In each case, the variation database can be used to identify that the other variations are commonly observed in non-cancer sequencing, indicating that they are common polymorphisms (see figure 4.6).  148  BRCA1-210  x  0  0  0  0  0  0  x  Germline Non-Synonymous Mutation Germline Truncating Mutation  BRCA2-001  149  *  ^ Exons  RNA  RNA HCC38 HCC1187 HCC1395 HCC2218 HCC70 HCC202 HCC1500 UAC-893  * ^  Truncating Somatic Mutation Predisposing Non-Synonymous Mutation  DNA HCC38BL HCC1187BL HCC1395BL HCC2218BL  DNA HCC38 HCC1187 HCC1395 HCC2218  HCC38BL HCC1187BL HCC1395BL HCC2218BL  Figure 4.6: Summary of variations across the BRCA1 and BRCA2 genes. Triangles represent variations in which 80% or fewer reads contain the deviation from the reference, while squares indicate variations with greater than 80% of reads indicating a variation from the reference base. Variations are colour coded: Yellow, green and red variations represent synonymous, non-synonymous and truncating variations respectively.  4.2.7  Common Phenotypic Markers of Breast Cancer  Breast cancer samples are often classified on the basis of their expression of several common receptors: the Estrogen Receptor, the human epidermal growth factor receptor 2 (HER2) and the Progesterone Receptor (PR). This information is commonly available for all the cell lines used with three notable exceptions: the ER status for HCC1187 and HCC70, and the PR status for HCC70. In order to determine whether these markers are likely to be present and complete the classification, we investigated the presence of the mRNA expression of the genes responsible for generating the markers. Thus, using the aligned fragments from the RNA-Seq experiments, we can obtain a simple determinant of the RNA expression level for a given gene. In this case, we have used this feature to examine the Estrogen Receptor mRNA levels (gene ESR1). Although RNA expression would not necessarily indicate the presence of the expressed protein for the marker, the absence of RNA fragments for a gene is a strong indication of the lack of a receptor. We can determine from the RNA-Seq data that neither one expresses mRNA for the estrogen receptor gene at detectable levels, making them both, most likely, ER- . (See Table 4.6.)  Her2 ER PR TP53  HCC38 +  With Matched Normals HCC1187 HCC1395 HC2218 + -† + + + +  HCC70 -† ? +  Without Matched Normals HCC202 HCC1500 UACC-893 + + + + + ?  Table 4.6: Status of common markers of breast cancer tumours: the Estrogen Receptor (ER), the Human Epidermal growth factor Receptor 2 (Her2), the Progesterone Receptor (PR) and the TP53 gene. Reported presence of the marker is indicated by a + symbol, negative status is indicated by a - symbol, and unknown with a ? symbol. The ER status of two cell lines was not previously reported, however transcripts of the estrogen receptor gene (ESR1) are absent in those cell lines, suggesting a negative type, indicated by the † symbol. (See also Figure 4.7.)  In figure 4.7, each sequenced cell line shows a separate line for the depth of 150  coverage for the ESR1 gene. The first eight (cancer, RNA) show bands in green for coverage, although only one band is present for the HCC1500 cell line, known to be ER+ , but completely missing from six of the other cancer cell lines. A faint band is present for UACC-893, but it fails to indicate any coverage for most exons, suggesting an incomplete transcription, which would also likely explain why UACC-893 is phenotypically ER- . Below the green bands, a set of four purple bands would be visible for the RNA of the four matched normals, however, no reads mapping to the ESR1 gene were observed in any of the B-cell derived cell lines. This finding is consistent with the B-cell origin of the cell lines, which do not normally produce the estrogen receptor. The yellow and blue bands below indicate the coverage of the cell lines from the exome capture experiment. The large exons on each end encompass the 3’ and 5’ UTR, and are thus not included in the probe sets used in the exome capture experiments, explaining the lack of coverage for these regions.  151  ESR1-201  Expression/Coverage RNA-Cancer Cell Lines RNA-Normals DNA-Cancer Cell Lines DNA-Normals  Exons  Variations Synonymous Snp Non Synonymous Snp Truncating variant UTR variant  RNA  RNA HCC38 HCC1187 HCC1395 HCC2218 HCC70 HCC202 HCC1500 UAC-893  DNA HCC38BL HCC1187BL HCC1395BL HCC2218BL  DNA HCC38 HCC1187 HCC1395 HCC2218  HCC38BL HCC1187BL HCC1395BL HCC2218BL  Figure 4.7: Summary of sequencing of the ESR1 gene - darker colours indicate greater depth, while the width of the line indicates the fraction of the bases covered by reads. The dark green band shows a strong signal for the production of ESR1 mRNA in HCC1500, the only cell line in the set which is confirmed to be positive for the estrogen receptor. All other cell lines appear to be ER- , including the B-cell derived matched normals. (See also Table 4.6.) Triangles represent variations in which 80% or fewer reads contain the deviation from the reference, while squares indicate variations with greater than 80% of reads indicating a variation from the reference base. Variations are colour coded: Yellow, green and red variations represent synonymous, non-synonymous and truncating variations respectively.  4.2.8  Single Nucleotide Variations  The Variation Database (VDB) can be used to perform multiple levels of filtering and annotation. Although the database performs the filtering in a single step, the same information can be broken down into separate components. (See Table 4.7.) In the first instance, it can be used to remove variants also found in the matched normal (if available) for a given cell line. For the ductal carcinoma cell lines with Bcell derived matched normals, this removes 65-85% of candidate putative variants. Filtering against dbsnp removes a further 6-20% of variants, while filtering against a large collection of normal genome SNP calls, including those found in the 1000 152  genome project bring the total number of candidate driver variants down to between 3.7-5.9% of the total number of putative non-synonymous variations. For the ductal carcinoma cell lines where a matched normal is not available, a similar process was able to reduce the candidate non-synonymous variants to between 5.8-10.4% of the total number. Non-Synonymous - matched normal - dbsnp - all normal genes  HCC38 8989 1735 1181 534 463  HCC70 5089 1206 530 491  HCC202 4702 971 404 353  HC1187 8766 1602 938 322 297  HCC1395 11145 3834 1550 559 335  HCC1500 5287 1103 404 336  HCC2218 9186 1416 750 385 361  UACC-893 4673 841 271 245  Table 4.7: Non synonymous mutations before and after filtering for each cell line. Non-synonymous SNVs are collected from both transcriptome and exome when available. Cell lines without matched normals show a single “-” to indicate this filtering was not performed.  Across all eight ductal carcinoma cell lines, a total of 3409 non-synonymous variations were observed, affecting a total of 2570 candidate genes.  4.2.9  Variant Frequency  Only a small percentage of variations in each cell line are unique, with the vast majority of differences to the reference genome being either known polymorphisms or common to other samples. Figure 4.8 illustrates this point, with HCC1500 showing the largest number of unique variations. Not surprisingly, the RNA shows a greater number of rare variations, which is often a byproduct of poorly aligned junctions or alternate splicing events that can not be re-aligned without errors. Of interest HCC38BL, the B-Cell derived matched normal, shows a much greater number of rare variations than the HCC38 tumour-derived cell line. It is somewhat unexpected that HCC38BL has 3.6-fold greater number of variations than the supposedly more highly mutated cancer-derived HCC38 cell line. However, the HCC38BL cell line has 80 million reads, collected over 2 lanes, while HCC38 has only 67 Million reads over three lanes. It is likely that the higher number of 153  variants are simply a product of a higher base calling error caused by the higher read density on the sequencing platform. The same relationship is not observed in the exome capture data for HCC38 and HCC38L, supporting this conclusion. It is important to recognize that RNA samples identify more variations because alignment and mapping are much more difficult processes, requiring a mapping step that attempts to identify all possible exon boundaries. Because our knowledge of exon boundaries is incomplete and splicing of RNA is a malleable process, RNAbased samples will always identify more variants than DNA-based samples because of the post-processing steps imposed upon the data collected.  Figure 4.8: A graph representing three subsections of variants in each sample - those that are unique to the sample (not found in any other data set entered), those that are rare (found in less than 10 other libraries) and those that are found frequently in other data sets (observed in 10 or more other data sets).  154  4.2.10  Recurrent Overlaps Between Cell Lines  In the four cell lines with matched normals, a total of 1485 genes were observed to contain non-synonymous mutations, of which 201 genes contain high quality non-synonymous somatic variants, variations observed in both the cancer RNA and exome capture reads but not found in the matched normals. Of this set of genes, an inter-se analysis was performed to identify genes that were recurrent among the four ductal carcinomas with matched normals. See Figure 4.9.  Figure 4.9: Venn diagram of overlap of genes with high confidence non-synonymous variations (Non-synonymous variations found in both the DNA and RNA of the cancer sample, but not present in the matched normal), illustrating the number of genes with variations found uniquely in each sample as well as those found in two different samples.  The four cell lines used to study the high confidence non-synonymous somatic variations all share a lack of expression of the estrogen receptor and are TP53 positive. Of interest, however, is that HCC38, HCC1187 and HCC1395 are all so called “triple negative” cell lines, lacking the estrogen receptor, the progesterone receptor (PR) and the human epidermal growth factor receptor 2 (Her2). In contrast HCC2218 is Her2 and PR positive, making it clearly a distinct subtype. Unexpectedly, it is also the one cell line that shares the most genes with nonsynonymous variations in common with the other cell lines, as six of the eight 155  genes containing disruptive variations shared with HCC2218 despite it having fewer total non-synonymous somatic variations than HCC1395. Although no single gene with high quality variants was observed in three or more cell lines, there were eight genes with non-synonymous variations that were not unique to a single cell line. The eight genes identified span 27 individual variations, of which ten are described in the cosmic database. Thus, 37% of the variations identified by this screen are known somatic cancer mutations. See table 4.8. Protein TRPM7 MACF1 SBNO1 CSNK1D HUWE1 TUBGCP6 PLCG1 ADNP2  Cell Line #1 (Variation) HCC38 (T720S) HCC38 (L306R, Q1901L, E5006/4448/4374/4504Q) HCC38 (S996C) HCC38 (S101N) HCC1187 (R481K) HCC1395 (H220R) HCC1395 (S582C) HCC1395 (P380L, R939I)  Cell Line #2 (Variation) HCC1395 (T720S & K153T) HCC1395 (E3544/3470/3600/4102Q, E5006/4448/4374/4504Q †) HCC2218 (E888K) HCC2218 (S97C) HCC2218 (R1083S) HCC2218 (W1104*) HCC2218 (A995P) HCC2218 (V355I)  Other Cell Lines (Variation) HCC2218 (E49Q ‡§), HCC202 (R4949/5023/581/60/5079C ‡) HCC1395 (S996C †) HCC70 (S17R ‡)  Table 4.8: List of genes with high confidence mutations (Found in both RNA and DNA of cancer, but not in RNA or DNA of matched normals) in more than one cell line, as well as variations found in the other four cell lines. Identical mutations that give rise to different coordinates in different transcripts are annotated by “coord1/coord2”. Variations found in DNA only are marked with a †, variations found in RNA only are marked with a ‡. The § symbol is used to indicate a different transcript is affected by this mutation. Variations in boldface font were not found in COSMIC.  Many of the genes identified as containing recurrent variations have previously been associated with breast cancer in in the literature: TRPM7: May be involved in breast cancer cell proliferation: Guilbert et al. [178] MACF1: May be a predictor of recurrence in gastric cancer: Motoori et al. [179] SBNO1: Potential oncogene, likely involved in expression of EGF: (see chapter 5) 156  CSNK1D: May be involved in breast cancer resistance to aromatase inhibitors: Aguilar et al. [180] HUWE1: A Ubiquitin ligase, plays a role in brain development and an interacting partner of Myc, HUWE1 has been suggested to be involved in several different types of cancer: Choi et al. [181] TUBGCP6: Identified as up-regulated in breast cancers: Kibriya et al. [182] PLCG1: Plays a role in EGF-Expression in breast cancers, controlled by MIR-200: [183] ADNP2: May be involved in brain development and schizophrenia: [184]  4.2.11  Verification  The majority of high confidence non-synonymous variants were confirmed by comparison with COSMIC’s Cancer Cell Line Project where variants were identified using the SNP6.0 array (see figure 4.10).  157  Figure 4.10: Verification of high confidence variations. All high confidence variations were confirmed in the cell lines by comparison with the cosmic reference when available, or by the use of PCR where not available. The chart above shows the number of variations confirmed using cosmic.  Eight genes were found to contain high confidence non-synonymous variations in more than one cell line, for a total of 17 SNVs. 12 variants in 8 genes had previously been observed and were annotated in COSMIC - however, a further 5, scattered across 4 genes, have not previously been reported. Thus, with the addition of these variations, 4 previously overlooked genes can be considered reasonable targets for further inspection. In addition to the 12 previously reported variations, all 5 of the novel variations were observed to be present as heterozygotes by Sanger sequencing. (Data not shown).  4.2.12  Assembly  In order to identify if any novel gene fusion or other disruptive events could be identified from already collected transcriptomes, full de novo assembly was  158  undertaken on the four cell lines that had matched normals, from which alternate splicing and fusion events could be identified. Fusion Genes A small number of correct orientation gene fusion events (7-16) were observed in cancer cell lines using the Trans-ABySS software. Fusion events found in the cancer cell lines, but not in the respective matched normals, were investigated for recurring events. Using the matched normals as a control identifies events that would not have been associated with the immortalization process imposed on the matched normal cells lines, and is thus more likely to find cancer associated mutations. Among the 45 proper orientation gene fusion events observed that were not also detected in the matched normals, only one (chromosome 15 open reading frame 57 (C15orf57)-chromobox homolog 1 (CBX)) was detected in more than one cell line (see table D.1). This event was detected by 14/9 and 9/2 spanning pairs and spanning reads respectively, across the HS1623 and HS1804 cell lines respectively and results in a frame-shifted product in one case and an intron to UTR fusion product in the second. A smaller set of fusion events were detected (2-11 per cell line, 24 total), in which the genes are placed in opposite orientation (see table D.2). These events may disable the production of functional proteins and thus could be expected to function in much the same fashion as a frame shift or truncating mutation. No recurring events were observed. Despite the number of observed events, this likely represents only a small number of structural rearrangements present in the cell lines. Only rearrangements that fuse together two identifiable proteins and result in an expressed mRNA would have been detected. The results of this method produced a long list of possible fusions for both the matched normals and the cancer cell lines (data not shown), of which the majority of events appeared to be present with identical break points in both the cancer and normal. Verifications were not attempted for gene fusions, as only one cancer specific  159  fusion event was detected as indicated above and the supporting coverage for the single detected event was poor. None of the fusion events detected here match those described by Stephens et al. [170], as reported in the COSMIC database. Alternate Splicing & INDELs from Assembly A significant number of alternate splicing events and Insertions and Deletions (INDELs) were detected in the assembly data for both the cancer and normal derived cell lines, however, those events observed in the normals are less likely to be cancer driver mutations and thus can be filtered from those observed in the cancer derived cell lines. Of interest, 25 genes are observed to have alternate splicing events predicted in three or more cell lines (tables D.3, D.4 and D.5). While these results have not been validated independently using another sequencing technique, they can instead be compared with results generated by aligning reads. In many cases, the two data sets provide widely disparate answers. For instance, the assembly predicts a skipped exon in chromodomain helicase DNA binding protein 3 (CHD3), exon 10. When compared to the aligned data, exon 10 in cell line HCC1187 is found to have each base covered, with an average 82-fold coverage, while the HCC1395 cell line has each base covered with an average of 204-fold coverage - providing no supporting evidence for a skipped exon. For each skipped exon investigated, the same trend was observed. Similarly, when validated, INDELs also performed poorly. A subset of INDELs were investigated by Sanger sequencing (see table E.2) and only 15% of the INDELs were confirmed. Thus, while the results of alternate splicing and Insertion and Deletion (INDEL) calls from assembly are included in the appendix, they are unlikely to accurately reflect actual biological events that would be expected to have a major impact on the behaviour of the cell lines investigated. Several possible interpretations are possible; Potentially the software used to make the alternate splicing and INDEL calls is performing poorly providing a high false positive rate, or perhaps the events detected are occurring at low frequency and are either undetectable or too small of a fraction of the events occurring when  160  checking with other methods. It is also possible that some other event (e.g. gene duplication and partial deletion) is occurring, generating reads that interfere with the assembly, making it more difficult to interpret the results.  161  Chapter 5 EGF and Notch Involvement in Ductal Carcinoma “Several lines of evidence suggest that Strawberry Notch participates together with Notch in many common pathways. A number of Strawberry Notch mutant phenotypes are similar to those of Notch mutants and can be rescued by an extra copy of wild-type Notch.” — Coyle-Thompson and Banerjee, 1993, discussing the role of Strawberry Notch in Drosophila. During the analysis discussed in chapter 4, eight genes with high confidence recurrent mutations were observed. One of these genes, strawberry notch homologue 1 (SBNO1), was selected for more in-depth analysis due to its known association with the oncogenic Notch genes as well as the presence of a recurrent somatic variation in two cell lines, as well as the presence of a variation in the RNA of a third cell line. The three cell lines, in fact, shared only two different mutations: an S996C variant and an E888K variant, neither of which had previously been observed in any sample sequenced at the Canada’s Michael Smith Genome Sciences Centre (BCGSC). The recurrence of one variant as well as the novelty of both variants suggested that SBNO1 may play a previously unsuspected role in cancer or in the immortalization process of the cell lines. This chapter describes the methods and results of the investigation into the 162  role of the SBNO1 gene in the context of the ductal carcinoma cell lines, guided by differential expression of the messenger Ribonucleic Acid (mRNA) detected in the Ribonucleic Acid (RNA) Isolation and Sequencing (RNA-Seq) data and literature reported interactions between genes.  5.1  Methods  5.1.1  Cell Lines  Refer to chapter 4 for methods used in the growth and harvesting of material from the ductal carcinoma cell lines and the B-cell derived cell lines.  5.1.2  Verification  Bioinformatic verification was performed by comparing candidate non-synonymous variants to the Catalogue of Somatic Mutations in Cancer (COSMIC) database, which contains lists of variants previously found in cell lines, particularly those from American Type Culture Collection (ATCC) origins, for which variants have been determined through use of the SNP6.0 array.[117] Any variant observed in both the COSMIC database and our own results, with the same coordinates and non-synonymous change confirmed for the same cell line, was considered independently validated. Best candidate non-synonymous variants observed in two different cell lines that were not confirmed this way were validated using Sanger sequencing. Primers for the studies were designed using “Primer 3” Web tool.[171] Forward primers had the –21M13F sequence (TGTAAAACGACGGCCAGT) added to them and the reverse primers had the M13R sequence (CAGGAAACAGCTATGAC) for sequencing of the Polymerase Chain Reaction (PCR) products. All primers were ordered at 50nmol scale from Invitrogen Life Technologies. PCR reactions were carried out in a volume of 20µl containing 10ng genomic Deoxyribonucleic Acid (DNA), 1mM MgSO4 , each primer at 0.5µM, 2mM dNTP’s, 163  1x Pfx Amplification Buffer supplied and 0.25 units Platinum Pfx DNA polymerase (Invitrogen). A programmable thermal cycler (MJResearch DNA Engine 2 Tetrad) was used for the PCR reactions which have a total of 30 cycles (30 seconds at 94o C, 30 seconds at 50-65o C and 1 minute at 68o C). A 3µl Aliquot of each PCR reaction was run on a 2% agarose gel (BioWhittaker Molecular Applications) to confirm amplification and quality of the product. The remaining 17µl of PCR product was purified with Ampure magnetic beads (Agencourt Bioscience Corporation) and eluted in a volume of 30µl of TE (Tris-EDTA pH 8.0). Purified DNA samples were then cycle sequenced using Big Dye Terminator Mix V.3.1 (Applied Biosystems) in both forward (-21M13F primer) and reverse (M13R primer) directions. Completed cycle sequencing reactions were then loaded on ABI-3730XL capillary sequencers to collect sequence data.  5.1.3  Expression Data  Information on the expression of mRNA was obtained using custom scripts from the Vancouver Short Read Analysis Package (VSRAP). Repositioned mRNA reads, aligned as described in chapter 4, were used to determine the number of times fragments of each exon were observed independently, and the sum of the exons for each transcript was used to describe the count for the whole gene isoform. All exon and transcript boundaries were taken from Ensembl 54, corresponding to the final Ensembl annotation set for the hg18 human reference.  5.1.4  Panel of Mammary Ductal Carcinomas  Recurrence of the SBNO1 mutation was tested using a panel of 190 Breast Cancer cell lines, made up of 170 invasive ductal carcinomas, 9 invasive lobular cancers and 9 other breast cancer types (two tubular, two Phyllodes, three Mucinous and two Ductal Carcinoma in Situ (DCIS)). Each cancer sample on the panel also included a paired normal sample, allowing somatic variations to be identified. Using custom probes for the SBNO1 gene, each sample and normal on the panel could be searched for variations. 164  To perform this search, all of the sequences obtained were aligned using the Burrows-Wheeler Aligner (BWA) tool, and Single Nucleotide Variations (SNVs) were called against the hg18 human reference genome, using the standard BCGSC Single Nucleotide Polymorphism (SNP)-calling pipelin, composed of BWA 0.5.7 and SNVMix2 0.12α.[13, 17] SNVs were loaded into the BCGSC’s Variation Database (VDB) for further analysis. Pileup files were generated using samtools 0.1.13 for the SBNO1 region of the genome (Chromosome 12:122346408-122400941).[89] The multiplexed sequencing of the SBNO1 coding region from patient samples was carried out following a targeted hybridization enrichment of SBNO1 genomic DNA fragments from a pool of prepared Illumina libraries. The SBNO1 genomic fragments were enriched using a biotinylated RNA probe constructed by the following method. A 1µg pool of cDNA clones for specific target enrichment, including an SBNO1 cDNA corresponding to Refseq ID = NM 018183.2 (Open Biosystems Products, Huntsville, AL), was sheared using a Covaris S2 focused ultra-sonicator (Covaris Inc., Woburn, Mass.) with the following settings: 10% Duty cycle, 5% Intensity, and 200 Cycles per burst for 360 sec. The resulting products were size selected on an 8% Novex TBE gel (Invitrogen Canada Inc.) and the 75 to 125 bp fraction excised and eluted into 300 µl of elution buffer containing 5:1 (vol/vol) LoTe (3mM Tris-HCl, pH7.5, 0.2mN EDTA)/7.5 M ammonium acetate. The eluates were purified from gel slurries by centrifugation through Spin-X centrifuge tube filters (Fisher Scientific Ltd., Nepean, Ont.), and EtOH precipitated. The purified DNA fraction was quantified using an Agilent DNA 1000 Series II assay (Agilent Technologies Canada Inc., Mississauga, Ont.). 100ng of resulting DNA was end-repaired, polyadenylated, and ligated to custom adaptors (Supplemental Table 2) containing T7 and T3 promoter sequences as described.[65] Adapted products were enriched by PCR in reactions containing 0.5 Units Phusion DNA Polymerase (New England Biolabs, Pickering, Ont.), 0.25 mM dNTPs, 3% DMSO, 0.4 µM of T3 and T7 sense strand-specific primers and 5 pmol template using a MJR Pelletier Thermocycler (model PTC-225) with the following cycling conditions; 1 min. at 98o C; 8X (10 sec. at 98o C, 30 sec. at 60o C, 30 sec. at 72o C); 5  165  min. at 72o C. The amplified products were separated from excess adaptor on an 8% Novex TBE gel (Invitrogen Canada Inc.), purified, and quantified using the Qubit Quant-iTTM assay and Qubit Fluorometer (Invitrogen Canada Inc.). An in vitro transcription reaction was carried out using 100 ng of purified adapter-ligated DNA as per the manufacturer’s specifications (AmpliscribeTM T7-FlashTM Biotin-RNA Transcription Kit; Intersciences Inc., Markham, Ont.). The reaction mixture was incubated at 37o C for 60 minutes, DNase-I treated for 15 minutes at 37o C, and then incubated at 70 o C for 5 minutes to inactivate DNaseI. Transcription products were precipitated with 1 volume of 5M NH4 Ac, and size fractioned on a 10% Novex TBE-Urea gel (Invitrogen Canada Inc.). The 100 to 150 bp fraction was isolated from the gel, eluted into 0.3M NaCl, and EtOH-precipitated after extraction of the eluate from the gel slurry by centrifugation through a Spin-X Filter centrifuge tube filter (Fisher Scientific Ltd.). The biotinylated RNA was re-suspended in 20 µl nuclease-free water and quantified by Agilent RNA Nano assay (Agilent Technologies Canada Inc.). Indexed libraries of patient genomic DNA were pooled from 96 well plates in groups of 92 libraries per pool.[185] A 275-425 bp size fraction from each pool was size-selected by gel purification from an 8% Novex TBE gel as above (Invitrogen Canada Inc.). The protocol described by Blumenstiel et al. [186] was followed for the hybridization reaction and subsequent washes.[186] The incubation of the library fragments with the RNA probe pool was carried out for 24 hours at 65o C, followed by binding to M-280 Streptavidin Dynabeads (Invitrogen Canada Inc.), washes, and elution of the captured library fragments. The eluted fragments were amplified by PCR using primers that anneal upstream of the adapter index sites (5’ IdxCap AATGATACGGCGACCACCG, 3’ IdxCap CAAGCAGAAGACGGCATACGAG) and subjected to cluster generation and sequencing on Illumina HiSeq2000 machines.  166  Adaptor Name T3 Adaptor T7 Adaptor  Direction Sense Strand Anti-sense Strand Sense Strand Anti-sense Strand  Sequence TGCGAATTAACCCTCACTAAAGGGAGA*T 5’ [PHOS]-TCTCCCTTTAGTGAGGGTTAATTCGCA-3’ TGGTGCTAATACGACTCACTATAGGGAGA*T 5’ [PHOS]-TCTCCCTATAGTGAGTCGTATTAGCACCA-3’  Table 5.1: Adaptor Oligonucleotide sequences used for the purification of SBNO1 fragments for the construction of the custom capture experiment. The asterisk indicates the presence of a phosphorothioate bond, used to decrease degradation by both endo- and exonucleases.  5.2 5.2.1  Results Cell Line Characterization  The cell lines for this study come from eight separate tumours, along with four B-cell derived matched normals. The cell lines themselves are relatively diverse in terms of the four main markers commonly used for breast cancer classification: human epidermal growth factor receptor 2 (HER2), the Estrogen Receptor (ER), the Progesterone Receptor (PR) and the key cancer protein tumor protein p53 (TP53). (See Table 4.6). For some cell lines, the status of all four indicators is not available in the literature, and thus the status is indeterminate. In some cases, it is possible to identify the status of the indicator by testing for mRNA expression of the gene that gives rise to the indicator protein. (See section 4.2.7.) Although this is an indirect test of the status, a lack of expression of the mRNA would be a strong indicator of the absence of the final product of the mRNA. For cell lines HCC1187 and HCC70, the ER protein expression status was not previously recorded. In each case, we were able to conclusively identify the estrogen receptor 1 (ESR1) gene was not being transcribed, and thus the ER protein is unlikely to be present. As a control, the only cell line in which ER was detected was HCC1500, a known ER+ cell line. However, the HCC70 cell line’s PR status and UACC-893 cell-line’s TP53 status were previously unrecorded, but no conclusions could be reached from mRNA levels about the final absence or presence of the respective indicator proteins. 167  Of the cell lines with matched normal samples, three lack HER2, ER and PR indicators, placing them in a category known as “Triple Negative” breast cancers. The remaining cell lines each display a different combination of protein markers, although if HCC70’s indeterminate PR status were found to be negative, it would also appear as a triple negative breast cancer, and if UACC-893’s TP53 status were to be found to be negative, it would display the same histological markers as HCC202.  5.2.2  Inter-se Analysis  The number of non-synonymous single nucleotide variations found in each cell line was between 4,702 and 11,145, with those cell lines with matched normals exhibiting a greater number of variations than those without. This does not appear to be an artifact of the depth of reads (data not shown), and no other reason for this distribution is apparent. (See Table 4.7) After filtering with the matched normals, where available, dbSNP and an in-house database of previously sampled human variation (see chapter 3), the trend described above disappears and the cell lines all show a set of 275 to 560 non-synonymous variations scattered across 245 to 491 genes. Using only those cancer cell lines with a matched normal, A list of candidate genes of greatest interest was obtained by identifying only those mutations found in both the RNA-Seq and exome capture data, but not in the corresponding normal. This provided a short list of 29 to 86 genes of interest per cell line. By intersecting these sets, only 8 genes were observed to have mutations in two cell lines that were observed in both the DNA and the RNA sampled. (See Figure 4.9) A Note on Normalization In the following sections, tag values are given in the raw form, calculated by the number of tags that have been mapped overlapping exons of a the gene of interest using code available in the VSRAP. Normalized values are show for table 5.4. It should be clear for this table that normalization does not make a qualitative 168  difference to the values observed, and the raw counts provide a clear indication of the number of times a sequence mapping to the gene is observed. In most cases described, several orders of magnitude difference are present between the cancerderived samples and the b-cell derived controls. Thus, for clarity, normalization has not been performed on the raw counts on each table. Strawberry Notch The SBNO1 gene was observed among the list of eight genes found with high quality non-synonymous variations in two of the four cell lines and was selected for further investigation. mRNA for SBNO1 is observed in all eight cell lines and in all four B-cell derived matched normals, with all exons exhibiting complete coverage. (See Table 5.2.) The expression of the gene varies across the cell lines, from 3,211 reads recorded up to 12,248. However, these differences in expression are mirrored in the associated matched normal cell lines, where data is available. An interesting observation, however, is that the reads observed in the exome captures are relatively consistent, between 3,553 and 4,401 across those cell lines with exome capture data, suggesting that copy number variation is not likely a factor in the different expression levels of SBNO1 mRNA. Cell Line Transcriptome Exome HCC38 12248 3553 HCC70 3211 HCC202 6261 HCC1187 4992 3553 HCC1395 4216 3358 HCC1500 6789 HCC2218 8874 3026 UAC-893 15395  Cell Line HCC38BL  Transcript Exome 12464 4185  HCC1187BL HCC1395BL  6616 7048  3817 3466  HCC2218BL  8988  4401  Table 5.2: Expression of SBNO1 gene, and depth of sequencing using exome capture, expressed in reads aligned to the gene. There do not appear to be significant changes in depth to account for the differential expression of the gene.  169  Exon Usage Although the coverage of each annotated exon in SBNO1 is complete, there exists some variation in the expression levels. Over the eight cell lines and four matched normals, the exon usage is remarkably constant with the exception of a single cell line, HCC202. In this cell line, exons 1-3, 12-13 and 18 are disproportionately expressed, suggesting that there may be some alternate splicing occurring unique to this cell line. (See Figure 5.1)  Figure 5.1: Exon coverage (number of tags per exon) normalized by the total number of tags observed in the SBNO1 exons and normalized by the coverage of exon 27 (lowest standard deviation of all exons before normalization). It is evident that some cell lines notably use some exons more than others, suggesting alternate splicing may play a part in SBNO1 expression.  170  Paralogue - Strawberry Notch Homologue 2 The strawberry notch homologue 1 (SBNO1) gene has a paraloguous gene named strawberry notch homologue 2 (SBNO2). Despite their distant homology, they share very little in common at the protein level with only 38.3% similarity (Blossom62 matrix). (http://imed.med.ucm.es/Tools/sias.html, last accessed April 13th, 2012). Like SBNO1, the SBNO2 protein appears to encode a transcription factor, but it is believed to contribute to the downstream anti-inflammatory effects of interleukin 10 (IL10).[187] With distant sequences and clearly divergent functions, the SBNO2 gene is unlikely to be a source of mis-aligned reads or to interfere with the SBNO1 protein’s function. mRNA from the SBNO2 gene is present in all eight cancer-derived and all four B-cell derived cell lines, with up to four-fold higher expression in the cancer cell lines. (Data not shown.) DNA from SBNO2 was not captured by the exome capture experiment.  5.2.3  Pathway  Despite its name, SBNO1 is not a part of the Notch class of proteins. It does not contain the epidermal growth factor (EGF), NOTCH Protein Domain (NOD) or Ankyrin domains typical of the other Notch genes, but rather is found to contain helicase domains instead. Its homology to the other Notch genes is practically non-existent, suggesting it has other functions and origins. Indeed, it is only related by name, a consequence of the phenotypic characteristics caused by the loss of gene function in model organisms. This, however, does suggest a likely involvement in the same pathways. A pathway can be assembled from literature validated associations and experiments, sampling those genes with observable traits differing between the cancer and normal cell lines, comprised of either non-synonymous variations, Insertions and Deletions (INDELs), fusions or marked differences in protein expression (See figure 5.2), indicating the context in which SBNO1 is expressed. This pathway is enriched for known oncogenes, suggesting through guilt by association that SBNO1 171  itself may be an oncogene as well. The illustrated pathway, however, does not contain all known protein interactions, and is instead directed by known association and binding partners described in peer-reviewed literature. The pathway is remarkable for several reasons: it ties together several important cancer pathways through a small number of genes, it provides mechanisms by which the pathways are influenced by external signals (Notch proteins and jagged2 (JAG2) signal through recombination signal binding protein for immunoglobulin kappa J region (RBPJ) and pre-B-cell leukemia homeobox 1 (PBX1)), it includes avenues for metastasis, cell cycle control and terminates with EGF, a well studied growth factor for which breast cancer cells are known to over-express the receptor. This effectively describes a closed loop that provides a possible explanation for how cells utilize autocrine signalling to effectively enhance their own growth.  172  snv and indel unverified fusion indel snv snv observed in other normal samples  DVL3 JAG2  DVL1  NOTCH3  NOTCH1  NOTCH2  MAML2  HCC38 HCC70 HCC202 HCC1187 HCC1395 HCC1500 HCC2218 UACC-893  NOTCH4  MAML3  MAML1  Breast cancer specific pathways  Cell type dependent pathways GATA3 MYC  GATA5 RBPJ CREBBP  GATA2  MIR-200  Other Pathways  Metastasis Genes SNA1  RAP2  RASGEF1A  C14orf169/MAPJD  SNA2  RAS Family TGFBRAP1 RIOK1  SBNO1  TGFB1 TGFBI  PRMT5  BARX2 Map Kinases RHBDF1  TP53  ATOH8  EGF GFI1  PBXIP1 PBX1  MEIS  HOX MLL5  MLL2 MLL3  POLYCOMB  HOXC5 HOXC6  HOXC8  HOXC9  HOXC13  HOXC10  HOXC11  BCL6  STAT5  Methylation, cell cycle control B-cell specific genes (IRF4, SPI1)  Figure 5.2: Proposed pathways of interacting proteins around SBNO1.  173  Strawberry Notch Pathway It has been suggested that lethal gene 765 (LET-765) protein, the C. elegans homologue to the strawberry notch homologue 1 protein, promotes the expression of many diverse targets by interacting with transcriptional activator or repressor complexes.[188] Some of the downstream targets of the LET-765 have been well studied and can provide an indication of the putative function in human pathways. Recent work has suggested that LET-765 is found in the nucleus and positively regulates the Rat sarcoma (RAS) pathway via epidermal growth factor receptor (EGFR).[189] Furthermore, it is required for producing wild-type levels of lin-3 mRNA, the C. elegans homologue to the human EGF. Thus, we have focused on Pathways related to EGF expression, in which SBNO1 likely participates. Also of interest, let-765 (SBNO1) mRNA is detectable at all stages of development in C. elegans and is enriched in the germ line.[189] Pathways Upstream of Strawberry Notch The earliest Drosophila research on the Strawberry Notch gene implied that it was involved in the Notch pathway, interacting in an unknown manner with the canonical Notch genes.[190] To tie together this family of genes, one can draw on several relationships: • The relationship between the Notch, Mastermind proteins and v-myc myelocytomatosis viral oncogene homolog (MYC) [144, 191] • The relationship between MYC and chromosome 14 open reading frame 169 (C14orf169)/Myc-associated protein with JmjC domain (MAPJD)[192] • The relationship between SBNO1 and C14orf169/MAPJD[192] • Strawberry Notch’s known connection to EGF[188] Further, although relatively poorly studied, the C14orf169/MAPJD protein is also known to regulate several other important genes including RIO kinase 1 (RIOK1), 174  transforming growth factor, beta receptor associated protein 1 (TGFBRAP1) and RasGEF domain family, member 1A (RASGEF1A).[192] The RIOK1 protein binds to protein arginine methyltransferase 5 (PRMT5), an upstream methylator of TP53, required for cell-cycle progression and tumour suppressor function.[193, 194] TGFBRAP1 plays a role in signalling through the transforming growth factor beta (TGFB) family of proteins, controlling proliferation and differentiation. RASGEF1A is a guanine nucleotide exchange factor operating on RAP2A, member of RAS oncogene family (RAP2), a member of the RAS super-family.[195] MAPJD itself is a member of the MYC transcriptional complex and is commonly activated in lung cancers.[192] Other Pathways Controlling EGF Several other proteins are known to control or regulate the expression of EGF. growth factor independent 1 transcription repressor (GFI1), the human homologue of the Drosophila senseless gene, is known to produce a protein that directly represses EGF expression, while rhomboid 5 homolog 1 (RHBDF1), the human homologue of the Drosophila Rhomboid gene, produces a protein that promotes EGF expression.[196] However, EGF expression is governed by several complex relationships; the GFI1 repressor protein competes directly with a complex that involves homeobox (HOX) genes for binding with the enhancer sequence upstream of the EGF gene, while human atonal homolog 8 (ATOH8) promotes GFI1 binding, which is in turn suppressed by BARX homeobox 2 (BARX2).[197, 198] This complex set of relationships and similar pathways are also likely also used to regulate the expression of other genes in addition to EGF. Each of the proteins above is also interesting in it’s own right. BARX2, for instance, has been shown to influence processes controlling cell adhesion and is suspected of participating in cancer progression (Refseq). The complex that competes and thus regulates GFI1 in Drosophila is made up of proteins from the Exd, Hth and Abd-A genes, and has human homologues in PBX1, Meis homeobox 1 (MEIS1) and an unspecified human HOX gene.[197] Both PBX1 and MEIS1 are  175  known proto-oncogenes. Although guilt by association is not a firm foundation for implicating the strawberry notch homologue 1 protein, it provides a compelling indication that this pathway may be of some interest in future cancer studies. It has long been known that HOX proteins bind DNA to regulate gene expression.[199] Thus, the complex involving PBX1 and MEIS1 bind with HOX proteins likely confers upon the complex the ability to recognize or bind specific DNA sequences, and thus provides the ability to compete with GFI1 proteins upstream of repressed genes. However, while the Drosophila Abd-a gene product is known to work in this capacity, it’s unclear which Human HOX protein is utilized in the EGF expression pathway. homeobox C (HOXC) and homeobox D (HOXD) proteins are likely to have a roll in cell proliferation, which would make them candidates for oncogenic pathways.[200] SynMuv Genes In C. elegans, Egf is also controlled by a family of a genes known as Synthetic Multivulva (SynMuv) genes, which also appear to be responsible for controlling other oncogenic genes such as Egfr/Ras/Mapk1.[201, 202] Human homologues to the SynMuv genes include the human lin-9 homolog (LIN9) gene and the known oncogenes retinoblastoma 1 (RB1) and E2F transcription factor (E2F) and may also be in involved in nucleosome remodelling and deacetylase (NuRD) complex, which make this particular pathway interesting for cancer research.[203–205] Further evidence also suggests that LIN9 is required in C. elegans for the entry of cells into mitosis.[206] It has also been suggested that SynMuv proteins function in repression of transcription.[189] Epidermal Growth Factor EGF expression appears to be a key component at the end of the pathway annotated  from the sequencing information obtained. SBNO1 is directly upstream and is known to control the expression of this gene, likely via its role in DNA binding, possibly acting as a transcription factor. It is known to be necessary for EGF 176  EGFR-001  ERBB2-201  ERBB3-201  ERBB4-201  Cell Line HCC38 HCC70 HCC202 HCC1187 HCC1395 HCC1500 HCC2218 UAC-893 HCC38BL HCC1187BL HCC1395BL HCC2218BL  EGF-001  expression in other organisms, however, it does not appear to be sufficient for EGF expression, as it is present in all cell lines, including the matched normals, although EGF itself is only expressed in the cancer-derived cell lines (See Table 5.5). Control of EGF is a complex relationship involving multiple transcription factors, repressors and gene complexes, so other genes are also likely to play a role in determining the expression behaviour of the EGF protein.  756 1038 393 36 174 78 139 1286 2 0 0 0  7636 14746 4208 13295 12791 0 3864 20430 1 0 5 971  11275 10615 680609 26846 5103 12508 340272 640834 838 603 699 601  8483 14262 44342 10285 111 19764 25777 13180 9 10 13 3  109 0 409 12 0 578 93 164 2 0 45 1145  Table 5.3: Expression of EGF, EGFR and its Receptor Genes EGFR, HER2 (ERBB2), HER3 (ERBB3) and HER4 (ERBB4). EGFR is the only one of the family of proteins that is known to bind EGF, although the family of proteins is known to form dimers, and thus interactions are possible between the various members of the EGFR family.  With the exception of cell lines HCC1395 and HCC1187, a correlation between EGF and EGFR mRNA expression (R2 = 0.95) is observed. (Figure 5.3) However, this correlation decreases (R2 = 0.57) with the inclusion of the two cell lines which express high levels of EGFR without strong EGF. Thus, it appears that high levels of EGF expression promote high levels of EGFR expression, but high levels of EGFR expression do not require high levels of EGF expression.  177  Figure 5.3: Count of EGF and EGFR tags for each cell line. A distinct linear relation is present, with two outliers. Without the two outliers, a linear correlation value of R2 = 0.95 is observed.  Expression of Notch Genes Notch proteins function as receptors and have been repeatedly implicated in cancers. While it can be difficult to sort out the downstream proteins through which the Notch proteins are sending signals, the expression of the Notch genes themselves can provide some clues. For the notch 1 (NOTCH1) and notch 4 (NOTCH4) genes, mRNA levels do not change significantly between the cancer-derived and matchednormal-derived cell lines. However, for notch 3 (NOTCH3) and notch 2 (NOTCH2), some changes are visible. NOTCH3 is highly over-expressed in many cancers, and this is clearly mirrored in the mRNA of all eight of the cancer-derived cell lines, while no expression (<5 tags per cell line) are observed in the normal samples. (See 178  HCC1500 HCC2218 UAC-893 Cancer cell lines Stdev – cancer HCC38BL HCC1187BL HCC1395BL HCC2218BL Non-cancer cell lines Stdev – non-cancer  NOTCH4-001  HCC1395  NOTCH3-201  HCC1187  NOTCH2-201  HCC202  NOTCH2-001  HCC70  NOTCH1-201  HCC38  NOTCH1-001  Table 5.4). In contrast, NOTCH2 mRNA does show expression in the normal-derived cell lines, but is dramatically up-regulated all eight cancer-derived cell lines. This suggests a change in signalling patterns of cancer cells. The Notch proteins play a significant role in this pathway, as the putative signalling receptors for driving EGF production.  4366 (65.25) 12294 (185.52) 13204 (224.69) 29451 (638.99) 7972 (119.18) 5049 (83.20) 8904 (159.43) 6925 (90.97) 11020.63 8076.07 8799 (108.89) 9927 (199.10) 23476 (423.07) 14845 (256.05) 14261.75 6680.18  2611 (39.02) 8856 (133.64) 11848 (201.62) 23806 (516.51) 6835 (102.18) 4188 (69.02) 5790 (103.67) 5710 (75.01) 8705.50 6723.58 7177 (88.82) 8398 (168.43) 20125 (362.68) 12480 (215.26) 12045.00 5844.47  33624 (502.56) 29848 (450.41) 32370 (550.84) 31308 (679.28) 59484 (889.24) 21473 (353.86) 21465 (384.34) 23781 (312.41) 31669.13 12257.56 4197 (51.94) 325 (6.52) 8054 (145.14) 777 (13.40) 3338.25 3587.73  319 (4.77) 295 (4.45) 1918 (32.64) 813 (17.64) 1739 (26.00) 670 (11.04) 754 (13.50) 656 (8.62) 895.50 607.69 153 (1.89) 14 (0.28) 276 (4.97) 32 (0.36) 118.75 121.65  8238 (123.13) 64636 (975.37) 55822 (949.92) 112109 (2432.41) 31646 (473.08) 20292 (334.40) 13906 (248.99) 9042 (118.79) 39461.38 36176.10 0 (0.00) 1 (0.02) 5 (0.09) 3 (0.05) 2.25 2.22  75 (1.12) 25 (0.38) 18 (0.31) 415 (9.00) 165 (2.47) 29 (0.48) 0 (0.00) 36 (0.47) 95.38 139.07 94 (1.16) 96 (1.93) 122 (2.20) 154 (2.66) 116.50 28.07  Table 5.4: mRNA expression of the Notch Genes. Normalized values are given in reads per million in parentheses.  179  Unlike the apparent correlation between EGF and EGFR mRNA expression, there does not appear to be any correlation between any of the Notch genes mRNA expression and EGF expression. This is likely because there are several other factors controlling EGF expression, in addition to the notch genes (e.g. the BARX2 ATOH8 and GFI1 proteins that interact to control EGF mRNA expression,) However, no clear correlations are observed between any of the expressions of the genes upstream of EGF and the expression of EGF mRNA. (Data not shown.)  RHBDF1  GFI1-001  ATOH8-001  BARX2-201  PBX1-001  MEIS1-001  MEIS3P1-201  Cell Line HCC38 HCC70 HCC202 HCC1187 HCC1395 HCC1500 HCC2218 UAC-893 HCC38BL HCC1187BL HCC1395BL HCC2218BL  LIN9-001  Genes Directly Controlling EGF Expression  8907 2103 965 1568 1670 1112 3886 1417 1353 701 925 1031  3985 6560 7544 4312 5050 3937 1760 3909 31 62 124 32  124 1140 27 24 24 14 11 25 2506 1103 759 26  736 31 215 115 987 53 53 241 2 2 3 8  268 2973 355 26062 5 34 517 422 0 8 3 0  1487 1712 11719 811 1138 5730 6685 6861 0 0 7 0  412 1224 537 62 453 10 848 588 2 1 13 3  607 139 584 121 78 184 145 132 0 0 0 0  Table 5.5: In C. elegans, Sbno1 is required for expression for Lin-3/Egf, however, this information combined with information in table 5.2 appears to indicate that SBNO1 is not sufficient for EGF mRNA expression in humans. LIN9, the human SynMuv homologue, is also necessary for EGF mRNA expression, but clearly also not sufficient. RHBDF1 is a transcription factor believed to control EGF expression, while GFI1 is a repressor, also putatively in control of EGF mRNA expression.  One mechanism that is well understood in the control of EGF expression is the GFI1 repressor. This protein normally binds upstream of the EGF gene, preventing transcription of EGF mRNA.[197] The ATOH8 protein also plays a role in regulating 180  the GFI1 repressor protein, enabling the GFI1 protein to perform its normal function. Upstream in the pathway of ATOH8, however, is BARX2, which functions to suppress the activity of ATOH8, leading to the repression of the GFI1 repressor, presumably with the intent of enabling EGF transcription. In seven of the eight cancer-derived cell lines, the BARX2 gene is being actively transcribed (30-26,000 tags), while fewer than 10 tags are observed in each of the non-cancer matched normals. This suggests that the cancer cell lines are actively transcribing BARX2 in order to deregulate the expression of EGF. Furthermore, a second pathway interferes with the GFI1 repression. The known oncogene NOTCH3, frequently over-expressed in breast and ovarian cancers, is able to signal through a complex involving a HOX gene, PBX1 and MEIS1, which is known to displace GFI1, causing it to dissociate from the DNA upstream of the EGF gene.[147, 207] This provides a second mechanism by which EGF expression can be deregulated. All eight cancer cell lines show significant expression of NOTCH3 mRNA (See Table 5.4), PBX1 and MEIS1 and Meis homeobox 3 pseudogene 1 (MEIS3P1) (See Table 5.5), while expression is limited or not observed for any of those genes in the non-cancer derived cell lines. RHBDF1 & LIN9 Two other proteins are known to play a role in the expression of the EGF gene: RHBDF1 and LIN9. RHBDF1 is a transcription factor known to promote EGF expression, while LIN9 is believed to play a role upstream in EGF expression as well as control entry of cells into mitosis, based on C. elegans model systems.[196, 201– 203, 208] LIN9 mRNA does not show any significant signs of expression differences in our ductal carcinoma cell lines, RHBDF1 mRNA does show some signs up being up-regulated, perhaps underscoring the significance of EGF in the development of these cells.  181  Expression of Notch Pathway Genes  Figure 5.4: Expression levels of Notch Pathway Genes, represented graphically. The error bars are for the standard deviation over all eight breast cancer cell lines or for the four B-cell derived cell lines. Not all members of this pathway are statistically significant, as evidenced by the error bars. However, the elements represented in this figure do not represent a single pathway, but rather several branches of the same pathway, indicating that some branches clearly are less important than others. See table 5.6 for read count values.  182  JAG1-001  JAG2-001  DLL1-001  DVL3-201  DVL3-202  DVL4-201  NUMB-201  HCC38 HCC70 HCC202 HCC1187 HCC1395 HCC1500 HCC2218 UAC-893 Cancer cell lines Stdev – cancer HCC38BL HCC1187BL HCC1395BL HCC2218BL Non-cancer cell lines Stdev – non-cancer  SBNO1-201  The JAG2 protein is a known trigger for the Notch genes in lung adenocarcinomas,[209] and its production by the cell lines investigated here may indicate an interesting mechanism for stimulating the Notch signalling cascade in breast cancer. mRNA from JAG2 is expressed in all eight of the cancer-derived cell lines, but none of the normals, while jagged1 (JAG1) mRNA expression is completely lost in UACC893. The JAG2 proteins are normally expressed on the cell surface where they interact with adjacent cells, making this an interesting, but currently unexplored, extracellular target for disruption. The two JAG2 transcripts also show a significant change in expression, both of which appear to be up-regulated in the cancers.  12248 3211 6261 4992 4216 6789 8874 15395 7748.25 4205.10 12464 6616 7048 8988 8779.00 2664.45  10315 7439 416 1825 35450 1011 5255 83 7724.25 11791.50 1789 2639 2308 578 1828.50 904.09  2380 4778 521 5537 1466 5281 1855 673 2811.38 2074.50 12 30 51 61 38.50 21.89  194 4 0 7 150 14 3 47 52.38 76.22 127 176 323 154 195.00 87.65  55 2 12 19 225 8 6 0 40.88 76.43 5 74 37 205 80.25 87.82  52 2 11 16 184 6 6 0 34.63 62.59 5 67 35 199 76.50 85.50  10 11 38 49 54 17 6 101 35.75 32.26 0 3 224 86 78.25 105.02  9285 12711 5159 7741 5774 3695 3965 4381 6588.88 3137.77 13920 8785 9556 5694 9488.75 3392.93  Table 5.6: Expression of genes in the Notch Pathway. JAG2-002 is not shown, as it provides numbers virtually identical to JAG2-001. Only NUMB-201 is shown, but is representative of the other three isoforms, DVL1-201 is not shown, as it provides numbers identical to DVL1-001.  183  Polycomb and Methylation Methylation is thought to play a significant role in the development of cancer, and cancer cells have long been observed to have methylation patterns different than healthy cells.[70, 71] Thus, the observation that methylation genes may play a role in this pathway, if tangentially, is perhaps not a surprise. Methylation is believed to play a role in cell cycle control and a host of other processes which cancer cells need to alter to encourage rapid and uncontrolled growth. Two proteins that appear to play a role in this process are signal transducer and activator of transcription 5 (STAT5) and B-cell CLL/lymphoma 6 (BCL6), which both function upstream of the Polycomb complex.[37, 210] In the cell lines studied here, STAT5 is shown to be significantly down regulated in cancers compared to the B-cell derived non-cancer cell lines. This makes sense in the context that STAT5 itself negatively regulates BCL6 - a known oncogene, frequently over-expressed in cancers - which itself shows an up-regulation in the cancer-derived cell lines. BCL6 is known to affect both the Polycomb and the expression of B-cell specific proteins.[210, 211] In the case of the cancers, the B-cell specific proteins such as interferon regulatory factor 4 (IRF4) and spleen focus forming virus (SFFV) proviral integration oncogene spi1 (SPI1) are not turned on, while they do appear prominently in the B-cell derived cell lines. However, the over-expression of BCL6 in the cancer cells is possibly functioning to influence the Polycomb complex. In turn, the polycomb complex is known to affect mixed-lineage leukemia (MLL) genes.[212] Surprisingly, these genes appear to contain cancer specific mutations in all but two cell lines, with emphasis on variations in myeloid/lymphoid or mixedlineage leukemia 3 (MLL3). (myeloid/lymphoid or mixed-lineage leukemia 2 (MLL2) and myeloid/lymphoid or mixed-lineage leukemia 5 (MLL5) also show signs of disruption) As the MLL genes are specifically involved in methylation, this may have genome wide implications for the cell cycle control and other core cellular process.  184  HOX Genes and the SynMuv Homologues The Polycomb complex also functions through a second complex of genes known in C. elegans research as the Synthetic Multivulva (SynMuv) complex. In humans, the SynMuv complex is made up of PBX1, a Meis homeobox (MEIS) protein and a HOX protein, and is known to work to regulate transcription of various other genes throughout the genome, including the EGF gene, where the complex competes directly with the GFI1 protein, a repressor that resides upstream of EGF.[201] In the cancer cell lines studied, this particular complex is characterized by significant up-regulation of both PBX1 and the MEIS1 gene, which suggests that they may play a role in the over-expression of the EGF mRNA. The MEIS gene, however, is also the subject of variations in 3 of the eight cancer-derived cell lines, suggesting that it may have a role in oncogenesis. The third partner in the complex is a HOX protein, however, which specific human HOX protein is involved in regulation of EGF is not known. However, several HOX mRNAs are up-regulated in all of the cancer-derived cell lines, including seven HOXC mRNAs, as well as a number of homeobox A (HOXA) and homeobox B (HOXB) mRNAs. Two of the HOXC genes, homeobox C10 (HOXC10) and homeobox C11 (HOXC11) also show potentially disruptive variations in the cancer-derived cell lines sampled. Finally, this complex is also known to be the direct target of the NOTCH3 protein that, as discussed above, is cancer specific and, in this case, highly over-expressed in all eight cancer-derived cell lines.[147] Notch Signalling Notch Signalling plays a significant role in many cancers, and is thus likely the target for genomic disruptions that could hijack the pathways for oncogenic purposes. In this case, several different disruptions can be observed. NOTCH1 and NOTCH2 are the target of SNVs in these cell lines and are are known to interact with DVL1, dishevelled, dsh homolog 3 (DVL3) (repressors) and JAG2 (activator) among other surface and extracellular proteins.[209] In the case of the cancer-derived cell lines, 185  both DVL1 and DVL3 show disruptive variations, while JAG2 shows both disruptive variations in four of the eight cancer-derived cell lines and a significant increase in expression in all of the cancer-derived cell lines not observed in the matched normal cell lines. Downstream in the signalling pathway, Notch proteins also interact with the Mastermind proteins (mastermind-like 1 (MAML1), mastermind-like 2 (MAML2) and mastermind-like 3 (MAML3)), which translate the Notch signals into cell type dependent pathways.[144, 191] These proteins are thought to create the diversity of the signals which enable the Notch proteins to create a diverse repertoire of cell type specific signals. One of the proteins with which the mastermind (MAML) proteins communicate is the RBPJ protein, a regulator of the MYC proteins.[213] RBPJ itself does not appear to show any signs of variation in any of the cell lines investigated here, however, it’s binding partner, CREBBP, does show signs of variations in three of the eight cancer-derived cell lines. This may play a role in it’s regulation of the MYC genes, which are well studied oncogenes. MAPJD One of the more fascinating targets to have appeared in this study is the MAPJD protein. This protein was previously known as C14orf169 and is a downstream target of the MYC proteins, as well as being a known oncogene.[192] In the studied cell lines, the MAPJD gene appears to contain disruptive variations in only one cell line. However, this gene’s role in cell regulation appears to be relatively significant, as it is known to control several proteins that may be important in oncogenesis. It is also known to regulate the SBNO1 protein, which is necessary for the expression of EGF and RIOK1 mRNA. The RIOK1 protein, which operates with PRMT5 to regulate TP53 and other map kinases, appears to play a part in regulating the expression of transforming growth factor, beta-induced, (TGFBI) and RAS family of proteins.[193] Most surprising, while the MAPJD gene has not accumulated many variations itself, the proteins with which it is affiliated are frequent targets of mutations. Both SBNO1 and RIOK1 exhibit disruptive variations in four of eight cancer-derived cell lines, as  186  does the gene of downstream target TGFBI. It is not inconceivable that MAPJD’s activity could be moderated by other mechanisms than variations, such as phosphorylation or changes to an unknown interaction partner that were not detected in the RNA-Seq experiments. Jagged2 Dependent Pathway Another interesting aspect to this pathways is the inclusion of JAG2, upstream of the Notch proteins. The JAG2 pathway is know to affect the Notch pathway in an RBPJ dependent manner to effect the signalling of GATA binding protein (GATA) genes.[209] In fact, GATA binding protein 3 (GATA3) is known to plays a central role in breast cancer and appears to encourage metastasis in Lung cancer. GATA binding protein 2 (GATA2) appears to control mir-200, which is a known trigger of metastasis through the snail homolog 1 (SNAI1) and snail homolog 2 (SNAI2) transcription factors. Three of the GATA genes, GATA2, GATA3 and GATA binding protein 5 (GATA5) are all expressed highly compared to the B-cell derived cell lines, as are mRNA from the SNAI1 and SNAI2 genes. While one normal does express mRNA from the GATA3 gene, expression of GATA genes does appear to be generally cancer specific. GATA binding protein 1 (GATA1) is not expressed in any sample. Other Genes of interest Several other genes related to this pathway show alterations in expression patterns that might potentially be of interest. The mRNA for v-myc myelocytomatosis viral related oncogene, neuroblastoma derived (MYCN), a core oncogene, is found to be turned on in five of the eight cancer-derived cell lines, but none of the matched normals, while v-myc myelocytomatosis viral oncogene homolog 1, lung carcinoma derived (MYCL1) mRNA is up-regulated in seven of the eight cancer-derived cell lines. In contrast MYCBP associated protein (MYCBPAP) expression is turned off in seven of the eight cancer-derived cell lines, but expressed in all four normal-derived cell lines.  187  NUMB,  another protein believed to interact with the Notch genes, appears to show signs of alternate splicing, with poor expression of exon three in normalderived cell lines. A Note on Pathway Analysis The pathway outlined above was the product of manual literature searches, in combination with the use of variation and expression data, rather than automated pathway tools. Early efforts to use Ingenuity Pathway Analysis tools (www.ingenuity.com), no significant enrichment of pathways involving variations were identified. This may because many of the relationships between genes or proteins were discovered after the initial attempts to use the software, or because many of the relationships were discovered in model organisms such as fruit flies or worms. In any case, the enrichment for pathways was originally undertaken through the use of variations, rather than the combination of variations and cancer-cell-line specific expression. Thus, it would be possible that the combination of expression and variation information might have improved the results obtained.  5.2.4  Recurrence of Strawberry Notch Variations  Recurrence of SBNO1 variations in ductal carcinomas were investigated through the use of a panel of 170 mammary ductal carcinoma samples described in section 5.1.4. All variations called by SNVMix2α were loaded into the BCGSC’s VDB, which was used to identify possible recurrent non-synonymous variations. Four non-synonymous variants were observed in cancer samples in the panel, S264C, F248V, N773S and P1350L. Upon closer investigation, however, none of these provide particularly strong evidence for a recurrent SBNO1 variations. S264C The S264C variation appears as a somatic, heterozygous (9 reads out of a total of 22 at chr12:122384668) variation in a single sample (SA401), which is an invasive 188  ductal carcinoma, as well as appearing with a single read out of 50-73 in four other samples (SA159N, SA160N, SA214N, SA397). The single base coverage is insufficient for a variant call to be made, indicating that this is likely a mis-call for the four low-coverage samples. F348V The F348V variation appears in the database in four normals (A07321, A07304, A0284, A07244) and one cancer (A07331), with frequencies between 21.75% and 40.74%, with the cancer sample showing eight variant reads out of twentyeight with the substituted A → C mutation. However, when searching through the raw data, the same variation can be observed in 161 out of the 190 matched cancer/normal sample pairs at various depths. Although not previously observed in any other data set, the vast majority of the variants observed in the raw data are poor quality, explaining why they were not called by SNVMix2α and imported into the VDB. With such a large number of samples it is possible to ask whether this variation may be frequently recurring in ductal carcinomas, or a predisposing germ-line variant, as it has never before been observed in other samples. Both hypothesis are unlikely as this variant found in 16 of the 20 non-ductal carcinomas present in the panel (80%), as well as 151 of the 170 ductal carcinomas (88.8%).  189  Observations of C->A substitutions at F348V in SBNO1 45  Number of samples  40 35 30 25  Cancer Normal  20 15 10 5 0 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18  count of C->A reads at F348V  Figure 5.5: Histogram of the count of times the F348V A→ C mutation is observed in each sample and matched normal. There is a slight shift towards the right that could indicate an increase in the prevalence in cancer samples, however, this is not supported when observed as a percentage of the total number of bases observed at that position (see figure 5.6).  The difference in samples observed to contain this position and the numbers of variations reported at this position in the VDB are surprisingly divergent, reflecting the lack of trust placed in the quality of the bases observed at this position by the SNP-caller, SNVMix2. The vast majority of the non-canonical bases reported show poor alignment qualities, which may reflect some form of alignment bias for this region. However, searches with the region around the F348V variation, with and without the altered base did not identify any regions of similarity that might be causing a systematic bias.  190  Observations of C->A substitutions at F348V in SBNO1 by percent 18  Number of Samples  16 14 12 10  Cancer Normal  8 6 4 2 0 2 1  4 3  6 5  8 7  10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41  percent of C->A reads at F348V  Figure 5.6: Histogram of the count of reads yielding the F348V A→ C mutation is observed in each sample and matched normal, shown as a percentage of the total reads. In this view, the shift to the right observed in figure 5.5 is no longer visible, and the groupings of cancer and normal appear to behave similarly. Samples with high percentage of variant reads in the cancer are generally due to low total coverage.  Overall, there are a few possibilities that explain this variant’s presence in the data. It could be a polymorphism that rarely passes detection of SNP-callers, which is unlikely as it was called five times in the ductal carcinoma panel alone. It could be a ductal carcinoma specific variation, which is again unlikely for reasons outlined above. It could be a novel breast cancer specific predisposing variation, although it has not been detected in other samples, to date. Finally, it could be an artifact of the specific capture experiment or alignment, which appears to be the most likely solution, although the mechanism by which this artifact arrives is unknown. N773S The N773S variation is similar to the S264C variation in that it is present in several samples at low levels, and was observed in the database because of it’s presence in a single sample, however, the sample in which it was found, SA196, also shows it’s presence in the matching normal. While the cancer shows a slightly elevated  191  frequency over the normal (54% and 44% respectively), this is unlikely to represent a biologically significant change. However, among the other samples, the T → C variation that gives rise to the N773S substitution is found in three separate normals and cancers, none of which make up an matched pair. With one exception, SA195, the variation is observed in only a single read making up less than 5% of the reads in that position, and in SA195, the variation is only observed in three out of thirty reads aligned to this position. Thus, while this may be a germline variation in SA196, it is most likely only appearing as a sequencing error in the other samples. P1350L A very similar story to the N773S variation is observed with the P1350L variation: A single matched normal pair with approximately 50% of the total reads, followed by a small number of mis-matched cancers and normals (four and eight, respectively) in which the variant base is observed only once. Again, this variation may be a germ line variation in the one sample and matched normal, SA407, but is most likely only representing sequencing errors in the other samples in which low level base mis-calls are present. S996C The S996C variation was observed as a putative recurrent variation in the ductal carcinoma cell lines (see figure 4.8). In the panel of ductal carcinomas, however, it is observed with only one read three times, and with two reads once - at about the same level as the error-rate for the sequencing platform, and like the noise observed for the other variations, distributed evenly across cancers and normals. There is little support for recurrence of this variation in the ductal carcinoma samples tested. E888K Like the S996C variation, the E888K variant was observed in a ductal carcinoma cell line, reported in chapter 4. This variation, however, shows even less support as a putative recurrent variation, as it is observed with a single supporting read in five 192  normal samples - again, consistent with the error rate of the platform, and unlikely to be present at any level in any of the cancers or normals in the sample. Overall, SBNO1 does not appear to harbour any recurrent variations that are indicative of a gene being repeatedly disrupted by ductal carcinomas. However, the lack of recurrent SNV does not rule out the possibility that this gene may play a substantial role in cancer cells. It is possible that fusions or INDELs, may also disrupt this gene, which were not ruled out by this particular test. Summary of Recurrence Despite the identification and confirmation of the presence of sbno1 somatic mutations in the cell lines investigated, we were unable to find evidence of similar somatic mutations in the panel of breast cancers. It is worthy of note, however, that while both the tumours and the cell lines investigated were taken from primary tumours, only the cell lines were taken from tumours that had metastasized. Thus, is is possible that the sbno1 somatic mutations observed may be a characteristic variation accrued during the process of metastasis. Thus, it would be useful to further investigate metastatic ductal carcinoma tumours to identify whether sbno1 mutation events are associated with metastasis or if sbno1 mutations simply promote growth during the immortalization process.  5.2.5  HCC1500 Exhibits Few Mutations  One interesting observation is that very few mutations in this pathway are observed in HCC1500. While HCC1500 is observed to have mutations in the MLL3 gene as well as SBNO1 and pre-B-cell leukemia homeobox interacting protein 1 (PBXIP1), the variations in the latter two genes have also been observed in other non-cancer cell lines (Figure 5.2). This is significantly fewer variations than the other cell lines, suggesting that this pathway may not be significant target for variations in HCC1500. This, in fact, fits well with the observation that HCC1500 does not express mRNA for the epidermal growth factor receptor (EGFR) (Table 5.5), and is the 193  only clearly ER+ cell line among the group (Table 4.6). This may suggest that ER+ tumours may depend less on the EGF pathway to promote cell growth, or alternately, that cells that become ER- may compensate by using the the EGF pathway to promote cell growth. A slightly confounding factor to the theory is that HCC1500 is the only cell line to have a variation in the MAPJD gene (validated deletion, see table E.2), which lies upstream of several pathways strongly correlated with oncogenesis. While it is difficult to know the effect of the variations, it is possible that the deletion observed in MAPJD in HCC1500 is activating (see figure 5.2, and that this single variation can provide a similar effect as a combination of other variations found in the other cell lines.  5.2.6  Interleukin Expression  Notch receptors are not the only signalling molecules to be involved in cancers, and using expected levels of Interleukins can be useful as a positive control. For instance, interleukin 8 (IL8) and interleukin 12 (IL12) are suspected to be related to breast cancer metastasis.[214] In the ductal carcinomas used in this experiment, mRNA from IL8 was observed to be generally up-regulated in the cancer-derived cell lines, although neither interleukin 12 alpha (IL12A) nor interleukin 12 beta (IL12B) mRNA was observed to be up-regulated (See Figure 5.7). In part, this is because IL12 is one of a handful of B-cell produced interleukins, and the matched normals used here were created from Epstein-Barr Virus (EBV) infected B-cells. Thus, mRNA from interleukin 1 (IL1), IL10 and IL12 are observed to be highly expressed in the matched normals. interleukin 6 (IL6) would also be expected to be produced by B-cells, but its expression level is lower than that of the cancer-derived cell lines. Also of interest, cells infected with viruses tend to express high levels of interleukin 15 (IL15), for which mRNA is also observed for the matched normals. Other mRNA signals for which expression shifts significantly include IL6, IL8, interleukin 18 (IL18) and interleukin 20 (IL20) up-regulated in the cancers, while mRNA from interleukin 7 (IL7), interleukin 16 (IL16), interleukin 23 alpha (IL23A), 194  Figure 5.7: Expression of Interleukins. The error bars are for the standard deviation over all eight breast cancer cell lines or for the four B-cell derived cell lines. Not all Interleukins expression values are statistically significant, as evidenced by the error bars. However, the graph shows the expressions of all interleukins and indicates that some of them appear to be specifically of interest in the cancer cell lines, suggesting avenues of further investigation for cell growth or survival.  interleukin 24 (IL24) and interleukin 32 (IL32) are all expressed at higher levels in the B-cell derived normals. Interleukin Receptors also exhibit significant shifts in expression between the cancer and matched normals. mRNA from interleukin 2 receptor (IL2R), interleukin 4 receptor (IL4R), interleukin 6 receptor (IL6R), interleukin 10 receptor (IL10R), interleukin 12 receptor (IL12R), interleukin 18 receptor (IL18R) and interleukin 21 receptor (IL21R) are all more highly expressed in the B-cell derived matched normals, while mRNA from interleukin 13 receptor (IL13R) and interleukin 17  195  receptor (IL17R) show highly up-regulated expression in the cancer-derived cell lines (See Figure 5.8).  Figure 5.8: Expression of Interleukin Receptors.  196  Chapter 6 Conclusions “I am turned into a sort of machine for observing facts and grinding out conclusions.” — Charles Robert Darwin  6.1  Bioinformatics and Next Generation Sequencing  In each of the preceding chapters, technology and biology advances have been presented that demonstrate the value of Next Generation Sequencing (NGS) in the quest for personal medicine and improved health care. The ability to gain insight into the mechanics of DNA-protein interactions provides the promise of improved understanding of the mechanics of genomic diseases. The ability to search large numbers of variations across a sample of individual patients and tissue donors provides new opportunities to perform rapid medical assessments while learning about the underlying variations in the human genome. The opportunity to shed new light on the underlying mechanisms of mammary ductal carcinomas yields insights into the diversity of the disease and the opportunity to identify common features that are shared by multiple cancers. The insight into the behaviour of malfunctions in the Notch pathways provides new opportunities and targets for potential treatments and disruption of oncogenic behaviours of cancer cells. Each 197  of these insights was fundamentally provided through the use of NGS platforms that have transformed the way high-throughput biology is performed in the less than five years since the introduction of the first pyrosequencing platforms. In each case, the common element is that the application of bioinformatics has allowed the research to go beyond simple sequencing applications into designing experiments that yield biologically relevant answers.  6.2  FindPeaks  FindPeaks, presented in chapter 2, is a well known software application that has filled a niche in the Chromatin Immunoprecipitation and Sequencing (ChIP-Seq) community since its publication in 2008. It was among the first applications available for the protocol and has been a pioneer in the field of epigenetics providing several advances in the field. Its design made it an ideal platform for experimentation and development of novel methods, it demonstrated a scalability that was second to none, and was incorporated into pipelines and experimental procedures around the world. Although there are no developers actively contributing to this project at this point, FindPeaks can demonstrate its aggressive vision by having implemented solutions to problems that continue to plague other Peak Finding applications, such as normalization and sample comparisons - while many of the ideas first implemented in FindPeaks have since been incorporated into other applications. By any metric, it has been a successful project in the fast moving world of bioinformatics application developments, outlasting many of its contemporaries.JanuaryJanuary  6.2.1  Contributions  Among the many contributions made by the FindPeaks application are several enduring innovations: the use of read extension models, various False Discovery Rate (FDR) methods and novel methods of ChIP-Seq normalization and comparisons between samples. Additionally, the FindPeaks suite of tools for conversion and  198  manipulation of data files continues to be downloaded and used by scientists around the world.  6.2.2  Who is Using it?  FindPeaks has been broadly used in publications to analyze ChIP-Seq experiments. (See appendices A, B and C.) Its open source policy has even encouraged other software developers to build on it and develop hybrid methods that extend the use of the software.[3] Pre-bundled versions of FindPeaks versions 3.3 and 4.0 have been downloaded 4,345 times, suggesting a broad user base, as this does not include the downloads of the source code for compiling (the preferred method of installing the application) or earlier versions including, which is the published version of FindPeaks.[1] However, because of the diminishing support available for the software, the use of FindPeaks has declined since the end of 2010.  6.2.3  Strengths of the Software  There are several reasons for FindPeaks popularity. The code is relatively bug free for most core functions and bugs have traditionally been fixed quickly. At times, when support was available from other developers, FindPeaks was able to innovate rapidly and the turnaround time for bug fixes was frequently in the hours and days range - a key metric for any software project. Additionally, the software performed well, producing results quickly in our hands with average analysis taking 15-60 minutes on a server with 16 Gigabytes of Random Access Memory (RAM), and was able to process individual peaks containing in excess of 17 million reads. The modular code base also made it possible for quick development changes and for new developers to learn the code quickly and to reuse code efficiently, accelerating the development cycle.  199  6.2.4  Limitations of the Software  Despite its success, FindPeaks suffered from several limitations. The small development team meant that there were often insufficient resources to capitalize on all of the opportunities presented. Documentation for the code often lagged behind the development, making the code difficult to use at times, and support could frequently take days to get a response when the developer(s) were unavailable. This also made it impossible to invest the time to implement proper unit testing or to arrange sufficient levels of user testing during development. However, these problems are all common for small open source projects as well as academic coding projects.  6.2.5  Potential Applications of the Software  In addition to its use in ChIP-Seq and RNA-Seq applications, FindPeaks can also serve several other purposes. First, its ability to generate files that support visualization of genomic information makes it a useful tool for non-peak finding purposes, and has been employed in that capacity to generate “wig” tracks for visualization with external viewers, such as the University of California, Santa Cruz (UCSC) Genome Browser.[100] Second, it can also be used as a building block for other tools, as a starting point for other investigators interested in building ChIP-Seq tools. Finally, despite its shortcomings, FindPeaks is also a great model for developers at the BCGSC or other academic institutions to use for the development of new open source software projects. Many things done during the FindPeaks project were done well and could provide valuable insight for developers aiming to replicate the successes of the FindPeaks project - and those elements that did not work well could provide valuable lessons as well.  6.2.6  Future Directions  Although no future development is planned on this code base, there are still several possible modifications that future developers could attempt. Included among them are different metrics for ranking peaks. While FindPeaks has always utilized peak  200  height as the metric for the significance of a peak, it would also be possible to use peak areas or widths - or any combination of the above. There are several possibilities available that might yield interesting insights into the relationships between the DNA and proteins bound to it. Additionally, FindPeaks has never been modified for use with broad, continuous peak sets, such as polymerase binding signatures. This would also represent a further area that could be explored, although there are already several other applications available that perform these functions.[74, 75] Finally, there remains much exploratory work to be done in the area of peak comparisons between different samples - particularly with respect to comparing three or more samples simultaneously. This represents a difficult area of statistics, however, and little headway has been made in the ChIP-Seq field to date.  6.3  Variation Database  The VDB is a project that is just finding its stride in being able to contribute successfully to relevant research and insight into human genomics and personalized medicine. It is currently experiencing a tremendous growth in both the volume of data stored as well as in the number of users utilizing data which it creates. It is undoubtedly a key piece of the analysis pipeline at the BCGSC and is likely to play an important role in future research exploring the human genome.  6.3.1  Contributions  The VDB brings several distinct conceptual ideas that underpin its success. The first idea is the separation of genomic data from the annotations that are imposed upon it, creating a database that tolerates updates to the annotations without updates to the stored data. The second is the use of a relational database to separate observations from coordinates, creating an environment that is flexible and able to handle a wide variety of query types efficiently. Next, the use of information detailing the origin of each observed variation in addition to some basic raw information  201  about the conditions of each observation makes it possible to understand the relationships between the samples. Finally, the database also departs radically from other academic databases, by making the case that there will be a need to collect and sequence the vast amount of human variation data that can not be shared publicly. This reinforces the important suggestion that human variation information collected academically should be analyzed, even if the information can not be publicly distributed due to ownership issues. Each of these contributions is a significant conceptual change not observed in other published variation databases. Additionally, the database is well poised to make significant biological contributions to the interpretation of human variations. Although the exact nature of the contributions in each case will be determined by the type of data collected, it is likely that it will help in the identification of novel cancer driver mutations, new pathways and genes not previously known to be enriched for variations in cancer tissues. Additionally, it will provide an invaluable resource for confirming the distribution of variations in the population for any future project looking to identify novel variations. In addition to the above, the lasting contribution of the database is to grant access to researchers to all of the information on SNVs called at a single sequencing centre through a single query, a task that was virtually impossible and inconceivable just a short time ago. This demonstration has been pervasive enough that other parallel pipelines are being implemented for other forms of genomic and transcriptomic data using the innovations detailed above. This represents a paradigm shift in the use of information generated by NGS platforms.  6.3.2  Usage  The number of people using information generated by the database is much greater than the number of people who have direct access to the database itself. It has become the focus of many collaborations and projects and is now used routinely for scanning data sets and performing filtering and analysis. These collaborators have accessed information from the database via small number of users, and the BCGSC 202  has begun developing a broad group of bioinformaticians who have the knowledge to utilize the information generated by the VDB Application Programming Interface (API) in more complex analyses. Downstream, the VDB has taken a key role in filtering SNPs and SNVs data and the process has begun towards using the VDB to perform the same functions with INDELs. In the near future, other forms of structural variations will have parallel databases based upon the VDB schema. It is likely that future BCGSC researchers and collaborators will routinely use results that have passed through the VDB API and its sister databases.  6.3.3  Strengths of the Software  The database and associated API/User Interface (UI) have exhibited several strengths, of which the most obvious has been scalability. With the databases exceeding 1.4 and 1.7 billion SNP/SNV observations for the hg18 and hg19 reference genomes respectively, the size of the database is dramatically larger than most traditional transactional databases. However, the database continues to remain usable and to perform the functions it was designed to do in a timely manner. Additionally, the database’s ability to access the information in many different ways through the API continues to be useful, enabling many different hypotheses to be tested on the data, which would otherwise be a major challenge with a less well designed schema. These strengths continue to make the database a valuable resource. In addition, the database’s API has benefited from four years of work, resulting in functions that are well tested and are able to handle a variety of tasks robustly. Due to the challenges of working with genomic annotations, which often contain a lot of noise and errors, the ability to produce genome wide results incorporating annotations on the fly is a major contribution.  6.3.4  Limitations of the Software  There are several limitations to the variation database, of which the most important is that it is not infinitely scalable. The ability to scan billions of records is an 203  impressive demonstration of the scalability of the postgres infrastructure upon which it is built, however, there are limits to the extent to which the database may extended. Most of those limitations are dependent upon the hardware hosting the database, and so the ability to find more powerful machines may alleviate those limitations, the simple question will be which technology develops more rapidly: The sequencing technology used to generate the results, or the computing technology used to process and store them. To date, the sequencing technology has dramatically outstripped Moore’s Law, which has governed the pace of computer processing power for the past 40 years. It may one day be beneficial to move away from a relational database to one that better supports the use of large databases, or perhaps is able to divide the data more efficiently across a network of computers to distribute the load across more hardware. A second important limitation of the VDB is the amount of knowledge and experience required to make use of the data efficiently. A significant amount of query optimization is required to be able to use the database efficiently, and that experience has been utilized to create the repertoire of queries available in the API. Unfortunately, extending the API to include novel functionality or to investigate new ideas to exploit the data will likely present issues to developers unfamiliar with relational databases and the challenges of dealing with large data sets. Finally, despite the open source schema, UI and API which have been published, the adoption of the schema by users outside of the BCGSC has been limited. At the current time, there are a few users of the database running external to the BCGSC, but no other developers have stepped forward to participate in the project.  6.3.5  Potential Applications of the Software  As described in chapter 3, the VDB has the ability to perform a diverse range of queries, granting instant access to a plethora of information about genomic variations to researchers. This information can be as granular as a single base pair, a single gene, a single chromosome, or as coarse as a single sequencing experiment or patient. This ability to investigate at different scales encourages the users of the 204  database to formulate a wide range of questions to explore the data from a wide range of perspectives. As the volume of data stored increases, so to will the ability of the database to answer questions about the variations in the wider population. Equally important, the growing size of the database will increase the ability to distinguish between cancer drivers and passengers, as well as separate cancer associated variations from polymorphisms. In each case, the database will continue to provide increasingly comprehensive data that will enable scientists to gain insight into the biology underlying the human genome. The most important element that will determine the impact of the database will be the ability to ask intelligent questions from it. This was shown clearly in section 3.6.4, where asking a focused question enabled the researchers to identify a useful pattern in the data that could not be discerned from studying the variations in the gene alone. The ability to engage in fruitful collaborations with biologists and researchers with biological questions will determine the future uses of the database.  6.3.6  Future Directions  The VDB is still evolving, expanding to meet the requirements of the BCGSC for storage and analysis of variations. Some of the future developments have been described in section 3.7. However, there are other opportunities to enhance the usability of the database. The easiest way to improve the database would be to replace the UI of the database. It currently presents a simple command line interface to the user, requiring familiarity with Java commands. This could be augmented with a graphical user interface to make user interaction easier. On the back end, the database is currently benefiting from new hardware and the implementation of postgres replication to provide separate instances for adding data to the database and for retrieving data from the database. This schema separates operations that lock tables from those that do not. However, it may be worth investigating alternative arrangements that would provide more efficient access to 205  the data, including a round-robin setup where a single machine becomes the front end of the database, balancing the queries across multiple machines. Identifying more efficient ways to direct and control access to the database may be a fruitful place to search for added performance increases as the volume of data stored continues to increase.  6.4  Mammary Ductal Carcinoma Cell Lines  The results presented in chapter 4 cover a wide range of attributes of the mammary ductal carcinoma cell lines and their matched normals. We have demonstrated that the quality of the libraries is appropriate using the dbSNP concordance scores and were able to compare the transitions and transversion ratios of the cell line, indicating the presence of RNA-editing, as well as to contrast the ability of the RNA-Seq and exome capture to report on exon, intron and Untranslated Region (UTR) SNVs ratios. Additionally, we were able to remove polymorphisms and germline variants where possible using the VDB to focus on a small number of candidate driver mutations. We used the number of similarities and differences between the genes containing these variations to compare the relationships between the cell lines, both of which provided the same results, with triple negative cell lines HCC38 and HCC1395 sharing the most common variations, but sharing the fewest other variations with the other cell lines. HCC70, who’s PR status is indeterminate, shared the least in common with other cell lines. From the filtered list of variants we were also able to report a number of genes with high confidence non-synonymous recurrent variations. Those variations could be compared the to full lists of known mutations for the same cell lines available in COSMIC. Although the origin of the variations reported was not clear, there were significant overlaps between those called via our pipeline and those in the COSMIC database. For three of the four cell lines with matched normals, our high confidence variations had a 68-83% correspondence rate, while for HCC38, only 23% of the high confidence variations agreed with those in the COSMIC database. 206  However, at the time of the assay, the only variations available in the COSMIC database were from a SNP array (SNP6.0) and as such, it is highly likely that the missing variations were simply not included on the array. Using the list of high confidence non-synonymous variations, we were able to ask the question of which genes are recurrent across the four cell lines. Not surprisingly, given the vast array of mechanisms by which genes can be activated or deactivated, the list of genes with recurrent variations contained only 8 genes, among which was SBNO1, discussed in chapter 5. Each of the high confidence variations observed was validated either by confirmation with sequencing results uploaded to COSMIC, or through Sanger sequencing done at the BCGSC. Seven of the eight genes for which the high confidence variations were observed were also found to be associated with cancer genes or known to be involved in cancers in one capacity or another. Each one represents an interesting angle for future research. We also specifically investigated the breast cancer 1, early onset (BRCA1) and breast cancer 2, early onset (BRCA2) genes, identifying that the donor patient for the HCC1395 and HCC1395BL cell lines had a germline BRCA1 truncating mutation at residue R1751, a common founder mutation in Greek populations. This conclusion fits well with the known ancestry of the cell line, providing an interesting insight into the ability to infer details of a genome’s origin using only the SNPs observed.  6.4.1  Contributions  The contributions of this segment of the research are mainly in the development of the bioinformatics tools described in chapter 3 and the identification of recurrent variations from which the SBNO1 gene was identified, providing the starting point for the data presented in chapter 5. The methods pioneered in chapter 4 for the investigation of the mammary ductal carcinoma cell lines are now being applied to other samples, including both research and clinical samples in pilot projects. However, the data itself is not without value. Researchers who work with mammary ductal carcinoma cell lines will benefit from additional knowledge about the genomic and transcriptomic data being released. As the majority of breast 207  cancer research is performed on cell lines, releasing this information can only serve to improve the understanding of the tools upon which researchers are basing their work.[168]  6.4.2  Who is Using the Results?  Although the data generated from sequencing these cell lines has not yet been released, there may be interest from resources like COSMIC that currently store variant lists for these cell lines. However, several of the cell lines sequenced for these experiments were published in Stephens et al. [170] using the Illumina GA II sequencing platform with 65,000,000 paired end reads from each cancer cell line and have already been incorporated into the COSMIC database. The data here may provide further verification and a complimentary analysis to the published data set already available.  6.4.3  Limitations of the Cell Lines Studied  There are several major limitations on inherent in the data presented in chapter 4. The most important are the limitations imposed by the sequencing methods. The use of exome capture biases the genomic variations towards those in known regions included in the probe sets employed, while the use of RNA-Seq biases against genes that are expressed weakly or not at all. Thus, comparing between the RNA-derived and DNA-derived variations is often somewhat misleading. In this case, we used the DNA-derived samples solely to filter out germ-line variations, which avoids many of these biases, while employing the VDB to supplement the data set used for filtering common variants that were not available in the exome capture data. The other major limitation of the study was the limited number of samples and matched normal pairs. Unfortunately, this trade off was necessary as we exchanged breadth for depth of sequencing. In the end, although it was difficult to identify recurrent variations, the depth of sequencing allowed for other facets of the data to be explored. However, the lack of matched normal cell lines for four of the mammary ductal carcinoma cell lines proved to be a detriment, as it was impossible 208  to identify somatic variations. Unfortunately, matched normals were simply not created for these cell lines when they were originally obtained and no further ductal carcinoma cell lines are available at the ATCC. Furthermore, the “non-cancer” matched normal cell lines used in this assay are derived from B-cells transformed with the Epstein-Barr Virus (EBV), a process that causes immortalization through various interactions, some of which are not well understood. Another poorly understood property of the normals arises from their B-Cell heritage; B-cell derived cells are immune to transfection with siRNA vectors.[215] This can complicate attempts to perform knockdowns comparing the cancer-derived cell lines with the non-cancer-derived cell lines. Despite acting as a control, these cells are not without their own deviations from the germ-line of the individual from whom they were taken, as EBV promotes genetic instability.[216] Furthermore, the mechanism by which EBV interacts with its host cell disrupts the normal operation of the cell and alters the regulation of key genes involved in the cell death, replication and core signalling pathways.[217] It’s also expected that disruptions in regular RNA expression patterns are occurring, both through the interaction of the viral transcription factors, as well as through the expression of microRNA-effected pathways.[218] Thus, results taken from the comparison of matched normal/diseased cell lines are good model systems, but may not accurately reflect the physiology of the cancer or normal cells in their natural environment. Moreover, it is also important to recall that the cancer cell and normal cell have unique and different lineages, and in the case of the sets used here, are from breast tumour and B-cell tissues respectively, further complicating the analysis of genes which are likely to be expressed in different patterns in the two cell types as well. Another complicating issue for using cell lines of matched normal and cancer tissues is clonality. Although this issue is faced in many other experiments, cancer cell lines are already prone to rapid mutation and chromosomal instability, which has been demonstrated for several of the cell lines used here.[219]  209  6.4.4  Future Directions  There are many facets of research on the mammary ductal carcinoma cell lines studied here that remain to be investigated. The evaluation of INDELs for the cell line failed to discover any significant rearrangements that would potentially act as driver mutations, and of those that were passed into the verification phase, only a small number were confirmed. Thus, even at the basic level, there is a significant amount to work that could be done to identify and make sense of INDELs. Furthermore, only a small number of targets and pathways of interest could be investigated thoroughly. While one such lead was explored in chapter 5, There remain 7 other recurrent variations upon which little work has been done. Unfortunately, due to the amount of time and effort required to thoroughly investigate a single gene in the literature and perform the requisite verification on the interaction partners in any pathway discovered, these leads will be passed on to other investigators to follow up. Several aspects of the data were only investigated superficially, including fusion and alternate splicing events, reported in appendix D. While no significant recurrent events were found, it is entirely possible that other events observed could play a major role in the transition of the cells into their cancer state. Unfortunately, chromosomal rearrangements outside of gene fusions could not be studied, as the genome of the cell lines was not sequenced in it’s entirety, and exome capture data is provides insufficient information to make this possible. However, as discussed earlier, many of the cell lines included in this study have been investigated for rearrangements elsewhere.[170] Although the field of bioinformatics has long been grappling with the transition from studying genes to studying pathways, understanding how the cells operate still requires that we understand pathways and information flow (signalling) in the cell. Without this, it will be impossible to make completely informed decisions about which drugs to use, if we can’t figure out in which pathway a gene target is disrupting. Thus, the transition from single-gene targets to network biology will be a necessary step towards the arrival and the success of personalized medicine. 210  6.5  SBNO1 and EGF  Integration of multiple types of data, including SNVs, INDELs, fusions and expression data obtainable from next-generation sequencing, provides opportunities to identify novel pathways employed by cancers for circumventing constraints on cellular behaviour. It is of clear clinical importance to move beyond simply identifying genes that are commonly mutated or common fusions in order to identify the bottlenecks in cellular signalling. By employing such a strategy, we have been able to identify several genes, such as SBNO1, that could play a role in the oncogenic process. Genomic instability is one of the most prominent molecular traits common to cancers, fuelling the ability of the cancer cells to adapt and evade therapies to which they are subjected. The better we understand the pathways involved in cancer growth and survival, the better able we will be to identify the best target drugs for each patient - whether it is selected based upon individual markers or whole genome sequencing. In this project, we have identified SBNO1 as a good target for future drug therapies, based upon the central role it plays in the expression of EGF mRNA. In seven of the eight cell lines selected, the cancer has made several alterations to gene expression and, presumably, protein behaviour in order to activate this particular pathway. Examples include the expression of NOTCH3, JAG2, BARX2, MEIS1 and PBX1 - all of which feed into the same EGF mRNA expression pathway and associate with oncogenes or are oncogenes themselves. While the usefulness of this pathway for therapeutic intervention has yet to be demonstrated, the underlying relationships in this pathway are all independently well understood and verified.  6.5.1  Summary of Results Presented  Building on the inter-se analysis of the eight mammary ductal carcinomas investigated in chapter 4, we chose to focus on the SBNO1 gene, which was observed to have cancer specific mutations in two of the four cell lines with matched normals,  211  as well as one of the cell lines that does not have a matched normal. This protein led us to investigate its interaction partners and their expression patters. As the investigation proceed, several clear relationships became apparent: SBNO1 plays a role in the expression of EGF mRNA, which can be used to trigger the growth of EGFR+ cell lines. SBNO1 is under control of the MAPJD protein, which is a key regulator of several well known oncogenic pathways. Notch Signalling is used to regulate the MAPJD protein through a RBPJ/MYC dependent pathway. JAG2 uses a similar Notch/RBPJ pathway to regulate the GATA genes, which are in turn used to control metastasis and other oncogenic pathways. Finally, NOTCH3 specifically regulates the EGF mRNA expression pathway through a PBX1/MEIS1/HOX protein complex that interacts with the polycomb complex. With each connection of the pathway diagram independently confirmed, this pathways ties together many of the common functions required for oncogenic behaviour, including metastasis, cell growth and a diverse set of oncogenes. Although it doesn’t attempt to explain the relationship between all oncogenes, it is a surprisingly concise chart that links many diverse oncogenes into a single pathway. Furthermore, the pathways is built around genes that exhibit validated SNVs or INDELs or show cancer specific expression or enrichment. This has the effect of enriching the pathways for genes that may be ideal targets for future therapeutic intervention and to narrow down those elements of the pathway that are likely most relevant for the mammary ductal carcinoma cell lines studied. In addition to the relationships above, the pathway also includes several important genes that are specific or well characterized properties of breast cancer, including the over expression of the NOTCH3 gene in breast cancer (as well as the related ovarian cancer[207]), the over expression of the JAG2 controlled GATA genes and the existence of TP53 gene disruptions. This provides some measure of confirmation that the pathways identified as important are likely common to other breast cancers as well, and may not be limited to the cell lines studied. Not all of the cell lines studied exhibited the same properties. As described in section 5.2.5, the HCC1500 cell line stands apart from the rest with its limited  212  number of variations in this pathway, suggesting that while it may utilize the same pathway (as indicated by the up-regulation of the same genes as the other mammary ductal carcinoma cell lines), it likely uses a different mechanism to achieve the disruption of regular gene patterns indicated here. This would fit well with the observation that HCC1500 is both uniquely ER+ and EGFR- among the cell lines studied. It is also worth noting that the pathway itself is suggestive of a set of recurrent mechanisms by which mammary ductal carcinomas operate. The shared patterns of over expression and gene disruption are strong indications of core pathways that are common across all of the mammary ductal carcinoma cell lines studied. Although much larger samples would be required to determine if this pattern is accurate, the complete agreement among the eight cancer derived cell lines and the supporting lack of expression observed in all four of the B-cell derived normals is a strong indication that the effects observed are not coincidental.  6.5.2  Contributions Made by Studying Breast Cancer Cell Lines  The results presented here provide an opportunity to gain insight into understanding how the investigated cell lines are functioning through the use of expression data, as well as INDELs and SNV. One of the most important components is the assembly of a putative pathway that incorporates and supports the information gathered through the sequencing of the mammary ductal carcinoma cell lines. This pathways also hints at several genes that play central roles and might become ideal targets for future drug target studies such as JAG2, SBNO1 and RBPJ. Each of these genes appears to play a crucial role in regulating downstream genes that appear to promote oncogenic behaviours. Whether these genes prove to be useful weaknesses in mammary ductal carcinomas, however, is not certain.  213  6.5.3  Strengths of the Information Discussed  The pathway presented is based upon three separate elements: A series of verified SNVs present in the cell lines investigated, clear signals indicating genes that are turned on or off in the cancer-derived cell lines but not the normal-cell lines (e.g. differential expression) and peer-reviewed literature supporting the relations and interactions between the proteins cited. All three of these elements can be used to gain insight into the putative mechanisms by which the cancer-derived cell lines differ from their normal-derived matched samples. Independently, each gene turned on specifically in cancer cells has the ability to become an interesting and novel target for the development of novel therapeutic approaches, and in this thesis, a more integrative approach has been used to combine a series of interesting putative targets to create an explanation for the flow of information and inclusion of central pathways.  6.5.4  Limitations of the Information Discussed  The limitations on the information presented in chapter 5 are similar to those presented for chapter 4, including the limitations of sample size, lack of matched normals for four of the cell lines and the inherent difficulty of comparing RNA-Seq data with exome capture data. In addition, several other challenges exist. One of the most vexing is that many of the relationships discussed between gene products were discovered and characterized in model organisms. Although not all of the protein-protein interactions are limited to other animals, much of the research done was performed using systems that may not accurately reflect the in vivo behaviour of the proteins in human tissue. Thus, the model presented must be considered a hypothesis until more work can be done to demonstrate that it is, in fact accurate. Finally, as discussed in section 6.4.3, the normal cell lines used in this experiment are still EBV-transformed B-cells, making them somewhat deviant from a true normal for each patient in terms of gene expression and genomic instability. However, as the mechanisms of EBV are relatively well understood, it appears that 214  the pathways identified do not overlap with the ones controlled by EBV-expressed proteins.  6.5.5  Future Directions  Several important experiments lay ahead, building on the work described here. The first is to demonstrate that the pathways and variations observed here are common in mammary ductal carcinoma tissues and are not constrained to cell lines. To validate the expression data, it may be possible to identify existing array data that could be explored to confirm the expression patterns suggested by the pathway. However, to validate the variations on a larger sample of mammary ductal carcinomas would require a concerted effort to collect and sequence a larger sample of tissues, as it is unlikely to already exist in any other publicly available repository. Such a project may be ground for future collaborations with labs that study mammary ductal carcinomas and have already amassed a large tissue repository that could be screened for recurrent variations using custom SNP arrays, bar-coded multiplexing for NGS or other common screening processes. If the results are consistent across ductal carcinoma tissues, it could provide a classifier for distinguishing the specific subtype from other mammary tumours at the molecular level, with the potential to become a diagnostic tool. Furthermore, this research suggests several novel drug targets that are worthy of investigation, including SBNO1, EGF and JAG2, which appear to be highly expressed in the cancer-derived cells investigated. With both NOTCH3 and JAG2 being expressed on the cell surface, novel classes of bivalent antibodies that target cells expressing a combination of proteins may be a viable path for future drug development for the treatment of mammary ductal carcinomas. Finally, this project demonstrates a successful merger of a diverse collection of samples, sequenced using RNA-Seq and exome capture technologies being utilized to generate insight into a specific cancer subtype. The methods employed here could provide a model for future research into cancers, incorporating a broad range of information types derived from NGS platforms. 215  Bibliography [1] A.P. Fejes, G. Robertson, M. Bilenky, R. Varhol, M. Bainbridge, and S. J. Jones. “FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology”. In: Bioinformatics 24 (Aug. 2008), pp. 1729–1730 (cit. on pp. iii, 8, 83, 199). [2] A. P. Fejes and S. J. Jones. “Chip-Seq: Mapping of Protein-DNA Interactions”. In: Next-Generation Genome Sequencing: Towards Personalized Medicine. Ed. by Michal Janitz. Wiley, John & Sons, November 2008 (cit. on p. iii). [3] V. Boeva, D. Surdez, N. Guillon, F. Tirode, A. P. Fejes, O. Delattre, and E. Barillot. “De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis”. In: Nucleic Acids Res. 38 (June 2010), e126 (cit. on pp. iv, 84, 85, 199, 252). [4] A. P. Fejes, A. H. Khodabakhshi, I. Birol, and S. J. Jones. “Human variation database: an open-source database template for genomic discovery”. In: Bioinformatics 27 (Apr. 2011), pp. 1155–1156 (cit. on pp. iv, 89). [5] A. Heravi-Moussavi, M. S. Anglesio, S. W. Cheng, J. Senz, W. Yang, L. Prentice, A. P. Fejes, C. Chow, A. Tone, S. E. Kalloger, N. Hamel, A. Roth, G. Ha, A. N. Wan, S. Maines-Bandiera, C. Salamanca, B. Pasini, B. A. Clarke, A. F. Lee, C. H. Lee, C. Zhao, R. H. Young, S. A. Aparicio, P. H. Sorensen, M. M. Woo, N. Boyd, S. J. Jones, M. Hirst, M. A. Marra, B. Gilks, S. P. Shah, W. D. Foulkes, G. B. Morin, and D. G. Huntsman. “Recurrent Somatic DICER1 Mutations in Nonepithelial Ovarian Cancers”. In: N Engl J Med (Dec. 2011) (cit. on pp. iv, 122). [6] C. Greenman, P. Stephens, R. Smith, G. L. Dalgliesh, C. Hunter, G. Bignell, H. Davies, J. Teague, A. Butler, C. Stevens, S. Edkins, S. O’Meara, I. Vastrik, E. E. Schmidt, T. Avis, S. Barthorpe, G. Bhamra, 216  G. Buck, B. Choudhury, J. Clements, J. Cole, E. Dicks, S. Forbes, K. Gray, K. Halliday, R. Harrison, K. Hills, J. Hinton, A. Jenkinson, D. Jones, A. Menzies, T. Mironenko, J. Perry, K. Raine, D. Richardson, R. Shepherd, A. Small, C. Tofts, J. Varian, T. Webb, S. West, S. Widaa, A. Yates, D. P. Cahill, D. N. Louis, P. Goldstraw, A. G. Nicholson, F. Brasseur, L. Looijenga, B. L. Weber, Y. E. Chiew, A. DeFazio, M. F. Greaves, A. R. Green, P. Campbell, E. Birney, D. F. Easton, G. Chenevix-Trench, M. H. Tan, S. K. Khoo, B. T. Teh, S. T. Yuen, S. Y. Leung, R. Wooster, P. A. Futreal, and M. R. Stratton. “Patterns of somatic mutation in human cancer genomes”. In: Nature 446 (Mar. 2007), pp. 153–158 (cit. on p. 3). [7] A. M. Dunning, C. S. Healey, P. D. Pharoah, M. D. Teare, B. A. Ponder, and D. F. Easton. “A systematic review of genetic polymorphisms and breast cancer risk”. In: Cancer Epidemiol. Biomarkers Prev. 8 (Oct. 1999), pp. 843–854 (cit. on p. 3). [8] S. P. Shah, R. D. Morin, J. Khattra, L. Prentice, T. Pugh, A. Burleigh, A. Delaney, K. Gelmon, R. Guliany, J. Senz, C. Steidl, R. A. Holt, S. Jones, M. Sun, G. Leung, R. Moore, T. Severson, G. A. Taylor, A. E. Teschendorff, K. Tse, G. Turashvili, R. Varhol, R. L. Warren, P. Watson, Y. Zhao, C. Caldas, D. Huntsman, M. Hirst, M. A. Marra, and S. Aparicio. “Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution”. In: Nature 461 (Oct. 2009), pp. 809–813 (cit. on pp. 3, 52). [9] N. Barnes. “Publish your computer code: it is good enough”. In: Nature 467.7317 (Oct. 2010), p. 753 (cit. on p. 8). [10] R. D. Peng. “Reproducible research in computational science”. In: Science 334.6060 (Dec. 2011), pp. 1226–1227 (cit. on p. 8). [11] N. Homer, B. Merriman, and S. F. Nelson. “BFAST: an alignment tool for large scale genome resequencing”. In: PLoS ONE 4 (2009), e7767 (cit. on p. 8). [12] H. Li and R. Durbin. “Fast and accurate short read alignment with Burrows-Wheeler transform”. In: Bioinformatics 25 (July 2009), pp. 1754–1760 (cit. on pp. 8, 33). [13] H. Li and R. Durbin. “Fast and accurate long-read alignment with Burrows-Wheeler transform”. In: Bioinformatics 26 (Mar. 2010), pp. 589–595 (cit. on pp. 8, 134, 165). 217  [14] Y. Zhang, T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein, C. Nusbaum, R. M. Myers, M. Brown, W. Li, and X. S. Liu. “Model-based analysis of ChIP-Seq (MACS)”. In: Genome Biol. 9 (2008), R137 (cit. on pp. 8, 64, 73). [15] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol. “ABySS: a parallel assembler for short read sequence data”. In: Genome Res. 19 (June 2009), pp. 1117–1123 (cit. on p. 8). [16] D. R. Zerbino and E. Birney. “Velvet: algorithms for de novo short read assembly using de Bruijn graphs”. In: Genome Res. 18 (May 2008), pp. 821–829 (cit. on p. 8). [17] R. Goya, M. G. Sun, R. D. Morin, G. Leung, G. Ha, K. C. Wiegand, J. Senz, A. Crisan, M. A. Marra, M. Hirst, D. Huntsman, K. P. Murphy, S. Aparicio, and S. P. Shah. “SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors”. In: Bioinformatics 26 (Mar. 2010), pp. 730–736 (cit. on pp. 8, 34, 90, 134, 165). [18] R. Li, Y. Li, K. Kristiansen, and J. Wang. “SOAP: short oligonucleotide alignment program”. In: Bioinformatics 24 (Mar. 2008), pp. 713–714 (cit. on p. 8). [19] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew, D. H. Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder, A. G. Clark, J. Nadeau, V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M. Simon, C. Slayman, M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Florea, A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Remington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Brandon, M. Cargill, I. Chandramouliswaran, R. Charlab, K. Chaturvedi, Z. Deng, V. Di Francesco, P. Dunn, K. Eilbeck, C. Evangelista, A. E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T. J. Heiman, M. E. Higgins, et al. “The sequence of the human genome”. In: Science 291 (Feb. 2001), pp. 1304–1351 (cit. on pp. 9, 125). [20] M. Ptashne and A. Gann. “Transcriptional activation by recruitment”. In: Nature 386 (Apr. 1997), pp. 569–577 (cit. on p. 9).  218  [21] W. Reik, I. Romer, S. C. Barton, M. A. Surani, S. K. Howlett, and J. Klose. “Adult phenotype in the mouse can be affected by epigenetic events in the early embryo”. In: Development 119 (Nov. 1993), pp. 933–942 (cit. on p. 9). [22] P. Cheung and P. Lau. “Epigenetic regulation by histone methylation and histone variants”. In: Mol. Endocrinol. 19 (Mar. 2005), pp. 563–573 (cit. on p. 9). [23] E. Birney, J. A. Stamatoyannopoulos, A. Dutta, R. Guigo, T. R. Gingeras, E. H. Margulies, Z. Weng, M. Snyder, E. T. Dermitzakis, R. E. Thurman, M. S. Kuehn, C. M. Taylor, S. Neph, C. M. Koch, S. Asthana, A. Malhotra, I. Adzhubei, J. A. Greenbaum, R. M. Andrews, P. Flicek, P. J. Boyle, H. Cao, N. P. Carter, G. K. Clelland, S. Davis, N. Day, P. Dhami, S. C. Dillon, M. O. Dorschner, H. Fiegler, P. G. Giresi, J. Goldy, M. Hawrylycz, A. Haydock, R. Humbert, K. D. James, B. E. Johnson, E. M. Johnson, T. T. Frum, E. R. Rosenzweig, N. Karnani, K. Lee, G. C. Lefebvre, P. A. Navas, F. Neri, S. C. Parker, P. J. Sabo, R. Sandstrom, A. Shafer, et al. “Identification and analysis of functional elements in 1 percent of the human genome by the ENCODE pilot project”. In: Nature 447 (June 2007), pp. 799–816 (cit. on p. 10). [24] D. Brutlag, C. Schlehuber, and J. Bonner. “Properties of formaldehyde-treated nucleohistone”. In: Biochemistry 8 (Aug. 1969), pp. 3214–3218 (cit. on p. 10). [25] B. D. Strahl, R. Ohba, R. G. Cook, and C. D. Allis. “Methylation of histone H3 at lysine 4 is highly conserved and correlates with transcriptionally active nuclei in Tetrahymena”. In: Proc. Natl. Acad. Sci. U.S.A. 96 (Dec. 1999), pp. 14967–14972 (cit. on p. 10). [26] M. Vignali, A. H. Hassan, K. E. Neely, and J. L. Workman. “ATP-dependent chromatin-remodeling complexes”. In: Mol. Cell. Biol. 20 (Mar. 2000), pp. 1899–1910 (cit. on p. 10). [27] L. Verdone, E. Agricola, M. Caserta, and E. Di Mauro. “Histone acetylation in gene regulation”. In: Brief Funct Genomic Proteomic 5 (Sept. 2006), pp. 209–221 (cit. on p. 10). [28] L. Elnitski, V. X. Jin, P. J. Farnham, and S. J. Jones. “Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques”. In: Genome Res. 16 (Dec. 2006), pp. 1455–1464 (cit. on p. 10). 219  [29] A. P. Wolffe. “Transcriptional regulation in the context of chromatin structure”. In: Essays Biochem. 37 (2001), pp. 45–57 (cit. on p. 10). [30] V. Orlando. “Mapping chromosomal proteins in vivo by formaldehyde-crosslinked-chromatin immunoprecipitation”. In: Trends Biochem. Sci. 25 (Mar. 2000), pp. 99–104 (cit. on p. 11). [31] B. Ren, F. Robert, J. J. Wyrick, O. Aparicio, E. G. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, E. Kanin, T. L. Volkert, C. J. Wilson, S. P. Bell, and R. A. Young. “Genome-wide location and function of DNA binding proteins”. In: Science 290 (Dec. 2000), pp. 2306–2309 (cit. on pp. 11, 17). [32] M. H. Kuo and C. D. Allis. “In vivo cross-linking and immunoprecipitation for studying dynamic Protein:DNA associations in a chromatin environment”. In: Methods 19 (Nov. 1999), pp. 425–433 (cit. on pp. 11, 13). [33] H. S. Huang, A. Matevossian, Y. Jiang, and S. Akbarian. “Chromatin immunoprecipitation in postmortem brain”. In: J. Neurosci. Methods 156 (Sept. 2006), pp. 284–292 (cit. on p. 13). [34] S. Strahl-Bolsinger, A. Hecht, K. Luo, and M. Grunstein. “SIR2 and SIR4 interactions differ in core and extended telomeric heterochromatin in yeast”. In: Genes Dev. 11 (Jan. 1997), pp. 83–93 (cit. on p. 14). [35] A. Hecht, S. Strahl-Bolsinger, and M. Grunstein. “Spreading of transcriptional repressor SIR3 from telomeric heterochromatin”. In: Nature 383 (Sept. 1996), pp. 92–96 (cit. on p. 14). [36] A. S. Weinmann, S. M. Bartley, T. Zhang, M. Q. Zhang, and P. J. Farnham. “Use of chromatin immunoprecipitation to clone novel E2F target promoters”. In: Mol. Cell. Biol. 21 (Oct. 2001), pp. 6820–6832 (cit. on p. 14). [37] M. J. LeBaron, J. Xie, and H. Rui. “Evaluation of genome-wide chromatin library of Stat5 binding sites in human breast cancer”. In: Mol. Cancer 4 (Feb. 2005), p. 6 (cit. on pp. 14, 184). [38] A. Ahmadian, M. Ehn, and S. Hober. “Pyrosequencing: history, biochemistry and future”. In: Clin. Chim. Acta 363 (Jan. 2006), pp. 83–94 (cit. on p. 14).  220  [39] V. E. Velculescu, L. Zhang, B. Vogelstein, and K. W. Kinzler. “Serial analysis of gene expression”. In: Science 270 (Oct. 1995), pp. 484–487 (cit. on pp. 14, 18). [40] S. Saha, A. B. Sparks, C. Rago, V. Akmaev, C. J. Wang, B. Vogelstein, K. W. Kinzler, and V. E. Velculescu. “Using the transcriptome to annotate the genome”. In: Nat. Biotechnol. 20 (May 2002), pp. 508–512 (cit. on p. 14). [41] T. Y. Roh, W. C. Ngau, K. Cui, D. Landsman, and K. Zhao. “High-resolution genome-wide mapping of histone modifications”. In: Nat. Biotechnol. 22 (Aug. 2004), pp. 1013–1016 (cit. on p. 14). [42] J. Chen and I. Sadowski. “Identification of the mismatch repair genes PMS2 and MLH1 as p53 target genes by using serial analysis of binding elements”. In: Proc. Natl. Acad. Sci. U.S.A. 102 (Mar. 2005), pp. 4813–4818 (cit. on p. 14). [43] P. Ng, C. L. Wei, W. K. Sung, K. P. Chiu, L. Lipovich, C. C. Ang, S. Gupta, A. Shahab, A. Ridwan, C. H. Wong, E. T. Liu, and Y. Ruan. “Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation”. In: Nat. Methods 2 (Feb. 2005), pp. 105–111 (cit. on p. 14). [44] C. Y. Lin, V. B. Vega, J. S. Thomsen, T. Zhang, S. L. Kong, M. Xie, K. P. Chiu, L. Lipovich, D. H. Barnett, F. Stossi, A. Yeo, J. George, V. A. Kuznetsov, Y. K. Lee, T. H. Charn, N. Palanisamy, L. D. Miller, E. Cheung, B. S. Katzenellenbogen, Y. Ruan, G. Bourque, C. L. Wei, and E. T. Liu. “Whole-genome cartography of estrogen receptor alpha binding sites”. In: PLoS Genet. 3 (June 2007), e87 (cit. on p. 14). [45] A. Barski, S. Cuddapah, K. Cui, T. Y. Roh, D. E. Schones, Z. Wang, G. Wei, I. Chepelev, and K. Zhao. “High-resolution profiling of histone methylations in the human genome”. In: Cell 129 (May 2007), pp. 823–837 (cit. on pp. 15, 19). [46] K. P. Chiu, C. H. Wong, Q. Chen, P. Ariyaratne, H. S. Ooi, C. L. Wei, W. K. Sung, and Y. Ruan. “PET-Tool: a software suite for comprehensive processing and managing of Paired-End diTag (PET) sequence data”. In: BMC Bioinformatics 7 (2006), p. 390 (cit. on p. 15). [47] A. M. Maxam and W. Gilbert. “A new method for sequencing DNA”. In: Proc. Natl. Acad. Sci. U.S.A. 74 (Feb. 1977), pp. 560–564 (cit. on p. 16). 221  [48] R. Drmanac, S. Drmanac, G. Chui, R. Diaz, A. Hou, H. Jin, P. Jin, S. Kwon, S. Lacy, B. Moeur, J. Shafto, D. Swanson, T. Ukrainczyk, C. Xu, and D. Little. “Sequencing by hybridization (SBH): advantages, achievements, and opportunities”. In: Adv. Biochem. Eng. Biotechnol. 77 (2002), pp. 75–101 (cit. on p. 16). [49] E. M. Southern. “Detection of specific sequences among DNA fragments separated by gel electrophoresis”. In: J. Mol. Biol. 98 (Nov. 1975), pp. 503–517 (cit. on p. 16). [50] E. Southern. “Southern blotting”. In: Nat Protoc 1 (2006), pp. 518–525 (cit. on p. 16). [51] B. Ren, H. Cam, Y. Takahashi, T. Volkert, J. Terragni, R. A. Young, and B. D. Dynlacht. “E2F integrates cell cycle progression with DNA repair, replication, and G(2)/M checkpoints”. In: Genes Dev. 16 (Jan. 2002), pp. 245–256 (cit. on p. 16). [52] B. E. Bernstein, A. Meissner, and E. S. Lander. “The mammalian epigenome”. In: Cell 128 (Feb. 2007), pp. 669–681 (cit. on p. 17). [53] M. J. Buck and J. D. Lieb. “ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments”. In: Genomics 83 (Mar. 2004), pp. 349–360 (cit. on p. 17). [54] S. Draghici, P. Khatri, A. C. Eklund, and Z. Szallasi. “Reliability and reproducibility issues in DNA microarray measurements”. In: Trends Genet. 22 (Feb. 2006), pp. 101–109 (cit. on p. 17). [55] W. P. Kuo, T. K. Jenssen, A. J. Butte, L. Ohno-Machado, and I. S. Kohane. “Analysis of matched mRNA measurements from two different microarray technologies”. In: Bioinformatics 18 (Mar. 2002), pp. 405–412 (cit. on p. 17). [56] T. K. Jenssen, M. Langaas, W. P. Kuo, B. Smith-Sørensen, O. Myklebost, and E. Hovig. “Analysis of repeatability in spotted cDNA microarrays”. In: Nucleic Acids Res. 30 (July 2002), pp. 3235–3244 (cit. on p. 17). [57] V. R. Iyer, C. E. Horak, C. S. Scafe, D. Botstein, M. Snyder, and P. O. Brown. “Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF”. In: Nature 409 (Jan. 2001), pp. 533–538 (cit. on p. 17). [58] S. Fields. “Molecular biology. Site-seeing by sequencing”. In: Science 316 (June 2007), pp. 1441–1442 (cit. on p. 17). 222  [59] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold. “Genome-wide mapping of in vivo protein-DNA interactions”. In: Science 316 (June 2007), pp. 1497–1502 (cit. on pp. 17, 20, 23, 32). [60] R. J. Melamede. Automatable process for sequencing nucleotide. US. Patent number 4,863,849. U. S. Patent Office, Sept. 1989 (cit. on p. 18). [61] J. H. Leamon, W. L. Lee, K. R. Tartaro, J. R. Lanza, G. J. Sarkis, A. D. deWinter, J. Berka, M. Weiner, J. M. Rothberg, and K. L. Lohman. “A massively parallel PicoTiterPlate based platform for discrete picoliter-scale polymerase chain reactions”. In: Electrophoresis 24 (Nov. 2003), pp. 3769–3777 (cit. on p. 18). [62] S. Bennett. “Solexa Ltd”. In: Pharmacogenomics 5 (June 2004), pp. 433–438 (cit. on p. 18). [63] S. Suzuki, N. Ono, C. Furusawa, B. W. Ying, and T. Yomo. “Comparison of sequence reads obtained from three next-generation sequencing platforms”. In: PLoS ONE 6 (2011), e19534 (cit. on p. 18). [64] P. Ng, J. J. Tan, H. S. Ooi, Y. L. Lee, K. P. Chiu, M. J. Fullwood, K. G. Srinivasan, C. Perbost, L. Du, W. K. Sung, C. L. Wei, and Y. Ruan. “Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high-throughput analysis of transcriptomes and genomes”. In: Nucleic Acids Res. 34 (2006), e84 (cit. on p. 19). [65] G. Robertson, M. Hirst, M. Bainbridge, M. Bilenky, Y. Zhao, T. Zeng, G. Euskirchen, B. Bernier, R. Varhol, A. Delaney, N. Thiessen, O. L. Griffith, A. He, M. Marra, M. Snyder, and S. Jones. “Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing”. In: Nat. Methods 4 (Aug. 2007), pp. 651–657 (cit. on pp. 20, 21, 23, 32, 73, 165). [66] Melissa J. Fullwood and Yijun Ruan. “ChIP-based methods for the identification of long-range chromatin interactions”. In: Journal of Cellular Biochemistry 107.1 (2009), pp. 30–39. ISSN: 1097-4644. DOI: 10.1002/jcb.22116. URL: http://dx.doi.org/10.1002/jcb.22116 (cit. on p. 22). [67] S. M. Reamon-Buettner and J. Borlak. “A new paradigm in toxicology and teratology: altering gene activity in the absence of DNA sequence variation”. In: Reprod. Toxicol. 24 (July 2007), pp. 20–30 (cit. on p. 22).  223  [68] B. J. Morris. “A forkhead in the road to longevity: the molecular basis of lifespan becomes clearer”. In: J. Hypertens. 23 (July 2005), pp. 1285–1309 (cit. on p. 22). [69] A. P. Feinberg. “Phenotypic plasticity and the epigenetics of human disease”. In: Nature 447 (May 2007), pp. 433–440 (cit. on p. 22). [70] J. G. Herman and S. B. Baylin. “Promoter-region hypermethylation and gene silencing in human cancer”. In: Curr. Top. Microbiol. Immunol. 249 (2000), pp. 35–54 (cit. on pp. 22, 26, 184). [71] P. A. Jones and S. B. Baylin. “The epigenomics of cancer”. In: Cell 128 (Feb. 2007), pp. 683–692 (cit. on pp. 22, 26, 184). [72] R. D. Morin, M. Mendez-Lago, A. J. Mungall, R. Goya, K. L. Mungall, R. D. Corbett, N. A. Johnson, T. M. Severson, R. Chiu, M. Field, S. Jackman, M. Krzywinski, D. W. Scott, D. L. Trinh, J. Tamura-Wells, S. Li, M. R. Firme, S. Rogic, M. Griffith, S. Chan, O. Yakovenko, I. M. Meyer, E. Y. Zhao, D. Smailus, M. Moksa, S. Chittaranjan, L. Rimsza, A. Brooks-Wilson, J. J. Spinelli, S. Ben-Neriah, B. Meissner, B. Woolcock, M. Boyle, H. McDonald, A. Tam, Y. Zhao, A. Delaney, T. Zeng, K. Tse, Y. Butterfield, I. Birol, R. Holt, J. Schein, D. E. Horsman, R. Moore, S. J. Jones, J. M. Connors, M. Hirst, R. D. Gascoyne, and M. A. Marra. “Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma”. In: Nature 476 (Aug. 2011), pp. 298–303 (cit. on p. 22). [73] T. A. Lasko, J. G. Bhagwat, K. H. Zou, and L. Ohno-Machado. “The use of receiver operating characteristic curves in biomedical informatics”. In: J Biomed Inform 38 (Oct. 2005), pp. 404–415 (cit. on p. 23). [74] C. Zang, D. E. Schones, C. Zeng, K. Cui, K. Zhao, and W. Peng. “A clustering approach for identification of enriched domains from histone modification ChIP-Seq data”. In: Bioinformatics 25 (Aug. 2009), pp. 1952–1958 (cit. on pp. 24, 201, 254). [75] Y. Zhang, H. Shin, J. S. Song, Y. Lei, and X. S. Liu. “Identifying positione dnucleosomes with epigenetic marks in human from ChIP-Seq”. In: BMC Genomics 9 (2008), p. 537 (cit. on pp. 24, 201). [76] C. B. Nielsen, M. Cantor, I. Dubchak, D. Gordon, and T. Wang. “Visualizing genomes: techniques and challenges”. In: Nat. Methods 7 (Mar. 2010), S5–S15 (cit. on p. 24). 224  [77] W. Reik, W. Dean, and J. Walter. “Epigenetic reprogramming in mammalian development”. In: Science 293 (Aug. 2001), pp. 1089–1093 (cit. on p. 25). [78] S. J. Clark, J. Harrison, C. L. Paul, and M. Frommer. “High sensitivity mapping of methylated cytosines”. In: Nucleic Acids Res. 22 (Aug. 1994), pp. 2990–2997 (cit. on p. 25). [79] R. A. Harris, T. Wang, C. Coarfa, R. P. Nagarajan, C. Hong, S. L. Downey, B. E. Johnson, S. D. Fouse, A. Delaney, Y. Zhao, A. Olshen, T. Ballinger, X. Zhou, K. J. Forsberg, J. Gu, L. Echipare, H. O’Geen, R. Lister, M. Pelizzola, Y. Xi, C. B. Epstein, B. E. Bernstein, R. D. Hawkins, B. Ren, W. Y. Chung, H. Gu, C. Bock, A. Gnirke, M. Q. Zhang, D. Haussler, J. R. Ecker, W. Li, P. J. Farnham, R. A. Waterland, A. Meissner, M. A. Marra, M. Hirst, A. Milosavljevic, and J. F. Costello. “Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications”. In: Nat. Biotechnol. 28 (Oct. 2010), pp. 1097–1105 (cit. on p. 26). [80] T. S. Mikkelsen, M. Ku, D. B. Jaffe, B. Issac, E. Lieberman, G. Giannoukos, P. Alvarez, W. Brockman, T. K. Kim, R. P. Koche, W. Lee, E. Mendenhall, A. O’Donovan, A. Presser, C. Russ, X. Xie, A. Meissner, M. Wernig, R. Jaenisch, C. Nusbaum, E. S. Lander, and B. E. Bernstein. “Genome-wide maps of chromatin state in pluripotent and lineage-committed cells”. In: Nature 448 (Aug. 2007), pp. 553–560 (cit. on p. 26). [81] M. Simonis, J. Kooren, and W. de Laat. “An evaluation of 3C-based methods to capture DNA interactions”. In: Nat. Methods 4 (Nov. 2007), pp. 895–901 (cit. on p. 26). [82] P. J. Browett, H. M. Cooke, L. M. Secker-Walker, and J. D. Norton. “Chromosome 22 breakpoints in variant Philadelphia translocations and Philadelphia-negative chronic myeloid leukemia”. In: Cancer Genet. Cytogenet. 37 (Feb. 1989), pp. 169–177 (cit. on p. 30). [83] T. F. Smith and M. S. Waterman. “Identification of common molecular subsequences”. In: J. Mol. Biol. 147 (Mar. 1981), pp. 195–197 (cit. on p. 31). [84] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. “Basic local alignment search tool”. In: J. Mol. Biol. 215 (Oct. 1990), pp. 403–410 (cit. on p. 31). 225  [85] A. A. Schaffer, L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge, Y. I. Wolf, E. V. Koonin, and S. F. Altschul. “Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements”. In: Nucleic Acids Res. 29 (July 2001), pp. 2994–3005 (cit. on p. 31). [86] E. M. Gertz, Y. K. Yu, R. Agarwala, A. A. Schaffer, and S. F. Altschul. “Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST”. In: BMC Biol. 4 (2006), p. 41 (cit. on p. 31). [87] G.