UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Mutation discovery and characterization in lymphoid neoplasms using massively parallel RNA and DNA sequencing Morin, Ryan David 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2012_spring_morin_ryan.pdf [ 17.17MB ]
Metadata
JSON: 24-1.0072548.json
JSON-LD: 24-1.0072548-ld.json
RDF/XML (Pretty): 24-1.0072548-rdf.xml
RDF/JSON: 24-1.0072548-rdf.json
Turtle: 24-1.0072548-turtle.txt
N-Triples: 24-1.0072548-rdf-ntriples.txt
Original Record: 24-1.0072548-source.json
Full Text
24-1.0072548-fulltext.txt
Citation
24-1.0072548.ris

Full Text

MUTATION DISCOVERY AND CHARACTERIZATION IN LYMPHOID NEOPLASMS USING MASSIVELY PARALLEL RNA AND DNA SEQUENCING  by Ryan David Morin B.Sc., Simon Fraser University, 2003 M.Sc., The University of British Columbia, 2007  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  January 2012  © Ryan David Morin, 2012  ii Abstract New massively parallel sequencing technologies offer opportunities to profile genomes and transcriptomes for copy number variations, polymorphisms, somatic point mutations, chromosomal rearrangements and can capture gene expression and splicing information. A suite of methods was developed to analyze both RNA-seq and whole genome/exome sequence data from malignant cells for the purpose of identifying somatic point mutations and fusion transcripts. This work reports the application of these and other tools to gain insights into the somatic mutations involved in two common classes of lymphoid malignancies, namely non Hodgkin lymphoma and acute lymphoblastic leukemia. Analysis of multiple cases by a combination of RNA-seq, genome and exome sequencing revealed genes significantly mutated in non Hodgkin lymphoma including many not previously known to be mutated in these or any other cancers. These included multiple genes involved in altering the methylation or acetylation state of histones such as EZH2, MLL2, CREBBP and MEF2B, suggesting a previously unappreciated role of deregulated or altered epigenetic gene regulation in lymphomagenesis. Some of the mutated genes, such as MLL2, had clear patterns of inactivating mutations, indicating they act as tumour suppressors in NHL. Others had mutation hot spots that can be indicative of an oncogenic gain of function and this was proven to be the case for the mutation hot spot identified in EZH2. Analysis of acute lymphoblastic leukemia revealed both novel point mutations and fusion transcripts. The latter included fusions that potentially deregulate known oncogenes such as JAK2 and ABL1. These data may indicate new treatment options for patients with ALL and NHL and lend new insights into the molecular nature of these diseases.  iii Preface I wrote Chapter 1 and Chapter 5 in their entirety and prepared all figures. Chapters 2-4 were the result of multi-author collaborations. Dr. Marra and myself conceived the research described in Chapters 2 and 3 with conceptual contributions by Randy Gascoyne, Joseph Connors, Martin Hirst and Steven Jones. I developed the analytical pipeline utilized for the SNV analysis described in these chapters and in Chapter 4, with contributions from Rodrigo Goya, Sohrab Shah (alignment and SNV identification), Richard Corbett, Martin Krzywinski (visualization), Karen Mungall, Shaun Jackman, Readman Chiu, Sa Li (fusion transcripts/rearrangements), Allen Delaney, Eric Zhao and Malachi Griffith (primer design and software conceptualization), Jianghong An, and Alex Yakovenko (protein modeling). I performed all alignments and ran the pipeline on every library described in these chapters to identify the mutations that are described therein. I designed and implemented the software that produced an early list of fusions described in Chapter 4 and ultimately ran the deFuse software on all libraries to refine this list and to produce the final fusion results presented in Table 4-1 and Appendix C. Nathalie Johnson, Merrill Boyle, David Scott, Barbara Meissner and Bruce Woolcock prepared the tumour samples sequenced in Chapters 2 and 3. Members of the TARGET initiative including Charles Mullighan, Steven Hunger and Daniela Gerhard conceived of the study described in Chapter 4 and provided the samples. The libraries were all prepared and sequenced by Yongjun Zhao, Richard Moore, Martin Hirst and members of their team. For Chapter 2, Tesa Severson and Trevor Pugh performed the resequencing experiments and manually reviewed the Sanger data. The in vitro methylation experiments described in Chapter 2 were conceived by Sam Aparicio and performed by Damian Yap and by employees of BPS Biosciences. For Chapter 3, Maria Mendez-Lago performed the bulk of  iv the SNV validation with substantial help from Andy Mungall, Diane Trinh, Jessica Tumura- Wells, Marlo Firme, Helen McDonald, and Suganthi Chittaranjan. I prepared all figures in Chapter 2 and 3 with the exception of Figure 2-3 (prepared by Damian Yap), Figure 3-2 (prepared in conjunction with Richard Corbett), Figure 3-3 (prepared by Alex Yakovenko), Figure 3-5 (prepared in conjunction with Diane Trinh), Figure 3-6 (panel B and C prepared by Andy Mungall and Maria Mendez-Lago) and Figure 3-8 (Prepared in conjunction with Martin Krzywinski). I prepared the tables and supplementary tables in Chapters 2 and 3 with the following exceptions: the p-values in Table 3-2 were computed with assistance from Rodrigo Goya and Irmtraud Meyer and the data in Table 3-3 was prepared by Maria Mendez Lago. For Chapter 4, Kathryn Roberts performed the validation of fusion transcripts described and Martin Hirst and his group performed validation experiments for SNVs reported. Kathryn also prepared some of the figures and tables presented in this chapter whereas I prepared Figure 4-2, Tables 4-2, 4-3 and 4-4 and Appendix C and contributed to Table 4-1. Richard Harvey and Charles Mullighan and members of their teams performed the ROSE and PAM methods, respectively, for identifying Ph-like samples and selected the cases for profiling by RNA-seq and whole genome sequencing (WGS). Chapter 1 contains sections derived from the book chapter, which I wrote, entitled “Transcriptomics in the age of ultra high-throughput sequencing” by Ryan Morin and Marco Marra to be published in the forthcoming book “Genomic and Personalized Medicine, second edition”. Chapter 2 was based on “Somatic mutations altering EZH2 (Tyr641) in follicular and diffuse large B-cell lymphomas of germinal-center origin” published in Nature Genetics (2010) by Ryan Morin, Nathalie Johnson, Tesa Severson, Andrew Mungall, Jianghong An, Rodrigo Goya, Jessica Paul, Merrill Boyle, Bruce Woolcock, Florian Kuchenbauer, Damian Yap, R Keith  v Humphries, Obi Griffith, Sohrab Shah, Henry Zhu, Michelle Kimbara, Pavel Shashkin, Jean Charlot, Marianna Tcherpakov, Richard Corbett, Angela Tam, Richard Varhol, Duane Smailus, Michelle Moksa, Yongjun Zhao, Allen Delaney, Hong Qian, Inanc Birol, Jacqueline Schein, Richard Moore, Robert Holt, Doug Horsman, Joseph Connors, Steven Jones, Samuel Aparicio, Martin Hirst, Randy Gascoyne, and Marco A Marra with additions from “Somatic mutations at EZH2 Y641 act dominantly through a mechanism of selectively altered PRC2 catalytic activity, to increase H3K27 trimethylation” published in Blood by Damian Yap, Justin Chu, Tobias Berg, Matthieu Schapira, S-W Grace Cheng, Annie Moradian, Ryan Morin, Andrew Mungall, Barbara Meissner, Merrill Boyle, Victor Marquez, Marco Marra, Randy Gascoyne, R Keith Humphries, Cheryl Arrowsmith, Gregg Morin, and Samuel Aparicio. Chapter 3 was based on “Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma” published in Nature (2011) by Ryan Morin, Maria Mendez-Lago, Andrew Mungall, Rodrigo Goya, Karen Mungall, Richard Corbett, Nathalie Johnson, Tesa Severson, Readman Chiu, Matthew Field, Shaun Jackman, Martin Krzywinski, David Scott, Diane Trinh, Jessica Tamura-Wells, Sa Li, Marlo Firme, Sanja Rogic, Malachi Griffith, Susanna Chan, Oleksandr Yakovenko, Irmtraud Meyer, Eric Zhao, Duane Smailus, Michelle Moksa, Suganthi Chittaranjan, Lisa Rimsza, Angela Brooks-Wilson, John Spinelli, Susana Ben-Neriah, Barbara Meissner, Bruce Woolcock, Merrill Boyle, Helen McDonald, Angela Tam, Yongjun Zhao, Allen Delaney, Thomas Zeng, Kane Tse, Yaron Butterfield, Inanc Birol, Rob Holt, Jacqueline Schein, Douglas Horsman, Richard Moore, Steven Jones, Joseph Connors, Martin Hirst, Randy Gascoyne, and Marco Marra. Chapter 4 was based on a manuscript entitled “Novel genetic alterations activating kinase and cytokine receptor signaling in high risk acute lymphoblastic leukemia” (currently under review) by Ryan  vi Morin, Kathryn Roberts, Jinghui Zhang, Martin Hirst, Yongjun Zhao, Xiaoping Su, Debbie Payne-Turner, Xiang Chen, Richard Harvey, Corynn Kasap, Chunhua Yan, Michelle Churchman, Shann-Ching Chen, Jared Becksfort, Richar Finney, David Teachey, Shannon Maude, Kane Tse, Richard Moore, Steven Jones, Karen Mungall, Inanc Birol, Michael Edmonson, Ying Hu, Kenneth E. Buetow, I-Ming Chen, William L. Carroll, Lei Wei, Jing Ma, Eric Larsen, Neil Shah, Meenakshi Devidas, Gregory Reaman, Malcolm Smith, William Evans, Stephen Paugh, Stephan Grupp, Ching-Hon Pui, Daniela Gerhard, James Downing, Cheryl Willman, Mignon Loh, Stephen Hunger, Marco Marra, Charles Mullighan and the Children's Oncology Group. This research was approved by the UBC Research Ethics Board (REB number H05-60103).      vii Table of Contents  Abstract .................................................................................................................................... ii	
   Preface ..................................................................................................................................... iii	
   Table of Contents .................................................................................................................. vii	
   List of Tables ........................................................................................................................ xiii	
   List of Figures ....................................................................................................................... xiv	
   List of Abbreviations ............................................................................................................ xv	
   Acknowledgements ............................................................................................................. xvii	
   Dedication ............................................................................................................................. xix	
   Chapter  1: Introduction ........................................................................................................ 1	
  1.1	
   Aims	
  ..............................................................................................................................................................................	
  1	
  1.2	
   Overview	
  of	
  next-­‐generation	
  sequencing	
  technologies	
  .........................................................................	
  1	
   1.2.1	
   454/Roche	
  sequencing	
  ......................................................................................................................................	
  1	
   1.2.2	
   Illumina	
  sequencing	
  ...........................................................................................................................................	
  2	
   1.2.3	
   Sequencing	
  by	
  ligation	
  ......................................................................................................................................	
  3	
   1.2.4	
   Low-­‐throughput	
  rapid-­‐turnaround	
  sequencers	
  ....................................................................................	
  5	
   1.2.5	
   Single	
  molecule	
  sequencing	
  technologies	
  .................................................................................................	
  6	
   1.2.6	
   Perspective	
  on	
  DNA	
  sequencing	
  technology	
  ............................................................................................	
  7	
  1.3	
   Overview	
  of	
  the	
  applications	
  of	
  Illumina	
  sequencing	
  ............................................................................	
  8	
   1.3.1	
   RNA	
  sequencing,	
  goals	
  and	
  motivation	
  .....................................................................................................	
  9	
   1.3.2	
   Gene	
  expression	
  analysis	
  techniques	
  .......................................................................................................	
  11	
    viii 1.3.3	
   RNA-­‐seq	
  for	
  gene	
  expression	
  profiling	
  ....................................................................................................	
  12	
   1.3.4	
   Targeted	
  resequencing	
  applications	
  .......................................................................................................	
  13	
   1.3.5	
   Whole	
  genome	
  sequencing	
  in	
  humans	
  ....................................................................................................	
  14	
  1.4	
   Computational	
  problems	
  and	
  algorithmic	
  solutions	
  in	
  next-­‐generation	
  sequencing	
  ...........	
  15	
   1.4.1	
   The	
  raw	
  data	
  ......................................................................................................................................................	
  15	
   1.4.2	
   Read	
  mapping	
  ....................................................................................................................................................	
  16	
   1.4.3	
   Read	
  mapping	
  for	
  RNA-­‐seq	
  ..........................................................................................................................	
  17	
   1.4.4	
   Perspectives	
  on	
  alignment	
  of	
  Illumina	
  sequence	
  data	
  .....................................................................	
  18	
   1.4.5	
   Alignment	
  data	
  universality	
  ........................................................................................................................	
  19	
   1.4.6	
   Sequence	
  assembly	
  ..........................................................................................................................................	
  20	
   1.4.7	
   Variant	
  identification	
  .....................................................................................................................................	
  22	
  1.5	
   Mutation	
  identification	
  in	
  cancer	
  .................................................................................................................	
  24	
   1.5.1	
   Methods	
  for	
  identifying	
  gross	
  chromosomal	
  abnormalities	
  ..........................................................	
  24	
   1.5.2	
   The	
  search	
  for	
  somatic	
  point	
  mutations	
  ................................................................................................	
  27	
   1.5.3	
   From	
  candidate	
  genes	
  to	
  global	
  mutational	
  surveys	
  .......................................................................	
  28	
   1.5.4	
   Cancer	
  genome	
  sequencing	
  .........................................................................................................................	
  31	
   1.5.5	
   Approaches	
  for	
  identifying	
  driver	
  mutations	
  and	
  cancer	
  genes	
  .................................................	
  32	
   1.5.6	
   Conclusions	
  .........................................................................................................................................................	
  34	
  1.6	
   Lymphoid	
  neoplasms:	
  leukemias	
  and	
  lymphomas	
  ...............................................................................	
  34	
   1.6.1	
   The	
  molecular	
  nature	
  of	
  non	
  Hodgkin	
  lymphomas	
  ...........................................................................	
  35	
   1.6.2	
   Acute	
  lymphoblastic	
  leukemia	
  ....................................................................................................................	
  37	
  1.7	
   Thesis	
  objectives	
  and	
  chapter	
  overview	
  ...................................................................................................	
  38	
   Chapter  2: Somatic mutations altering EZH2 (Y641) in follicular and diffuse large B- cell lymphomas of germinal-center origin .......................................................................... 44	
  2.1	
   Introduction	
  ...........................................................................................................................................................	
  44	
    ix 2.2	
   Results	
  ......................................................................................................................................................................	
  44	
   2.2.1	
   Discovery	
  of	
  the	
  Y641	
  mutation	
  with	
  genome	
  sequencing	
  ............................................................	
  44	
   2.2.2	
   Determining	
  recurrence	
  of	
  the	
  mutation	
  with	
  RNA-­‐seq	
  .................................................................	
  46	
   2.2.3	
   Biochemical	
  analysis	
  of	
  Y641	
  mutation	
  in	
  vitro	
  .................................................................................	
  48	
   2.2.4	
   Further	
  characterization	
  of	
  Y641	
  mutation	
  in	
  vivo	
  and	
  in	
  vitro	
  ................................................	
  49	
   2.2.5	
   Substrate	
  specificity	
  of	
  SET	
  domain	
  proteins	
  and	
  the	
  Phe/Tyr	
  switch	
  ....................................	
  49	
   2.2.6	
   Measuring	
  altered	
  substrate	
  specificity	
  of	
  Y641	
  mutant	
  EZH2	
  protein	
  ...................................	
  50	
  2.3	
   Discussion	
  ...............................................................................................................................................................	
  51	
  2.4	
   Methods	
  ...................................................................................................................................................................	
  54	
   2.4.1	
   Sample	
  acquisition	
  ..........................................................................................................................................	
  54	
   2.4.2	
   Preparation	
  and	
  sequencing	
  of	
  Illumina	
  libraries	
  ............................................................................	
  55	
   2.4.3	
   Targeted	
  ultra-­‐deep	
  resequencing	
  using	
  read	
  indexing	
  .................................................................	
  57	
   2.4.4	
   SNV	
  analysis	
  of	
  tumour	
  DNA	
  and	
  RNA	
  sequence	
  ...............................................................................	
  59	
   2.4.5	
   Amplicon	
  sequencing	
  for	
  SNV	
  identification	
  and	
  Sanger	
  sequence	
  validation	
  ....................	
  60	
   2.4.6	
   Computational	
  modeling	
  of	
  EZH2	
  wild	
  type	
  and	
  mutant	
  SET	
  domain	
  ....................................	
  60	
   2.4.7	
   In	
  vitro	
  EZH2	
  H3K27	
  tri-­‐methylation	
  assay	
  .........................................................................................	
  61	
   2.4.8	
   Revised	
  tri-­‐methylation	
  assay	
  .....................................................................................................................	
  61	
   2.4.9	
   Cell	
  lines	
  ................................................................................................................................................................	
  62	
   2.4.10	
   Cell-­‐of-­‐origin	
  (COO)	
  determination	
  .......................................................................................................	
  62	
   Chapter  3: Frequent mutation of histone-modifying genes in NHL ............................... 72	
  3.1	
   Introduction	
  ...........................................................................................................................................................	
  72	
  3.2	
   Results	
  ......................................................................................................................................................................	
  72	
   3.2.1	
   Identification	
  of	
  genes	
  recurrently	
  mutated	
  in	
  B-­‐cell	
  NHL	
  ............................................................	
  72	
   3.2.2	
   Identification	
  of	
  potential	
  driver	
  mutations	
  ........................................................................................	
  74	
   3.2.3	
   Evidence	
  for	
  selection	
  of	
  inactivating	
  changes	
  identifies	
  novel	
  mutated	
  genes	
  ...................	
  76	
    x 3.2.4	
   Inactivating	
  MLL2	
  mutations	
  .....................................................................................................................	
  77	
   3.2.5	
   Recurrent	
  point	
  mutations	
  affecting	
  the	
  MADS	
  box	
  and	
  MEF2	
  domains	
  of	
  MEF2B	
  ...........	
  79	
  3.3	
   Discussion	
  ...............................................................................................................................................................	
  80	
  3.4	
   Conclusions	
  ............................................................................................................................................................	
  82	
  3.5	
   Methods	
  ...................................................................................................................................................................	
  83	
   3.5.1	
   Sample	
  acquisition	
  ..........................................................................................................................................	
  83	
   3.5.2	
   Cell	
  lines	
  ................................................................................................................................................................	
  84	
   3.5.3	
   Preparation	
  and	
  sequencing	
  of	
  RNA-­‐seq,	
  genome	
  and	
  exome	
  Illumina	
  libraries	
  ................	
  84	
   3.5.4	
   Alignment-­‐based	
  analysis	
  of	
  tumour	
  DNA/RNA	
  sequence	
  for	
  point	
  mutations	
  ...................	
  86	
   3.5.5	
   Validation	
  of	
  candidate	
  somatic	
  mutations	
  using	
  Illumina	
  sequencing	
  ..................................	
  88	
   3.5.6	
   Validation	
  of	
  cSNVs	
  by	
  Sanger	
  sequencing	
  ...........................................................................................	
  88	
   3.5.7	
   Detection	
  of	
  enrichment	
  of	
  functional	
  gene	
  classes	
  within	
  frequently	
  mutated	
  genes	
  .....	
  89	
   3.5.8	
   Detection	
  of	
  mutations	
  with	
  imbalanced/skewed	
  expression	
  ......................................................	
  89	
   3.5.9	
   Calculation	
  of	
  selective	
  pressure	
  ...............................................................................................................	
  90	
   3.5.10	
   Identifying	
  genes	
  with	
  mutation	
  hot	
  spots	
  .........................................................................................	
  91	
   3.5.11	
   Analysis	
  of	
  aligned	
  genomic	
  DNA	
  sequence	
  for	
  copy	
  number	
  alterations	
  and	
  LOH	
  ........	
  92	
   3.5.12	
   Assembly-­‐based	
  analysis	
  of	
  tumour	
  DNA	
  and	
  RNA	
  sequence	
  ....................................................	
  93	
   3.5.13	
   Cell	
  of	
  origin	
  subtype	
  assignment	
  using	
  RNA-­‐seq	
  expression	
  values	
  ......................................	
  94	
   3.5.14	
   Targeted	
  MEF2B	
  resequencing	
  using	
  biotinylated	
  RNA	
  capture	
  probes	
  ..............................	
  95	
   3.5.15	
   Targeted	
  MLL2	
  resequencing	
  using	
  long-­‐range	
  PCR	
  and	
  sample	
  indexing	
  ........................	
  97	
   3.5.16	
   Re-­‐confirmation	
  of	
  MLL2	
  mutations	
  in	
  patient	
  samples	
  and	
  DLBCL	
  cell	
  lines	
  ..................	
  99	
   3.5.17	
   Targeted	
  resequencing	
  of	
  MEF2B	
  coding	
  exons	
  1	
  and	
  2.	
  ..........................................................	
  100	
   3.5.18	
   Identification	
  of	
  structural	
  aberrations	
  involving	
  BCL2	
  and	
  BCL6	
  ......................................	
  100	
   3.5.19	
   Analysis	
  of	
  impact	
  of	
  COO	
  and	
  mutation	
  status	
  on	
  outcome	
  in	
  DLBCL	
  ..............................	
  101	
   Chapter  4: Novel chromosomal rearrangements in high-risk ALL .............................. 116	
    xi 4.1	
   Introduction	
  ........................................................................................................................................................	
  116	
  4.2	
   Results	
  ...................................................................................................................................................................	
  116	
   4.2.1	
   Novel	
  chromosomal	
  rearrangements	
  in	
  Ph-­‐like	
  ALL	
  ....................................................................	
  116	
   4.2.2	
   Recurrence	
  evaluation	
  ................................................................................................................................	
  118	
   4.2.3	
   Deletions	
  and	
  sequence	
  mutations	
  in	
  Ph-­‐like	
  ALL	
  ..........................................................................	
  119	
  4.3	
   Discussion	
  ............................................................................................................................................................	
  120	
  4.4	
   Conclusions	
  .........................................................................................................................................................	
  122	
  4.5	
   Methods	
  ................................................................................................................................................................	
  122	
   4.5.1	
   Patients	
  and	
  samples	
  ...................................................................................................................................	
  122	
   4.5.2	
   Prediction	
  analysis	
  of	
  microarrays	
  (PAM)	
  .........................................................................................	
  124	
   4.5.3	
   RNA-­‐seq	
  library	
  preparation	
  and	
  sequencing	
  ..................................................................................	
  124	
   4.5.4	
   Whole	
  genome	
  shotgun	
  library	
  preparation	
  and	
  sequencing	
  ...................................................	
  125	
   4.5.5	
   Alignment-­‐based	
  analysis	
  of	
  tumour	
  DNA	
  and	
  RNA	
  sequence	
  for	
  somatic	
  point	
   mutations	
  and	
  fusion	
  transcripts	
  ..........................................................................................................................	
  125	
   4.5.6	
   Validation	
  of	
  candidate	
  somatic	
  mutations	
  identified	
  in	
  genomes	
  and	
  exomes	
  ...............	
  126	
   Chapter  5: Conclusions ..................................................................................................... 136	
  5.1	
   Tools	
  for	
  routine	
  analysis	
  of	
  DNA	
  sequence	
  from	
  cancers	
  .............................................................	
  136	
  5.2	
   Towards	
  affordable	
  genome	
  sequencing	
  ...............................................................................................	
  136	
  5.3	
   Pharmacogenomics	
  and	
  personalized	
  cancer	
  treatment	
  ................................................................	
  137	
  5.4	
   Epigenetics	
  in	
  cancer:	
  Overview	
  and	
  future	
  work	
  .............................................................................	
  142	
  5.5	
   Summary	
  and	
  future	
  directions	
  .................................................................................................................	
  144	
   References ............................................................................................................................ 146	
   Appendices ........................................................................................................................... 167	
  Appendix	
  A	
  Supplementary	
  data	
  from	
  the	
  analysis	
  of	
  FL	
  Sample	
  A	
  and	
  EZH2	
  ................................	
  167	
    xii A.1	
   Additional	
  novel	
  SNVs	
  identified	
  in	
  FL	
  Sample	
  A	
  by	
  RNA-­‐seq	
  .......................................................	
  167	
   A.2	
   All	
  EZH2	
  mutants	
  detected	
  by	
  Sanger	
  sequencing	
  in	
  FL	
  and	
  DLBCL	
  .........................................	
  170	
   A.3	
   Analysis	
  of	
  samples	
  for	
  EZH2	
  mutations	
  using	
  ultra-­‐deep	
  targeted	
  resequencing	
  .............	
  172	
  Appendix	
  B	
  Supplementary	
  data	
  from	
  the	
  analysis	
  of	
  127	
  NHL	
  cases	
  by	
  RNA-­‐seq	
  and	
  genome/exome	
  sequencing	
  .....................................................................................................................................	
  176	
   B.1	
   Details	
  of	
  cases	
  analyzed	
  and	
  their	
  libraries	
  ........................................................................................	
  176	
   B.2	
   109	
  recurrently	
  mutated	
  genes	
  identified	
  in	
  the	
  RNA-­‐seq	
  cohort	
  ..............................................	
  182	
   B.3	
   Somatic	
  mutations	
  showing	
  allelic	
  imbalance	
  ....................................................................................	
  185	
   B.4	
   All	
  MEF2B	
  mutations	
  identified	
  .................................................................................................................	
  191	
  Appendix	
  C	
  Supplementary	
  data	
  from	
  the	
  sequencing	
  of	
  11	
  Ph-­‐like	
  ALL	
  cases	
  ..............................	
  193	
   C.1	
   Fusions	
  detected	
  with	
  DeFuse	
  or	
  MOSAIK	
  or	
  previously	
  identified	
  ............................................	
  193	
  Appendix	
  D	
  Supplementary	
  methods	
  .................................................................................................................	
  194	
   D.1	
   Galaxy	
  workflow	
  for	
  analysis	
  of	
  paired	
  tumour/normal	
  sequence	
  data	
  .................................	
  194	
   D.2	
   Schema	
  and	
  description	
  of	
  database	
  .......................................................................................................	
  195	
     xiii List of Tables Table 1-1: Comparing throughput, read length and error rate of sequencing platforms ........ 43	
   Table 1-2: Summary of cancer genomes sequenced and non-silent somatic mutations ......... 43	
   Table 2-1: Summary of sequence coverage in FL patient A .................................................. 69	
   Table 2-2: Location and effect of mutations in EZH2 in NHL identified by RNA-seq ......... 70	
   Table 2-3: Frequency of EZH2 Y641 mutations in lymphoma and benign samples .............. 71	
   Table 3-1: Hot spot mutations identified directly from the RNA-seq data ........................... 112	
   Table 3-2: Overview of cSNVs and somatic mutations in most frequently mutated genes . 113	
   Table 3-3: Summary of the types of MLL2 somatic mutations ............................................ 115	
   Table 4-1: Overview of rearrangements detected by RNA-seq and confirmed .................... 131	
   Table 4-2: Somatic SNVs identified in 11 ALL cases by RNA-seq ..................................... 132	
   Table 4-3: Somatic indels detected in 11 ALL cases by RNA-seq ....................................... 133	
   Table 4-4: Validation results for somatic indels/SNVs detected in two ALL genomes ....... 134	
  	
    xiv List of Figures Figure 1-1: Overview of raw sequence data from Illumina sequencer ................................... 40	
   Figure 1-2: Flow of data in a typical analysis paradigm ......................................................... 41	
   Figure 1-3: Detecting single nucleotide variants from aligned sequence data ....................... 42	
   Figure 2-1: Recurrent mutations of Y641 in EZH2 ................................................................ 63	
   Figure 2-2: In vitro assembly and functional analysis of PRC2 with mutant and wild type EZH2. ...................................................................................................................................... 65	
   Figure 2-3: In vivo demonstration of H3K27 methylation levels in EZH2 wild type and mutant cells. ............................................................................................................................ 67	
   Figure 3-1: Overview of analyses performed ....................................................................... 102	
   Figure 3-2: Genome-wide visualization of somatic mutation targets in NHL ...................... 103	
   Figure 3-3: Recurrent mutations affecting the CREBBP and EP300 HATs ......................... 104	
   Figure 3-4: Overview of mutations and potential cooperative interactions in NHL ............ 105	
   Figure 3-5: Determining the effects of mutations in GNA13 at the protein level ................ 106	
   Figure 3-6: Summary and effect of somatic mutations affecting MLL2 ............................... 108	
   Figure 3-7: Overview of MEF2B mutations ......................................................................... 109	
   Figure 3-8: Potential impact of recurrently mutated genes on BCR signalling and downstream messengers ............................................................................................................................ 110	
   Figure 4-1: Novel rearrangements in Ph-like ALL. .............................................................. 128	
   Figure 4-2: Insertion of EPOR gene into IGH locus ............................................................. 129	
   Figure 4-3: Recurrence screening and structure of the EBF1-PDGFRB rearrangement ...... 130	
      xv List of Abbreviations ABC – activated B cell ALL – acute lymphoblastic leukemia BAC – bacterial artificial chromosome BAM – binary/compressed counterpart to the Sequence Alignment/Map format bp – base pair BCR – B-cell receptor BWA – the Burrows-Wheeler Aligner software cDNA – complementary DNA ChIP – chromatin immunoprecipitation CNVs – DNA copy number variations COO – cell of origin cSNV – coding (non-silent) single nucleotide variant DLBCL – diffuse large B-cell lymphoma DNA – deoxyribonucleic acid dNTP – deoxynucleoside triphosphate FL – follicular lymphoma Gb – gigabase (one billion nucleotides/base pairs of DNA sequence) GCB – germinal centre B cell GEP – gene expression profiling H3K27 – lysine 27 on histone H3 H3K4 – lysine 4 on histone H3 HAT – histone acetyltransferase  xvi HDAC – histone deacetylase hg18 – human reference genome sequence, build 18 HMM – hidden Markov model IHC – immunohistochemistry Indel – insertion or deletion IGV – the Integrative Genomics Viewer software MAQ – the Mapping and Assembly with Qualities alignment software mRNA – messenger ribonucleic acid NHL – non Hodgkin lymphoma nt - nucleotide PRC2 – polycomb repressive complex 2 RAM – random access memory RNA – ribonucleic acid RNA-seq – messenger RNA sequencing rRNA – ribosomal RNA RPKM – reads per kilobase of gene model per million reads SAM – sequence alignment/map format SNP – single nucleotide polymorphism SNV – single nucleotide variant tRNA – transfer RNA WGS – whole genome sequencing    xvii Acknowledgements I am infinitely grateful to my supervisor and mentor, Dr. Marra, for his ongoing support and guidance throughout my entire scientific career. His advice has never steered me wrong and the manuscripts included in this thesis are a direct reflection of his propensity for excellence and meticulousness in all aspects of research. Without Dr. Marra’s support and direction, I would likely have never entered post-graduate studies and for that I am thankful. I thank my committee members Drs. Joe Connors, Steven Jones and Paul Pavlidis for voluntarily advising and overseeing my research. I would like to also thank the other examiners who have donated their time to read this thesis and participate in my defense. I owe much of my technical abilities to lab members and fellow bioinformaticians more talented than myself including (but not limited to) Rodrigo Goya, Dr. Malachi Griffith, Dr. Sohrab Shah, Richard Corbett, Anthony Fejes, Trevor Pugh, Martin Krzywinski, and Karen Mungall. I am indebted to Drs. Randy Gascoyne, Nathalie Johnson, Christian Steidl and David Scott for patiently teaching me about lymphoma over these past years and Drs. Sam Aparicio and Charles Mullighan for being superb and generous collaborators. This work would also not have been possible if it weren’t for the motivated and seemingly inexhaustible laboratory expertise of Dr. Maria Mendez-Lago, Dr. Andy Mungall, Tesa Severson, Dr. Yongjun Zhao, Diane Trinh and Dr. Martin Hirst as well as the entire GSC sequencing and library construction groups and the Gascoyne lab. I am also grateful for the past and ongoing support of the project management group including Karen Novik, Robyn Roscoe and Cecilia Suragh. I am very appreciative of the financial support provided by both the Canadian Institutes for Health Research and the Michael Smith Foundation for Health Research in the  xviii forms of Vanier Scholarship and the Senior Graduate Studentship. The work described herein was funded in part by funding from the National Cancer Institute Office of Cancer Genomics (Contract No. HHSN261200800001E.), the Terry Fox Foundation (grant #019001, Biology of Cancer: Insights from Genomic Analyses of Lymphoid Neoplasms), Genome Canada/Genome BC Grant Competition III (Project Title: High Resolution Analysis of Follicular Lymphoma Genomes) and NIH grants P50CA130805-01 “SPORE in Lymphoma, Tissue Resource Core (PI Fisher)” and 1U01CA114778 “Molecular Signatures to Improve Diagnosis and Outcome in Lymphoma (PI Chan)”. Much of the laboratory work was undertaken at the Genome Sciences Centre, British Columbia Cancer Research Centre and the Centre for Translational and Applied Genomics, a program of the Provincial Health Services Authority Laboratories. I sincerely thank my wife, Sarah, and daughters, Natalie and Zoe, for their patience and support during the course of my studies. Their humour, distraction, affirmation and love have enabled me to endure the low points and celebrate the high points of these past few years. I extend these thanks to the remainder of my family (parents, step-parents, brothers, sister, grandparents, aunts, uncles and cousins) who share in my successes even when you find them incomprehensible. Without you, I would not have had the motivation to strive for scientific excellence.     xix Dedication I dedicate this thesis to the cancer patients who have donated samples to research in selfless hope of aiding the treatment of others who befall the same disease.   1 Chapter  1: Introduction 1.1 Aims  The goal of this research was to apply new sequencing technologies to reveal the somatic mutations that drive progression of lymphoid neoplasms. I set out to find and develop tools that could accurately and efficiently identify somatic point mutations and fusion transcripts from Illumina sequencing data. I hypothesized that, if such mutations could be found and proven to be somatic, some of these would represent mutations driving these cancers. Further, if these mutations or genes were not known to these cancers, they may present new therapeutic targets, the potential for new biomarkers, or guide us to better understand the molecular nature of these diseases. 1.2 Overview of next-generation sequencing technologies Numerous sequencing platforms have appeared over the past few years and additional technologies show promise of emerging in the near future. These have collectively been referred to as “next generation sequencing” technologies, though they vary broadly in their underlying chemistry, approaches and, as a result, their throughputs. The currently available methodologies can be broadly divided into “sequencing by synthesis” and “sequencing by ligation”. Sequencing by synthesis approaches can further be distinguished by those using clonally amplified DNA as templates and the more recently released single-molecule platforms, which read the sequence of individual DNA molecules. Each of these platforms has drawbacks and potential benefits which will be briefly discussed in this section. 1.2.1 454/Roche sequencing The first of the next-generation technologies, released by 454 Technologies (now Roche), relies on sequencing by synthesis. Briefly, sequencing is performed in picolitre-  2 volume wells that are populated with beads coated with clonally-amplified DNA molecules (Margulies et al. 2005). The sequencing reaction is based upon pyrosequencing, which involves the emission of photons in a luciferase-based reaction that is triggered by the pyrophosphate that is released during nucleotide incorporation. The sequencing proceeds with separate introduction of the four dNTPs and detection of light emitted using a charge- coupled device (CCD). When a given dNTP is introduced, the template DNA strands on each bead undergo zero, one, or multiple extension reactions, and the amount of light produced is linearly proportional to the number of introduced nucleotides. In practice, this linearity is maintained in homopolymers up to 8 nt but sequencing accuracy drops with increasing homopolymer length and as a result, the single base indels detected within homopolymers are error-prone (Wheeler et al. 2008). Sequencing progresses with repeated cycles of each of the four dNTPs for some pre-defined number of cycles, yielding a collection of reads (one from each bead) of varying length. In its first published application, this technology was capable of, in a single run, producing 238 thousand reads of varying length averaging 110 nucleotides (nt). Successive advancements over the past six years have enabled average read lengths of 400 nt and throughputs of 1 million reads (Table 1-1). 1.2.2 Illumina sequencing The first widely-adopted sequencing technology commonly referred to as ‘massively parallel’ was released by Solexa (now Illumina) and was also based on sequencing by synthesis, albeit with a distinct chemistry and amplification methodology (Barski et al. 2007). The Illumina approach relies on reversible termination chemistry, in which the nucleotides added in each extension reaction are fluorescently labelled, which allows them to be photographed under a microscope. Because these fluorescent labels prevent the addition  3 of another nucleotide until they are chemically cleaved, they are referred to as reversible terminators. One key benefit of reversible terminator nucleotides is the ability to separately label each of the four bases. Thus, rather than relying on temporal separation of individual nucleotides as in 454 sequencing, each of the separately ‘coloured’ dNTPs can be added without concern of multiple incorporations. Another benefit owing to these terminators is that the Illumina approach is more robust to accurately reading homopolymeric sequences than the un-terminated chemistries such as the 454/Roche approach. When it was introduced, a key limitation of Illumina sequencing was the read length. Early iterations of these sequencers (1G Genome Analyzers) produced from 27 to 36 nt reads. Applications suitable for this read length were ChIP-seq (Robertson et al. 2007) and microRNA sequencing (miRNA-seq) (Stark et al. 2007; Morin et al. 2008a). Substantial improvements to virtually all aspects of the technology over the past four years have resulted in immense throughput increases. Current throughputs available from the HiSeq 2000 platform are shown in Table 1- 1. When the original platform was introduced, the shortened read length, high error rate, and vast increase in raw sequence output compared to the 454 platform made the application of standard algorithms to the alignment and assembly of sequence reads un-tractable. As a result, a multitude of new software for performing tasks such as alignment and assembly were created (to be reviewed in a subsequent section) and even now are continually evolving to better handle the ever-changing landscape of sequencing data. 1.2.3 Sequencing by ligation A distinct method of generating ultra-short reads was introduced by George Church and involved sequencing by ligation (Shendure et al. 2005). Rather than detecting the identity of individual bases after (or during incorporation), this technique leverages the  4 inherent ability of DNA ligase to recognize complementary base pairing to read the sequence of DNA. Rather than providing a template for 5´-3´extension via polymerase, the ‘anchor’ primer, which is annealed to universal sequence, provides a 5´ or 3´ free end for ligation by DNA ligase. The sequence of each template is queried by successive additions of degenerate n-mers containing a single non-degenerate nucleotide at a specific position. The identity of this ‘query’ base is encoded using one of four fluorescent dyes. A successful ligation reaction can only occur if the query base, and the nearby degenerate bases, are complementary (Drmanac et al. 2009; Shendure et al. 2005). Thus, after successful ligation, the identity of the nucleotide at the query position corresponds to the complement of the query base. With the original approach, successive ligation reactions could only read up to seven nucleotides from the ligation junction, so various solutions have been devised to allow longer reads to be generated using this basic chemistry (Drmanac et al. 2009; McKernan et al. 2009). One platform utilizing this approach forms the basis of the sequence-as-a-service company Complete Genomics. The second is the SOLiD platform, produced by Life Techologies. Each company introduced improvements to base-calling accuracy, read length and overall throughput. To address accuracy issues, SOLiD sequencing included so-called two-base encoding, which allowed pairs of nucleotides to be read concurrently using di-base probes. SOLiD throughput and read lengths were, until recently, generally comparable to those produced by Illumina GAiix instruments (Table 1-1). However, owing to the two-base encoding scheme used in the SOLiD platform, this data must be handled differently because the raw data is in “colour space” rather than corresponding to the actual DNA template sequence. This rather unique data type has spurred yet further development of algorithms to accomplish alignment and assembly and to perform variant identification that leverages the  5 benefits of two-base encoding. However, because the focus of this thesis derives from results obtained using Illumina sequencing, these will not be discussed in detail. 1.2.4 Low-throughput rapid-turnaround sequencers A key goal of many studies discussed in this chapter is to collect the entire genomic sequence from some number of individuals in a population. Each of the sequencing technologies have heretofore catered to a constant demand for greater throughput, both in read length and, more importantly, total number of reads. However, as initial findings from early sequence-based genomic analysis begin to reveal potentially important discoveries, there will be an equally increasing demand for accelerated (and simplified) sequencing at the expense of lower throughput. Multiple platforms have become available, or will soon be available, to fill this niche. Both the Roche 454 GS Junior and the Illumina MiSeq are examples of scaled-down instruments that use the same sequencing technology as their namesakes (Table 1-1). A third system that offers accelerated sequence runs is the Ion Torrent technology (Life Technologies). Ion Torrent sequencing is reminiscent of 454 sequencing but eliminates the time-consuming requirement for imaging, thereby offering reduced reagent cost and run time (Rothberg et al. 2011). Because it is a relatively young technology, whether these promises will materialize remains to be seen. Chapter 4 of this thesis includes preliminary data utilizing this platform for validating SNVs and indels. Another sequencing technology (“SMRT sequencing”), which also benefits from very short run-times, will be described below. With the steadily diminishing costs, the application of DNA sequencing in clinical diagnostics is becoming routine practice and as it does, it is likely that such rapid-turnaround platforms will be key in providing results in an appropriate time frame as to provide results that can aid in diagnosis and ultimately guiding treatments.  6 1.2.5 Single molecule sequencing technologies A commonality between each of the aforementioned approaches is the clonal amplification of input DNA (either by emulsion PCR or bridge amplification) to increase signal. As such, these technologies are collectively referred to as “ensemble” to distinguish them from single-molecule sequencing, wherein the individual sequence reads directly derives from one template DNA (or RNA) molecule. To date, two companies have produced commercially available platforms that perform single-molecule sequencing and obviate the need for DNA amplification, thus eliminating the potential biases this may introduce. The first, known as true single-molecule sequencing or tSMS, was developed by Helicos Biosciences (Pushkarev et al. 2009). This approach is akin to the Illumina sequencing strategy (i.e. including reversible terminators) with a few key distinctions. The major distinction is that the same fluorophore is used for each of the four nucleotides and thus four cycles (rather than one) are required to interrogate a given position in a molecule. Secondly, the ability to detect the addition of a nucleotide is completely reliant on the fidelity of a single fluorescent molecule. A drawback of this is that so-called “dark bases” result from the incorporation of dNTPs that lack an active fluorophore molecule.  The second single-molecule approach was developed by Pacific Biosciences and is termed Single Molecule Real Time (SMRT) sequencing. This method employs a unique approach to sequencing that involves real-time visualization (rather than static imaging) of the DNA sequencing process. This relies on DNA polymerases that have been engineered (by mutation) to perform the DNA synthesis reaction at a reduced rate. A single polymerase is physically affixed within each zero-mode waveguide (ZMW), of which there are thousands on a single surface termed a SMRT chip. The ZMW allows visualization of fluorescence  7 from individual nucleotides when they are in close proximity with the polymerase and essentially filters the noise produced from the diffusion of polymerases in solution. The four dNTPs are labeled with different fluorophores that are cleaved when added to the growing DNA strand. Because this cleavage is catalyzed by the polymerase, subsequent nucleotide additions are limited only by polymerase fidelity and diffusion. SMRT sequencing exhibits a very high error rate when compared to ensemble sequencing approaches but the read length achievable with this approach is only limited by the processivity of each polymerase molecule and can yield read lengths averaging over 1 kb (Table 1-1). One mechanism of ameliorating the significant error inherent in SMRT sequencing is termed “consensus sequencing”, a process by which, aided by the extended read length, a circularized DNA molecule can be repeatedly sequenced by the same polymerase to produce a consensus read with enhanced accuracy. The unique properties of single molecule sequencing including, in the case of SMRT sequencing, long read lengths and high error rate, have resulted in further computational challenges. It is likely that further improvements to the technology and alignment algorithms will continue but it is likely that such technologies will be applied to applications for which these features are well suited and drawbacks are acceptable. 1.2.6 Perspective on DNA sequencing technology  Owing to the large differences in throughput, accuracy and read lengths, each of these sequencing platforms is suited to individual niches and none has prevailed as the gold standard for all applications. The bulk of the sequencing data presented in this thesis was generated on Illumina instruments, which have, over the duration of my research, evolved from the relatively lower throughput GAii to the HiSeq 2000 platform. Most of the genomes presented in this work were sequenced on HiSeq 2000 instruments, often achieving close to  8 30x coverage of a genome with a single run of the instrument. Alternate technologies may have been suitable for answering the questions discussed herein, but at the outset of this project, the Illumina platform was the only one capable of producing long paired-end reads, which can provide added value, especially in the analysis of RNA-seq data. Also, Illumina sequencing yielded promising results with applications to microRNA sequencing (Morin et al. 2008b; Kuchenbauer et al. 2008) and messenger RNA sequencing (RNA-seq)(Morin et al. 2008a) and as such, was an appealing choice at the outset of this large study. 1.3 Overview of the applications of Illumina sequencing The published studies that have utilized Illumina sequencing technology can be generally classified into three broad applications, namely, RNA sequencing, DNA sequencing and positional profiling of DNA-binding proteins or epigenetic modifications to the DNA or histones (which also involves DNA sequencing), collectively known as ChIP- seq. Although these latter applications are powerful tools that promise to further reveal the complexity of gene regulation, they were not utilized in the research described in this thesis and will thus not be discussed in detail. RNA sequencing can be further divided into small RNA sequencing (miRNA-seq, where miRNAs are the typical query molecule (Stark et al. 2007)) and long RNA sequencing, broadly referred to as RNA-seq (Mortazavi et al. 2008)(in some cases referred to as messenger RNA sequencing, abbreviated mRNA-seq or whole transcriptome shotgun sequencing, abbreviated WTSS (Morin et al. 2008a)). DNA sequencing, when applied to species with previously un-sequenced genomes, is typically referred to as de novo sequencing and analysis of this type of data generally involves producing an assembly that (ideally) represents the entire genome sequence, often as a set of scaffolded contigs (Gnerre et al. 2011). Owing to the existence of a nearly complete human  9 reference genome sequence and various genome sequences of individuals, massively parallel sequencing of human DNA is typically referred to as resequencing. Resequencing, either of entire genomes or regions of the genomes that have been enriched by various methods, is a common application of this sequencing technology, particularly as it has proven useful for the study of the genetic features of heritable diseases and human cancers. This thesis focuses on the application of human resequencing and RNA-seq to detect differences between the DNA sequence of test and reference samples, which may be either the reference genome itself (i.e. hg18) and a human genome variation database (e.g. dbSNP) or a ‘matched’ non- malignant tissue from a patient with cancer. This research involved the development of approaches for aligning Illumina RNA-seq and DNA sequence data and post-processing the data for a multitude of applications. The core set of tools were designed to robustly quantify gene expression and identify single nucleotide variants (SNVs) from the alignments and annotating these variants at a variety of levels to identify those most likely to be relevant to cancer. 1.3.1 RNA sequencing, goals and motivation The abundance and activity of the functional units of the cell (proteins and some RNAs) result from the transcription of a small fraction of the genomic sequence into complementary RNA molecules that are further processed into what is collectively referred to as the transcriptome. The central dogma of molecular biology states that RNA is the intermediate molecule between the stable genetic components of the cell (chromosomal and mitochondrial DNA) and the functional proteins that perform cellular functions. The RNA species that is subsequently translated into proteins, namely messenger RNA (mRNA), is the most common molecule type assayed by RNA sequencing (both in this thesis and elsewhere).  10 It is important to note that other functional RNA molecules can also be sequenced and quantitated using variants of RNA sequencing. The most famous class of non-coding RNA (besides rRNA and tRNA) is probably the microRNAs (miRNAs). A more recent discovery is the large number of potentially functional large RNA molecules referred to as long intergenic non-coding RNAs (lincRNAs)(Guttman et al. 2009). Owing to the relative lack of understanding of the function of the individual non-coding RNA genes, this work focuses on the application of RNA-seq in human cancers and, more specifically, only the known protein-coding transcripts in the human genome. In many diseases, the cells comprising diseased tissues or organs have one or more defects reflected in some element of their transcriptome, which can range in scale from a single nucleotide replacement; gain, loss or alteration of an exon; to the altered dosage of an entire gene or deregulation of any number of genes. Some of these alterations may result from changes to the DNA sequence and others may not. For example, polymorphisms and mutations that affect regulatory regions can alter the transcriptional level of genes and can also impact their splicing patterns and act in cis. Alternatively epigenetic changes such as DNA methylation or histone protein modification, two processes that can be deregulated in cancer as well as a multiplicity of other diseases, can also result in perturbations to the transcriptome. These may manifest either as the inappropriate silencing or deregulated expression of genes and can also be restricted to a single allele (for a review, (Tazi et al. 2009)). Though many features of the transcriptome have historically remained unexplored due to cost and throughput limitations, modern transcriptomics, enabled by second- generation sequencing, is now armed with the potential to detect many types of transcriptomic perturbations, affecting both protein-coding mRNAs and non-coding RNAs  11 including miRNAs. RNA-seq offers the possibility to extend our view of the transcriptome from the simple measurement of dosage (or expression). It enables a more detailed view of the variation in splicing, RNA sequence, and variable expression between the individual alleles, each of which will be demonstrated in this thesis. 1.3.2 Gene expression analysis techniques Measuring the level of RNA abundance for each transcribed gene has been one of the major applications of genomics in humans and other organisms. Until recently, high- throughput options for quantifying transcript abundance from individual samples were mainly restricted to either microarrays or digital profiling approaches such as serial analysis of gene expression (SAGE). The array-based (“analog”) strategies, which relied on the measurement of signal from fluorescently labeled cDNA hybridized to a fixed surface, were typically preferred over sequencing-based approaches due to their lower cost, more rapid turnaround time and reduced requirement for highly specialized facilities. Microarrays have proven extremely useful in cancer research, having enabled the identification of prognostic biomarkers (Lossos et al. 2004) and new subtypes of various cancers, including diffuse large B-cell lymphoma (Alizadeh et al. 2000). A major drawback of array-based technologies, besides their limited dynamic range owing to signal saturation, was that the array design process inherently limited the set of genes that could be detected. The original array designs were also unable to distinguish expression from the individual alternative isoforms of genes produced by alternative splicing or other forms of alternative expression (AE). Although subsequent generations of expression arrays were designed to also interrogate AE, there were still inherent limitations such as the ability to design unique probes for individual exons and exon junctions (Griffith et al. 2008).  12 1.3.3 RNA-seq for gene expression profiling The introduction of ultra high-throughput cost effective DNA sequencing has provided a viable alternative to array based transcriptome profiling. Most massively parallel sequencing platforms offer protocols that allow the measurement of transcript abundance without any of the a priori assumptions of the structures/sequences previously imposed by microarrays. These techniques involve sequencing the entire transcriptome (RNA- seq)(Mortazavi et al. 2008) or short portions (‘tags’) from polyadenylated transcripts involving a procedure reminiscent of SAGE known as Tag-Seq (Morrissy et al. 2009). Collectively, these technologies have made the digital profiling of entire transcriptomes both comprehensive and cost-effective. Since their appearance, numerous studies have compared the utility of these approaches to first-generation expression platforms and these have typically demonstrated improved sensitivity and dynamic range in measuring steady-state mRNA abundance over their previous array- or sequence-based predecessors (Sultan et al. 2008; Marioni et al. 2008; Asmann et al. 2009). I also demonstrate in Chapter 3 that gene expression data from RNA-seq can be used to identify the subtype of DLBCL samples. The improvements to the quality and reproducibility of measurement yielded by RNA-seq are a modest gain when compared to the additional possibilities it offers. As might be expected, because they rely on sequence reads to measure expression, these approaches enable numerous added benefits beyond only measuring mRNA abundance such as more accurate detection and differentiation of alternative transcript isoforms, detection of novel mRNAs resulting from fusion of two genes, and small-scale sequence variation such as polymorphisms and mutations. Various analytical strategies have been implemented to  13 address each of these features of the data and will be reviewed in detail in subsequent sections. 1.3.4 Targeted resequencing applications  Though sequencing costs have been dropping rapidly, sequencing the entire genome is still costly. It is difficult to justify such a cost when one considers that much of the variation that can be interpreted lies within the ~1% of the genome that encodes protein, often referred to as the “exome”. A common approach to limit sequencing to certain genomic regions involves an enrichment step, which amplifies (or captures) regions of interest from a DNA sample prior to sequencing. For a small number of such regions, a set of primers can readily be designed that enable amplification of the entire region (or the exons within the region) by polymerase chain reaction (PCR). Studies seeking the underlying causal mutation in human genetic disease have used such strategies. For example, linkage analysis can reveal a region of the genome that segregates with a disease in a family, but the resolution is limited and often the linked region contains many genes. Long-range PCR amplification of all genes (or their exons) in the linked region followed by deep sequencing of those amplicons has proven able to identify the deleterious mutation responsible for human genetic syndromes (McLarren et al. 2010). More recently, approaches that utilize hybridization rather than PCR have become commonplace for such applications. In general, each of these approaches is able to “capture” the fragments from a DNA sample (or library) that correspond to pre- defined regions of interest, leaving behind the remainder of the genome. These “baits” are either physically anchored to a surface (usually the surface on which they were synthesized)(Albert et al. 2007) or consist of biotin-labeled RNA oligonucleotides that are free in solution (Gnirke et al. 2009). The two cited methods have been further developed into  14 commercially available applications that can be purchased as kits designed to capture the entire exome or as custom designs that can be made to capture subsets of the exome or arbitrary regions of interest. There are numerous published examples where the application of these strategies followed by next-generation sequencing enabled the discovery of novel and disease-relevant variations in the human genome both in genetic disease (S. B. Ng et al. 2010) and cancer (Wei et al. 2011; Varela et al. 2011). 1.3.5 Whole genome sequencing in humans  Some early applications of next-generation sequencing to resequence the genomes of individuals revealed not only large numbers of single nucleotide variants but also a glimpse of the structural variation that exists in the population (Wheeler et al. 2008; Bentley et al. 2008; McKernan et al. 2009). An extensive survey of the genetic variation in the genomes of 179 individuals and exomes of 697 individuals (i.e. The 1000 Genomes Project) was recently published, which revealed roughly 15 million novel SNVs and another 20,000 novel large structural variations in the human population (The 1000 Genomes Project Consortium 2010). As variants need not alter the protein coding sequence to affect gene expression or splicing, any of these variants might contribute to human diseases. Hence, to maximize the potential for discovering those variants relevant to disease, in theory, one must sequence the entire genome. There are already examples of whole genome sequencing revealing causal mutations in rare genetic diseases (Lupski et al. 2010) and enabling the diagnosis of atypical presentation of an autosomal recessive disorder (Rios et al. 2010). In each case, these studies sequenced the proband and subtracted known polymorphisms. In the cited studies, the causal variants were found to be protein-altering mutations and thus may have been identified by  15 exome sequencing. I predict that future studies utilizing whole genome sequencing (WGS) may reveal non-coding mutations that are causal in other genetic diseases. 1.4 Computational problems and algorithmic solutions in next-generation sequencing  The introduction of massively parallel sequencing technologies presented multiple informatic challenges ranging from the difficulty in storing and effectively accessing the sheer volume of data to the ineffectiveness and/or inappropriateness of extant algorithms for routine tasks such as sequence alignment and mutation identification. 1.4.1 The raw data  Until recently, DNA sequence data could be stored in the raw formats (“reads” or “traces”) as the sequencing machines provided them. The raw data from Illumina sequencing includes large image files, which collectively can be analyzed to produce the individual reads of each DNA cluster on the sequencing surface (“flow cell”)(Figure 1-1). The extraction of reads from these images produces strings of sequence in a five-letter alphabet {A,C,T,G, and .} with un-called bases represented with the “period” character. Each position in the read is accompanied by a “Phred-like” quality score, which reflect the confidence of the called base. This “Q” score is derived from the estimated probability (e) that a base is called incorrectly (Q = −10log(e)) similar to the quality scores originally introduced to describe the quality of base calls from a Sanger capillary sequencer (Ewing et al. 1998; Ewing & Green 1998). The formats for the reporting of these data from a sequencing run have undergone numerous iterations and the most common ‘raw’ sequence format utilized by most software is the fastq format, which is reminiscent of the standard fasta format with additional fields for quality data (Figure 1-1). However, the raw sequence data is typically merely an intermediate that is passed into other specific tools such as either an aligner or assembler (Figure 1-2).  16 1.4.2 Read mapping  Sequence alignment is the process by which regions of similarity between two DNA, RNA or protein sequences are identified. In resequencing, alignment is often used to describe the more specific task of read mapping, whereby reads are each aligned to large reference and their best hit (or hits) are identified. The purpose of this process is to determine the most likely region of the human genome from which each read derives. Nearly every analytical task in resequencing relies on mapped reads and a reference genome rather than the raw sequence data. Many next-generation aligners have been described in the past few years (for a comprehensive review, see (Li & Homer 2010)). Though many of these aligners aim to achieve the same goal, optimal mapping of short reads, there are many differences in indexing strategies and alignment algorithms. Also, some aligners attempt to estimate their confidence in each alignment reflected in a “mapping quality”. These differences between aligners often result in trade-offs between speed and sensitivity, with faster aligners generally (but not always) less sensitive due to heuristics or lack of gapped alignments (Li & Homer 2010). Also, because paired end reads and colour-space sequence (from the SOLiD platform) emerged later, not all aligners handle these types of data. For those aligners that do consider pairing information, enhanced sensitivity (and specificity) is achieved by selecting alignments for un-mappable or poorly mapped reads that situate them in close proximity to their well-mapped mate. The BWA software accomplishes this by empirically determining the distribution of fragment lengths based on batches of aligned reads and attempts to align unmapped mates to adhere to this distribution when possible (Li & Durbin 2009).  17  1.4.3 Read mapping for RNA-seq  When sequencing mRNA molecules with a short read technology, there are additional issues to consider during alignment. The key difference in the data owes to the fact that many of the reads from a cDNA molecule derive from two or more exons, which were joined together by splicing, where the intervening sequence (the intron) is removed. This presents problems in two stages of the alignment. First, identifying matching sequences in the genome is greatly hampered for so-called “junction reads” as they contain sequences from disparate locations of the genome. For alignment algorithms that index the reads, the individual seed sequences that correspond to disparate halves of a spliced sequence are thus unlikely to yield correct matches in the genome index. Second, due to the presence of introns in the genome and their absence in mRNA, read pairing does not follow the expected trend based on the fragment length distribution and rather resembles the intron length distribution. This significantly affects the ability of BWA and other aligners to use the pairing information of reads advantageously. A modified version of this software, termed BWA-R (for RNA-seq) has been produced in-house to compensate for this, but alignment still relies on a genome that has been supplemented with all known exon-exon junction sequences and hence, cannot reveal novel (un-annotated) splicing events.  A variety of aligners have been published that can map junction reads to the genome without a set of exon junctions defined a priori, with the more commonly cited example being TopHat (Trapnell et al. 2009). The TopHat approach to identify junction reads involves first aligning as many of the reads as possible with a standard aligner (Bowtie). The aligned reads are then used to define clusters of reads (termed islands), which are considered  18 rough estimates of exonic regions. In the next phase, the unaligned reads are split into pieces and attempts are made to align portions of those reads to islands and then to align the remaining portion of that read within the other proximal islands. Some problems with this approach include the reduced ability to align reads deriving from poorly covered genes (i.e. those with low expression) and insensitivity to non-canonical splice sites. A more recent publication describes HMMSplicer, which uses a probabilistic approach (a Hidden Markov Model or HMM) to identify exon junction reads (Dimon et al. 2010). This approach demonstrated improved sensitivity and accuracy when compared to TopHat but notably, neither of these algorithms leverages paired-end read information. 1.4.4 Perspectives on alignment of Illumina sequence data  Ultimately, the aligner for a given task is selected based on a combination of the features (and volume) of the data being analyzed and more practical considerations. For example, one might consider desire for sensitive alignments  (important for longer reads, where multiple errors in a read are commonplace), presence/absence of paired reads, and the availability of computational resources. The aligner utilized for much of the research described in this thesis is BWA (Li & Durbin 2009), which handles paired end reads, performs gapped alignments, and uses a memory-efficient indexing strategy termed a Burrows-Wheeler transformation to enable fast (and multi-threaded) alignment. For RNA- seq data, I utilized BWA-R in conjunction with a set of known exon-exon junction sequences for alignment. In RNA-seq analysis, one might prefer an aligner capable of identifying novel exon junctions. However because the focus of this work was on single nucleotide variants and gene expression, BWA was a suitable choice.   19  1.4.5 Alignment data universality  Early next-generation aligners each produced plain text tab-delimited alignment formats, which attempted to retain all relevant read information and report succinct details of the alignment. These text-based formats were cumbersome and, as they did not follow any standard formatting, programming downstream analytical tools was cumbersome. One widely-adopted read mapper for Illumina data was MAQ (Li et al. 2008), which included a tab-delimited format which could be byte-encoded and compressed for efficiency without impacting accessibility of the data. Other tools that accompanied this mapper such as single nucleotide variant (SNV) and indel callers were able to read this format directly and MAQ also included a rudimentary alignment visualization tool. Still, in order to implement a variant caller that utilized these alignments one would first have to implement a parser of this format or of the more human-readable (and less space-efficient) composite known as the “pileup” format. For the efficient development of any useful and deployable set of analytical tools that could leverage alignments from different algorithms, there was a clear need for a unified alignment format.  The most widely adopted format for aligned sequence data is the Sequence Alignment/Map (SAM) format. The SAM (and accompanying binary counterpart termed BAM) format is conceptually similar in principle to the compact MAQ alignments but is more versatile in its ability to handle longer reads and more complex alignments. Besides its ability to store all necessary information about an alignment, BAM files (and associated set of software utilities known as SAMtools) enable efficient random access of reads mapped within certain genomic regions. This is accomplished by sorting the alignments on both the  20 chromosome and position followed by an indexing process (Li et al. 2009). SAM also introduced a more versatile pileup format, which encapsulated enough information to allow reconstruction of the individual reads (and qualities) from that format. Pileup acts as a human-readable and yet readily parsable text format that allows for utilities such as SNV callers to be written in languages other than C or Java. My early implementations of variant callers used in this thesis relied first upon the MAQ and later the SAM pileup formats for SNV identification and were implemented in Perl. Many modern variant callers and other post-processing algorithms work from the native SAM or BAM files (Goya et al. 2010; Edmonson et al. 2011), allowing these to reside at the core of many diverging analysis paths and abrogating the need for data duplication in numerous formats for accomplishing analysis tasks (Figure 1-2). Also, multiple robust tools for visualizing aligned sequence data such as the Integrative Genomics Viewer (IGV)(Robinson et al. 2011) enable the visualization of reads aligned to an entire genome, an often-necessary task that was previously difficult to accomplish. 1.4.6 Sequence assembly  An alternative entry point into short read analysis is de novo assembly, whereby overlapping reads within a data set are identified and a consensus sequence is determined from sets of overlapping reads yielding numerous “contigs”. Many assemblers have been designed to handle next-generation sequencing data (Warren et al. 2007; Zerbino & Birney 2008; Butler et al. 2008), but these are often applied to de novo sequencing projects rather than resequencing applications. One exception where assembly may be superior to alignment in such applications is the identification and resolution of large structural aberrations such as large insertions, deletions, translocations and inversions and the altered transcripts that can  21 result from such events. One major issue is that application of these de novo assemblers to large datasets is often limited by the amount of RAM this process requires, making assembly tractable mainly for small genomes such as those of bacteria or yeasts. Some advances have been made to ameliorate this issue, for example making use of large shared memory servers (Gnerre et al. 2011) or by efficiently distributing the memory usage across many nodes in a compute cluster (Simpson et al. 2009). Recently, ABySS was adapted to handle RNA-seq data and demonstrated promise as another potential tool for identifying un-annotated alternative splicing without the necessity of splicing-aware short read aligners such as TopHat (Robertson et al. 2010). Still, the computational time and space required to perform a de novo assembly for mammalian genomes and transcriptomes is orders of magnitude greater than the time required to perform to map an equal number of reads to a reference and it is yet unclear whether (or in which situations) the former would be preferable.  Another strategy that leverages the power of assembly but bypasses the necessity for large RAM by utilizing a priori knowledge of the genome is targeted assembly. An example of an algorithm that utilizes this approach is CREST (J. Wang et al. 2011a), which was designed to identify structural aberrations in cancer. This algorithm first identifies likely chromosomal break points by identifying clusters of soft-clipped reads.  These are alignments where one alignment of one (or both) ends of the read sequence did not improve alignment quality, so the remaining sequence was ‘clipped’ to produce a superior alignment. The full sequences of these soft-clipped reads are next assembled using the CAP3 de novo assembler. The resultant contigs are aligned to the reference genome to identify the likely source of the sequence contributing to the soft-clipped portion of these reads. If any of the hits by this search contain soft-clipped reads, these are next assembled to confirm whether  22 they support the same event. The CREST publication was relatively recent and it is likely that hybrid strategies for more fine-grained analysis of DNA sequence will become more common. 1.4.7 Variant identification  A core motivation in DNA sequencing is to capture and accurately decode the genetic variation of individuals and much of this variation is thought to be single nucleotide differences that are either polymorphic (i.e. Single Nucleotide Polymorphisms or SNPs) or are private to an individual (i.e. Mutations or SNVs). Not surprisingly, there is no consensus on how best to accomplish this given aligned sequence data as exemplified by the numerous tools designed to accomplish essentially the same task (Li et al. 2008; Goya et al. 2010; Edmonson et al. 2011). This multiplicity of options owes, in part, to technical differences in sequencing technologies and alignment approaches, as previously discussed. Some examples are differences in base qualities and accurate estimates of alignment confidence (mapping qualities). Some of the disparity may also stem from imperfect comparisons made between variant callers and no true gold standard data sets on which each algorithm has been tested. The goal of most variant callers is to utilize aligned reads to identify variant sites, and assign a genotype call as either homozygous reference, homozygous non-reference, or heterozygous (Li et al. 2008). This motivation is well suited to experiments searching for variants in normal (diploid) cells. In this scenario, reads aligning across a SNV should represent each of the two alleles in roughly equal proportions.  It has been argued that the problem of identifying somatic point mutations in often heterogeneous cancer samples is not equivalent to genotyping polymorphisms in a population of homogenous diploid cells (Shah et al. 2009b). Some attempts have been made to  23 implement variant callers more suited to identify mutations in cancer. One example is SNVmix, which I assisted in conceptualizing and developing during the course of this research (Goya et al. 2010). The simplest version of SNVmix performs thresholding on the base and mapping qualities at a given site, then (using empirically-derived priors), assigns posterior probabilities for each of the three genotypes at that position (Figure 1-3). Our comparisons showed that SNVmix performed favourably against other variant callers that were not designed specifically for mutation identification in cancer, such as the commonly- used method included in the SAMtools package (Li et al. 2008).  Despite improvements made in variant identification, many of the novel variants identified by this and other algorithms (without any post hoc filtering, to remove systematic artefacts) are still subsequently found to be false (Ajay et al. 2011). Thus, there is still a need for improved filtering mechanisms that better enable automated differentiation of false positives from true variants and this is an open problem in the field. Also, in cancer sequencing, it would be preferable if variant callers could consider paired tumour/normal data in aggregate rather than independently. A single approach to accomplish this has thus far been published (Larson et al. 2011) and both MuTect (currently unavailable) and JointSNVMix (http://code.google.com/p/joint-snv-mix/; Roth et al. unpubl) also present methods for accomplishing this task. Other approaches with the same motivation appear to be pending publication (Ding et al. 2010). Identification of somatic point mutations is only a first step towards determining the genetic alterations that are important in a cancer. The subset of these mutations that are important, as well as the other types of alterations that can drive tumour progression, will be discussed in the next section.  24 1.5 Mutation identification in cancer  The process of neoplasia, in which a population of normal cell progress to a cancer, is thought to be a stepwise process involving a series of genetic changes, each of which increase fitness and thus provide a selective advantage to the cells (Hanahan & Weinberg 2000). The mutations that promote malignant transformation are collectively referred to as “drivers” and the genes they affect are referred to as “cancer genes” (Futreal et al. 2004). Many of the mutations in a tumour genome may have a neutral (or even deleterious) impact on the fitness and these are referred to as “passengers”. These mutations are retained in the tumour cell population only because they arise in clones containing driver mutations (Stratton et al. 2009). An early (and ongoing) task in the study of cancer is to identify the key cancer genes and their respective driver mutations. 1.5.1 Methods for identifying gross chromosomal abnormalities  The earliest driver mutations were discovered long before the invention of DNA sequencing by directly visualizing metaphase chromosomes under a light microscope. This technique allowed the Philadelphia chromosome (Ph1) to be identified in the malignant cells from the bone marrow of chronic myelogenous leukemia (CML) patients (Nowell 1962). A decade later, once karyotyping had been improved by using various stains that revealed chromosomal banding patterns, the underlying chromosomal aberration resulting in this chromosome was discovered, namely, the translocation between chromosomes 9 and 22 termed t(9;22)(q34;q11)(Rowley 1973). This driver mutation was ultimately found to produce fusion between the genes BCR and ABL1, producing the oncogenic BCR-ABL1 fusion gene and resulting in deregulated activity of ABL1 kinase (Bartram & Grosveld 1985). This oncoprotein is the target of the drug imatinib, which is one of the first successful  25 targeted therapies in cancer and exemplifies a desired paradigm whereby knowledge of driver mutations can ultimately be translated to treatments or other methods of impacting the clinical outcomes of cancer patients.  Karyotype analysis using G-banding became a routine method of discovering and cataloguing the chromosomal aberrations in certain cancers. Hematologic malignancies such as leukemias and lymphomas were particularly amenable to this type of analysis as relatively pure tumour populations can be obtained and cultured from these diseases. As such, many of the first cancer-associated mutations were identified from recurrent genetic abnormalities in leukaemias and lymphomas (Futreal et al. 2004). However, the use of karyotypes is limited by ones ability to culture cells (which is not always possible) and further, its resolution is limited to translocations affecting large regions as well as very large deletions or inversions. Improved methods eventually emerged that utilized arrays printed with DNA from sets of individual bacterial artificial chromosomes (BACs) in a technique referred to as array comparative genomic hybridization or aCGH (Snijders et al. 2003). A successor to this method involves hybridization of fluorescently labeled genomic DNA to genotyping arrays (SNP arrays) and either comparing the signal against that from matched normal cells from the same individual or, when a matched normal is unavailable, against that of an averaged signal from many un-matched samples (Nannya 2005). These approaches allow the detection of copy number alterations in cancer resulting from duplications, amplifications, copy-neutral loss of heterozygosity and unbalanced translocations but are unable to detect inversions or other copy-neutral events.  Studies applying array technology to cancer have revealed new oncogenes, which are highly amplified in certain tumour types (Ota et al. 2004), and tumour suppressors which are  26 commonly deleted, often with a second mutation (deletion or point mutation) affecting the other allele (Mestre-Escorihuela et al. 2007; Bea et al. 2009). A common problem with these studies is that copy number alterations often involve large regions, commonly involving gains/losses of entire chromosomes or chromosomal arms (Beroukhim et al. 2010). Identifying the important genes in these rearrangements using array-based technologies requires many samples, which can allow the delineation of minimal common regions (MCRs), that is, the minimal bounds of a genomic region gained or lost many tumours. Small, focal regions of gain and loss are more readily identified and various algorithms have been devised to define these, with GISTIC being a commonly cited example (Beroukhim et al. 2007). One large-scale study employing this approach profiled over 3000 tumours of various cancer types and identified focal gains/losses affecting genes in some of the same pathways across multiple cancer types such as NF-κB signalling and apoptosis (or inhibition thereof) by BCL2-family proteins (Beroukhim et al. 2010). As might be expected, many lineage-restricted alterations were also identified, reflecting the distinct genetic pathways utilized in different cancer types. A similar study profiled almost 2500 cancer cell lines and focused on identifying genes affected by homozygous deletions (HD) in these samples (i.e. likely tumour suppressors or recessive cancer genes)(Bignell et al. 2010). This study showed that many regions of HD were associated with either known (or suspected) recessive cancer genes or fragile sites, that is, regions known to be more susceptible to genomic rearrangements. The use of integrative analyses involving copy number and gene expression arrays has enabled the identification of the genes that are deregulated or silenced in these large regions commonly altered in cancer, also revealing new oncogenes and tumour suppressors (Lenz et al. 2008c). Still, without the ability to directly identify situations where  27 inactivation of a second allele occurs by point mutation rather than deletion, more integrative approaches relying on sequencing are required. Such approaches, utilizing sequencing to identify secondary inactivating mutations in minimal common regions, have identified new tumour suppressors in various cancers (Sanchez-Cespedes et al. 2002; Cheung et al. 2010). 1.5.2 The search for somatic point mutations  It is clear that cancer genes can be both activated and inactivated by various mechanisms including not only copy number gains/losses and translocations but also somatic point mutations, which can only be discovered by sequencing. Early attempts to catalogue driver point mutations were aimed at candidate genes because sequencing the entire genome or even the exome was not yet in reach. These genes were selected either due to a priori knowledge of their role in processes or presence in a pathway known to be altered in cancer (e.g. apoptosis), were known to be commonly amplified or involved in other chromosomal defects in cancer cells (Davies et al. 2002) or had viral orthologues capable of transformation (e.g. Src (Martin 2001)). Also, classes of genes for which nonspecific drugs were available were often searched for driver mutations, in hopes that oncogenic versions of these genes could be readily targeted if found.  One example is the “kinome”, which refers to the set of exons encoding all known protein kinases. Constitutively activated kinases, resulting either from specific recurrent mutations (Davies et al. 2005) or chromosomal aberrations such as the aforementioned BCR- ABL fusion (Lugo et al. 1990), are known to drive many neoplasms. Kinases can often be targeted with ATP-competitive inhibitors making them appealing targets for novel therapeutics (Schwartz & Murray 2011). As a result, numerous specific and broad-spectrum small molecule inhibitors of kinase activity are already available and approved for use in  28 various cancers (Schwartz & Murray 2011). Examples of some early studies performing kinome sequencing revealed genes affected by somatic point mutations in lung (Davies et al. 2005) and colorectal cancers (Bardelli et al. 2003), suggesting that such strategies could be successful at least in the cancers driven by deregulated kinase signalling. A broader study of lung cancer focused on not only kinases but also known oncogenes, tumour suppressors and genes in regions of recurrent genomic alterations (Ding et al. 2008). This study revealed an unappreciated role of known tumour suppressors such as NF1, ATM and APC in lung adenocarcinoma. They also reported mutations in ERBB4, suggesting that it may act as an oncogene in lung cancer as do the related receptor tyrosine kinases EGFR and ERBB2. Even with the relatively small number of genes sampled, this study was also able to identify enrichment for mutations within certain pathways, such as MAPK and p53 signalling. Thus, despite its limitations, targeted resequencing of subsets of genes led to novel insights into the molecular mechanisms underlying certain cancers and also revealed some new potential therapeutic targets. 1.5.3 From candidate genes to global mutational surveys  As all the pathways relevant to cancer may not yet be known, a preferred approach to identify genes mutated in cancer would survey DNA sequence in an unbiased manner. Early efforts at performing such surveys proceeded by PCR-based amplification of massive sets of amplicons. For example, the exons from between ~13,000 and ~20,000 genes were amplified from the tumours and constitutional DNA of multiple breast, colorectal and glioblastoma multiforme (GBM) patients (Sjöblom et al. 2006; Parsons et al. 2008). The latter study identified a gain-of-function hot spot mutation in the IDH1 gene present in 12% of GBMs. IDH1 encodes isocitrate dehydrogenase, a key enzyme in the citric acid cycle. As IDH1 is  29 unlikely to have been present in any candidate gene lists, this exemplifies the need for unbiased and global surveys for mutations in cancer.  With the introduction of next-generation sequencing, it quickly became possible to assess mutational status of larger components of a cancer genome. Because sequences from RNA-seq experiments represent the sequence of the corresponding DNA (with the exception of RNA edits)(Shah et al. 2009b), they have the theoretical potential to reveal the SNVs in the exons of all genes transcribed in that sample. Though it is not without bias, some groups tested RNA-seq as a tool to identify recurrent driver mutations that were present in actively expressed genes. It was quickly shown that modest amounts of sequencing could reveal biologically relevant somatic mutations affecting highly expressed genes (Shah et al. 2009a) potentially reducing the cost required to identify expressed ‘driver’ mutations in cancer. The cited study revealed a recurrent point mutations affecting single amino acid residue in FOXL2 in certain ovarian cancers. Shortly thereafter, utilizing a similar approach, my analysis revealed a hot spot mutation in the EZH2 oncogene in lymphoma and this study forms the basis of Chapter 2.  Any study relying solely on RNA-seq for discovering somatic mutations must consider that the approach is not well suited for capturing those alleles that demonstrate extreme allelic imbalance (AI) in favour of the wild type allele and may miss those mutations that result in abrogation of transcription or enhanced mRNA turnover via the miRNA or nonsense-mediated decay (NMD) pathways. Nonetheless, the added benefits over tumour genome sequencing including the ability to monitor AI and alternative splicing in the context of sequence variations make it an attractive option and a very powerful complement to genome or exome sequencing. Further, knowledge that a mutated allele is expressed in a  30 diseased cell offers possible avenues for future treatments. For example, if a mutated protein is constitutively active or has an otherwise enhanced function, small molecules that directly target/inhibit the mutant protein may be sought. Even if the mutant protein is not oncogenic, recurrent point mutations such as those in FOXL2 and protein-altering RNA edits specific to cancer cells could be used as biomarkers. In the case of FOXL2, the mutation is restricted to a single type of ovarian cancer and is as such considered “pathognomonic” of adult-type granulosa-cell tumours (Schrader et al. 2009). Genes whose transcription is silenced or mRNA degraded due to mutation (i.e. tumour suppressor genes), do not directly offer new treatment options but may also prove useful as biomarkers. Hence, RNA-seq is not expected to capture all the mutations in a tumour but those mutations revealed by this approach may be more appealing from a clinical perspective.  The mutation hot spots in genes such as EZH2 and FOXL2 (and others, to be described in Chapter 3) were readily identified due to their clear patterns of recurrence and high incidence. To robustly allow differentiation of germ line from somatic variants, a necessary paradigm that is typically followed in the sequencing of cancer genomes is the paired sequencing of a matched constitutional DNA sample, which derives from a non- malignant tissue from the same patient (Meyerson et al. 2010). The same approach is often followed when one applies targeted resequencing to cancer samples. However, in performing RNA-seq, a matched control sample is typically not sequenced in parallel; even if it were, the optimal matched normal tissue may not always be available from a patient. For example, a pure population of non-malignant B cells may not be obtainable from a patient with lymphoma. Owing to the lack of matched controls, mutations that are not focally recurrent are difficult to differentiate from the hundreds of novel germ line variants and RNA edits in  31 each RNA-seq library. Owing to this, many cancer-sequencing studies have instead favoured the use of either exome or whole genome sequencing for mutation discovery, with RNA-seq often included as a complementary tool. 1.5.4 Cancer genome sequencing  At the outset of this research project, there were no published studies in which an entire cancer genome was sequenced and compared to its germ line sequence and only three studies had thus far applied global next-generation sequencing to identify mutations in cancer cell lines (Bainbridge et al. 2006; Campbell et al. 2008; Morin et al. 2008a). Over the subsequent three years, multiple studies emerged, which often sequenced a single primary tumour (or cell line) and therein identified somatic point mutations, copy number alterations and chromosomal rearrangements (Ley et al. 2008; Mardis et al. 2009; Shah et al. 2009b). One interesting trend revealed by these and later studies was that the number of protein- altering somatic mutations ranged from as few as four (S. J. Jones et al. 2010) to more than 300 (Lee et al. 2010). This was roughly in concordance with estimates made based on the previous targeted surveys of breast and colorectal cancers, where it was estimated that an average tumour contained 93 mutated genes (Sjöblom et al. 2006).  Another notable finding revealed in the Sjöblom study and further exemplified by subsequent publications is the drastic differences in the frequency of different mutation types in distinct tumours. The two papers by Pleasance et al describing the genomes of a small-cell lung cancer (2010b) and melanoma (2010a) cell line investigated this phenomenon in detail. In the case of lung cancer, the pleasance group reported an enrichment of G>T/C>A mutations, which were found enriched in the CpG context and considered likely to result from deamination of methylated cytosines. This study also reported discrepant mutation rates  32 in expressed genes and between the transcribed and un-transcribed strands, which they interpreted as evidence for DNA repair mechanisms that are coupled to gene expression (Pleasance et al. 2010b). The second study found a predominance of C>T/G>A mutations, which are known to result from exposure to ultraviolet radiation and also noted evidence for expression-coupled repair (Pleasance et al. 2010a). With the emergence of additional genomes from other cancer types, new patterns of point mutations and surprising modes of chromosomal rearrangement have emerged such as chromothripsis seen in various cancer types (Stephens et al. 2011) and complex chaining rearrangements observed in prostate cancer (Berger et al. 2011). It is becoming clear that there are diverse mutational mechanism(s) driving each tumour type and these are directly reflected in the landscape and frequency of mutations. 1.5.5 Approaches for identifying driver mutations and cancer genes  One central goal in the genomic study of cancer is to identify the driver mutations and the cancer genes in which they arise. Hence, whether using transcriptome, exome or genome sequencing (or some combination thereof), one is eventually tasked with determining which of the thousands of mutations are relevant and which are merely passengers. This thesis focuses on the utility of sequencing to discover somatic point mutations and, to a lesser extent, fusion transcripts. As shown in Table 1-2, the number of somatic mutations identified from a paired tumour/normal genome can be large and the types of mutations are not uniform between cancer types due to differences in mutational mechanisms.  Numerous methods have been proposed that typically aim to answer one of two questions when confronted with lists of somatic mutations, namely: 1) which individual mutations are likely to damage or alter protein function or 2) which genes appear to have  33 significant enrichment for mutations and/or appear to be under selection. A common approach to answer the first question is to assess the evolutionary constraint at each mutated position based on multiple alignments of homologous proteins across many species. Examples of utilities that accomplish this are SIFT (P. C. Ng & Henikoff 2003), ProPhylER (Binkley et al. 2010), CanPredict (Kaminker et al. 2007) and CHASM (Carter et al. 2009; Wong et al. 2011) with the latter two specifically designed to identifying driver mutations in cancer. The second question is accomplished in multiple ways, often testing for enrichment of mutations (or certain types of mutations) above that expected by chance, and sometimes considering the background mutation rate and/or the distribution of mutation types in that tumour. These approaches often employ methods borrowed from molecular evolutionary biology, namely the comparison of synonymous and non-synonymous mutation rates to identify selective pressure.  Multiple groups have proposed methods for identifying genes that appear to be enriched for mutations or otherwise displaying evidence of positive Darwinian selection in cancer. Various studies have utilized the MutSig approach to identify what are considered “significantly mutated” genes (Cancer Genome Atlas Research Network 2011; Chapman et al. 2011), however this algorithm is not publicly available. Another example was described by Greenman et al (2006) and has been applied in multiple studies (Dalgliesh et al. 2010; Greenman et al. 2007). In essence, the Greenman approach (1) estimates the probability of the observed set of mutations in a given gene distributing, as observed, between synonymous, non-synonymous and truncating sites; and (2) based on the observed mutations, provides independent estimates of the selective pressure acting on that gene to acquire non- synonymous and truncating mutations. If the p-value from (1) is significant, one can  34 conclude that the gene has evidence for Darwinian selection, evidenced by significant deviation from the expected distribution of mutations at synonymous, non-synonymous and nonsense sites. If the selective pressure estimate from (2) for either non-synonymous or nonsense mutations is greater than 1, this is interpreted as evidence for positive selection. This method was utilized in Chapter 3 of this thesis to identify new lymphoma-related genes and differentiate them into potential oncogenes and tumour suppressors. 1.5.6 Conclusions  The identification of the driver mutations and the cancer genes that harbour them is a central aim in the genetic study of cancer. The tools that can detect these events have evolved from low-resolution techniques that can detect gross abnormalities to mid-resolution methods that can identify large copy number alterations and each of these have led to important insights into many cancers. Sequence-based surveys for somatic point mutations have now expanded from small sets of candidate genes up to entire exomes, transcriptomes and genomes. RNA-seq data can also enable the identification both known and novel fusion transcripts in cancer. These techniques can result in expansive lists of mutated genes that require validation and further statistical analysis to reveal the genes likely to harbour driver mutations. 1.6 Lymphoid neoplasms: leukemias and lymphomas The lymphoid neoplasms are a collection of cancers that originate from B, T or natural killer lymphocytes at varying stages of differentiation and development. They are often divided into leukemias, which populate the blood with malignant cells, and lymphomas, which form localised tumours in secondary lymphoid organs such as lymph nodes and sometimes other tissues. This broad distinction is somewhat artificial as some  35 leukemias can have solid phase manifestations and likewise some lymphomas can include disseminated (circulating) tumour cells (Harris 2010). Another common distinction within the lymphomas is the separation of classical Hodgkin’s disease from all other lymphomas (i.e. non Hodgkin lymphomas or NHLs) and the plasma cell neoplasms such as multiple myeloma (Harris 2010). Collectively, many of the somatic mutations that occur in these cancers are thought to arise due to inappropriate activity of AID and RAG, enzymes that are responsible for driving two key developmental processes in the normal development of immune cells: somatic hypermutation and rearrangement of the V, D, and J segments of the genes that encode components of the T cell receptor or the B cell receptor (and antibodies)(Casellas et al. 2009). A common feature of lymphoid neoplasms is one or more large-scale rearrangements (often translocations) that result in constitutive expression of a proto-oncogene (Dalla-Favera & Pasquallucci 2010) but can also lead to the interruption or deletion of tumour suppressor loci. These chromosomal rearrangements often involve loci normally rearranged in the cell of origin of these cancers, such as the T cell receptor loci in T-lineage ALLs (Harrison 2009) and immunoglobulin loci in B-lineage leukemias (Harrison 2009) and lymphomas (Dalla- Favera & Pasquallucci 2010). The molecular nature of the individual affected genes differs across (and within) these diverse malignancies and the list of individual genes whose mutation or deregulation is involved in driving these cancers is continually evolving. 1.6.1 The molecular nature of non Hodgkin lymphomas There are over 50 malignancies currently classified as NHLs by the World Health Organization (WHO) and many of these diseases are thought to be unrelated at the molecular and genetic level (Harris 2010). Based on incidence, the compendium of NHLs comprises the  36 fifth most commonly diagnosed cancer. The two most common types, namely follicular lymphoma (FL) and diffuse large B-cell lymphoma (DLBCL), together comprise 60% of new B-cell NHL diagnoses each year in North America (Anderson et al. 1998). FL is an indolent and typically incurable disease characterized by clinical and genetic heterogeneity. DLBCL is aggressive and likewise heterogeneous, comprising at least two distinct subtypes that respond differently to standard treatments. The two commonly-cited molecular subtypes of DLBCL are termed germinal centre B cell-like (GCB) and activated B cell-like (ABC) and these groupings were identified based on gene expression signatures that are thought to reflect a distinct cell of origin (COO) for these two groups (Alizadeh et al. 2000). Both FL and the GCB subtype of DLBCL derive from germinal centre B cells whereas the ABC variety, which exhibits a more aggressive clinical course, is thought to originate from B cells that have exited, or are poised to exit, the germinal centre (Lenz & Staudt 2010). Current knowledge of the specific genetic events leading to DLBCL and FL has, until recently, been limited to the presence of a few recurrent genetic abnormalities resulting from either chromosomal rearrangements or point mutations (Lenz & Staudt 2010). For example, alterations resulting in deregulated expression of the MYC oncogene are observed in between 7 and 14% of cases of DLBCL (Slack & Gascoyne 2011). This and other genetic events have been identified as enriched in only one subtype of DLBCL (Lenz et al. 2008c). A second example is t(14;18)(q32;q21), which results in deregulated expression of the BCL2 oncoprotein, which is found more commonly in FL (85-90% of cases) and to a lesser extent in GCB DLBCL (30-40% of cases) (Horsman et al. 2003; Iqbal et al. 2004). Other genetic abnormalities unique to GCB cases include amplification of the c-REL gene and of the miR- 17-92 microRNA cluster (Lenz et al. 2008c). In contrast, 24% of ABC DLBCLs harbour  37 structural alterations or inactivating mutations affecting PRDM1, which is involved in differentiation of GCB cells into antibody-secreting plasma cells (Pasqualucci et al. 2006). ABC-specific mutations also affect genes regulating NF-κB signalling (Kato et al. 2009), (Compagno et al. 2009; Davis et al. 2010), with TNFAIP3 (A20) and MYD88 (Ngo et al. 2011) reportedly the most abundantly mutated in 24% and 39% of cases respectively. Though pathways such as enhanced NF-κB signalling and suppression of pro-apoptotic signals are known to be involved in lymphoma, the breadth of mutations and genes altering these pathways and the extent to which other pathways are involved is unclear. 1.6.2 Acute lymphoblastic leukemia  The leukemias are the most common class of cancer to affect children and adolescents (Smith et al. 1999) and acute lymphoblastic leukemia (ALL) is the most common leukemia diagnosed in that population. Although it is curable in up to 80% of cases, relapsed ALL is the most common cause of cancer-related deaths in young adults (Pui et al. 2008). ALL can derive from lymphoid progenitor cells committed to either the T or B lineage. B-lineage ALL is characterized by recurrent chromosomal abnormalities including aneuploidy, chromosomal rearrangements resulting in fusion transcripts such as ETV6-RUNX1, BCR-ABL1, TCF3- PBX1, and various rearrangements involving MLL and CRLF2 (Pui et al. 2008; Mullighan et al. 2009a; Harvey et al. 2010; Russell et al. 2009a; Yoda et al. 2010). Genome-wide analyses have identified mutations targeting transcription regulators of lymphoid development (PAX5, EBF1 and IKZF1) in over 60% of B-lineage ALL patients (Mullighan et al. 2007; Kuiper et al. 2007). Furthermore, IZKF1 inactivation is a hallmark of BCR-ABL1 ALL (Mullighan et al. 2008; Iacobucci et al. 2009), and is also associated with poor outcome in BCR-ABL1 negative ALL (Mullighan et al. 2009b; Boer et al. 2009).  38 Notably, IKZF1-mutated BCR-ABL1 negative cases commonly exhibit a gene expression profile similar to that of BCR-ABL1 ALL (termed Philadelphia-like or Ph-like ALL), which suggests the presence of unidentified mutations activating kinase signalling pathways (Mullighan et al. 2009b; Boer et al. 2009). Additional genomic profiling has identified JAK2 mutations in approximately 30% of Ph-like cases (Mullighan et al. 2009c) and rearrangement of CRLF2 in up to 50% (Mullighan et al. 2009a; Harvey et al. 2010; Russell et al. 2009a). However, approximately 50% of Ph-like cases lack known chromosomal abnormalities or mutations known to drive kinase signalling, suggesting the possibility that additional mutations responsible for the Ph-like phenotype have yet to be discovered. 1.7 Thesis objectives and chapter overview Improved understanding of the molecular drivers of a cancer can enable the discovery of important prognosticators and is essential to facilitate the future development of targeted therapeutics. Many molecular and prognostic features of NHL and ALL have already been elucidated in the form of gene expression signatures, recurrent translocations and point mutations affecting various genes. The general aim of this thesis was to develop analytical approaches that would leverage data from emerging sequencing technologies enabling us to extend current understanding of the mutations that drive these malignancies. I hypothesized that proper analysis of data from sequence-based global profiling experiments including RNA-seq, exome sequencing and WGS would reveal additional targets of somatic point mutation and new fusion transcripts in these cancers. I also hypothesized that if new sets of mutated genes were identified, they might offer insights into the molecular underpinnings of these cancers and potentially suggest novel therapeutic options.  39 I first analyzed a small amount of WGS and RNA-seq data to identify somatic point mutations in FLs and DLBCLs and Chapter 2 describes the emergence of EZH2 as a clear novel target of somatic mutation from these cases. In a subsequent publication, we demonstrated the recurrent EZH2 mutation to confer the enzyme with a gain-of-function and this discovery is also described in that chapter. Chapter 3 describes the subsequent sequencing of the genomes, exomes and transcriptomes from additional DLBCL patients, which revealed hundreds of novel targets of somatic mutation. Using the candidate mutations I identified from the larger set of NHLs profiled by RNA-seq, I determined the actively expressed genes that were commonly mutated, ultimately yielding a set of 109 recurrently mutated genes. I demonstrate that 26 of these genes show significant evidence for selection, indicating they are (in many cases novel) cancer genes. I also report in this chapter that some of these genes were mutated more commonly in certain molecular subtypes of DLBCL. The chapter focuses on a striking trend in commonly mutated genes being those involved in histone modification, one mechanism of epigenetic gene regulation, and highlights the discovery that mutations in MLL2 are possibly as common as the t(14;18) translocation in FL. Chapter 4 describes the application of similar methodologies to identify novel driver mutations in paediatric ALL. As oncogenic fusion proteins are known to commonly drive ALL and other leukemias, the focus of this study was the discovery of fusion transcripts. I describe novel fusion transcripts that are predicted to result in deregulated activity of the kinases ABL1 or JAK2 and other fusions that may also deregulate kinase or cytokine signalling in ALL. As discussed in that chapter, these transcripts indicate potential targeted treatment options in these patients who would otherwise be expected to exhibit an aggressive clinical course.   40 Figure 1-1: Overview of raw sequence data from Illumina sequencer (a) The raw sequence data from an Illumina sequencer consists of sets of four images, one representing each of the four nucleotides. After laser excitation, a photograph is taken to capture the excitation of a single fluorophore. This is repeated for each of the fluorophore wavelengths. The images below are examples from a single cycle and are pseudo-coloured to represent the four distinct fluorophores used to label the four nucleotides. Each dot represents a cluster of identical DNA templates. Dark areas are positions on the flow cell surface where that nucleotide was not incorporated and light areas are the clusters in which incorporation occurred. (b) After the run is complete, the images from each cycle are analyzed and base calling is achieved by identifying the strongest of the four signals from each cluster. A Q value represents the confidence of the base call. In the output shown below (in fastq format) each read is accompanied by a character string of equal length in which these Q values are encoded as ASCII characters. In the standard (“Sanger”) fastq format, the score they correspond to is the decimal value of the character from the ASCII table less 33 (to account for unprintable characters). For example, converting “:” to a decimal value can be done with the following Perl code (which returns 25, the base quality): print(ord(“:”)-33)    41 Figure 1-2: Flow of data in a typical analysis paradigm Two example paths of data flow from a sequence (top left) to either an alignment- or assembly-based pipeline. In an alignment-based pipeline, the BAM file acts as a central format that can be input into various downstream tools (here shown as the input for a variant caller and a visualization tool). For assemblies, post-processing is required, which typically involves mapping the contigs to a genome or transcriptome. Further post-processing can then be used to identify discrepancies such as indels, SNVs or larger aberrations.  42 Figure 1-3: Detecting single nucleotide variants from aligned sequence data An example alignment of nine reads is shown to indicate how the input for SNVmix and SNVMix2 differ. (a) For SNVMix1, the input is the allelic counts. These counts are the two most common base calls at a given position in the genome. Here, four positions have more than allele due to non-reference base calls. Two of these are likely base calling errors (the cases where a single “a” is the only non-reference base, and the other two are SNVs. For the two SNVs, one has allele counts {1,6} and the other {3,3}. Typically, hard thresholds on Q values and alignment qualities are used to remove low-quality base calls and alignments before calculating the allele counts. The SNVMix1 model uses these values and a Bayesian mixture model to assign probabilities for each of the three possible states at each position, namely {aa,ab,bb}. (b) SNVMix2 utilizes a more complicated model that considers all base qualities at a given position and the mapping qualities of all the aligned reads. Here, mapping qualities are encoded in grey-scale and base qualities from dark (low) to light green (high). A variant of this figure appeared in the study by Goya et al (2010). aattcaggaccca----------------------------- aattcaggacccacacga------------------------ aattcaggacccacacgacgggaagacaa------------- -attcaggacaaacacgaagggaagacaagttcatgtacttt ----caggacccacacgacgggtagacaagttcatgtacttt --------acccacacgacgggtagacaagttcatgtacttt --------acccacacgacgggtagacaagttcatgtacttt ----------------gacgggaagacaagttcatgtacttt ---------------------------------atgtacttt 344455557761766677566636666665555666666666 aattcaggaccaacacgacgggaagacaagttcatgtacttt 000000000016000000100030000000000000000000 Aligned reads Reference seq Allelic counts ab bb abA B   43 Table 1-1: Comparing throughput, read length and error rate of sequencing platforms Company Platform Read length (typical- max) Raw base accuracy % (average) Total yield per run (typical number of reads) Apprx. total sequence yield per run Run time Paired reads? Illumina GAIIx 75-150 99.96 6.4x10 8 9.6x1010 7d yes Illumina HiSeq 2000 100-150 99.95 3x109 6.0x1011 12d yes Illumina MiSeq* 35-150 99.9 6.8x106 1.36x109 4-27h yes Life Technologies SOLiD 3* 50 99.94 4.5x108 4.5x1010 12-14d yes Life Technologies SOLiD 4* 50 99.94 7x108 7x1010 12-16d yes Life Technologies Ion Torrent 92 97.3 2-3x105 2.76x107 2h no Pacific Biosciences RS* 1500** 85 2.5x104 3.75x107 <1h no 454 FLX* 400 99 1x106 4x108 10h no 454 Junior* 400 99 1x105 4x107 10h no *Rows indicated are based on specifications provided by the company. Otherwise, data was computed from recent sequencing runs performed at the BC Genome Sciences Centre **average read length of a run for sequences that produce reads of non-uniform lengths  Table 1-2: Summary of cancer genomes sequenced and non-silent somatic mutations Reference Cancer source/type Samples Non- silent mutations (total or mean) Ley et al. 2008 AML primary 8 Mardis et al. 2009 AML primary 7 Shah et al. 2009 breast primary, relapse 5, 32 Pleasance et al. 2010 lung/cell line cell line 98 Pleasance et al. 2010 skin/cell line cell line 187 Jones et al. 2010 tongue primary, relapse 4, 12 Ding et al. 2010 breast primary, xenograft 29 Lee et al. 2010 lung primary 302 Totoki et al. 2011 liver primary 63 Berger et al. 2011 prostate primary, 7 samples 10 Chapman et al. 2011 multiple myeloma untreated and treated, 23 samples 33.4  Morin et al. unpubl. (Chapter 4) ALL relapse, 2 samples 14 Morin et al. 2011 DLBCL primary, 11 samples 51.5  44 Chapter  2: Somatic mutations altering EZH2 (Y641) in follicular and diffuse large B-cell lymphomas of germinal-center origin 2.1  Introduction  Follicular lymphoma (FL) and the GCB subtype of diffuse large B-cell lymphoma (DLBCL) derive from germinal centre B cells. Targeted resequencing studies have revealed mutations in various genes encoding proteins in the NF-κB pathway that contribute to the activated B-cell (ABC) DLBCL subtype, but thus far few GCB-specific mutations have been identified. Here we report recurrent somatic mutations affecting the polycomb-group oncogene EZH2, which encodes a histone methyltransferase responsible for trimethylating lysine of histone H3 (H3K27). After the recent discovery of mutations in KDM6A (UTX), which encodes the histone H3K27me3 demethylase UTX, in several cancer types (van Haaften et al. 2009), EZH2 is the second histone methyltransferase gene found to be mutated in cancer. These mutations, which result in the replacement of a single tyrosine in the SET domain of the EZH2 protein (Y641), occur in 21.7% of GCB DLBCLs and 7.2% of FLs and are absent from ABC DLBCLs. Our data are consistent with the notion that EZH2 proteins with mutant Y641 have reduced enzymatic activity on un-methylated H3K27 but enhanced activity on mono- and di-methylated H3K27 substrate. We show that in a heterozygous state, the mutated enzyme cooperates with the wild type to produce an overall gain-of-function phenotype both in vitro and in vivo. 2.2 Results 2.2.1 Discovery of the Y641 mutation with genome sequencing  Advances in DNA sequencing technology have recently enabled the characterization of genomes and transcriptomes at sufficient resolution for identification of somatic point  45 mutations (Morin et al. 2008a; Shah et al. 2009a). To develop new insight into previously unidentified mutations potentially contributing to B-cell non-Hodgkin lymphomas (NHLs), we used Illumina technology to sequence genomic DNA and RNA purified from a malignant lymph node biopsy from “FL patient A”. This patient was shown by immunohistochemistry to have a grade 1 FL that co-expressed CD10, BCL2 and BCL6. This sample was chosen for sequence analysis because it had an unusually simple karyotype, lacking the translocation t(14;18)(q32;q21) or other large-scale alterations. We analyzed the exon sequences of this tumour for mutations in both the genome (whole-genome sequencing, WGS) and the transcriptome (RNA-seq)(Methods). Matched constitutional DNA from the patient was sequenced to reveal ‘germ line’ sequence variants (Methods). We produced 25.6 aligned gigabases (Gb) from the tumour genomic library, yielding 9.47-fold redundant base coverage on average, and an additional 2.2 Gb of aligned sequence from the RNA-seq library, yielding 18.86-fold redundant base coverage on average within exons (Table 2-1; Methods). We focused our analysis on novel changes predicted to affect protein-coding sequence (Methods; Appendix A). We confirmed a subset of these to be somatic by resequencing both tumour and matched constitutional DNA. Among these somatic variants we found a mutation affecting exon 15 of the EZH2 gene, which encodes a portion of the EZH2 SET domain. EZH2 is the catalytic component of the PRC2 complex, which is responsible for adding methyl groups to H3K27 (Kirmizis et al. 2004), thereby repressing transcription at loci associated with histones bearing this mark. This mutation is predicted to result in the replacement of Y641 (amino acid 641 in Q15910 and 646 in NP_004447) with a histidine. We also confirmed that the mutation was heterozygous in FL Sample A.  46 2.2.2 Determining recurrence of the mutation with RNA-seq  To determine recurrence of mutations in B-cell lymphomas, we first used RNA-Seq to sequence the transcriptomes of 31 samples from individuals with DLBCL and seven DLBCL-derived cell lines. On the basis of cell-of-origin (COO) expression classification (Wright et al. 2003), the primary lymphoma samples were classified as belonging to either the ABC (n = 12), GCB (n = 15) or unclassifiable subtypes (n = 2) (Methods). We identified mutations resulting in Y641 substitutions in four of these 31 samples and in five of the cell lines (Table 2-2). No other mutations in EZH2 were detected and coverage of EZH2 in the RNA-seq libraries was consistently high, ranging in these 31 samples from 5.3-fold to 187- fold redundant coverage (median 48.7-fold), suggesting we had sufficient power to discover all mutations if present. The striking recurrence of these mutations in EZH2 and paucity of mutations elsewhere in the gene suggested that codon 641 is a mutation hot spot in DLBCL and its mutation is a common feature of this disease. Notably, despite a median base coverage depth of 11.4-fold in the KDM6A (UTX) locus, we found no evidence for mutations within KDM6A in these libraries.  We next determined the prevalence of Y641-affecting mutations in both FL and DLBCL tumours by Sanger sequencing the exon containing codon 641 in 251 FL samples, of which 30 had matched DLBCL samples taken at histological transformation, and 320 primary DLBCL samples (including the original 31 samples from affected individuals) (Appendix A). This revealed a total of 18 FL and 35 DLBCL samples with heterozygous mutations affecting Y641 (Table 2-3). Of note, all such mutations detected by RNA-seq showed clear evidence for expression of both alleles. To search for additional mutated sites in this gene, we also sequenced all exons of EZH2 in tumour DNA from 24 FL samples in  47 addition to FL sample A and found only one example of an EZH2 mutation not affecting Y641 (Figure 2-1). This mutation, affecting N635, was found in conjunction with a Y641 mutation, and we confirmed that the two mutations were in a cis orientation. We were able to confirm that these mutations were somatic in the seven individuals with FL (including ‘patient A’) and the two with DLBCL from whom constitutional DNA was available. To exclude the possibility that such mutations can also occur in non-malignant germinal- center B cells or in other types of lymphoma, we sequenced this region of exon 15 in eight CD77+-enriched centroblast samples from reactive tonsils and 23 reactive lymph nodes (a source of normal B cells) and 80 samples of other lymphoma types using both Sanger sequencing and targeted ultra-deep Illumina resequencing (Methods; Appendix A). We also sequenced RNA-seq libraries generated from two additional normal centroblast samples. Consistent with our hypothesis that these mutations are unique to malignant B cells, none of these samples showed evidence for mutations affecting Y641 or elsewhere within the sequenced region (Table 2-3). Notably, all of the DLBCL samples for which COO was known and that were also positive for EZH2 mutations were of the GCB subtype and not the ABC subtype. This revealed a significant enrichment of Y641-altering mutations among the GCB subtype of DLBCLs (Table 2-3; n = 18/83 GCB versus 0/42 ABC; P = 0.00168, two- tailed Fisher’s exact test).  We next assessed the effect various Y641 mutations would have on the structure, and potentially the function, of the EZH2 SET domain by generating a computational model using the crystal structure of the highly conserved MLL1 SET domain (Southall et al. 2009) as the structural template (Methods). Our model indicated that Y641 likely interacts with the lysine 27 side chain of the H3 histone tail, as has been suggested in other SET domain  48 proteins (Dillon et al. 2005). Though no EZH2 SET-domain mutations have been reported in humans, detailed mutant phenotypes have been described in Drosophila melanogaster. A mutation altering the tyrosine orthologous to EZH2 Y641 has been characterized in the Drosophila orthologue E(z) in an allele known as ‘E(z)1’. Interestingly, the phenotype of this mutation is distinct from loss-of-function mutations in this gene (R. Jones & Gelbart 1990) or loss-of-function mutations in other members of the polycomb complex such as Su(z)12 (Persson 1976). In a recent study, polycomb complexes containing the mutant Drosophila E(z)1 protein were found to be incapable of trimethylating H3K27 in vitro (Joshi et al. 2008). 2.2.3 Biochemical analysis of Y641 mutation in vitro  We sought to directly determine whether EZH2 with mutant Y641 affects the catalytic activity of PRC2 in a cell-free methylation assay. Individual clones, each containing one of the four most frequently detected mutations (Figure 2-1), were first expressed along with the other components of PRC2. PRC2 complexes were purified and tested in vitro for H3K27 methylation activity using ELISA and an antibody specific for H3K27me3 (Methods). The results (Figure 2-2) indicated that, compared to wild type EZH2, all four Y641 mutants consistently demonstrated a marked reduction (~7-fold) in their ability to trimethylate the un- methylated H3K27 peptide. This biochemical result suggested that the four predominant Y641 variants observed in our sequencing study could confer reduced ability of PRC2 complexes to trimethylate H3K27 in vivo. Of note, neither this experiment nor the aforementioned assay of Drosophila mutant protein, accounted for the possible role of the wild type enzyme. Considering our observation that the mutations were always heterozygous and both alleles were expressed, we considered it possible that the wild type protein may be important in the activity of the Y641 mutation.  49 2.2.4 Further characterization of Y641 mutation in vivo and in vitro  To determine whether altered H3K27 methylation levels could be detected in vivo, we compared the steady-state levels of each H3K27 methylation state between Y641-mutant and wild type DLBCL-derived cell lines. Interestingly, in apparent contrast to the biochemical result, this experiment revealed a consistent increase in H3K27me3 in each of the mutant- bearing cell lines (Figure 3-3). Similar results were obtained when comparing the patient samples known to contain Y641 mutations to those wild type for the enzyme (Figure 3-3). Further, transfection of two separate Y641 mutants into HEK293T cells revealed consistent increases in H3K27me3 relative to those transfected with wild type enzyme. Taken together, these results suggested to us that the Y641 mutation acts as a gain-of-function, likely in cooperation with the wild type enzyme or another protein, both of which were absent from the first set of biochemical experiments. 2.2.5 Substrate specificity of SET domain proteins and the Phe/Tyr switch  Previous studies have demonstrated that single amino acid differences in SET domains can alter the substrate specificity of methyltransferases. For example, alignment of all SET- domain proteins reveals a single site (position 1205 in human G9A protein), which is nearly always represented by either a phenylalanine or a tyrosine (Collins et al. 2005). Wild type G9a protein efficiently catalyzes the mono- and dimethylation of H3K9 and is relatively inefficient at catalyzing the third methylation step to produce H3K9me3. This group hypothesized that this residue is a key determinant of substrate specificity and tested this by mutating the phenylalanine in G9a to tyrosine. Strikingly, the mutant G9a enzyme was unaffected in its ability to monomethylate H3K9 but no longer able to catalyze the dimethylation or trimethylation reactions. Zhang et al (2003) had previously reported a very  50 similar observation by mutating the homologous phenylalanine of DIM-5, another H3K9- specific methyltransferase. This observation was thought to be generalizable to all histone methyltransferases and was termed the “Phe/Tyr Switch” (Collins et al. 2005). 2.2.6 Measuring altered substrate specificity of Y641 mutant EZH2 protein  The Phe/Tyr switch site is not homologous to Y641 in EZH2, so it was not immediately clear to us whether the mutation reported here could alter substrate specificity in a similar way. A separate group has shown a similar phenomenon in the H3K4 methyltransferase Set7/9, which is able to catalyze only the monomethylation reaction (Xiao et al. 2003). In this study, Y245 (homologous to Y641 in EZH2) was mutated to alanine and tested for activity in vitro. Similar to the Phe/Tyr switch phenomenon, mutation of this tyrosine, which is invariant across all SET domains, produced an enzyme with altered substrate specificity. In this case, the enzyme became less efficient at catalyzing the monomethylation reaction but, unlike wild type enzyme, was able to perform di- and trimethylation of monomethylated substrate. This observation led us to hypothesize that Y641 mutations may alter the substrate specificity of EZH2 and, if this altered specificity mirrored that described in the cited study, would be consistent with our observation of reduced activity on un-methylated substrate.  To test the hypothesis that EZH2 Y641 mutations have a reduced ability to methylate H3K27me0/1 substrates but increased ability to methylate H3K27me2 substrates, we next studied the function of the enzyme in vitro with an altered experimental design. Because the relative catalytic efficiency of the EZH2/PRC2 complex for each methyl addition step is poorly understood, we again carried out a 3H methyl incorporation assay using recombinant PRC2 complexes containing EED, SUZ12, AEBP2, RbAp48, and wild type or EZH2 Y641 mutants, as previously described. Rather than providing only a single substrate, we separately  51 assayed the activity of the variant proteins by measuring 3H-methyl incorporation, using unmodified, chemically synthesized monomethylated and dimethylated K27 substrate peptides (Methods). As previously observed, EZH2 Y641 mutants showed strongly reduced activity on un-methylated or monomethylated peptides (Yap et al. 2011). Strikingly, these mutants showed markedly enhanced activity on dimethylated peptides compared with EZH2 wild type containing PRC2 complexes (Yap et al. 2011). We estimated the specific activity of the protein complexes and observed that the EZH2 Y641F/N-containing PRC2 complexes are approximately twice as active on dimethylated peptides compared with wild type. These data indicated that EZH2 Y641F/N mutations alter the substrate specificity in favour of dimethylated peptides compared with PRC2 containing complexes with wild type EZH2. 2.3 Discussion  EZH2 is typically considered an oncogene, but to date it has only been shown to drive cancer by increased mRNA/protein abundance rather than mutation. Notably, these diseases in which increased EZH2 mRNA correlates with cancer progression derive from tissues in which EZH2 expression is normally low or undetectable, such as breast and prostate (Kleer et al. 2003; Varambally et al. 2002). However, EZH2 mRNA is known to be abundant in normal germinal centre B cells (Raaphorst et al. 2000), and a conditional knockout of the mouse EZH2 orthologue indicated that EZH2 function is required for early B-cell development, including rearrangement of the immunoglobulin heavy chain (IGH) locus (Su et al. 2003). Given the apparent requirement for EZH2 in germinal centre B cells, it is perhaps not surprising that the modality by which it contributes to lymphomagenesis is distinct from the apparently straightforward increases in EZH2 mRNA abundance observed in breast and prostate cancers. Expression of both EZH2 and BMI1 (the latter encoding the  52 catalytic component of PRC1) has been linked to the degree of malignancy of B-cell NHLs, and perturbations in the balance of the quantities of these two proteins (and more likely, the epigenetic marks they produce) has been suggested as an early event in lymphomagenesis (van Kemenade et al. 2001).  Our findings suggest that Y641-altering mutations of EZH2, and possibly an enhancement of H3K27 trimethylation, are involved in the pathogenesis of GCB lymphomas. The well-studied phenylalanine-tyrosine switch (Couture et al. 2008) site is known to regulate the number of methyl groups that a SET domain–containing protein can add without compromising its overall catalytic activity. Although the Y641 residue is distinct from the phenylalanine-tyrosine switch site described by Couture and others, this mutation appears to have an analogous effect on EZH2 activity. Specifically, our experiments support that these mutations may alter the product specificity of EZH2, in essence acting as an oncogenic gain- of-function mutation. Importantly, a separate group also demonstrated that EZH2 Y641 mutants catalyze the di- and trimethylation reactions with increased efficiency using a complementary approach (Sneeringer et al. 2010) leading to the same model of cooperative action between the mutant and wild type enzymes.  Our finding is particularly timely in light of recent studies demonstrating enhanced DNA methylation at PRC2 targets in lymphoma as compared to normal B cells (O Apos Riain et al. 2009; Martin-Subero et al. 2009). H3K27 trimethylation via PRC2 can be a precursor to DNA methylation and, in some cases, DNA methyltransferase may be physically coupled with PRC2 (Viré et al. 2006). Hence, Y641-altering mutations may contribute to the differential DNA methylation that has been observed at polycomb targets in FL (O Apos Riain et al. 2009) and DLBCL (Martin-Subero et al. 2009). A variety of potential targets of  53 EZH2-mediated silencing have been suggested, including two tumour suppressor genes known to be commonly deleted in DLBCL, namely CDKN2A and CDKN2B (Velichutina et al. 2010). More generally, in embryonic stem cells, EZH2 is known to silence the expression of many genes involved in differentiation. Thus, it could be that a gain-of-function of this gene further silences some of the same pro-differentiation genes in germinal centre B-cells, blocking further stages of differentiation.  In conclusion, we have identified recurrent somatic mutations affecting a single tyrosine in the EZH2 SET domain and have associated these with FL and DLBCL cases of only the GCB subtype. At the time of publication, these mutations were among the most frequent genetic events observed in GCB malignancies after the t(14;18)(q32;q21) translocation. The altered tyrosine corresponds to a key residue in the active site of the EZH2 protein and, consistent with a separate study, these mutations result in altered substrate specificity and cooperative gain in overall efficiency in producing H3K27me3. This, along with the fact that all lymphomas with mutations in EZH2 seem to have a mutation affecting Y641, sets EZH2 apart from the pattern of mutational inactivation seen in the case of KDM6A (UTX), which seems to behave as a tumour suppressor gene (van Haaften et al. 2009). Aside from the recurrence of inactivating mutants in UTX, EZH2 is the only protein affecting H3K27 methylation status to be identified as a target of somatic mutation in cancer, and it is the first in which recurrent mutations of the SET domain appear to be restricted to a specific lymphoma subtype.  54 2.4 Methods 2.4.1 Sample acquisition The initial patient (FL patient A) had two FL samples utilized in this study. Both had ~70% tumour content based on the co-expression of CD19 and lambda by flow cytometry and were “fresh” frozen at source. The first was taken at the time of diagnosis and was used for RNA-seq and WGS. The second was acquired at the time of progression and was flow sorted to >95% purity. It was analyzed by karyotype and fluorescence in situ hybridization (FISH) for the presence of a translocation t(14;18) using the dual color, dual fusion probe. In addition, it was analyzed for copy number alterations by array comparative hybridization (aCGH) and fingerprint profiling (Krzywinski et al. 2007) and for loss of heterozygosity (LOH) by Affymetrix 500K array. For DLBCL samples, only fresh frozen biopsies having >50% tumour content by flow cytometry were used for RNA-seq. All other specimens used in this study were obtained at the time of diagnosis and were derived from archived fresh frozen tissue or frozen tumour cell suspensions. Germ line DNA was obtained from peripheral blood in live patients and from CD19-negative sorted tumour cell suspensions using Miltenyi magnetic beads (Miltenyi Biotec, Bergisch Gladbach, Germany) for deceased patients. All lymphoma samples were diagnosed according to the World Health Organization criteria of 2008 by an expert hematopathologist (R.D.G.). Benign specimens included reactive pediatric tonsils or purified CD77-positive centroblasts sorted from reactive tonsils using Miltenyi beads (Miltenyi Biotec, Bergisch Gladbach, Germany). The tumour specimens analyzed in this study were collected as part of a research project approved by the University of British Columbia-British Columbia Cancer Agency Research Ethics Board (BCCA REB) and are in accordance with the Declaration of Helsinki. Our protocols stipulate  55 that genome-scale data will not be released into the public domain but can be made available via a tiered-access mechanism to named investigators of institutions agreeing by a materials transfer agreement to honour the same ethical and privacy principles required by the BCCA REB. 2.4.2 Preparation and sequencing of Illumina libraries RNA was extracted from a total lymph node section using AllPrep DNA/RNA Mini Kit (Qiagen, Valencia, CA, USA) and DNaseI treated. For RNA-seq analysis, we used a modified method similar to the protocol we have previously described (Morin et al. 2008a). Briefly, PolyA+ RNA was purified using the MACS mRNA isolation kit (Miltenyi Biotec, Bergisch Gladbach, Germany), from 5-10 µg of DNaseI-treated total RNA as per the manufacturer’s instructions. Double-stranded cDNA was synthesized from the purified polyA+ RNA using the Superscript Double-Stranded cDNA Synthesis kit (Invitrogen, Carlsbad, CA, USA) and random hexamer primers (Invitrogen) at a concentration of 5µM. The cDNA was fragmented by sonication and a paired-end sequencing library prepared following the Illumina paired-end library preparation protocol (Illumina, Hayward, CA, USA). Genomic DNA for construction of whole genome sequencing (WGS) libraries was prepared from the same biopsy material (FL patient A) using the Qiagen AllPrep DNA/RNA Mini Kit (Qiagen, Valencia, CA, USA). DNA quality was assessed by spectrophotometry (260/280 and 260/230) and gel electrophoresis before library construction. Depending on the availability of DNA, between 2 and 10µg were used in WGS library construction. Briefly, DNA was sheared for 10 min using a Sonic Dismembrator 550 with a power setting of “7” in pulses of 30 s interspersed with 30 s of cooling (Cup Horn, Fisher Scientific, Ottawa,  56 Ontario, Canada), and analyzed on 8% PAGE gels. The 200-300 bp DNA fraction was excised and eluted from the gel slice overnight at 4°C in 300 µl of elution buffer (5:1, LoTE buffer (3 mM Tris-HCl, pH 7.5, 0.2 mM EDTA)-7.5 M ammonium acetate), and was purified using a Spin-X Filter Tube (Fisher Scientific), and by ethanol precipitation. WGS libraries were prepared using a modified paired-end protocol supplied by Illumina Inc. (Illumina, Hayward, USA). This involved DNA end-repair and formation of 3’ Adenosine overhangs using Klenow fragment (3’ to 5’ exo minus) and ligation to Illumina PE adapters (with 5’ overhangs). Adapter-ligated products were purified on QIAquick spin columns (Qiagen, Valencia, CA, USA) and PCR-amplified using Phusion DNA polymerase (NEB, Ipswich, MA, USA) and 10 cycles with the PE primer 1.0 and 2.0 (Illumina). PCR products of the desired size range were purified from adapter ligation artifacts using 8% PAGE gels. DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay (Agilent, Santa Clara CA, USA) and Nanodrop 7500 spectrophotometer (Nanodrop, Wilmington, DE, USA) and DNA was subsequently diluted to 10 nM. The final concentration was confirmed using a Quant-iT dsDNA HS assay kit and Qubit fluorometer (Invitrogen, Carlsbad, CA, USA). For sequencing, clusters were generated on the Illumina cluster stations using v1 cluster reagents. Paired-end reads were generated using v3 sequencing reagents on the Illumina GAii platform following the manufacturer’s instructions. Image analysis, base- calling and error calibration were performed using v1.0 of Illumina’s Genome analysis pipeline. Paired-end RNA-seq and WGS libraries were sequenced to 36, 50 or 76 cycles. The WGS library comprised a mixture of 13 flow cell lanes of 36 nt reads, 16 lanes of 50 nt reads and 6 lanes of 76 nt reads.   57 2.4.3 Targeted ultra-deep resequencing using read indexing This procedure describes the individual PCR amplification of EZH2 exon 15, indexing of individual amplicons, and subsequent pooling and sequencing. Individual indexes allow the de-convolution of reads deriving from individual samples in multiplexed libraries such that many samples can be concurrently sequenced in the same library. Genomic DNA from individual samples was normalized to 5 ng/µL and 5 ng of each sample was PCR amplified using Phusion DNA polymerase (New England Biolabs, Ipswich, MA, USA) in 96-well format using gene specific primers (Primer EZH2_015R3 sequence: 5’- TCTCAGCAGCTTTCACGTTG-3’, Primer EZH2_015F sequence: 5’- CAGGTTATCAGTGCCTTACCTCTCC-3’) to produce ~300 bp amplicons. Hot Start PCR conditions: 98oC, 60 s, then 36 cycles (98oC-10 s, 60oC-15 s, 72oC- 30 s), final extension 72oC, 5 min. Amplicons were cleaned using AMPure beads (Beckman Coulter, CA, USA) on a Biomek F/X (Beckman Coulter, Fullerton, CA, USA) and eluted with 40 µL elution buffer EB (QIAGEN, USA). Cleaned amplicons were QC tested on a 1.2% SeaKem LE Agarose gel (Cambrex, East Rutherford, NJ, USA) using 1X TAE buffer. Bands were quantified by the QBitTM Fluorometer (Invitrogen, Carlsbad, CA, USA) high sensitivity assay. Approximately 500 ng of each amplicon DNA sample was then phosphorylated and end- repaired in 50 µL reactions at room temp, 30 min (T4 DNA Pol 5U, DNA Pol I (Klenow) 1U, T4 PNK 100U, dNTP mix 0.4 mM (Invitrogen). End-repair reactions were cleaned using AMPure beads and dATP was added to the 3’-ends using Klenow (exo-) 5U and 0.2mM dATP in 1X Klenow Buffer (Invitrogen) with 30-min incubation at 37oC in a Tetrad thermal cycler (MJ Research, USA). DNA was again cleaned on AMPure beads using a Biomek FX. Adapter ligation (10:1 ratio) was completed with 0.03 µM Adapter (Multiplexing Adapter 1:  58 5’-GATCGGAAGAGCACACGTCT-3’, Multiplexing Adapter 2: 5’- ACACTCTTTCCCTACACGCTCTTCCGATCT-3’), 100 ng DNA, T4 DNA Ligase 5U, 0.2 mM ATP, 1X T4 DNA Ligase Buffer (Invitrogen) for 30 min @ room temp. Adapter-ligated DNA was cleaned using AMPure beads on a Biomek FX. A selection of DNA samples was quantified on a QBitTM (Invitrogen). Phusion DNA polymerase, 15-cycle indexing enrichment PCR was performed using Primer 1.0 sequence: 5’- AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGA TCT-3’ and Primer 2.0 sequence: 5’- GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT -3’ (IDT, USA) and 96 custom indexing primers. The PCR program was as follows: 98oC for 60 s followed by 15 cycles of 98oC, 10 s, 65oC, 15 s, 72oC, 30 s. The PCR products were cleaned using AMPure beads and eluted in 40 µL elution buffer EB (QIAGEN, USA). Quality of product was assessed by QC gels: 1.75% SeaKem LE agarose 1X TAE, (0.2 µL of every amplicon) and on Bioanalyzer- 1000 (Agilent Technologies, Santa Clara, CA, USA). All 96 ~400 bp amplicons from each plate were then pooled (15 µL of each well) into a separate 1.5 mL microfuge tube. Hence, one tube represents a plate of 96 pooled and indexed PCR products from 96 distinct DNA templates. The 400 bp DNA size fraction was purified using 8% PAGE gels (1X TAE) and eluted from the gel slice overnight at 4°C in 400 µl of elution buffer (5:1, LoTE buffer (3 mM Tris-HCl, pH 7.5, 0.2 mM EDTA)-7.5 M ammonium acetate). Gel pieces were filtered using a Spin-X Filter Tube (Fisher Scientific, Pittsburgh, PA, USA) and DNA was precipitated using ethanol. DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay (Agilent Technologies, Santa Clara CA, USA) and DNA was subsequently diluted to 10 nM. The final concentration was confirmed using a Quant-iT  59 dsDNA HS assay kit and QBitTM fluorometer (Invitrogen). An individual library was constructed from each indexed sample (comprising amplicons from up to 96 distinct template DNAs). Each of these libraries was sequenced on a single flowcell lane. 2.4.4 SNV analysis of tumour DNA and RNA sequence All reads were aligned to the human reference genome (hg18) or (for RNA-seq) to a genome file that has been augmented with a set of all exon-exon junction sequences using the MAQ aligner v0.7.1 (Li et al. 2008). Candidate single nucleotide variants (SNV) were identified in the aligned genomic sequence reads and the transcriptome (RNA-seq) reads using an approach similar to one we have previously described (Shah et al. 2009a). One key difference in our variant calling in this study is the application of a Bayesian SNV identification algorithm (“SNVmix; version 0.11.7; http://compbio.bccrc.ca/?page_id=204)(Shah et al. 2009b). This approach is able to identify SNVs with a minimum coverage of 2 high quality (Q20) bases. All sites assessed as being polymorphisms (SNPs) were disregarded, including variants matching a position in dbSNP or the personal genomes of Venter (Levy et al. 2007), Watson (Wheeler et al. 2008), the anonymous Asian (J. Wang et al. 2008) and Yoruban (Bentley et al. 2008; McKernan et al. 2009) individuals. Additionally, all candidate mutations also found in the genomic sequence from this patient’s germ line DNA were ignored. For the targeted re-sequencing experiment, coverage was generally greater than 1000x read depth at codon 641. Hence, we used all unambiguously mapped reads spanning this site to determine the percentage of reads with a high quality mismatch (Illumina base quality >20).    60 2.4.5 Amplicon sequencing for SNV identification and Sanger sequence validation Exon 15 of EZH2 was PCR amplified from genomic DNA using the following primers: 5’-TGTAAAACGACGGCCAGTCTGGGACTACAAGTATGCACCACC-3’ and 5’-CAGGAAACAGCTATGACACCAACACCACCAAAAGGTTTTCT-3’. These primers contain priming sites for (M13 Forward -21, M13 reverse) on their 5’ ends to allow direct Sanger sequencing of amplicons. Unless otherwise stated, amplicons were produced from genomic DNA from both tumour and matched normal patient DNA. All capillary traces were analyzed using Mutation Surveyor and all variants were visually inspected to confirm their presence or absence (for germ line traces). 2.4.6 Computational modeling of EZH2 wild type and mutant SET domain The EZH2 SET domain sequence was used to search for the structural template for homology modeling in the Protein Data Bank. The available crystal structure of the MLL1 SET domain (PDB ID 2w5z) was identified as the best template (with sequence identities of 39% for the SET domain and no gaps in the sequence alignment). A three dimensional model of the EZH2 SET domain was constructed via the protein modeling server SWISS-MODEL (Kiefer et al. 2009). Because MLL1 is a H3K4 binding protein, there was some concern that the target lysine residue of EZH2 (K27) may not reside in the same conformation. Another concern is that the MLL1 crystal structure is in an open conformation and this conformation has reduced methyltransferase activity compared to the closed ones. The conformation change may shift the position of Y641. To confirm these we built alternative models using other structures as templates. We used the H3K9 binding proteins EHMT1 (PDB ID 2RFI), DIM-5 (1PEG), SUV39H2 (2R3A) and G9a (2O8J), as well as the H3K36 binding protein SETD2 (3H6L). The striking overlap of the conserved  61 tyrosine residue corresponding to Y641 confirms that the position of Y641 remains unchanged in all proteins regardless of an open or closed conformation. The co-crystallized H3 peptides in 1PEG and 2RFI helped us confirmed that the conformations of K4 and K9 are quite similar in those models. Therefore, we assume that the K27 in EZH2 will pose a conformation close to what is shown in the model. 2.4.7 In vitro EZH2 H3K27 tri-methylation assay Mutant constructs were generated using site-directed mutagenesis of the Refseq EZH2 (NM_004456) with an N-terminal His tag. Wild type EZH2 and each of the four Y641 mutant constructs were co-expressed along with wild type AEPB2, EED, SUZ12 and RbAp48 in SF9 cells using a baculovirus expression system (pVL1392, cloned using BamHI and EcoRI). Together, these five proteins associate to form an enzymatically active PRC2 complex in vitro. Expression of EZH2 protein from each of the four mutant constructs was confirmed by Western blot and detected using anti-EZH2. Assay plates are coated with biotinylated histone H3 (21-44) peptide. Purified PRC2 was added to the plate along with S- adenosylmethionine (in the assay buffer) to detect enzyme activity. Methylated histone H3 was measured using a highly specific mouse-derived monoclonal antibody, which recognizes only the tri-methylated K27 residue of histone H3 (Active Motif, Catalog Number 39535.). The secondary antibody, which is labeled with Europium, was detected using time-resolved fluorescence (620 nm). PRC2 methyltransferase activity of each mutant (and wild type EZH2) was tested at varying purified PRC2 amounts (between 0 and 200 ng). 2.4.8 Revised tri-methylation assay Similar batches of recombinant reconstituted active PRC2 complexes as used (above) containing EZH2 Wild type (51 004), EZH2 Y641N, or EZH2 Y641F were purchased from  62 BPS Biosciences. Methyltransferase assays were performed using a kit (#17-330, Millipore) as per the manufacturer’s instructions, except that biotinylated peptides (of either me0, me1 or me2 methylation states) were used as substrates; 250 ng of active PRC2 complex was incubated with 0.67microM 3H-SAM (PerkinElmer Life and Analytical Sciences), 1 µM biotinylated peptide in 50 mM Tris-HCl, pH 9.0, and 0.5 mM dithiothreitol for 30 min at 30°C in a 10 µL volume. A total of 5 µL were spotted on a P81 square paper (Millipore), washed (twice with 10% trichloroacetic acid and once with 100% ethanol) to remove unincorporated SAM, air-dried, and placed in a glass scintillation vial with 5 mL of scintillation fluid (ScintiSafe Econo1, Fisher Chemical) and counted on a 1900TR Liquid Scintillation Analyzer (PerkinElmer Life and Analytical Sciences). 2.4.9 Cell lines DB, KARPAS 422, SU-DHL-6 and WSU-DLCL2 are cell lines obtained from DSMZ and all “OCI-LY” lines were obtained from Dr. L Staudt, NIH. 2.4.10 Cell-of-origin (COO) determination  Total RNA was reversed transcribed (one cycle) and hybridized to U133-2 Plus arrays according to the manufacturer’s protocol (Affymetrix). CEL files were normalized using robust multi-chip analysis (RMA). Cell of origin (COO) was calculated using model scores for activated B-cell type (ABC) and germinal B-cell type (GCB) derived from the 185-gene model described by Lenz et al (2008b) and the Bayesian formula described by Wright et al (2003).  63 Figure 2-1: Recurrent mutations of Y641 in EZH2 (A) Genomic organization of the EZH2 locus, alternative exons and protein domain structure. The location of the mutation affecting Y641 in exon 15 of the EZH2 gene and protein is indicated with a red asterisk. (B) Illustration of sequencing results. Three of the five distinct mutations and amino acid replacements in codon 641 from different lymphoma samples as detected by capillary sequencing (left) or Illumina RNA-seq (right). (C) A multiple alignment of EZH2, EZH1 (its paralogue), the Drosophila orthologue E(Z) and six other human SET domain proteins demonstrates the intra and inter-species sequence conservation of SET domains. Conservation codes reported by ClustalX are shown above (Thompson et al. 2002). The predominant mutation in EZH2 affects a key tyrosine in the catalytic site of the SET domain (orange) conserved in the Drosophila orthologue E(Z). With one exception, all EZH2 mutations in FL and DLBCL alter this amino acid. The exception was a double mutant (FL), with a second somatic mutation affecting N635 (blue). All mutants comprise 5 of the 8 possible non-synonymous variants of this codon (lower right, in red). Notably, the five observed amino acid changes were not found at equal frequencies. We detected a slight enrichment for Y641F (49%) followed by Y641S (21%), Y641N (15%) and Y641H (13%) and only a single example of Y641C (2%)(Appendix A). Of the unobserved variants (D, blue), two would result in a truncated protein and the third would introduce an aspartate residue. The pattern and nature of these changes (A->G, A->T, T->G, T->A), indicated to us that these mutations do not likely arise from AID-induced somatic hypermutation at this locus (Pasqualucci et al. 2001).  This figure was previously published (Morin et al. 2010) and I produced it in its entirety.   64 Y N H M N N N Y L DVSPLLL G V V Y L G A L H H H Y641C Catalytic site AdoMet N635 Y641 (TAC) variants AAC N* CAC H TTC F TCC S TGC C GAC D TAT Synonymous TAA Truncation TAG Truncation A EZH2 exons B Homeodomain-likeSANT DNA-bd SET EZH2 protein EZH2 exon 15  ATGAATTCATCTCAGAATACTGTGGAGA A AGGT AA GGCA C 33G A 34T 35G 36A 37A 38T 39T 40AC 41TA 42T 43C 44T 45C 46A 47G 48A 49CA 50TC 51A 52C 53T 54G 55T 56G 57G 58A 59G 33A 34T 35G 36A 37A 38T 39T 40C 41A 42T 43C 44T 45C 46A 47G 48A 49A 50T 51CA 52C 53T 54G 55T 56G 57G 58A 59G * * * *FL tumor DNA, Y641F FL tumor DNA, Y641H * * DLBCL HS0639 tumor RNA, Y641H DLBCL HS0640 tumor RNA, Y641S                         * *  .:  * *  *    .. *                :: EZH2 -KK A AGWGIFIKDPV KQ EFISE CGEIISQDEADRR --KVYDKYMCSFLFNLNN EZH1 -KK LLAPSDVAGWGTFIKESVQK EFISEYCGELISQDEADRRG--KVYDKYMSSFLFNLNN E(Z) -HK LLMAPSDIAGWGIFLKEGAQK EFISEYCGEIISQDEADRRG--KVYDKYMCSFLFNLNN SUV92 QYS CIFRTSNGRGWGVKTLVKIKR SFVMEYVGEVITSEEAERR--GQFYDNKGITYLFDLDY DIM-5 TVP QIFRTKD-RGWGVKCPVNIKRGQFVDRYLGEIITSEEADRRRAESTIARRKDVYLFALDK SETD8 -EE MKIDLIDGKGRGVIATKQFSRGDFVVEYHGDLIEITDAKKREALYAQDPSTGCYMYYFQY MLL -KE VGVYRSPIHGRGLFCKRNIDAGEMVIEYAGNVIRSIQTDKRE-KYYDSKGIGCYMFRIDD MLL2 -KNN YLARSRIQGLGLYAAKDLEK TMVIEYIGTIIRNEVANRRE-KIYEEQNRGIYMFRINN SETD7 -RV AESLISSAGEGLFSKVAVGP TVMSF NGVRITHQEVDSRDWALNGNTLSLDEETVIDV       65 Figure 2-2: In vitro assembly and functional analysis of PRC2 with mutant and wild type EZH2. (A) Wild type EZH2 and each of the four Y641 mutants were co-expressed along with wild type AEPB2, EED, SUZ12 and RbAp48 in SF9 cells using a baculovirus expression system (Methods). Together, these five proteins associate to form an enzymatically active PRC2 complex in vitro. The purified complex from the SF9 cells showed strong expression of each of these proteins and confirmed their association and assembly into PRC2. (B) Expression of EZH2 protein from each of the four mutant constructs was confirmed by Western blot. (C) The purified complex was then assayed using biotinylated histone H3 (21-44) peptide along with S-adenosylmethionine (in the assay buffer) to detect enzyme activity. Methylated histone H3 was measured using a highly specific antibody, which recognizes only the tri- methylated K27 residue of histone H3 (Methods). The secondary antibody, which is labeled with Europium, was detected using time-resolved fluorescence (620 nm). PRC2 methyltransferase activity of each mutant (and wild type EZH2) was tested at varying purified PRC2 amounts (between 0 and 200 ng). The specific activity for the four mutants was calculated to be 0.001, 0.0012, 0.0011 and 0.0009 pmol/min/µg for the H, N, S and F mutants, respectively (mean = 0.00105). The wild type enzyme (blue) showed a specific activity of 0.0071 (~6.8-fold greater). Error bars reflect the standard deviation of triplicate measurements.  The blots and gels shown in this figure (Panels A and B) were produced by an employee of BPS BioScience and the data used in Panel C.  I prepared the graph and assembled the figure, which was previously published (Morin et al. 2010).  66  0 750 1500 2250 3000 3750 4500 5250 6000 6750 7500 0 20 40 60 80 100 120 140 160 180 200 EZH2/SUZ12 AEPB2 RbAp48 Y6 41 F Y6 41 S Y6 41 N Y6 41 H EED A B 175 kDa 80 kDa 58 kDa 46 kDa 30 kDa Expression of PRC2 components C EZH2 175 kDa 80 kDa 58 kDa 46 kDa 30 kDa Y6 41 F Y6 41 S Y6 41 N Y6 41 H W T Fl uo re sc en ce Enzyme (ng) EZH2 protein confirmation H3K27 trimethylation for EZH2 Y641 mutants WT H641 N641 S641 F641   67 Figure 2-3: In vivo demonstration of H3K27 methylation levels in EZH2 wild type and mutant cells. (A) Western blot analysis of ten DLBCL cell lines with known EZH2 Y641 mutation status provided evidence for enhanced steady-state levels of H3K27me3. Two separate H3K27me3- specific antibodies showed a consistent increase in histones bearing this modification in the four Y641-mutant cell lines relative to six cell lines lacking the mutation. (B) Whole cell lysates from frozen tumour sections from 10 patients with either wild type EZH2 (IDs 396, 839, 315, 085, or 900) or heterozygous for EZH2 (IDs 694 (+/Y641F), 178 (+/Y641H), 883 (+/Y641N), 940 (+/Y641F), or 353 (+/Y641S)) were probed with the respective antibodies. This a composite figure assembled to reflect similar levels of H3 from the two blots. (C) Nuclear lysates from HEK293T cells stably expressing GFP-tagged EZH2 and Y641 mutants were probed with anti-EZH2 to assess ectopically expressed levels (top band) or endogenous (lower band) EZH2 levels and anti-H3K27me3. Anti-H3 was used to assess histones as a loading control. (D) Respective plasmids encoding GFP, EZH2, or mutants (as indicated) were transfected in HEK293T cells; the lysates were probed with the antibodies anti-FLAG M2, anti-GFP, anti-EZH2 (top band shows GFP-tagged EZH2, lower band shows endogenous EZH2 and FLAG-tagged EZH2), monomethyl, dimethyl, or trimethyl specific H3K27. These antibodies are specific for the respective methylation states.  Experiments were performed by Damian Yap and Tobias Berg and this figure was prepared by Damian Yap.  A variant of this figure was previously published (Yap et al. 2011).  68   69 Table 2-1: Summary of sequence coverage in FL patient A Library Description Raw reads (pairs) Mapped reads Total sequence coverage (bp) Mean coverage depth of exons* FL sample A, matched germ line genomic DNA 93,473,829 163,216,278 7,986,844,356 2.80 FL sample A, tumour RNA-seq 51,729,560 63,262,348 2,277,444,528 18.9 FL sample A, tumour genomic DNA 351,666,782 563,762,488 27,024,661,976 9.47 *Because of the uneven coverage produced by genome sequencing and RNA-seq, coverage for this table (and the relevant regions of this chapter) was calculated by counting the number of reads unambiguously aligned within exonic regions of the genome.  70 Table 2-2: Location and effect of mutations in EZH2 in NHL identified by RNA-seq Sample ID Sample type or cell line name AGE SEX t(14;18) Genomic Position Mutation* Effect HS0804 FL (Sample A) 44 F No chr7:148139661 T->C Y->H HS0639 DLBCL 60 M Yes chr7:148139661 T->C Y->H HS0648 DLBCL 92 F No chr7:148139661 T->A Y->N HS0640 DLBCL 68 F Yes chr7:148139660 A->C Y->S HS0942 DLBCL 73 M Yes chr7:148139661 T->A Y->N HS0798 DB 45 M Yes chr7:148139661 T->A Y->N HS0841 KARPAS 422 73 F Yes chr7:148139661 T->A Y->N HS0900 SU-DHL-6 43 M Yes chr7:148139661 T->A Y->N HS0901 WSU-DLCL2 41 M Yes chr7:148139660 A->T Y->F HS1163 OCI-LY1 NA NA Yes chr7:148139661 T->A Y->N *All observed mutations are heterozygous and mutation is reported on the negative strand  71 Table 2-3: Frequency of EZH2 Y641 mutations in lymphoma and benign samples Sample type # of samples lacking EZH2 mutation # of samples with EZH2 Y641 mutation Total samples Prevalence of Y641 mutation FL Grade 1       Grade 2       Grade 3 125 58 26 10 4 2 225 7.1% FL/DLBCL pairs*       FL       DLBCL 24 22 2 4 52 11.5% DLBCL  GCB                 PMBCL                 GCB-cl¥                  ABC                 U                 Non-GCB                 ABC-cl¥¥                 NA§ 62 23 2  41 24 22 2 115 18 1 5  0 0 0 0 11 326 22.5% 4.2% 71.4%  0% 0% 0% 0% 9% MCL 25 0 25 0% SLL 30 0 30 0% PTCL 25 0 25 0% Benign Reactive lymph node Purified CD77+ centroblasts¶  23 8  0 31 0% Total  Primary            Cell lines            Benign  622 4 31  52 5 0  714    Abbreviations: PMBCL, primary mediastinal B cell lymphoma; Non-GCB, non-Germinal centre type of DLBCL defined by immunohistochemistry using the Hans criteria (Hans et al. 2004); MCL, mantle cell lymphoma; SLL, small lymphocytic lymphoma; PTCL, peripheral T cell lymphoma not otherwise specified; §Affymetrix array analysis was not performed; hence, COO information is unavailable *FL and DLBCL pairs were samples derived from the same patient pre (FL) and post transformation (DLBCL). ¥GCB-cl (cell lines): mutated EZH2: DB, KARPAS 422, SU-DHL-6 and WSU-DLCL2 and OCI-LY1; wild type EZH2: OCI-LY7 and OCI-LY19. ¥¥ABC cell lines: wild type EZH2: OCI-LY3 and OCI-LY10. ¶CD77+ centroblasts were purified based on CD77 selection from reactive tonsils.     72 Chapter  3: Frequent mutation of histone-modifying genes in NHL 3.1 Introduction Recent application of next-generation sequencing to discover mutations in NHL, by our group and others, has revealed recurrently mutated genes with no previously known role in lymphomagenesis such as EZH2 (Morin et al. 2010) and MYD88 (Ngo et al. 2011). Our initial sequencing of the genome from a single FL patient revealed additional genes harbouring candidate mutations (Appendix A), some of which demonstrated evidence for recurrent mutation in NHL based on our analysis of additional cases by RNA-seq. To enhance our understanding of the genetic architecture of B-cell NHL, we undertook a study to (1) survey genomic DNA for somatic mutations in a larger group of patient samples and (2) determine the prevalence, expression and focal recurrence of these mutations in expanded cohorts of both FL and DLBCL cases. Using strategies and techniques applied to cancer genome and transcriptome characterization by ourselves and others (Mardis et al. 2009; Shah et al. 2009b; Morin et al. 2010), we sequenced tumour DNA from 14 NHL cases and the mRNA from a total of 117 tumour samples and 10 cell lines (Appendix B). We report the utility of this approach in identifying hundreds of mutated genes of which 109 show evidence for recurrent mutation in these diseases and explore the potential relevance of some of the more significantly mutated genes. 3.2 Results 3.2.1 Identification of genes recurrently mutated in B-cell NHL We sequenced the genomes or exomes of 14 NHL cases, all with matched constitutional DNA sequenced to comparable depths (Appendix B). After screening for single nucleotide variants followed by subtraction of known polymorphisms and visual  73 inspection of the sequence read alignments, we identified 717 non-synonymous candidate somatic mutations (coding single nucleotide variants; cSNVs) affecting 651 genes (Figure 3-1; Methods). The number of cSNVs identified in these genomes ranged from 20 to 135. Only 25 of the 651 genes were represented in the cancer gene census (December, 2010 release)(Futreal et al. 2004). We performed RNA sequencing (RNA-seq) on these 14 NHL cases and an expanded set of 113 samples comprising 83 DLBCL, 12 FL and 8 B-cell NHL cases with other histologies and 10 DLBCL-derived cell lines (Appendix B). We analysed these data to identify novel fusion transcripts and cSNVs (Figure 3-2). We identified 240 genes with at least one cSNV in a genome/exome or an RNA-seq "mutation hot spot" (below), and with cSNVs in at least three cases in total. We selected cSNVs from each of these 240 genes for re-sequencing to confirm their somatic status. We did not re-sequence genes with previously documented mutations in lymphoma (e.g. CD79B, BCL2). We confirmed the somatic status of 543 cSNVs in 317 genes, with 109 genes having at least two confirmed somatic mutations (Appendix B). Of the successfully re-sequenced cSNVs predicted from the genomes, 171 (94.5%) were confirmed somatic, 7 were false calls and 3 were present in the germ line. These 109 recurrently mutated genes were significantly enriched for genes involved in lymphocyte activation (P=8.3x10-4; e.g. STAT6, BCL10), lymphocyte differentiation (P=3.5x10-3; e.g. CARD11), and regulation of apoptosis (P=1.9x10-3; e.g. BTG1, BTG2). Also significantly enriched were genes linked to transcriptional regulation (P= 5.4x10-4; e.g. TP53) and genes involved in methylation (P=2.2x10-4) and acetylation (P=1.2x10-2), including histone methyltransferase (HMT) and acetyltransferase (HAT) enzymes. The former (i.e. EZH2) was described in the previous chapter and mutations in the latter (i.e.  74 CREBBP) were reported by another group while this manuscript was being prepared (Pasqualucci et al. 2011). Mutation hot spots can result from mutations at sites under strong selective pressure and we have previously identified such sites using RNA-seq data (Morin et al. 2010). We searched these RNA-seq data for genes with mutation hot spots, and identified 10 genes that were not identified as mutated in the 14 genomes analyzed (PIM1, FOXO1, CCND3, TP53, IRF4, BTG2, CD79B, BCL7A, IKZF3 and B2M), of which five (FOXO1, CCND3, BTG2, IKZF3 and B2M) were not previously known targets of point mutation in NHL (Table 3-1; Methods). Each of FOXO1, BCL7A and B2M exhibited hot spots affecting their start codons. The effect of a FOXO1 start codon mutation, which was observed in three cases, was further studied using a cell line in which the initiating ATG was mutated to TTG. Western blots probed with a FOXO1-specific antibody revealed a band with a molecular weight indicative of an N-terminal truncation of the FOXO1 protein (not shown) consistent with utilization of the next in-frame ATG for translation initiation. We detected a second hot spot in FOXO1 at threonine residue 24, which was mutated in two cases. This residue is reportedly phosphorylated by AKT subsequent to B-cell receptor (BCR) stimulation (Yusuf et al. 2004) inducing nuclear export of FOXO1. We suspect that both mutations affecting threonine 24 and those resulting in removal of this portion of the protein would impact the ability of FOXO1 protein to be phosphorylated and thus impact AKT-mediated nuclear export. 3.2.2 Identification of potential driver mutations We next used the RNA-seq data to determine whether any of the somatic mutations in the 109 recurrently mutated genes showed evidence for allelic imbalance with expression favouring one allele. Of 380 mutant alleles with detectable expression, we observed  75 expression in favour of the mutation for 16.8% of alleles (64/380) and expression in favour of the wild type for 27.8% of alleles (106/380; Appendix B). Seven genes displayed evidence for significant allelic imbalance favouring the mutant allele in at least two cases: namely BCL2, CARD11, CD79B, EZH2, IRF4, MEF2B and TP53  (P<0.05, Methods). In 27 of 43 cases with cSNVs in BCL2, expression favoured the mutant allele, consistent with the previously-described hypothesis that the translocated (and hence, transcriptionally deregulated) allele of BCL2 is subsequently targeted by somatic hypermutation (Saito et al. 2009). Examples of mutations at known oncogenic hot spot sites such as F123I in CARD11 (Lenz et al. 2008a) exhibited allelic imbalance favouring the mutant allele in some cases. Similarly, we noted expression favouring two novel hot spot mutations in MEF2B (Y69 and D83) and two sites in EZH2 not previously reported as mutated in lymphoma (A682G and A692V). We sought to distinguish new cancer-related mutations from passenger mutations using the statistical approach proposed by Greenman et al (2006). We reasoned that this method would reveal genes with strong signatures of selection, and mutations in such genes would be good candidates for cancer drivers. We thus identified 26 genes with significant evidence for positive selection (FDR 0.03, Methods), with either selective pressure for acquiring non- synonymous point mutations or truncating/nonsense mutations (Methods; Table 1-1). These genes included known lymphoma oncogenes such as BCL2, CD79B (Davis et al. 2010), CARD11 (Lenz et al. 2008a), MYD88 (Ngo et al. 2011) and EZH2 (Morin et al. 2010), all of which exhibited signatures indicative of selection for non-synonymous variants.  76 3.2.3 Evidence for selection of inactivating changes identifies novel mutated genes In contrast to oncogenes, we expected tumour suppressor genes to exhibit strong selection for the acquisition of nonsense mutations. In our analysis, the eight most significant genes included seven with strong selective pressure for nonsense mutations and these included the known tumour suppressor genes TP53 and TNFRSF14, the latter of which is known to be mutated in FL (Cheung et al. 2010)(Table 1-1). CREBBP also showed evidence at reduced significance for acquisition of nonsense mutations and cSNVs (Figure 3-3). We also observed enrichment for nonsense mutations in BCL10, a positive regulator of NF-κB, in which oncogenic truncated products have been described in various lymphomas (Du et al. 2000). The remaining strongly significant genes had no previously known role in lymphoma and included BTG1, GNA13, SGK1 and MLL2. GNA13 was affected by mutations in 22 cases and five of these were either nonsense mutations or affected key residues (Figure 3-4). GNA13 encodes the alpha subunit of a heterotrimeric G protein-coupled receptor responsible for modulating RhoA activity (Kreutz et al. 2007) and some of the mutated residues are reported to negatively impact its function (Bhattacharyya & Wedegaertner 2000; Manganello et al. 2003), including a T203A mutation, which also exhibited allelic imbalance favouring the mutant allele (Figure 3-5). GNA13 protein was reduced or absent on Western blots in cell lines harbouring either a nonsense mutation, deletion of the stop codon, a frame shifting deletion, or changes affecting splice sites (Methods; Figure 3-4). SGK1 encodes a PI3K-regulated kinase with numerous functions including regulation of FOXO transcription factors (Brunet et al. 2001), regulation of NF-κB by phosphorylating IκB kinase (Tai et al. 2009), and negative regulation of NOTCH signalling (Mo et al. 2011). Notably, SGK1 also resides within a region of chromosome 6 commonly deleted in DLBCL  77 (Figure 3-1)(Lenz et al. 2008c). The mechanism by which inactivation of SGK1 and GNA13 may contribute to lymphoma is unclear but the strong degree of apparent selection towards their inactivation and their overall high mutation frequency (each mutated in 18 of 106 DLBCL cases) suggests that their loss contributes to B-cell NHL. Certain genes are known to be mutated more commonly in GCB DLBCLs such as TP53 (Young et al. 2008) and EZH2 (Morin et al. 2010). Strikingly, both SGK1 and GNA13 mutations were found only in GCB cases (P = 1.93x10-3 and 2.28x10-4, Fisher exact test; n=15 and 18, respectively)(Figure 3-4). Two additional genes (MEF2B and TNFRSF14) with no previously described role in DLBCL showed a similar restriction to GCB cases (Figure 3-4). 3.2.4 Inactivating MLL2 mutations The gene with the most significant evidence for selection and the largest number of nonsense SNVs was MLL2. Our RNA-seq analysis indicated that 26.0% (33/127) of cases carried at least one MLL2 cSNV. To address the possibility that variable RNA-seq coverage of MLL2 failed to capture some mutations, we PCR amplified the entire MLL2 locus (~36kb) in 89 cases (35 primary FLs, 17 DLBCL cell lines, and 37 DLBCLs). 58 of these cases were among the RNA-seq cohort. Illumina amplicon resequencing  (Methods) revealed 78 mutations, confirming the RNA-seq identified mutations in the overlapping cases and identifying 33 additional mutations. We confirmed the somatic status of 46 variants using Sanger sequencing (Appendix B), and showed that 20 of the 33 additional mutations were insertions or deletions (indels). Three SNVs at splice sites were also detected, as were 10 new cSNVs that had not been detected by RNA-seq due to low coverage. The somatic mutations were distributed across MLL2 (Figure 3-5A). 37% (n=29/78) of these were nonsense mutations, 46% (n=36/78) were indels that altered the reading frame,  78 8% (n=6/78) were point mutations at splice sites and 9% (n=7/78) were non-synonymous amino acid substitutions (Table 3-3). Four of the somatic mutations affecting splice sites had effects on MLL2 transcript length and structure. For example, two heterozygous splice site mutations resulted in the use of a novel splice donor site (Figure 3-5B) and an intron retention event (Figure 3-5C). We observed that approximately half of the NHL cases we sequenced had two MLL2 mutations (not shown). We used BAC clone sequencing in eight FL cases to show that in all eight cases the mutations were in trans, and therefore affected both MLL2 alleles. This observation is consistent with the notion that there is a complete, or near-complete, loss of MLL2 in the tumour cells of such patients. With the exception of two primary FL cases and two DLBCL cell lines (Pfeiffer and SU-DHL-9), the majority of MLL2 mutations appeared to be heterozygous. Analysis of Affymetrix 500k SNP array data from the two FL cases with apparent homozygous mutations revealed that both tumours exhibited copy number neutral loss of heterozygosity (LOH) for the region of chromosome 12 containing MLL2 (Methods). This suggests that, in addition to bi-allelic mutation, LOH is a second, albeit less common mechanism by which MLL2 function may be lost. MLL2 was the most frequently mutated gene in FL, and among the most frequently mutated genes in DLBCL (Figure 3-6). We confirmed MLL2 mutations in 31 of the 35 FL patients (89%), in 12 of the 37 DLBCL patients (32%), in 10 of the 17 DLBCL cell lines (59%) and in none of the eight normal centroblast samples we sequenced. Our analysis predicted that the majority of the somatic mutations observed in MLL2 were inactivating  79 (91% disrupted the reading frame or were truncating point mutations), suggesting to us that MLL2 is a tumour suppressor of broad significance in NHL. 3.2.5 Recurrent point mutations affecting the MADS box and MEF2 domains of MEF2B Our selective pressure analysis also revealed genes with stronger pressure for acquisition of amino acid substitutions than for nonsense mutations. One such gene was MEF2B, which has not previously been linked to lymphoma. 20 (15.7%) cases had cSNVs in MEF2B and 4 (3.1%) cases had cSNVs in the related gene MEF2C. All cSNVs detected by RNA-seq affected either the MADS box or MEF2 domains. To determine the frequency and scope of mutations in MEF2B, we Sanger-sequenced exons 2 and 3 in 261 primary FL samples; 259 DLBCL primary tumours; 17 cell lines; 35 cases of assorted NHL (IBL, composite FL and PBMCL); and eight non-malignant centroblast samples. We also used a capture strategy (Methods) to sequence the entire coding region of MEF2B in the 261 FL samples, revealing six additional variants outside exons 2 and 3. We thus identified 69 cases (34 DLBCL; 12.67% and 35 FL; 15.33%) with cSNVs or indels in MEF2B and noted an absence of novel variants from the other NHL and non-malignant samples. Of the detected variants, 55 (80%) affected residues within the MADS box and MEF2 domains encoded by exons 2 and 3 (Appendix B; Figure 3-7). In contrast to the mutational pattern observed in MLL2, each patient generally had a single variant and we observed relatively few (8 total, 10.7%) truncation-inducing SNVs or indels in MEF2B. Non-synonymous SNVs were by far the most common type of change observed, with 59.4% of detected variants affecting K4, Y69, N81 or D83. In 12 cases MEF2B mutations were shown to be somatic by sequencing matched constitutional DNA, including representative mutations at each of these four hot  80 spots (Table 3-1). Our analysis failed to detect mutations in ABC cases, indicating that somatic mutations in MEF2B play a role unique to the development of GCB DLBCL and FL (Figure 3-4). 3.3 Discussion  In our study of genome, transcriptome and exome sequences from 127 B-cell NHL cases, we identified 109 genes with clear evidence of somatic mutation in multiple individuals. Our analysis suggests that significant selection acts on at least 26 of these to acquire either nonsense or missense mutations. To the best of our knowledge, the majority of these genes had not previously been associated with any cancer type. We observed an enrichment of somatic mutations affecting genes involved in transcriptional regulation and, more specifically, chromatin modification. MLL2 emerged from our analysis as a candidate major tumour suppressor locus in NHL. It is one of six human H3K4-specific methyltransferases in the MLL family, all of which share homology with the Drosophila trithorax gene (Shilatifard 2008). Trimethylated H3K4 (H3K4me3) is an epigenetic mark associated with the promoters of actively transcribed genes. By laying down this mark, MLLs are responsible for the transcriptional regulation of developmental genes including the homeobox (Hox) gene family (Milne et al. 2002) which collectively control segment specificity and cell fate in the developing embryo (Krumlauf 1994; Canaani et al. 2004). Each MLL family member is thought to target different subsets of Hox genes (P. Wang et al. 2009) and, in particular, MLL2 is known to also regulate the transcription of a diverse set of genes (Issaeva et al. 2007). Recently, mutations in MLL2 were reported in a small-cell lung cancer cell line (Pleasance et al. 2010b) and in renal carcinoma (Dalgliesh et al. 2010) but the frequency of nonsense  81 mutations affecting MLL2 in these cancers was not established in these reports. Parsons and colleagues recently reported inactivating mutations in MLL2 or MLL3 in 16% of medulloblastoma patients (Parsons et al. 2011) further implicating MLL2 as a cancer gene. Here, our data link MLL2 somatic mutations to B-cell NHL. The reported mutations are likely to be inactivating and in eight of the cases with multiple mutations, we confirmed that both alleles were affected, presumably resulting in complete loss of MLL2 function. The high prevalence of MLL2 mutations in FL (89%), equals the frequency of the t(14;18)(q32;q21) translocation, which is considered the most prevalent genetic abnormality in FL (Horsman et al. 2003). In DLBCL tumour samples and cell lines, MLL2 mutation frequencies were 32% and 59% respectively, also exceeding the prevalence of the most frequent cytogenetic abnormalities, such as the various translocations involving 3q27, which occur in 25-30% of DLBCLs and are enriched in the ABC cases (Iqbal et al. 2007). Importantly, we found mutations in MLL2 in both DLBCL subtypes (Figure 3-4). Taken together, our analyses indicate that MLL2 acts as a central and pan-subtype tumour suppressor in both FL and DLBCL. The MEF2 gene family encodes four related transcription factors that recruit histone- modifying enzymes including histone deacetylases (HDACs) and HATs in a calcium- regulated manner. Although truncating variants were detected in our analysis of MEF2 gene family members, our analysis suggests that, in contrast to MLL2, MEF2 gene family members tend to selectively acquire non-synonymous amino acid substitutions. In the case of MEF2B, 59.4% of all the cSNVs were found at four sites within the protein (K4, Y69, N81 and D83), and all four of these sites were confirmed to be targets of somatic mutation. 39% of the MEF2B alterations affect D83, resulting in replacement of the charged aspartate with  82 any of alanine, glycine or valine. Although we cannot yet predict the consequences of these substitutions on protein function, it seems likely that their effect would impact the ability of MEF2B to facilitate gene expression and likely play a role in promoting the malignant transformation of germinal centre B cells to lymphoma. 3.4 Conclusions We can construct speculative links in which we relate MEF2B mutations to CREBBP and EP300 mutations, and to recurrent Y641 mutations in EZH2 (Morin et al. 2010). One target of CREBBP/EP300 HAT activity is H3K27, which is methylated by EZH2 to repress transcription. There is evidence that the action of EZH2 antagonizes that of CREBBP/EP300 (Pasini et al. 2010). One function of MEF2 is to recruit either HDACs or CREBBP/EP300 to target genes (Giordano & Avantaggiati 1999), and it has been suggested that HDACs compete with CREBBP/EP300 for the same binding site on MEF2 (Han et al. 2005). Under normal Ca2+ levels, MEF2 is bound by type IIa HDACs, which maintain the tails of histone proteins in a deacetylated repressive chromatin state (Youn & J. Liu 2000). Increased cytoplasmic Ca2+ levels induce the nuclear export of HDACs, enabling the recruitment of HATs such as CREBBP/EP300, facilitating transcription at MEF2 target genes. Mutation of CREBBP, EP300 or MEF2B may impact expression of MEF2 target genes owing to reduced acetylation of nucleosomes near these genes (Figure 3-8). In light of the recent finding that heterozygous EZH2 Y641 mutations enhance overall H3K27 trimethylation activity of PRC2 (Yap et al. 2011; Sneeringer et al. 2010), it is possible that mutation of both MLL2 and EZH2 could cooperate in reducing the expression of some of the same target genes. Our data imply that (1) post-transcriptional modification of histones is of key importance in germinal centre B cells and (2) deregulated histone modification due to these mutations likely results in  83 reduced acetylation and enhanced methylation and acts as a core driver event in the development of these cells into FL and GCB DLBCL (Figure 3-7). 3.5 Methods 3.5.1 Sample acquisition Lymphoma samples were classified by an expert haematopathologist (R.D.G) according to the World Health Organization criteria of 2008. Benign specimens included reactive pediatric tonsils or purified CD77-positive centroblasts sorted from reactive tonsils using Miltenyi magnetic beads (Miltenyi Biotec, CA). The tumour specimens were collected as part of a research project approved by the University of British Columbia-British Columbia Cancer Agency Research Ethics Board (BCCA REB) and are in accordance with the Declaration of Helsinki. Informed consent was obtained from all individuals whose samples were profiled using RNA-seq or genome/exome sequencing. Our protocols stipulate that these data will not be released into the public domain but can be made available via a tiered-access mechanism to named investigators of institutions agreeing by a materials transfer agreement that they will honour the same ethical and privacy principles required by the BCCA REB. For all DLBCL samples profiled by RNA-seq, genome or exome sequencing in this study, tumour content was greater than 50% as assessed by: a) immunophenotyping using flow cytometry to detect the level of co-expression of CD19 and light chain restriction; or b) a pathologist review of an H&E-stained frozen section taken adjacent to the tissue that was cut and used for nucleic acid extraction. All other specimens used in this study were obtained at the time of diagnosis and were derived from archived fresh-frozen tissue or frozen tumour cell suspensions. Constitutional DNA was obtained from peripheral blood or from B cell-  84 negative sorted tumour cell suspensions (fraction eluted from cells captured by B Cell Isolation Kit II or CD19 Micro Beads (Miltenyi Biotec, CA)). 3.5.2 Cell lines DB (Beckwith et al. 1990), DOHH-2 (Kluin-Nelemans et al. 1991), Karpas422 (Dyer et al. 1990), NU-DHL-1 (Winter et al. 1984), NU-DUL-1 (Epstein et al. 1985), SU-DHL-6 and WSU-DLCL2 (Al-Katib et al. 1998) are cell lines obtained from DSMZ. Pfeiffer and Toledo were obtained from ATCC and all OCI-Ly (Mehra et al. 2002) lines (1, 3, 7, 10 and 19) were obtained from Louis Staudt (US National Institutes of Health). The cell lines MD903, SU-DHL-9 and RIVA were obtained from Martin Dyer (University of Leicester, UK). 3.5.3 Preparation and sequencing of RNA-seq, genome and exome Illumina libraries Genomic DNA for construction of genome and exome libraries was prepared from biopsy materials using the Qiagen AllPrep DNA/RNA Mini Kit (Qiagen). DNA quality was assessed by spectrophotometry (260 nm/280 nm and 260 nm/230 nm absorption ratios) and gel electrophoresis before library construction. DNA was sheared for 10 min using a Sonic Dismembrator 550 with a power setting of “7” in pulses of 30 s interspersed with 30 s of cooling (Cup Horn, Fisher Scientific) and then analysed on 8% PAGE gels. The 200 to 300 bp DNA size fraction was excised and eluted from the gel slice overnight at 4 °C in 300 µL of elution buffer (5:1 (vol/vol) LoTE buffer (3 mM Tris-HCl, pH 7.5, 0.2 mM EDTA)/7.5 M ammonium acetate) and was purified using a Spin-X Filter Tube (Fisher Scientific) and ethanol precipitation. Genome libraries were prepared using a modified paired-end protocol supplied by Illumina Inc. This involved DNA end-repair and formation of 3′ adenosine overhangs using the Klenow fragment of DNA polymerase I (3′–5′ exonuclease minus) and  85 ligation to Illumina PE adapters (with 5′ overhangs). Adapter-ligated products were purified on QIAquick spin columns (Qiagen) and PCR-amplified using Phusion DNA polymerase (NEB) and ten cycles with the PE primer 1.0 and 2.0 (Illumina). PCR products of the desired size range were purified from adapter ligation artifacts using 8% PAGE gels. DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay (Agilent) and Nanodrop 7500 spectrophotometer (Nanodrop), and DNA was subsequently diluted to 10 nM. The final concentration was confirmed using a Quant-iT dsDNA HS assay kit and Qubit fluorometer (Invitrogen). For genomic DNA sequencing, clusters were generated on the Illumina cluster stations using v1 cluster reagents. Paired-end reads were generated using v3 sequencing reagents on the Illumina GAiix platform following the manufacturer's instructions. Image analysis, base-calling and error calibration were performed using v1.0 of Illumina's Genome analysis pipeline. The DLBCL genomes were sequenced with 100 nucleotide paired-end reads using the HiSeq 2000 platform. For RNA-seq analysis, we used a modified method similar to the protocol we have previously described (Morin et al. 2010). Briefly, RNA was extracted from 15 x 20 µm sections cut from fresh-frozen lymph node biopsies using the MACS mRNA isolation kit (Miltenyi Biotec), from 5–10 µg of DNase I–treated total RNA as per the manufacturer's instructions. Double-stranded cDNA was synthesized from the purified poly(A)+ RNA using the Superscript Double-Stranded cDNA Synthesis kit (Invitrogen) and random hexamer primers (Invitrogen) at a concentration of 5 µM. The cDNA was fragmented by sonication and a paired-end sequencing library prepared following the Illumina paired-end library preparation protocol (Illumina).  86 For exome sequencing, genomic DNA was extracted following the protocol supplied in the Qiagen AllPrep DNA/RNA Mini Kit (Cat#80204), and quantified using a Quant-iT dsDNA HS assay kit and a Qubit fluorometer (Invitrogen). Approximately 500 ng DNA was sheared for 75 s at duty cycle “20%” and intensity of “5” using a Covaris E210, and run on an 8% PAGE gel. A 200 to 250 bp DNA size fraction was excised and eluted from the gel slice, and was ligated to Illumina paired-end adapters following a standard protocol. The adapter ligated DNA was amplified for 10 cycles using the PE primer set (Illumina) and purified as a pre-exome capture library. The DNA was assessed using an Agilent DNA 1000 Series II assay, and 500 ng DNA was hybridized to the 38 Mb Human exon probe using the All Exon Kit (Cat#G3362) following the Agilent SureSelect Paired-End Target Enrichment System Protocol (Version 1.0, September 2009). The captured DNA was purified using a Qiagen MinElute column, and amplified for 12 cycles using PE primer set. The PCR products were separated on an 8% PAGE gel, the desired size range (320 to 370 bp) was excised and purified, and was then assessed using an Agilent DNA 1000 series II assay and diluted to 10 nM. The final library DNA concentration was confirmed using a Quant-iT dsDNA HS assay kit and Qubit fluorometer. Clusters were generated on the Illumina cluster station and paired-end reads generated using an Illumina Genome Analyzer (GAIIx) following the manufacturer’s instructions. 3.5.4 Alignment-based analysis of tumour DNA/RNA sequence for point mutations All reads were aligned to the human reference genome (hg18) or (for RNA-seq) to a genome file that was augmented with a set of all exon-exon junction sequences using BWA version 0.5.4 (Li & Durbin 2009). RNA-seq libraries were aligned with an in-house modified version of BWA that is aware of exon junction reads and considers them when determining  87 pairing distance in the “sampe” (read pairing) phase of alignment. Candidate single- nucleotide variants (SNVs) were identified in the aligned genomic sequence reads and the transcriptome (RNA-seq) reads using an approach similar to one we previously described (Morin et al. 2010). One key difference in our variant calling in this study was the application of a Bayesian SNV identification algorithm ('SNVmix')(Goya et al. 2010). This approach is able to identify SNVs with a minimum coverage of two high-quality (Q20) bases. SNVs were retained if they had a SNVmix probability of at least 0.99 and had support from reads mapping to both genomic strands. Any SNV near gapped alignments or exactly overlapping sites assessed as being polymorphisms (SNPs) were disregarded, including variants matching a position in dbSNP or the sequenced personal genomes of Venter (Levy et al. 2007), Watson (Wheeler et al. 2008) or the anonymous Asian (J. Wang et al. 2008) and Yoruban (Bentley et al. 2008) individuals. For paired samples with matched constitutional DNA sequence, all variants with evidence (a SNVmix probability of at least 0.99 and 2 or more high quality base calls matching the SNV) in the constitutional DNA were considered germ line variants and were no longer considered cSNVs. Mutations were annotated on genes using the Ensembl transcripts (version 54), except in the cases of MEF2B and MLL2, for which the Ensembl annotations were deemed inferior to the Refseq. Because we observed situations where exons were represented in Ensembl transcripts that were not also represented in a Refseq, we only report candidate mutations in exons shared by both annotations (e.g. the tables in Appendix B). Candidate mutations were subsequently reviewed visually in the integrative genomics viewer (IGV)(Robinson et al. 2011) and those appearing to be artefacts or with some evidence (2 or more reads) visible in the constitutional DNA sequence were removed.  88 3.5.5 Validation of candidate somatic mutations using Illumina sequencing Validation was accomplished by designing primers to amplify a 200 to 300 bp region around the targeted variant with one primer within reach of a single read (i.e. maintaining the sum of the primer length and distance to variant less than 100 bp, depending on read length used). Amplicons were generated for both tumour and normal DNA. Two pools of amplicons were generated, one for tumour and one for normal DNA, with equal volumes from each PCR reaction (or increased volume for amplicons that resulted in faint bands in an agarose gel) and an Illumina paired-end sequencing library was constructed from the pool. For variants common to more than one patient, a 6 nt index, which was added to the 5’ end of each primer, was assigned for each patient. These index sequences were trimmed from sequence reads prior to alignment and subsequently used to associate the data with individual patients. Reads were aligned using BWA and variants were visually confirmed for validity and somatic status in IGV (Robinson et al. 2011)(absence from constitutional DNA). Variants with primer design or PCR failures were scored as ‘un-validated’. 3.5.6 Validation of cSNVs by Sanger sequencing The majority of candidate cSNVs were validated by Sanger sequencing of the region surrounding each mutation. These included all cSNVs identified in the two DLBCL exomes and the FL genome/exome (i.e. DLBCL-PatientA, DLBCL-PatientB and FL-PatientA). For the additional DLBCL genomes, cSNVs were selected for validation only if there were three or more cSNVs in that gene in the entire cohort. To do so, primers were designed to amplify 350-1200 bp regions by PCR (most amplicons were ~400 bp). Forward and reverse primers were tailed with T7 and M13Reverse 5’ priming sites, respectively. PCR conditions used were 94ºC for 2 min, 30 cycles of 94ºC for 30 s, 60ºC for 30 s and 72ºC for 1 min, and a final  89 extension at 72ºC for 8 min. To determine the somatic or germ line origin of the mutations, mutations were re-sequenced in both tumour and constitutional DNA, the latter obtained from peripheral blood or negative-sort cells (see section entitled Sample Acquisition). The sequencing reactions consisted of 50 cycles of 96ºC for 10 s, 43ºC (for M13Reverse) or 48ºC (T7) for 5 s and 60ºC for 4 min and were analysed using an AB 3730XL. All capillary traces were analysed using the Staden Package (Staden 1996) and all somatic variants were visually inspected to confirm their presence in tumour and absence from germ line traces. Some regions that failed to amplify in the first attempt were re-addressed with the addition of 5% DMSO and 5% betaine to the sequencing reactions, but otherwise maintaining the PCR conditions. SNVs in certain genes, such as BCL7A and HDAC7, repeatedly failed to amplify and for these, it was not possible to address whether the mutations in these genes were somatically acquired or were present in the germ line. Validation was not performed for variants in BCL2 or CD79B as their somatic mutation status in DLBCL is well established. 3.5.7 Detection of enrichment of functional gene classes within frequently mutated genes Significant functional classes represented in the cSNV list were identified using the DAVID Functional Annotation tool (http://david.abcc.ncifcrf.gov/). Reported P values were corrected for multiple testing using the Benjamini method. 3.5.8 Detection of mutations with imbalanced/skewed expression The analysis of imbalanced expression was restricted to (1) confirmed somatic non- synonymous point mutations along with (2) previously published hot spot mutations. In total, there were 381 such mutations in 99 of the 109 genes represented in the RNA-seq data. For  90 each mutated gene, the number of aligned reads supporting the reference and mutant allele was determined. For genes with multiple mutations in the same patient (e.g. BCL2), the sum of all reads supporting each of the non-reference alleles in that patient was used instead (assuming that all mutations were restricted to the same allele). Significant imbalance/skew was computed using the binomial exact test and P values were corrected using the Bonferroni method. Genes with corrected P values less than 0.05 and skew in favour of the mutant allele are reported in Appendix B. 3.5.9 Calculation of selective pressure To determine if mutational patterns were indicative of selective pressure, we considered both synonymous and non-synonymous cSNVs across our patient cohort (excluding those found to be present in the germ line or false positives after validation). Selection can be inferred when the type of mutations in a gene differs from those expected by chance given a specific mutation profile. To analyse the significance of this deviation we applied the methods described by Greenman et al (2006) to identify genes with signatures of selection. We performed this analysis on the 101 (of 109 total) genes that had, in addition to 2 or more confirmed somatic mutations, more than 2 cSNVs in total. The coding sequence of each gene (using the longest Refseq annotation for that gene) was scanned for all possible silent and non-silent mutations (missense and truncating) matching six types of sequence changes (C>A, C>G, C>T, T>A, T>C, T>G). The separation of mutations into different strata allows the model to consider the overall effect that cancer specific mutation mechanisms may have on the mutation profile. A null-selection mutation profile is estimated via the synonymous mutations, under the assumption that they do not confer an advantage to the tumour. A score statistic describing the selective pressure was then calculated by  91 comparing the expected mutations of each type to the observed ones. Statistical significance was then determined by constructing an empirical distribution of scores from 100,000 Monte Carlo simulations under the null hypothesis of no selection. The number of Monte Carlo iterations was increased to a maximum of 14,600,000 for genes that did not obtain a p-value at the default 100,000 simulations. Using the models described by Greenman et al, we also estimated the type and strength of the selective pressure the genes were under. This is represented by a quantitative value of less than, equal to, or larger than 1 for negative, null, or positive selection respectively (Table 3-2). Several genes in our list have previously been identified as targets of somatic hypermutation (SHM), which is mediated by the enzyme AICDA (also known as AID) and targets a limited number of genes in DLBCL (Pasqualucci et al. 2004; Pasqualucci et al. 2001). In an attempt to avoid biasing the selective pressure model with the distinct mutational signature caused by somatic hypermutation, we split the genes into two sets. The hypermutation set contained genes previously reported to be targets of SHM (BCL2 (Saito et al. 2009), BCL6, IRF4, PIM1, and CIITA) and the non-hypermutation set contained the remaining 95 genes. The effect of the different mutational profiles of both sets can be appreciated by considering the BCL2 case. When inserted into the model with the rest of the genes BCL2 presented the highest selective pressure of all genes (65.65); however, when the selective pressure model was applied to the hypermutated genes separately, BCL2 selective pressure was estimated at 3.78. 3.5.10 Identifying genes with mutation hot spots Hot spots were identified by searching for clustered mutations in the cSNVs identified by RNA-seq. Owing to the lack of constitutional DNA sequence from some patient samples,  92 we could not necessarily discern if the variants detected only by RNA-seq were present in the germ line. We specifically sought cases in which codons were recurrently mutated. To find hot spots in the RNA-seq data, we searched for sets of distinct variants producing non- synonymous changes affecting the same codon in different tumours. The genes that met this criterion (Table 3-1) included known targets of recurrent mutation (EZH2, CARD11 (Lenz et al. 2008a) and CD79B (Davis et al. 2010)) and three hot spots in MEF2B. Also among these genes were known targets of aberrant somatic hypermutation in DLBCL, including BCL2, IRF4 (Pasqualucci et al. 2004), PIM1 (Pasqualucci et al. 2001), BCL6 (Pasqualucci et al. 2003), and BCL7A (Pasqualucci et al. 2004). 3.5.11 Analysis of aligned genomic DNA sequence for copy number alterations and LOH For the identification of copy number variations (CNVs), sequence quality filtering was used to remove all reads of low mapping quality (Q ≤ 10). Due to the varying numbers of sequence reads from each sample, aligned reference reads were first used to define genomic bins of equal reference coverage to which depths of alignments of sequence from each of the tumour samples were compared. This resulted in a measurement of the relative number of aligned reads from the tumours and reference in bins of variable length along the genome, where bin width is inversely proportional to the number of mapped reference reads. After an estimate of differential GC bias was used to reduce noise, an HMM was used to classify and segment continuous regions of copy number loss, neutrality, or gain using methodology outlined previously (S. J. Jones et al. 2010). Loss of heterozygosity was determined for each sample using the lists of genomic SNPs that were identified through the BWA / SNVMix pipeline. This analysis allows for  93 classification of each SNP as either heterozygous or homozygous based on the reported SNP probabilities. For each sample, genomic bins of consistent SNP coverage were used by an HMM to identify genomic regions of consistent rates of heterozygosity. The HMM partitioned each tumour genome into three states: normal heterozygosity, increased homozygosity (low), and total homozygosity (high). We infer that a region of low homozygosity either represents a state where only a portion of the cellular population had lost a copy of a chromosomal region or the signal was convoluted due to contaminating normal cells in the tumour. Both states of reduced homozygosity are displayed in blue in Figure 3-2, generated by Circos (Krzywinski et al. 2009). 3.5.12 Assembly-based analysis of tumour DNA and RNA sequence Reads from the individual RNA-seq libraries were assembled using ABySS as previously described (Birol et al. 2009) using multiple values of k. Iterative pairwise alignments of the contigs from the individual k-mer assemblies resulted in a merged contig set that was aligned against the reference Human genome (hg18) using BLAT as described (Robertson et al. 2010). Putative fusions were identified from contigs that had alignments to two distinct genomic locations. The putative events were filtered using evidence from alignment of reads to contigs using Bowtie and alignments of reads to the genome using BWA. Those events with at least four read pairs from the reads-to-genome alignment and two supporting reads from the reads-to-contig alignment (i.e. across the fusion breakpoint) were manually curated to produce a final list of putative fusions. The genomic breakpoints for the transcriptome predicted events were identified manually from the alignments of the reads to the genome using IGV. Putative indels were identified from alignment of the contigs to hg18 using BLAT  94 when contiguous unmatched base(s) were found in either the contig (insertion) or reference (deletion) sequences. The events were filtered for read support with events requiring three or more reads to be considered in the filtered set. The filtered set was then screened against dbSNP130 to find putative novel events. The resulting set was manually inspected using read alignments (against both the genome and contigs) to visually confirm candidates. The splicing alterations in MLL2 (Figure 3-6, panels B and C) and GNA13 (Figure 3-5) were identified from pairwise alignments of the contigs to hg18 using BLAT. The contig alignments were then matched against the four known gene models to identify novel splice junctions. The putative novel splice junctions were filtered where two or more reads were required across the novel junction for the event to be considered. Manual inspection using read alignments (against both the genome and contigs) was performed to visually confirm candidates. 3.5.13 Cell of origin subtype assignment using RNA-seq expression values Global gene expression signatures measured with microarrays are the standard method for classifying DLBCL samples into the two molecular subtypes (GCB and ABC). We adapted the Bayesian method described by Wright et al (2003) to allow classification to be accomplished with the expression values obtained from RNA-seq data. To accomplish this, expression values for each Ensembl gene model (version 54) were computed as FPKM (fragments per kilobase gene model per million, rather than RPKM to account for the use of paired-end reads) and log-transformed. The current standard approach for routinely classifying samples using Affymetrix U133 arrays employs 186 probesets (George Wright, personal communication). The 165 Ensembl genes that correspond to these probesets were used for classification by RNA-seq. The classifier was trained using the 43 cases previously  95 classified as GCB and 21 classified as ABC using Affymetrix data. The FPKM values for these genes were compared between the samples with known subtypes using the T test and those producing a P value < 0.01 were used for the classifier. The robustness of this approach was tested using leave-one-out cross-validation, which resulted in no mis-classifications. Similarly, no samples were mis-classified when all cases with known COO (based on Affymetrix data) were used to produce the classifier however there were some cases that were defined as unclassifiable (U) by one method and given a subtype assignment by the other method. In such cases, the subtype assignment (rather than U) was used and the discrepancies are noted in Appendix B. 3.5.14 Targeted MEF2B resequencing using biotinylated RNA capture probes The following strategy was used to sequence the entire MEF2B locus in multiple patient samples in multiplex. Four exonic regions of the MEF2B gene were amplified from a template consisting of a pool of DNAs from three bacterial artificial chromosomes (BACs) containing the MEF2B locus (M. Nefedov , P. J. de Jong and U Surtiby, unpublished) using PCR. PCR reactions consisting of 0.5 Units Phusion DNA Polymerase (New England Biolabs, Pickering, Ont.), 0.25 mM dNTPs, 3% DMSO, 0.4 µM of the forward and reverse primer and 5 pmol template were cycled on a MJR Pelletier Thermocycler (model PTC-225) for 30 s at 980C; 25 X {10 s at 980C, 30 s at 650C, 30 s at 720C}; 5 min at 720C. The resulting PCR amplicons, ranging in size from 342 to 474 bp, were size selected on an 8% Novex-TBE gel (Invitrogen Canada Inc., Burlington, Ont.), excised and eluted into 300 µL of elution buffer containing 5:1 (vol/vol) LoTe  (3 mM Tris-HCl, pH7.5, 0.2 nM EDTA)/7.5 M ammonium acetate. The eluates were purified from gel slurries by centrifugation through Spin-X centrifuge tube filters (Fisher Scientific Ltd., Nepean, Ont.), and EtOH precipitated.  96 Purified amplicon DNAs were quantified using an Agilent DNA 1000 Series II assay (Agilent Technologies Canada Inc., Mississauga, Ont.). Individual amplicons were pooled (equimolar) and sheared using the Covaris S2 focused ultra-sonicator (Covaris Inc., Woburn, Mass.) with the following settings; 10% Duty cycle, 5% Intensity, and 200 Cycles per burst for 180 s. The resulting products were size fractioned on an 8% Novex TBE gel (Invitrogen Canada Inc.) and the 75 to 125 bp fraction isolated, purified and quantified as above. 30 ng of resulting DNA was end-repaired, 3-prime modified with Adenosine overhangs, and ligated to custom adapters containing T7 and T3 promoter sequences as described (Robertson et al. 2007). Adapter-ligated products were enriched by PCR as above using T3 and T7 sense strand-specific primers and the following cycling conditions; 1 min. at 980C; 8X {10 s at 980C, 30 s at 600C, 30 s at 720C}; 5 min at 720C. The amplified products were separated from excess adapter on an 8% Novex TBE gel (Invitrogen Canada Inc.), purified, and quantified using the Qubit Quant-iTTM assay and Qubit Fluorometer (Invitrogen Canada Inc.). An in vitro transcription reaction was carried out using 100 ng of purified adapter-ligated DNA as per the manufacturer’s specifications (AmpliscribeTM T7-FlashTM Biotin-RNA Transcription Kit; Intersciences Inc., Markham, Ont.). The reaction mixture was incubated at 37OC for 60 min, DNase-I treated for 15 min at 37OC, and then incubated at 70 OC for 5 min to inactivate DNaseI. Transcription products were precipitated with 1 volume of 5 M NH4Ac, and size fractioned on a 10% Novex TBE-Urea gel (Invitrogen Canada Inc.). The 100 to 150 bp fraction was isolated from the gel, eluted into 0.3 M NaCl, and EtOH-precipitated after extraction of the eluate from the gel slurry by centrifugation through a Spin-X Filter centrifuge tube filter (Fisher Scientific Ltd.). The biotinylated RNA was re-suspended in 20  97 µl nuclease-free water and quantified using an Agilent RNA Nano assay (Agilent Technologies Canada Inc.). Indexed libraries of patient genomic DNA were pooled from 96 well plates in groups ranging from 36 to 47 libraries per pool (Kimberly C Wiegand et al. 2010). A 250 to 350 bp size fraction from each pool was size-selected by gel purification from an 8% Novex TBE gel as above (Invitrogen Canada Inc.). The protocol described by Gnirke and colleagues (Gnirke et al. 2009) was followed for the hybridization reaction and subsequent washes, with an additional oligonucleotide block consisting of standard Illumina PCR primers PE1 and PE2 included in the hybridization reaction mixture to prevent cross-hybridization between library fragments. The incubation of the library fragments with the RNA probe pool was carried out for 24 hours at 650C, followed by binding to M-280 Streptavidin Dynabeads (Invitrogen Canada Inc.), washes, and elution of the captured library fragments. The eluted fragments were amplified by PCR using primers that anneal upstream of the adapter index sites and subjected to cluster generation and sequencing as described above. 3.5.15 Targeted MLL2 resequencing using long-range PCR and sample indexing Due to the presence of inactivating mutations in different positions within the MLL2 gene, we sequenced the entire MLL2 locus (chr12:47,699,025-47,735,374; hg18) in a cohort of 35 FL and 37 DLBCL primary tumours, in 17 DLBCL derived cell lines and, as a control, in 8 centroblast samples. Genomic DNA from individual samples was normalized to 5 ng/µL, and 12.5 ng of each sample was PCR amplified using LA Taq DNA polymerase (TaKaRa). Twelve long amplicons, of sizes ranging from 6600 bp to 7800 bp, were obtained under the following PCR conditions: 94ºC for 5 min, 35 cycles of 98ºC for 10 s and 68ºC for 8 min, and a final extension at 72ºC for 10 min. Amplicons were cleaned using AMPure beads  98 (Beckman Coulter) and eluted with 20-µL of TE. All 12 amplicons per sample were normalized and pooled together. An individual indexed library was constructed from each sample (comprising the pool of the 12 long amplicons from MLL2). Approximately 500 ng of each pooled DNA sample was sheared for 10 min using a Sonic Dismembrator 550 with a power setting of “7” in pulses of 30 s interspersed with 30 s of cooling (Cup Horn, Fisher Scientific) and then analysed on 8% PAGE gels. The 200 to 300 bp DNA fraction was excised and eluted from the gel slice overnight at 4°C in 300 µL of elution buffer (5:1 (vol/vol) LoTE buffer (3 mM Tris-HCl, pH 7.5, 0.2 mM EDTA)/7.5 M ammonium acetate) and was purified using a Spin- X Filter Tube (Fisher Scientific) and by ethanol precipitation. Indexed libraries were prepared using a modified paired-end protocol. This involved DNA end-repair reactions at room temperature 20–25 °C for 30 min (5 U T4 DNA polymerase, 1 U Klenow DNA polymerase (exonuclease minus), 100 U T4 polynucleotide kinase and 0.4 mM dNTP mix (Invitrogen). End-repair reactions were purified using AMPure beads, and dATP was added to the 3′ ends using 5 U Klenow DNA polymerase (exonuclease minus) and 0.2 mM dATP in 1× Klenow Buffer (Invitrogen) with 30-min incubation at 37 °C in a Tetrad thermal cycler (MJ Research). DNA was again purified on AMPure beads using a Biomek FX. Adapter ligation (10:1 ratio) was completed with 0.03 µM adapter (multiplexing adapters 1 and 2), 100 ng DNA, 5 U T4 DNA ligase, 0.2 mM ATP and 1× T4 DNA Ligase Buffer (Invitrogen) for 30 min at room temperature. Adapter-ligated DNA was again purified using AMPure beads on a Biomek FX. A selection of DNA samples were quantified on a Qubit (Invitrogen). 15-cycle indexing enrichment PCR was performed using Phusion DNA polymerase and Primers 1.0 and 2.0 (IDT) and 96 custom indexing primers. PCR cycles were: 98°C for 60 s,  99 followed by 15 cycles of 98°C for 10 s, 65°C for 15 s and 72°C for 30 s. The PCR products were purified using AMPure beads and eluted in 40 µL elution buffer EB (Qiagen). Product quality was assessed by quality-control gels with 1.75% SeaKem LE agarose in 1× TAE (0.2 µL of every amplicon) and on a 2100 Bioanalyzer (Agilent Technologies). Indexed libraries were pooled together and sequenced on two lanes of a flowcell using an Illumina GAii platform. Individual indexes allowed the de-convolution of reads deriving from individual samples in multiplexed libraries such that many cases were concurrently sequenced in the same flow cell lane. The reads were matched to patient samples using the index read and were aligned with BWA to the human reference genome (hg18). Point mutations were identified using SNVMix with stringent post-filtration including a requirement for dual-strand coverage and requiring at least 10% of the aligned reads at a candidate variant to be non-reference. Insertions and deletions were identified using the SAMtools indel calling algorithm with similar filters. Only insertions and deletions supported by at least 2 reads on each strand were considered valid. The reported average coverage for each sample was calculated as the average depth of aligned reads across each of the coding (CDS) positions in the MLL2 locus. 3.5.16 Re-confirmation of MLL2 mutations in patient samples and DLBCL cell lines MLL2 mutations found by targeted sequencing of MLL2 in lymphoma samples were validated by Sanger sequencing of the region surrounding each mutation, except in 15 cases. To do so, primers were designed to amplify 400-600 bp regions by PCR. Validating forward and reverse primers carried T7 and M13Reverse 5’ tails, respectively. PCR conditions used were 94ºC for 2 min, 30 cycles of 94ºC for 30 s, 60ºC for 30 s and 72ºC for 1 min, and a final extension at 72ºC for 8 min. To determine the somatic or germ line origin of the mutations,  100 mutations were re-sequenced in both tumour and constitutional DNA, the latter obtained from peripheral blood or negative sort cells. The sequencing reactions consisted of 50 cycles of 96ºC for 10 s, 43ºC (for M13Reverse) or 48ºC (T7) for 5 s and 60ºC for 4 min and were analysed using an AB 3730XL. Variants were visually inspected to confirm their presence in tumour and absence from germ line traces. In 8 of the patient samples that carried 2 mutations in MLL2, to establish whether one allele contained both mutations or each allele contained one, we sequenced both candidate mutations using DNA from BAC clones from FL patient libraries. The primers and PCR conditions were the same as those used for the validation of each of those mutations. 3.5.17 Targeted resequencing of MEF2B coding exons 1 and 2. Coding exons 1 and 2 of MEF2B were PCR amplified using MEF2B_1F/R and MEF2B_2F/R primers using the same conditions for MLL2 (previous paragraph). Priming sites for T7 and M13Reverse were added to their 5′ ends to allow direct Sanger sequencing of amplicons. Amplicons were produced from whole genome amplified tumour genomic DNA from lymphoma patients and DLBCL cell lines. Whole genome amplification was performed using Repli-g Screening kit reagents (Qiagen), following the manufacturer instructions. All capillary traces were visually inspected. 3.5.18 Identification of structural aberrations involving BCL2 and BCL6 The presence of translocations involving MYC, BCL2 and BCL6 was determined for 49 of the DLBCL cases (Appendix B; Figure 3-4) using commercial dual color “break-apart” probes from Abbott Molecular (Abbott Park, IL) on formalin fixed paraffin embedded tissue in tissue microarrays using the described method (Chin et al. 2003). Additional fusion transcripts involving BCL2 or BCL6 were detected in these and the remaining libraries  101 directly from the RNA-seq data using both Trans-ABySS (Robertson et al. 2010) and deFuse (Mcpherson et al. 2011). 3.5.19 Analysis of impact of COO and mutation status on outcome in DLBCL The analysis included only patients treated with curative intent who received at least one cycle of R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine and prednisone), the current standard multi-agent chemotherapy program used internationally for treating DLBCL. Overall survival (OS) was calculated as the time from date of diagnosis until death from any cause. Patients were censored at the time they were last known to be alive. OS was assessed using the Kaplan-Meier method and the log rank test was used for comparison between groups. Data were analysed using SPSS software (SPSS version 14.0 for Windows; SPSS Inc, Chicago, IL).  102 Figure 3-1: Overview of analyses performed The samples analysed, experiments performed and an overview of the results. 717 non- synonymous candidate somatic mutations in 651 genes were identified from the genomes/exomes of 14 NHL patient samples with matched constitutional DNA. Ten genes with hot spots were identified directly from the RNA-seq and nine of these genes were not among the 651 (661 total genes). We determined that 230 of these genes contained 3 or more cSNVs when genome and RNA-seq data were combined. We also attempted validation of all cSNVs identified in the FL genome and two DLBCL exomes. We validated at least one somatic cSNV in 317 genes (Appendix B) and identified two or more somatic mutations in 109 genes using PCR amplification and Sanger sequencing in tumour and normal DNA. Of the successfully re-sequenced cSNVs predicted from the genomes, 171 (94.5%) were confirmed somatic, 7 were false calls and 3 were present in the germ line. The entire MLL2 locus was screened for additional mutations in 37 DLBCL and 35 FL samples. MEF2B was similarly sequenced in an extension cohort of 292 DLBCLs and 261 FLs.  I prepared this figure with assistance from Maria Mendez-Lago and version of it was previously published (Morin et al. 2011).   103 Figure 3-2: Genome-wide visualization of somatic mutation targets in NHL Overview of structural rearrangements and copy number variations (CNVs) in the 11 DLBCL genomes and cSNVs in the 109 recurrently mutated genes identified in our analysis. Inner arcs represent somatic fusion transcripts identified in one of the 11 genomes. The CNVs and LOH detected in each of the 11 DLBCL tumour/normal pairs are displayed on the concentric sets of rings. The inner 11 rings show regions of enhanced homozygosity plotted with blue (interpreted as LOH). The outer 11 rings show somatic CNVs. Purple circles indicate the position of genes with at least two confirmed somatic mutations with circle diameter proportional to the number of cases with cSNVs detected in that gene. Circles representing the genes with significant evidence for positive selection are labelled. Coincidence between recurrently mutated genes and regions of gain/loss are colour-coded in the labels (green=loss, red=gain). For example B2M, which encodes beta-2-microglobulin, is recurrently mutated and is deleted in two cases.  This figure was prepared by Richard Corbett and myself and appeared in Morin et al (2011).   104 Figure 3-3: Recurrent mutations affecting the CREBBP and EP300 HATs Mutations affecting both CREBBP and EP300 were detected. (A) Alignment shows the histone acetyltransferase (HAT) domain of both CREBBP and EP300 proteins. Residues with cSNVs in either CREBBP or EP300 are coloured pink if affected in a single sample or green if affected in multiple samples (i.e. hot spots). Yellow indicates that mutation of the highlighted residue has been shown to reduce HAT activity in vitro and orange indicates that the residue interacts with the substrate analogue utilized in solving the crystal structure (Lys- CoA)(X. Liu et al. 2008) (B) The three dimensional structure of the HAT domain of CREBBP and EP300 based on the solved structure of EP300 (X. Liu et al. 2008). Mutated residues in the HAT domain of both CREBBP and EP300 are highlighted with the same colouring as (A) with positions numbered relative to EP300. Many of the mutated residues are in close proximity to the substrate (shown here is the substrate analogue and inhibitor Lys-CoA, blue-caged molecule).  This figure was prepared by myself (panel A) and Alexander Yakovenko (panel B) and was published by Morin et al (2011).   105 Figure 3-4: Overview of mutations and potential cooperative interactions in NHL This heat map displays possible trends towards co-occurrence (red) and mutual exclusion (blue) of somatic mutations and structural rearrangements. Colours were assigned by taking the minimum value of a left- and right-tailed Fisher exact test. To capture trends a P-value threshold of 0.3 was used, with the darkest shade of the colour indicating those meeting statistical significance (P <=0.05). The relative frequency of mutations in ABC (blue), GCB (red), unclassifiable (black) DLBCLs and FL (yellow) cases is shown on the left. Genes were arranged with those having significant (P<0.05, Fisher exact test) enrichment for mutations in ABC cases (blue triangle) towards the top (and left) and those with significant enrichment for mutations in GCB cases (red triangle) towards the bottom (and right). The total number of cases in which each gene contained either cSNVs or confirmed somatic mutations is shown at the top. The cluster of blue squares (upper-right) results from the mutual exclusion of the ABC-enriched mutations (e.g. MYD88, CD79B) from the GCB-enriched mutations (e.g. EZH2, GNA13). Presence of structural rearrangements involving the two oncogenes BCL6 and BCL2 (indicated as BCL6s and BCL2s) was determined with FISH techniques utilizing break-apart probes (Methods).  This figure was prepared by myself and appeared in Morin et al (2011). AB C- en ric hm en t G CB -e nr ic hm en t 10 20 30 40 Ca se s ABC GCB U FL <0.05 0.1-0.05 0.3-0.1 M YD 88 CD 79 B BC L6 s TN FA IP 3 CA RD 11 FA S TM EM 30 A CD 58 CD 70 ST AT 3 ET S1 HI ST 1H 1C CC ND 3 KL HL 6 BT G1 BT G2 IR F8 B2 M EP 30 0 CR EB BP M LL 2 FO XO 1 TN FR SF 14 M EF 2B TP 53 BC L2 SG K1 GN A1 3 EZ H2 BC L2 s BCL2s EZH2 GNA13 SGK1 BCL2 TP53 MEF2B TNFRSF14 FOXO1 MLL2 CREBBP EP300 B2M IRF8 BTG2 BTG1 KLHL6 CCND3 HIST1H1C ETS1 STAT3 CD70 CD58 TMEM30A FAS CARD11 TNFAIP3 BCL6s CD79B MYD88  106 Figure 3-5: Determining the effects of mutations in GNA13 at the protein level (A) A Western blot revealed the expected lack of GNA13 protein in DOHH2, the cell line with a truncating point mutation detected in the RNA-seq data. The lack of protein in Karpas422, SU-DHL-6 and WSU-DLCL2 was surprising, as we had not detected protein- truncating mutations in these cells. (B) Further analysis of the aligned sequence from these three cell lines and additional analysis utilizing a de-novo transcript assembly approach (Trans-ABySS; Methods), revealed multiple aberrations that may explain the lack of protein. Firstly, in Karpas422 reads were observed to map the first intron, suggesting that the intron is retained in a significant proportion of GNA13 transcripts (compare Karpas422 on the left to WSU-DLCL2 on the right). Inspection of sequence contigs from this case revealed the likely cause of intron reads to be a deletion of 87 nt that removes the canonical splicing donor from this exon (Panel C, top). Splicing still appears to occur to a lesser extent using a non-GT donor. Assembled reads from SU-DHL-6 revealed a 2 nt deletion and a large 1028 nt deletion. The former would affect the reading frame and the latter removes the terminal stop codon. Finally, in WSU-DLCL2, the splicing donor after the third exon was apparently mutated, converting the GT donor to a GC sequence (not shown). As in the Karpas422 case, there was clear evidence for retention of this intron in GNA13 transcripts in WSU-DLCL2. Intron retention has previously been linked to nonsense-mediated transcript degradation (Lewis et al. 2003) and if that is the case here, could explain the lack of GNA13 protein in these cells.  This figure was prepared by Diane Trinh (Panel A) and myself (Panel B and C) and was published by Morin et al (2011).  107  108 Figure 3-6: Summary and effect of somatic mutations affecting MLL2 (A) Re-sequencing the MLL2 locus in 89 samples revealed mainly nonsense (red circles) and frameshift-inducing indel mutations (orange triangles). A smaller number of non- synonymous somatic mutations (green circles) and point mutations or deletions affecting splice sites (yellow stars) were also observed. All of the non-synonymous point mutations affected a residue within either the catalytic SET domain, the FYRC domain (“FY-rich C- terminal domain”) or PHD zinc finger domains. (B & C) Shown are two examples of the effect of somatic splice site mutations in the MLL2 gene (reverse strand) determined from RNA-seq data and verified by Sanger sequencing. (B) Point mutation affecting a splice donor site (GT>GA) at chr12:47,714,115 (yellow star). RT-PCR primers for exons 38 and 39 (blue rectangles) are indicated by black arrows. Sequence from the upper gel band represents the reference allele or, from the lower band, the alternative transcript which uses an alternative donor site within exon 38. (C) Point mutation that disrupts a splice acceptor site (AG>CG) at chr12:47,733,692 (yellow star). Intron retention is inferred due to enrichment of intronic sequence reads (top, grey bars). Black arrows depict PCR primers. Primers for exons 5 and 6 (lane 2 of gel on right) produced an amplicon of 267 bp and an amplicon of 602 bp. The difference in these sizes is the size of the intron (335 bp). Primers for exon 5 and the intron (lane 1) produced a single amplicon of 318 bp. Amplicon sequencing confirmed the presence of the reference splicing pattern and the intron retention event.  This figure was prepared by myself (Panel A), Maria Mendez-Lago and Andy Mungall (Panels B and C) and a variant of these panels were published by Morin et al (2011).   109 Figure 3-7: Overview of MEF2B mutations (A) The cSNVs and confirmed somatic mutations in all FL and DLBCL cases are shown with the same symbols used in Figure 3A. Only amino acids with a cSNV in at least two patients are labelled. cSNVs were most prevalent in the first two protein coding exons of MEF2B (exons 2 and 3). (B) The crystal structure of MEF2B bound to one of its co- repressors (Cabin1) is shown with the mutated residues indicated. Some of the mutations affect regions of the protein that interact with co-activators and co-repressors (e.g. Y69, L67) and regions that are likely to interact with DNA (e.g. K4). The crystal structure of MEF2A bound to EP300 has also recently been solved and supports that L67 and Y69 are also important in that interaction (He et al. 2011).  I prepared this figure and these panels were published as separate figures by Morin et al (2011).  110 Figure 3-8: Potential impact of recurrently mutated genes on BCR signalling and downstream messengers (A) Autocrine and paracrine stimulation of IL-21R induces the dimerization and activation of STAT3, a positive regulator of PRDM1 expression (Diehl et al. 2008). Mutations affecting the DNA binding domain of STAT3 are known to act as dominant negatives, which would predict the inability to induce PRDM1 expression following IL-21 stimulation. (B) Multiple mutations predicted to directly alter BCR signalling or alter the normal events subsequent to BCR-induced influx of the secondary messenger Ca2+. Cross-linking of CD58 has been shown to result in the phosphorylation of BLNK, Syk and PLC-gamma and lead to Akt activation (Ariel et al. 2009). Various mutations are expected to alter the ability of B cells to induce the expression of MEF2 target genes in response to the Ca2+ influx. The role of MEF2 gene family members in mediating epigenetic alterations downstream of the BCR has been inferred from a knockout study in which MEF2C was shown to be required for mediating calcium-dependent response to BCR signalling (Wilker et al. 2008) and the involvement of CREBBP/EP300 in this process has been inferred from MEF2-mediated transcriptional regulation in other cell types including T cells (Youn et al. 1999). This model predicts that influx of Ca2+ after BCR stimulation would result in the displacement of HATs by activated Calmodulin-dependent protein kinase (CAMK), allowing HDAC activity via CREBBP/EP300 thus enabling transcription at MEF2 target loci. In this model, mutation of any of these three genes and potentially the S155F mutation in HDAC7 would diminish this effect and suppress the induction of MEF2 target loci after BCR stimulation. (C) Multiple mutations may affect the regulation of the activity of FOXO proteins following BCR stimulation. FOXO1 is a downstream target of the kinase AKT, which is activated during BCR signalling. SGK, a related kinase (commonly mutated in this study), is known to phosphorylate FOXO3a in a similar way (Brunet et al. 2001) and we predict it may also phosphorylate FOXO1. Thus, mutations affecting the FOXO1 phosphorylation site or SGK1 could affect the regulation of FOXO1 nuclear localization and hence, its transactivation activity. The shortened FOXO1 protein produced by mutation of the initial codon would not contain this phosphorylation site and hence those mutations may also result in altered subcellular localization. Various mutations affecting NF-κB activity, which have been previously described, were also observed here (Lenz et al. 2008a; Davis et al. 2010; Ngo et al. 2011; Du et al. 2000). (D) Many of the recurrently mutated genes in B-NHL are involved in histone modification or themselves encode histone proteins (i.e. HIST1H1C, one of multiple genes that encode histone protein H1). CREBBP/EP300 and MLL2 each produce activating chromatin marks (H3K27Ac and H3K4me3, respectively). HDAC (e.g. HDAC7) and EZH2 produce inactivating marks by removing acetyl groups and trimethylating H3K27, respectively. As heterozygous EZH2 Y641 mutations are known to effectively enhance PRC2 activity (Yap et al. 2011), then each of the individual mutations may result in suppression of gene expression. Importantly, it is not known whether EZH2 and MLL2 regulate the expression of the same genes as MEF2B/CREBBP/EP300.  I prepared a detailed draft of this figure and the polished version, which was published in Morin et al (2011), was produced by Martin Krzywinski.  111  112  Table 3-1: Hot spot mutations identified directly from the RNA-seq data  §This mutation was proven to be somatic in at least one case; that is, present in tumour DNA but absent in matched constitutional DNA. ‡Not mutated in any of the fourteen genomes or exomes sequenced. *Additional hot spots in BCL2 were excluded to simplify the table. Genes indicated in bold are previously described targets of somatic mutation in lymphoma. Although known to be mutated, hot spots have not, to our knowledge, been described in BCL7A. Note that Tyr641 as previously described (Morin et al. 2010) is based on the Uniprot sequence Q15910, whereas this site corresponds to residue 602 and 646 in the Refseq annotations. Codon Number of Samples Distinct mutations Gene Name 602;646 30 4 EZH2 83§ 9 2 MEF2B 69§ 4 2 MEF2B 81§ 2 2 MEF2B 1482§ 3 2 CREBBP 1499§ 2 2 CREBBP 1467§ 2 2 EP300 287§ 2 1 HLA-C 1 8 5 BCL7A‡ 206§ 4 1 MYD88‡ 230§ 2 1 MYD88‡ 252§ 6 1 MYD88‡ 59 7 3 BCL2* 92;196;197 5 4 CD79B‡ 73;160§ 4 2 IKZF3‡ 164;255§ 3 2 PIM1‡ 97;188 3 2 PIM1‡ 18§ 3 2 IRF4‡ 587§ 3 2 BCL6 45§ 3 2 BTG2‡ 141;234 3 2 TP53‡ 24§ 2 2 FOXO1‡ 1§ 3 3 FOXO1‡ 12§ 2 1 TNFRSF14 226§ 2 2 CCND3‡ 233§ 2 2 CCND3‡ 1§ 3 3 B2M‡  113 Table 3-2: Overview of cSNVs and somatic mutations in most frequently mutated genes Gene NS S T NS S T Somatic cSNVs (RNA- seq cohort)* P (raw)     q NS SP T SP Skew (M, WT, both) *** MLL2† 16 8 17 17 8 18 10 6.85x10-8 8.50x10 -7 0.834 14.4 WT TNFRSF14 G† 7  1 7 8 1 7 11 6.85x10-8 8.50x10 -7 7.52 118 both SGK1 G† 18  6 6 37 10 6 9 6.85x10-8 8.50x10 -7 19.5 61.7 - BCL10† 2 0 4 3 0 4 4 6.85x10-8 8.50x10 -7 3.62 112 WT GNA13 G† 21  1 2 33 1 2 5 6.85x10-8 8.50x10 -7 24.1 25.7 both TP53 G† 20  2 1 23 3 1 22 6.85x10-8 8.50x10 -7 15.6 14.1 both EZH2 G† 33  0 0 33 0 0 33 6.85x10-8 8.50x10 -7 11.4 0.00 both BTG2† 12 6 1 14 6 1 2 6.85x10-8 8.50x10 -7 23.9 35.1 - BCL2 G† 42  45 0 96 105 0 43 9.35x10-8 8.50x10 -7 3.78 0.00 M BCL6†** 11 2 0 12 2 0 2 9.35x10-8 8.50x10 -7 0.175 0.00 M CIITA†** 5 3 0 6 3 0 2 9.35x10-8 8.50x10 -7 0.086 0.00 FAS† 2 0 4 3 0 4 2 1.52x10-7 1.17x10 -6 2.54 66.5 WT BTG1† 11 6 2 11 7 2 10 1.52x10-7 1.17x10 -6 17.5 52.5 both MEF2B G† 20  2 0 20 2 0 10 2.05x10-7 1.47x10 -6 14.2 0.00 M IRF8† 11 5 3 14 5 3 3 4.55x10-7 3.03x10 -6 8.82 28.2 WT TMEM30A† 1 0 4 1 0 4 4 6.06x10-7 3.79x10 -6 0.785 65.0 WT CD58† 2 0 3 2 0 3 2 2.42x10-6 1.43x10 -5 2.29 69.2 - KLHL6† 10 2 2 12 2 2 4 1.00x10-5 5.26x10 -5 5.42 16.4 - MYD88 A† 13  2 0 14 2 0 9 1.00x10-5 5.26x10 -5 12.4 0.00 WT CD70† 5 0 1 5 0 2 3 1.70x10-5 8.48x10 -5 7.08 44.0 - CD79B A† 7  2 1 9 2 1 5 2.00x10-5 9.52x10 -5 10.9 18.3 M CCND3† 7 1 2 7 1 2 6 2.80x10-5 1.27x10 -4 6.55 36.3 WT CREBBP† 20 7 4 24 7 4 9 1.00x10-4 4.35x10 -4 2.72 6.04 both HIST1H1C† 9 0 0 10 0 0 6 1.80x10-4 7.50x10 -4 11.9 0.00 both B2M† 7 0 0 7 0 0 4 3.90x10-4 1.56x10 -3 16.6 0.00 WT ETS1† 10 1 0 10 1 0 4 4.10x10-4 1.58x10 -3 5.76 0.00 WT CARD11† 14 3 0 14 3 0 3 1.90x10-3 7.04x10 -3 3.37 0.00 both FAT2†** 2 1 0 2 1 0 2 6.30x10-3 2.25x10 -2 0.128 0.00 - IRF4†** 9 4 0 26 5 0 5 7.00x10-3 2.41x10 -2 0.569 0.00 both FOXO1† 8 4 0 10 4 0 4 7.60x103 2.53x10 -2 4.02 0.00 - STAT3 9 0 0 9 0 0 4 2.19x10-2 6.08x10 -2 - - both RAPGEF1 8 3 0 10 3 0 3 2.98x10-2 7.45x10 -2 - - WT ABCA7 12 3 0 15 3 0 2 7.76x10-2 1.67x10 -1 - - WT RNF213 10 8 0 10 8 0 2 7.87x10-2 1.67x10 -1 - - - MUC16 17 12 0 39 25 0 2 8.32x10-2 1.73x10 -1 - - - HDAC7 8 4 0 8 4 0 2 8.94x10-2 1.82x10 -1 - - WT  114 Gene NS S T NS S T Somatic cSNVs (RNA- seq cohort)* P (raw)     q NS SP T SP Skew (M, WT, both) *** PRKDC 7 3 0 7 4 0 2 1.06x10-1 2.05x10 -1 - - - SAMD9 9 2 0 9 2 0 2 1.79x10-1 3.01x10 -1 - - - TAF1 10 0 0 10 0 0 2 3.03x10-1 4.74x10 -1 - - - PIM1 20 19 0 33 34 0 11 3.40x10-1 5.23x10 -1 - - WT COL4A2 8 2 0 8 2 0 2 7.64x10-1 8.99x10 -1 - - - EP300 8 7 1 8 7 1 3 9.54x10-1 1.00 - - WT Individual cases with non-synonymous (NS), synonymous (S) and truncating (T) mutations and total number of mutations of each class is shown separately as some genes contained multiple mutations in the same case. The P values indicated in bold are the upper limit on the P value for that gene determined with the approach described by Greenman et al (2006)(see Methods), q is the Benjamini-corrected q value, and NS, SP and T SP refer to selective pressure estimates from this model for the acquisition of non-synonymous or truncating mutations, respectively. †genes significant at an FDR of 0.03. SNVs in BCL2 and previously confirmed hot spot mutations in EZH2 and CD79B are likely somatic in these samples based on published observations of others. *Additional somatic mutations identified in larger cohorts and insertion/deletion mutations are not included in this total. ** Selective pressure estimates are both <1 indicating purifying selection rather than positive selection acting on this gene. ***"both" indicates we observed separate cases in which skewed expression was seen but where this skew was not consistent for the mutant or wild type allele. Genes with a superscript of either A or G were found to have mutations significantly enriched in ABC or GCB cases, respectively (P< 0.05, Fisher Exact test).  115 Table 3-3: Summary of the types of MLL2 somatic mutations Sample Type FL DLBCL DLBCL cell-line Centroblast Truncation 18 4 7 0 Indel with frameshift 22 8 6 0 Splice site 4 2 0 0 SNV 3 2 2 0 Any mutation (number of cases) 31 / 35 12 / 37 10 / 17 0 / 8 Percentage 89% 32% 59% 0%    116 Chapter  4: Novel chromosomal rearrangements in high-risk ALL 4.1 Introduction  Relapsed ALL is the most common cause of cancer-related death in young adults (Pui et al. 2008). The addition of imatinib to intensive chemotherapy in children and adolescents with BCR-ABL1-positive ALL dramatically improves event-free survival rates (Schultz et al. 2009), suggesting that the outcome of high-risk ALL can be improved by the use of small molecule inhibitors directed against rational therapeutic targets. Although cases with the BCR-ABL1 fusion transcript may benefit from such treatment, many relapsed patients lack any such known recurrent chromosomal alterations or mutations. Thus identifying the full repertoire of genetic alterations in high-risk ALL is essential to design novel targeted therapies.  Here, we present a list of novel somatic point mutations and fusion transcripts identified in each of eleven cases using a combination of RNA-seq and WGS. In summary, our profiling of Ph-like ALL cases by next generation sequencing has revealed rearrangements or mutations deregulating either cytokine receptor or kinase signalling in 10 of 11 cases studied. These alterations include rearrangements activating the PDGFRB, JAK2 and ABL1 kinases, a rearrangement deregulating the erythropoietin receptor (EPOR), insertion mutations of IL7R and deletion of the JAK2 regulator SH2B3 (LNK), in addition to known CRLF2 rearrangements. 4.2 Results 4.2.1 Novel chromosomal rearrangements in Ph-like ALL  We performed RNA-seq on 11 Ph-like pediatric ALL cases. Three cases harboured known IGH@-CRLF2 translocations, one with a concomitant JAK2 mutation, and were  117 included to identify additional lesions in this subtype of ALL. An average of 2.1 putative rearrangements per case was identified, ranging between 0 and 5 (Appendix C). No additional rearrangements targeting kinase signalling were identified in the three cases with known IGH@-CRLF2 translocations. Strikingly, we detected novel rearrangements deregulating tyrosine kinase and cytokine receptor signalling in six of the eight remaining cases, including two cases with the NUP214-ABL1 fusion previously identified in T-lineage ALL (Graux et al. 2004), one case each with the in-frame fusions EBF1-PDGFRB, BCR- JAK2 and STRN3-JAK2, and one case with an IGH@-EPOR rearrangement (Table 4-1 and Figure 4-1). In each case multiple paired reads mapped to the partner genes, and split reads mapping across the fusion were identified (Table 4-1). Though additional putative rearrangements were identified in each case (Appendix C), these commonly did not result in expression of in-frame chimeric fusions, and showed a low level of read support, suggesting that they are not leukemogenic alterations. All putative rearrangements were validated by RT- PCR and direct sequencing (Figure 4-1). Fluorescence in-situ hybridization confirmed that the EBF1-PDGFRB rearrangement was present in the predominant leukemic clone (not shown). The IGH@- EPOR rearrangement may arise due to a reciprocal t(14;19) translocation (Russell et al. 2009b), but fluorescence in situ hybridization for this rearrangement was negative (data not shown). Further analysis of RNA-seq data and genomic mapping with CREST (J. Wang et al. 2011a) revealed the fusion was the result of a 7.5 kb insertion of EPOR into the IGH@ locus. Moreover, digital gene expression profiling demonstrated marked overexpression of EPOR in this case, but not in the other Ph-like cases sequenced as illustrated by the RPKM values for EPOR in each of the cases (Figure 4-2).  118 4.2.2 Recurrence evaluation  To determine the frequency of each chimeric fusion, we performed RT-PCR on 231 cases from the COG AALL0232 cohort, 40 (17%) of which were identified as Ph-like. The EBF1-PDGFRB rearrangement was detected in three additional patients, with exon 15 (N=2) or exon 14 (N=1) of EBF1 juxtaposed to exon 11 of PDGFRB, thus preserving the transmembrane and kinase domains of PDGFRB (Figure 4-3). These patients were also Ph- like and exhibited increased PDGFRB expression on microarray-based gene expression profiling (not shown). RNA-seq coverage analysis for one EBF1-PDGFRB case, PAKKCA, showed a sharp increase in read depth at PDGFRB intron 10 that corresponds to the genomic breakpoint (Figure 4-3B). EBF1 and PDGFRB both reside on chromosome 5q, and three of four EBF1-PDFGRB cases had an interstitial deletion between the partner gene breakpoints (Figure 4-3D). Genomic PCR identified the breakpoint 0.5 kb downstream of EBF1 exon 15 and 2.3 kb upstream of PDGFRB exon 11 in the index case (Figure 4-3E). The NUP214-ABL1 rearrangement was accompanied by low-level amplification of the region between the two partner genes at 9q34 in both cases. Analysis of SNP array data from 211 additional P9906 cases identified two cases (one Ph-like) with identical genomic amplification, both also harbouring the fusion (not shown). No additional NUP214- ABL1, BCR-JAK2 or STRN3-JAK2 rearrangements were detected in the AALL0232 recurrence cohort. Furthermore, we screened for the common PDGFRB rearrangement, ETV6-PDGFRB (Golub et al. 1994), in Ph-like, samples with high PDGFRB expression and found no additional cases.    119 4.2.3 Deletions and sequence mutations in Ph-like ALL  Two of the three cases with known CRLF2 rearrangements lacked known JAK mutations. However, analysis of RNA-seq data from case PAMDRM identified a complex JAK2 mutation (GPinsI682) at a residue commonly mutated in CRLF2 rearranged ALL (Mullighan et al. 2009c), which was subsequently confirmed by Sanger sequencing. Two Ph- like cases lacked a rearrangement based on RNA-seq data analysis. Whole genome sequencing of tumour DNA identified an in-frame insertion in the transmembrane domain of IL7R (L243>PGVCL) in case PALJDL, and analysis of RNA-seq data demonstrated that this mutation was expressed. This case also harboured a focal homozygous deletion of the first two exons of SH2B3 that was not evident by SNP array analysis. Accordingly, RNA-seq analysis showed reduced expression of SH2B3 in this case. SH2B3 is a negative regulator of JAK2 signalling, and inactivating mutations have been previously described in cases of myeloproliferative neoplasm (MPN) lacking JAK2 V617F mutations (Baran-Marszak et al. 2010; Oh et al. 2010). The remaining case (PALETF) harboured, among other somatic point mutations, a missense mutation within IL7 (G123E; Table 4-4). No additional rearrangements predicted to result in fusion transcripts were identified on analysis of whole genome sequence data.  Screening of both the P9906 (N=188) and AALL0232 (N=248) cohorts identified recurrence of complex sequence mutations in IL7R (N=9). Furthermore, an additional Ph-like case from the AALL0232 cohort (PANKMB) harboured a focal homozygous deletion removing exons 1-2 of SH2B3 (data not shown). The other novel SNVs and indels have not been tested for recurrence in additional cases.   120 4.3 Discussion  Using RNA-seq, we have identified novel rearrangements activating tyrosine kinase and cytokine signalling in 10 of 11 Ph-like ALL cases. Interestingly, all patients with kinase- activating rearrangements in this study also harbour lesions affecting lymphoid transcription factors (commonly the dominant negative IKZF1 IK6 mutation)(Mullighan et al. 2009b)(Table 1-1). This suggests that kinase-activating lesions and mutations disrupting lymphoid development cooperate to induce B- lineage ALL, drive the Ph-like gene expression profile and confer resistance to standard therapies.  Chromosomal translocations resulting in activated tyrosine kinase signalling are recognized as driver lesions in a number of hematopoietic malignancies, the prototype being BCR-ABL1 in chronic myeloid leukemia (Perrotti & Harb 2011). Rearrangements involving the PDGFRB receptor (e.g. ETV6-PDGFRB) are present at low frequency in BCR-ABL1- negative chronic myeloproliferative neoplasms (CMPN)(Golub et al. 1994), whilst fusions activating the ABL1 kinase (e.g. NUP214-ABL1) are observed in approximately 5% of T- lineage ALL patients (Graux et al. 2004). Rearrangements involving JAK2, such as BCR- JAK2 and ETV6-JAK2 have been identified as rare events in myeloid neoplasms (Lacronique et al. 1997; Hagman & Lukin 2006). With the exception of IGH@-EPOR, we report these rearrangements for the first time in B-lineage ALL, and identify EBF1 as a novel 5’ fusion partner for PDGFRB. The IGH@-EPOR rearrangement arose from insertion of EPOR into the IGH@ locus, and could not be detected by cytogenetic analysis. Thus, EPOR rearrangement may be more common in ALL than previously suspected but if this type of mutation is the common mode by which EPOR is deregulated in ALL, its detection would require high-resolution techniques such as WGS or targeted resequencing to identify.  121 EBF1 encodes a transcription factor that plays a major role in regulating B-cell differentiation (Hagman & Lukin 2006; Hagman et al. 1995), and deletions that abolish normal EBF1 function have been reported in B-lineage ALL (Mullighan et al. 2007). The fusion of EBF1 to PDGFRB is likely to impair the normal function of EBF1. Also, it is predicted that the EBF1 helix-loop-helix domain mediates homo-dimerization (Hagman et al. 1995) and facilitates autophosphorylation and activation of PDGFRB. In addition, PDGFRB is not normally expressed in leukemic blasts, and the juxtaposition to EBF1 represents a novel mechanism resulting in PDGFRB overexpression.  The identification of a PDGFRB fusion is of major clinical relevance, as CMPN patients with activating PDGFRB rearrangements show complete hematologic and molecular responses to imatinib treatment (Apperley et al. 2002). Whether the PDGFRB fusion can be targeted by current therapies is unknown. Importantly, however, the NUP214-ABL1 fusion is known to show sensitivity to tyrosine kinase inhibitors (Deenik et al. 2009), supporting the use of targeted therapies in ALL patients harbouring NUP214-ABL1. Activating mutations disrupting the pseudokinase domain of JAK2 are observed in myeloproliferative neoplasms (MPN)(Levine & Gilliland 2008), and cooperate with CRLF2 rearrangement in B-lineage ALL (Mullighan et al. 2009a; Harvey et al. 2010; Mullighan et al. 2009c), highlighting the therapeutic potential of JAK2 inhibition. Our RNA-seq analysis identified two translocations also disrupting the pseudokinase domain of JAK2 (BCR-JAK2 and STRN3-JAK2). However no additional JAK2 fusions were observed upon recurrence screening, which together with previous findings (Griesinger et al. 2005; Lacronique et al. 1997), suggest that each distinct JAK2 rearrangement is uncommon in ALL. As these translocations likely deregulate JAK signalling similar to the activating point mutations, JAK  122 inhibitors may be a rational therapeutic approach in these patients as well.  The identified IL7R mutations result in in-frame insertions in the transmembrane domain. Similar mutations at the same site have recently been described in pediatric B and T- lineage ALL, and confer factor-independence in vitro (Shochat et al. 2011). This case and an additional case from the recurrence cohort harboured focal deletions of SH2B3, which encodes the protein, LNK. Inactivating mutations within exon 2 of SH2B3 have been identified in JAK2 V617F-negative MPN, and removal of LNK-mediated negative regulation leads to activated JAK/STAT signalling (Oh et al. 2010). In our case, the deletion of exons 1- 2 within SH2B3 is predicted to cause loss-of-function, and most likely cooperates with the IL7R mutation to induce constitutive activation of JAK/STAT oncogenic pathways, thus providing a novel mechanism for transformation. 4.4 Conclusions In summary, using complementary genomic approaches we have identified rearrangements, sequence mutations and DNA copy number alterations that result in activation of kinase and cytokine receptor signalling in Ph-like ALL patients, the majority of which may be amenable to therapy with currently available tyrosine kinase inhibitors. These data support the screening of patients at diagnosis to identify those with Ph-like ALL, characterize the underlying genomic lesions driving this phenotype, and identify patients that may benefit from treatment with targeted therapeutic approaches. 4.5 Methods 4.5.1 Patients and samples Patients for RNA-seq were selected from the COG P9906 trial of high-risk B- progenitor ALL, and were treated with an augmented intensive regimen of post-induction  123 chemotherapy. All patients were classified as high-risk based on the presence of central nervous system or testicular disease, MLL rearrangement, or based on age, sex, and leukocyte count at diagnosis. BCR-ABL1 and hypodiploid ALL patients, in addition to those who experienced primary induction failure were excluded. High-hyperdiploid (as defined by trisomy of chromosomes 4 and 10 on cytogenetic analysis) and ETV6-RUNX1 cases were excluded unless central nervous system or testicular involvement was present at diagnosis. A total of 207 enrolled cases had suitable material for SNP array analysis (Mullighan et al. 2009b). Ten index cases lacking a known chromosomal abnormality were chosen for RNA- seq analysis based on ROSE R8 and a BCR-ABL1-like gene expression profile predicted by PAM. RNA-seq was also performed on an additional high-risk B-ALL case enrolled on the St. Jude Children’s Research Hospital Total XV. For recurrence screening of EBF1-PDGFRB, NUP214-ABL1 and BCR-JAK2 translocations, the COG AALL0232 cohort (ClinicalTrials.gov Identifier NCT00075725) was used. A total of 283 cases were genotyped by SNP 6.0 array, of which 231 BCR-ABL1- patients had suitable RNA for RT-PCR. All newly diagnosed B-ALL patients were determined high-risk for treatment failure based on WBC count at presentation, age, prior steroid therapy, or the presence of testicular disease. The average age at diagnosis was 10.0 + 5.8 years. Twenty-eight cases (10.4%) were hyperdiploid, 20 (7.4%) were ETV6-RUNX1- positive, 17 (6.3%) were TCF3-PBX1-positive, 14 (4.9%) were BCR/ABL1-positive, 5 (1.9%) harboured MLL rearrangements, and 199 cases (70.3%) lacked a known chromosomal abnormality. All samples were obtained with patient and and/or a parent/guardian provided informed consent under protocols approved by the Institutional Review Board (IRB) at each  124 of the Children’s Oncology Group institutions. The clinical study was approved by the National Cancer Institute and appropriate IRB’s. 4.5.2 Prediction analysis of microarrays (PAM) Gene expression profiling was performed using U133A microarrays for 203 P9906 cases, and U133A Plus 2.0 arrays for 608 AALL0232 cases (Affymetrix). Probe intensities were generated using the MAS 5.0 algorithm, probe sets called absent in all samples in each cohort were excluded, and expression data log-transformed (Mullighan et al. 2009b). To identify BCR-ABL1 and BCR-ABL1-like cases, we trained PAM using the second cohort of AALL0232 cases to detect BCR-ABL1 and BCR-ABL1-like cases in the first half of the AALL0232 cohort, setting a threshold of 2.2 to minimize the rate of false negative predictions. A gene-expression signature comprising 257 probe sets correctly identified 12 of 14 BCR-ABL1 cases and classified 40/269 (15%) as BCR-ABL1-like, determined by a PAM coefficient greater than 0.5. The same training conditions were applied to the 9906 cohort, with 42/207 cases (20%) classified as BCR-ABL1-like. 4.5.3 RNA-seq library preparation and sequencing  For library preparation, total RNA was extracted from fresh-frozen bone marrow biopsies or peripheral blood leukocytes using Trizol (Invitrogen). Poly(A)+ RNA was enriched from 5-10 µg of DNAse 1-treated total RNA using the MACS mRNA isolation kit (Miltenyi Biotec). Double-stranded cDNA was synthesized from the purified poly(A)+ RNA using the Superscript Double-Stranded cDNA Synthesis kit (Invitrogen) and random hexamer primers (Invitrogen) at a concentration of 5 µM. The cDNA was fragmented by sonication and a paired-end sequencing library prepared following the Illumina paired-end library preparation protocol (Illumina). For RNA-seq library sequencing, clusters were generated on the Illumina  125 cluster station and paired-end sequence reads were generated using v3-v5 sequencing reagents on the Illumina GAii and the Illumina GAiix platforms following the manufacturer's instructions. Image analysis, base-calling and error calibration were performed using v1.0, v1.3.2, v1.5.0 and v1.6.0 of Illumina's Genome analysis pipeline. 4.5.4 Whole genome shotgun library preparation and sequencing  Illumina paired-end whole genome shotgun libraries were prepared from 1mg of genomic DNA from COG P9906 cases PALJDL and PALETF as described6. The resulting libraries were sequenced on an llumina GAiix platform using v5 paired-end 100 base sequence chemistry following the manufacturer's instructions. Image analysis, base-calling and error calibration were performed using v1.4.0, v1.5.0 and v.1.8.0 of Illumina's Genome analysis pipeline. 4.5.5 Alignment-based analysis of tumour DNA and RNA sequence for somatic point mutations and fusion transcripts All reads were aligned to the human reference genome (hg18) or (for RNA-seq) to a genome file that was augmented with a set of all exon-exon junction sequences using BWA version 0.5.4 (Li & Durbin 2009). RNA-seq libraries were aligned with an in-house modified version of BWA (BWA-R) that is aware of exon junction reads and considers them when determining pairing distance in the “sampe” (read pairing) phase of alignment. Candidate single-nucleotide variants (SNV) were identified in the aligned genomic sequence reads and the transcriptome (RNA-seq) reads using SNVmix and filtered as described in Chapter 3. For paired samples with matched constitutional DNA sequence, all variants with evidence (a SNVmix probability of at least 0.99 and 2 or more high quality base calls matching the SNV) in the constitutional DNA were considered germ line. Matched RNA-seq data was used to  126 identify the expression level of mutations. Mutations were annotated on genes using the Ensembl transcripts (version 54). Because we observed situations where exons were represented in Ensembl transcripts that were not also represented in a Refseq, we only retained candidate mutations in exons shared by both annotations. Candidate mutations were subsequently reviewed visually in IGV (Robinson et al. 2011) and those appearing to be artefacts were removed. The deFuse software (version 0.2.0; http://compbio.bccrc.ca/?page_id=275) (Mcpherson et al. 2011) was utilized for the identification of putative fusion transcripts using the hg18 reference genome. Predicted fusion sequences were subsequently aligned using BLAT and those with numerous high-confidence alignments were removed. Predicted fusions were further filtered to remove those with less than 2 ‘split reads’ (those that cross the fusion point) and predicted fusions involving adjacent (or nearby) gene pairs were also removed. 4.5.6 Validation of candidate somatic mutations identified in genomes and exomes Validation was attempted for each of the candidate point mutations identified in any of the eleven samples analyzed by RNA-seq. This was accomplished by designing primers to amplify a 200 to 300 bp region around the targeted variant with one primer within reach of a single read (<=75 bases). Primers were synthesized by Integrated DNA Technologies at a 25nmol scale with standard desalting (IDT Coralville, IA). Polymerase cycling reactions were set up in 96-well plates and comprised of 0.5 µM forward primer, 0.5 µM reverse primer, 1-3 ng of gDNA template, 5X Phusion HF Buffer, 0.2 µM dNTPs, 3% DMSO, and 0.4 units of Phusion DNA polymerase (NEB, Ipswich, MA, USA). Reaction plates were cycled on a MJR Peltier Thermocycler (model PTC-225) with cycling conditions of a  127 denaturation step at 98 °C for 30 s, followed by 35 cycles of {98°C for 10 s, 69°C for 15 s, 72°C for 15 s} and a final extension step at 72°C for 10 min. PCR reactions were visualized by SybrGreen (Life Technologies, Carlsbad, CA, USA) in 1.2% agarose (SeaKem LE, Cambrex, NJ, USA) gels electrophoresed for 90 min at 170 V to assess PCR success. The resulting amplicons were pooled by patient and template, one for tumour and one for normal DNA, with equal volumes from each PCR reaction and an indexed Illumina paired-end sequencing library was constructed from each pool as described in Chapter 3. The resulting library was sequenced using v5 paired-end sequencing reagents on the Illumina GAiix platform following the manufacturer's instructions. Between the paired 75 nt reads a third 7 nt read was performed using the following custom sequencing primer to sequence the hexamer barcode [5’- GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG]. Image analysis, base calling and error calibration were performed using v1.8.0/ RTA 1.8.70.0 of Illumina's Genome analysis pipeline. Reads were aligned using BWA and variants were visually confirmed for validity and somatic status in IGV (absence from constitutional DNA). For genomic variants, validation was accomplished by PCR amplification and sequencing by two separate approaches. The first approach utilized direct amplicon sequencing by the Sanger method in two directions. The variant site was then visually confirmed in the traces. The second approach involved PCR followed by Ion Torrent library construction and sequencing. The percentage of high-quality (Q17) Ion Torrent reads supporting the non-reference allele is shown (Table 4-4).    128 Figure 4-1: Novel rearrangements in Ph-like ALL. (A) Structure of the chimeric fusions identified by RNA-seq in four Ph-like ALL cases. (B) Confirmation of predicted fusions by RT-PCR. E/P refers to the EBF1-PDGFRB fusion et cetera. (C) Sanger sequencing of RT-PCR products confirming in frame fusion.  This figure was prepared by Kathryn Roberts.    129 Figure 4-2: Insertion of EPOR gene into IGH locus (A) Illustration of the relocation of the EPOR gene, which is located at 19p13.2 into the IGH@ locus on 14q32.33 in a single B-ALL case. The regions boxed in red show the normal location of EPOR and IGH@ and the red arrow indicates the insertion of the EPOR gene into the IGH@ (B) Approximate breakpoints are indicated in read and are supported by multiple read pairs in which one maps to chromosome 19 (orange) and the mate maps to chromosome 14. (C) Sequence validation of one of the insertion breakpoints. (D) Expression levels of EPOR were compared across all eleven cases using RPKM as a measure of EPOR mRNA level. This figure was prepared by myself (Panels A, B and D) and Kathryn Roberts (Panel C). a b EPOR IGH@ c H S 08 25 H S 08 94 H S 08 97 H S 15 33 H S 15 34 H S 15 35 H S 15 36 H S 15 37 H S 15 76 H S 15 84 H S 21 63 0 20 40 60 80 100 EPOR expression in 11 ALL cases E xp re ss io n (R P K M ) EPOR q32.33 d EPOR exon 8 IGHV4 chr19:11349946|chr14:105549154  130 Figure 4-3: Recurrence screening and structure of the EBF1-PDGFRB rearrangement (A) Confirmation of EBF1-PDGFRB fusion in four HR B-ALL cases. (B) RNA-seq read depth at the EBF1 and PDGFRB locus in case PAKKCA, showing expression across the EBF1 locus, but increased expression of PDGFRB downstream of the genomic breakpoint. (C) Deletion between the EBF1 and PDGFRB breakpoints are shown for two cases with EBF1-PDGFRB rearrangement (at right), and four non-rearranged cases with focal EBF1 deletions (at left). (D) Genomic mapping of the EBF1-PDGFRB rearrangement breakpoint by PCR (top), and subsequent sequencing (bottom) showing juxtaposition of EBF1 intron 15 to PDGFRB intron 10, with the addition of non-consensus nucleotides between the breakpoints.  This figure was prepared by Jinghui Zhang and Kathryn Roberts. ! 25! Figure 2      131 Table 4-1: Overview of rearrangements detected by RNA-seq and confirmed Chromosomal rearrangements affecting kinase and cytokine receptor signalling identified by RNA-seq in 10 COG P9906 cases and 1 case from St. Jude Children’s Research Hospital**. Genetic lesions associated with the B-cell pathway (IKZF1, EBF1, and PAX5) and kinase activation (JAK2) are also described.  Sample ID Fusion Key lesions PAKTAL  STRN3-JAK2  IKZF1 deletion and (L117fs) PAKKCA  EBF1-PDGFRB  IKZF1 (IK6); EBF1 deletion; CDKN2A/B deletion PAKVKK  NUP214- ABL1 IKZF1 (S402fs); PAX 5 deletion;   CDKN2A-B deln PALIBN  IGH@-EPOR IKZF1 (e1-5) deletion; CDKN2A/B deletion PAKYEP  BCR-JAK2 IKZF1 (IK6); EBF1 deletion; PAX5 deletion; CDKN2A/B deletion PAMDRM  IGH@-CRLF2* JAK2ins683# IKZF1 (e1-6) deletion; EBF1 deletion; PAX5 V319fs; CDKN2A/B deletion PAKKXB  IGH@-CRLF2* IKZF1 (IK6); CDKN2A/B deletion PALETF   EBF1 deletion; IL7 (G123E)# PAKHZT  IGH@-CRLF2* JAK2 (R867Q);  CDKN2A/B deletion PALJDL   PAX5 deletion;  CDKN2A/B deletion; IL7R 242IFPGVC  SJ NUP214- ABL1** IKZF1(Ik6)  *Previously known; #identified by RNA-seq analysis  132 Table 4-2: Somatic SNVs identified in 11 ALL cases by RNA-seq Sample ID Gene Chr Position Sequence change Amino acid change PAKHZT JMJD3 chr17 7691628 T>C S433P PAKHZT RAB11B chr19 8373013 G>A A94T PAKHZT JAK2 chr9 5079702 G>A R867Q PAKHZT OFD1 chrX 13696806 T>C S531P;S853P;S993P PAKVKK ILF3 chr19 10655385 G>A R642Q;R646Q PAKVKK NSMAF chr8 59682919 C>T R241H PAKVKK PAX5 chr9 37010775 C>G G24R PAKKXB TP53 chr17 7514718 T>A N345I PAKKXB ATRX chrX 76805676 G>T S1256*;S1286*;S131*;S1324*;S 1357* PALIBN ZNF12 chr7 6697396 G>A H494Y;H529Y;H530Y; H568Y PALIBN PHF8 chrX 54065429 C>G R94T;R130T PALJDL VCL chr10 75544616 G>A A1003T;A1071T PALJDL WDR45L chr17 78167098 C>T D283N;D341N PALJDL MOV10 chr1 113042873 G>A R785H;R841H PALJDL HECA chr6 139529155 C>T P105S PAMDRM LGALS8 chr1 234768983 G>A V105M;V106M PALETF LYSMD4 chr15 98087051 C>A V231F;V232F PALETF RAD51C chr17 54129161 C>A D171E PALETF DMPK chr19 50975027 G>A P44L PALETF ADNP chr20 48942253 G>A S802F PALETF TSHZ2 chr20 51304885 C>T P494L PALETF APC chr5 112204169 T>C L1660P  133 Table 4-3: Somatic indels detected in 11 ALL cases by RNA-seq  Sample ID Gene Chr Position Sequence change Amino acid change PAKVKK IKZF1 chr7 50435466 +CCTCCCC N403fs PAKTAL IKZF1 chr7 50411913 -G L117fs PAKKXB USP9X chrX 40940849 -TTAATTACGC L1383fs PALJDL IL7R chr5 35910329 +CGGGGGTCTGCC L243>PGVCL PAKYEP VEZT chr12 94175099 +T W70fs PALETF FLT3 chr13 27506253 +ATTTTGGAAAGTAACATAA GAGATCATATTCATATTGTC TGAAATCAATGTAGAAGTA CTCCCAATTTT KWEFPRENEYFYVD FREYEYDLinsL604 PAMDRM JAK2 chr9 5068357 +GGACCC GPinsR683 PAMDRM MKI67 chr10 129793424 -GAG S2223_P2224>S PAMDRM HDAC7 chr12 46477155 +T K205fs SJ05-2623 IKZF1 chr7 50411801 +GCTCAGG A79fs  134 Table 4-4: Validation results for somatic indels/SNVs detected in two ALL genomes Gene Chr Position Class Sequence change Amino acid Change Validation result (Sanger) Ion Torrent result*: % non-reference reads (total Q17 depth in tumour) IL7R chr5 35910329 insertion +CGGGGGTCTGCC L243>PGVCL Somatic 46.7% (1430) ZNF468 chr19 58036999 frameshift +GGG,-A A67fs Inconclusive (PCR failure) 27.5% (1416) CGNL1 chr15 55608184 missense C>T A1027V Somatic 40.2% (1568) COL4A1 chr13 109628259 missense C>T V883I Somatic 60.6% (2646) HECA chr6 139529155 missense C>T P105S Somatic 52.3% (7315) ITGAL chr16 30435856 missense T>G V891G; V932G; V975G  False Fail/low coverage MOV10 chr1 113042873 missense G>A R841H Somatic 47.2% (2706) OR8K1 chr11 55870928 missense G>C A280P Somatic 48.5% (1631) PDZD2 chr5 32019253 missense G>A A238T Somatic 44.7% (9731) PPFIA2 chr12 80212905 missense C>T R922Q Somatic Fail/low coverage PRAMEF1 chr1 12776775 missense T>C C138R Inconclusive (primer design failure) Fail/low coverage ROBO1 chr3 78767718 missense C>A G1051C Somatic 74.6% (1978) VCL chr10 75544616 missense G>A A1071T Somatic 40.7% (2779) WDR45L chr17 78167098 missense C>T D341N Inconclusive (sequencing failure) 34.8% (589) ADNP chr20 48942253 missense G>A S802F Somatic 55.5% (854) ALPK2 chr18 54335283 missense C>T G1926E Somatic 44.4% (8527) APC chr5 112204169 missense T>C L1660P Somatic 26.6% (2286) ARSI chr5 149657296 missense C>T V462M Somatic 2.6% (153) ATP7B chr13 51437128 missense A>G S584P Somatic 30.2% (6566) DMPK chr19 50975027 missense G>A P44L Somatic 45.4% (10052) IL7 chr8 79811310 missense C>T G123E Somatic 45.6% (936) KRT1 chr12 51356339 missense C>T G488R Somatic 41.8% (1958) LOC643677 chr13 102188387 missense C>A A4221S False Fail/low coverage LOC643677 chr13 102190128 missense T>A K3640N Inconclusive (PCR or sequencing failure) 0.37% (807) False  135 Gene Chr Position Class Sequence change Amino acid Change Validation result (Sanger) Ion Torrent result*: % non-reference reads (total Q17 depth in tumour) LYSMD4 chr15 98087051 missense C>A V232F Somatic 35.6% (1142) MAGI1 chr3 65322046 missense C>T D1196N Somatic 53.3% (608) MYOF chr10 95147017 missense T>G K437N Somatic Fail/low coverage MYOF chr10 95147027 missense A>T V434E Somatic Fail/low coverage PAPPA2 chr1 174830357 missense G>A G332R Somatic 42.7% (2901) PDE4B chr1 66156917 missense C>T S31F Somatic 44.9% (2615) RAD51C chr17 54129161 missense C>A D171E Somatic 51.0% (1572) RHAG chr6 49691341 missense C>T E199K Somatic 43.7% (2160) SPO11 chr20 55342171 missense C>T S81F; S119F Somatic 44.7% (877) TSHZ2 chr20 51304885 missense C>T P494L Somatic 40.7 (1303) ZADH2 chr18 71042423 missense T>A Y357F False 0% 2907, False ZNF280A chr22 21199339 missense T>G T206P Somatic 46.4% (1656) Variant was found to be somatic by Ion Torrent sequencing unless indicated otherwise.  136 Chapter  5: Conclusions 5.1 Tools for routine analysis of DNA sequence from cancers The routine discovery and annotation of somatic mutations from individual tumours is, for reasons described below, likely to quickly become a staple in clinical research, in particular in the study of the genetic nature of cancers. The methods described in this thesis offer an efficient and relatively straightforward approach to identify the protein-altering single nucleotide mutations, which are the simplest class of somatic mutation to interpret. I have developed pipelines for analyzing RNA-seq data, paired tumour/normal genomes/exomes and targeted deep resequencing data. These pipelines rely on the popular aligner BWA or the version adapted for RNA-seq applications (BWA-R) and most steps are performed directly from the alignment files in BAM format. These tools have enabled the discovery of more than 100 genes that are recurrently mutated in NHL and have resulted in important advancements to our understanding of both NHL and B-ALL. In hopes of benefiting others, the tools I developed during this research are being provided to the scientific community in the Galaxy framework (Blankenberg et al. 2010) as a set of Galaxy tools and workflows (Appendix D; http://toolshed.g2.bx.psu.edu). I have also developed a MySQL database that enables storage, retrieval and mining of the mutation data to produce a lasting resource that will facilitate future research leveraging this large data set (Appendix E). 5.2 Towards affordable genome sequencing The throughput of existing next generation sequencing technologies has increased immensely owing to improvements in sequencing chemistry, optics, cluster density and base calling. Meanwhile, novel “third generation” sequencing technologies continue to emerge such as single molecule sequencing, offered by Pacific Biosciences, and the novel optics-free  137 detection method utilized by the Ion Torrent platform (Rothberg et al. 2011). The two major players in the arena of WGS have thus far been the SOLiD and Illumina platforms. Each of these sequencers is able to “completely” sequence a human genome to approximately 30x coverage in under two weeks for a cost of less than $10,000. A recent improvement made to the Illumina sequencing reagents has enabled us to sequence a genome to 30x with as few as 3 lanes (on a HiSeq 2000 instrument), roughly halving the current sequencing cost. It is difficult to predict the extent to which further technical improvements are possible for each of these platforms. In any case, there is an expectation that in the near future, we will be capable of sequencing a genome for $1000 or possibly less (the so-called $1000 genome)(Mardis 2006). 5.3 Pharmacogenomics and personalized cancer treatment The ramifications of affordable genome sequencing will be widespread and the potential positive impact on health care is expected to be enormous. With our increasing understanding of the genetic nature of disease and response (both positive and negative) to certain drugs, genomics can offer improvements to both diagnosis and treatment. The emerging field of pharmacogenomics aims to utilize genomic information to study the effect of both inherited and acquired variation on response to treatments (L. Wang et al. 2011b). The ultimate goal of the field is, for a given patient, to predict whether a given drug will produce the desired therapeutic effect, determine the optimal dosage, and uncover any potential for undesired side effects. Pharmacogenomics is becoming a key component in the treatment of certain cancers but we are likely only beginning to appreciate the full scope of potentially important mutations in human cancer. Targeted anti-cancer therapeutics such as trastuzumab (Herceptin) are  138 typically only used when certain mutations are present (L. Wang et al. 2011b). Trastuzumab inhibits the activity of Her-2, encoded by ERBB2, a gene that is amplified in some breast cancers. More recently, this gene has been found to be commonly amplified in other cancers such as those affecting the oesophagus and stomach (Langer et al. 2011; Yk et al. 2011). This observation lead to the suggestion that patients with these cancers containing amplified ERBB2 may also benefit from treatment with trastuzumab (Gravalos & Jimeno 2008) and trials testing the efficacy of trastuzumab treatment for such cancers are being conducted (Grávalos et al. 2011). Similarly, a hot spot mutation in the BRAF oncogene (V600E) is present in the majority of melanoma cases and is found to a lesser extent in other cancers (Davies et al. 2002). A small molecule inhibitor that specifically targets the mutated protein (PLX4032) has proven successfully in treating metastatic melanoma in clinical trials (Flaherty et al. 2010). A recent report demonstrated the presence of V600E BRAF mutation in all cases of a rare hematologic cancer known as hairy cell leukemia (Tiacci et al. 2011). Such reports suggest the possibility that drugs such as PLX4032 may become broadly applicable in the treatment of diverse cancer types in cases where the targeted mutation can be detected. For many cancers a “druggable” and highly recurrent molecular alteration has not yet been identified. As a result, these diseases have no targeted standard of care treatment but might benefit from drugs used to treat other cancers if “actionable” mutations, those with targeted therapies available, could be identified. We recently described the genomic profiling of a patient with a rare adenocarcinoma of the tongue and the resultant identification of RET amplification and enriched RET mRNA abundance in the tumour cells. This suggested the RET tyrosine kinase as a potential actionable target for treatment of this patient (S. J. Jones et  139 al. 2010). Based on this observation, the patient was treated with sunitinib (a RET inhibitor), which resulted in temporary response in this patient. This study exemplifies the benefit of applying unbiased and complementary methods such as genome sequencing and RNA-seq to identify actionable targets. As the list of anti-cancer agents with known targets increases, one aim of cancer sequencing is that actionable mutations will be routinely detected in rare tumours or those with no known treatment, allowing approved drugs to be successfully applied to a larger patient population. Methods for routinely and rapidly detecting such mutations are emerging but are suited to specific types of mutations (namely hot spots) (Thomas et al. 2007; Macconaill et al. 2009). As the cost of sequencing continues to drop and our ability to accurately identify mutations from these data improves to an acceptable level (i.e. very close to 100% accuracy), it has been predicted that sequencing could supplant such approaches (Macconaill & Garraway 2010). Emerging rapid-turnaround sequencers such as those described in section 1.2.4 along with appropriate custom capture or PCR designs to amplify known regions of hot spot mutations (or entire genes) may also facilitate such assays in the future. Actionable therapeutic targets are not necessarily restricted to known molecular alterations. In the case of the patients profiled in Chapter 4, I identified novel lesions that may indicate that additional B-ALL patients should be considered candidates for treatments not previously considered a viable option. Specifically, in this study we revealed in-frame fusions involving JAK2 and ABL1 in multiple patients. Typically, therapy with tyrosine kinase inhibitors such as dasatanib is only given to a patient if their malignant cells harbour the translocation t(9;22)(q34;q11)(Gruber et al. 2009), which is the most common mechanism by which the ABL1 kinase is deregulated in ALL. Further, specific inhibitors of  140 constitutively active JAK2 mutant proteins are being pursued by multiple groups for the treatment of patients with the JAK2 V617F mutation (Quintás-Cardama et al. 2011; Hart et al. 2011). If the fusions such as BCR-JAK2 and NUP214-ABL1 described in that chapter are ultimately demonstrated to be inhibited by these compounds, then this study has effectively increased the population of patients who could be candidates for these targeted treatments. PCR-based tests for identifying these additional rearrangements could be designed to routinely screen for such events, enabling the rapid and routine identification of malignant cells harbouring these fusions. Considering that novel fusions were identified in five of the 11 cases profiled, it seems reasonable to predict that we have still not identified the full extent of fusion transcripts responsible for driving this disease. Hence, genomic profiling using RNA-seq, or another unbiased sequencing-based method combined with appropriate downstream analysis, is arguably a preferable method for screening for such mutations. The new hot spot mutations described in Chapters 2 and 3 are indicative of dominant acting mutations, which can result in the activation of oncogenes (for example, BRAF V600E) or dominant negative mutations observed in some tumour suppressors (Hassan et al. 2008). The mutation affecting EZH2 in particular appears to confer a gain of function phenotype, suggesting it is an oncogenic driver of lymphoma. Similarly, while our manuscript was under review, a separate group reported the hot spot mutations we observed in MYD88 and demonstrated these to be oncogenic (Ngo et al. 2011). Whether any of the additional genes with similar mutation patterns in Chapter 3 such as MEF2B, FOXO1, and CCND3 produce constitutively active (or neomorphic) proteins requires further experimentation. These proteins may ultimately be investigated as new targets for small  141 molecule inhibitors. Owing to the apparent oncogenic activity of EZH2 in other cancer types such as breast and prostate, inhibition of EZH2 activity has already been explored as a novel therapy (Tan et al. 2007). Specific inhibition of mutant EZH2 is being pursued as a novel treatment option for GCB DLBCL and FL in our lab and by other groups (Sneeringer et al. 2010; Copeland et al. 2010). However, until a crystal structure of wild type EZH2 is solved, rational design of small molecules to selectively inhibit this enzyme will be challenging. Treatment of DLBCL may also ultimately be improved using some of the molecular lesions identified in this research, particularly the data described in Chapter 3, as biomarkers. The two subtypes of DLBCL, namely ABC and GCB, are known to respond differently to the current standard of care (R-CHOP). It is believed that the ABC subtype, which exhibits an inferior clinical prognosis, is driven by mutations or other alterations that result in deregulated expression of NF-κB target genes (Lenz & Staudt 2010). Indeed, our data includes some of the previously described hot spot mutations known to result in deregulated NF-κB such as those affecting MYD88 and CD79B and these mutations were significantly more common in our ABC cases. Based on the notion that ABC DLBCL is driven by NF-κB signalling, drugs that inhibit NF-κB activity are being pursued (Dunleavy et al. 2009; Hernandez-Ilizaliturri et al. 2011). If therapies specifically targeting NF-κB in ABC DLBCL prove successful, then routine tests for classifying samples by molecular subtype will become an asset. The current methods utilized for subtyping DLBCL include gene expression profiling (Wright et al. 2003) and immunohistochemistry (Hans et al. 2004). The former is more accurate but requires fresh frozen tissue, which is not always available. The latter is less accurate, and only formally identifies GCB cases (calling the remainder of samples “non-GCB”). In Chapter 3, we report seven genes (GNA13, MEF2B, EZH2, TP53, BCL2,  142 TNFRSF14, SGK1) with mutations significantly enriched in the GCB subtype and another two genes (MYD88 and CD79B) more commonly mutated in the ABC subtype. It is possible that, if a suitable number of genes with subtype-restricted mutations such as these could be identified, subtype assignment might eventually be accomplished (or augmented) by sequencing these genes rather than by utilizing expression arrays or IHC. 5.4 Epigenetics in cancer: Overview and future work The data presented in this thesis represent a substantial gain in our understanding of the genetic mutations driving certain types of NHL. Prior to this work, the role of histone modifying enzymes in NHL was not appreciated and only a limited number of studies had suggested a role of epigenetic changes in lymphomagenesis (van Kemenade et al. 2001; O Apos Riain et al. 2009). The initial finding of recurrent mutations affecting EZH2 Y641 was subsequently confirmed by multiple groups in FL, DLBCL and found in (or excluded from) other NHLs (Bodor et al. 2011; Park et al. 2011; Pellissery et al. 2010; Salido et al. 2011). Based on these findings and our additional data subsequently published (see Chapter 3), it is likely that this mutation is unique to FL and GCB DLBCL. Interestingly, however, shortly after the publication of our initial finding, multiple reports of inactivating mutations in EZH2 were reported in myeloid neoplasms (Nikoloski et al. 2010; Makishima et al. 2010). EZH2 missense mutations were also subsequently reported in carcinomas of the head and neck (Stransky et al. 2011) but these affected residues outside the SET domain and were also not clearly inactivating. These findings suggest that in other cancers, EZH2 mutation may have a mode of action distinct from the gain-of-function phenotype we observed in DLBCL and FL. This is consistent with the notion that EZH2 is responsible for regulating distinct sets of genes in the different cell types from which these cancers originate. Despite this distinction,  143 further investigation into the mechanism by which EZH2 mutations contribute to transformation of germinal centre B cells and other cell types is clearly warranted by these studies. In Chapter 3, I extended this discovery to reveal that perturbation of normal epigenetic gene regulation by histone modification may be a general feature of lymphomagenesis. This is exemplified by recurrence of mutations in each of MLL2, CREBBP, EP300, MEF2B and HDAC7. I proposed a model describing how each of these mutations could result in similar global effects on gene expression (Figure 3-8) but this remains speculative and requires further exploration. This model predicts that any of the observed mutations may shift the balance of epigenetic modifications in favour of HDACs, resulting in suppression of gene expression (potentially at tumour suppressor loci). If proven accurate, this model suggests that inhibitors of HDAC activity may be useful in patients whose tumours contain these mutations. Some HDAC inhibitors have already been pursued as novel treatments for NHL (Kirschbaum et al. 2011) and other cancers (Mund & Lyko 2010) with variable clinical success. As the mutation status of the histone modifying enzymes in these patients was unknown, it is advisable that future trials testing the efficacy of such compounds in NHL should test for mutations in these genes. Mutations affecting histone modifiers are not restricted to lymphoma but rather appear to be emerging as a common theme in many unrelated cancers. The first indication of this was the mutation of UTX in various cancer types (van Haaften et al. 2009). As described in Chapter 3, MLL2 appears to be the most commonly mutated gene in DLBCL and FL. This gene has very recently been reported to act as a potential tumour suppressor in renal carcinoma (Dalgliesh et al. 2010), multiple myeloma (Chapman et al. 2011),  144 medulloblastoma (Parsons et al. 2011) and head and neck carcinoma (Stransky et al. 2011). Based on their role in modulating gene expression, I expect that mutations of histone modifying enzymes have an indirect effect on lymphomagenesis by altering the expression of oncogenes or tumour suppressors or genes involved in differentiation. It is unknown whether the individual mutated genes such as MLL2, EZH2, or CREBBP are responsible for regulating a shared set of genes in germinal centre B cells. This could potentially be determined by comparing the gene expression profiles of tumours with (and without) mutations in each of these genes. This may also be determined using chromatin immunoprecipitation and sequencing (ChIP-seq), which is a method that can identify in vivo protein-DNA interactions. ChIP-seq experiments enable semi-quantitative detection of the steady-state levels of histones with various modifications in a given population of cells (Neff & Armstrong 2009). Such experiments may only be informative if the genetic background of the samples being compared is uniform, such as those utilizing siRNA or shRNA, mouse knock-outs (or knock-ins) or isogenic cell lines. 5.5 Summary and future directions The utility of next generation sequencing for detecting somatic mutations in cancer was reviewed in Chapter 1 and my contribution to this area was described in Chapters 2-4. This thesis has demonstrated that SNVs and fusion transcripts can be robustly identified from these data types and that the gene expression information derived from RNA-seq data can be used, similar to expression data from microarrays, to identify the molecular subtype of DLBCLs. My survey for mutations in NHL (Chapters 2 and 3) heavily relied on RNA-seq data to quantify mutation recurrence. As such, this study was powered only to detect mutations in genes that are actively transcribed in cancer cells. To compensate for this  145 shortcoming, we have begun sequencing the genomes of more DLBCL tumour/normal pairs and, as expected, this has revealed additional genes that are mutated but not necessarily expressed. More detailed characterization of these genomes, including detection of SNVs, copy number alterations and large structural alterations should provide a more complete view of the genetic landscape of this disease. There are still many improvements that can be made to strengthen the types of analysis presented in this thesis. For example, identification of indel mutations from aligned data is possible but problematic, resulting in a much lower validation rate than SNVs (see Chapter 4). Even SNV detection from RNA-seq and from tumour/normal pairs cannot yet be performed with 100% accuracy. From the eleven DLBCL genomes sequenced on the HiSeq 2000, we observed a 94.5% validation rate. The false positive variant calls likely result from a combination of inaccuracy in the alignment procedure, random and systematic errors in the sequencing data, and incomplete representation of pseudogenes and other repeated sequences in the reference genome. Future improvements to base calling, alignment and variant identification/filtering may improve our true positive rate and increase our confidence in the raw data (minimizing the need for validation) moving forward.         146 References Ajay, S.S., Parker, S.C.J., Abaan, H.O., Fajardo, K.V.F. & Margulies, E.H., 2011. Accurate and comprehensive sequencing of personal genomes. Genome Res, 21(9), pp.1498–1505. Al-Katib, A.M., Smith, M.R., Kamanda, W.S., Pettit, G.R., Hamdan, M., Mohamed, A.N., Chelladurai, B. & Mohammad, R.M., 1998. Bryostatin 1 down-regulates mdr1 and potentiates vincristine cytotoxicity in diffuse large cell lymphoma xenografts. Clin Cancer Res, 4(5), pp.1305–1314. Albert, T.J., Molla, M.N., Muzny, D.M., Nazareth, L., Wheeler, D., Song, X., Richmond, T.A., et al., 2007. Direct selection of human genomic loci by microarray hybridization. Nat Meth, 4(11), pp.903–905. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., et al., 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403(6769), pp.503–511. Anderson, J.R., Armitage, J.O. & Weisenburger, D.D., 1998. Epidemiology of the non- Hodgkin“s lymphomas: distributions of the major subtypes differ by geographic locations. Non-Hodgkin”s Lymphoma Classification Project. Ann Oncol, 9(7), pp.717– 720. Apperley, J.F., Gardembas, M., Melo, J.V., Russell-Jones, R., Bain, B.J., Baxter, E.J., Chase, A., et al., 2002. Response to imatinib mesylate in patients with chronic myeloproliferative diseases with rearrangements of the platelet-derived growth factor receptor beta. N Engl J Med, 347(7), pp.481–487. Ariel, O., Levi, Y. & Hollander, N., 2009. Signal transduction by CD58: The transmembrane isoform transmits signals outside lipid rafts independently of the GPI-anchored isoform. Cell Signal, 21(7), pp.1100–1108. Asmann, Y., Klee, E., Thompson, E., Perez, E., Middha, S., Oberg, A., Therneau, T., et al., 2009. 3' tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics, 10, pp.531–531. Bainbridge, M.N., Warren, R.L., Hirst, M., Romanuik, T., Zeng, T., Go, A., Delaney, A., et al., 2006. Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach. BMC Genomics, 7, p.246. Baran-Marszak, F., Magdoud, H., Desterke, C., Alvarado, A., Roger, C., Harel, S., Mazoyer, E., et al., 2010. Expression level and differential JAK2-V617F-binding of the adaptor protein Lnk regulates JAK2-mediated signals in myeloproliferative neoplasms. Blood, 116(26), pp.5961–5971. Bardelli, A., Parsons, D.W., Silliman, N., Ptak, J., Szabo, S., Saha, S., Markowitz, S., et al., 2003. Mutational analysis of the tyrosine kinome in colorectal cancers. Science,  147 300(5621), p.949. Barski, A., Cuddapah, S., Cui, K., Roh, T.-Y., Schones, D.E., Wang, Z., Wei, Gang, Chepelev, I. & Zhao, Keji, 2007. High-Resolution Profiling of Histone Methylations in the Human Genome. Cell, 129(4), pp.823–837. Bartram, C.R. & Grosveld, G., 1985. [Philadelphia translocation and the human c-abl oncogene--relations in the light of molecular genetics]. Klinische Pädiatrie, 197(3), pp.196–202. Bea, S., Salaverria, I., Armengol, L., Pinyol, M., Fernandez, V., Hartmann, E., Jares, P., et al., 2009. Uniparental disomies, homozygous deletions, amplifications, and target genes in mantle cell lymphoma revealed by integrative high-resolution whole-genome profiling. Blood, 113(13), pp.3059–3069. Beckwith, M., Longo, D.L., O'Connell, C.D., Moratz, C.M. & Urba, W.J., 1990. Phorbol ester-induced, cell-cycle-specific, growth inhibition of human B-lymphoma cell lines. J Natl Cancer I, 82(6), pp.501–509. Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G., Hall, K.P., et al., 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218), pp.53–59. Berger, M.F., Lawrence, M.S., Demichelis, F., Drier, Y., Cibulskis, K., Sivachenko, A.Y., Sboner, A., et al., 2011. The genomic complexity of primary human prostate cancer. Nature, 470(7333), pp.214–220. Beroukhim, R., Getz, G., Nghiemphu, L., Barretina, J., Hsueh, T., Linhart, D., Vivanco, I., et al., 2007. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci USA, 104(50), pp.20007–20012. Beroukhim, R., Mermel, C.H., Porter, D., Wei, Guo, Raychaudhuri, S., Donovan, J., Barretina, J., et al., 2010. The landscape of somatic copy-number alteration across human cancers. Nature, 463(7283), pp.899–905. Bhattacharyya, R. & Wedegaertner, P., 2000. Galpha 13 requires palmitoylation for plasma membrane localization, Rho-dependent signaling, and promotion of p115-RhoGEF membrane binding. J Biol Chem, 275(20), pp.14992–14999. Bignell, G.R., Greenman, C.D., Davies, H., Butler, A.P., Edkins, S., Andrews, J.M., Buck, G., et al., 2010. Signatures of mutation and selection in the cancer genome. Nature, 463(7283), pp.893–898. Binkley, J., Karra, K., Kirby, A., Hosobuchi, M., Stone, E.A. & Sidow, A., 2010. ProPhylER: A curated online resource for protein function and structure based on evolutionary constraint analyses. Genome Res, 20(1), pp.142–154. Birol, I., Jackman, S.D., Nielsen, C.B., Qian, J.Q., Varhol, R., Stazyk, G., Morin, R.D., et al.,  148 2009. De novo transcriptome assembly with ABySS. Bioinformatics, 25(21), pp.2872– 2877. Blankenberg, D., Kuster, Von, G., Coraor, N., Ananda, G., Lazarus, R., Mangan, M., Nekrutenko, A. & Taylor, J., 2010. Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol, Chapter 19, pp.Unit 19.10.1–21. Bodor, C., O'Riain, C., Wrench, D., Matthews, J., Iyengar, S., Tayyib, H., Calaminici, M., et al., 2011. EZH2 Y641 mutations in follicular lymphoma. Leukemia, 25(4), pp.726–729. Boer, den, M.L., van Slegtenhorst, M., de Menezes, R.X., Cheok, M.H., Buijs-Gladdines, J.G.C.A.M., Peters, S.T.C.J.M., van Zutven, L.J.C.M., et al., 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol, 10(2), pp.125–134. Brunet, A., Park, J., Tran, H., Hu, L., Hemmings, B. & Greenberg, M., 2001. Protein Kinase SGK Mediates Survival Signals by Phosphorylating the Forkhead Transcription Factor FKHRL1 (FOXO3a). Mol Cell Biol, 21(3), pp.952–965. Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C. & Jaffe, D.B., 2008. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res, 18(5), pp.810–820. Campbell, P.J., Stephens, P.J., Pleasance, E.D., O'Meara, S., Li, H., Santarius, T., Stebbings, L.A., et al., 2008. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet, 40(6), pp.722–729. Canaani, E., Nakamura, T., Rozovskaia, T., Smith, S., Mori, T., Croce, C. & Mazo, A., 2004. ALL-1//MLL1, a homologue of Drosophila TRITHORAX, modifies chromatin and is directly involved in infant acute leukaemia. Br J Cancer, 90(4), pp.756–760. Cancer Genome Atlas Research Network, 2011. Integrated genomic analyses of ovarian carcinoma. Nature, 474(7353), pp.609–615. Carter, H., Chen, S., Isik, L., Tyekucheva, S., Velculescu, V., Kinzler, K., Vogelstein, B. & Karchin, R., 2009. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res, 69(16), pp.6660– 6667. Casellas, R., Yamane, A., Kovalchuk, A.L. & Potter, M., 2009. Restricting activation- induced cytidine deaminase tumorigenic activity in B lymphocytes. Immunology, 126(3), pp.316–328. Chapman, M.A., Lawrence, M.S., Keats, J.J., Cibulskis, K., Sougnez, C., Schinzel, A.C., Harview, C.L., et al., 2011. Initial genome sequencing and analysis of multiple myeloma. Nature, 471(7339), pp.467–472. Cheung, K.-J.J., Johnson, N.A., Affleck, J.G., Severson, T.M., Steidl, C., Ben-Neriah, S.,  149 Schein, J., et al., 2010. Acquired TNFRSF14 mutations in follicular lymphoma are associated with worse prognosis. Cancer Res, 70(22), pp.9166–9174. Chin, S.-F., Daigo, Y., Huang, H.-E., Iyer, N.G., Callagy, G., Kranjac, T., Gonzalez, M., et al., 2003. A simple and reliable pretreatment protocol facilitates fluorescent in situ hybridisation on tissue microarrays of paraffin wax embedded tumour samples. Mol Path, 56(5), pp.275–279. Collins, R.E., Tachibana, M., Tamaru, H., Smith, K.M., Jia, D., Zhang, Xing, Selker, E.U., Shinkai, Y. & Cheng, X., 2005. In vitro and in vivo analyses of a Phe/Tyr switch controlling product specificity of histone lysine methyltransferases. J Biol Chem, 280(7), pp.5563–5570. Compagno, M., Lim, W.K., Grunn, A., Nandula, S.V., Brahmachary, M., Shen, Q., Bertoni, F., et al., 2009. Mutations of multiple genes cause deregulation of NF-kappaB in diffuse large B-cell lymphoma. Nature, 459(7247), pp.717–721. The 1000 Genomes Project Consortium 2010. A map of human genome variation from population-scale sequencing. Nature, 467(7319), pp.1061–1073. Copeland, R.A., Olhava, E.J. & Scott, M.P., 2010. Targeting epigenetic enzymes for drug discovery. Curr Opin Chem Biol, 14(4), pp.505–510. Couture, J.-F., Dirk, L.M.A., Brunzelle, J.S., Houtz, R.L. & Trievel, R.C., 2008. Structural origins for the product specificity of SET domain protein methyltransferases. Proc Natl Acad Sci USA, 105(52), pp.20659–20664. Dalgliesh, G.L., Furge, K., Greenman, C., Chen, Lina, Bignell, G., Butler, A., Davies, H., et al., 2010. Systematic sequencing of renal carcinoma reveals inactivation of histone modifying genes. Nature, 463(7279), pp.360–363. Dalla-Favera, R. & Pasquallucci, L., 2010. Molecular Genetics of Lymphoma. In J. O. Armitage, P. M. Mauch, N. L. Harris, B. Coiffier, & R. Dalla-Favera, eds. Non-Hodgkin Lymphomas. Philadelphia, PA: Lippincott Williams & Wilkins, pp. 115–130. Davies, H., Bignell, G.R., Cox, C., Stephens, P., Edkins, S., Clegg, S., Teague, J., et al., 2002. Mutations of the BRAF gene in human cancer. Nature, 417(6892), pp.949–954. Davies, H., Hunter, Chris, Smith, R., Stephens, P., Greenman, C., Bignell, G., Teague, J., et al., 2005. Somatic mutations of the protein kinase gene family in human lung cancer. Cancer Res, 65(17), pp.7591–7595. Davis, R.E., Ngo, V.N., Lenz, G., Tolar, P., Young, R.M., Romesser, P.B., Kohlhammer, H., et al., 2010. Chronic active B-cell-receptor signalling in diffuse large B-cell lymphoma. Nature, 463(7277), pp.88–92. Deenik, W., Beverloo, H.B., van der Poel-van de Luytgaarde, S.C.P.A.M., Wattel, M.M., van Esser, J.W.J., Valk, P.J.M. & Cornelissen, J.J., 2009. Rapid complete cytogenetic  150 remission after upfront dasatinib monotherapy in a patient with a NUP214-ABL1- positive T-cell acute lymphoblastic leukemia. Leukemia, 23(3), pp.627–629. Diehl, S., Schmidlin, H., Nagasawa, M., van Haren, S., Kwakkenbos, M., Yasuda, E., Beaumont, T., Scheeren, F. & Spits, H., 2008. STAT3-mediated up-regulation of BLIMP1 Is coordinated with BCL6 down-regulation to control human plasma cell differentiation. J Immunol, 180(7), pp.4805–4815. Dillon, S.C., Zhang, Xing, Trievel, R.C. & Cheng, X., 2005. The SET-domain protein superfamily: protein lysine methyltransferases. Genome Biol, 6(8), p.227. Dimon, M.T., Sorber, K. & DeRisi, J.L., 2010. HMMSplicer: a tool for efficient and sensitive discovery of known and novel splice junctions in RNA-Seq data. PloS one, 5(11), p.e13875. Ding, L., Getz, G., Wheeler, D.A., Mardis, E.R., McLellan, M.D., Cibulskis, K., Sougnez, C., et al., 2008. Somatic mutations affect key pathways in lung adenocarcinoma. Nature, 455(7216), pp.1069–1075. Ding, L., Wendl, M.C., Koboldt, D.C. & Mardis, E.R., 2010. Analysis of next-generation genomic data in cancer: accomplishments and challenges. Hum Mol Genet, 19(R2), pp.R188–R196. Drmanac, R., Sparks, A.B., Callow, M.J., Halpern, A.L., Burns, N.L., Kermani, B.G., Carnevali, P., et al., 2009. Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays. Science, 327(5961), pp.78–81. Du, M.Q., Peng, H., Liu, H., Hamoudi, R.A., Diss, T.C., Willis, T.G., Ye, H., et al., 2000. BCL10 gene mutation in lymphoma. Blood, 95(12), pp.3885–3890. Dunleavy, K., Pittaluga, S., Czuczman, M.S., Dave, S.S., Wright, G., Grant, N., Shovlin, M., et al., 2009. Differential efficacy of bortezomib plus chemotherapy within molecular subtypes of diffuse large B-cell lymphoma. Blood, 113(24), pp.6069–6076. Dyer, M.J., Fischer, P., Nacheva, E., Labastide, W. & Karpas, A., 1990. A new human B-cell non-Hodgkin's lymphoma cell line (Karpas 422) exhibiting both t (14;18) and t(4;11) chromosomal translocations. Blood, 75(3), pp.709–714. Edmonson, M.N., Zhang, J., Yan, C., Finney, R.P., Meerzaman, D.M. & Buetow, K.H., 2011. Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format. Bioinformatics, 27(6), pp.865–866. Epstein, A., Variakojis, D., Berger, C. & Hecht, B., 1985. Use of novel chemical supplements in the establishment of three human malignant lymphoma cell lines (NU- DHL-1, NUDUL-1, and NU-AMB-1) with chromosome 14 translocations. Int J Cancer, 35(5), pp.619–627. Ewing, B. & Green, P., 1998. Base-calling of automated sequencer traces using phred. II.  151 Error probabilities. Genome Res, 8(3), pp.186–194. Ewing, B., Hillier, L., Wendl, M.C. & Green, P., 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res, 8(3), pp.175–185. Flaherty, K.T., Puzanov, I., Kim, K.B., Ribas, A., McArthur, G.A., Sosman, J.A., O'Dwyer, P.J., et al., 2010. Inhibition of mutated, activated BRAF in metastatic melanoma. N Engl J Med, 363(9), pp.809–819. Futreal, P.A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., Rahman, N. & Stratton, M.R., 2004. A census of human cancer genes. Nat Rev Cancer, 4(3), pp.177– 183. Giordano, A. & Avantaggiati, M., 1999. p300 and CBP: partners for life and death. J Cell Physiol, 181(2), pp.218–230. Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F.J., Burton, J.N., Walker, B.J., Sharpe, T., et al., 2011. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA, 108(4), pp.1513–1518. Gnirke, A., Melnikov, A., Maguire, J., Rogov, P., Leproust, E.M., Brockman, W., Fennell, Timothy, et al., 2009. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol, 27(2), pp.182–189. Golub, T.R., Barker, G.F., Lovett, M. & Gilliland, D.G., 1994. Fusion of PDGF receptor beta to a novel ets-like gene, tel, in chronic myelomonocytic leukemia with t(5;12) chromosomal translocation. Cell, 77(2), pp.307–316. Goya, R., Sun, M.G.F., Morin, R.D., Leung, G., Ha, G., Wiegand, Kimberley C, Senz, J., et al., 2010. SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics, 26(6), pp.730–736. Graux, C., Cools, J., Melotte, C., Quentmeier, H., Ferrando, A., Levine, R., Vermeesch, J.R., et al., 2004. Fusion of NUP214 to ABL1 on amplified episomes in T-cell acute lymphoblastic leukemia. Nat Genet, 36(10), pp.1084–1089. Gravalos, C. & Jimeno, A., 2008. HER2 in gastric cancer: a new prognostic factor and a novel therapeutic target. Ann Oncol, 19(9), pp.1523–1529. Grávalos, C., Gómez-Martín, C., Rivera, F., Alés, I., Queralt, B., Márquez, A., Jiménez, U., et al., 2011. Phase II study of trastuzumab and cisplatin as first-line therapy in patients with HER2-positive advanced gastric or gastroesophageal junction cancer. Clin Transl Oncol, 13(3), pp.179–184. Greenman, C., Stephens, P., Smith, R., Dalgliesh, G.L., Hunter, Christopher, Bignell, G., Davies, H., et al., 2007. Patterns of somatic mutation in human cancer genomes. Nature, 446(7132), pp.153–158.  152 Greenman, C., Wooster, R., Futreal, P.A., Stratton, M.R. & Easton, D.F., 2006. Statistical analysis of pathogenicity of somatic mutations in cancer. Genetics, 173(4), pp.2187– 2198. Griesinger, F., Hennig, H., Hillmer, F., Podleschny, M., Steffens, R., Pies, A., Wörmann, B., Haase, D. & Bohlander, S.K., 2005. A BCR-JAK2 fusion gene as the result of a t(9;22)(p24;q11.2) translocation in a patient with a clinically typical chronic myeloid leukemia. Genes Chromosome Canc, 44(3), pp.329–333. Griffith, M., Tang, M.J., Griffith, O.L., Morin, R.D., Chan, S.Y., Asano, J.K., Zeng, T., et al., 2008. ALEXA: a microarray design platform for alternative expression analysis. Nat Meth, 5(2), pp.118–118. Gruber, F., Mustjoki, S. & Porkka, K., 2009. Impact of tyrosine kinase inhibitors on patient outcomes in Philadelphia chromosome-positive acute lymphoblastic leukaemia. Br J Haematol, 145(5), pp.581–597. Guttman, M., Amit, I., Garber, M., French, C., Lin, M., Feldser, D., Huarte, M., et al., 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature, 458(7235), pp.223–227. Hagman, J. & Lukin, K., 2006. Transcription factors drive B cell development. Curr Opin Immunol, 18(2), pp.127–134. Hagman, J., Gutch, M.J., Lin, H. & Grosschedl, R., 1995. EBF contains a novel zinc coordination motif and multiple dimerization and transcriptional activation domains. The EMBO Journal, 14(12), pp.2907–2916. Han, A., He, J., Wu, Y., Liu, J.O. & Chen, Lina, 2005. Mechanism of recruitment of class II histone deacetylases by myocyte enhancer factor-2. J Mol Biol, 345(1), pp.91–102. Hanahan, D. & Weinberg, R.A., 2000. The hallmarks of cancer. Cell, 100(1), pp.57–70. Hans, C.P., Weisenburger, D.D., Greiner, T.C., Gascoyne, R.D., Delabie, J., Ott, G., Muller- Hermelink, H.K., et al., 2004. Confirmation of the molecular classification of diffuse large B-cell lymphoma by immunohistochemistry using a tissue microarray. Blood, 103(1), pp.275–282. Harris, N.L., 2010. Introduction. In J. O. Armitage, P. M. Mauch, N. L. Harris, B. Coiffier, & R. Dalla-Favera, eds. Non-Hodgkin Lymphomas. Philadelphia, PA: Lippincott Williams & Wilkins, pp. xv–xxix. Harrison, C.J., 2009. Cytogenetics of paediatric and adolescent acute lymphoblastic leukaemia. Br J Haematol, 144(2), pp.147–156. Hart, S., Goh, K.C., Novotny-Diermayr, V., Hu, C.Y., Hentze, H., Tan, Y.C., Madan, B., et al., 2011. SB1518, a novel macrocyclic pyrimidine-based JAK2 inhibitor for the treatment of myeloid and lymphoid malignancies. Leukemia.  153 Harvey, R.C., Mullighan, C.G., Chen, I.-M., Wharton, W., Mikhail, F.M., Carroll, A.J., Kang, H., et al., 2010. Rearrangement of CRLF2 is associated with mutation of JAK kinases, alteration of IKZF1, Hispanic/Latino ethnicity, and a poor outcome in pediatric B-progenitor acute lymphoblastic leukemia. Blood, 115(26), pp.5312–5321. Hassan, N.M.M., Tada, M., Hamada, J.-I., Kashiwazaki, H., Kameyama, T., Akhter, R., Yamazaki, Y., et al., 2008. Presence of dominant negative mutation of TP53 is a risk of early recurrence in oral cancer. Cancer Lett, 270(1), pp.108–119. He, J., Ye, J., Cai, Y., Riquelme, C., Liu, J.O., Liu, X., Han, A. & Chen, Lin, 2011. Structure of p300 bound to MEF2 on DNA reveals a mechanism of enhanceosome assembly. Nucleic Acids Res, 39(10), pp.4464–4474. Hernandez-Ilizaliturri, F.J., Deeb, G., Zinzani, P.L., Pileri, S.A., Malik, F., Macon, W.R., Goy, A., Witzig, T.E. & Czuczman, M.S., 2011. Higher response to lenalidomide in relapsed/refractory diffuse large B-cell lymphoma in nongerminal center B-cell-like than in germinal center B-cell-like phenotype. Cancer, 117(22), pp.5058–5066. Horsman, D.E., Okamoto, I., Ludkovski, O., Le, N., Harder, L., Gesk, S., Siebert, R., et al., 2003. Follicular lymphoma lacking the t(14;18)(q32;q21): identification of two disease subtypes. Br J Haematol, 120(3), pp.424–433. Iacobucci, I., Storlazzi, C.T., Cilloni, D., Lonetti, A., Ottaviani, E., Soverini, S., Astolfi, A., et al., 2009. Identification and molecular characterization of recurrent genomic deletions on 7p12 in the IKZF1 gene in a large cohort of BCR-ABL1-positive acute lymphoblastic leukemia patients: on behalf of Gruppo Italiano Malattie Ematologiche dell'Adulto Acute Leukemia Working Party (GIMEMA AL WP). Blood, 114(10), pp.2159–2167. Iqbal, J., Greiner, T.C., Patel, K., Dave, B.J., Smith, L., Ji, J., Wright, G., et al., 2007. Distinctive patterns of BCL6 molecular alterations and their functional consequences in different subgroups of diffuse large B-cell lymphoma. Leukemia, 21(11), pp.2332–2343. Iqbal, J., Sanger, W.G., Horsman, D.E., Rosenwald, A., Pickering, D.L., Dave, B., Dave, S., et al., 2004. BCL2 translocation defines a unique tumor subset within the germinal center B-cell-like diffuse large B-cell lymphoma. Am J Pathol, 165(1), pp.159–166. Issaeva, I., Zonis, Y., Rozovskaia, T., Orlovsky, K., Croce, C.M., Nakamura, T., Mazo, A., Eisenbach, L. & Canaani, E., 2007. Knockdown of ALR (MLL2) Reveals ALR Target Genes and Leads to Alterations in Cell Adhesion and Growth. Mol Cell Biol, 27(5), pp.1889–1903. Jones, R. & Gelbart, W., 1990. Genetic analysis of the enhancer of zeste locus and its role in gene regulation in Drosophila melanogaster. Genetics, 126(1), pp.185–199. Jones, S.J., Laskin, J., Li, Y.Y., Griffith, O.L., An, J., Bilenky, M., Butterfield, Y.S., et al., 2010. Evolution of an adenocarcinoma in response to selection by targeted kinase inhibitors. Genome Biol, 11(8), pp.R82–R82.  154 Joshi, P., Carrington, E.A., Wang, Liangjun, Ketel, C.S., Miller, E.L., Jones, R.S. & Simon, J.A., 2008. Dominant alleles identify SET domain residues required for histone methyltransferase of Polycomb repressive complex 2. J Biol Chem, 283(41), pp.27757– 27766. Kaminker, J.S., Zhang, Yan, Watanabe, C. & Zhang, Z., 2007. CanPredict: a computational tool for predicting cancer-associated missense mutations. Nucleic Acids Res, 35(Web Server issue), pp.W595–8. Kato, M., Sanada, M., Kato, I., Sato, Y., Takita, J., Takeuchi, K., Niwa, A., et al., 2009. Frequent inactivation of A20 in B-cell lymphomas. Nature, 459(7247), pp.712–716. Kiefer, F., Arnold, K., Kunzli, M., Bordoli, L. & Schwede, T., 2009. The SWISS-MODEL Repository and associated resources. Nucleic Acids Res, 37(Database issue), pp.D387– 92–D387–92. Kirmizis, A., Bartley, S.M., Kuzmichev, A., Margueron, R., Reinberg, D., Green, R. & Farnham, P.J., 2004. Silencing of human polycomb target genes is associated with methylation of histone H3 Lys 27. Genes Dev, 18(13), pp.1592–1605. Kirschbaum, M.H., Goldman, B.H., Zain, J.M., Cook, J.R., Rimsza, L.M., Forman, S.J. & Fisher, R.I., 2011. A phase 2 study of vorinostat for treatment of relapsed or refractory Hodgkin lymphoma: Southwest Oncology Group Study S0517. Leuk lymphoma. Kleer, C.G., Cao, Q., Varambally, S., Shen, R., Ota, I., Tomlins, S.A., Ghosh, D., et al., 2003. EZH2 is a marker of aggressive breast cancer and promotes neoplastic transformation of breast epithelial cells. Proc Natl Acad Sci USA, 100(20), pp.11606– 11611. Kluin-Nelemans, H.C., Limpens, J., Meerabux, J., Beverstock, G.C., Jansen, J.H., de Jong, D. & Kluin, P.M., 1991. A new non-Hodgkin's B-cell line (DoHH2) with a chromosomal translocation t(14;18)(q32;q21). Leukemia, 5(3), pp.221–224. Kreutz, B., Hajicek, N., Yau, D.M., Nakamura, S. & Kozasa, T., 2007. Distinct regions of Galpha13 participate in its regulatory interactions with RGS homology domain- containing RhoGEFs. Cell Signal, 19(8), pp.1681–1689. Krumlauf, R., 1994. Hox genes in vertebrate development. Cell, 78(2), pp.191–201. Krzywinski, M., Bosdet, I., Mathewson, C., Wye, N., Brebner, J., Chiu, R., Corbett, R., et al., 2007. A BAC clone fingerprinting approach to the detection of human genome rearrangements. Genome Biol, 8(10), pp.R224–R224. Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J. & Marra, M.A., 2009. Circos: an information aesthetic for comparative genomics. Genome Res, 19(9), pp.1639–1645. Kuchenbauer, F., Morin, R.D., Argiropoulos, B., Petriv, O.I., Griffith, M., Heuser, M., Yung,  155 E., et al., 2008. In-depth characterization of the microRNA transcriptome in a leukemia progression model. Genome Res, 18(11), pp.1787–1797. Kuiper, R.P., Schoenmakers, E.F.P.M., van Reijmersdal, S.V., Hehir-Kwa, J.Y., van Kessel, A.G., van Leeuwen, F.N. & Hoogerbrugge, P.M., 2007. High-resolution genomic profiling of childhood ALL reveals novel recurrent genetic lesions affecting pathways involved in lymphocyte differentiation and cell cycle progression. Leukemia, 21(6), pp.1258–1266. Lacronique, V., Boureux, A., Valle, V.D., Poirel, H., Quang, C.T., Mauchauffé, M., Berthou, C., et al., 1997. A TEL-JAK2 fusion protein with constitutive kinase activity in human leukemia. Science, 278(5341), pp.1309–1312. Langer, R., Rauser, S., Feith, M., Nährig, J.M., Feuchtinger, A., Friess, H., Höfler, H. & Walch, A., 2011. Assessment of ErbB2 (Her2) in oesophageal adenocarcinomas: summary of a revised immunohistochemical evaluation system, bright field double in situ hybridisation and fluorescence in situ hybridisation. Mod Pathol, 24(7), pp.908–916. Larson, D.E., Harris, C.C., Chen, K., Koboldt, D.C., Abbott, T.E., Dooling, D.J., Ley, T.J., et al., 2011. SomaticSniper: Identification of Somatic Point Mutations in Whole Genome Sequencing Data. Bioinformatics. Lee, W., Jiang, Z., Liu, J., Haverty, P.M., Guan, Y., Stinson, J., Yue, P., et al., 2010. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature, 465(7297), pp.473–477. Lenz, G. & Staudt, L.M., 2010. Aggressive lymphomas. N Engl J Med, 362(15), pp.1417– 1429. Lenz, G., Davis, R.E., Ngo, V.N., Lam, L., George, T.C., Wright, G.W., Dave, S.S., et al., 2008a. Oncogenic CARD11 mutations in human diffuse large B cell lymphoma. Science, 319(5870), pp.1676–1679. Lenz, G., Wright, G., Dave, S.S., Xiao, W., Powell, J., Zhao, H., Xu, W., et al., 2008b. Stromal gene signatures in large-B-cell lymphomas. N Engl J Med, 359(22), pp.2313– 2323. Lenz, G., Wright, G.W., Emre, N.C.T., Kohlhammer, H., Dave, S.S., Davis, R.E., Carty, S., et al., 2008c. Molecular subtypes of diffuse large B-cell lymphoma arise by distinct genetic pathways. Proc Natl Acad Sci USA, 105(36), pp.13520–13525. Levine, R.L. & Gilliland, D.G., 2008. Myeloproliferative disorders. Blood, 112(6), pp.2190– 2198. Levy, S., Sutton, G., Ng, P., Feuk, L., Halpern, A., Walenz, B., Axelrod, N., et al., 2007. The diploid genome sequence of an individual human. PLoS Biol, 5(10), pp.e254–e254. Lewis, B.P., Green, R.E. & Brenner, S.E., 2003. Evidence for the widespread coupling of  156 alternative splicing and nonsense-mediated mRNA decay in humans. Proc Natl Acad Sci USA, 100(1), pp.189–192. Ley, T.J., Mardis, E.R., Ding, L., Fulton, B., McLellan, M.D., Chen, K., Dooling, D., et al., 2008. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature, 456(7218), pp.66–72. Li, H. & Durbin, R., 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), pp.1754–1760. Li, H. & Homer, N., 2010. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5), pp.473–483. Li, H., Handsaker, B., Wysoker, A., Fennell, Tim, Ruan, J., Homer, N., Marth, G., et al., 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), pp.2078–2079. Li, H., Ruan, J. & Durbin, R., 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res, 18(11), pp.1851–1858. Liu, X., Wang, Ling, Zhao, Kehao, Thompson, P.R., Hwang, Y., Marmorstein, R. & Cole, P.A., 2008. The structural basis of protein acetylation by the p300/CBP transcriptional coactivator. Nature, 451(7180), pp.846–850. Lossos, I.S., Czerwinski, D.K., Alizadeh, A.A., Wechser, M.A., Tibshirani, R., Botstein, D. & Levy, R., 2004. Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N Engl J Med, 350(18), pp.1828–1837. Lugo, T.G., Pendergast, A.M., Muller, A.J. & Witte, O.N., 1990. Tyrosine kinase activity and transformation potency of bcr-abl oncogene products. Science, 247(4946), pp.1079– 1082. Lupski, J.R., Reid, J.G., Gonzaga-Jauregui, C., Rio Deiros, D., Chen, D.C.Y., Nazareth, L., Bainbridge, M., et al., 2010. Whole-genome sequencing in a patient with Charcot-Marie- Tooth neuropathy. N Engl J Med, 362(13), pp.1181–1191. Macconaill, L.E. & Garraway, L.A., 2010. Clinical implications of the cancer genome. J Clin Oncol, 28(35), pp.5219–5228. Macconaill, L.E., Campbell, C.D., Kehoe, S.M., Bass, A.J., Hatton, C., Niu, L., Davis, M., et al., 2009. Profiling critical cancer gene mutations in clinical tumor samples. PloS one, 4(11), p.e7887. Makishima, H., Jankowska, A.M., Tiu, R.V., Szpurka, H., Sugimoto, Y., Hu, Z., Saunthararajah, Y., et al., 2010. Novel homo- and hemizygous mutations in EZH2 in myeloid malignancies. Leukemia, 24(10), pp.1799–1804. Manganello, J.M., Huang, J.-S., Kozasa, T., Voyno-Yasenetskaya, T.A. & Le Breton, G.C.,  157 2003. Protein kinase A-mediated phosphorylation of the Galpha13 switch I region alters the Galphabetagamma13-G protein-coupled receptor complex and inhibits Rho activation. J Biol Chem, 278(1), pp.124–130. Mardis, E.R., 2006. Anticipating the 1,000 dollar genome. Genome Biol, 7(7), p.112. Mardis, E.R., Ding, L., Dooling, D.J., Larson, D.E., McLellan, M.D., Chen, K., Koboldt, D.C., et al., 2009. Recurring mutations found by sequencing an acute myeloid leukemia genome. N Engl J Med, 361(11), pp.1058–1066. Margulies, M., Egholm, M., Altman, W., Attiya, S., Bader, J., Bemben, L., Berka, J., et al., 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057), pp.376–380. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y., 2008. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res, 18(9), pp.1509–1517. Martin, G.S., 2001. The hunting of the Src. Nat Rev Mol Cell Biol, 2(6), pp.467–475. Martin-Subero, J.I., Kreuz, M., Bibikova, M., Bentink, S., Ammerpohl, O., Wickham-Garcia, E., Rosolowski, M., et al., 2009. New insights into the biology and origin of mature aggressive B-cell lymphomas by combined epigenomic, genomic, and transcriptional profiling. Blood, 113(11), pp.2488–2497. McKernan, K., Peckham, H., Costa, G., McLaughlin, S., Fu, Y., Tsung, E., Clouser, C., et al., 2009. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res, 19(9), pp.1527–1541. McLarren, K.W., Severson, T.M., Souich, du, C., Stockton, D.W., Kratz, L.E., Cunningham, D., Hendson, G., et al., 2010. Hypomorphic temperature-sensitive alleles of NSDHL cause CK syndrome. Am J Hum Genet, 87(6), pp.905–914. Mcpherson, A., Hormozdiari, F., Zayed, A., Giuliany, R., Ha, G., Sun, M.G.F., Griffith, M., et al., 2011. deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data S. Markel, ed. PLoS Comput Biol, 7(5), p.e1001138. Mehra, S., Messner, H., Minden, M. & Chaganti, R.S.K., 2002. Molecular cytogenetic characterization of non-Hodgkin lymphoma cell lines. Genes Chromosome Canc, 33(3), pp.225–234. Mestre-Escorihuela, C., Rubio-Moscardo, F., Richter, J.A., Siebert, R., Climent, J., Fresquet, V., Beltran, E., et al., 2007. Homozygous deletions localize novel tumor suppressor genes in B-cell lymphomas. Blood, 109(1), pp.271–280. Meyerson, M., Gabriel, S. & Getz, G., 2010. Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet, 11(10), pp.685–696.  158 Milne, T., Briggs, S., Brock, H., Martin, M., Gibbs, D., Allis, C. & Hess, J., 2002. MLL Targets SET Domain Methyltransferase Activity to Hox Gene Promoters. Mol Cell, 10(5), pp.1107–1117. Mo, J.-S., Ann, E.-J., Yoon, J.-H., Jung, J., Choi, Y.-H., Kim, H.-Y., Ahn, J.-S., et al., 2011. Serum- and glucocorticoid-inducible kinase 1 (SGK1) controls Notch1 signaling by downregulation of protein stability through Fbw7 ubiquitin ligase. J Cell Sci, 124(Pt 1), pp.100–112. Morin, R., Bainbridge, M., Fejes, A., Hirst, M., Krzywinski, M., Pugh, T., Mcdonald, H., et al., 2008a. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques, 45(1), pp.81–94. Morin, R.D., Johnson, N.A., Severson, T.M., Mungall, A.J., An, J., Goya, R., Paul, J.E., et al., 2010. Somatic mutations altering EZH2 (Tyr641) in follicular and diffuse large B- cell lymphomas of germinal-center origin. Nat Genet, 42(2), pp.181–185. Morin, R.D., Mendez-Lago, M., Mungall, A.J., Goya, R., Mungall, K.L., Corbett, R.D., Johnson, N.A., et al., 2011. Frequent mutation of histone-modifying genes in non- Hodgkin lymphoma. Nature, 476(7360), pp.298–303. Morin, R.D., O'Connor, M.D., Griffith, M., Kuchenbauer, F., Delaney, A., Prabhu, A.-L., Zhao, Y., et al., 2008b. Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res, 18(4), pp.610–621. Morrissy, A.S., Morin, R.D., Delaney, A., Zeng, T., McDonald, H., Jones, S., Zhao, Y., Hirst, M. & Marra, M.A., 2009. Next-generation tag sequencing for cancer gene expression profiling. Genome Res, 19(10), pp.1825–1835. Mortazavi, A., Williams, B.A., Mccue, K., Schaeffer, L. & Wold, B., 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth, 5(7), pp.621–628. Mullighan, C.G., Collins-Underwood, J.R., Phillips, L.A.A., Loudin, M.G., Liu, W., Zhang, J., Ma, J., et al., 2009a. Rearrangement of CRLF2 in B-progenitor- and Down syndrome- associated acute lymphoblastic leukemia. Nat Genet, 41(11), pp.1243–1246. Mullighan, C.G., Goorha, S., Radtke, I., Miller, C.B., Coustan-Smith, E., Dalton, J.D., Girtman, K., et al., 2007. Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia. Nature, 446(7137), pp.758–764. Mullighan, C.G., Miller, C.B., Radtke, I., Phillips, L.A., Dalton, J., Ma, J., White, D., et al., 2008. BCR-ABL1 lymphoblastic leukaemia is characterized by the deletion of Ikaros. Nature, 453(7191), pp.110–114. Mullighan, C.G., Su, X., Zhang, J., Radtke, I., Phillips, L.A.A., Miller, C.B., Ma, J., et al., 2009b. Deletion of IKZF1 and prognosis in acute lymphoblastic leukemia. N Engl J Med, 360(5), pp.470–480.  159 Mullighan, C.G., Zhang, J., Harvey, R.C., Collins-Underwood, J.R., Schulman, B.A., Phillips, L.A., Tasian, S.K., et al., 2009c. JAK mutations in high-risk childhood acute lymphoblastic leukemia. Proc Natl Acad Sci USA, 106(23), pp.9414–9418. Mund, C. & Lyko, F., 2010. Epigenetic cancer therapy: Proof of concept and remaining challenges. BioEssays, 32(11), pp.949–957. Nannya, Y., 2005. A Robust Algorithm for Copy Number Detection Using High-Density Oligonucleotide Single Nucleotide Polymorphism Genotyping Arrays. Cancer Res, 65(14), pp.6071–6079. Neff, T. & Armstrong, S.A., 2009. Chromatin maps, histone modifications and leukemia. Leukemia, 23(7), pp.1243–1251. Ng, P.C. & Henikoff, S., 2003. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res, 31(13), pp.3812–3814. Ng, S.B., Bigham, A.W., Buckingham, K.J., Hannibal, M.C., McMillin, M.J., Gildersleeve, H.I., Beck, A.E., et al., 2010. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet, 42(9), pp.790–793. Ngo, V.N., Young, R.M., Schmitz, R., Jhavar, S., Xiao, W., Lim, K.-H., Kohlhammer, H., et al., 2011. Oncogenically active MYD88 mutations in human lymphoma. Nature, 470(7332), pp.115–119. Nikoloski, G., Langemeijer, S.M.C., Kuiper, R.P., Knops, R., Massop, M., Tönnissen, E.R.L.T.M., van der Heijden, A., et al., 2010. Somatic mutations of the histone methyltransferase gene EZH2 in myelodysplastic syndromes. Nat Genet, 42(8), pp.665– 667. Nowell, P.C., 1962. The minute chromosome (Phl) in chronic granulocytic leukemia. Blut, 8, pp.65–66. O Apos Riain, C., O Apos Shea, D.M., Yang, Y., Dieu, R.L., Gribben, J.G., Summers, K., Yeboah-Afari, J., et al., 2009. Array-based DNA methylation profiling in follicular lymphoma. Leukemia, 23(10), pp.1858–1866. Oh, S.T., Simonds, E.F., Jones, C., Hale, M.B., Goltsev, Y., Gibbs, K.D., Merker, J.D., et al., 2010. Novel mutations in the inhibitory adaptor protein LNK drive JAK-STAT signaling in patients with myeloproliferative neoplasms. Blood, 116(6), pp.988–992. Ota, A., Tagawa, H., Karnan, S., Tsuzuki, S., Karpas, A., Kira, S., Yoshida, Y. & Seto, M., 2004. Identification and characterization of a novel gene, C13orf25, as a target for 13q31-q32 amplification in malignant lymphoma. Cancer Res, 64(9), pp.3087–3095. Park, S.W., Chung, N.G., Eom, H.S., Yoo, N.J. & Lee, S.H., 2011. Mutational analysis of EZH2 codon 641 in non-Hodgkin lymphomas and leukemias. Leuk Res, 35(1), pp.e6–7.  160 Parsons, D.W., Jones, Siân, Zhang, Xiaosong, Lin, J.C.-H., Leary, R.J., Angenendt, P., Mankoo, P., et al., 2008. An integrated genomic analysis of human glioblastoma multiforme. Science, 321(5897), pp.1807–1812. Parsons, D.W., Li, M., Zhang, X, Jones, S., Leary, R.J., Lin, J.C.H., Boca, S.M., et al., 2011. The Genetic Landscape of the Childhood Cancer Medulloblastoma. Science, 331(6016), pp.435–439. Pasini, D., Malatesta, M., Jung, H.R., Walfridsson, J., Willer, A., Olsson, L., Skotte, J., et al., 2010. Characterization of an antagonistic switch between histone H3 lysine 27 methylation and acetylation in the transcriptional regulation of Polycomb group target genes. Nucleic Acids Res, 38(15), pp.4958–4969. Pasqualucci, L., Compagno, M., Houldsworth, J., Monti, S., Grunn, A., Nandula, S., Aster, J., et al., 2006. Inactivation of the PRDM1/BLIMP1 gene in diffuse large B cell lymphoma. J Exp Med, 203(2), pp.311–317. Pasqualucci, L., Dominguez-Sola, D., Chiarenza, A., Fabbri, G., Grunn, A., Trifonov, V., Kasper, L.H., et al., 2011. Inactivating mutations of acetyltransferase genes in B-cell lymphoma. Nature, 471(7337), pp.189–195. Pasqualucci, L., Guglielmino, R., Malek, S.N., Novak, U., Compagno, M., Nanjangud, G. & Dalla-Favera, R., 2004. Aberrant Somatic Hypermutation Targets an Extensive Set of Genes in Diffuse Large B-Cell Lymphoma. ASH Annual Meeting Abstracts, 104(11), pp.1528–1528. Pasqualucci, L., Migliazza, A., Basso, K., Houldsworth, J., Chaganti, R.S.K. & Dalla-Favera, R., 2003. Mutations of the BCL6 proto-oncogene disrupt its negative autoregulation in diffuse large B-cell lymphoma. Blood, 101(8), pp.2914–2923. Pasqualucci, L., Neumeister, P., Goossens, T., Nanjangud, G., Chaganti, R., Kuppers, R. & Dalla-Favera, R., 2001. Hypermutation of multiple proto-oncogenes in B-cell diffuse large-cell lymphomas. Nature, 412(6844), pp.341–346. Pellissery, S., Richter, J., Haake, A., Montesinos-Rongen, M., Deckert, M. & Siebert, R., 2010. Somatic mutations altering Tyr641 of EZH2 are rare in primary central nervous system lymphoma. Leuk lymphoma, 51(11), pp.2135–2136. Perrotti, D. & Harb, J.G., 2011. BCR-ABL1 kinase-dependent alteration of mRNA metabolism: potential alternatives for therapeutic intervention. Leuk lymphoma, 52 Suppl 1, pp.30–44. Persson, K., 1976. Modification of the eye colour mutant zeste by suppressor, enhancer and minute genes in Drosophila melanogaster. Hereditas, 82(1), pp.111–119. Pleasance, E.D., Cheetham, R.K., Stephens, P.J., McBride, D.J., Humphray, S.J., Greenman, C.D., Varela, I., et al., 2010a. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature, 463(7278), pp.191–196.  161 Pleasance, E.D., Stephens, P.J., O'Meara, S., McBride, D.J., Meynert, A., Jones, D., Lin, M.- L., et al., 2010b. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature, 463(7278), pp.184–190. Pui, C.-H., Robison, L.L. & Look, A.T., 2008. Acute lymphoblastic leukaemia. Lancet, 371(9617), pp.1030–1043. Pushkarev, D., Neff, N.F. & Quake, S.R., 2009. Single-molecule sequencing of an individual human genome. Nat Biotechnol, 27(9), pp.847–850. Quintás-Cardama, A., Manshouri, T., Estrov, Z., Harris, D., Zhang, Ying, Gaikwad, A., Kantarjian, H.M. & Verstovsek, S., 2011. Preclinical characterization of atiprimod, a novel JAK2 AND JAK3 inhibitor. Invest New Drugs, 29(5), pp.818–826. Raaphorst, F., van Kemenade, F., Fieret, E., Hamer, K., Satijn, D., Otte, A. & Meijer, C., 2000. Cutting edge: polycomb gene expression patterns reflect distinct B cell differentiation stages in human germinal centers. J Immunol, 164(1), pp.1–4. Rios, J., Stein, E., Shendure, J., Hobbs, H.H. & Cohen, J.C., 2010. Identification by whole- genome resequencing of gene defect responsible for severe hypercholesterolemia. Hum Mol Genet, 19(22), pp.4313–4318. Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., Euskirchen, G., et al., 2007. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Meth, 4(8), pp.651–657. Robertson, G., Schein, J., Chiu, R., Corbett, R., Field, M., Jackman, S.D., Mungall, K., et al., 2010. De novo assembly and analysis of RNA-seq data. Nat Meth, 7(11), pp.909–912. Robinson, J.T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E.S., Getz, G. & Mesirov, J.P., 2011. Integrative genomics viewer. Nat Biotechnol, 29(1), pp.24–26. Rothberg, J.M., Hinz, W., Rearick, T.M., Schultz, J., Mileski, W., Davey, M., Leamon, J.H., et al., 2011. An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475(7356), pp.348–352. Rowley, J.D., 1973. Identificaton of a translocation with quinacrine fluorescence in a patient with acute leukemia. Annales de génétique, 16(2), pp.109–112. Russell, L.J., Capasso, M., Vater, I., Akasaka, T., Bernard, O.A., Calasanz, M.J., Chandrasekaran, T., et al., 2009a. Deregulated expression of cytokine receptor gene, CRLF2, is involved in lymphoid transformation in B-cell precursor acute lymphoblastic leukemia. Blood, 114(13), pp.2688–2698. Russell, L.J., De Castro, D.G., Griffiths, M., Telford, N., Bernard, O., Panzer-Grümayer, R., Heidenreich, O., Moorman, A.V. & Harrison, C.J., 2009b. A novel translocation, t(14;19)(q32;p13), involving IGH@ and the cytokine receptor for erythropoietin. Leukemia, 23(3), pp.614–617.  162 Saito, M., Novak, U., Piovan, E., Basso, K., Sumazin, P., Schneider, C., Crespo, M., et al., 2009. BCL6 suppression of BCL2 via Miz1 and its disruption in diffuse large B cell lymphoma. Proc Natl Acad Sci USA, 106(27), pp.11294–11299. Salido, M., Martinez-Avilés, L., Ademà, V., Ferrer, A., Espinet, B., Garcia, M., Salar, A., et al., 2011. Absence of mutations of the histone methyltransferase gene EZH2 in splenic b- cell marginal zone lymphoma. Leuk Res, 35(3), pp.e23–4. Sanchez-Cespedes, M., Parrella, P., Esteller, M., Nomoto, S., Trink, B., Engles, J.M., Westra, W.H., Herman, J.G. & Sidransky, D., 2002. Inactivation of LKB1/STK11 is a common event in adenocarcinomas of the lung. Cancer Res, 62(13), pp.3659–3662. Schrader, K.A., Gorbatcheva, B., Senz, J., Heravi-Moussavi, A., Melnyk, N., Salamanca, C., Maines-Bandiera, S., et al., 2009. The Specificity of the FOXL2 c.402C>G Somatic Mutation: A Survey of Solid Tumors A. Brandstaetter, ed. PloS one, 4(11), p.e7988. Schultz, K.R., Bowman, W.P., Aledo, A., Slayton, W.B., Sather, H., Devidas, M., Wang, C., et al., 2009. Improved early event-free survival with imatinib in Philadelphia chromosome-positive acute lymphoblastic leukemia: a children's oncology group study. J Clin Oncol, 27(31), pp.5175–5181. Schwartz, P.A. & Murray, B.W., 2011. Protein kinase biochemistry and drug discovery. Bioorg Chem, 39(5-6), pp.192–210. Shah, S.P., Köbel, M., Senz, J., Morin, R.D., Clarke, B.A., Wiegand, Kimberly C, Leung, G., et al., 2009a. Mutation of FOXL2 in granulosa-cell tumors of the ovary. N Engl J Med, 360(26), pp.2719–2729. Shah, S.P., Morin, R.D., Khattra, J., Prentice, L., Pugh, T., Burleigh, A., Delaney, A., et al., 2009b. Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature, 461(7265), pp.809–813. Shendure, J., Porreca, G., Reppas, N., Lin, X., McCutcheon, J., Rosenbaum, A., Wang, M., et al., 2005. Accurate multiplex polony sequencing of an evolved bacterial genome. Science, 309(5741), pp.1728–1732. Shilatifard, A., 2008. Molecular implementation and physiological roles for histone H3 lysine 4 (H3K4) methylation. Curr Opin Cell Biol, 20(3), pp.341–348. Shochat, C., Tal, N., Bandapalli, O.R., Palmi, C., Ganmore, I., Kronnie, Te, G., Cario, G., et al., 2011. Gain-of-function mutations in interleukin-7 receptor-α (IL7R) in childhood acute lymphoblastic leukemias. J Exp Med, 208(5), pp.901–908. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J.M. & Birol, I., 2009. ABySS: A parallel assembler for short read sequence data. Genome Res, 19(6), pp.1117– 1123. Sjöblom, T., Jones, Siân, Wood, L.D., Parsons, D.W., Lin, J., Barber, T.D., Mandelker, D., et  163 al., 2006. The consensus coding sequences of human breast and colorectal cancers. Science, 314(5797), pp.268–274. Slack, G.W. & Gascoyne, R.D., 2011. MYC and Aggressive B-cell Lymphomas. Adv Anat Pathol, 18(3), pp.219–228. Smith, M.A., Gleckler, L.A., Gurney, J.G. & Ross, J.A., 1999. Leukemia. In L. Ries, M. Smith, J. Gurney, M. Linet, T. Tamra, J. Young, & G. Bunin, eds. Cancer Incidence and Survival among Children and Adolescents: United States SEER Program 1975-1995. National Cancer Institute, SEER Program, pp. 17–34. Sneeringer, C.J., Scott, M.P., Kuntz, K.W., Knutson, S.K., Pollock, R.M., Richon, V.M. & Copeland, R.A., 2010. Coordinated activities of wild-type plus mutant EZH2 drive tumor-associated hypertrimethylation of lysine 27 on histone H3 (H3K27) in human B- cell lymphomas. Proc Natl Acad Sci USA, 107(49), pp.20980–20985. Snijders, A.M., Pinkel, D. & Albertson, D.G., 2003. Current status and future prospects of array-based comparative genomic hybridisation. Brief Funct Genomic Proteomic, 2(1), pp.37–45. Southall, S.M., Wong, P.-S., Odho, Z., Roe, S.M. & Wilson, Jon R, 2009. Structural basis for the requirement of additional factors for MLL1 SET domain activity and recognition of epigenetic marks. Mol Cell, 33(2), pp.181–191. Staden, R., 1996. The Staden sequence analysis package. Molecular biotechnology, 5(3), pp.233–241. Stark, A., Kheradpour, P., Parts, L., Brennecke, J., Hodges, E., Hannon, G.J. & Kellis, M., 2007. Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. Genome Res, 17(12), pp.1865–1879. Stephens, P.J., Greenman, C.D., Fu, B., Yang, F., Bignell, G.R., Mudie, L.J., Pleasance, E.D., et al., 2011. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell, 144(1), pp.27–40. Stransky, N., Egloff, A.M., Tward, A.D., Kostic, A.D., Cibulskis, K., Sivachenko, A., Kryukov, G.V., et al., 2011. The mutational landscape of head and neck squamous cell carcinoma. Science, 333(6046), pp.1157–1160. Stratton, M.R., Campbell, P.J. & Futreal, P.A., 2009. The cancer genome. Nature, 458(7239), pp.719–724. Su, I.-H., Basavaraj, A., Krutchinsky, A.N., Hobert, O., Ullrich, A., Chait, B.T. & Tarakhovsky, A., 2003. Ezh2 controls B cell development through histone H3 methylation and Igh rearrangement. Nat Immunol, 4(2), pp.124–131. Sultan, M., Schulz, M.H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., Seifert, M., et al., 2008. A global view of gene activity and alternative splicing by deep sequencing  164 of the human transcriptome. Science, 321(5891), pp.956–960. Tai, D.J.C., Su, C.-C., Ma, Y.-L. & Lee, E.H.Y., 2009. SGK1 phosphorylation of IkappaB Kinase alpha and p300 Up-regulates NF-kappaB activity and increases N-Methyl-D- aspartate receptor NR2A and NR2B expression. J Biol Chem, 284(7), pp.4073–4089. Tan, J., Yang, X., Zhuang, L., Jiang, X., Chen, W., Lee, P.L., Karuturi, R.K.M., et al., 2007. Pharmacologic disruption of Polycomb-repressive complex 2-mediated gene repression selectively induces apoptosis in cancer cells. Genes Dev, 21(9), pp.1050–1063. Tazi, J., Bakkour, N. & Stamm, S., 2009. Alternative splicing and disease. Biochim Biophys Acta, 1792(1), pp.14–26. Thomas, R.K., Baker, A.C., Debiasi, R.M., Winckler, W., Laframboise, T., Lin, W.M., Wang, M., et al., 2007. High-throughput oncogene mutation profiling in human cancer. Nat Genet, 39(3), pp.347–351. Thompson, J., Gibson, T. & Higgins, D., 2002. Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinformatics, Chapter 2, pp.Unit 2 3–Unit 2 3. Tiacci, E., Trifonov, V., Schiavoni, G., Holmes, A., Kern, W., Martelli, M.P., Pucciarini, A., et al., 2011. BRAF mutations in hairy-cell leukemia. N Engl J Med, 364(24), pp.2305– 2315. Trapnell, C., Pachter, L. & Salzberg, S.L., 2009. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9), pp.1105–1111. van Haaften, G., Dalgliesh, G.L., Davies, H., Chen, Lina, Bignell, G., Greenman, C., Edkins, S., et al., 2009. Somatic mutations of the histone H3K27 demethylase gene UTX in human cancer. Nat Genet, 41(5), pp.521–523. van Kemenade, F.J., Raaphorst, F.M., Blokzijl, T., Fieret, E., Hamer, K.M., Satijn, D.P., Otte, A.P. & Meijer, C.J., 2001. Coexpression of BMI-1 and EZH2 polycomb-group proteins is associated with cycling cells and degree of malignancy in B-cell non-Hodgkin lymphoma. Blood, 97(12), pp.3896–3901. Varambally, S., Dhanasekaran, S.M., Zhou, M., Barrette, T.R., Kumar-Sinha, C., Sanda, M.G., Ghosh, D., et al., 2002. The polycomb group protein EZH2 is involved in progression of prostate cancer. Nature, 419(6907), pp.624–629. Varela, I., Tarpey, P., Raine, K., Huang, D., Ong, C.K., Stephens, P., Davies, H., et al., 2011. Exome sequencing identifies frequent mutation of the SWI/SNF complex gene PBRM1 in renal carcinoma. Nature, 469(7331), pp.539–542. Velichutina, I., Shaknovich, R., Geng, H., Johnson, N.A., Gascoyne, R.D., Melnick, A.M. & Elemento, O., 2010. EZH2-mediated epigenetic silencing in germinal center B cells contributes to proliferation and lymphomagenesis. Blood, 116(24), pp.5247–5255.  165 Viré, E., Brenner, C., Deplus, R., Blanchon, L., Fraga, M., Didelot, C., Morey, L., et al., 2006. The Polycomb group protein EZH2 directly controls DNA methylation. Nature, 439(7078), pp.871–874. Wang, J., Mullighan, C.G., Easton, J., Roberts, S., Heatley, S.L., Ma, J., Rusch, M.C., et al., 2011a. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Meth, 8(8), pp.652–654. Wang, J., Wang, W., Li, R., Li, Y., Tian, G., Goodman, L., Fan, W., et al., 2008. The diploid genome sequence of an Asian individual. Nature, 456(7218), pp.60–65. Wang, L., McLeod, H.L. & Weinshilboum, R.M., 2011b. Genomics and drug response. N Engl J Med, 364(12), pp.1144–1153. Wang, P., Lin, C., Smith, E.R., Guo, H., Sanderson, B.W., Wu, M., Gogol, M., et al., 2009. Global Analysis of H3K4 Methylation Defines MLL Family Member Targets and Points to a Role for MLL1-Mediated H3K4 Methylation in the Regulation of Transcriptional Initiation by RNA Polymerase II. Mol Cell Biol, 29(22), pp.6074–6085. Warren, R.L., Sutton, G.G., Jones, S.J.M. & Holt, R.A., 2007. Assembling millions of short DNA sequences using SSAKE. Bioinformatics, 23(4), pp.500–501. Wei, X., Walia, V., Lin, J.C., Teer, J.K., Prickett, T.D., Gartner, J., Davis, S., et al., 2011. Exome sequencing identifies GRIN2A as frequently mutated in melanoma. Nat Genet, 43(5), pp.442–446. Wheeler, D.A., Srinivasan, M., Egholm, M., Shen, Y., Chen, Lei, McGuire, A., He, W., et al., 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature, 452(7189), pp.872–876. Wiegand, Kimberly C, Shah, S.P., Al-Agha, O.M., Zhao, Y., Tse, K., Zeng, T., Senz, J., et al., 2010. ARID1A mutations in endometriosis-associated ovarian carcinomas. N Engl J Med, 363(16), pp.1532–1543. Wilker, P., Kohyama, M., Sandau, M., Albring, J., Nakagawa, O., Schwarz, J. & Murphy, K., 2008. Transcription factor Mef2c is required for B cell proliferation and survival after antigen receptor stimulation. Nat Immunol, 9(6), pp.603–612. Winter, J.N., Variakojis, D. & Epstein, A.L., 1984. Phenotypic analysis of established diffuse histiocytic lymphoma cell lines utilizing monoclonal antibodies and cytochemical techniques. Blood, 63(1), pp.140–146. Wong, W.C., Kim, D., Carter, H., Diekhans, M., Ryan, M.C. & Karchin, R., 2011. CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer. Bioinformatics, 27(15), pp.2147–2148. Wright, G., Tan, B., Rosenwald, A., Hurt, E.H., Wiestner, A. & Staudt, L.M., 2003. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell  166 lymphoma. Proc Natl Acad Sci USA, 100(17), pp.9991–9996. Xiao, B., Jing, C., Wilson, Jonathan R, Walker, P.A., Vasisht, N., Kelly, G., Howell, S., et al., 2003. Structure and catalytic mechanism of the human histone methyltransferase SET7/9. Nature, 421(6923), pp.652–656. Yap, D.B., Chu, J., Berg, T., Schapira, M., Cheng, S.-W.G., Moradian, A., Morin, R.D., et al., 2011. Somatic mutations at EZH2 Y641 act dominantly through a mechanism of selectively altered PRC2 catalytic activity, to increase H3K27 trimethylation. Blood, 117(8), pp.2451–2459. Yk, W., Cf, G., T, Y., Z, C., Xw, Z., Xx, L., Nl, M. & Wz, Z., 2011. Assessment of ERBB2 and EGFR gene amplification and protein expression in gastric carcinoma by immunohistochemistry and fluorescence in situ hybridization. Mol Cytogenet, 4(1), p.14. Yoda, A., Yoda, Y., Chiaretti, S., Bar-Natan, M., Mani, K., Rodig, S.J., West, N., et al., 2010. Functional screening identifies CRLF2 in precursor B-cell acute lymphoblastic leukemia. Proc Natl Acad Sci USA, 107(1), pp.252–257. Youn, H. & Liu, J., 2000. Cabin1 represses MEF2-dependent Nur77 expression and T cell apoptosis by controlling association of histone deacetylases and acetylases with MEF2. Immunity, 13(1), pp.85–94. Youn, H., Sun, L., Prywes, R. & Liu, J., 1999. Apoptosis of T cells mediated by Ca2+- induced release of the transcription factor MEF2. Science, 286(5440), pp.790–793. Young, K.H., Leroy, K., Møller, M.B., Colleoni, G.W.B., Sánchez-Beato, M., Kerbauy, F.R., Haioun, C., et al., 2008. Structural profiles of TP53 gene mutations predict clinical outcome in diffuse large B-cell lymphoma: an international collaborative study. Blood, 112(8), pp.3088–3098. Yusuf, I., Zhu, X., Kharas, M.G., Chen, J. & Fruman, D.A., 2004. Optimal B-cell proliferation requires phosphoinositide 3-kinase-dependent inactivation of FOXO transcription factors. Blood, 104(3), pp.784–787. Zerbino, D.R. & Birney, E., 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 18(5), pp.821–829. Zhang, Xing, Yang, Z., Khan, S.I., Horton, J.R., Tamaru, H., Selker, E.U. & Cheng, X., 2003. Structural basis for the product specificity of histone lysine methyltransferases. Mol Cell, 12(1), pp.177–185.     167 Appendices Appendix A  Supplementary data from the analysis of FL Sample A and EZH2 A.1 Additional novel SNVs identified in FL Sample A by RNA-seq Position Gene Ref Base NR Base RNA-seq Reads (R) RNA-seq Reads (NR) Ref AA NR AA chr1:1131748 TNFRSF18 T C 2 3 S G chr1:10381723 PGD C T 30 22 R C chr1:20315465 PLA2G2D C T 70 37 G S chr1:21762347 ALPL G A 11 12 R H chr1:43670083 KIAA0467 G A 1 5 R H chr1:46541451 LRRC41 G C 2 4 P R chr1:89296499 GBP1 C T 14 12 S N chr1:148994106 CTSS§ A C 217 48 Y D chr1:149503078 PSMD4 G T 0 2 V F chr1:156331454 KIRREL A C 5 3 T P chr1:158759594 SLAMF6 A C 20 7 F V chr1:172106402 ZBTB37 G T 4 2 S I chr2:11401593 ROCK2 T C 3 5 I V chr2:27135770 AGBL5 G A 0 2 V I chr2:96237694 STARD7 G A 18 32 P L chr2:108714301 RANBP2 A G 18 8 K R chr2:201845626 CASP8 A G 17 12 I M chr2:203470334 WDR12 C T 4 4 G R chr2:203819783 CYP20A1 G T 14 5 L F chr2:231371767 CAB39 A G 62 58 I V chr3:9686176 MTMR14 G A 44 13 R Q chr3:69313282 FRMD4B T G 7 4 N T chr3:109885838 DZIP3 C T 5 9 P L chr3:122899503 GOLGB1 G A 2 8 L F chr3:126435054 SLC12A8 T G 11 3 K N chr3:185553595 CLCN2 C T 0 2 R Q chr3:197872542 LRRC33 C T 6 9 S L chr4:357199 ZNF141 A T 2 2 T S chr4:1813073 LETM1 G T 1 4 N K chr4:8284650 SH3TC1 A G 6 3 M V chr4:10054305 ZNF518B T C 1 4 I V chr4:20228276 SLIT2 C T 4 3 A V chr4:71131547 C4orf7 T G 76 13 V G chr4:78090084 SEPT11 G C 4 5 A P chr4:164726502 MARCH1 C T 53 62 S N chr5:102460210 GIN1§ C A 2 2 V F chr6:30787176 MDC1 T C 5 5 Q R  168 Position Gene Ref Base NR Base RNA-seq Reads (R) RNA-seq Reads (NR) Ref AA NR AA chr6:32597800 HLA-DRB5 C T 1 4 R Q chr7:5319475 TNRC18 G C 10 5 L V chr7:5393918 TNRC18 A C 11 3 V G chr7:24725289 DFNA5§ G A 10 5 Q * chr7:66120549 TYW1 A G 4 6 H R chr7:101885258 ALKBH4 A C 2 4 I S chr7:101900231 LRWD1 C T 9 8 P L chr7:107942613 PNPLA8 G A 8 12 R C chr7:124279208 POT1 C A 4 8 Q H chr7:130778829 MKLN1 T G 14 7 I S chr7:139825570 MKRN1 C A 6 4 G V chr7:148139661 EZH2§ A G 11 4 Y H chr8:19732261 INTS10 A C 32 10 E A chr8:145592397 CPSF1 G A 8 9 R C chr9:116826116 TNC C T 69 59 R H chr9:133324433 KIAA0515 G A 23 11 A T chr9:134207924 SETX A C 14 12 L V chr9:138396112 SNAPC4 C T 6 6 V I chr9:139281277 COBRA1 A G 20 29 T A chr9:139569822 WDR85 C T 9 3 R Q chr10:45565563 FAM21D T C 3 4 M T chr10:71575478 TYSND1 C A 6 9 V L chr10:75205265 FUT11 G C 0 2 A P chr10:79465368 RPS24§ A C 67 116 I L chr10:89464598 PAPSS2 C G 13 11 Q E chr10:90764005 FAS§ C T 46 9 Q * chr10:104668340 CNNM2 G A 1 3 R Q chr11:46787573 CKAP5 C T 13 7 A T chr11:58736424 MPEG1 G A 28 36 T I chr11:61320311 FEN1 C G 11 6 P R chr11:65113802 EHBP1L1 G A 8 6 G D chr11:65142149 PCNXL3 A G 18 15 Q R chr11:67021989 PITPNM1 C T 28 25 R Q chr11:73446107 C2CD3 C A 7 3 Q H chr11:114880310 CADM1 C G 0 22 V L chr11:118270511 BLR1§ C T 115 54 Q * chr12:2668007 CACNA1C G A 1 2 R Q chr12:29556302 TMTC1 G C 1 2 Q E chr12:42428601 PUS7L G A 4 5 S L chr12:46429582 RAPGEF3 A G 0 3 L P chr12:47717214 MLL2 T C 10 6 M V chr12:55785346 STAT6 G T 77 45 Q K chr12:93138029 PLXNC1 G A 19 19 R Q  169 Position Gene Ref Base NR Base RNA-seq Reads (R) RNA-seq Reads (NR) Ref AA NR AA chr12:107707010 SSH1 G A 8 7 T I chr13:72267613 C13orf24 C T 6 6 R C chr14:20012815 NP§ C T 35 19 T I chr14:72823631 NUMB T G 30 6 T P chr14:73109889 ACOT2 G A 0 2 G S chr14:87486760 GALC G A 12 8 S F chr15:38856063 DNAJC17 T G 3 2 E A chr15:41604904 MAP1A A G 2 2 E G chr16:2203837 PGP C T 6 9 G R chr16:3726768 CREBBP§ A G 33 22 Y H chr16:30640059 SRCAP C A 16 19 T K chr16:68336417 NOB1 T C 5 12 T A chr17:4391586 MYBBP1A G A 10 13 A V chr17:8082622 C17orf68 C G 22 20 S T chr17:25730809 CPD C G 4 2 P A chr17:34125400 MLLT6 T A 48 25 S T chr17:37167297 JUP C T 11 92 V I chr17:71568013 SRP68 T C 19 25 T A chr18:19738538 LAMA3 A G 3 2 D G chr18:31960233 SLC39A6 C T 16 24 E K chr19:5874888 RANBP3 T C 11 11 N S chr19:19121087 MEF2B§ T C 56 45 Y C chr19:42827422 ZFP30 C A 5 5 C F chr19:50267501 ZNF342 A C 8 4 V G chr19:55084855 IL4I1 C T 5 5 V M chr20:23013733 CD93 C T 5 9 R H chr20:60338313 LAMA5 A C 2 2 V G chr21:14675411 STCH C A 12 6 G V chr21:41733614 MX1 T G 20 5 V G chr21:46406377 C21orf56 C T 4 2 A T chr22:31585244 TIMP3 C T 64 68 R W chrX:18846780 PHKA2 T C 8 6 I V chrX:47803931 ZNF630 G C 0 2 Q E chrX:48647000 SLC35A2 C A 4 5 R L chrX:53669571 HUWE1 G T 0 2 P T chrX:122694531 THOC2 A C 9 3 V G chrX:148500835 TMEM185A T C 1 2 K E chrX:148500838 TMEM185A A C 2 2 W G chrX:150324177 VMA21 A G 7 12 K E chrX:152873134 HCFC1 T G 27 12 T P §SNV was proven to be somatic (absent in matched constitutional DNA). Abbreviations: Ref = reference; NR = non-reference; AA = amino acid; * refers to a truncation  170  A.2 All EZH2 mutants detected by Sanger sequencing in FL and DLBCL ID Diagnosis EZH2 mutation Position (chr7) Nucleotide change Amino Acid change 00-13940 DLBCL (GCB) Y641 MUT 148139660 A->T Y->F 00-15694 DLBCL (GCB) Y641 MUT 148139660 A->T Y->F 00-19845 DLBCL (GCB) Y641 MUT 148139660 A->C Y->S 00-22287 DLBCL (GCB) Y641 MUT 148139660 A->T Y->F 01-19969 DLBCL (GCB) Y641 MUT 148139660 A->C Y->S 02-26353 DLBCL (GCB) Y641 MUT 148139660 A->C Y->S 02-30647 DLBCL (GCB) Y641 MUT 148139661 T->C Y->H 04-11156 DLBCL (GCB) Y641 MUT 148139660 A->C Y->S 04-28216 DLBCL (GCB) Y641 MUT 148139660 A->C Y->S 04-39242 DLBCL (GCB) Y641 MUT 148139660 A->T Y->F 05-11328 DLBCL (GCB) Y641 MUT 148139661 T->C Y->H 05-23110 DLBCL (GCB) Y641 MUT 148139661 T->C Y->H 05-25439 DLBCL (GCB) Y641 MUT 148139660 A->T Y->F 05-32947 DLBCL (GCB) Y641 MUT 148139661 T->C Y->H 06-24718 DLBCL (GCB) Y641 MUT 148139660 A->C Y->S 07-30109 DLBCL (GCB) Y641 MUT 148139660 A->T Y->F 07-30628 DLBCL (GCB) Y641 MUT 148139660 A->T Y->F 98-14032 DLBCL (GCB) Y641 MUT 148139660 A->T Y->F 01-15178 DLBCL (NA) Y641 MUT 148139661 T->C Y->H 01-24152 DLBCL (NA) Y641 MUT 148139660 A->C Y->S 01-28389 DLBCL (NA) Y641 MUT 148139660 A->T Y->F 02-24981 DLBCL (NA) Y641 MUT 148139661 T->A Y->N 03-11110 DLBCL (NA) Y641 MUT 148139660 A->T Y->F 03-28045 DLBCL (NA) Y641 MUT 148139660 A->C Y->S 05-12131 DLBCL (NA) Y641 MUT 148139660 A->T Y->F 05-26898 DLBCL (NA) Y641 MUT 148139661 T->C Y->H 06-27034 DLBCL (NA) Y641 MUT 148139660 A->T Y->F 96-20883 DLBCL (NA) Y641 MUT 148139661 T->A Y->N 99-22226 DLBCL (NA) Y641 MUT 148139660 A->T Y->F 99-29859 DLBCL (PMBCL) Y641 MUT 148139660 A->T Y->F 01-11023 FL Grade 1 Y641 MUT 148139660 A->T Y->F 02-23246 FL Grade 1 Y641 MUT 148139661 T->A Y->N 05-12472 FL Grade 1 Y641 MUT 148139661 T->A Y->N 06-12968 FL Grade 1 Y641 MUT 148139661 T->C Y->H 06-19522 FL Grade 1 Y641 MUT 148139660 A->T Y->F 06-19817 FL Grade 1 Y641 MUT 148139660 A->T Y->F 89-33903 FL Grade 1 Y641 MUT 148139660 A->T Y->F 89-37479 FL Grade 1 Y641 MUT 148139660 A->T Y->F 90-34286 FL Grade 1 Y641 MUT 148139660 A->T Y->F 95-13715 FL Grade 1 Y641 MUT 148139660 A->C Y->S  171 ID Diagnosis EZH2 mutation Position (chr7) Nucleotide change Amino Acid change 99-17919*+ FL Grade 1, DLBCL WT in FL, Y641 MUT in DLBCL 148139661 T->A Y->N 99-30068* FL Grade 1, DLBCL WT in FL, Y641 MUT in DLBCL 148139661 T->A Y->N 06-16058 FL Grade 2 Y641 MUT 148139660 A->T Y->F 06-23851 FL Grade 2 Y641 MUT 148139660 A->T Y->F 06-30133 FL Grade 2 Y641 MUT 148139660 A->T Y->F 96-26853 FL Grade 2 Y641 MUT 148139661 T->A Y->N 96-26853 FL Grade 2 SET MUT 148139677 T->A N->K 00-27081* FL Grade 2, DLBCL WT in FL, Y641 MUT in DLBCL 148139660 A->G Y->C 03-11874* FL Grade 2, DLBCL Y641 MUT in FL, WT in DLBCL 148139660 A->T Y->F 95-24059* FL Grade 2, DLBCL Y641 MUT in FL, WT in DLBCL 148139660 A->T Y->F 94-12812 FL Grade 3 Y641 MUT 148139660 A->T Y->F 04-40070 FL Grade 3A Y641 MUT 148139660 A->C Y->S 02-18484*+ FL Grade 3A, DLBCL WT in FL, Y641 MUT in DLBCL 148139660 A->C Y->S  *FL patients with paired samples taken pre- and post-transformation to DLBCL +Visual assessment of Sanger trace files revealed evidence for the same Y641 mutation in the matched sample originally deemed wild type by automated analysis of Sanger sequence data (peak height was below threshold)    172 A.3 Analysis of samples for EZH2 mutations using ultra-deep targeted resequencing Index* Sample ID Type Maximum high- quality matches at either NS site (codon 641) Maximum high-quality mismatches at either NS site (codon 641) Maximum % high-quality mismatches at either NS site (codon 641) amplicon average % high- quality mismatch ACGATA V00196 FL 2430 207 7.850 0.145 GTAGAG 4299 FL 2749 194 6.592 0.144 TGCTGG 9425 FL 1583 263 14.24 0.079 ATCACG V00180 Tonsil 1276 0 0.000 0.084 GATCAG V00522 Tonsil 2073 2 0.112 0.082 AACCCC V00523 Tonsil 2697 4 0.158 0.105 ACCCAG V00524 Tonsil 3049 0 0.000 0.090 AGCGCT V00525 Tonsil 1501 0 0.000 0.076 CAAAAG V00526 Tonsil 5585 3 0.061 0.117 CCAACA V00527 Tonsil 3514 6 0.182 0.235 CTAGCT V00528 Tonsil 1508 1 0.066 0.088 GATGCT V00530 Tonsil 1182 2 0.189 0.100 TAATCG V00531 Tonsil 1764 0 0.000 0.092 TGAATG V00537 Tonsil 2610 2 0.084 0.112 AGTTCC V00538 Tonsil 1331 1 0.075 0.074 CGATGT V00539 Tonsil 1326 1 0.085 0.071 TAGCTT V00540 Tonsil 1906 1 0.060 0.076 AACTTG V00541 Tonsil 8556 7 0.101 0.264 ACCGGC V00542 Tonsil 1657 2 0.139 0.086 AGGCCG V00543 Tonsil 5545 3 0.061 0.386 CAACTA V00545 Tonsil 2101 2 0.095 0.236 CCACGC V00546 Tonsil 1625 0 0.000 0.075 CTATAC V00548 Tonsil 4836 4 0.083 0.252 GCAAGG V00550 Tonsil 1706 2 0.117 0.087 TACAGC V00553 Tonsil 2258 1 0.049 0.134 TGCCAT V00557 Tonsil 1558 1 0.071 0.082 ATGTCA V00563 SLL 1327 0 0.000 0.078 TTAGGC V00573 SLL 1705 3 0.194 0.098 GGCTAC V00595 SLL 2222 2 0.104 0.074 AAGACT V00600 SLL 10225 4 0.044 0.104 ACGATA V00609 MCLN 1444 1 0.077 0.200 ATAATT V00611 MCLD 18089 11 0.070 0.346 CACCGG V00625 MCLN 5010 3 0.069 0.352 CCCATG V00641 SLL 2961 2 0.072 0.110 CTCAGA V00646 SLL 18903 8 0.046 0.096 GCACTT V00649 SLL 1753 1 0.064 0.231 TATAAT V00654 MCLN 105394 45 0.046 0.304 TGCTGG V00655 MCLN 2805 2 0.079 0.101  173 Index* Sample ID Type Maximum high- quality matches at either NS site (codon 641) Maximum high-quality mismatches at either NS site (codon 641) Maximum % high-quality mismatches at either NS site (codon 641) amplicon average % high- quality mismatch CCGTCC V00660 MCLD 1614 2 0.124 0.094 TGACCA V00663 SLL 1988 1 0.057 0.160 CTTGTA V00668 SLL 7808 8 0.113 0.432 AAGCGA V00673 MCLD 3535 3 0.094 0.100 ACTCTC V00677 SLL 1769 1 0.056 0.094 ATACGG V00695 MCLD 3276 3 0.099 0.090 CACGAT V00701 SLL 1673 1 0.065 0.108 CCCCCT V00727 MCLD 133041 73 0.059 0.152 CTGCTG V00732 MCLD 30803 20 0.075 0.133 GCCGCG V00734 SLL 2329 1 0.048 0.081 TCATTC V00735 MCLD 2055 1 0.055 0.291 TGGCGC V00737 MCLN 8485 4 0.051 0.135 GTAGAG V00738 SLL 1305 1 0.085 0.157 ACAGTG V00739 MCLD 1305 1 0.077 0.070 AAACAT V00741 MCLD 1298 1 0.083 0.079 AAGGAC V00744 MCLN 9418 3 0.040 0.178 ACTGAT V00750 MCLN 1717 2 0.131 0.248 ATCCTA V00752 MCLD 2267 1 0.049 0.128 CACTCA V00762 MCLD 29623 18 0.067 0.334 CCGCAA V00764 MCLD 1197 1 0.094 0.161 GAAACC V00769 SLL 14238 8 0.059 0.298 GCCTTA V00770 SLL 3248 1 0.033 0.104 TCCCGA V00772 SLL 3872 4 0.115 0.109 TTCGAA V00774 MCLD 23240 10 0.046 0.124 GTCCGC V00775 SLL 1330 0 0.000 0.075 GCCAAT V00779 SLL 2163 1 0.051 0.093 AAAGCA V00781 SLL 1571 2 0.141 0.235 AATAGG V00786 SLL 15048 9 0.062 0.073 AGAAGA V00789 MCLD 40993 20 0.050 0.201 ATCTAT V00790 SLL 37055 15 0.042 0.216 CAGGCG V00808 MCLD 12481 6 0.052 0.075 CCTTAG V00809 SLL 88603 49 0.061 0.934 GAATAA V00810 SLL 61609 23 0.039 0.265 GCTCCA V00814 SLL 75635 52 0.069 0.262 TCGAAG V00819 SLL 69900 21 0.033 0.200 TTCTCC V00826 SLL 19348 7 0.043 0.111 GTGAAA V00830 SLL 4003 4 0.108 0.117 CAGATC V00831 SLL 1484 2 0.152 0.086 AAATGC V00832 MCLN 1662 0 0.000 0.080 ACAAAC V00837 SLL 24 0 0.000 0.041  174 Index* Sample ID Type Maximum high- quality matches at either NS site (codon 641) Maximum high-quality mismatches at either NS site (codon 641) Maximum % high-quality mismatches at either NS site (codon 641) amplicon average % high- quality mismatch AGATAG V00838 SLL 1583 1 0.063 0.094 ATGAGC V00929 MCL 10327 15 0.155 0.459 CATGGC V00930 MCL 3064 1 0.036 0.104 CGAGAA V00946 PTCL 13428 8 0.064 0.073 GACGGA V00951 PTCL 1288 2 0.169 0.087 GGCACA V00956 PTCL 2043 2 0.098 0.292 TCGGCA V00957 PTCL 2694 3 0.121 0.105 AGGTTT V00959 PTCL 4243 3 0.078 0.138 GTGGCC V00960 PTCL 1586 0 0.000 0.095 ACTTGA V00961 PTCL 4625 2 0.047 0.113 AACAAA V00963 PTCL 3098 2 0.065 0.097 ACATCT V00964 PTCL 428478 220 0.051 1.028 AGCATC V00965 PTCL 111 1 0.952 0.115 ATTCCT V00967 PTCL 17 0 0.000 0.059 CATTTT V00970 PTCL 74 0 0.000 0.125 CGGAAT V00971 PTCL 116 0 0.000 0.104 GATATA V00975 PTCL 22105 8 0.038 0.142 GGCCTG V00977 PTCL 11 0 0.000 0.000 TCTACC V00980 PTCL 10 0 0.000 0.207 AGTCAA V00984 PTCL 219 0 0.000 0.102 ATCACG V00989 PTCL 2453 1 0.042 0.073 GATCAG V00991 PTCL 2667 1 0.039 0.058 AACCCC V00993 PTCL 1511 1 0.068 0.052 ACCCAG V00996 PTCL 1953 1 0.051 0.058 AGCGCT V00997 PTCL 2268 2 0.088 0.225 CAAAAG V00999 PTCL 3729 2 0.054 0.061 CCAACA V01000 PTCL 1592 1 0.063 0.112 CTAGCT V01002 PTCL 29436 12 0.042 0.246 GATGCT 21148-CB CD77+ CB 2179 1 0.047 0.153 TAATCG 2407-CB CD77+ CB 3563 0 0.000 0.071 TGAATG 2424-CB CD77+ CB 4644 2 0.044 0.067 AGTTCC 12307-CB CD77+ CB 8728 2 0.023 0.073 CGATGT 3412-CB CD77+ CB 1583 0 0.000 0.052 TAGCTT 4915-CB CD77+ CB 2143 0 0.000 0.073 AACTTG 4920-CB CD77+ 1982 1 0.052 0.051  175 Index* Sample ID Type Maximum high- quality matches at either NS site (codon 641) Maximum high-quality mismatches at either NS site (codon 641) Maximum % high-quality mismatches at either NS site (codon 641) amplicon average % high- quality mismatch CB ACCGGC 120808- CB CD77+ CB 1642 2 0.122 0.191 Abbreviations: NS, non-synonymous; CB, centroblast; MCLD, mantle cell lymphoma diffuse; MCLN, mantle cell lymphoma nodular; SLL, small lymphocytic lymphoma; PTCL, peripheral T cell lymphoma not otherwise specified;  176 Appendix B  Supplementary data from the analysis of 127 NHL cases by RNA-seq and genome/exome sequencing B.1 Details of cases analyzed and their libraries Library	
   Analysis	
   performed	
   Patient	
  ID	
  &	
   alternate	
   name	
  (or	
  cell	
   line	
  name)	
   Type	
   Non-­‐duplicate	
   mapped	
  bases	
   (RNA-­‐seq),	
   coverage	
   (genome)	
   HS0637,	
   A01437,	
   A01413	
   RNA-­‐seq,	
   genome-­‐ normal,	
   genome-­‐ tumour	
   05-­‐25439,	
   DLBCL-­‐ PatientL	
   GCB	
   9115713150,	
   28.39,	
  51.74	
   HS0639	
   RNA-­‐seq	
   02-­‐30647	
   GCB	
   1681071792	
   HS0640,	
   HS1786,	
   HS1787	
   RNA-­‐seq,	
   exome-­‐ normal,	
   exome-­‐ tumour	
   04-­‐11156,	
   DLBCL-­‐ PatientA	
   GCB	
   3219848180,	
   43.10,	
  57.13	
   HS0641	
   RNA-­‐seq	
   03-­‐31713	
   GCB	
   9997488900	
   HS0644	
   RNA-­‐seq	
   03-­‐33266	
   GCB	
   4309116342	
   HS0645	
   RNA-­‐seq	
   02-­‐30519	
   U	
   2327511160	
   HS0646	
   RNA-­‐seq	
   04-­‐23426	
   ABC	
   3394891950	
   HS0647,	
   HS2703,	
   HS2702	
   RNA-­‐seq,	
   genome-­‐ normal,	
   genome-­‐ tumour	
   98-­‐22532,	
   DLBCL-­‐ PatientE	
   GCB	
   1828711740,	
   34.57,	
  29.76	
   HS0648	
   RNA-­‐seq	
   05-­‐32947	
   GCB	
   2096385876	
   HS0649	
   RNA-­‐seq	
   04-­‐39108	
   ABC	
   2446692684	
   HS0650	
   RNA-­‐seq	
   05-­‐26084	
   GCB	
   2445307150	
   HS0651	
   RNA-­‐seq	
   06-­‐25470	
   GCB	
   2057593536	
   HS0652	
   RNA-­‐seq	
   06-­‐27347	
   ABC	
   2133559872	
   HS0653	
   RNA-­‐seq	
   06-­‐30025	
   GCB	
   1587728124	
   HS0654	
   RNA-­‐seq	
   06-­‐31353	
   GCB	
   3242738650	
   HS0656	
   RNA-­‐seq	
   07-­‐35482	
   GCB	
   3067507142	
   HS0685	
   RNA-­‐seq	
   DOHH-­‐2	
   GCB-­‐cl	
   224132412	
   HS0747	
   RNA-­‐seq	
   07-­‐31833	
   GCB	
   11889487734	
   HS0748	
   RNA-­‐seq	
   05-­‐20543	
   ABC	
   7228685061	
    177 Library	
   Analysis	
   performed	
   Patient	
  ID	
  &	
   alternate	
   name	
  (or	
  cell	
   line	
  name)	
   Type	
   Non-­‐duplicate	
   mapped	
  bases	
   (RNA-­‐seq),	
   coverage	
   (genome)	
   HS0749	
   RNA-­‐seq	
   05-­‐24666	
   GCB	
   2645899472	
   HS0750	
   RNA-­‐seq	
   99-­‐25549	
   ABC	
   1183617072	
   HS0751	
   RNA-­‐seq	
   05-­‐19287	
   ABC	
   1758036422	
   HS0798	
   RNA-­‐seq	
   DB	
   GCB-­‐cl	
   429399720	
   HS0804,	
   HS0821,	
   HS0733,	
   HS1846,	
   HS1843	
   RNA-­‐seq,	
   genome-­‐ tumour,	
   genome-­‐ normal,	
   exome-­‐ normal,	
   exome-­‐ tumour	
   FL-­‐PatientA,	
   06-­‐12968	
   FL	
   2277444528,	
   9.47,	
  2.80,	
  26.29,	
   27.73	
   HS0841	
   RNA-­‐seq	
   Karpas422	
   GCB-­‐cl	
   496065700	
   HS0842	
   RNA-­‐seq	
   NU-­‐DHL-­‐1	
   GCB-­‐cl	
   299256950	
   HS0900	
   RNA-­‐seq	
   SU-­‐DHL-­‐6	
   GCB-­‐cl	
   393940750	
   HS0901	
   RNA-­‐seq	
   WSU-­‐DLCL2	
   GCB-­‐cl	
   390550000	
   HS0926,	
   A01439,	
   A01415	
   RNA-­‐seq,	
   genome-­‐ normal,	
   genome-­‐ tumour	
   06-­‐11535,	
   DLBCL-­‐ PatientJ	
   U	
   2734734350,	
   34.64,	
  28.26	
   HS0927	
   RNA-­‐seq	
   06-­‐16316	
   U	
   3787295600	
   HS0928,	
   A01452,	
   A01416	
   RNA-­‐seq,	
   genome-­‐ normal,	
   genome-­‐ tumour	
   06-­‐22057,	
   DLBCL-­‐ PatientK	
   ABC	
   5181143850,	
   29.1637,	
  24.90	
   HS0929	
   RNA-­‐seq	
   06-­‐23792	
   GCB	
   4096791250	
   HS0930	
   RNA-­‐seq	
   95-­‐32814	
   GCB	
   5100167200	
   HS0931	
   RNA-­‐seq	
   01-­‐26405	
   ABC	
   5196066450	
   HS0932	
   RNA-­‐seq	
   03-­‐10363	
   ABC	
   3076977050	
   HS0933	
   RNA-­‐seq	
   03-­‐13123	
   GCB	
   4786772800	
   HS0934	
   RNA-­‐seq	
   05-­‐24395	
   U	
   5014573200	
   HS0935	
   RNA-­‐seq	
   08-­‐21175	
   ABC	
   2737381650	
    178 Library	
   Analysis	
   performed	
   Patient	
  ID	
  &	
   alternate	
   name	
  (or	
  cell	
   line	
  name)	
   Type	
   Non-­‐duplicate	
   mapped	
  bases	
   (RNA-­‐seq),	
   coverage	
   (genome)	
   HS0936,	
   A01447,	
   A01418	
   RNA-­‐seq	
   06-­‐19919,	
   DLBCL-­‐ PatientF	
   GCB	
   5369008300,	
   31.62,	
  31.91	
   HS0937	
   RNA-­‐seq	
   05-­‐24561	
   GCB	
   3711758000	
   HS0938	
   RNA-­‐seq	
   06-­‐15922	
   GCB	
   3175879400	
   HS0939	
   RNA-­‐seq	
   06-­‐24881	
   GCB	
   6851078650	
   HS0940	
   RNA-­‐seq	
   04-­‐10134	
   ABC	
   3089257700	
   HS0941	
   RNA-­‐seq	
   00-­‐26427	
   ABC	
   6453544025	
   HS0942	
   RNA-­‐seq	
   96-­‐20883	
   GCB	
   825099550	
   HS0943	
   RNA-­‐seq	
   02-­‐22991	
   ABC	
   5630404000	
   HS0944	
   RNA-­‐seq	
   05-­‐17793	
   GCB	
   4437085550	
   HS1131	
   RNA-­‐seq	
   03-­‐30438	
   GCB	
   4296558950	
   HS1132	
   RNA-­‐seq	
   94-­‐26795	
   GCB	
   8115757300	
   HS1133,	
   HS2707,	
   HS2706	
   RNA-­‐seq	
   05-­‐23110,	
   DLBCL-­‐ PatientD	
   GCB	
   5081588150,	
   38.74,	
  40.49	
   HS1134	
   RNA-­‐seq	
   06-­‐24915	
   ABC	
   10212852500	
   HS1135	
   RNA-­‐seq	
   07-­‐37968	
   GCB	
   2275960800	
   HS1136,	
   HS1788,	
   HS1789	
   RNA-­‐seq,	
   exome-­‐ normal,	
   exome-­‐ tumour	
   08-­‐15460,	
   DLBCL-­‐ PatientB	
   U	
   1493224900,	
   30.10,	
  24.98	
   HS1137	
   RNA-­‐seq	
   01-­‐19969	
   GCB	
   3789815900	
   HS1138	
   RNA-­‐seq	
   01-­‐26579	
   GCB	
   2445673300	
   HS1163	
   RNA-­‐seq	
   OCI-­‐Ly1	
   GCB-­‐cl	
   559481000	
   HS1164	
   RNA-­‐seq	
   01-­‐18667	
   ABC	
   4720239050	
   HS1181	
   RNA-­‐seq	
   05-­‐24401	
   GCB	
   3996953650	
   HS1182	
   RNA-­‐seq	
   OCI-­‐Ly7	
   GCB-­‐cl	
   380839550	
   HS1183	
   RNA-­‐seq	
   OCI-­‐Ly19	
   GCB-­‐cl	
   510352150	
   HS1184	
   RNA-­‐seq	
   FL001	
   FL	
   751935450	
   HS1185	
   RNA-­‐seq	
   FL002,	
  05-­‐ 14720	
   FL	
   2479201700	
   HS1186	
   RNA-­‐seq	
   FL003,	
  05-­‐ 14545	
   FL	
   2412850100	
   HS1199	
   RNA-­‐seq	
   FL004,	
  05-­‐ 19843	
   FL	
   2758254750	
    179 Library	
   Analysis	
   performed	
   Patient	
  ID	
  &	
   alternate	
   name	
  (or	
  cell	
   line	
  name)	
   Type	
   Non-­‐duplicate	
   mapped	
  bases	
   (RNA-­‐seq),	
   coverage	
   (genome)	
   HS1200	
   RNA-­‐seq	
   FL005,	
  03-­‐ 10481	
   FL	
   1855621450	
   HS1201	
   RNA-­‐seq	
   FL006,	
  04-­‐ 28117	
   FL	
   2571816950	
   HS1202	
   RNA-­‐seq	
   FL007,	
  07-­‐ 21038	
   FL	
   5464267200	
   HS1203	
   RNA-­‐seq	
   FL008,	
  08-­‐ 10448	
   FL	
   2494514950	
   HS1204	
   RNA-­‐seq	
   FL009,	
  06-­‐ 28477	
   FL	
   4413118200	
   HS1205	
   RNA-­‐seq	
   FL010,	
  01-­‐ 16433	
   FL	
   1795709350	
   HS1350	
   RNA-­‐seq	
   06-­‐24718	
   OTHER	
   5262606150	
   HS1352	
   RNA-­‐seq	
   06-­‐10398	
   ABC	
   7324336375	
   HS1356	
   RNA-­‐seq	
   02-­‐22023	
   GCB	
   3637235375	
   HS1358	
   RNA-­‐seq	
   05-­‐24904	
   GCB	
   2793424250	
   HS1360	
   RNA-­‐seq	
   FL011,	
  92-­‐ 33015	
   FL	
   2865953100	
   HS1361	
   RNA-­‐seq	
   FL012,	
  03-­‐ 28399	
   FL	
   3458186450	
   HS1452	
   RNA-­‐seq	
   92-­‐56188	
   ABC	
   2596812000	
   HS1454	
   RNA-­‐seq	
   03-­‐30549	
   GCB	
   4419307700	
   HS1456	
   RNA-­‐seq	
   03-­‐31974	
   GCB	
   7001874575	
   HS1458	
   RNA-­‐seq	
   81-­‐52884	
   ABC	
   8382757225	
   HS1460	
   RNA-­‐seq	
   05-­‐24006	
   GCB	
   3185187950	
   HS1462,	
   A01445,	
   A01424	
   RNA-­‐seq,	
   genome-­‐ normal,	
   genome-­‐ tumour	
   05-­‐25674,	
   DLBCL-­‐ PatientG	
   GCB	
   6288448875,	
   31.37,	
  31.81	
   HS1555	
   RNA-­‐seq	
   05-­‐11328	
   GCB	
   4495826925	
   HS1556	
   RNA-­‐seq	
   06-­‐30145	
   GCB	
   4193592275	
   HS1557	
   RNA-­‐seq	
   02-­‐13818	
   GCB	
   3859061625	
   HS1558	
   RNA-­‐seq	
   01-­‐25197	
   GCB	
   5026022500	
   HS1559	
   RNA-­‐seq	
   02-­‐20170	
   ABC	
   5073011300	
   HS1974	
   RNA-­‐seq	
   97-­‐14402	
   ABC	
   12616303950	
    180 Library	
   Analysis	
   performed	
   Patient	
  ID	
  &	
   alternate	
   name	
  (or	
  cell	
   line	
  name)	
   Type	
   Non-­‐duplicate	
   mapped	
  bases	
   (RNA-­‐seq),	
   coverage	
   (genome)	
   HS1975	
   RNA-­‐seq	
   06-­‐16716	
   ABC	
   11275656675	
   HS1976	
   RNA-­‐seq	
   06-­‐23057	
   ABC	
   11506266000	
   HS1977	
   RNA-­‐seq	
   02-­‐24725	
   ABC	
   13097930625	
   HS1978	
   RNA-­‐seq	
   09-­‐12737	
   GCB	
   13738185525	
   HS1979	
   RNA-­‐seq	
   01-­‐12047	
   OTHER	
   17562473550	
   HS1980	
   RNA-­‐seq	
   01-­‐21689	
   OTHER	
   14317904700	
   HS1981	
   RNA-­‐seq	
   08-­‐15393	
   OTHER	
   12181898100	
   HS1982	
   RNA-­‐seq	
   08-­‐15577	
   OTHER	
   12111208800	
   HS1983	
   RNA-­‐seq	
   03-­‐26817	
   GCB	
   12690182925	
   HS1984	
   RNA-­‐seq	
   85-­‐63855	
   U	
   13814264250	
   HS2047	
   RNA-­‐seq	
   NU-­‐DUL-­‐1	
   ABC-­‐cl	
   2800196000	
   HS2048,	
   A01426,	
   A01453	
   RNA-­‐seq,	
   genome-­‐ normal,	
   genome-­‐ tumour	
   05-­‐12939,	
   DLBCL-­‐ PatientI	
   ABC	
   12120724350,	
   33.09,	
  26.42	
   HS2049	
   RNA-­‐seq	
   07-­‐32561	
   OTHER	
   9309776325	
   HS2050	
   RNA-­‐seq	
   99-­‐27137	
   GCB	
   7363844100	
   HS2051,	
   HS2975,	
   HS2974	
   RNA-­‐seq,	
   genome-­‐ normal,	
   genome-­‐ tumour	
   09-­‐33003,	
   DLBCL-­‐ PatientM	
   U	
   10768700700,	
   29.79,	
  28.53	
   HS2053	
   RNA-­‐seq	
   82-­‐57570	
   U	
   10023690000	
   HS2054	
   RNA-­‐seq	
   05-­‐12224	
   GCB	
   13373002275	
   HS2055	
   RNA-­‐seq	
   07-­‐30109	
   OTHER	
   9876708825	
   HS2056	
   RNA-­‐seq	
   SPEC1120	
   ABC	
   9445155825	
   HS2058	
   RNA-­‐seq	
   SPEC1185	
   U	
   9867511050	
   HS2059	
   RNA-­‐seq	
   SPEC1187	
   GCB	
   8820944475	
   HS2060	
   RNA-­‐seq	
   SPEC1203	
   ABC	
   8481193425	
   HS2248	
   RNA-­‐seq	
   09-­‐41082	
   GCB	
   10935488925	
   HS2249	
   RNA-­‐seq	
   00-­‐15694	
   GCB	
   8400794475	
   HS2250	
   RNA-­‐seq	
   07-­‐25012	
   GCB	
   9647441925	
   HS2251	
   RNA-­‐seq	
   06-­‐33777	
   GCB	
   10647233100	
   HS2252	
   RNA-­‐seq	
   08-­‐25894	
   GCB	
   13617165825	
   HS2604	
   RNA-­‐seq	
   05-­‐15797	
   OTHER	
   7319727750	
    181 Library	
   Analysis	
   performed	
   Patient	
  ID	
  &	
   alternate	
   name	
  (or	
  cell	
   line	
  name)	
   Type	
   Non-­‐duplicate	
   mapped	
  bases	
   (RNA-­‐seq),	
   coverage	
   (genome)	
   HS2605,	
   A01436,	
   A01434	
   RNA-­‐seq	
   06-­‐15256,	
   DLBCL-­‐ PatientH	
   GCB	
   10830100200,	
   30.94,	
  32.82	
   HS2606	
   RNA-­‐seq	
   08-­‐11596	
   ABC	
   10853474700	
   HS2607	
   RNA-­‐seq	
   06-­‐25674	
   ABC	
   11439274125	
   HS2937	
   RNA-­‐seq	
   06-­‐34043	
   ABC	
   10613573325	
   HS2938	
   RNA-­‐seq	
   04-­‐20644	
   U	
   10444842225	
   HS2939	
   RNA-­‐seq	
   04-­‐36422	
   ABC	
   10587259200	
   HS2940	
   RNA-­‐seq	
   06-­‐18547	
   GCB	
   11794516050	
   HS3014	
   RNA-­‐seq	
   07-­‐17613	
   ABC	
   9491902250	
   HS3105,	
   A03290,	
   A03291	
   RNA-­‐seq	
   06-­‐14634,	
   DLBCL-­‐ PatientC	
   GCB	
   9127296000,	
   30.31,	
  29.14	
   HS3120	
   RNA-­‐seq	
   04-­‐29264	
   GCB	
   9994753125	
   HS3129	
   RNA-­‐seq	
   06-­‐23907	
   ABC	
   8936977875	
   COO	
  subtype	
  is	
  shown	
  from	
  Affymetrix	
  expression	
  values	
  where	
  available	
  (Wright	
  et	
  al.	
   2003)	
  and	
  indicated	
  in	
  bold	
  if	
  it	
  disagrees	
  with	
  the	
  pGCB	
  value	
  based	
  on	
  RNA-­‐seq-­‐derived	
   expression	
  values	
    182 B.2 109 recurrently mutated genes identified in the RNA-seq cohort Gene	
   Symbol	
   NS	
   S	
   T	
   Total	
  cases	
  with	
   cSNVs	
   Confirmed	
  somatic	
  NS	
   or	
  T	
  in	
  RNA-­‐seq	
  cohort	
   BCL2	
   42	
   45	
   0	
   42	
   43	
   EZH2	
   33	
   0	
   0	
   33	
   33	
   TP53	
   20	
   2	
   1	
   21	
   22	
   PIM1	
   20	
   19	
   0	
   20	
   11	
   TNFRSF14	
   7	
   1	
   7	
   14	
   11	
   MLL2	
   16	
   8	
   17	
   33	
   10	
   MEF2B	
   20	
   2	
   0	
   20	
   10	
   BTG1	
   11	
   6	
   2	
   13	
   10	
   SGK1	
   18	
   6	
   6	
   24	
   9	
   CREBBP	
   20	
   7	
   4	
   24	
   9	
   MYD88	
   13	
   2	
   0	
   13	
   9	
   CCND3	
   7	
   1	
   2	
   9	
   6	
   HIST1H1C	
   9	
   0	
   0	
   9	
   6	
   GNA13	
   21	
   1	
   2	
   23	
   5	
   IRF4	
   9	
   4	
   0	
   9	
   5	
   CD79B	
   7	
   2	
   1	
   8	
   5	
   KLHL6	
   10	
   2	
   2	
   12	
   4	
   ETS1	
   10	
   1	
   0	
   10	
   4	
   STAT3	
   9	
   0	
   0	
   9	
   4	
   FOXO1	
   8	
   4	
   0	
   8	
   4	
   B2M	
   7	
   0	
   0	
   7	
   4	
   BCL10	
   2	
   0	
   4	
   6	
   4	
   TMEM30A	
   1	
   0	
   4	
   5	
   4	
   IRF8	
   11	
   5	
   3	
   14	
   3	
   CARD11	
   14	
   3	
   0	
   14	
   3	
   EP300	
   8	
   7	
   1	
   9	
   3	
   RAPGEF1	
   8	
   3	
   0	
   8	
   3	
   TTC27	
   6	
   0	
   0	
   6	
   3	
   CD70	
   5	
   0	
   1	
   6	
   3	
   PASK	
   5	
   5	
   0	
   5	
   3	
   HIST1H1D	
   4	
   1	
   0	
   4	
   3	
   PDS5B	
   4	
   1	
   0	
   4	
   3	
   CMYA5	
   4	
   0	
   0	
   4	
   3	
   MPDZ	
   4	
   0	
   0	
   4	
   3	
   HIST1H2AG	
   4	
   0	
   0	
   4	
   3	
   HIST1H2AC	
   4	
   5	
   0	
   4	
   3	
   PACS1	
   3	
   0	
   0	
   3	
   3	
    183 Gene	
   Symbol	
   NS	
   S	
   T	
   Total	
  cases	
  with	
   cSNVs	
   Confirmed	
  somatic	
  NS	
   or	
  T	
  in	
  RNA-­‐seq	
  cohort	
   MALT1	
   3	
   0	
   0	
   3	
   3	
   WDFY3	
   3	
   4	
   0	
   3	
   3	
   NLRP11	
   3	
   2	
   0	
   3	
   3	
   HIST1H2BO	
   3	
   0	
   0	
   3	
   3	
   MUC16	
   17	
   12	
   0	
   17	
   2	
   BTG2	
   12	
   6	
   1	
   13	
   2	
   ABCA7	
   12	
   3	
   0	
   12	
   2	
   BCL6	
   11	
   2	
   0	
   11	
   2	
   TAF1	
   10	
   0	
   0	
   10	
   2	
   RNF213	
   10	
   8	
   0	
   10	
   2	
   SAMD9	
   9	
   2	
   0	
   9	
   2	
   COL4A2	
   8	
   2	
   0	
   8	
   2	
   HDAC7	
   8	
   4	
   0	
   8	
   2	
   PRKDC	
   7	
   3	
   0	
   7	
   2	
   TET2	
   6	
   5	
   0	
   6	
   2	
   MAP4K1	
   6	
   21	
   0	
   6	
   2	
   FAS	
   2	
   0	
   4	
   6	
   2	
   ZFHX3	
   6	
   6	
   0	
   6	
   2	
   RFTN1	
   6	
   0	
   0	
   6	
   2	
   HUWE1	
   6	
   1	
   0	
   6	
   2	
   STAT6	
   6	
   0	
   0	
   6	
   2	
   CIITA	
   5	
   3	
   0	
   5	
   2	
   DNM2	
   5	
   2	
   0	
   5	
   2	
   CNOT1	
   5	
   1	
   0	
   5	
   2	
   KIF20B	
   4	
   2	
   1	
   5	
   2	
   ITPR2	
   5	
   5	
   0	
   5	
   2	
   CD58	
   2	
   0	
   3	
   5	
   2	
   DMD	
   5	
   1	
   0	
   5	
   2	
   LRRK1	
   5	
   3	
   0	
   5	
   2	
   SENP6	
   4	
   1	
   0	
   4	
   2	
   POSTN	
   4	
   0	
   0	
   4	
   2	
   CTCF	
   4	
   0	
   0	
   4	
   2	
   FAM40B	
   4	
   0	
   0	
   4	
   2	
   IQGAP1	
   4	
   5	
   0	
   4	
   2	
   HIST1H2BC	
   4	
   1	
   0	
   4	
   2	
   JMJD2C	
   3	
   2	
   0	
   3	
   2	
   MCTP2	
   3	
   2	
   0	
   3	
   2	
   P2RY8	
   3	
   0	
   0	
   3	
   2	
   DNAH8	
   3	
   0	
   0	
   3	
   2	
    184 Gene	
   Symbol	
   NS	
   S	
   T	
   Total	
  cases	
  with	
   cSNVs	
   Confirmed	
  somatic	
  NS	
   or	
  T	
  in	
  RNA-­‐seq	
  cohort	
   CXCR5	
   2	
   2	
   1	
   3	
   2	
   NR3C1	
   3	
   1	
   0	
   3	
   2	
   SMG7	
   3	
   0	
   0	
   3	
   2	
   KIAA0100	
   3	
   2	
   0	
   3	
   2	
   HIST1H2BD	
   3	
   1	
   0	
   3	
   2	
   UBAP2	
   3	
   1	
   0	
   3	
   2	
   RNF135	
   2	
   1	
   1	
   3	
   2	
   RBL2	
   3	
   1	
   0	
   3	
   2	
   GPR153	
   3	
   0	
   0	
   3	
   2	
   NMD3	
   3	
   2	
   0	
   3	
   2	
   MTR	
   3	
   1	
   0	
   3	
   2	
   KIF27	
   2	
   0	
   0	
   2	
   2	
   SPEF2	
   2	
   1	
   0	
   2	
   2	
   FAT2	
   2	
   1	
   0	
   2	
   2	
   ITSN1	
   2	
   1	
   0	
   2	
   2	
   SCLY	
   2	
   1	
   0	
   2	
   2	
   SPTBN5	
   2	
   0	
   0	
   2	
   2	
   ADAMTS12	
   2	
   0	
   0	
   2	
   2	
   ZNF134	
   1	
   1	
   1	
   2	
   2	
   HCK	
   2	
   1	
   0	
   2	
   2	
   WDR76	
   2	
   0	
   0	
   2	
   2	
   MTERF	
   2	
   1	
   0	
   2	
   2	
   BBX	
   2	
   2	
   0	
   2	
   2	
   HIST1H3D	
   2	
   2	
   0	
   2	
   2	
   PXDN	
   2	
   2	
   0	
   2	
   2	
   ALG10	
   2	
   0	
   0	
   2	
   2	
   PAXIP1	
   2	
   1	
   0	
   2	
   2	
   TLE4	
   2	
   2	
   0	
   2	
   2	
   PLCE1	
   2	
   0	
   0	
   2	
   2	
   LPCAT3	
   2	
   0	
   0	
   2	
   2	
   CNTNAP5	
   2	
   0	
   0	
   2	
   2	
   MYO5B	
   2	
   0	
   0	
   2	
   2	
   BCAS3	
   2	
   3	
   0	
   2	
   2	
   These numbers reflect variants detected only in the RNA-seq cohort (n=127) and do not include indels or mutations detected in the expanded cohorts by other means (namely those identified in MLL2 and MEF2B by targeted resequencing)  185 B.3 Somatic mutations showing allelic imbalance Gene	
   Library	
   Reference	
   allele	
   (total	
   reads)	
   Mutant	
   allele	
   (total	
   reads)	
   Ref-­‐ Mut	
   ratio	
   Mutations	
   P	
  (raw)	
   P	
   (Bonferroni)	
   ABCA7	
   HS2058	
   78	
   34	
   2.29	
   S268L	
   3.91E-­‐05	
   6.65E-­‐03	
   ALG10	
   HS2605	
   70	
   24	
   2.92	
   V446I	
   2.20E-­‐06	
   3.74E-­‐04	
   ALG10	
   HS2252	
   42	
   12	
   3.50	
   L299F	
   5.21E-­‐05	
   8.86E-­‐03	
   B2M	
   HS1133	
   2403	
   575	
   4.18	
   L12Q	
   6.42E-­‐264	
   1.09E-­‐261	
   B2M	
   HS2250	
   2748	
   510	
   5.39	
   M1R	
   0	
   0.00E+00	
   B2M	
   HS0639	
   1699	
   300	
   5.66	
   M1L	
   1.30E-­‐236	
   2.21E-­‐234	
   B2M	
   HS0653	
   1609	
   273	
   5.89	
   Y86N	
   4.71E-­‐230	
   8.01E-­‐228	
   BBX	
   HS2051	
   126	
   34	
   3.71	
   A485V	
   1.25E-­‐13	
   2.13E-­‐11	
   BCAS3	
   HS1204	
   47	
   12	
   3.92	
   I601F	
   5.13E-­‐06	
   8.71E-­‐04	
   BCL10	
   HS1984	
   322	
   202	
   1.59	
   L225F,	
  S136*	
   1.78E-­‐07	
   3.03E-­‐05	
   BCL10	
   HS0936	
   274	
   86	
   3.19	
   L225*,	
  S227L,	
   T229S	
   5.53E-­‐24	
   9.40E-­‐22	
   BCL2	
   HS1163	
   0	
   197	
   0.00	
   A131V,	
  F49L,	
   T187I	
   9.96E-­‐60	
   1.69E-­‐57	
   BCL2	
   HS2249	
   18	
   1187	
   0.02	
   I48V,	
  P71S,	
  R6I	
   0.00E+00	
   0.00E+00	
   BCL2	
   HS1181	
   3	
   133	
   0.02	
   A131V,	
  E135D	
   9.63E-­‐36	
   1.64E-­‐33	
   BCL2	
   HS0749	
   2	
   75	
   0.03	
   V156A,	
  Y28F	
   3.98E-­‐20	
   6.76E-­‐18	
   BCL2	
   HS2250	
   119	
   3506	
   0.03	
   L86F,	
  M16K,	
   N172S,	
  P59A,	
   Q52R,	
  T7K	
   0.00E+00	
   0.00E+00	
   BCL2	
   HS1185	
   11	
   266	
   0.04	
   A60T	
   1.30E-­‐64	
   2.20E-­‐62	
   BCL2	
   HS0841	
   1	
   21	
   0.05	
   G47D,	
  P59L	
   1.10E-­‐05	
   1.86E-­‐03	
   BCL2	
   HS1138	
   2	
   38	
   0.05	
   G5V	
   1.49E-­‐09	
   2.54E-­‐07	
   BCL2	
   HS0656	
   7	
   85	
   0.08	
   D34G,	
  S51Y,	
   T132S	
   3.85E-­‐18	
   6.54E-­‐16	
   BCL2	
   HS1462	
   6	
   71	
   0.08	
   P59L	
   3.42E-­‐15	
   5.81E-­‐13	
   BCL2	
   HS2054	
   38	
   403	
   0.09	
   F104L,	
  S167A	
   4.44E-­‐78	
   7.54E-­‐76	
   BCL2	
   HS1555	
   10	
   96	
   0.10	
   H20Q,	
  S50P	
   8.74E-­‐19	
   1.49E-­‐16	
   BCL2	
   HS0930	
   13	
   112	
   0.12	
   A60V,	
  P59S	
   8.12E-­‐21	
   1.38E-­‐18	
   BCL2	
   HS0637	
   19	
   150	
   0.13	
   K17N	
   1.88E-­‐26	
   3.19E-­‐24	
   BCL2	
   HS1456	
   18	
   140	
   0.13	
   N11Y	
   1.35E-­‐24	
   2.29E-­‐22	
   BCL2	
   HS1131	
   9	
   58	
   0.16	
   H20Q,	
  M16R	
   6.81E-­‐10	
   1.16E-­‐07	
   BCL2	
   HS1199	
   91	
   585	
   0.16	
   A60V,	
  F104L,	
   M16V,	
  P59S,	
   S117R,	
  Y21H	
   3.28E-­‐89	
   5.57E-­‐87	
    186 Gene	
   Library	
   Reference	
   allele	
   (total	
   reads)	
   Mutant	
   allele	
   (total	
   reads)	
   Ref-­‐ Mut	
   ratio	
   Mutations	
   P	
  (raw)	
   P	
   (Bonferroni)	
   BCL2	
   HS0933	
   12	
   74	
   0.16	
   A60D,	
  P57A	
   4.69E-­‐12	
   7.97E-­‐10	
   BCL2	
   HS1558	
   12	
   71	
   0.17	
   A76T,	
  S87N,	
   V35M,	
  Y108S	
   2.39E-­‐11	
   4.07E-­‐09	
   BCL2	
   HS2251	
   23	
   110	
   0.21	
   N143S	
   8.36E-­‐15	
   1.42E-­‐12	
   BCL2	
   HS2055	
   61	
   289	
   0.21	
   F49S,	
  P46A,	
  T7R	
   1.29E-­‐36	
   2.19E-­‐34	
   BCL2	
   HS1203	
   28	
   129	
   0.22	
   A45S,	
  H20Q,	
   P57S,	
  T69A	
   1.08E-­‐16	
   1.83E-­‐14	
   BCL2	
   HS0640	
   66	
   202	
   0.33	
   A131G,	
  A2T,	
   H3D	
   3.27E-­‐17	
   5.56E-­‐15	
   BCL2	
   HS1186	
   70	
   214	
   0.33	
   F49L,	
  P53S,	
   R129H	
   3.93E-­‐18	
   6.67E-­‐16	
   BCL2	
   HS1454	
   20	
   61	
   0.33	
   A80V,	
  T74I	
   5.66E-­‐06	
   9.62E-­‐04	
   BCL2	
   HS1204	
   132	
   368	
   0.36	
   D31N,	
  P59S,	
   V66I,	
  Y9F	
   8.34E-­‐27	
   1.42E-­‐24	
   BCL2	
   HS0644	
   434	
   944	
   0.46	
   R146K,	
  S51P,	
   V159A,	
  Y21C	
   8.98E-­‐44	
   1.53E-­‐41	
   BCL2	
   HS0939	
   2071	
   1176	
   1.76	
   A60V,	
  G27D,	
   N11K	
   3.61E-­‐56	
   6.14E-­‐54	
   BCL2	
   HS1164	
   114	
   49	
   2.33	
   E165K	
   3.81E-­‐07	
   6.47E-­‐05	
   BCL2	
   HS1979	
   84	
   22	
   3.82	
   T125S	
   1.02E-­‐09	
   1.73E-­‐07	
   BCL6	
   HS2252	
   25	
   2041	
   0.01	
   A587T	
   0	
   0.00E+00	
   BLR1	
   HS0804	
   303	
   185	
   1.64	
   Q350*,	
  Y222S	
   1.03E-­‐07	
   1.75E-­‐05	
   BTG1	
   HS1557	
   26	
   85	
   0.31	
   R27H	
   1.70E-­‐08	
   2.89E-­‐06	
   BTG1	
   HS1974	
   1007	
   697	
   1.44	
   Q36H	
   6.10E-­‐14	
   1.04E-­‐11	
   BTG1	
   HS2054	
   1356	
   764	
   1.77	
   E46D	
   3.44E-­‐38	
   5.85E-­‐36	
   BTG1	
   HS1137	
   229	
   121	
   1.89	
   Q38E	
   8.21E-­‐09	
   1.40E-­‐06	
   BTG1	
   HS3105	
   1010	
   476	
   2.12	
   L94V	
   2.02E-­‐44	
   3.44E-­‐42	
   BTG1	
   HS1458	
   248	
   74	
   3.35	
   Q36H	
   4.23E-­‐23	
   7.19E-­‐21	
   CARD11	
   HS1458	
   39	
   797	
   0.05	
   F123I	
   8.46E-­‐185	
   1.44E-­‐182	
   CARD11	
   HS2252	
   314	
   939	
   0.33	
   F123C	
   1.16E-­‐72	
   1.98E-­‐70	
   CARD11	
   HS0934	
   298	
   519	
   0.57	
   Q364R	
   9.65E-­‐15	
   1.64E-­‐12	
   CARD11	
   HS2251	
   270	
   106	
   2.55	
   D394N	
   1.28E-­‐17	
   2.17E-­‐15	
   CARD11	
   HS2050	
   37	
   9	
   4.11	
   G116D	
   4.06E-­‐05	
   6.90E-­‐03	
   CARD11	
   HS1360	
   83	
   14	
   5.93	
   N230K	
   4.22E-­‐13	
   7.18E-­‐11	
   CCND3	
   HS1556	
   177	
   104	
   1.70	
   I240R	
   1.57E-­‐05	
   2.68E-­‐03	
   CCND3	
   HS2251	
   287	
   53	
   5.42	
   V237D	
   5.22E-­‐40	
   8.87E-­‐38	
   CD70	
   HS0640	
   8	
   214	
   0.04	
   L60R	
   3.97E-­‐53	
   6.75E-­‐51	
    187 Gene	
   Library	
   Reference	
   allele	
   (total	
   reads)	
   Mutant	
   allele	
   (total	
   reads)	
   Ref-­‐ Mut	
   ratio	
   Mutations	
   P	
  (raw)	
   P	
   (Bonferroni)	
   CD79B	
   HS0940	
   4	
   2406	
   0.00	
   Y92S	
   0	
   0.00E+00	
   CD79B	
   HS0927	
   3	
   1436	
   0.00	
   Y92F	
   0	
   0.00E+00	
   CD79B	
   HS1352	
   984	
   2118	
   0.46	
   Y92H	
   4.15E-­‐94	
   7.06E-­‐92	
   CIITA	
   HS0640	
   208	
   128	
   1.63	
   D748V	
   1.50E-­‐05	
   2.55E-­‐03	
   CNOT1	
   HS3105	
   600	
   310	
   1.94	
   D110Y	
   4.33E-­‐22	
   7.36E-­‐20	
   CNOT1	
   HS0939	
   445	
   220	
   2.02	
   D1972G	
   1.77E-­‐18	
   3.01E-­‐16	
   CREBBP	
   HS0641	
   58	
   169	
   0.34	
   L88Q	
   9.01E-­‐14	
   1.53E-­‐11	
   CREBBP	
   HS2606	
   476	
   186	
   2.56	
   R35C	
   3.22E-­‐30	
   5.47E-­‐28	
   DNM2	
   HS1133	
   576	
   308	
   1.87	
   E453K	
   1.43E-­‐19	
   2.43E-­‐17	
   EP300	
   HS0939	
   150	
   72	
   2.08	
   A1498T	
   1.79E-­‐07	
   3.04E-­‐05	
   ETS1	
   HS2248	
   417	
   203	
   2.05	
   E22D	
   5.57E-­‐18	
   9.47E-­‐16	
   ETS1	
   HS3120	
   618	
   156	
   3.96	
   L23F	
   8.12E-­‐66	
   1.38E-­‐63	
   ETS1	
   HS2060	
   154	
   30	
   5.13	
   L23F	
   2.74E-­‐21	
   4.66E-­‐19	
   EZH2	
   HS1462	
   20	
   582	
   0.03	
   A682G	
   1.46E-­‐144	
   2.48E-­‐142	
   EZH2	
   HS2604	
   30	
   147	
   0.20	
   A692V	
   9.99E-­‐20	
   1.70E-­‐17	
   EZH2	
   HS2249	
   26	
   70	
   0.37	
   Y646F	
   8.05E-­‐06	
   1.37E-­‐03	
   EZH2	
   HS1133	
   362	
   264	
   1.37	
   Y646H	
   0.0001026 66	
   1.75E-­‐02	
   EZH2	
   HS0637	
   220	
   138	
   1.59	
   Y646F	
   1.72E-­‐05	
   2.92E-­‐03	
   EZH2	
   HS1454	
   99	
   47	
   2.11	
   Y646F	
   2.02E-­‐05	
   3.43E-­‐03	
   EZH2	
   HS1138	
   77	
   35	
   2.20	
   Y646N	
   8.98E-­‐05	
   1.53E-­‐02	
   EZH2	
   HS2054	
   238	
   71	
   3.35	
   Y646F	
   3.15E-­‐22	
   5.36E-­‐20	
   EZH2	
   HS1131	
   96	
   22	
   4.36	
   Y646F	
   3.26E-­‐12	
   5.54E-­‐10	
   EZH2	
   HS2050	
   214	
   43	
   4.98	
   Y646H	
   1.82E-­‐28	
   3.10E-­‐26	
   FAS	
   HS0804	
   38	
   10	
   3.80	
   Q255*	
   6.17E-­‐05	
   1.05E-­‐02	
   GNA13	
   HS2252	
   80	
   1330	
   0.06	
   T203A	
   9.25E-­‐293	
   1.57E-­‐290	
   GNA13	
   HS1982	
   836	
   250	
   3.34	
   H345P	
   2.86E-­‐74	
   4.86E-­‐72	
   GNA13	
   HS2251	
   437	
   125	
   3.50	
   *378R	
   1.67E-­‐41	
   2.83E-­‐39	
   HDAC7A	
   HS1974	
   193	
   115	
   1.68	
   A786T	
   1.04E-­‐05	
   1.76E-­‐03	
   HIST1H1C	
   HS2607	
   40	
   257	
   0.16	
   P118S	
   5.91E-­‐40	
   1.00E-­‐37	
   HIST1H1C	
   HS0936	
   596	
   368	
   1.62	
   A180P	
   2.06E-­‐13	
   3.50E-­‐11	
   HIST1H1C	
   HS3105	
   146	
   82	
   1.78	
   A185G	
   2.69E-­‐05	
   4.58E-­‐03	
   HIST1H2AC	
   HS2051	
   394	
   980	
   0.40	
   K120R	
   7.10E-­‐58	
   1.21E-­‐55	
   IQGAP1	
   HS3105	
   182	
   64	
   2.84	
   L937R	
   2.73E-­‐14	
   4.64E-­‐12	
   IRF4	
   HS1976	
   1	
   294	
   0.00	
   Q60K	
   9.30E-­‐87	
   1.58E-­‐84	
   IRF4	
   HS0934	
   4	
   748	
   0.01	
   I32V,	
  L40V,	
  S18T	
   1.12E-­‐216	
   1.91E-­‐214	
    188 Gene	
   Library	
   Reference	
   allele	
   (total	
   reads)	
   Mutant	
   allele	
   (total	
   reads)	
   Ref-­‐ Mut	
   ratio	
   Mutations	
   P	
  (raw)	
   P	
   (Bonferroni)	
   IRF4	
   HS1134	
   99	
   1056	
   0.09	
   Q60H,	
  S18R,	
   S48R	
   1.00E-­‐202	
   1.71E-­‐200	
   IRF4	
   HS2050	
   226	
   76	
   2.97	
   S48R	
   1.90E-­‐18	
   3.24E-­‐16	
   IRF8	
   HS1202	
   888	
   678	
   1.31	
   *427L	
   1.23E-­‐07	
   2.09E-­‐05	
   ITPR2	
   HS2605	
   628	
   280	
   2.24	
   E2470Q	
   1.98E-­‐31	
   3.36E-­‐29	
   ITPR2	
   HS1462	
   84	
   26	
   3.23	
   R1200Q	
   2.61E-­‐08	
   4.44E-­‐06	
   JMJD2C	
   HS3105	
   56	
   14	
   4.00	
   T794A	
   4.30E-­‐07	
   7.31E-­‐05	
   JMJD2C	
   HS1978	
   136	
   21	
   6.48	
   T1042A	
   8.10E-­‐22	
   1.38E-­‐19	
   KLHL6	
   HS0647	
   234	
   96	
   2.44	
   S83T	
   1.93E-­‐14	
   3.28E-­‐12	
   KLHL6	
   HS1975	
   73	
   23	
   3.17	
   T53S	
   3.11E-­‐07	
   5.29E-­‐05	
   KLHL6	
   HS0943	
   407	
   120	
   3.39	
   L45*,	
  T53I	
   1.63E-­‐37	
   2.78E-­‐35	
   LRRK1	
   HS0936	
   190	
   92	
   2.07	
   L995H	
   5.41E-­‐09	
   9.20E-­‐07	
   MALT1	
   HS3105	
   78	
   30	
   2.60	
   V370F	
   4.31E-­‐06	
   7.34E-­‐04	
   MAP4K1	
   HS1136	
   48	
   14	
   3.43	
   T228I	
   1.74E-­‐05	
   2.96E-­‐03	
   MBOAT5	
   HS1458	
   113	
   46	
   2.46	
   T298M	
   1.09E-­‐07	
   1.85E-­‐05	
   MBOAT5	
   HS1133	
   312	
   100	
   3.12	
   I339F	
   1.84E-­‐26	
   3.13E-­‐24	
   MEF2B	
   HS0938	
   4	
   187	
   0.02	
   D83A	
   3.50E-­‐50	
   5.95E-­‐48	
   MEF2B	
   HS0639	
   4	
   69	
   0.06	
   D83V	
   2.44E-­‐16	
   4.15E-­‐14	
   MEF2B	
   HS2249	
   1671	
   2774	
   0.60	
   Y69H	
   5.41E-­‐62	
   9.19E-­‐60	
   MLL2	
   HS0943	
   41	
   9	
   4.56	
   R4634C	
   5.61E-­‐06	
   9.54E-­‐04	
   MYD88	
   HS2606	
   800	
   589	
   1.36	
   L252P	
   1.65E-­‐08	
   2.81E-­‐06	
   MYD88	
   HS1559	
   186	
   116	
   1.60	
   L252P	
   6.69E-­‐05	
   1.14E-­‐02	
   MYD88	
   HS2056	
   653	
   329	
   1.98	
   L252P	
   2.40E-­‐25	
   4.08E-­‐23	
   MYD88	
   HS0935	
   93	
   46	
   2.02	
   L252P	
   8.27E-­‐05	
   1.41E-­‐02	
   MYD88	
   HS0748	
   549	
   189	
   2.90	
   S206C	
   1.58E-­‐41	
   2.69E-­‐39	
   MYD88	
   HS0936	
   252	
   71	
   3.55	
   S206C	
   6.59E-­‐25	
   1.12E-­‐22	
   MYD88	
   HS0649	
   87	
   21	
   4.14	
   L252P	
   9.90E-­‐11	
   1.68E-­‐08	
   NLRP11	
   HS1462	
   128	
   250	
   0.51	
   R867K	
   3.44E-­‐10	
   5.85E-­‐08	
   NMD3	
   HS0936	
   342	
   62	
   5.52	
   S297N	
   5.30E-­‐48	
   9.01E-­‐46	
   NR3C1	
   HS0640	
   55	
   20	
   2.75	
   Y433H	
   6.49E-­‐05	
   1.10E-­‐02	
   PACS1	
   HS2604	
   226	
   45	
   5.02	
   T471M	
   3.48E-­‐30	
   5.92E-­‐28	
   PASK	
   HS3105	
   146	
   274	
   0.53	
   A689S	
   4.25E-­‐10	
   7.23E-­‐08	
   PDS5B	
   HS1134	
   13	
   45	
   0.29	
   K303E	
   3.01E-­‐05	
   5.12E-­‐03	
   PIM1	
   HS0929	
   215	
   137	
   1.57	
   L2V,	
  S146R	
   3.79E-­‐05	
   6.45E-­‐03	
   PIM1	
   HS2053	
   164	
   100	
   1.64	
   S146R	
   9.82E-­‐05	
   1.67E-­‐02	
   PIM1	
   HS2252	
   532	
   285	
   1.87	
   E79D,	
  Q37H	
   4.22E-­‐18	
   7.17E-­‐16	
   PIM1	
   HS1204	
   80	
   25	
   3.20	
   L164F	
   6.88E-­‐08	
   1.17E-­‐05	
    189 Gene	
   Library	
   Reference	
   allele	
   (total	
   reads)	
   Mutant	
   allele	
   (total	
   reads)	
   Ref-­‐ Mut	
   ratio	
   Mutations	
   P	
  (raw)	
   P	
   (Bonferroni)	
   PIM1	
   HS1199	
   131	
   24	
   5.46	
   L255V,	
  L25V	
   4.87E-­‐19	
   8.27E-­‐17	
   RAPGEF1	
   HS3120	
   277	
   160	
   1.73	
   V16L	
   2.39E-­‐08	
   4.06E-­‐06	
   RAPGEF1	
   HS2051	
   383	
   177	
   2.16	
   Y265N	
   1.90E-­‐18	
   3.23E-­‐16	
   RAPGEF1	
   HS2054	
   479	
   170	
   2.82	
   S53N	
   5.79E-­‐35	
   9.84E-­‐33	
   RFTN1	
   HS0747	
   863	
   526	
   1.64	
   P205S	
   1.31E-­‐19	
   2.22E-­‐17	
   RNF135	
   HS1462	
   86	
   22	
   3.91	
   T189S	
   3.99E-­‐10	
   6.78E-­‐08	
   SENP6	
   HS1978	
   131	
   64	
   2.05	
   E890D	
   1.83E-­‐06	
   3.11E-­‐04	
   SGK	
   HS2050	
   192	
   321	
   0.60	
   K19M,	
  R22M	
   1.34E-­‐08	
   2.28E-­‐06	
   SGK	
   HS0944	
   496	
   277	
   1.79	
   F113L	
   2.94E-­‐15	
   4.99E-­‐13	
   SGK	
   HS1136	
   108	
   48	
   2.25	
   A105P	
   1.75E-­‐06	
   2.97E-­‐04	
   SGK	
   HS2053	
   8033	
   2985	
   2.69	
   A115E,	
  E136*,	
   T229N	
   0.00E+00	
   0.00E+00	
   SGK	
   HS0650	
   212	
   78	
   2.72	
   A193G,	
  Q30*,	
   S242T	
   1.77E-­‐15	
   3.02E-­‐13	
   SGK	
   HS1557	
   545	
   141	
   3.87	
   P65S,	
  R211T	
   6.95E-­‐57	
   1.18E-­‐54	
   SGK	
   HS2604	
   2089	
   408	
   5.12	
   H153Q,	
  N34K,	
   P67S,	
  T5I	
   8.15E-­‐271	
   1.38E-­‐268	
   STAT3	
   HS0653	
   12	
   43	
   0.28	
   N567K	
   3.31E-­‐05	
   5.62E-­‐03	
   STAT3	
   HS0933	
   326	
   144	
   2.26	
   E616K	
   2.76E-­‐17	
   4.68E-­‐15	
   STAT3	
   HS0944	
   467	
   98	
   4.77	
   D566N	
   1.46E-­‐58	
   2.48E-­‐56	
   STAT6	
   HS2059	
   665	
   189	
   3.52	
   D419G	
   8.32E-­‐63	
   1.41E-­‐60	
   TMEM30A	
   HS0934	
   43	
   11	
   3.91	
   R254*	
   1.40E-­‐05	
   2.38E-­‐03	
   TMEM30A	
   HS0928	
   102	
   22	
   4.64	
   D155E,	
  Y157*	
   1.66E-­‐13	
   2.82E-­‐11	
   TNFRSF14	
   HS1202	
   12	
   130	
   0.09	
   C57*	
   3.43E-­‐26	
   5.84E-­‐24	
   TNFRSF14	
   HS2249	
   75	
   26	
   2.88	
   W201*	
   1.12E-­‐06	
   1.90E-­‐04	
   TNFRSF14	
   HS0637	
   230	
   77	
   2.99	
   G60D	
   7.47E-­‐19	
   1.27E-­‐16	
   TNFRSF14	
   HS1185	
   118	
   35	
   3.37	
   Y47*	
   1.02E-­‐11	
   1.74E-­‐09	
   TNFRSF14	
   HS1460	
   125	
   26	
   4.81	
   Q95*	
   1.00E-­‐16	
   1.70E-­‐14	
   TNFRSF14	
   HS1137	
   102	
   20	
   5.10	
   C53R	
   1.96E-­‐14	
   3.34E-­‐12	
   TNFRSF14	
   HS1350	
   170	
   32	
   5.31	
   S112C	
   6.41E-­‐24	
   1.09E-­‐21	
   TNFRSF14	
   HS1981	
   113	
   19	
   5.95	
   W12*	
   1.81E-­‐17	
   3.08E-­‐15	
   TNFRSF14	
   HS2059	
   221	
   34	
   6.50	
   W12*	
   9.16E-­‐35	
   1.56E-­‐32	
   TP53	
   HS1182	
   0	
   41	
   0.00	
   G152D	
   9.09E-­‐13	
   1.55E-­‐10	
   TP53	
   HS1181	
   12	
   339	
   0.04	
   S122R	
   2.73E-­‐84	
   4.64E-­‐82	
   TP53	
   HS0648	
   6	
   42	
   0.14	
   Y33D	
   1.01E-­‐07	
   1.71E-­‐05	
   TP53	
   HS2049	
   58	
   344	
   0.17	
   C83F	
   1.46E-­‐50	
   2.47E-­‐48	
   TP53	
   HS2053	
   94	
   554	
   0.17	
   R155W	
   3.05E-­‐80	
   5.18E-­‐78	
    190 Gene	
   Library	
   Reference	
   allele	
   (total	
   reads)	
   Mutant	
   allele	
   (total	
   reads)	
   Ref-­‐ Mut	
   ratio	
   Mutations	
   P	
  (raw)	
   P	
   (Bonferroni)	
   TP53	
   HS1980	
   124	
   708	
   0.18	
   I255F	
   4.48E-­‐100	
   7.61E-­‐98	
   TP53	
   HS2250	
   94	
   411	
   0.23	
   Y107D	
   2.82E-­‐48	
   4.80E-­‐46	
   TP53	
   HS0641	
   124	
   498	
   0.25	
   Y141N	
   5.24E-­‐54	
   8.91E-­‐52	
   TP53	
   HS0944	
   197	
   116	
   1.70	
   P278L	
   5.47E-­‐06	
   9.30E-­‐04	
   TP53	
   HS1462	
   736	
   173	
   4.25	
   R155W	
   2.75E-­‐83	
   4.67E-­‐81	
   TP53	
   HS1460	
   135	
   26	
   5.19	
   M144L	
   5.90E-­‐19	
   1.00E-­‐16	
   TTC27	
   HS0936	
   66	
   27	
   2.44	
   R528H	
   6.47E-­‐05	
   1.10E-­‐02	
    191 B.4 All MEF2B mutations identified Case (res_id) Position (chromosome) Change (DNA) Change (protein) Diagnosis and subtype (subtyping method) 03-31934 chr19:19122543 A > T M1K FL 02-17440 chr19:19122535 T > C K4E GCB DLBCL (GEP) 98-17403 chr19:19122535 T > C K4E DLBCL 06-20044 chr19:19122535§ T > C K4E FL 06-23741 chr19:19122535§ T > C K4E FL 07-14540 chr19:19122535 T > C K4E FL 98-14740 chr19:19122535 T > C K4E FL 05-15463 chr19:19122532 T > C K5E FL 03-28045 chr19:19122523 T > C I8V DLBCL 92-59893 chr19:19122502 T > C R15G DLBCL 02-28712 chr19:19122493 G > A Q18* DLBCL 05-22052 chr19:19121225 T > C K23R DLBCL 07-10201 chr19:19121222 C > T R24Q FL SPEC1187 chr19:19121217 A > C F26V GCB DLBCL (GEP) 06-20952 chr19:19121195 T > G Y33S FL 03-18669 chr19:19121153 A > G I47T DLBCL 03-33888 chr19:19121135 C > T R53H DLBCL 01-16433 chr19:19121093§ A > C L67R FL 00-15694 chr19:19121088§ A > G Y69H GCB DLBCL (GEP) 05-11328 chr19:19121088 A > G Y69H GCB DLBCL (GEP) 06-12968 chr19:19121087§ T > C Y69C FL 06-18193 chr19:19121087 T > C Y69C FL 08-10448 chr19:19121087 T > C Y69C FL 99-30068 chr19:19121087 T > C Y69C FL 05-11369 chr19:19121066 -GGGGCT E74-P75- H76 > D FL 06-23851 chr19:19121066 T > C H76R FL 07-21828 chr19:19121064 C > T E77K DLBCL 07-30109 chr19:19121063 T > C E77G Composite FL 06-30145 chr19:19121052§ T > A N81Y GCB DLBCL (GEP) 05-23110 chr19:19121050§ G > T N81K GCB DLBCL (GEP) 00-13940 chr19:19121045 T > G D83A GCB DLBCL (IHC) 06-15922 chr19:19121045§ T > G D83A GCB DLBCL (GEP) 07-23804 chr19:19121045 T > G D83A GCB DLBCL (GEP) 00-22287 chr19:19121045 T > A D83V GCB DLBCL (IHC) 01-18672 chr19:19121045 T > A D83V GCB DLBCL (IHC) 02-30647 chr19:19121045§ T > A D83V GCB DLBCL (GEP) 03-11110 chr19:19121045 T > A D83V DLBCL 03-26817 chr19:19121045 T > A D83V GCB DLBCL (GEP) 03-30438 chr19:19121045 T > A D83V GCB DLBCL (GEP) 05-24666 chr19:19121045 T > A D83V GCB DLBCL (GEP) 06-30025 chr19:19121045§ T > A D83V GCB DLBCL (GEP)  192 Case (res_id) Position (chromosome) Change (DNA) Change (protein) Diagnosis and subtype (subtyping method) 06-33777 chr19:19121045§ T > A D83V GCB DLBCL (GEP) 78-60284 chr19:19121045 T > A D83V GCB DLBCL (IHC) 95-32814 chr19:19121045§ T > A D83V GCB DLBCL (GEP) 97-10270 chr19:19121045 T > A D83V DLBCL DB chr19:19121045 T > A D83V GCB DLBCL (GEP) 06-11109 chr19:19121045 T > G D83A FL 07-20462 chr19:19121045 T > G D83A FL 91-34915 chr19:19121045 T > G D83A FL 03-16286 chr19:19121045 T > C D83G FL 05-12024 chr19:19121045 T > A D83V FL 06-22766 chr19:19121045 T > A D83V FL 06-33903 chr19:19121045 T > A D83V FL 89-30159 chr19:19121045 T > A D83V FL 91-53679 chr19:19121045 T > A D83V FL 97-23234 chr19:19121045 T > A D83V FL 99-21548 chr19:19121045 T > A D83V FL 01-24821 chr19:19119600 +A L100fs FL 85-31959 chr19:19119578 C > A E108* FL 06-16716 chr19:19119559‡ C > T R114Q ABC DLBCL (GEP) 02-18484 chr19:19119539  10 bp del G121fs FL 91-53679 chr19:19118877 -GGAA F170fs FL 08-15460 chr19:19118875 -AAGG P169fs DLBCL 06-10398 chr19:19118406 +GG G242fs ABC DLBCL (GEP) 06-30389 chr19:19118365 -C P256fs FL 07-18609 chr19:19117831 A > C S294R† FL 05-20543 chr19:19117794 G > T R307S† ABC DLBCL (GEP) 05-14545 chr19:19117608 A > G *369G† FL 06-23851 chr19:19117608 A > C *369E† FL 06-12557 chr19:19117606 C > G *369Y† FL  GEP: Subtype determined using gene expression profiling (Affymetrix); IHC: Subtype determined using immunohistochemistry  193 Appendix C  Supplementary data from the sequencing of 11 Ph-like ALL cases C.1 Fusions detected with DeFuse or MOSAIK or previously identified Sample ID / Library read pairs Gene 1 Gene 2 Fusion product   Gene Chr Location Gene Chr Location PAKTAL/ HS0897  29 22 5 STRN3 C12orf35 FAM23A 14 12 10 exon 9 intron 4 exon 3 JAK2 AMN1 MRC1 9 12 10 exon 17 intron 6 exon 2 STRN3-JAK2*,# C120rf35-AMN1* FAM23A-MRC1* PAKKCA/ HS1584  66 6 5 EBF1 SEMA6A DOCK8 5 5 9 exon 15 exon 1 exon 3-4 PDGFRB FEM1C CBWD2 5 5 2 exon 11 exon 2 intron 10 EBF1-PDGFRB*,# SEMA6A-FEM1C# DOCK8-CBWD2* PAKVKK/ HS0894  15 6 6 17 NUP214 SEMA6A TPM4 TSHZ2 9 5 19 20 exon 34 exon 1 exon 2 intron 2 ABL1 FEM1C KLF2 SLC35A1 9 5 19 6 exon 3 exon 2 exon 3 intron 5 NUP214- ABL1*,# SEMA6A- FEM1C# TSHZ2- SLC35A1* TPM4- KLF2# PALIBN/ HS1535  53 IGH@ 14 NA EPOR 19  IGH@-EPOR* PAKYEP/ HS1534  33 BCR 22 exon 1 JAK2 9 exon 15  BCR-JAK2*,# PAMDRM/ HS1576   8 12 8 10 IGH@ OAZ1 TPM4 SEMA6A SLC2A5  19 19  1 NA exon 1 exon 2 exon 1 intron 1 CRLF2 KLF2 KLF2 FEM1C BTBD7  19 19 5 14  exon 3 exon 3 exon 2 intron 10 IGH@-CRLF2** OAZ1-KLF2# TPM4-KLF2# SEMA6A-FEM1C# SLC2A5-BTBD7* PAKKXB/ HS1533  13 IGH@ BTBD7  14  intron 10 CRLF2 SLC2A5  1  intron 1 IGH@-CRLF2** BTBD7-SLC2A5* PALETF/ HS1537        None PAKHZT/ HS0825  IGH@    CRLF2    IGH@-CRLF2** PALJDL/ HS1536 5 ZNF292 6 exon 1  SLC2A5 6 exon 2  ZNF292- SLC2A5* *detected by defuse; #detected by Mosaik; **previously identified  194 Appendix D  Supplementary methods D.1 Galaxy workflow for analysis of paired tumour/normal sequence data The analysis pipeline used to identify the variants from the genomes and exomes described in Chapter 3 was converted into a Galaxy (Blankenberg et al. 2010) workflow. The pipeline includes four inputs (boxed in red): two bam files, one from the tumour and one from the matched normal sample; a codon-lookup table for annotation and a polymorphism table for removal of known SNPs. The output (boxed in blue) is a list of somatic calls identified with SNVMix. The SNVs considered somatic are those for which the corresponding region in the matched normal has no significant evidence of a SNV. Those somatic SNVs affecting codons are annotated as synonymous, missense or nonsense along with the specific amino acid altered. Codon lookup table Polymorphism table BAM !le (tumour) BAM !le (normal) Annotated Somatic SNVs Inputs Output   195 D.2 Schema and description of database This schema shows the tables in “lymphoma_db”, a simple MySQL database designed for storage and routine retrieval and tracking of mutations derived from RNA-seq, genome and exome libraries. The Mutation and Mutation_validation tables track mutation information. The former stores details of the mutation such as the genomic position, gene affected, amino acid change, and nucleotide change. The latter table tracks the design and result of validation experiments. The database is able to handle multiple samples from the same patient (e.g. tumour and normal) and multiple library types from the same sample (e.g. RNA-seq and WGS). Clinical characteristics of the individual patients and descriptions of the samples are tracked in the Patient and Sample tables. The Gene table stores general information about each gene and stores the pre-computed values utilized within the statistical approach for identifying genes under selection (Greenman et al. 2006). validation_method validation_outcome user manual_review_result region external_gene_id annotation base_change protein_altering status gene position chromosome type library_id id Mutation sequencing_complete somatic_snvs_loaded average_read_length average_exonic_coverage average_coverage mapped_bases sample_id bam_location library_type library_name id Library N_tg N_tc N_ta N_cg N_ct N_ca L_tg L_tc L_ta L_cg L_ct L_ca M_tc M_ta M_cg M_ct M_ca N L M has_somatic_mutation refseq gene_name biotype gene_symbol ensembl_id id Gene user flanking_sequence well plate comment status primer_index primer_2 primer_1 mutation_key mutation_id id Mutation_validation disease type Sample sample_type patient_id id p_gcb treatment sample_id subtype os_code os_time Patient id pfs_code pfs_time res_id alternative_name  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0072548/manifest

Comment

Related Items