UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Evolutionarily conserved regulatory programs Kwon, Tae-Jun Andrew 2011

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2011_fall_kwon_taejun.pdf [ 11.17MB ]
Metadata
JSON: 24-1.0072183.json
JSON-LD: 24-1.0072183-ld.json
RDF/XML (Pretty): 24-1.0072183-rdf.xml
RDF/JSON: 24-1.0072183-rdf.json
Turtle: 24-1.0072183-turtle.txt
N-Triples: 24-1.0072183-rdf-ntriples.txt
Original Record: 24-1.0072183-source.json
Full Text
24-1.0072183-fulltext.txt
Citation
24-1.0072183.ris

Full Text

EVOLUTIONARILY	
  CONSERVED	
  REGULATORY	
  PROGRAMS	
   	
   by	
   	
   Tae-­‐Jun	
  Andrew	
  Kwon	
   	
   M.Sc.,	
  The	
  University	
  of	
  British	
  Columbia,	
  2003	
   B.Sc.,	
  The	
  University	
  of	
  British	
  Columbia,	
  2000	
   	
   	
   	
   	
   	
   	
   A	
  THESIS	
  SUBMITTED	
  IN	
  PARTIAL	
  FULFILLMENT	
  OF	
   THE	
  REQUIREMENTS	
  FOR	
  THE	
  DEGREE	
  OF	
   	
   	
   DOCTOR	
  OF	
  PHILOSOPHY	
   	
   in	
   	
   THE	
  FACULTY	
  OF	
  GRADUATE	
  STUDIES	
   	
   (Genetics)	
   	
   	
   	
   	
   	
   	
   	
   THE	
  UNIVERSITY	
  OF	
  BRITISH	
  COLUMBIA	
   (Vancouver)	
   	
   	
   	
   	
   	
   August	
  2011	
   	
   ©	
  Tae-­‐Jun	
  Andrew	
  Kwon,	
  2011	
    Abstract	
    Despite	
  the	
  diversity	
  of	
  metazoans,	
  common	
  biochemical	
  systems	
  and	
  structures	
  can	
  be	
   found	
  in	
  distinct	
  taxonomic	
  groups.	
  The	
  development	
  and	
  formation	
  of	
  metazoan	
   tissues	
  and	
  structures	
  has	
  been	
  well	
  researched,	
  but	
  their	
  regulatory	
  mechanisms	
  are	
   not	
  understood	
  well.	
  To	
  this	
  end,	
  we	
  implemented	
  bioinformatics	
  tools	
  regulatory	
   mechanism	
  analysis	
  and	
  applied	
  them	
  to	
  study	
  regulatory	
  program	
  conservation	
  with	
   an	
  emphasis	
  on	
  muscle	
  development.	
   We	
  first	
  performed	
  a	
  genome-­‐wide	
  scan	
  for	
  muscle-­‐specific	
  cis-­‐regulatory	
  modules	
   (CRMs)	
  using	
  three	
  computational	
  prediction	
  programs.	
  	
  Based	
  on	
  the	
  predictions,	
  339	
   candidate	
  CRMs	
  were	
  tested	
  in	
  cell	
  culture	
  with	
  NIH3T3	
  fibroblasts	
  and	
  C2C12	
   myoblasts	
  for	
  capacity	
  to	
  direct	
  selective	
  reporter	
  gene	
  expression	
  to	
  differentiated	
   myotubes.	
  A	
  subset	
  of	
  19	
  CRMs	
  validated	
  as	
  functional	
  in	
  the	
  assay.	
  	
  The	
  rate	
  of	
   predictive	
  success	
  reveals	
  striking	
  limitations	
  of	
  computational	
  CRM	
  discovery.	
  Motif-­‐ based	
  methods	
  performed	
  no	
  better	
  than	
  predictions	
  based	
  only	
  on	
  sequence	
   conservation.	
  Analysis	
  of	
  the	
  properties	
  of	
  the	
  functional	
  sequences	
  relative	
  to	
  inactive	
   sequences	
  identifies	
  nucleotide	
  sequence	
  composition	
  and	
  ChIP-­‐Seq	
  evidence	
  as	
   important	
  characteristics	
  to	
  incorporate	
  in	
  future	
  methods	
  for	
  improved	
  predictive	
   specificity.	
   In	
  studying	
  the	
  transcriptional	
  regulation,	
  motif	
  enrichment	
  analysis	
  of	
  co-­‐expressed	
   genes	
  is	
  often	
  employed	
  to	
  determine	
  mediating	
  transcription	
  factors.	
  We	
  built	
   oPOSSUM-­‐3,	
  a	
  web-­‐based	
  software	
  system	
  for	
  identification	
  of	
  enriched	
  transcription	
   factor	
  binding	
  sites	
  (TFBS)	
  and	
  TFBS	
  families	
  in	
  DNA	
  sequences	
  of	
  co-­‐expressed	
  genes	
   and	
  sequences	
  generated	
  from	
  high-­‐throughput	
  methods,	
  such	
  as	
  ChIP-­‐Seq.	
  Validation	
   of	
  the	
  system	
  with	
  known	
  sets	
  of	
  published	
  data	
  demonstrates	
  the	
  capacity	
  for	
   oPOSSUM-­‐3	
  to	
  identify	
  mediating	
  TFs	
  for	
  co-­‐regulated	
  genes.	
  	
    	
    ii  	
   Studies	
  have	
  shown	
  that	
  TF	
  binding	
  profiles	
  tend	
  to	
  be	
  highly	
  conserved	
  over	
  long	
   evolutionary	
  distances.	
  	
  In	
  large-­‐scale	
  public	
  genome	
  annotation	
  projects,	
  such	
  as	
   modENCODE,	
  transcriptional	
  regulation	
  data	
  is	
  compiled	
  for	
  comparative	
  genomics	
   research.	
  Using	
  the	
  oPOSSUM-­‐3	
  system	
  and	
  published	
  data,	
  we	
  performed	
  comparative	
   analyses	
  of	
  the	
  regulatory	
  programs	
  across	
  evolutionarily	
  divergent	
  species,	
  including	
   human,	
  fruit	
  fly,	
  and	
  nematode,	
  and	
  examined	
  the	
  extent	
  of	
  conservation	
  in	
  major	
   regulatory	
  programs.	
  	
   The	
  thesis	
  research	
  provides	
  new	
  approaches	
  to	
  computational	
  analysis	
  of	
  DNA	
   sequences	
  and	
  insights	
  into	
  the	
  analysis	
  of	
  transcription	
  regulation	
  across	
  the	
   phylogenetic	
  spectrum.	
   	
    	
    	
    iii  Preface	
    Chapter	
  2	
  is	
  based	
  on	
  work	
  conducted	
  in	
  collaboration	
  with	
  Alice	
  Chou.	
  I	
  collected	
  the	
   pre-­‐annotated	
  regulatory	
  regions	
  from	
  literature,	
  and	
  developed	
  the	
  software	
  and	
  the	
   protocols	
  for	
  the	
  genome-­‐wide	
  computational	
  predictions.	
  Alice	
  Chou	
  designed	
  the	
   experimental	
  protocols	
  for	
  the	
  validation	
  of	
  the	
  computational	
  predictions,	
  and	
  we	
   jointly	
  performed	
  the	
  validation	
  experiments.	
  David	
  J.	
  Arenillas	
  designed	
  the	
   oligonucleotides	
  used	
  for	
  PCR	
  in	
  the	
  experiments.	
  	
  I	
  performed	
  the	
  computational	
   analysis	
  of	
  the	
  results	
  and	
  wrote	
  the	
  whole	
  manuscript.	
  Dr.	
  Wasserman	
  supervised	
  the	
   work	
  and	
  revised	
  the	
  manuscript,	
  as	
  well	
  as	
  contributing	
  to	
  the	
  development	
  of	
  the	
   ideas.	
  A	
  version	
  of	
  chapter	
  2	
  has	
  been	
  submitted	
  for	
  publication	
  and	
  is	
  currently	
   undergoing	
  requested	
  revisions.	
  	
   Chapter	
  3	
  is	
  based	
  on	
  work	
  conducted	
  in	
  collaboration	
  with	
  David	
  J.	
  Arenillas	
  and	
   Rebecca	
  Worsley	
  Hunt.	
  I	
  was	
  responsible	
  for	
  designing	
  the	
  TFBS	
  clustering	
  algorithm	
   and	
  preparing	
  the	
  TFBS	
  clusters	
  used	
  by	
  the	
  oPOSSUM	
  system,	
  as	
  well	
  as	
  designing	
   routines	
  for	
  nematode	
  operon	
  analysis.	
  David	
  J.	
  Arenillas	
  and	
  I	
  jointly	
  participated	
  in	
   the	
  software	
  development.	
  Rebecca	
  Worsley	
  Hunt	
  was	
  responsible	
  for	
  studying	
  the	
   effects	
  of	
  sequence	
  composition	
  on	
  the	
  results	
  produced	
  by	
  the	
  oPOSSUM	
  system.	
  I	
   wrote	
  the	
  manuscript,	
  with	
  contributions	
  from	
  Rebecca	
  Worsley	
  Hunt	
  for	
  the	
  sections	
   pertaining	
  to	
  her	
  work.	
  David	
  J.	
  Arenillas	
  produced	
  two	
  figures	
  that	
  were	
  used	
  in	
  the	
   manuscript.	
  Dr.	
  Wasserman	
  supervised	
  the	
  work	
  and	
  revised	
  the	
  manuscript,	
  as	
  well	
  as	
   contributing	
  to	
  the	
  development	
  of	
  the	
  ideas.	
  A	
  version	
  of	
  chapter	
  3	
  has	
  been	
  submitted	
   for	
  publication	
  and	
  is	
  currently	
  under	
  review.	
   The	
  research	
  described	
  in	
  Chapter	
  4	
  was	
  the	
  product	
  of	
  my	
  efforts.	
  	
  Dr.	
  Wasserman	
   supervised	
  the	
  work	
  and	
  revised	
  the	
  manuscript,	
  as	
  well	
  as	
  contributing	
  to	
  the	
   development	
  of	
  the	
  ideas.	
   	
  	
   	
   	
    	
   iv  Table	
  of	
  Contents	
    Abstract	
  .............................................................................................................................................	
  ii	
   Preface	
  ..............................................................................................................................................	
  iv	
   List	
  of	
  Tables	
  ..................................................................................................................................	
  ix	
   List	
  of	
  Figures	
  .................................................................................................................................	
  xi	
   List	
  of	
  Abbreviations	
  .................................................................................................................	
  xiv	
   Acknowledgements	
  ....................................................................................................................	
  xvi	
   1	
   Introduction	
  .............................................................................................................................	
  1	
   1.1	
   Metazoan	
  Tissue	
  Development	
  .................................................................................................	
  3	
   1.1.1	
   Tissue	
  and	
  Structure	
  Conservation	
  among	
  Metazoans	
  ...........................................................	
  4	
   1.2	
   Gene	
  Regulation	
  .............................................................................................................................	
  5	
   1.2.1	
   Promoter	
  and	
  Regulatory	
  Regions	
  ...................................................................................................	
  6	
   1.2.2	
   Transcription	
  Factors	
  .............................................................................................................................	
  7	
   1.2.3	
   MicroRNA	
  ....................................................................................................................................................	
  8	
   1.2.4	
   Chromatin	
  and	
  Epigenetics	
  ..................................................................................................................	
  9	
   1.3	
   Analysis	
  of	
  Gene	
  Expression	
  ...................................................................................................	
  10	
   1.3.1	
   DNA	
  Microarrays	
  ...................................................................................................................................	
  11	
   1.3.2	
   Serial	
  Analysis	
  of	
  Gene	
  Expression	
  (SAGE)	
  ................................................................................	
  12	
   1.3.3	
   RNA-­‐Seq	
  ....................................................................................................................................................	
  12	
   1.3.4	
   Differential	
  Gene	
  Expression	
  Analysis	
  .........................................................................................	
  13	
   1.4	
   Experimental	
  Methods	
  for	
  Identification	
  of	
  Regulatory	
  Regions	
  ..............................	
  14	
   1.4.1	
   Earlier	
  Methods	
  .....................................................................................................................................	
  14	
   1.4.2	
   High	
  Throughput	
  Methods	
  ................................................................................................................	
  15	
   1.5	
   Comparative	
  Genomics	
  .............................................................................................................	
  18	
   1.5.1	
   Homology:	
  Orthologs	
  and	
  Paralogs	
  ...............................................................................................	
  19	
   1.5.2	
   Sequence	
  Alignment	
  ............................................................................................................................	
  20	
   1.5.3	
   Evolutionary	
  Constraint	
  Measures	
  ................................................................................................	
  20	
   1.6	
   Public	
  Repositories	
  of	
  Genomic	
  Data	
  ...................................................................................	
  21	
   1.6.1	
   Public	
  Genome	
  Databases	
  .................................................................................................................	
  21	
    	
    v  1.6.2	
   Ortholog	
  Databases	
  ..............................................................................................................................	
  22	
   1.6.3	
   Transcription	
  Factor	
  and	
  Regulatory	
  Region	
  Databases	
  .....................................................	
  24	
   1.7	
   Bioinformatics	
  Approaches	
  for	
  Regulatory	
  Region	
  Analysis	
  ......................................	
  25	
   1.7.1	
   TFBS	
  Representation	
  ...........................................................................................................................	
  26	
   1.7.2	
   Motif	
  Discovery	
  ......................................................................................................................................	
  29	
   1.7.3	
   Phylogenetic	
  Footprinting	
  .................................................................................................................	
  30	
   1.7.4	
   cis-­‐Regulatory	
  Module	
  Detection	
  Methods	
  ................................................................................	
  32	
   1.7.5	
   Regulome	
  ..................................................................................................................................................	
  32	
   1.8	
   Expansion	
  of	
  Regulatory	
  Region	
  Data	
  .................................................................................	
  33	
   1.8.1	
   ENCODE	
  Data	
  ..........................................................................................................................................	
  34	
   1.8.2	
   modENCODE	
  Data	
  .................................................................................................................................	
  34	
   1.9	
   Thesis	
  Overview	
  ..........................................................................................................................	
  35	
    2	
   Validation	
  of	
  skeletal	
  muscle	
  cis-­‐regulatory	
  module	
  predictions	
  reveals	
   nucleotide	
  composition	
  bias	
  in	
  functional	
  enhancers	
  ....................................................	
  38	
   2.1	
   Introduction	
  .................................................................................................................................	
  38	
   2.2	
   Methods	
  .........................................................................................................................................	
  43	
   2.2.1	
   Candidate	
  CRM	
  Region	
  Selection	
  ...................................................................................................	
  43	
   2.2.2	
   Experimental	
  Validation	
  ....................................................................................................................	
  44	
   2.2.3	
   Validated	
  Region	
  Selection	
  ...............................................................................................................	
  49	
   2.2.4	
   Data	
  Analysis	
  ...........................................................................................................................................	
  50	
   2.3	
   Results	
  ............................................................................................................................................	
  52	
   2.3.1	
   Region	
  Selection	
  ....................................................................................................................................	
  52	
   2.3.2	
   Validation	
  of	
  CRM	
  Activity	
  in	
  Cell	
  Culture	
  ..................................................................................	
  55	
   2.3.3	
   Properties	
  of	
  the	
  Validated	
  Positive	
  Set	
  ......................................................................................	
  61	
   2.4	
   Discussion	
  .....................................................................................................................................	
  85	
    3	
   oPOSSUM-­‐3:	
  Advanced	
  Analysis	
  of	
  Regulatory	
  Motif	
  Over-­‐representation	
   across	
  Genes	
  or	
  ChIP-­‐Seq	
  Datasets	
  ........................................................................................	
  88	
   3.1	
   Introduction	
  .................................................................................................................................	
  88	
   3.2	
   Methods	
  .........................................................................................................................................	
  93	
   3.2.1	
   Nomenclature	
  .........................................................................................................................................	
  93	
   3.2.2	
   Data	
  Sources	
  ............................................................................................................................................	
  93	
   3.2.3	
   Calculation	
  of	
  TFBS	
  Motif	
  Over-­‐representation	
  .......................................................................	
  94	
    	
    vi  3.2.4	
   Single	
  Site	
  Analysis	
  (SSA)	
  ..................................................................................................................	
  96	
   3.2.5	
   Anchored	
  Combination	
  Site	
  Analysis	
  (aCSA)	
  ............................................................................	
  97	
   3.2.6	
   Gene-­‐based	
  vs.	
  Sequence-­‐based	
  .....................................................................................................	
  98	
   3.2.7	
   TFBS	
  Clustering	
  .....................................................................................................................................	
  98	
   3.2.8	
   Species-­‐Specific	
  Implementation	
  Details	
  .................................................................................	
  102	
   3.2.9	
   Build	
  Process	
  for	
  oPOSSUM-­‐3	
  .......................................................................................................	
  104	
   3.2.10	
   Data	
  Sources	
  for	
  ChIP-­‐Seq	
  Based	
  Analyses	
  ..........................................................................	
  105	
   3.2.11	
   ChIP-­‐Seq	
  Related	
  Analysis	
  ...........................................................................................................	
  106	
   3.2.12	
   Analysis	
  Parameters	
  ......................................................................................................................	
  108	
   3.3	
   System	
  Walk-­‐Through	
  ............................................................................................................	
  109	
   3.3.1	
   Input	
  for	
  Gene-­‐Based	
  Analysis	
  .....................................................................................................	
  111	
   3.3.2	
   Input	
  for	
  Sequence-­‐Based	
  Analysis	
  ............................................................................................	
  111	
   3.3.3	
   Results	
  Output	
  .....................................................................................................................................	
  112	
   3.4	
   Results	
  ..........................................................................................................................................	
  113	
   3.4.1	
   Application	
  to	
  Reference	
  Collections	
  .........................................................................................	
  113	
   3.4.2	
   Skeletal	
  Muscle	
  Reference	
  Gene	
  Set	
  ...........................................................................................	
  113	
   3.4.3	
   Cilia	
  Gene	
  Set	
  for	
  Nematodes	
  ........................................................................................................	
  117	
   3.4.4	
   Results	
  for	
  ChIP-­‐Seq	
  Reference	
  Data	
  Sets	
  ...............................................................................	
  119	
   3.4.5	
   GC	
  Composition	
  Effect	
  on	
  Over-­‐Representation	
  Results	
  ..................................................	
  125	
   3.4.6	
   Z-­‐Sscore	
  vs.	
  Fisher	
  Score	
  for	
  ChIP-­‐Seq	
  Based	
  Analyses	
  ....................................................	
  127	
   3.4.7	
   Assessing	
  Impact	
  of	
  Data	
  Set	
  Size	
  on	
  ChIP-­‐Seq	
  Based	
  Analyses	
  ....................................	
  129	
   3.5	
   Discussion	
  ...................................................................................................................................	
  130	
    4	
   Leveraging	
  Evolution	
  for	
  the	
  Identification	
  of	
  Regulatory	
  Programs	
  Directing	
   Gene	
  Transcription	
  ..................................................................................................................	
  132	
   4.1	
   Introduction	
  ...............................................................................................................................	
  132	
   4.1.1	
   Muscle	
  Regulatory	
  Mechanisms	
  ..................................................................................................	
  132	
   4.1.2	
   TFs	
  Involved	
  in	
  Vertebrate	
  Muscle	
  Development	
  ................................................................	
  133	
   4.1.3	
   TFs	
  Involved	
  in	
  Invertebrate	
  Muscle	
  Development	
  ............................................................	
  134	
   4.1.4	
   Differences	
  between	
  Muscle	
  Regulatory	
  TFs	
  of	
  Vertebrates	
  and	
  Invertebrates	
  ....	
  135	
   4.1.5	
   Ciliary	
  Gene	
  Regulation	
  ...................................................................................................................	
  135	
   4.1.6	
   Motif	
  Enrichment	
  Analysis	
  .............................................................................................................	
  136	
   4.1.7	
   Regulog	
  Analysis	
  ................................................................................................................................	
  136	
   4.1.8	
   Motivation	
  and	
  Experimental	
  Approach	
  ..................................................................................	
  137	
    	
    vii  4.2	
   Methods	
  .......................................................................................................................................	
  139	
   4.2.1	
   Input	
  Gene	
  Collections	
  .....................................................................................................................	
  139	
   4.2.2	
   List	
  of	
  Known	
  TFs	
  Involved	
  in	
  Each	
  Regulatory	
  Program	
  ................................................	
  142	
   4.2.3	
   TFBS	
  Over-­‐Representation	
  Analysis	
  Using	
  oPOSSUM-­‐3	
  ...................................................	
  143	
   4.2.4	
   Analysis	
  Tools	
  ......................................................................................................................................	
  144	
   4.3	
   Results	
  ..........................................................................................................................................	
  145	
   4.3.1	
   Cilia	
  Genes	
  .............................................................................................................................................	
  147	
   4.3.2	
   Nfe2l2	
  /	
  Nrf2	
  Target	
  Genes	
  ...........................................................................................................	
  149	
   4.3.3	
   Muscle	
  Genes	
  ........................................................................................................................................	
  155	
   4.4	
   Discussion	
  ...................................................................................................................................	
  187	
    5	
   Discussion	
  and	
  Conclusions	
  ...........................................................................................	
  190	
   5.1	
   Identification	
  of	
  Tissue-­‐Specific	
  cis-­‐Regulatory	
  Modules	
  Through	
  Computational	
   Prediction	
  Programs	
  ..........................................................................................................................	
  191	
   5.2	
   TFBS	
  Over-­‐Representation	
  Analysis	
  for	
  the	
  Identification	
  of	
  Main	
  Regulatory	
  TFs	
   in	
  Co-­‐expressed	
  Genes	
  .......................................................................................................................	
  193	
   5.3	
   Utility	
  of	
  Evolution	
  in	
  Identifying	
  Conserved	
  Regulation	
  Programs	
  .......................	
  194	
   5.4	
   Future	
  Directions	
  .....................................................................................................................	
  196	
    References	
  ..................................................................................................................................	
  200	
   	
   	
    	
    	
    viii  List	
  of	
  Tables	
    Table 2-1. List of some of the published CRM prediction programs. ................................... 40	
   Table 2-2. Number of candidate regions................................................................................ 55	
   Table 2-3. List of genomic regions validated as driving muscle-specific expression. ........... 60	
   Table 2-4. Overrepresented TFBS in the validated regions vs. non-responding regions ranked by Fisher p-values. .................................................................................................................. 62	
   Table 2-5. Sequence composition characteristics. .................................................................. 64	
   Table 2-6. GC and AT skews of the responding regions vs. non-responding regions. .......... 64	
   Table 2-7. The distribution of the regions in the muscle set according to the evidence source for muscle expression. ............................................................................................................ 68	
   Table 2-8. Sequence conservation based on phastCons scores (28-way Placental Mammals). ................................................................................................................................................. 69	
   Table 2-9. Comparison of the five CRM prediction programs. .............................................. 71	
   Table 2-10. Distances to the nearest annotated transcription start sites (Ensembl v61). ........ 77	
   Table 2-11. Regions associated with CpG islands. ................................................................. 77	
   Table 2-12. Comparison of the mean phyloP scores in the three region sets for profiles with at least 2-fold increase in phyloP scores for predicted TFBS positions vs. non-TFBS positions. ................................................................................................................................. 81	
   Table 2-13. Regions overlapping MyoD ChIP-Seq peaks in C2C12 cells. ............................ 82	
    	
    ix  Table 3-1. Search region level distances for human/mouse, fly and nematode in oPOSSUM3............................................................................................................................................. 108	
   Table 3-2. oPOSSUM-3 results for the muscle reference gene set (human). ....................... 114	
   Table 3-3. oPOSSUM-3 results for the cilia gene set in nematodes. .................................... 118	
   Table 3-4. oPOSSUM-3 results for Nrf2/Nfe2L2 ChIP-Seq data set, using JASPAR CORE vertebrate profiles. ................................................................................................................ 120	
   Table 3-5. oPOSSUM-3 results for FoxA2 ChIP-Seq data set, using JASPAR CORE vertebrate profiles. ................................................................................................................ 123	
   Table 4-1. Target genes of Nfe2l2 and its orthologs cnc and skn-1. .................................... 140	
   Table 4-2. Muscle gene sets analyzed................................................................................... 141	
   Table 4-3. List of TFs available in JASPAR 2010 used in this study. ................................. 143	
   Table 4-4. Gene ontology analysis of target genes for Nfe2l2 and its orthologs using DAVID. ............................................................................................................................................... 154	
   Table 4-5. Gene ontology analysis on Meissner/Stormo-muscle gene sets. ......................... 184	
   Table 4-6. Gene ontology analysis on Schnorrer-larval gene sets. ....................................... 185	
   Table 4-7. Gene ontology analysis on Schnorrer-adult genes. ............................................. 185	
   	
   	
    	
    	
    x  List	
  of	
  Figures	
    Figure 1-1. Differentiation of human tissues. ........................................................................... 4	
   Figure 1-2. Overview of transcriptional regulation. ................................................................. 6	
   Figure 1-3. cis-Regulatory module examples. .......................................................................... 8	
   Figure 1-4. Transcription factor binding site (TFBS) and position-specific scoring matrix (PSSM).................................................................................................................................... 28	
   Figure 2-1. Biases in the location of wells with successful reporter assays. .......................... 48	
   Figure 2-2. The overview of the region selection pipeline. .................................................... 53	
   Figure 2-3. Reporter expression for known muscle-specific and non-muscle specific enhancer regions in myotubes vs. myoblasts. ........................................................................................ 56	
   Figure 2-4. Selection of clones for differential expression analysis. ...................................... 57	
   Figure 2-5. Comparison of the reporter expression levels between background regions and non-background regions in the validated positive set. ............................................................ 58	
   Figure 2-6. Logos of the 5 muscle TFBSs (SRF, MEF2A, Myf, TEAD, and SP1) used for CRM prediction and 2 additional TFBSs (RREB1, NHLH1) found to be overrepresented in the validated set....................................................................................................................... 63	
   Figure 2-7. Dinucleotide frequencies in responding regions vs. non-responding regions...... 66	
   Figure 2-8. Examples of Positive Regions.............................................................................. 70	
   Figure 2-9. ROC Analysis of the conservation filter and the 4 CRM Prediction Programs. .. 73	
    	
    xi  Figure 2-10. Density plots for phyloP (46-way All) scores of the predicted binding site positions for the three region sets. .......................................................................................... 79	
   Figure 2-11. Phylogenetic depth analysis of TFBSs in responding and non-responding regions using phyloP (46-ay, hg19). ....................................................................................... 80	
   Figure 2-12. Histone modifications in the responding and non-responding regions. ............. 84	
   Figure 3-1. Overview of the main analysis types available in oPOSSUM-3.......................... 92	
   Figure 3-2. oPOSSUM-specific JASPAR PENDING collection. .......................................... 94	
   Figure 3-3. Defining conserved regions.................................................................................. 97	
   Figure 3-4. TF structural families and TFBS clusters............................................................. 99	
   Figure 3-5. TFBS cluster analysis......................................................................................... 101	
   Figure 3-6. oPOSSUM system provisions for species with operon structures. .................... 103	
   Figure 3-7. The build process for oPOSSUM 3 gene-based analysis. ................................. 105	
   Figure 3-8. The range of enrichment scores increases as the number of foreground sequences increases. ............................................................................................................................... 107	
   Figure 3-9. oPOSSUM analysis pipeline. ............................................................................. 109	
   Figure 3-10. Web interface. .................................................................................................. 110	
   Figure 3-11. Relationship between TF profile GC content and enrichment statistics. ......... 126	
   Figure 3-12. Fisher scores vs. Z-scores from oPOSSUM analysis on sequence-based data. 128	
   Figure 4-1. Overview of the conserved regulatory program analysis. .................................. 138	
   	
    xii  Figure 4-2. Fisher vs. Z score plot for oPOSSUM TCA results on human muscle reference and validated regions from Chapter 2. .................................................................................. 146	
   Figure 4-3. Score distributions for Rfx profiles from oPOSSUM TCA performed on cilia genes collected from the Ciliome DB and Geremek et al. ................................................... 148	
   Figure 4-4. oPOSSUM TCA results on Nfe2l2, cnc and skn-1 target genes. ....................... 151	
   Figure 4-5. Logos of Nfe2l2 orthologs in flies and nematodes ............................................ 153	
   Figure 4-6. Enrichment of Mef2A motif in each gene set, as measured by the distance between the combined score of Mef2A motif and the median score of all the other motifs. 161	
   Figure 4-7. Enrichment of Myf/Myf6 motifs in each gene set, as measured by the distance between the combined score of Myf/Myf6 and the median score of all the other motifs. ... 166	
   Figure 4-8. Enrichment of Srf motifs in each gene set, as measured by the distance between the combined score of Srf and the median score of all the other motifs. For control, results for cilia genes, Nfe2l2/Nrf2 genes, and randomly selected genes are also included. ................ 171	
   Figure 4-9. Enrichment of Egr1 motifs in each gene set, as measured by the distance between the combined score of Egr1 and the median score of all the other motifs. For control, results for cilia genes, Nfe2l2/Nrf2 genes, and randomly selected genes are also included. ........... 176	
   Figure 4-10. oPOSSUM TCA results on Mef2 target genes. ............................................... 180	
   Figure 4-11 oPOSSUM TCA results on Biniou target genes. .............................................. 182	
   	
    	
    xiii  List	
  of	
  Abbreviations	
    	
    •  3C:	
  Chromosome	
  Conformation	
  Capture	
    •  aCSA:	
  anchored	
  Combination	
  Site	
  Analysis	
    •  aCTCA:	
  anchored	
  Combination	
  TFBS	
  Cluster	
  Analysis	
    •  API:	
  Application	
  Programming	
  Interface	
    •  AUC:	
  Area	
  Under	
  Curve	
    •  CAGE:	
  Cap	
  Analysis	
  Gene	
  Expression	
  	
    •  ChIP:	
  Chromatin	
  Immunoprecipitation	
    •  COG:	
  Cluster	
  of	
  Orthologous	
  Group	
    •  CRM:	
  cis-­‐Regulatory	
  Modules	
    •  DPE:	
  Downstream	
  Promoter	
  Element	
    •  EMSA:	
  Electrophorectic	
  Mobility	
  Shift	
  Assay	
    •  ENCODE:	
  ENCyclopedia	
  Of	
  DNA	
  Elements	
    •  FDR:	
  False	
  Discovery	
  Rate	
    •  GFP:	
  Green	
  Fluorescent	
  Protein	
    •  GO:	
  Gene	
  Ontology	
    •  HMM:	
  Hidden	
  Markov	
  Model	
    •  IC:	
  Information	
  Content	
    •  Inr:	
  Initiator	
  Region	
    •  IUPAC:	
  International	
  Union	
  of	
  Pure	
  and	
  Applied	
  Chemistry	
    •  KOG:	
  euKaryotic	
  Orthologous	
  Group	
    •  LCR:	
  Locus	
  Control	
  Region	
    •  LRA:	
  Logistic	
  Regression	
  Analysis	
    •  MEME:	
  Multiple	
  EM	
  for	
  Motif	
  Elicitation	
    •  modENCODE:	
  model	
  organism	
  ENCODE	
    •  MSA:	
  Multiple	
  Sequence	
  Alignment	
    •  PBM:	
  Protein	
  Binding	
  Microarray	
    •  PCR:	
  Polymerase	
  Chain	
  Reaction	
   xiv  •  PFM:	
  Position	
  Frequency	
  Matrix	
    •  PSSM:	
  Position-­‐Specific	
  Scoring	
  Matrix	
    •  PWM:	
  Position	
  Weight	
  Matrix	
    •  QT-­‐PCR:	
  Quantitative	
  –	
  Polymerase	
  Chain	
  Reaction	
    •  ROC:	
  Receiver	
  Operating	
  Characteristic	
    •  RT-­‐PCR:	
  Reverse	
  Transcriptase	
  –	
  Polymerase	
  Chain	
  Reaction	
    •  SAGE:	
  Serial	
  Analysis	
  of	
  Gene	
  Expression	
    •  SAM:	
  Significance	
  Analysis	
  of	
  Microarrays	
    •  SELEX:	
  Systematic	
  Evolution	
  of	
  Ligands	
  by	
  Exponential	
  Enrichment	
    •  SSA:	
  Single	
  Site	
  Analysis	
    •  TCA:	
  TFBS	
  Cluster	
  Analysis	
    •  TF:	
  Transcription	
  Factor	
    •  TFBS:	
  Transcription	
  Factor	
  Binding	
  Site	
    •  TSS:	
  Transcription	
  Start	
  Site	
    •  UTR:	
  UnTranslated	
  Region	
    •  VSN:	
  Variance	
  Stabilization	
  Normalization	
    •  WTSS:	
  Whole	
  Transcriptome	
  Shotgun	
  Sequencing	
    	
   	
    	
    	
    xv  Acknowledgements	
    First	
  and	
  foremost,	
  I	
  would	
  like	
  to	
  express	
  my	
  deepest	
  gratitude	
  to	
  my	
  supervisor,	
  Dr.	
   Wyeth	
  W.	
  Wasserman,	
  who	
  has	
  provided	
  me	
  with	
  years	
  of	
  patient	
  and	
  careful	
  guidance	
   in	
  my	
  endeavours.	
  I	
  would	
  also	
  like	
  to	
  thank	
  my	
  thesis	
  supervisory	
  committee	
   members,	
  Dr.	
  Holger	
  Hoos,	
  Dr.	
  Don	
  Moerman	
  and	
  Dr.	
  Francis	
  Ouellett,	
  for	
  their	
   mentorship	
  and	
  support.	
   I	
  am	
  also	
  grateful	
  to	
  both	
  the	
  current	
  and	
  past	
  members	
  of	
  the	
  Wasserman	
  group,	
   whom	
  I	
  have	
  shared	
  much	
  stimulating	
  interactions	
  with	
  and	
  received	
  immeasurable	
   support	
  from.	
  Many	
  thanks	
  to	
  Dr.	
  Shannan	
  Ho	
  Sui,	
  Dr.	
  Debra	
  Fulton,	
  Dr.	
  Jochen	
  Brumm,	
   Alice	
  Chou,	
  Dr.	
  Elodie	
  Portales-­‐Casamar,	
  Dr.	
  David	
  Martin,	
  David	
  Arenillas,	
  Rebecca	
   Worsley	
  Hunt,	
  Warren	
  Cheung	
  and	
  Jonathan	
  Lim.	
  In	
  particular,	
  I	
  gratefully	
  recognize	
   Drs.	
  Ho	
  Sui	
  and	
  Dr.	
  Fulton	
  as	
  the	
  key	
  contributors	
  to	
  past	
  versions	
  of	
  oPOSSUM,	
  as	
  the	
   discussions	
  of	
  the	
  earlier	
  implementations	
  informed	
  a	
  significant	
  part	
  of	
  my	
  research.	
   Also,	
  I	
  couldn’t	
  have	
  made	
  it	
  this	
  far	
  without	
  the	
  administrative	
  assistance	
  from	
  Dora	
   Pak.	
   I	
  would	
  like	
  to	
  acknowledge	
  the	
  National	
  Sciences	
  and	
  Engineering	
  Research	
  Council	
  for	
   funding	
  my	
  research.	
  I	
  would	
  like	
  to	
  thank	
  Dr.	
  Rob	
  Holt	
  and	
  Andrea	
  McLeod	
  for	
   assistance	
  with	
  the	
  plasmid	
  sequencing,	
  Dr.	
  Catherine	
  Pallen,	
  Dr.	
  Elizabeth	
  M.	
  Simpson	
   and	
  Dr.	
  Blair	
  Leavitt	
  for	
  the	
  use	
  of	
  their	
  laboratories.	
  “The	
  work	
  was	
  supported	
  by	
   grants	
  from	
  the	
  National	
  Institutes	
  of	
  Health	
  (USA)	
  1R01GM084875,	
  the	
  Canadian	
   Institutes	
  for	
  Health	
  Research,	
  the	
  National	
  Science	
  and	
  Engineering	
  Research	
  Council	
   (NSERC)	
  and	
  Genome	
  Canada	
  (Pleiades	
  Promoter	
  Project).”	
   Finally,	
  I	
  would	
  like	
  to	
  give	
  special	
  thanks	
  to	
  my	
  parents,	
  Chung-­‐Kun	
  and	
  Sook-­‐Ja	
  Kwon,	
   and	
  my	
  sister	
  Jung-­‐Eun	
  Kwon	
  and	
  my	
  brother	
  Peter	
  Kwon,	
  whose	
  unwavering	
  support	
   and	
  encouragement	
  have	
  allowed	
  me	
  to	
  complete	
  this	
  long	
  journey.	
   	
    	
    	
    xvi  1 Introduction	
   An	
  astounding	
  diversity	
  of	
  life	
  exists	
  in	
  nature,	
  with	
  each	
  organism	
  featuring	
  complex	
   structures	
  ranging	
  from	
  sub-­‐cellular	
  organelles	
  to	
  multi-­‐cellular	
  organs	
  that	
  enable	
   them	
  to	
  function	
  within	
  an	
  environmental	
  niche.	
  Despite	
  the	
  diversity,	
  common	
   structures	
  and	
  mechanisms	
  can	
  be	
  found	
  in	
  species	
  from	
  widely	
  separated	
  branches	
  of	
   the	
  evolutionary	
  tree.	
  	
  	
  The	
  study	
  of	
  evolutionarily	
  conserved	
  mechanisms	
  has	
  been	
  a	
   fundamental	
  tool	
  to	
  advance	
  our	
  understanding	
  of	
  biological	
  systems.	
  	
   Within	
  the	
  metazoan	
  branch	
  of	
  the	
  tree	
  of	
  life,	
  the	
  study	
  of	
  development	
  can	
  reveal	
   biomedically	
  relevant	
  findings.	
  A	
  metazoan	
  develops	
  from	
  a	
  single-­‐celled	
  zygote	
  to	
  a	
   multicellular	
  adult	
  consisting	
  of	
  specialized	
  organs	
  and	
  tissues.	
  At	
  the	
  cellular	
  level,	
  the	
   developmental	
  program	
  can	
  be	
  considered	
  a	
  series	
  of	
  transitions	
  from	
  pluripotent	
  stem	
   cells	
  to	
  progressively	
  more	
  restricted	
  tissue-­‐specific	
  progenitors	
  cells,	
  eventually	
   producing	
  fully	
  differentiated	
  cells	
  that	
  characterize	
  tissues	
  and	
  organs.	
  The	
  genes	
   controlling	
  the	
  process	
  have	
  been	
  repeatedly	
  shown	
  to	
  be	
  responsible	
  for	
  human	
   disease	
  and	
  disability.	
  	
  With	
  the	
  potential	
  of	
  stem	
  cell	
  therapy	
  looming	
  on	
  the	
  horizon,	
  a	
   fuller	
  understanding	
  of	
  these	
  master	
  control	
  genes	
  stands	
  out	
  as	
  a	
  high	
  priority	
  for	
   biological	
  research.	
  As	
  the	
  master	
  control	
  genes	
  are	
  fundamental	
  components	
  of	
   development,	
  they	
  often	
  exhibit	
  strong	
  patterns	
  of	
  conservation.	
  A	
  well-­‐studied	
   example	
  of	
  master	
  control	
  genes	
  is	
  the	
  Hox	
  gene	
  family,	
  which	
  is	
  vital	
  for	
  correct	
  body	
   plan	
  development	
  and	
  organogenesis	
  of	
  metazoans	
  and	
  is	
  highly	
  conserved	
  (Carroll	
   1995).	
   While	
  the	
  developmental	
  stages	
  of	
  tissue	
  development	
  in	
  metazoans	
  have	
  been	
  well	
   established,	
  the	
  underlying	
  regulatory	
  controls	
  behind	
  the	
  specialization	
  process	
  are	
   not	
  well	
  understood.	
  The	
  gene	
  regulatory	
  network	
  is	
  the	
  fundamental	
  mechanism	
  by	
   which	
  a	
  cell	
  controls	
  the	
  activity	
  of	
  its	
  genes.	
  	
  For	
  multicellular	
  metazoans,	
  such	
  as	
   worms,	
  flies	
  and	
  humans,	
  maintaining	
  precise	
  spatial	
  and	
  temporal	
  control	
  of	
   transcription	
  is	
  vital	
  for	
  correct	
  tissue	
  development	
  and	
  specialization	
  (Look	
  1997;	
   	
    1  Gellon	
  and	
  McGinnis	
  1998;	
  Lee	
  et	
  al.	
  2004).	
  The	
  regulatory	
  network	
  governing	
  gene	
   transcription	
  is	
  a	
  complex	
  input-­‐output	
  system	
  involving	
  a	
  number	
  of	
  players,	
  including	
   (but	
  not	
  limited	
  to)	
  the	
  cis-­‐regulatory	
  regions	
  of	
  genes	
  and	
  the	
  corresponding	
  trans-­‐ acting	
  transcription	
  factors	
  (TFs).	
  Depending	
  on	
  the	
  type	
  and	
  the	
  number	
  of	
  TFs	
  that	
   bind	
  to	
  the	
  target	
  regulatory	
  regions,	
  the	
  expression	
  pattern	
  of	
  a	
  gene	
  can	
  be	
  controlled	
   in	
  a	
  specific	
  manner.	
  Additional	
  complexities	
  arise	
  from	
  the	
  widely	
  dispersed	
   regulatory	
  region	
  locations	
  and	
  presence	
  of	
  multiple	
  alternative	
  promoters.	
  	
  Thus,	
   deciphering	
  just	
  one	
  component	
  of	
  this	
  network,	
  transcriptional	
  regulation,	
  has	
  been	
  a	
   strong	
  focus	
  of	
  decades	
  of	
  research	
  and	
  yet	
  remains	
  a	
  major	
  challenge.	
   Understanding	
  how	
  this	
  regulation	
  takes	
  place	
  can	
  be	
  a	
  key	
  to	
  understanding	
  how	
  it	
   may	
  contribute	
  to	
  human	
  genetic	
  diseases.	
  Known	
  diseases	
  that	
  are	
  caused	
  by	
  faulty	
   regulatory	
  controls	
  are	
  growing	
  in	
  number.	
  The	
  earliest	
  discoveries	
  of	
  regulation-­‐ related	
  diseases	
  include	
  certain	
  forms	
  of	
  thalassemias,	
  blood	
  disorders	
  caused	
  by	
   incorrect	
  levels	
  of	
  α-­‐	
  and	
  β-­‐globin	
  chains	
  in	
  hemoglobin	
  (Kioussis	
  et	
  al.	
  1983;	
  Driscoll	
  et	
   al.	
  1989).	
  In	
  some	
  patients,	
  genomic	
  translocations	
  result	
  in	
  the	
  removal	
  of	
  the	
  cis-­‐ acting	
  locus	
  control	
  region	
  (LCR)	
  for	
  the	
  β-­‐globin	
  locus.	
  Such	
  links	
  between	
  regulatory	
   disruptions	
  and	
  phenotypes	
  have	
  been	
  found	
  for	
  many	
  tissues	
  and	
  organs.	
  	
  In	
  patients	
   with	
  Van	
  Buchem	
  disease,	
  abnormal	
  bone	
  density	
  results	
  in	
  enlargement	
  of	
  skull	
  and	
   mandible	
  leading	
  to	
  other	
  serious	
  complications	
  (Wergedal	
  2003).	
  The	
  disease	
  can	
  be	
   caused	
  by	
  the	
  removal	
  of	
  the	
  cis-­‐acting	
  ECR5	
  element	
  from	
  the	
  locus	
  of	
  sclerotin	
  (SOST),	
   a	
  gene	
  encoding	
  a	
  negative	
  regulator	
  of	
  bone	
  formation	
  (Loots	
  et	
  al.	
  2005).	
  The	
  genetic	
   eye	
  disease	
  aniridia	
  is	
  caused	
  by	
  disruption	
  of	
  the	
  trans-­‐acting	
  PAX6	
  master	
  regulator	
   of	
  transcription	
  (Hanson	
  et	
  al.	
  1993).	
  The	
  onset	
  and	
  rate	
  of	
  progression	
  of	
  acquired	
   immunodeficiency	
  syndrome	
  (AIDS)	
  has	
  been	
  tied	
  to	
  the	
  presence	
  of	
  certain	
  alleles	
  of	
   common	
  polymorphisms	
  in	
  the	
  regulatory	
  region	
  of	
  the	
  CCR5	
  gene.	
  Examples	
  of	
   regulatory	
  region-­‐related	
  diseases	
  are	
  plentiful,	
  and	
  have	
  been	
  the	
  subject	
  of	
  recent	
   reviews	
  (Kleinjan	
  and	
  Lettice	
  2008).	
   	
   	
    2  There	
  is	
  clear	
  biomedical	
  importance	
  to	
  developing	
  a	
  full	
  understanding	
  of	
  the	
  human	
   gene	
  regulatory	
  network.	
  	
  Due	
  to	
  its	
  complexity,	
  however,	
  much	
  of	
  the	
  progress	
  has	
   been	
  largely	
  made	
  through	
  the	
  study	
  of	
  model	
  organisms.	
  	
  The	
  popular	
  metazoan	
  model	
   organisms	
  offer	
  different	
  advantages	
  and	
  technical	
  convenience.	
  The	
  laboratory	
  mouse	
   is	
  expensive	
  to	
  study,	
  but	
  can	
  provide	
  exquisite	
  relevance	
  to	
  the	
  study	
  of	
  human	
  gene	
   regulation.	
  	
  Within	
  the	
  more	
  distant	
  metazoans,	
  both	
  flies	
  and	
  worms	
  have	
  been	
   extensively	
  used	
  to	
  study	
  gene	
  regulation,	
  but	
  the	
  capacity	
  to	
  interpret	
  the	
  specific	
   properties	
  of	
  human	
  genes	
  based	
  on	
  the	
  research	
  findings	
  has	
  been	
  lacking.	
  	
  The	
   considerable	
  evolutionary	
  distances	
  between	
  vertebrates,	
  nematodes	
  and	
  insects	
   makes	
  interpretation	
  challenging,	
  but	
  key	
  findings	
  in	
  both	
  worms	
  and	
  flies	
  have	
   revealed	
  critical	
  insights	
  into	
  diseases	
  and	
  conditions	
  ranging	
  from	
  diabetes	
  (Zhang	
  et	
   al.	
  2009)	
  to	
  limb	
  malformations	
  (Goodman	
  2002).	
  Improving	
  the	
  capacity	
  to	
  use	
   regulatory	
  insights	
  from	
  insects	
  and	
  nematodes	
  could	
  accelerate	
  our	
  understanding	
  of	
   the	
  human	
  transcription	
  regulatory	
  network.	
  The	
  main	
  goal	
  of	
  the	
  research	
  outlined	
  in	
   this	
  thesis	
  is	
  to	
  develop	
  bioinformatics	
  methods	
  for	
  the	
  identification	
  of	
  cis-­‐regulatory	
   regions	
  governing	
  gene	
  transcription,	
  and	
  identify	
  the	
  important	
  regulatory	
  modules	
   that	
  are	
  conserved	
  and	
  shared	
  across	
  evolutionarily	
  distant	
  species.	
    1.1 Metazoan	
  Tissue	
  Development	
   Metazoans	
  with	
  bilateral	
  symmetry,	
  including	
  worms,	
  flies,	
  and	
  humans,	
  form	
  three	
   primary	
  tissue	
  layers:	
  ectoderm,	
  mesoderm,	
  and	
  endoderm	
  (Nance	
  et	
  al.	
  2005;	
  Solnica-­‐ Krezel	
  2005).	
  	
  After	
  the	
  zygote	
  formation	
  and	
  subsequent	
  cell	
  divisions,	
  early	
  embryos	
   in	
  general	
  go	
  through	
  the	
  gastrulation	
  process	
  to	
  produce	
  the	
  three	
  layers	
  (Figure	
  1-­‐1).	
   Ectoderm	
  is	
  the	
  source	
  of	
  epidermis	
  and	
  neurons.	
  Endoderm	
  develops	
  into	
  the	
   digestive	
  system,	
  while	
  mesoderm	
  turns	
  into	
  muscle,	
  blood,	
  connective	
  tissue	
  etc.	
  At	
   each	
  developmental	
  stage,	
  specific	
  sets	
  of	
  regulatory	
  controls	
  are	
  exerted	
  to	
  ensure	
  that	
   cells	
  differentiate	
  correctly.	
  The	
  most	
  well	
  known	
  regulatory	
  system	
  involved	
  in	
  the	
   development	
  process	
  involves	
  the	
  Hox	
  family	
  of	
  transcriptional	
  regulators.	
  Hox	
  genes	
   are	
  homeodomain-­‐containing	
  transcription	
  factors	
  that	
  specify	
  the	
  anterior-­‐posterior	
   	
    3  axis	
  and	
  segment	
  identity	
  in	
  the	
  early	
  embryo	
  stages	
  of	
  metazoans	
  (Carroll	
  1995).	
   These	
  genes	
  are	
  responsible	
  for	
  the	
  development	
  of	
  the	
  correct	
  body	
  plan	
  and	
   organogenesis,	
  and	
  are	
  highly	
  conserved	
  in	
  most	
  metazoan	
  species.	
  	
   Figure 1-1. Differentiation of human tissues. Gastrulation leads to formation of ectoderm, mesoderm, and endoderm. (Source: NCBI. http://www.ncbi.nih.gov/About/primer/images/layoutdiff8.gif)  	
    1.1.1 Tissue	
  and	
  Structure	
  Conservation	
  among	
  Metazoans	
   The	
  metazoan	
  evolutionary	
  tree	
  is	
  well	
  established,	
  serving	
  as	
  a	
  useful	
  guide	
  to	
   important	
  organogenesis	
  events.	
  By	
  looking	
  at	
  the	
  ‘oldest’	
  (farthest	
  from	
  humans	
  in	
   branch	
  length)	
  species	
  with	
  a	
  tissue	
  type	
  of	
  interest,	
  one	
  can	
  analyze	
  the	
  commonalities	
   in	
  the	
  tissue	
  between	
  the	
  two	
  species	
  and	
  determine	
  the	
  extent	
  of	
  conservation.	
   Intermediate	
  species	
  can	
  provide	
  insight	
  into	
  the	
  additional	
  complexities	
  in	
  terms	
  of	
   	
    4  both	
  the	
  tissue	
  structure	
  and	
  the	
  regulatory	
  controls	
  behind	
  the	
  structure.	
  A	
  good	
   example	
  is	
  vertebrate	
  striated	
  muscle	
  myogenesis,	
  a	
  highly	
  structured	
  process	
  in	
  which	
   mononucleate	
  myoblasts	
  fuse	
  together	
  to	
  form	
  multinucleate	
  myotubes,	
  which	
  then	
   develop	
  into	
  various	
  classes	
  of	
  myofibres	
  (Valdez	
  et	
  al.	
  2000).	
  This	
  differentiation	
   process	
  requires	
  complex	
  transcriptional	
  regulation	
  controls,	
  which	
  in	
  vertebrate	
   skeletal	
  muscle	
  involves	
  two	
  major	
  TF	
  families,	
  MyoD	
  and	
  MEF2	
  (Braun	
  et	
  al.	
  1994;	
   Rudnicki	
  and	
  Jaenisch	
  1995;	
  Naya	
  and	
  Olson	
  1999).	
  Muscle	
  is	
  present	
  in	
  most	
  branches	
   of	
  the	
  metazoans,	
  from	
  jellyfish	
  to	
  vertebrates.	
  When	
  one	
  examines	
  nematode	
  muscle	
   structure	
  and	
  the	
  regulatory	
  controls	
  involved,	
  one	
  finds	
  both	
  commonalities	
  and	
   pronounced	
  differences	
  with	
  vertebrates	
  (Dichoso	
  et	
  al.	
  2000;	
  Castanon	
  and	
  Baylies	
   2002;	
  Fukushige	
  et	
  al.	
  2006).	
  	
   Some	
  cells	
  exhibit	
  key	
  structures	
  critical	
  for	
  their	
  role	
  within	
  an	
  organism.	
  	
  Such	
   structures	
  can	
  appear	
  in	
  different	
  tissues,	
  but	
  contain	
  similar	
  protein	
  components.	
  	
   Cilium	
  is	
  a	
  cellular	
  structure	
  that	
  is	
  used	
  in	
  a	
  variety	
  of	
  roles,	
  from	
  cell	
  motility	
  to	
   sensory	
  detection.	
  This	
  structure	
  has	
  been	
  adapted	
  for	
  different	
  purposes	
  over	
   evolution,	
  but	
  the	
  general	
  structure	
  and	
  its	
  components	
  are	
  highly	
  conserved,	
  not	
  only	
   between	
  the	
  different	
  cell	
  types	
  but	
  also	
  between	
  multiple	
  species	
  (Li	
  et	
  al.	
  2004;	
   Blacque	
  et	
  al.	
  2005).	
  As	
  in	
  muscle,	
  the	
  structural	
  conservation	
  is	
  paralleled	
  by	
  retention	
   of	
  regulatory	
  controls.	
  	
  The	
  RFX	
  transcription	
  factors	
  perform	
  regulatory	
  roles	
  for	
  cilia	
   gene	
  expression	
  across	
  vertebrates	
  and	
  nematodes	
  (Emery	
  et	
  al.	
  1996;	
  Efimenko	
  et	
  al.	
   2005).	
  	
    1.2 Gene	
  Regulation	
   In	
  the	
  simplest	
  model	
  of	
  transcriptional	
  regulation,	
  as	
  proposed	
  for	
  lactose-­‐regulated	
   transcription	
  in	
  bacteria	
  (Reznikoff	
  1992),	
  the	
  regulatory	
  organization	
  of	
  a	
  gene	
   consists	
  of	
  a	
  basal	
  promoter,	
  a	
  coding	
  region,	
  and	
  the	
  flanking	
  sequence	
  containing	
  TF	
   binding	
  sites	
  (TFBSs).	
  When	
  these	
  sites	
  are	
  bound	
  by	
  the	
  appropriate	
  sequence-­‐specific	
   TFs,	
  a	
  complex	
  forms	
  to	
  promote	
  the	
  recruitment	
  and	
  positioning	
  of	
  RNA	
  polymerase	
  at	
   	
    5  the	
  promoter	
  region	
  to	
  initiate	
  transcription	
  (Hahn	
  2004).	
  However,	
  there	
  is	
  added	
   complexity	
  at	
  many	
  levels	
  that	
  is	
  not	
  represented	
  by	
  this	
  simple	
  model	
  (Figure	
  1-­‐2).	
   Figure 1-2. Overview of transcriptional regulation. Groups of transcription factors (TFs) bind to sets of transcription factor binding sites (TFBSs) within or adjoining genes to activate or repress gene expression. Within the figure are examples of numerous regulatory regions, including regions proximal to the promoter, distal regions both 5’ and 3’ of the transcription start site (TSS). (reproduced with permission;(Wasserman and Krivan 2003))  	
    1.2.1 Promoter	
  and	
  Regulatory	
  Regions	
   The	
  core	
  promoter,	
  which	
  extends	
  ~34	
  bp	
  upstream	
  and	
  downstream	
  of	
  a	
  transcription	
   start	
  site	
  (TSS),	
  can	
  contain	
  all	
  or	
  a	
  subset	
  of	
  a	
  TATA-­‐box,	
  initiator	
  region	
  (Inr)	
  and	
   downstream	
  promoter	
  element	
  (DPE).	
  The	
  general	
  transcription	
  factors	
  bind	
  to	
  the	
   core	
  promoter	
  to	
  form	
  the	
  pre-­‐initiation	
  complex,	
  which	
  positions	
  the	
  RNA	
  polymerase	
   II	
  for	
  proper	
  transcription	
  initiation.	
  The	
  proximal	
  promoter	
  region	
  located	
  to	
  either	
   side	
  of	
  the	
  core	
  promoter	
  usually	
  contains	
  some	
  transcription	
  factor	
  binding	
  sites	
   (TFBS)	
  that	
  participate	
  in	
  the	
  regulation	
  process.	
  Regulatory	
  regions	
  can	
  be	
  found	
  not	
   	
    6  only	
  at	
  the	
  5’	
  upstream	
  sequences	
  but	
  also	
  within	
  the	
  introns	
  and	
  3’	
  downstream	
   sequences.	
  The	
  regulatory	
  regions	
  are	
  spread	
  over	
  much	
  larger	
  distances	
  as	
  well,	
   possibly	
  hundreds	
  of	
  kilobases	
  away	
  from	
  the	
  exons	
  of	
  regulated	
  genes	
  in	
  vertebrates	
   (Loots	
  et	
  al.	
  2000).	
  Some	
  genes	
  have	
  been	
  found	
  to	
  be	
  under	
  the	
  control	
  of	
  regulatory	
   regions	
  that	
  are	
  located	
  in	
  introns	
  of	
  an	
  adjacent	
  gene	
  (McBride	
  and	
  Kleinjan	
  2004)	
  or	
   even	
  on	
  other	
  chromosomes	
  (Morris	
  et	
  al.	
  1999).	
  The	
  mediator	
  complex	
  acts	
  as	
  the	
   intermediary	
  between	
  the	
  basal	
  transcription	
  machinery	
  and	
  the	
  TFs	
  bound	
  to	
  the	
   regulatory	
  regions.	
  	
  Evidence	
  continues	
  to	
  build	
  indicating	
  that	
  most	
  genes	
  have	
   multiple	
  alternative	
  promoters	
  that	
  can	
  be	
  used	
  in	
  a	
  selective	
  manner.	
  	
  Depending	
  on	
   promoter	
  organization,	
  the	
  alternative	
  promoters	
  may	
  produce	
  transcripts	
  encoding	
   identical	
  or	
  different	
  proteins.	
  	
  Nematodes	
  have	
  a	
  distinguishing	
  feature	
  from	
  many	
   other	
  metazoans	
  in	
  that	
  they	
  contain	
  operons	
  in	
  their	
  genome,	
  where	
  multiple	
  genes	
   are	
  transcribed	
  as	
  a	
  single	
  unit	
  before	
  being	
  processed	
  into	
  separate	
  mRNAs	
   (Blumenthal	
  and	
  Gleason	
  2003).	
    1.2.2 Transcription	
  Factors	
   There	
  are	
  estimated	
  to	
  be	
  1,500	
  TFs	
  encoded	
  by	
  human	
  genes	
  and	
  over	
  900	
  TFs	
  by	
   nematode	
  genes	
  (Reece-­‐Hoyes	
  et	
  al.	
  2005;	
  Fulton	
  et	
  al.	
  2009).	
  TFs	
  can	
  be	
  classified	
   according	
  to	
  the	
  structural	
  class	
  of	
  their	
  DNA	
  binding	
  domains,	
  which	
  tend	
  to	
  be	
  highly	
   conserved.	
  For	
  certain	
  DNA	
  binding	
  domains,	
  the	
  DNA	
  sequences	
  bound	
  exhibit	
  class-­‐ specific	
  DNA	
  sequence.	
  TFs	
  act	
  in	
  a	
  multitude	
  of	
  combinations,	
  some	
  binding	
  as	
   monomers,	
  some	
  as	
  homodimers	
  or	
  multimers,	
  or	
  through	
  forming	
  hetero-­‐complexes	
   with	
  different	
  TFs.	
  The	
  interactions	
  between	
  TFs	
  allows	
  for	
  a	
  portion	
  of	
  observed	
  cell-­‐ type	
  specific	
  expression,	
  for	
  instance	
  through	
  combinations	
  of	
  different	
  TFs	
  binding	
  to	
   clusters	
  of	
  cis-­‐regulatory	
  elements	
  cooperatively	
  (Arnone	
  and	
  Davidson	
  1997).	
  These	
   clusters	
  are	
  often	
  referred	
  to	
  as	
  cis-­‐regulatory	
  modules	
  (CRMs).	
  For	
  each	
  gene,	
  specific	
   combinations	
  of	
  TFBSs	
  form	
  units	
  of	
  regulatory	
  control.	
  This	
  model	
  allows	
  for	
  a	
   combinatorial	
  control	
  of	
  the	
  gene,	
  where	
  only	
  a	
  limited	
  number	
  of	
  TFs	
  can	
  create	
  an	
    	
    7  exponential	
  number	
  of	
  unique	
  combinations,	
  each	
  conferring	
  a	
  specific	
  regulation	
  to	
  the	
   gene	
  (Figure	
  1-­‐3).	
  	
   Figure 1-3. cis-Regulatory module examples. CYP1A2 has a CRM in a distal enhancer located at -2,352 to -2,094 relative to the TSS. IGF1 has a CRM in its promoter with more than 1 factor binding at the first two sites, while ALDOB has a CRM located in the first intron. AP-1  HNF-1  CYP1A2 -2352  -2094  HNF-1 AP-1  HNF-1 C/EBP  HNF-3  HNF-3  IGF1 -420  -100  HNF-4  HNF-1  HNF-1  ALDOB +2125  +2329  	
    	
    1.2.3 MicroRNA	
   Discoveries	
  of	
  new	
  RNA	
  classes	
  and	
  their	
  roles	
  in	
  gene	
  regulation	
  provided	
  further	
   insight	
  into	
  gene	
  regulatory	
  mechanisms.	
  One	
  prominent	
  discovery	
  is	
  the	
  role	
  that	
   microRNA	
  plays	
  in	
  gene	
  regulation.	
  MicroRNAs	
  are	
  short	
  noncoding	
  RNAs	
  that	
  act	
  as	
   posttranscriptional	
  regulators	
  (Brennecke	
  and	
  Cohen	
  2003;	
  Ambros	
  2004).	
  Processed	
   microRNAs	
  have	
  an	
  average	
  length	
  of	
  22	
  nucleotides,	
  and	
  most	
  hybridize	
  to	
  target	
   sequences	
  within	
  the	
  3’	
  UTR	
  of	
  target	
  transcripts.	
  The	
  interaction	
  of	
  a	
  microRNA	
  with	
  a	
   target	
  sequence	
  in	
  an	
  mRNA	
  promotes	
  the	
  degradation	
  of	
  the	
  mRNA.	
  Cohen	
  and	
   colleagues	
  suggested	
  that	
  animal	
  microRNAs	
  confer	
  robustness	
  to	
  developmental	
  gene	
   expression	
  programs	
  to	
  ensure	
  tissue	
  specificity	
  (Stark	
  et	
  al.	
  2005).	
  Lim	
  et	
  al.	
  have	
   	
    8  shown	
  that	
  transfecting	
  human	
  cells	
  with	
  microRNAs	
  that	
  are	
  preferentially	
  expressed	
   in	
  a	
  particular	
  tissue	
  type	
  shifts	
  the	
  expression	
  profile	
  of	
  the	
  cells	
  to	
  that	
  of	
  the	
  tissue,	
   and	
  that	
  a	
  significant	
  portion	
  of	
  those	
  genes	
  with	
  shifted	
  (reduced)	
  expression	
  have	
  the	
   transfected	
  microRNA	
  target	
  sequence	
  in	
  their	
  3’	
  UTRs	
  (Lim	
  et	
  al.	
  2005).	
  Genes	
  with	
   microRNA	
  target	
  sequences	
  tend	
  to	
  be	
  highly	
  expressed	
  at	
  the	
  developmental	
  stage	
   prior	
  to	
  the	
  microRNA	
  expression,	
  while	
  those	
  genes	
  that	
  show	
  the	
  same	
  spatio-­‐ temporal	
  expression	
  pattern	
  as	
  a	
  particular	
  microRNA	
  lack	
  the	
  target	
  sequences	
  for	
  the	
   microRNA,	
  potentially	
  reflecting	
  selective	
  avoidance	
  (Farh	
  et	
  al.	
  2005).	
    1.2.4 Chromatin	
  and	
  Epigenetics	
   The	
  study	
  of	
  epigenetics—the	
  heritable	
  changes	
  in	
  gene	
  function	
  and	
  expression	
   without	
  any	
  change	
  to	
  the	
  gene	
  sequence	
  itself—has	
  provided	
  researchers	
  with	
  better	
   insight	
  into	
  additional	
  layers	
  of	
  regulation	
  (Olins	
  and	
  Olins	
  2003).	
  Two	
  important	
   means	
  of	
  epigenetic	
  control	
  are	
  histone	
  modification	
  and	
  nucleosome	
  remodeling,	
   which	
  in	
  turn	
  lead	
  to	
  changes	
  in	
  chromatin	
  conformation	
  (Fischle	
  et	
  al.	
  2003).	
   Acetylation	
  of	
  histone	
  tail	
  residues	
  tends	
  to	
  be	
  associated	
  with	
  open	
  chromatin	
  and	
   widely	
  spaced	
  nucleosomes.	
  On	
  the	
  other	
  hand,	
  methylation	
  of	
  histone	
  tail	
  residues	
   have	
  been	
  associated	
  with	
  both	
  open	
  and	
  closed	
  chromatin	
  with	
  tightly	
  packed	
   nucleosomes,	
  depending	
  on	
  which	
  specific	
  residues	
  undergo	
  modification.	
  With	
  open	
   chromatin,	
  TFs	
  have	
  better	
  access	
  to	
  the	
  TFBSs	
  on	
  DNA	
  for	
  binding;	
  conversely	
  with	
   closed	
  chromatin,	
  the	
  compactness	
  hinders	
  TFs	
  from	
  binding	
  to	
  the	
  TFBS.	
  Acetylated	
   histones	
  facilitate	
  better	
  access	
  and	
  stability	
  for	
  the	
  TFs	
  binding	
  to	
  CRMs	
  by	
  opening	
  up	
   the	
  chromatin,	
  while	
  methylated	
  histones	
  that	
  lead	
  to	
  closed	
  chromatin	
  act	
  to	
  close	
  the	
   conformation	
  reducing	
  access.	
  Thus,	
  actively	
  expressed	
  genes	
  tend	
  to	
  be	
  associated	
   with	
  open	
  chromatin	
  structure	
  and	
  silenced	
  genes	
  tend	
  to	
  be	
  associated	
  with	
   heterochromatin.	
  It	
  is	
  important	
  to	
  note	
  that	
  the	
  chromatin	
  conformation	
  changes	
   result	
  in	
  spatial	
  chromatin	
  re-­‐organization,	
  which	
  means	
  that	
  one	
  needs	
  to	
  examine	
  the	
   chromatin	
  in	
  three-­‐dimensional	
  space	
  in	
  order	
  to	
  fully	
  capture	
  the	
  details	
  of	
  the	
   conformation	
  effects.	
  The	
  recent	
  development	
  of	
  the	
  chromosome	
  conformation	
   	
    9  capture	
  (3C)	
  technology	
  and	
  its	
  higher-­‐throughput	
  derivatives	
  has	
  provided	
   researchers	
  with	
  a	
  much-­‐needed	
  method	
  of	
  determining	
  the	
  distribution	
  of	
  enhancers	
   and	
  promoters	
  in	
  3D	
  space	
  (McBride	
  and	
  Kleinjan	
  2004).	
   Another	
  important	
  regulatory	
  mechanism	
  coupled	
  with	
  chromatin	
  conformation	
  is	
  the	
   control	
  provided	
  by	
  insulators	
  -­‐	
  DNA	
  sequence	
  elements	
  that	
  block	
  the	
  influence	
  of	
   regulatory	
  events	
  from	
  adjacent	
  chromatin	
  domains.	
  There	
  are	
  two	
  different	
  types	
  of	
   insulators:	
  1)	
  those	
  that	
  prevent	
  enhancers	
  from	
  exerting	
  their	
  control	
  on	
   inappropriate	
  genes,	
  and	
  2)	
  those	
  that	
  act	
  as	
  barriers	
  against	
  spreading	
  of	
  chromatin	
   condensation	
  to	
  allow	
  the	
  adjacent	
  genes	
  to	
  continue	
  with	
  their	
  expression.	
  Insulator	
   elements	
  have	
  been	
  identified	
  in	
  yeast,	
  fruit	
  flies	
  and	
  vertebrates	
  (Golovnin	
  et	
  al.	
  1999;	
   Bell	
  et	
  al.	
  2001;	
  Zhan	
  et	
  al.	
  2001;	
  West	
  and	
  Fraser	
  2005).	
  In	
  vertebrates,	
  CTCF	
  has	
  been	
   identified	
  as	
  the	
  main	
  enhancer-­‐blocking	
  protein	
  that	
  binds	
  to	
  the	
  insulator	
  elements	
   (Brasset	
  and	
  Vaury	
  2005).	
    1.3 Analysis	
  of	
  Gene	
  Expression	
   Gene	
  expression	
  can	
  roughly	
  be	
  evaluated	
  in	
  two	
  parts:	
  RNA	
  and	
  protein.	
  	
  RNA	
   abundance	
  is	
  determined	
  by	
  a	
  series	
  of	
  processes,	
  including	
  transcription	
  initiation,	
   processing	
  and	
  stability.	
  	
  In	
  addition	
  to	
  the	
  rate	
  of	
  translation	
  by	
  ribosomes,	
  protein	
   abundance	
  reflects	
  protein	
  processing	
  and	
  stability.	
  	
  	
  Because	
  of	
  the	
  relative	
  ease	
  with	
   which	
  cellular	
  mRNA	
  levels	
  can	
  be	
  measured	
  compared	
  to	
  protein	
  levels,	
  mRNA	
   expression	
  profiles	
  are	
  often	
  used	
  as	
  the	
  primary	
  measurement	
  of	
  gene	
  expression.	
  	
   This	
  common	
  practice	
  is	
  performed	
  in	
  spite	
  of	
  the	
  fact	
  that	
  mRNA	
  levels	
  and	
  protein	
   levels	
  often	
  can	
  correlate	
  poorly,	
  due	
  to	
  the	
  actions	
  of	
  post-­‐translational	
  regulatory	
   mechanisms	
  (Gygi	
  et	
  al.	
  1999).	
  For	
  study	
  of	
  the	
  transcription	
  regulatory	
  network,	
  RNA	
   measurements	
  are	
  more	
  directly	
  relevant	
  than	
  protein	
  measurements.	
  In	
  the	
  remainder	
   of	
  this	
  chapter,	
  I	
  will	
  often	
  refer	
  to	
  the	
  term	
  ‘gene’	
  as	
  being	
  inclusive	
  of	
  both	
  the	
   transcribed	
  RNA	
  and	
  the	
  translated	
  protein.	
  	
    	
    10  For	
  this	
  thesis,	
  I	
  evaluated	
  and	
  incorporated	
  into	
  my	
  research	
  several	
  gene	
  expression	
   measurement	
  data	
  sets	
  that	
  were	
  generated	
  with	
  the	
  methodologies	
  outlined	
  in	
  the	
   following	
  sections.	
  	
    1.3.1 DNA	
  Microarrays	
   DNA	
  microarray	
  technology	
  was	
  developed	
  as	
  a	
  way	
  of	
  parallelizing	
  traditional	
   northern	
  blotting	
  experiments	
  for	
  detecting	
  cellular	
  mRNA	
  levels.	
  Microarrays	
  are	
  glass	
   or	
  silicon	
  chips	
  that	
  contain	
  arrays	
  of	
  spots,	
  to	
  each	
  of	
  which	
  a	
  unique	
  DNA	
  probe	
  is	
   affixed.	
  In	
  a	
  microarray	
  experiment,	
  RNA	
  extracted	
  from	
  the	
  biological	
  sample	
  to	
  be	
   tested	
  is	
  converted	
  to	
  fluorescently	
  labeled	
  cDNA,	
  which	
  is	
  then	
  hybridized	
  to	
  the	
   microarray.	
  Spots	
  that	
  bind	
  the	
  labeled	
  cDNA	
  can	
  be	
  detected	
  using	
  a	
  laser	
  scanner,	
  and	
   the	
  strength	
  of	
  fluorescence	
  is	
  used	
  to	
  quantify	
  the	
  gene	
  expression	
  level.	
   There	
  are	
  two	
  common	
  types	
  of	
  microarray	
  technologies:	
  1)	
  cDNA	
  microarrays,	
  which	
   requires	
  two	
  samples	
  with	
  two	
  different	
  fluorescent	
  dyes,	
  and	
  2)	
  oligonucleotide	
   microarrays,	
  which	
  requires	
  only	
  one	
  sample	
  as	
  they	
  provide	
  absolute	
  measurement	
   levels	
  for	
  each	
  one	
  (Bowtell	
  1999).	
  It	
  is	
  also	
  possible	
  to	
  classify	
  the	
  microarrays	
  based	
   on	
  the	
  types	
  of	
  probes	
  used.	
  In	
  traditional	
  microarrays,	
  probes	
  are	
  designed	
  to	
  identify	
   the	
  known	
  or	
  predicted	
  genes	
  only.	
  On	
  the	
  other	
  hand,	
  in	
  tiling	
  arrays,	
  short	
  fragments	
   of	
  DNA	
  designed	
  to	
  cover	
  the	
  contiguous	
  regions	
  of	
  a	
  genome	
  (which	
  can	
  be	
  extended	
   to	
  cover	
  the	
  entire	
  genome)	
  are	
  used	
  as	
  probes,	
  regardless	
  of	
  gene	
  annotation.	
  The	
   resolution	
  of	
  tiling	
  arrays	
  is	
  determined	
  by	
  the	
  probe	
  lengths	
  and	
  spacing.	
  This	
  allows	
   an	
  unbiased	
  reporting	
  of	
  transcription,	
  allowing	
  transcriptome	
  profiling	
  in	
  the	
  given	
   genomic	
  region.	
  	
   Microarray	
  measurements	
  can	
  be	
  affected	
  by	
  multiple	
  sources	
  of	
  variation	
  over	
  the	
   course	
  of	
  experiment,	
  including	
  dye	
  integration,	
  sample	
  preparation,	
  hybridization	
  and	
   image	
  artifacts.	
  To	
  account	
  for	
  these	
  variations,	
  measurement	
  values	
  must	
  be	
   normalized	
  statistically	
  based	
  on	
  both	
  technical	
  and	
  biological	
  replicates	
  before	
  they	
   can	
  be	
  analyzed	
  for	
  biological	
  interpretation.	
  Numerous	
  normalization	
  methods	
  have	
   	
    11  been	
  developed,	
  including	
  Robust	
  Multi-­‐Chip	
  Analysis	
  (RMA)	
  and	
  MAS5	
  (Lim	
  et	
  al.	
   2007).	
  	
    1.3.2 Serial	
  Analysis	
  of	
  Gene	
  Expression	
  (SAGE)	
   Serial	
  analysis	
  of	
  gene	
  expression	
  (SAGE)	
  is	
  another	
  method	
  to	
  quantitatively	
  measure	
   mRNA	
  abundance	
  in	
  a	
  given	
  biological	
  sample	
  (Velculescu	
  et	
  al.	
  1995).	
  The	
  central	
  idea	
   behind	
  SAGE	
  is	
  that	
  a	
  short	
  (10-­‐25bp)	
  sequence	
  tag	
  from	
  a	
  transcript	
  would	
  be	
   obtained	
  and	
  used	
  to	
  uniquely	
  identify	
  the	
  gene.	
  In	
  a	
  SAGE	
  experiment,	
  mRNA	
  is	
   isolated	
  from	
  a	
  sample	
  and	
  short	
  fragments	
  are	
  obtained	
  from	
  the	
  3’	
  end	
  of	
  transcripts	
   using	
  pre-­‐defined	
  restriction	
  enzymes,	
  and	
  these	
  fragments	
  are	
  concatenated	
  using	
   DNA	
  ligase.	
  The	
  concatemers	
  are	
  sequenced,	
  and	
  the	
  obtained	
  sequences	
  are	
   computationally	
  analyzed	
  to	
  identify	
  each	
  constituent	
  tag,	
  which	
  are	
  then	
  mapped	
  back	
   to	
  the	
  genome	
  to	
  identify	
  the	
  genes	
  involved.	
  By	
  counting	
  the	
  number	
  of	
  observations	
   for	
  each	
  unique	
  tag,	
  quantitative	
  measures	
  of	
  mRNA	
  levels	
  are	
  obtained.	
  Improved	
   versions	
  of	
  SAGE	
  have	
  since	
  been	
  developed,	
  including	
  Long-­‐SAGE	
  and	
  SuperSAGE,	
   which	
  employ	
  longer	
  tags	
  to	
  improve	
  the	
  confidence	
  of	
  the	
  gene	
  identification	
  (Saha	
  et	
   al.	
  2002;	
  Matsumura	
  et	
  al.	
  2005).	
  	
  	
  The	
  SAGE	
  and	
  aforementioned	
  microarray	
  methods	
   are	
  rapidly	
  being	
  supplanted	
  by	
  RNA-­‐Seq.	
    1.3.3 RNA-­‐Seq	
   RNA-­‐Seq,	
  also	
  known	
  as	
  whole	
  transcriptome	
  shotgun	
  sequencing	
  (WTSS),	
  is	
  a	
   transcriptome	
  profiling	
  technology	
  based	
  on	
  high-­‐throughput	
  sequencing	
  (Wang	
  et	
  al.	
   2009).	
  RNA-­‐Seq	
  measures	
  the	
  cellular	
  RNA	
  content	
  by	
  sequencing	
  reverse	
  transcribed	
   cDNAs.	
  As	
  with	
  ChIP-­‐Seq	
  (explained	
  in	
  section	
  1.4.2.1),	
  the	
  deep	
  coverage	
  provided	
  by	
   the	
  next	
  generation	
  sequencing	
  methods	
  allows	
  researchers	
  to	
  examine	
  the	
   transcriptome	
  with	
  greater	
  resolution	
  than	
  was	
  possible	
  with	
  previous	
  methods.	
  The	
   counts	
  obtained	
  are	
  highly	
  quantitative,	
  and	
  provide	
  the	
  added	
  benefit	
  of	
  information	
    	
    12  about	
  alternative	
  splice	
  forms	
  of	
  RNA,	
  allelic	
  differences	
  in	
  RNA	
  production	
  and	
   mutations.	
    1.3.4 Differential	
  Gene	
  Expression	
  Analysis	
   Gene	
  expression	
  profiling	
  measurements	
  are	
  often	
  used	
  to	
  detect	
  genes	
  that	
  exhibit	
   differential	
  expression	
  across	
  samples,	
  such	
  as	
  drug-­‐treated	
  and	
  untreated	
  cells	
  or	
   multiple	
  tissue	
  types	
  (Lockhart	
  et	
  al.	
  1996).	
  The	
  identification	
  of	
  genes	
  that	
  display	
  a	
   selective	
  pattern	
  of	
  expression	
  in	
  such	
  studies	
  is	
  a	
  fundamental	
  challenge	
  in	
  applied	
   genomics	
  and	
  has	
  therefore	
  been	
  the	
  topic	
  of	
  research	
  by	
  hundreds	
  of	
  research	
  groups.	
   Thorough	
  reviews	
  and	
  useful	
  textbooks	
  have	
  been	
  published	
  about	
  the	
  analysis	
   methods	
  (Dudoit	
  et	
  al.	
  2000;	
  Butte	
  2002;	
  Cui	
  and	
  Churchill	
  2003).	
  	
  I	
  will	
  focus	
  on	
  a	
  few	
   key	
  points	
  here.	
  Statistical	
  tests	
  such	
  as	
  t-­‐tests	
  can	
  be	
  used	
  to	
  perform	
  the	
  comparisons,	
   but	
  such	
  analyses	
  are	
  often	
  complicated	
  by	
  the	
  fact	
  that	
  there	
  are	
  many	
  genes	
  to	
  be	
   analyzed	
  but	
  only	
  a	
  few	
  observations	
  available,	
  leading	
  to	
  multiple	
  testing	
  and	
  variance	
   estimation	
  challenges.	
  Furthermore,	
  no	
  assumption	
  of	
  normal	
  distribution	
  of	
   expression	
  measurements	
  can	
  be	
  made,	
  which	
  is	
  a	
  requirement	
  for	
  t-­‐tests.	
  Significance	
   analysis	
  of	
  microarrays	
  (SAM)	
  is	
  a	
  widely	
  used	
  method	
  that	
  makes	
  no	
  normal	
   distribution	
  assumption	
  and	
  instead	
  calculates	
  the	
  p-­‐values	
  based	
  on	
  permutations	
  of	
   data	
  (Tusher	
  et	
  al.	
  2001).	
  SAM	
  estimates	
  the	
  false	
  discovery	
  rate	
  (FDR),	
  which	
  is	
  the	
   proportion	
  of	
  data	
  likely	
  to	
  have	
  been	
  incorrectly	
  identified	
  as	
  being	
  significant.	
  SAM	
   can	
  be	
  used	
  for	
  differential	
  expression	
  analysis	
  on	
  both	
  pairwise	
  and	
  multiple	
  biological	
   samples.	
  Another	
  complication	
  in	
  expression	
  analysis	
  is	
  the	
  dependence	
  of	
  significance	
   of	
  differential	
  expression	
  on	
  signal	
  intensities.	
  The	
  same	
  fold	
  change	
  in	
  expression	
   between	
  two	
  samples	
  observed	
  at	
  high	
  intensities	
  is	
  more	
  likely	
  to	
  be	
  significant	
  than	
   the	
  one	
  observed	
  at	
  low	
  intensities.	
  Variance	
  stabilization	
  normalization	
  (VSN)	
  was	
   developed	
  to	
  compensate	
  for	
  this	
  effect,	
  by	
  transforming	
  the	
  data	
  so	
  that	
  the	
  variance	
  is	
   approximately	
  uniform	
  at	
  all	
  intensity	
  values	
  (Huber	
  et	
  al.	
  2002).	
  	
   	
   	
    13  1.4 Experimental	
  Methods	
  for	
  Identification	
  of	
  Regulatory	
  Regions	
   In	
  order	
  to	
  understand	
  how	
  a	
  given	
  gene	
  is	
  regulated,	
  it	
  is	
  important	
  to	
  be	
  able	
  to	
  first	
   locate	
  the	
  controlling	
  regulatory	
  regions	
  and	
  identify	
  the	
  responsible	
  TFBSs.	
  Numerous	
   experimental	
  methods	
  have	
  been	
  utilized	
  for	
  this	
  purpose.	
  Researchers	
  traditionally	
   have	
  relied	
  on	
  these	
  methods	
  to	
  1)	
  identify	
  the	
  TF-­‐DNA	
  binding	
  and	
  interactions	
  and	
  2)	
   validate	
  the	
  effects	
  of	
  cis-­‐regulatory	
  regions	
  on	
  gene	
  expression.	
  While	
  earlier	
  methods	
   involved	
  individually	
  isolating	
  genomic	
  regions	
  of	
  interest	
  or	
  tests	
  using	
  synthetically	
   generated	
  oligonucleotides,	
  advances	
  in	
  high	
  throughput	
  methods	
  now	
  allow	
   researchers	
  to	
  search	
  for	
  regulatory	
  elements	
  on	
  a	
  genomic	
  scale.	
    1.4.1 Earlier	
  Methods	
   Some	
  of	
  the	
  data	
  utilized	
  in	
  this	
  thesis	
  derives	
  from	
  older	
  publications	
  that	
  identified	
   TFBS	
  and	
  cis-­‐regulatory	
  regions	
  using	
  arduous	
  laboratory	
  methods.	
  	
  Protein-­‐DNA	
   binding	
  can	
  be	
  detected	
  through	
  the	
  use	
  of	
  DNA	
  binding	
  assays,	
  of	
  which	
  there	
  are	
   several	
  variations	
  relevant	
  to	
  the	
  thesis.	
  Except	
  for	
  chromatin	
  immunoprecipitation,	
  the	
   following	
  techniques	
  are	
  most	
  commonly	
  performed	
  in	
  vitro	
  using	
  cells	
  or	
  cell	
  extracts	
   from	
  tissue	
  culture.	
   •  Electrophoretic	
  mobility	
  shift	
  assays	
  (EMSA)	
  detect	
  protein-­‐DNA	
  complexes	
  by	
   measuring	
  the	
  retardation	
  of	
  protein-­‐bound	
  DNA	
  in	
  gel	
  electrophoresis	
   compared	
  to	
  unbound	
  DNA	
  (Garner	
  and	
  Revzin	
  1981).	
  An	
  antibody	
  that	
  targets	
   the	
  protein	
  of	
  interest	
  can	
  be	
  added	
  to	
  create	
  a	
  larger	
  complex	
  for	
  better	
   separation,	
  resulting	
  in	
  ‘supershifting’	
  of	
  the	
  complex.	
  	
    •  DNase	
  I	
  footprinting	
  is	
  used	
  to	
  elucidate	
  the	
  actual	
  protein-­‐bound	
  DNA	
   sequences	
  given	
  a	
  larger	
  DNA	
  region	
  of	
  interest	
  (Brenowitz	
  et	
  al.	
  1986).	
  	
  A	
   dsDNA	
  fragment	
  is	
  labeled	
  at	
  one	
  end.	
  With	
  protein	
  bound	
  to	
  the	
  labeled	
  DNA,	
   appropriate	
  conditions	
  are	
  used	
  for	
  DNase	
  I	
  digestion	
  to	
  obtain	
  on	
  average	
  one	
   cut	
  per	
  DNA	
  molecule.	
  	
  The	
  resulting	
  product	
  is	
  separated	
  using	
  gel	
    	
    14  electrophoresis,	
  revealing	
  a	
  banding	
  pattern	
  of	
  the	
  DNA.	
  	
  At	
  the	
  site	
  of	
  protein	
   binding,	
  the	
  banding	
  pattern	
  is	
  disrupted	
  compared	
  to	
  DNA	
  digested	
  in	
  the	
   absence	
  of	
  protein.	
   •  Systematic	
  evolution	
  of	
  ligands	
  by	
  exponential	
  enrichment	
  (SELEX)	
  is	
  used	
  to	
   determine	
  the	
  sequence	
  of	
  DNA	
  bound	
  by	
  a	
  protein	
  of	
  interest	
  (Tuerk	
  and	
  Gold	
   1990).	
  Starting	
  with	
  a	
  pool	
  of	
  dsDNA	
  of	
  random	
  sequences,	
  those	
  molecules	
   bound	
  by	
  the	
  protein	
  are	
  purified	
  (using	
  one	
  of	
  a	
  variety	
  of	
  techniques)	
  and	
   amplified	
  using	
  PCR.	
  This	
  process	
  is	
  repeated	
  until	
  the	
  stringency	
  conditions	
  are	
   satisfied.	
  	
  The	
  resulting	
  DNA	
  is	
  sequenced.	
    •  Chromatin	
  immunoprecipitation	
  (ChIP)	
  is	
  a	
  method	
  of	
  determining	
  genomic	
   regions	
  bound	
  by	
  a	
  protein	
  of	
  interest	
  in	
  vivo	
  (Collas	
  2010).	
  After	
  cross-­‐linking	
   proteins	
  to	
  the	
  DNA,	
  the	
  chromatin	
  is	
  sheared.	
  	
  The	
  protein-­‐bound	
  sheared	
  DNA	
   fragments	
  are	
  immunoprecipitated	
  and	
  the	
  associated	
  DNAs	
  are	
  determined	
  (by	
   sequencing	
  or	
  PCR).	
    •  Reporter	
  gene	
  assays	
  are	
  employed	
  when	
  one	
  needs	
  to	
  determine	
  if	
  a	
  potential	
   cis-­‐regulatory	
  region	
  of	
  interest	
  can	
  drive	
  gene	
  expression.	
  Reporter	
  genes	
  are	
   chosen	
  based	
  on	
  their	
  ability	
  to	
  produce	
  measurable	
  phenotypes,	
  including	
   green	
  fluorescent	
  protein	
  (GFP),	
  which	
  fluoresces	
  under	
  ultraviolet	
  light,	
  and	
   luciferase,	
  which	
  gives	
  off	
  light	
  when	
  it	
  reacts	
  with	
  luciferin.	
  A	
  reporter	
  construct	
   is	
  produced	
  by	
  fusing	
  the	
  putative	
  regulatory	
  region,	
  a	
  minimal	
  promoter	
  if	
  not	
   included	
  in	
  the	
  former,	
  and	
  the	
  reporter	
  gene.	
  	
  The	
  construct	
  is	
  introduced	
  to	
  a	
   cell	
  or	
  transgenic	
  organism	
  and	
  the	
  resulting	
  reporter	
  gene	
  expression	
   measured.	
    1.4.2 High	
  Throughput	
  Methods	
   The	
  development	
  of	
  DNA	
  microarray	
  technology	
  along	
  with	
  the	
  availability	
  of	
  genome	
   sequences	
  has	
  brought	
  significant	
  advances	
  in	
  the	
  field	
  of	
  gene	
  expression	
  analysis.	
  This	
   technology	
  has	
  been	
  applied	
  to	
  DNA-­‐protein	
  binding	
  analysis	
  as	
  well,	
  bringing	
   considerable	
  improvements	
  in	
  regulatory	
  region	
  identification.	
  It	
  became	
  possible	
  to	
   	
    15  identify	
  functional	
  noncoding	
  elements	
  on	
  a	
  genomic	
  scale	
  in	
  the	
  lab.	
  The	
  debut	
  of	
  high-­‐ throughput	
  sequencing	
  technologies	
  has	
  brought	
  new	
  advances	
  in	
  the	
  analysis	
   methods,	
  already	
  starting	
  to	
  supplant	
  the	
  microarray-­‐based	
  methods.	
    1.4.2.1 ChIP-­‐chip	
  and	
  ChIP-­‐seq	
   By	
  combining	
  chromatin	
  immunoprecipitation	
  (ChIP)	
  with	
  tiled	
  DNA	
  microarrays	
   (chip),	
  ChIP-­‐chip	
  analysis	
  can	
  identify	
  in	
  vivo	
  DNA-­‐protein	
  interactions	
  and	
  chromatin	
   modifications	
  on	
  a	
  genomic	
  scale	
  (Aparicio	
  et	
  al.	
  2004).	
  In	
  this	
  method,	
  chromatin	
   immunoprecipitation	
  (as	
  explained	
  in	
  section	
  1.4.1)	
  is	
  followed	
  by	
  tiling	
  DNA	
   microarray	
  analysis	
  of	
  the	
  purified	
  protein-­‐bound	
  DNA	
  fragments.	
  ChIP-­‐chip	
   experiments	
  have	
  been	
  used	
  to	
  identify	
  various	
  types	
  of	
  regulatory	
  regions,	
  including	
   enhancers,	
  silencing	
  elements,	
  and	
  insulators	
  (Buck	
  and	
  Lieb	
  2004).	
  They	
  have	
  also	
   been	
  used	
  to	
  examine	
  the	
  distribution	
  of	
  histone	
  modifications	
  and	
  their	
  effects	
  on	
  gene	
   regulation	
  (Huebert	
  et	
  al.	
  2006).	
  The	
  development	
  of	
  this	
  technology	
  has	
  allowed	
   researchers	
  to	
  embark	
  on	
  efforts	
  to	
  catalogue	
  and	
  identify	
  how	
  the	
  distribution	
  of	
  these	
   functional	
  elements	
  and	
  epigenetic	
  marks	
  vary	
  according	
  to	
  cell	
  types	
  and	
  physiological	
   conditions.	
   However,	
  ChIP-­‐chip	
  technology	
  suffers	
  from	
  a	
  number	
  of	
  limitations.	
  The	
  availability	
  of	
   high-­‐quality	
  antibodies	
  used	
  for	
  the	
  immunoprecipitation	
  is	
  a	
  limiting	
  factor.	
   Furthermore,	
  there	
  is	
  a	
  high	
  cost	
  for	
  the	
  experiments	
  due	
  to	
  the	
  expense	
  of	
  tiled	
  DNA	
   microarrays.	
  Because	
  ChIP-­‐chip	
  uses	
  tiling	
  arrays,	
  it	
  requires	
  large	
  sets	
  of	
  arrays	
  with	
   narrow	
  probe	
  spacing	
  to	
  achieve	
  high	
  resolution	
  and	
  precision,	
  but	
  the	
  costs	
  of	
  such	
   arrays	
  are	
  limiting.	
  Due	
  to	
  the	
  challenges	
  of	
  the	
  hybridization	
  procedures,	
  the	
   microarray	
  experiments	
  require	
  multiple	
  replications	
  that	
  add	
  to	
  the	
  expense.	
  	
  The	
  DNA	
   fragments	
  obtained	
  from	
  the	
  chromatin	
  immunoprecipitation	
  step	
  are	
  hundreds	
  of	
  base	
   pairs	
  long,	
  as	
  the	
  sonication	
  process	
  for	
  breaking	
  up	
  DNA	
  can	
  only	
  achieve	
  a	
  minimum	
   size	
  of	
  ~200	
  bp.	
  Because	
  tiling	
  arrays	
  are	
  used,	
  the	
  resolution	
  of	
  experiments	
  depends	
   on	
  the	
  coverage	
  provided	
  by	
  the	
  probes.	
  As	
  with	
  microarray	
  experiments	
  for	
  gene	
   expression,	
  the	
  statistical	
  analysis	
  of	
  the	
  data	
  generated	
  remains	
  a	
  challenge.	
  	
   	
    16  ChIP-­‐Seq	
  utilizes	
  a	
  different	
  detection	
  method	
  than	
  ChIP-­‐chip.	
  	
  Microarrays	
  are	
   supplanted	
  by	
  high-­‐throughput	
  sequencing	
  of	
  the	
  purified	
  DNA	
  fragments	
  (Johnson	
  et	
   al.	
  2007).	
  ChIP-­‐Seq	
  offers	
  better	
  resolution	
  and	
  precision	
  than	
  ChIP-­‐chip.	
  ChIP-­‐Seq	
  can	
   produce	
  highly	
  precise	
  results,	
  specifying	
  TFBSs	
  within	
  tens	
  of	
  bases	
  from	
  the	
  actual	
   protein-­‐binding	
  site.	
  With	
  ChIP-­‐Seq,	
  protein-­‐DNA	
  binding	
  affinity	
  can	
  be	
  quantified	
  with	
   the	
  tag	
  densities	
  at	
  the	
  binding	
  sites,	
  making	
  it	
  easier	
  to	
  compare	
  the	
  relative	
  affinities	
   at	
  different	
  binding	
  sites.	
  One	
  can	
  increase	
  the	
  sensitivity	
  of	
  ChIP-­‐Seq	
  as	
  required	
  by	
   increasing	
  the	
  sequencing	
  depth,	
  which	
  is	
  not	
  possible	
  with	
  ChIP-­‐chip.	
  Since	
  the	
   introduction	
  of	
  this	
  technology,	
  there	
  has	
  been	
  a	
  sharp	
  increase	
  in	
  the	
  number	
  of	
   regulatory	
  region	
  data	
  sets	
  generated	
  (Jothi	
  et	
  al.	
  2008;	
  Cao	
  et	
  al.	
  2010).	
    1.4.2.2 Cap	
  Analysis	
  Gene	
  Expression	
  (CAGE)	
   Cap	
  analysis	
  gene	
  expression	
  makes	
  use	
  of	
  the	
  principles	
  underlying	
  SAGE	
  for	
   identification	
  of	
  transcriptional	
  starting	
  points	
  (TSPs)	
  (Shiraki	
  et	
  al.	
  2003).	
  SAGE	
  in	
  its	
   original	
  form	
  cannot	
  be	
  used	
  to	
  identify	
  the	
  5’	
  TSPs,	
  as	
  the	
  tags	
  used	
  are	
  taken	
  from	
  the	
   3’	
  ends	
  of	
  transcripts.	
  For	
  CAGE,	
  full-­‐length	
  cDNAs	
  are	
  selected	
  using	
  “biotinylated	
  cap-­‐ trapper”	
  technology	
  (Carninci	
  et	
  al.	
  1996).	
  	
  Linker	
  DNA	
  oligonucleotides	
  containing	
   restriction	
  enzyme	
  sites	
  and	
  biotin	
  at	
  the	
  5’	
  ends	
  are	
  attached	
  at	
  the	
  5’	
  ends	
  of	
  cDNAs.	
   The	
  modified	
  linker-­‐cDNAs	
  are	
  separated	
  using	
  magnetic	
  beads,	
  purified	
  and	
  amplified	
   with	
  PCR.	
  The	
  linker-­‐cDNAs	
  are	
  then	
  cleaved	
  with	
  restriction	
  enzymes,	
  resulting	
  in	
  32	
   bp	
  5’	
  fragments	
  (20	
  bp	
  without	
  the	
  linker).	
  The	
  fragments	
  are	
  again	
  separated	
  using	
   magnetic	
  beads,	
  and	
  are	
  subsequently	
  cut	
  from	
  biotin,	
  concatenated	
  and	
  sequenced	
  as	
   for	
  the	
  SAGE	
  method.	
  CAGE	
  has	
  been	
  used	
  by	
  the	
  FANTOM	
  2	
  and	
  FANTOM	
  3	
   Consortiums	
  for	
  transcript	
  analysis	
  in	
  mammalian	
  genomes	
  (Okazaki	
  et	
  al.	
  2002;	
   Carninci	
  et	
  al.	
  2005).	
  An	
  extension	
  of	
  this	
  technique	
  called	
  deepCAGE,	
  which	
  combines	
   CAGE	
  with	
  deep	
  sequencing,	
  is	
  being	
  used	
  by	
  the	
  FANTOM4	
  Consortium	
  to	
  study	
  the	
   dynamics	
  of	
  TSPs	
  during	
  cellular	
  differentiation	
  process	
  (de	
  Hoon	
  and	
  Hayashizaki	
   2008).	
    	
    17  1.4.2.3 Protein	
  Binding	
  Microarray	
  (PBM)	
   Protein	
  binding	
  microarrays	
  (PBMs)	
  are	
  a	
  rapid,	
  high-­‐throughput	
  method	
  based	
  on	
   DNA	
  microarrays	
  for	
  DNA-­‐protein	
  binding	
  identification	
  in	
  vitro	
  (Berger	
  et	
  al.	
  2006).	
  In	
   this	
  method,	
  purified	
  epitope-­‐tagged	
  proteins	
  are	
  added	
  to	
  microarrays	
  spotted	
  with	
   double-­‐stranded	
  DNAs,	
  and	
  the	
  protein-­‐bound	
  DNA	
  probes	
  are	
  identified	
  by	
   fluorescently	
  labeled	
  antibodies	
  to	
  the	
  epitope.	
  Alternatively,	
  proteins	
  directly	
  labeled	
   with	
  fluorophore	
  can	
  be	
  used,	
  removing	
  the	
  antibody	
  treatment	
  step	
  (Bulyk	
  2007).	
  The	
   dsDNA	
  probes	
  can	
  be	
  constructed	
  from	
  primer	
  extension,	
  self-­‐hairpinning,	
  or	
  PCR	
  of	
   genomic	
  regions.	
  After	
  the	
  labeling	
  step,	
  the	
  microarrays	
  are	
  scanned	
  for	
  fluorescence,	
   and	
  the	
  normalized	
  values	
  are	
  used	
  to	
  identify	
  the	
  DNA	
  sequences	
  to	
  which	
  the	
  protein	
   binds.	
  	
  As	
  a	
  control,	
  another	
  set	
  of	
  microarrays	
  is	
  stained	
  with	
  SybrGreen	
  I,	
  which	
  is	
   specific	
  for	
  dsDNA.	
  Since	
  PBM	
  is	
  an	
  in	
  vitro	
  method,	
  no	
  prior	
  knowledge	
  of	
  cellular	
   conditions	
  for	
  in	
  vivo	
  TF	
  binding	
  is	
  required.	
  PBMs	
  can	
  also	
  provide	
  comprehensive	
   data	
  on	
  the	
  effects	
  of	
  DNA	
  sequence	
  variants	
  on	
  protein	
  binding	
  preference.	
  Berger	
  et	
   al.	
  determined	
  168	
  mouse	
  homeodomain	
  TF	
  binding	
  profiles	
  (Berger	
  et	
  al.	
  2008)	
  and	
   Noyes	
  et	
  al.	
  evaluated	
  the	
  84	
  Drosophila	
  homeodomain	
  TF	
  binding	
  profiles	
  using	
  PBMs	
   (Noyes	
  et	
  al.	
  2008).	
    1.5 Comparative	
  Genomics	
   The	
  biological	
  knowledge	
  gained	
  from	
  studying	
  model	
  organisms	
  brings	
  meaningful	
   benefits	
  to	
  understanding	
  human	
  physiology.	
  Comparative	
  genomics	
  facilitates	
  this	
   transfer,	
  in	
  part,	
  through	
  the	
  analysis	
  of	
  genomic	
  elements	
  that	
  are	
  under	
  evolutionary	
   constraint	
  across	
  diverged	
  species.	
  The	
  identification	
  of	
  orthologous	
  genes	
  is	
  a	
  vital	
   part	
  of	
  this	
  process,	
  as	
  it	
  provides	
  a	
  framework	
  to	
  infer	
  functionality	
  of	
  experimentally	
   undetermined	
  genes.	
  The	
  availability	
  of	
  multiple	
  genomes	
  of	
  organisms	
  at	
  diverse	
   branches	
  of	
  the	
  phylogenetic	
  tree	
  offers	
  an	
  unprecedented	
  opportunity	
  to	
  reconstruct	
   the	
  shared	
  history	
  of	
  evolution	
  of	
  functional	
  genomic	
  elements.	
  As	
  the	
  goal	
  of	
  this	
  thesis	
    	
    18  is	
  to	
  compare	
  and	
  identify	
  the	
  conserved	
  regulatory	
  programs,	
  comparative	
  genomics	
  is	
   the	
  central	
  strategy	
  behind	
  the	
  bioinformatics	
  framework	
  that	
  I	
  implement.	
  	
    1.5.1 Homology:	
  Orthologs	
  and	
  Paralogs	
   Gene	
  evolution	
  can	
  take	
  place	
  via:	
  1)	
  speciation	
  with	
  modification,	
  2)	
  gene	
  duplication	
   followed	
  by	
  speciation	
  with	
  modification,	
  3)	
  gene	
  loss,	
  4)	
  horizontal	
  gene	
  transfer,	
  and	
   5)	
  fusion,	
  fission	
  and	
  other	
  gene	
  rearrangements	
  (Koonin	
  2005).	
  Genes	
  that	
  share	
   common	
  descent	
  are	
  referred	
  to	
  as	
  homologs,	
  which	
  can	
  be	
  divided	
  into	
  orthologs	
  and	
   paralogs.	
  Orthologs	
  are	
  genes	
  separated	
  only	
  by	
  speciation	
  events,	
  whereas	
  the	
   separation	
  of	
  paralogs	
  includes	
  duplication	
  events.	
  However,	
  homologs	
  cannot	
  always	
   be	
  classified	
  into	
  these	
  classes,	
  due	
  to	
  complex	
  mixtures	
  of	
  speciation,	
  gene	
  duplication,	
   gene	
  rearrangements,	
  gene	
  loss	
  and	
  horizontal	
  gene	
  transfers.	
  	
   On	
  a	
  formal	
  level,	
  orthologs	
  are	
  more	
  consistently	
  defined	
  than	
  paralogs.	
  	
  Orthologs	
  are	
   “genes	
  derived	
  from	
  a	
  single	
  ancestral	
  gene	
  in	
  the	
  last	
  common	
  ancestor	
  of	
  the	
   compared	
  species”	
  (Koonin	
  2005).	
  It	
  is	
  important	
  to	
  distinguish	
  between	
  the	
  orthologs	
   and	
  paralogs,	
  because	
  a	
  common,	
  but	
  not	
  required,	
  property	
  of	
  orthologs	
  is	
  that	
  they	
   perform	
  equivalent	
  functions	
  in	
  the	
  compared	
  species,	
  whereas	
  paralogs	
  are	
  more	
   likely	
  to	
  perform	
  related	
  but	
  distinct	
  roles.	
  This	
  does	
  not	
  mean	
  that	
  all	
  orthologs	
  from	
   the	
  compared	
  species	
  have	
  one-­‐to-­‐one	
  relationships,	
  as	
  a	
  lineage-­‐specific	
  gene	
   duplication	
  event	
  can	
  take	
  place	
  subsequent	
  to	
  a	
  speciation	
  event.	
  Inparalogs	
  result	
   from	
  a	
  duplication	
  event	
  following	
  a	
  speciation	
  event,	
  and	
  outparalogs	
  result	
  from	
  a	
   duplication	
  event	
  preceding	
  a	
  speciation	
  event.	
  The	
  inparalogs	
  derived	
  from	
  a	
  single	
   ancestral	
  gene	
  are	
  still	
  orthologs	
  of	
  that	
  gene.	
  Moreover,	
  gene	
  rearrangements	
  can	
   bring	
  more	
  complications	
  to	
  the	
  orthology	
  classification,	
  as	
  different	
  parts	
  of	
  a	
  gene	
  in	
   one	
  species	
  may	
  be	
  orthologous	
  to	
  multiple	
  genes	
  in	
  another	
  species.	
  	
   	
    	
    19  1.5.2 Sequence	
  Alignment	
   An	
  early	
  method	
  for	
  the	
  identification	
  of	
  homology	
  is	
  pair-­‐wise	
  sequence	
  alignment	
  of	
   protein	
  or	
  nucleic	
  acid	
  sequences	
  between	
  two	
  species.	
  Two	
  main	
  alignment	
  classes	
   were	
  developed:	
  (i)	
  local	
  alignment	
  methods,	
  which	
  identify	
  conserved	
  segments	
  within	
   longer	
  sequences	
  (Smith	
  and	
  Waterman	
  1981)	
  and	
  (ii)	
  global	
  alignment	
  methods,	
   which	
  align	
  two	
  sequences	
  across	
  the	
  entire	
  length	
  (Needleman	
  and	
  Wunsch	
  1970).	
  	
   Heuristic	
  derivatives	
  of	
  the	
  classic	
  methods,	
  such	
  as	
  BLAST,	
  have	
  dominated	
   bioinformatics	
  in	
  the	
  genome	
  era	
  (Altschul	
  et	
  al.	
  1990).	
  Recently	
  developed	
  pair-­‐wise	
   alignment	
  programs	
  include	
  LAGAN	
  (Brudno	
  et	
  al.	
  2003a)	
  and	
  ORCA	
  (developed	
  in	
   Wasserman	
  lab).	
  	
   Multiple	
  sequence	
  alignment	
  methods	
  have	
  been	
  available	
  for	
  decades	
  (Chenna	
  et	
  al.	
   2003).	
  	
  With	
  the	
  release	
  of	
  genomes	
  for	
  multiple	
  species,	
  much	
  effort	
  has	
  been	
  made	
  to	
   improve	
  the	
  performance	
  of	
  the	
  multiple	
  sequence	
  alignment	
  (MSA)	
  algorithms.	
  	
  The	
   most	
  widely	
  utilized	
  MSA	
  algorithms	
  are	
  based	
  on	
  the	
  “glocal”	
  alignment	
  strategy,	
   which	
  is	
  a	
  hybrid	
  of	
  local	
  and	
  global	
  alignment	
  methods,	
  combined	
  with	
  a	
  progressive	
   alignment	
  approach	
  (Brudno	
  et	
  al.	
  2003b).	
  With	
  the	
  glocal	
  strategy,	
  sequences	
  are	
  first	
   processed	
  with	
  a	
  local	
  aligner,	
  and	
  the	
  identified	
  local	
  alignment	
  blocks	
  are	
   concatenated	
  to	
  form	
  longer	
  alignments.	
  Starting	
  with	
  the	
  sequences	
  from	
  the	
  two	
  most	
   closely	
  related	
  species	
  according	
  to	
  a	
  given	
  phylogenetic	
  tree,	
  pairwise	
  alignment	
  is	
   performed,	
  after	
  which	
  the	
  sequence	
  from	
  the	
  next	
  species	
  in	
  the	
  order	
  of	
  the	
  tree	
  is	
   added,	
  resulting	
  in	
  a	
  progressive	
  construction	
  of	
  multiple	
  alignment.	
  A	
  review	
  of	
  the	
   current	
  alignment	
  strategies	
  is	
  given	
  in	
  (Kumar	
  and	
  Filipski	
  2007).	
    1.5.3 Evolutionary	
  Constraint	
  Measures	
   With	
  multiple	
  sequence	
  alignments	
  and	
  a	
  species	
  distance	
  tree	
  available,	
  one	
  can	
   quantify	
  the	
  observed	
  evolutionary	
  constraint	
  on	
  either	
  the	
  individual	
  or	
  fixed-­‐size	
   blocks	
  of	
  nucleotides.	
  Examples	
  of	
  widely	
  used	
  evolutionary	
  constraint	
  measures	
   include	
  phastCons	
  and	
  PhyloP	
  from	
  the	
  PHAST	
  package	
  (Hubisz	
  et	
  al.	
  2011).	
  PhastCons	
   	
   20  models	
  the	
  evolutionary	
  process	
  based	
  on	
  the	
  phylogenetic	
  hidden	
  Markov	
  model	
   (phylo-­‐HMM)	
  (Fan	
  et	
  al.	
  2007).	
  With	
  this	
  statistical	
  model,	
  phastCons	
  can	
  account	
  for	
   multiple	
  substitutions	
  at	
  each	
  nucleotide	
  site	
  and	
  unequal	
  substitution	
  rates	
  between	
   different	
  pairs	
  of	
  bases	
  when	
  estimating	
  the	
  probability	
  that	
  each	
  nucleotide	
  is	
  under	
   evolutionary	
  constraint.	
  PhyloP,	
  which	
  computes	
  the	
  p-­‐values	
  of	
  conservation	
  based	
  on	
   a	
  neutral	
  model	
  of	
  evolution,	
  differs	
  from	
  phastCons	
  in	
  that	
  it	
  considers	
  each	
  nucleotide	
   individually,	
  ignoring	
  the	
  effects	
  from	
  the	
  neighbouring	
  bases.	
  PhastCons	
  is	
  more	
   suitable	
  for	
  detecting	
  conserved	
  elements,	
  whereas	
  PhyloP	
  is	
  better	
  suited	
  for	
   evaluating	
  signatures	
  at	
  particular	
  nucleotides	
  or	
  classes	
  of	
  nucleotides.	
  Scores	
  from	
   both	
  methods	
  are	
  available	
  from	
  the	
  UCSC	
  Genome	
  Browser	
  for	
  multiple	
  species,	
  and	
   are	
  used	
  in	
  several	
  stages	
  of	
  the	
  thesis	
  research	
  that	
  follows.	
    1.6 Public	
  Repositories	
  of	
  Genomic	
  Data	
   The	
  development	
  of	
  technologies	
  that	
  have	
  enabled	
  researchers	
  to	
  sequence	
  entire	
   genomes	
  of	
  organisms	
  has	
  brought	
  revolutionary	
  changes	
  to	
  the	
  field	
  of	
  biology.	
  With	
   the	
  sequence	
  information	
  stored	
  in	
  public	
  databases,	
  researchers	
  can	
  now	
  retrieve	
  the	
   sequences	
  of	
  genomic	
  regions	
  of	
  their	
  interest	
  within	
  seconds.	
  The	
  availability	
  of	
   multiple	
  genomes	
  has	
  enabled	
  detailed	
  analysis	
  of	
  homology	
  among	
  different	
  genes.	
   The	
  increasing	
  amount	
  of	
  data	
  accumulated	
  by	
  researchers	
  around	
  the	
  world	
  is	
  added	
   back	
  to	
  the	
  public	
  repositories	
  as	
  genomic	
  annotations,	
  further	
  promoting	
  the	
  sharing	
   of	
  knowledge	
  and	
  accelerating	
  research.	
  As	
  with	
  many	
  other	
  bioinformatics	
  projects,	
   the	
  work	
  performed	
  in	
  this	
  thesis	
  heavily	
  relies	
  on	
  the	
  public	
  genomic	
  data.	
    1.6.1 Public	
  Genome	
  Databases	
   The	
  main	
  public	
  repositories	
  such	
  as	
  NCBI	
  and	
  Ensembl	
  offer	
  the	
  latest	
  versions	
  of	
  the	
   genomes	
  of	
  species	
  that	
  span	
  diverse	
  taxonomic	
  groups,	
  along	
  with	
  the	
  annotations	
  for	
   functional	
  genomic	
  elements	
  and	
  comparative	
  genomics	
  data.	
  As	
  of	
  version	
  59,	
  Ensembl	
   contains	
  genomic	
  annotations	
  for	
  56	
  species	
  from	
  various	
  taxonomic	
  groups	
  (Flicek	
  et	
   	
    21  al.	
  2010).	
  Other	
  public	
  genome	
  analysis	
  systems	
  such	
  as	
  the	
  UCSC	
  Genome	
   Bioinformatics	
  Site	
  provide	
  rich	
  computational	
  and	
  experimental	
  annotations	
  (Kent	
  et	
   al.	
  2002).	
  There	
  are	
  species-­‐specific	
  public	
  databases	
  as	
  well,	
  including	
  Saccharomyces	
   Genome	
  Database	
  (SGD),	
  WormBase,	
  FlyBase,	
  and	
  Mouse	
  Genome	
  Informatics	
  (MGI)	
   (Cherry	
  et	
  al.	
  1998;	
  Tweedie	
  et	
  al.	
  2009;	
  Harris	
  et	
  al.	
  2010).	
  These	
  databases	
  contain	
   detailed	
  annotations	
  that	
  are	
  specific	
  to	
  the	
  respective	
  species	
  that	
  might	
  otherwise	
  be	
   missing	
  from	
  more	
  general	
  repositories.	
  Each	
  public	
  repository	
  adds	
  a	
  unique	
  set	
  of	
   features	
  and	
  tools,	
  offering	
  the	
  users	
  with	
  a	
  wealth	
  of	
  choice	
  in	
  their	
  research.	
   Available	
  data	
  can	
  be	
  readily	
  obtained	
  from	
  these	
  genome	
  annotation	
  systems,	
  and	
   open-­‐source	
  software	
  facilitates	
  data	
  mining	
  of	
  the	
  information	
  for	
  computational	
   biologists.	
  BioMart	
  offers	
  a	
  convenient,	
  centralized	
  gateway	
  to	
  multiple	
  repositories,	
   providing	
  the	
  users	
  with	
  powerful	
  data	
  mining	
  tools	
  in	
  its	
  interface	
  (Haider	
  et	
  al.	
  2009).	
   Another	
  benefit	
  provided	
  by	
  Ensembl	
  is	
  the	
  comprehensive	
  application	
  programming	
   interface	
  (API)	
  based	
  on	
  BioPerl,	
  an	
  open	
  source	
  library	
  of	
  bioinformatics	
  analysis	
   modules	
  for	
  the	
  Perl	
  programming	
  language	
  (Stajich	
  et	
  al.	
  2002).	
  This	
  greatly	
  facilitates	
   computational	
  analysis	
  of	
  the	
  data	
  available	
  at	
  Ensembl,	
  as	
  it	
  offers	
  a	
  well	
  defined	
  and	
   very	
  flexible	
  set	
  of	
  tools	
  for	
  easy	
  manipulation	
  of	
  the	
  data.	
  For	
  regulatory	
  region	
   analysis,	
  the	
  TFBS	
  Perl	
  module	
  library	
  allows	
  for	
  easy	
  manipulation	
  of	
  TFBS	
  data	
   (Lenhard	
  and	
  Wasserman	
  2002).	
  The	
  UCSC	
  Genome	
  Browser	
  allows	
  users	
  direct	
  access	
   to	
  the	
  data	
  contained	
  in	
  their	
  databases,	
  and	
  its	
  Galaxy	
  suite	
  of	
  bioinformatics	
  tool	
  set	
  is	
   emerging	
  as	
  another	
  popular	
  alternative	
  for	
  genome	
  data	
  mining	
  (Goecks	
  et	
  al.	
  2010).	
  	
    1.6.2 Ortholog	
  Databases	
   The	
  identification	
  of	
  orthologs	
  across	
  species	
  is	
  essential	
  for	
  comparative	
  genomics.	
   There	
  have	
  been	
  numerous	
  large-­‐scale	
  efforts	
  at	
  building	
  a	
  public	
  repository	
  for	
  multi-­‐ species	
  ortholog	
  annotation,	
  which	
  started	
  with	
  single-­‐cell	
  organisms	
  and	
  then	
   expanded	
  to	
  multiple	
  vertebrate	
  species.	
    	
    22  The	
  earliest	
  effort	
  at	
  building	
  an	
  orthology	
  database	
  was	
  the	
  Clusters	
  of	
  Orthologous	
   Groups	
  (COGs)	
  (Tatusov	
  et	
  al.	
  2001).	
  The	
  central	
  clustering	
  algorithm	
  behind	
  the	
   system	
  is	
  based	
  on	
  best	
  reciprocal	
  BLAST	
  matches	
  among	
  three	
  species.	
  Briefly,	
  the	
   protein	
  sequence	
  collections	
  (proteomes)	
  to	
  be	
  clustered	
  are	
  filtered	
  to	
  remove	
   ‘promiscuous’	
  domains,	
  which	
  are	
  widespread	
  and	
  generally	
  repetitive	
  in	
  nature.	
  The	
   filtered	
  proteomes	
  are	
  then	
  searched	
  for	
  consistent	
  best	
  reciprocal	
  hits	
  among	
  three	
   species	
  using	
  gapped	
  BLAST,	
  which	
  are	
  classified	
  as	
  seed	
  clusters.	
  The	
  seed	
  clusters	
  are	
   analyzed	
  case	
  by	
  case	
  for	
  false-­‐positives,	
  only	
  retaining	
  those	
  that	
  share	
  conserved	
   protein	
  domains	
  (“core”	
  domains).	
  The	
  sequences	
  containing	
  promiscuous	
  domains	
  are	
   returned	
  to	
  the	
  set	
  and	
  assessed	
  for	
  the	
  presence	
  of	
  the	
  core	
  domains.	
  Initially,	
  COGs	
   was	
  built	
  for	
  unicellular,	
  mainly	
  prokaryotic	
  organisms,	
  but	
  the	
  system	
  was	
  extended	
  to	
   include	
  eukaryotic	
  species	
  (Tatusov	
  et	
  al.	
  2003).	
  This	
  eukaryotic	
  version	
  has	
  been	
   named	
  KOGs,	
  for	
  Eukaryotic	
  Orthologous	
  Groups.	
  	
   The	
  InParanoid	
  database	
  was	
  constructed	
  to	
  identify	
  true	
  orthologs	
  and	
  to	
  separate	
   paralogs	
  into	
  inparalogs	
  and	
  outparalogs	
  (O’Brien	
  et	
  al.	
  2005).	
  By	
  focusing	
  on	
  clustering	
   of	
  the	
  inparalogs,	
  one-­‐to-­‐one	
  and	
  one-­‐to-­‐many	
  orthology	
  relationships	
  could	
  be	
   captured.	
  Starting	
  with	
  the	
  proteomes	
  from	
  Ensembl	
  and	
  UniProt,	
  pair-­‐wise	
  similarity	
   scores	
  are	
  calculated	
  using	
  pair-­‐wise	
  sequence	
  alignments	
  from	
  BLAST.	
  Ortholog	
   clusters	
  of	
  non-­‐overlapping	
  sequences	
  are	
  constructed	
  by	
  first	
  establishing	
  the	
   sequences	
  from	
  two	
  species	
  with	
  the	
  best	
  reciprocal	
  scores.	
  	
  For	
  these	
  ortholog	
  pairs,	
   additional	
  inparalogs	
  are	
  added	
  separately	
  for	
  each	
  species.	
  Inparalogs	
  are	
  determined	
   based	
  on	
  the	
  assumption	
  that	
  sequences	
  from	
  the	
  same	
  species	
  that	
  are	
  more	
  similar	
  to	
   the	
  main	
  ortholog	
  than	
  to	
  any	
  sequences	
  in	
  the	
  other	
  species	
  would	
  belong	
  to	
  the	
  same	
   ortholog	
  cluster.	
  The	
  Inparanoid	
  database	
  first	
  started	
  with	
  a	
  handful	
  of	
  organisms	
   only,	
  but	
  has	
  gone	
  through	
  regular	
  expansions.	
  The	
  latest	
  version	
  7.0	
  (updated	
  June	
   2009)	
  includes	
  100	
  species	
  with	
  a	
  total	
  of	
  1,687,023	
  sequences.	
  	
  	
   	
    	
    	
    23  The	
  HomoloGene	
  system	
  from	
  the	
  National	
  Center	
  for	
  Biotechnology	
  Information	
   (NCBI)	
  differs	
  from	
  the	
  previous	
  two	
  databases	
  in	
  that	
  it	
  makes	
  use	
  of	
  a	
  taxonomic	
  tree	
   based	
  on	
  sequence	
  similarity	
  when	
  assigning	
  protein	
  sequences	
  to	
  orthologous	
  groups	
   (Wheeler	
  et	
  al.	
  2006).	
  The	
  sequences	
  from	
  closer	
  organisms	
  are	
  compared	
  first	
  using	
   BLASTP,	
  with	
  the	
  order	
  of	
  organism	
  comparison	
  following	
  the	
  order	
  of	
  the	
  tree	
   traversal	
  to	
  its	
  root.	
  The	
  system	
  also	
  takes	
  chromosomal	
  synteny	
  into	
  account	
  when	
   assigning	
  ortholog	
  groups.	
  Species-­‐specific	
  paralogs	
  are	
  determined	
  by	
  identifying	
   sequences	
  that	
  are	
  closer	
  within	
  a	
  given	
  species	
  than	
  other	
  species.	
  As	
  of	
  HomoloGene	
   Release	
  64,	
  there	
  are	
  20	
  species	
  with	
  ortholog	
  annotation.	
  In	
  human,	
  there	
  are	
  18,876	
   HomoloGene	
  groups	
  identified.	
  	
   Ensembl	
  Compara	
  is	
  built	
  based	
  on	
  best	
  reciprocal	
  hits	
  from	
  pairwise	
  species	
   comparisons	
  using	
  BLASTP,	
  but	
  the	
  results	
  are	
  processed	
  with	
  a	
  unique	
  collection	
  of	
   inference	
  tools	
  (Vilella	
  et	
  al.	
  2009).	
  Multiple	
  alignment	
  of	
  sequences	
  clustered	
  together	
   based	
  on	
  BLASTP	
  results	
  is	
  used	
  to	
  create	
  gene	
  trees,	
  and	
  together	
  with	
  a	
  species	
   taxonomy	
  tree,	
  the	
  clustered	
  sequences	
  are	
  re-­‐processed	
  to	
  create	
  higher	
  quality	
   groupings.	
  Orthologs	
  and	
  paralogs	
  are	
  classified	
  based	
  on	
  the	
  constructed	
  species	
  and	
   gene	
  trees.	
  As	
  part	
  of	
  the	
  Ensembl	
  database	
  collection,	
  Compara	
  is	
  updated	
  with	
  every	
   Ensembl	
  release.	
    1.6.3 Transcription	
  Factor	
  and	
  Regulatory	
  Region	
  Databases	
   Computational	
  analysis	
  of	
  regulatory	
  regions	
  depends	
  heavily	
  on	
  the	
  availability	
  of	
   reliable	
  TFBS	
  profiles	
  (details	
  on	
  TFBS	
  profile	
  computation	
  are	
  given	
  in	
  section	
  1.7.1).	
   Because	
  experimental	
  data	
  relevant	
  to	
  regulatory	
  region	
  analysis	
  including	
  validated	
   TFBS	
  sequences	
  have	
  been	
  produced	
  by	
  many	
  different	
  research	
  groups,	
  they	
  are	
   dispersed	
  in	
  a	
  disorganized	
  manner.	
  To	
  alleviate	
  this	
  problem,	
  a	
  number	
  of	
  public	
   repositories	
  of	
  TFBS	
  profile	
  data	
  have	
  been	
  set	
  up.	
  The	
  two	
  most	
  widely	
  used	
  databases	
   are	
  JASPAR	
  and	
  TRANSFAC	
  (Sandelin	
  et	
  al.	
  2004;	
  Matys	
  et	
  al.	
  2006).	
  JASPAR,	
  which	
   recently	
  has	
  gone	
  through	
  a	
  significant	
  expansion	
  (Portales-­‐Casamar	
  et	
  al.	
  2010),	
   	
    24  contains	
  high	
  quality	
  motif	
  models	
  for	
  vertebrates,	
  insects,	
  nematodes,	
  plants	
  and	
  yeast.	
   The	
  TFBS	
  matrix	
  models	
  in	
  JASPAR	
  are	
  annotated	
  with	
  the	
  target	
  TF,	
  species,	
  taxonomy	
   group,	
  TF	
  structural	
  classification,	
  and	
  links	
  to	
  the	
  original	
  experiments	
  in	
  which	
  the	
   sequence	
  data	
  was	
  obtained.	
  While	
  TRANSFAC	
  contains	
  a	
  higher	
  number	
  of	
  TFBS	
   profiles	
  than	
  JASPAR,	
  the	
  database	
  is	
  redundant	
  and	
  users	
  are	
  constrained	
  by	
  the	
   licensing	
  rules	
  of	
  the	
  commercial	
  provider.	
  	
  Therefore	
  within	
  this	
  thesis	
  the	
  JASPAR	
   collection	
  will	
  be	
  used	
  exclusively.	
  	
   There	
  are	
  additional	
  types	
  of	
  regulatory	
  region	
  repositories.	
  The	
  Open	
  REGulatory	
   ANNOtation	
  (ORegAnno)	
  system	
  is	
  a	
  database	
  for	
  the	
  curation	
  of	
  known	
  regulatory	
   elements	
  based	
  on	
  open	
  access	
  principles,	
  where	
  annotation	
  collected	
  from	
  literature	
  is	
   first	
  cross-­‐referenced	
  with	
  various	
  genome	
  databases	
  and	
  PubMED	
  and	
  then	
  recorded	
   (Montgomery	
  et	
  al.	
  2006).	
  PAZAR,	
  another	
  open	
  access	
  database	
  for	
  cataloguing	
   regulatory	
  regions,	
  allows	
  users	
  to	
  maintain	
  individual	
  regulatory	
  region	
  collections	
   and	
  distribute	
  them	
  within	
  a	
  larger	
  ‘information	
  mall’	
  format	
  (Portales-­‐Casamar	
  et	
  al.	
   2009).	
  DBD	
  is	
  a	
  database	
  of	
  TF	
  protein	
  sequences	
  organized	
  by	
  DNA-­‐binding	
  domains	
   (Wilson	
  et	
  al.	
  2008).	
  The	
  TF	
  structural	
  classification	
  used	
  by	
  JASPAR	
  is	
  in	
  part	
  based	
  on	
   the	
  categories	
  distinguished	
  within	
  DBD.	
  Drosophila	
  DNase	
  I	
  Footprint	
  Database	
  and	
   REDfly	
  are	
  important	
  sources	
  of	
  experimental	
  data	
  for	
  TFBSs	
  in	
  fruit	
  flies	
  (Bergman	
  et	
   al.	
  2005;	
  Gallo	
  et	
  al.	
  2006).	
  	
    1.7 Bioinformatics	
  Approaches	
  for	
  Regulatory	
  Region	
  Analysis	
   Because	
  experimental	
  methods	
  of	
  regulatory	
  region	
  identification	
  are	
  expensive	
  in	
   terms	
  of	
  both	
  time	
  and	
  cost,	
  much	
  effort	
  has	
  been	
  made	
  by	
  the	
  bioinformatics	
  research	
   community	
  to	
  develop	
  reliable	
  computational	
  prediction	
  method	
  for	
  cis-­‐regulatory	
   regions.	
  Unlike	
  the	
  success	
  seen	
  with	
  prediction	
  methods	
  for	
  protein	
  coding	
  regions,	
   searching	
  the	
  genome	
  sequence	
  for	
  TFBSs	
  has	
  been,	
  in	
  large	
  part,	
  frustrating	
  because	
   the	
  short	
  lengths	
  of	
  TFBSs	
  lead	
  to	
  unacceptably	
  high	
  false	
  positive	
  rates.	
  However,	
  with	
   the	
  growing	
  availability	
  of	
  high-­‐throughput	
  data	
  from	
  ChIP-­‐chip	
  and	
  ChIP-­‐Seq	
   	
    25  experiments	
  for	
  TF	
  identification,	
  combined	
  with	
  other	
  sources	
  of	
  data,	
  it	
  is	
  becoming	
   more	
  feasible	
  to	
  create	
  useful	
  computational	
  prediction	
  methods	
  for	
  regulatory	
  regions,	
   The	
  data	
  generated	
  from	
  these	
  experiments	
  can	
  provide	
  some	
  of	
  the	
  necessary	
   information	
  for	
  delineating	
  functional	
  regulatory	
  regions,	
  which	
  cannot	
  be	
  deciphered	
   from	
  genomic	
  sequences	
  alone.	
  	
    1.7.1 TFBS	
  Representation	
   The	
  de	
  facto	
  method	
  for	
  modeling	
  a	
  TFBS	
  is	
  based	
  upon	
  the	
  consensus	
  of	
  the	
  known	
   binding	
  sites	
  for	
  the	
  given	
  TF.	
  Binding	
  sites	
  are	
  aligned	
  and	
  the	
  frequency	
  of	
  nucleotides	
   in	
  each	
  column	
  of	
  the	
  alignment	
  is	
  used	
  to	
  specify	
  the	
  key	
  bases	
  that	
  the	
  TF	
  recognizes.	
   The	
  simplest	
  model	
  used	
  to	
  represent	
  a	
  TFBS	
  is	
  the	
  consensus	
  sequence,	
  where	
  each	
   nucleotide	
  position	
  is	
  recorded	
  by	
  a	
  letter	
  in	
  the	
  International	
  Union	
  of	
  Pure	
  and	
   Applied	
  Chemistry	
  (IUPAC)	
  ambiguity	
  codes	
  (Cornish-­‐Bowden	
  1985).	
  While	
  this	
   depiction	
  is	
  simple	
  and	
  compact,	
  it	
  cannot	
  quantitatively	
  represent	
  the	
  distribution	
  of	
   the	
  nucleotides	
  in	
  each	
  column	
  of	
  the	
  alignment.	
  As	
  TFs	
  can	
  generally	
  tolerate	
  some	
   variability	
  from	
  consensus	
  in	
  most	
  positions	
  of	
  the	
  binding	
  site,	
  the	
  loss	
  of	
  this	
   information	
  is	
  a	
  significant	
  disadvantage.	
  	
    1.7.1.1 Matrix	
  Model	
   The	
  most	
  common	
  method	
  of	
  modeling	
  a	
  TFBS	
  is	
  the	
  position	
  specific	
  scoring	
  matrix	
   (PSSM),	
  frequently	
  called	
  the	
  position	
  weight	
  matrix	
  (PWM).	
  The	
  matrix	
  model	
  reflects	
   the	
  likelihood	
  of	
  finding	
  a	
  particular	
  base	
  at	
  each	
  position	
  of	
  the	
  binding	
  region	
  (Stormo	
   2000).	
  A	
  well-­‐defined	
  matrix	
  produces	
  scores	
  that	
  directly	
  correlate	
  with	
  the	
  binding	
   energies	
  between	
  the	
  TF	
  and	
  its	
  binding	
  site	
  (Stormo	
  and	
  Fields	
  1998).	
  The	
  process	
  of	
   constructing	
  a	
  PSSM	
  given	
  a	
  set	
  of	
  TFBSs	
  is	
  given	
  in	
  Figure	
  1-­‐4.	
  The	
  first	
  step	
  is	
  to	
  create	
   a	
  position	
  frequency	
  matrix	
  (PFM)	
  from	
  the	
  alignment	
  of	
  binding	
  sites,	
  where	
  the	
   number	
  of	
  each	
  nucleotide	
  is	
  enumerated	
  at	
  each	
  position.	
  A	
  PSSM	
  is	
  calculated	
  from	
  a	
    	
    26  PFM	
  by	
  dividing	
  the	
  nucleotide	
  empirical	
  probabilities	
  by	
  the	
  expected	
  background	
   probabilities	
  and	
  converting	
  the	
  quotient	
  to	
  log	
  scale.	
  	
   The	
  PSSM	
  value	
  of	
  base	
  b	
  at	
  column	
  i,	
  Wb,i,,	
  is	
  calculated	
  by:	
    Wb ,i = log 2  p (b, i ) 	
   p (b)  	
    (1)	
    where	
  p(b,i)	
  is	
  the	
  corrected	
  probability	
  of	
  base	
  b	
  found	
  at	
  column	
  i,	
  and	
  p(b)	
  is	
  the	
   background	
  probability	
  of	
  base	
  b.	
  The	
  corrected	
  probability	
  p(b,i)	
  is	
  computed	
  by:	
    p (b, i ) =  f b ,i + s (b) N+  ∑ s(b)  	
    (2)	
    b∈{ A,C ,G ,T }  where	
  fb,i	
  is	
  the	
  frequency	
  of	
  base	
  b	
  at	
  column	
  i	
  and	
  N	
  is	
  the	
  sum	
  of	
  all	
  binding	
  site	
   sequences	
  in	
  the	
  alignment,	
  and	
  s(b)	
  is	
  a	
  pseudocount	
  correction	
  function	
  used	
  to	
   correct	
  for	
  small	
  sample	
  sizes.	
  There	
  are	
  a	
  number	
  of	
  pseudocount	
  correction	
  functions	
   available,	
  ranging	
  from	
  simply	
  adding	
  a	
  small	
  non-­‐zero	
  constant	
  to	
  fitting	
  a	
  Dirichlet	
   prior	
  to	
  all	
  known	
  PFMs	
  (Nishida	
  et	
  al.	
  2009).	
   A	
  PFM	
  is	
  often	
  visualized	
  as	
  a	
  sequence	
  logo,	
  where	
  the	
  frequency	
  of	
  nucleotides	
   observed	
  in	
  each	
  column	
  is	
  used	
  to	
  calculate	
  information	
  content	
  (in	
  bits).	
  Each	
   nucleotide	
  occurrence	
  at	
  each	
  position	
  is	
  scaled	
  according	
  to	
  the	
  total	
  information	
   content	
  in	
  that	
  column.	
  The	
  probability	
  values	
  in	
  (2)	
  can	
  be	
  used	
  to	
  calculate	
  the	
   information	
  content	
  at	
  each	
  column	
  by:	
    Di =2+    pb,i log 2 pb,i 	
    (3)	
    b∈ A,C,G,T  where	
  Di	
  is	
  the	
  information	
  content	
  at	
  position	
  i.	
    	
    27  Figure 1-4. Transcription factor binding site (TFBS) and position-specific scoring matrix (PSSM). A PSSM is constructed by counting the base frequencies at each binding site position and converting them to log odds ratios. The PSSM is used to scan the sequence segment containing the binding site in question, which will score higher than the rest of the sequence. The sequence ‘CTACGTTTAT’ receives the score of 14.0 in this example.  	
   To	
  search	
  for	
  TFBSs,	
  sequences	
  are	
  scanned	
  with	
  the	
  PSSM	
  for	
  the	
  TFBS	
  of	
  interest,	
  and	
   the	
  score	
  at	
  each	
  sequence	
  position	
  is	
  calculated	
  by	
  summing	
  the	
  PSSM	
  values.	
   However,	
  because	
  these	
  binding	
  sites	
  are	
  so	
  short	
  (7	
  -­‐	
  10	
  nucleotides	
  long	
  on	
  average),	
   the	
  likelihood	
  of	
  finding	
  potential	
  sites	
  on	
  any	
  given	
  sequence	
  is	
  high,	
  resulting	
  in	
  many	
   spurious	
  hits.	
  For	
  example,	
  Fickett	
  has	
  shown	
  that	
  the	
  binding	
  model	
  for	
  MEF2	
  results	
   in	
  one	
  MEF2	
  binding	
  site	
  prediction	
  per	
  5000	
  base	
  pairs	
  in	
  the	
  human	
  genome	
  (Fickett	
   1996).	
  This	
  has	
  led	
  to	
  the	
  TFBS	
  Prediction	
  Futility	
  Conjuncture,	
  which	
  states	
  that	
   individual	
  TFBS	
  predictions	
  by	
  themselves	
  generally	
  provide	
  little	
  information	
  on	
  the	
   identification	
  of	
  binding	
  sites	
  actually	
  functioning	
  in	
  vivo	
  (Wasserman	
  and	
  Sandelin	
   2004).	
  Approaches	
  for	
  addressing	
  this	
  problem	
  are	
  discussed	
  in	
  sections	
  1.7.3	
  and	
   1.7.4.	
   	
    	
    	
    28  1.7.1.2 Hidden	
  Markov	
  Models	
   Although	
  widely	
  used,	
  the	
  matrix	
  model	
  suffers	
  from	
  a	
  number	
  of	
  limitations.	
  Because	
   the	
  model	
  relies	
  on	
  a	
  fixed	
  length	
  matrix,	
  it	
  cannot	
  easily	
  represent	
  a	
  TFBS	
  that	
  contains	
   a	
  variable	
  spacer	
  within	
  the	
  site.	
  Also,	
  because	
  the	
  matrix	
  model	
  treats	
  each	
  column	
   independently,	
  it	
  cannot	
  represent	
  any	
  inter-­‐dependence	
  among	
  different	
  positions	
  of	
  a	
   binding	
  site.	
  Hidden	
  Markov	
  models	
  (HMMs)	
  have	
  been	
  employed	
  by	
  a	
  number	
  of	
   researchers	
  to	
  create	
  a	
  modeling	
  framework	
  that	
  can	
  overcome	
  these	
  limitations	
   (Tokovenko	
  et	
  al.	
  2009).	
  A	
  HMM	
  can	
  be	
  depicted	
  as	
  a	
  directed	
  network	
  of	
  states.	
  A	
  state	
   can	
  emit	
  symbols	
  (such	
  as	
  nucleotides)	
  with	
  a	
  pre-­‐defined	
  emission	
  probability,	
  and	
  the	
   transition	
  between	
  any	
  two	
  states	
  has	
  an	
  associated	
  transition	
  probability.	
  A	
  specific	
   path	
  through	
  the	
  states	
  produces	
  a	
  sequence	
  of	
  symbols,	
  and	
  this	
  path	
  would	
  be	
  taken	
   with	
  a	
  probability	
  defined	
  as	
  the	
  product	
  of	
  all	
  emission	
  probabilities	
  for	
  the	
  symbols	
   and	
  the	
  transition	
  probabilities	
  between	
  the	
  states.	
  While	
  using	
  HMMs	
  can	
  overcome	
   the	
  limitations	
  imposed	
  by	
  the	
  matrix	
  model,	
  PSSMs	
  still	
  remain	
  the	
  predominant	
  form	
   of	
  TFBS	
  representation	
  due	
  to	
  their	
  simplicity,	
  the	
  ease	
  of	
  computation	
  and	
  the	
  fact	
  that	
   relatively	
  few	
  TF	
  classes	
  tolerate	
  the	
  variability	
  of	
  spacing	
  and	
  site	
  organization	
  that	
   HMMs	
  address.	
  Within	
  the	
  thesis	
  only	
  PSSMs	
  are	
  used,	
  but	
  in	
  the	
  long-­‐term	
  HMM-­‐based	
   methods	
  may	
  replace	
  PSSMs.	
    1.7.2 Motif	
  Discovery	
   The	
  generation	
  of	
  PSSMs	
  requires	
  a	
  set	
  of	
  binding	
  site	
  sequences	
  for	
  a	
  TF.	
  In	
  many	
  cases	
   however,	
  one	
  is	
  faced	
  with	
  a	
  situation	
  where	
  the	
  putative	
  target	
  sequences	
  are	
  too	
  long	
   for	
  accurate	
  alignment	
  of	
  the	
  binding	
  sites.	
  As	
  many	
  experimental	
  techniques	
  for	
  TFBS	
   identification	
  cannot	
  provide	
  the	
  resolution	
  necessary	
  for	
  direct	
  alignment	
  of	
  the	
   binding	
  site	
  sequences,	
  one	
  is	
  faced	
  with	
  the	
  need	
  to	
  discern	
  the	
  binding	
  sites	
  from	
   sequences	
  much	
  longer	
  than	
  the	
  sites	
  themselves.	
  In	
  these	
  situations,	
  it	
  is	
  necessary	
  to	
   perform	
  motif	
  discovery	
  using	
  pattern	
  discovery	
  algorithms.	
  There	
  are	
  two	
  major	
   categories	
  of	
  pattern	
  discovery	
  algorithms,	
  word-­‐based	
  and	
  probabilistic	
  (Das	
  and	
  Dai	
   	
    29  2007).	
  Word-­‐based	
  algorithms	
  count	
  the	
  occurrences	
  of	
  a	
  nucleotide	
  sequence	
  (‘word’)	
   with	
  a	
  given	
  length	
  in	
  the	
  input	
  sequence	
  set	
  and	
  compare	
  it	
  against	
  its	
  frequency	
  in	
  the	
   background	
  set,	
  whereas	
  probabilistic	
  methods	
  search	
  for	
  the	
  most	
  over-­‐represented	
   patterns	
  using	
  some	
  form	
  of	
  directed	
  or	
  random	
  selection.	
  As	
  finding	
  the	
  optimal	
   pattern	
  is	
  a	
  NP-­‐hard	
  problem,	
  heuristic	
  methods	
  such	
  as	
  Gibbs	
  sampling	
  are	
  employed	
   (Lawrence	
  et	
  al.	
  1993).	
  Many	
  software	
  packages	
  have	
  been	
  implemented	
  based	
  on	
   variations	
  of	
  these	
  approaches,	
  such	
  as	
  MEME,	
  AlignAce	
  and	
  Weeder;	
  Tompa	
  et	
  al.	
   performed	
  a	
  comprehensive	
  benchmark	
  of	
  13	
  motif	
  discovery	
  tools	
  and	
  discussed	
  the	
   issues	
  involved	
  in	
  the	
  selection	
  of	
  the	
  correct	
  approaches	
  for	
  the	
  evaluation	
  of	
  their	
   performances	
  (Tompa	
  et	
  al.	
  2005).	
  Within	
  the	
  thesis,	
  the	
  MEME	
  motif	
  discovery	
   package	
  is	
  used.	
  MEME	
  (Multiple	
  EM	
  for	
  Motif	
  Elicitation),	
  one	
  of	
  the	
  most	
  widely	
  used	
   motif	
  discovery	
  tools,	
  identifies	
  ungapped	
  motifs	
  that	
  occur	
  repeatedly	
  in	
  the	
  input	
   sequence	
  set	
  through	
  expectation	
  maximization	
  and	
  maximum	
  likelihood	
  methods	
   (Bailey	
  et	
  al.	
  2006).	
    1.7.3 Phylogenetic	
  Footprinting	
   The	
  search	
  for	
  functional	
  regulatory	
  elements	
  in	
  the	
  metazoan	
  genomes	
  is	
  severely	
   hampered	
  by	
  the	
  fact	
  that	
  one	
  needs	
  to	
  detect	
  short	
  motifs	
  of	
  several	
  base	
  pairs	
  within	
   large	
  genomic	
  regions	
  that	
  can	
  range	
  in	
  kilobases.	
  Efforts	
  into	
  reducing	
  the	
  search	
  space	
   to	
  enhance	
  the	
  specificity	
  of	
  regulatory	
  region	
  predictions	
  have	
  gained	
  a	
  powerful	
  tool	
   in	
  the	
  form	
  of	
  phylogenetic	
  footprinting	
  (Wasserman	
  et	
  al.	
  2000).	
  With	
  this	
  strategy,	
   one	
  restricts	
  TFBS	
  search	
  space	
  to	
  conserved	
  orthologous	
  genomic	
  regions	
  in	
  different	
   species	
  only.	
  Functional	
  regulatory	
  regions	
  are	
  often	
  under	
  evolutionary	
  selective	
   pressure	
  compared	
  to	
  the	
  background	
  sequence.	
  If	
  the	
  expression	
  pattern	
  of	
  a	
  gene	
  is	
   conserved	
  among	
  a	
  group	
  of	
  species,	
  the	
  CRM	
  that	
  confers	
  the	
  pattern	
  is	
  likely	
  to	
  be	
   retained	
  as	
  well.	
  Mutations	
  leading	
  to	
  significant	
  changes	
  in	
  the	
  TFBSs	
  of	
  a	
  CRM	
  can	
   destroy	
  its	
  functionality,	
  as	
  the	
  correct	
  binding	
  of	
  a	
  TF	
  is	
  dependent	
  on	
  the	
  TFBS	
   sequence.	
  By	
  employing	
  phylogenetic	
  footprinting	
  with	
  closely	
  related	
  species,	
  one	
  can	
    	
    30  determine	
  which	
  non-­‐coding	
  regions	
  of	
  the	
  genome	
  are	
  conserved,	
  and	
  limit	
  one’s	
   search	
  for	
  the	
  regulatory	
  regions	
  to	
  these	
  conserved	
  segments.	
  	
  	
  	
   While	
  orthologous	
  coding	
  regions	
  are	
  relatively	
  well	
  conserved	
  over	
  long	
  evolutionary	
   distances,	
  the	
  same	
  is	
  not	
  usually	
  true	
  of	
  non-­‐coding	
  regulatory	
  regions.	
  Still,	
   phylogenetic	
  footprinting	
  using	
  human	
  and	
  mouse	
  alignments	
  has	
  been	
  shown	
  to	
   reduce	
  the	
  search	
  space	
  by	
  up	
  to	
  80	
  %,	
  and	
  up	
  to	
  70	
  %	
  of	
  functional	
  binding	
  sites	
  have	
   been	
  located	
  in	
  the	
  conserved	
  regions	
  (Dermitzakis	
  and	
  Clark	
  2002;	
  Lenhard	
  et	
  al.	
   2003).	
  While	
  there	
  have	
  been	
  genome-­‐wide	
  ChIP-­‐chip	
  studies	
  for	
  transcription	
  factors	
   that	
  have	
  questioned	
  the	
  extent	
  of	
  functional	
  regulatory	
  element	
  conservation,	
  it	
  is	
   important	
  to	
  note	
  that	
  not	
  all	
  binding	
  sites	
  detected	
  through	
  ChIP	
  experiments	
  function	
   as	
  cis-­‐regulatory	
  elements	
  themselves	
  (Odom	
  et	
  al.	
  2007).	
  	
   However,	
  the	
  presence	
  of	
  functional	
  regulatory	
  motifs	
  in	
  non-­‐conserved	
  genomic	
   regions	
  is	
  a	
  valid	
  concern;	
  because	
  the	
  TFBS	
  are	
  short	
  and	
  TFs	
  have	
  degeneracy	
  in	
  their	
   binding	
  requirement,	
  the	
  likelihood	
  of	
  a	
  TFBS	
  being	
  lost	
  or	
  a	
  new	
  TFBS	
  arising	
  by	
   chance	
  as	
  species	
  diverge	
  is	
  relatively	
  high	
  (Dermitzakis	
  and	
  Clark	
  2002;	
  Dermitzakis	
  et	
   al.	
  2003).	
  Although	
  the	
  functional	
  motifs	
  are	
  under	
  evolutionary	
  constraint,	
  it	
  has	
  been	
   shown	
  that	
  substantial	
  TFBS	
  turnover	
  occurs	
  among	
  related	
  species.	
  The	
  short	
  lengths	
   of	
  motifs	
  mean	
  that	
  it	
  is	
  necessary	
  to	
  distinguish	
  conserved	
  motifs	
  due	
  to	
  evolutionary	
   constraints	
  from	
  those	
  that	
  are	
  conserved	
  by	
  chance	
  alone.	
  Distinguishing	
  between	
  the	
   two	
  classes	
  of	
  conserved	
  motifs	
  requires	
  numerous	
  species	
  with	
  varying	
  levels	
  of	
   evolutionary	
  divergence.	
  With	
  the	
  increasing	
  availability	
  of	
  multiple	
  genomes,	
  large	
   scale	
  multiple	
  alignment-­‐based	
  conservation	
  scores	
  such	
  as	
  phastCons	
  and	
  phyloP	
  are	
   now	
  available	
  for	
  major	
  taxonomic	
  groups.	
  While	
  applying	
  phylogenetic	
  footprinting	
   can	
  improve	
  the	
  specificity	
  of	
  regulatory	
  element	
  detection	
  drastically,	
  sensitivity	
  does	
   suffer	
  in	
  the	
  process.	
  	
  	
  Determining	
  the	
  appropriate	
  application	
  of	
  phylogenetic	
   footprinting	
  is	
  an	
  ill-­‐defined	
  process,	
  dependent	
  upon	
  the	
  conservation	
  scoring	
   methods	
  and	
  species	
  being	
  studied.	
    	
    31  1.7.4 cis-­‐Regulatory	
  Module	
  Detection	
  Methods	
   One	
  way	
  of	
  improving	
  TFBS	
  prediction	
  is	
  to	
  develop	
  models	
  that	
  incorporate	
  the	
   interactions	
  between	
  elements	
  required	
  for	
  functional	
  CRMs.	
  Instead	
  of	
  scanning	
  the	
   genome	
  for	
  single	
  TFBS,	
  CRM	
  detection	
  methods	
  look	
  for	
  the	
  presence	
  of	
  clusters	
  of	
   TFBSs	
  for	
  groups	
  of	
  TFs	
  known	
  to	
  work	
  in	
  concert	
  (Wasserman	
  and	
  Krivan	
  2003).	
  By	
   limiting	
  predictions	
  to	
  such	
  TFBS	
  clusters	
  within	
  a	
  short	
  (often	
  arbitrarily	
  determined	
   based	
  on	
  past	
  empirical	
  observations)	
  distance,	
  these	
  algorithms	
  have	
  been	
  shown	
  to	
   reduce	
  the	
  false	
  positive	
  rate	
  of	
  TFBS	
  predictions.	
  There	
  has	
  been	
  a	
  steady	
  development	
   of	
  various	
  CRM	
  detection	
  programs	
  over	
  the	
  last	
  few	
  years	
  (Table	
  2-­‐1).	
  One	
  of	
  the	
   earliest	
  CRM	
  detection	
  programs	
  used	
  logistic	
  regression	
  analysis	
  (LRA)	
  to	
  distinguish	
   between	
  CRM	
  and	
  non-­‐CRM	
  sequences	
  (Wasserman	
  and	
  Fickett	
  1998).	
  Some	
  methods	
   are	
  based	
  on	
  hidden	
  Markov	
  models	
  (Cister,	
  ClusterBuster),	
  while	
  others	
  are	
  based	
  on	
   computation	
  of	
  the	
  statistical	
  significance	
  of	
  sets	
  of	
  non-­‐overlapping	
  potential	
  TFBSs	
   (MSCAN)	
  (Frith	
  et	
  al.	
  2003;	
  Alkema	
  et	
  al.	
  2004a).	
  The	
  latest	
  methods	
  incorporate	
   phylogenetic	
  footprinting	
  to	
  improve	
  their	
  specificity	
  (Warner	
  et	
  al.	
  2008;	
  He	
  et	
  al.	
   2009).	
  The	
  reliability	
  of	
  a	
  subset	
  of	
  these	
  methods	
  will	
  be	
  explored	
  within	
  the	
  thesis.	
    1.7.5 Regulome	
   As	
  model	
  organisms	
  such	
  as	
  worms	
  and	
  flies	
  are	
  distant	
  from	
  humans	
  in	
  the	
   evolutionary	
  tree,	
  it	
  is	
  difficult	
  or	
  impossible	
  to	
  apply	
  phylogenetic	
  footprinting	
  directly	
   for	
  TFBS	
  studies.	
  To	
  overcome	
  this	
  problem,	
  a	
  strategy	
  known	
  as	
  Regulog	
  analysis	
  has	
   been	
  introduced	
  (Alkema	
  et	
  al.	
  2004b).	
  Algorithms	
  following	
  this	
  strategy	
  search	
  for	
  TF	
   models	
  common	
  to	
  a	
  list	
  of	
  genes	
  in	
  one	
  species	
  with	
  similar	
  expression	
  patterns	
  and	
   compare	
  them	
  to	
  the	
  orthologous	
  genes	
  in	
  another	
  species.	
  Given	
  a	
  candidate	
  TF	
   binding	
  model	
  and	
  a	
  set	
  of	
  orthologous	
  gene	
  sequences,	
  one	
  penalizes	
  the	
  candidate	
  TF	
   if	
  putative	
  binding	
  sites	
  are	
  not	
  conserved	
  between	
  closely	
  related	
  species	
  and	
  rewards	
   the	
  candidate	
  if	
  the	
  TFBSs	
  are	
  conserved	
  over	
  long	
  evolutionary	
  distances.	
    	
    32  There	
  has	
  been	
  some	
  previous	
  work	
  that	
  employs	
  Regulog-­‐like	
  analyses	
  to	
  relate	
  gene	
   regulation	
  between	
  distant	
  model	
  organisms	
  and	
  humans.	
  GuhaThakurta	
  et	
  al.	
  analyzed	
   the	
  worm	
  genome	
  to	
  identify	
  the	
  motifs	
  involved	
  in	
  muscle	
  gene	
  regulation,	
  and	
   transferred	
  the	
  predicted	
  motif	
  models	
  over	
  to	
  vertebrates	
  and	
  evaluated	
  TFBS	
   overrepresentation	
  in	
  the	
  vertebrate	
  muscle	
  genes	
  (GuhaThakurta	
  et	
  al.	
  2004).	
  In	
  order	
   to	
  find	
  those	
  genes	
  that	
  are	
  involved	
  in	
  ciliary	
  and	
  basal	
  body	
  biogenesis	
  and	
  function,	
   Li	
  et	
  al.	
  have	
  compared	
  the	
  genomes	
  of	
  Chlamydomonas,	
  which	
  are	
  ciliated,	
  and	
   Arabidopsis,	
  which	
  is	
  not	
  flagellated	
  (Li	
  et	
  al.	
  2004;	
  Blacque	
  et	
  al.	
  2005).	
  By	
  selecting	
   genes	
  that	
  are	
  unique	
  to	
  Chlamydomonas	
  and	
  looking	
  for	
  orthologous	
  genes	
  in	
  humans	
   with	
  similar	
  regulatory	
  mechanisms,	
  they	
  were	
  able	
  to	
  identify	
  a	
  flagellar	
  and	
  basal	
   body	
  proteome	
  in	
  human,	
  including	
  a	
  gene	
  that	
  is	
  responsible	
  for	
  Bardet-­‐Biedl	
   syndrome.	
  Zhang	
  et	
  al.	
  has	
  applied	
  this	
  idea	
  to	
  study	
  the	
  transcription	
  factor	
  FOXO	
   target	
  genes	
  in	
  worms	
  (Xuan	
  and	
  Zhang	
  2005).	
  The	
  FOXO	
  family	
  of	
  TFs	
  belongs	
  to	
  a	
   larger	
  family	
  known	
  as	
  Forkhead,	
  which	
  is	
  well	
  conserved,	
  even	
  with	
  vertebrates.	
   Zhang	
  et	
  al.	
  perform	
  a	
  Regulog-­‐style	
  analysis	
  to	
  find	
  human	
  genes	
  that	
  are	
  orthologous	
   to	
  the	
  worm	
  genes	
  and	
  are	
  also	
  under	
  the	
  regulatory	
  control	
  of	
  FOXO.	
  	
  Thus	
  far,	
  the	
   implementation	
  of	
  Regulog	
  methods	
  for	
  metazoan	
  species	
  has	
  been	
  limited	
  to	
  specific	
   biological	
  processes	
  using	
  carefully	
  selected	
  species.	
    1.8 Expansion	
  of	
  Regulatory	
  Region	
  Data	
   Until	
  recently,	
  studies	
  of	
  functional	
  elements	
  in	
  noncoding	
  regions	
  had	
  been	
  hampered	
   by	
  a	
  lack	
  of	
  high	
  quality	
  large-­‐scale	
  data.	
  Data	
  were	
  obtained	
  from	
  low-­‐throughput	
   experiments	
  such	
  as	
  gene	
  reporter	
  assays.	
  Often,	
  the	
  only	
  regulatory	
  region	
  data	
  for	
  a	
   given	
  species	
  were	
  limited	
  to	
  a	
  few	
  TFs	
  and	
  a	
  small	
  number	
  of	
  genes.	
  	
  Advances	
  in	
  high-­‐ throughput	
  methods	
  have	
  brought	
  renewed	
  focus	
  on	
  the	
  genome-­‐wide	
  search	
  for	
   functional	
  noncoding	
  regions	
  such	
  as	
  CRMs.	
    	
    33  1.8.1 ENCODE	
  Data	
   The	
  ENCODE	
  (ENCyclopedia	
  Of	
  DNA	
  Elements)	
  Consortium	
  was	
  established	
  with	
  the	
   purpose	
  of	
  identifying	
  and	
  annotating	
  all	
  functional	
  elements	
  in	
  the	
  human	
  genome	
   (Birney	
  et	
  al.	
  2007).	
  It	
  was	
  launched	
  as	
  a	
  pilot	
  project	
  in	
  2003,	
  with	
  the	
  goal	
  of	
   identifying	
  the	
  functional	
  elements	
  in	
  1	
  %	
  of	
  the	
  human	
  genome.	
  In	
  the	
  pilot	
  phase,	
   there	
  was	
  an	
  emphasis	
  on	
  the	
  development	
  of	
  high-­‐throughput	
  methods	
  to	
  identify	
  and	
   catalogue	
  functional	
  elements.	
  In	
  this	
  phase,	
  experimental	
  techniques	
  such	
  as	
  tiling	
   arrays,	
  tag	
  sequencing,	
  and	
  QT-­‐PCR	
  were	
  employed,	
  along	
  with	
  the	
  associated	
   computational	
  methods	
  for	
  analysis	
  of	
  the	
  generated	
  data.	
  In	
  2007,	
  following	
  the	
  initial	
   trial	
  with	
  the	
  pilot	
  phase,	
  the	
  project	
  was	
  expanded	
  to	
  cover	
  the	
  entire	
  human	
  genome.	
   The	
  data	
  generated	
  by	
  the	
  consortium	
  are	
  continuously	
  being	
  released	
  to	
  the	
  public.	
    1.8.2 modENCODE	
  Data	
   Two	
  of	
  the	
  most	
  widely	
  used	
  model	
  organisms	
  are	
  Drosophila	
  melanogaster	
  (fruit	
  fly)	
   and	
  Caenorhabditis	
  elegans	
  (nematode).	
  These	
  species	
  have	
  served	
  as	
  an	
  invaluable	
   platform	
  for	
  genetic	
  experiments,	
  where	
  their	
  simpler	
  structures,	
  ease	
  of	
  genetic	
   manipulation	
  and	
  shorter	
  life	
  cycles	
  allowed	
  researchers	
  to	
  perform	
  studies	
  that	
  are	
   intractable	
  in	
  vertebrates.	
  Despite	
  their	
  importance	
  as	
  experimental	
  platforms,	
  our	
   understanding	
  of	
  their	
  genomes	
  is	
  still	
  rudimentary.	
  While	
  fruit	
  flies	
  and	
  nematodes	
   contain	
  comparable	
  numbers	
  of	
  TFs	
  as	
  vertebrates	
  (900	
  to	
  1000	
  vs.	
  ~1,500	
  TFs	
  in	
   humans),	
  there	
  had	
  not	
  been	
  systematic	
  studies	
  of	
  the	
  gene	
  regulatory	
  networks	
  and	
   functional	
  noncoding	
  elements	
  in	
  these	
  species	
  (Vaquerizas	
  et	
  al.	
  2009).	
  In	
  part	
  to	
   overcome	
  these	
  limitations,	
  the	
  model	
  organism	
  ENCODE	
  (modENCODE)	
  consortium	
   was	
  launched	
  in	
  2007	
  as	
  an	
  offshoot	
  of	
  the	
  ENCODE	
  project,	
  with	
  the	
  goal	
  of	
   comprehensive	
  annotation	
  of	
  functional	
  elements	
  in	
  fruit	
  fly	
  and	
  nematode	
  genomes	
   (Celniker	
  et	
  al.	
  2009).	
   	
   	
    	
    34  The	
  modENCODE	
  consortium	
  is	
  currently	
  working	
  on	
  the	
  identification	
  of	
  the	
   transcripts,	
  TFBSs,	
  chromatin	
  signatures,	
  and	
  DNA	
  replication	
  mechanisms	
  of	
  fruit	
  flies	
   and	
  nematodes.	
  Various	
  experimental	
  techniques	
  are	
  being	
  employed,	
  including	
  (but	
   not	
  limited	
  to)	
  ChIP-­‐chip,	
  ChIP-­‐Seq,	
  RNA-­‐Seq,	
  RT-­‐PCR	
  and	
  mass	
  spectrometry.	
  By	
   generating	
  cell-­‐	
  and	
  tissue-­‐specific	
  data	
  for	
  transcripts,	
  the	
  associated	
  TFBSs,	
  small	
  RNA	
   binding	
  sites	
  and	
  structures,	
  DNA	
  replication	
  origins	
  and	
  other	
  chromatin	
  marks	
  and	
   interaction	
  sites,	
  the	
  consortium	
  hopes	
  to	
  understand	
  the	
  details	
  of	
  the	
  functional	
   genomic	
  landscapes	
  of	
  the	
  model	
  organisms.	
  The	
  data	
  generated	
  is	
  being	
  rapidly	
   released	
  in	
  coordination	
  with	
  major	
  public	
  databases	
  such	
  as	
  UCSC,	
  FlyBase	
  and	
   WormBase,	
  facilitating	
  the	
  use	
  of	
  the	
  data	
  for	
  further	
  research	
  and	
  understanding	
  the	
   inner	
  workings	
  of	
  the	
  model	
  organisms.	
  The	
  modENCODE	
  data	
  is	
  used	
  in	
  the	
  later	
   portions	
  of	
  the	
  thesis.	
    1.9 Thesis	
  Overview	
   In	
  this	
  thesis,	
  I	
  develop	
  and	
  evaluate	
  bioinformatics	
  methods	
  for	
  the	
  identification	
  of	
   cis-­‐regulatory	
  regions	
  governing	
  gene	
  transcription.	
  	
  	
  The	
  work	
  initiates	
  with	
  the	
  study	
   of	
  skeletal	
  muscle	
  regulatory	
  sequences	
  in	
  mammals,	
  including	
  both	
  computational	
   prediction	
  and	
  laboratory	
  testing	
  to	
  evaluate	
  the	
  performance	
  of	
  the	
  methods.	
  	
  The	
   work	
  progresses	
  to	
  the	
  engineering	
  of	
  software	
  for	
  the	
  identification	
  of	
  enriched	
  TFBS	
   patterns	
  across	
  sets	
  of	
  genes;	
  software	
  designed	
  to	
  work	
  across	
  the	
  popular	
  animal	
   model	
  organisms.	
  	
  	
  The	
  software	
  serves	
  as	
  the	
  framework	
  for	
  the	
  identification	
  of	
   important	
  regulatory	
  modules	
  in	
  co-­‐expressed	
  gene	
  sets	
  that	
  are	
  conserved	
  and	
  shared	
   across	
  evolutionarily	
  distant	
  species.	
  The	
  work	
  delivers	
  useful	
  methods	
  to	
  the	
  scientific	
   community	
  for	
  the	
  characterization	
  of	
  regulatory	
  programs	
  for	
  experimentally	
  defined	
   gene	
  sets,	
  while	
  offering	
  insight	
  into	
  severe	
  limitations	
  of	
  sequence	
  analysis	
  for	
   regulatory	
  sequence	
  annotation.	
   	
    	
    	
    35  In	
  Chapter	
  2,	
  I	
  describe	
  the	
  computational	
  prediction	
  and	
  experimental	
  validation	
  of	
   skeletal	
  muscle	
  cis-­‐regulatory	
  modules.	
  Predictions	
  are	
  made	
  by	
  scanning	
  the	
  human	
   genome	
  for	
  skeletal	
  muscle-­‐specific	
  CRMs	
  using	
  three	
  computational	
  prediction	
  tools.	
  	
   A	
  subset	
  of	
  predictions	
  was	
  validated	
  in	
  a	
  C2C12	
  murine	
  cell	
  culture	
  model	
  using	
  a	
  dual	
   reporter	
  expression	
  assay.	
  Of	
  the	
  339	
  candidate	
  CRMs	
  tested,	
  only	
  19	
  were	
  able	
  to	
  drive	
   muscle-­‐specific	
  expression.	
  The	
  low	
  success	
  rate	
  illustrates	
  the	
  limitations	
  of	
  the	
   existing	
  computational	
  CRM	
  discovery	
  methods.	
  Further	
  analysis	
  of	
  the	
  functional	
   sequences	
  indicates	
  a	
  bias	
  in	
  mononucleotide	
  sequence	
  composition	
  for	
  functional	
   sequences,	
  suggesting	
  avenues	
  for	
  future	
  improvement	
  of	
  the	
  CRM	
  prediction	
  methods.	
   In	
  Chapter	
  3,	
  I	
  describe	
  the	
  oPOSSUM-­‐3	
  system,	
  a	
  major	
  redesign	
  of	
  a	
  popular	
   regulatory	
  motif	
  over-­‐representation	
  analysis	
  software.	
  Given	
  a	
  set	
  of	
  genes	
  thought	
  to	
   be	
  regulated	
  by	
  a	
  common	
  set	
  of	
  TFs,	
  the	
  system	
  determines	
  which	
  motifs	
  are	
  over-­‐ represented	
  in	
  the	
  gene	
  sequences	
  relative	
  to	
  background.	
  The	
  new	
  system	
  provides	
   significant	
  improvements	
  and	
  new	
  features	
  that	
  make	
  it	
  more	
  suitable	
  for	
  genome-­‐scale	
   data	
  analysis.	
  The	
  system	
  incorporates	
  phylogenetic	
  footprinting	
  based	
  on	
  phastCons	
   conservation	
  scores	
  calculated	
  upon	
  multiple	
  sequence	
  alignments.	
  	
  The	
  adoption	
  of	
  the	
   JASPAR	
  2010	
  TFBS	
  profile	
  database	
  results	
  in	
  a	
  5-­‐fold	
  increase	
  in	
  the	
  number	
  of	
   profiles	
  incorporated,	
  rendering	
  the	
  system	
  for	
  the	
  study	
  of	
  TF	
  combinations	
   implemented	
  for	
  the	
  previous	
  version	
  of	
  oPOSSUM	
  unsuitable	
  for	
  practical	
  usage	
  due	
  to	
   the	
  heavy	
  computational	
  costs.	
  	
  A	
  new	
  system	
  was	
  implemented	
  to	
  identify	
  sets	
  of	
   motifs	
  that	
  occur	
  in	
  combination.	
  	
  Recognizing	
  the	
  increasing	
  use	
  of	
  ChIP-­‐Seq	
   procedures,	
  the	
  new	
  system	
  was	
  designed	
  to	
  support	
  sequence-­‐based	
  analysis,	
   complementing	
  the	
  classic	
  gene-­‐based	
  analysis	
  methods.	
  	
  The	
  performance	
  of	
  the	
   system	
  is	
  evaluated	
  on	
  reference	
  gene	
  collections,	
  revealing	
  the	
  influence	
  of	
  sequence	
   composition	
  on	
  the	
  results.	
  Procedures	
  to	
  mitigate	
  these	
  effects	
  are	
  developed	
  and	
   shown	
  to	
  improve	
  performance.	
  	
   	
    	
    	
    36  In	
  Chapter	
  4,	
  I	
  describe	
  the	
  comparative	
  analysis	
  of	
  the	
  regulatory	
  programs	
  among	
   evolutionarily	
  divergent	
  species	
  and	
  report	
  on	
  the	
  extent	
  of	
  conservation	
  in	
  the	
  major	
   regulatory	
  programs.	
  Multiple	
  genes	
  known	
  to	
  be	
  under	
  common	
  regulatory	
  controls	
   (namely,	
  muscle,	
  cilia,	
  and	
  Nfe2l2-­‐responding	
  genes)	
  and	
  their	
  orthologs	
  in	
  human,	
  fruit	
   flies	
  and	
  nematodes	
  were	
  compared	
  for	
  regulatory	
  signatures	
  using	
  oPOSSUM-­‐3,	
  and	
   both	
  the	
  commonalities	
  and	
  differences	
  found	
  among	
  the	
  three	
  species	
  are	
  discussed.	
   While	
  a	
  number	
  of	
  significant	
  differences	
  do	
  exist	
  among	
  the	
  three	
  species,	
  orthologs	
  of	
   key	
  TFs	
  that	
  function	
  as	
  main	
  regulatory	
  switches	
  in	
  each	
  gene	
  set	
  could	
  be	
  identified	
  in	
   each	
  species	
  despite	
  their	
  evolutionary	
  distances.	
  	
  	
   	
  In	
  Chapter	
  5,	
  I	
  present	
  a	
  summary	
  of	
  the	
  results	
  from	
  the	
  previous	
  chapters	
  and	
  a	
   discussion	
  of	
  what	
  future	
  research	
  directions	
  should	
  be	
  taken	
  to	
  improve	
  upon	
  the	
   findings	
  provided	
  by	
  this	
  thesis,	
  in	
  light	
  of	
  the	
  current	
  developments	
  and	
  advances	
   being	
  made	
  in	
  the	
  genomics	
  field.	
   	
    	
    	
    37  2 Validation	
  of	
  skeletal	
  muscle	
  cis-­‐regulatory	
  module	
  predictions	
   reveals	
  nucleotide	
  composition	
  bias	
  in	
  functional	
  enhancers	
   2.1 Introduction	
   A	
  regulatory	
  network	
  represents	
  the	
  complex	
  interplay	
  between	
  regulatory	
  proteins	
   and	
  biochemical	
  processes	
  that	
  govern	
  when	
  and	
  where	
  genes	
  are	
  expressed.	
  Two	
   important	
  components	
  of	
  a	
  regulatory	
  network	
  are	
  cis-­‐regulatory	
  modules	
  (CRM),	
   composed	
  of	
  functionally	
  interacting	
  clusters	
  of	
  transcription	
  factor	
  binding	
  sites	
   (TFBS)	
  sufficient	
  to	
  confer	
  a	
  pattern	
  of	
  expression	
  upon	
  a	
  promoter,	
  and	
  the	
   corresponding	
  trans-­‐acting	
  transcription	
  factors	
  (TFs)	
  that	
  bind	
  to	
  a	
  CRM	
  to	
  regulate	
   transcription	
  initiation.	
  The	
  multiple	
  TFBS	
  that	
  constitute	
  a	
  CRM	
  allow	
  for	
   combinatorial	
  control	
  of	
  expression;	
  a	
  limited	
  number	
  of	
  TFs	
  can	
  participate	
  in	
  an	
   exponential	
  number	
  of	
  combinations	
  with	
  each	
  potentially	
  conferring	
  specific	
  patterns	
   of	
  gene	
  activity	
  (Arnone	
  and	
  Davidson	
  1997).	
   CRMs	
  can	
  be	
  situated	
  almost	
  anywhere	
  relative	
  to	
  the	
  structure	
  of	
  a	
  gene:	
  both	
  near	
  and	
   far	
  (even	
  exceptionally	
  far)	
  from	
  the	
  promoter	
  region(s)	
  at	
  which	
  transcription	
   initiates.	
  While	
  there	
  are	
  indications	
  of	
  quantitative	
  orientation	
  effects	
  in	
  some	
  cases	
   (Shimizu	
  et	
  al.	
  2002),	
  CRMs	
  are	
  generally	
  thought	
  to	
  be	
  active	
  in	
  either	
  direction	
   relative	
  to	
  a	
  gene	
  promoter.	
  Linear	
  distance	
  in	
  primary	
  sequence	
  is	
  no	
  indication	
  of	
  the	
   three	
  dimensional	
  distance	
  (or	
  orientation)	
  within	
  the	
  nucleus.	
  Regulatory	
  regions	
  can	
   be	
  located	
  in	
  introns	
  of	
  an	
  adjacent	
  gene	
  (McBride	
  and	
  Kleinjan	
  2004;	
  Ilnytska	
  et	
  al.	
   2009),	
  can	
  skip	
  over	
  intervening	
  genes	
  (Ling	
  et	
  al.	
  2004)	
  and	
  there	
  are	
  suggestions	
  that	
   CRMs	
  can	
  act	
  on	
  genes	
  located	
  on	
  different	
  chromosomes	
  (McBride	
  and	
  Kleinjan	
  2004;	
   Miele	
  and	
  Dekker	
  2008).	
  Reflecting	
  these	
  properties,	
  the	
  discovery	
  of	
  CRMs	
  stands	
  out	
   as	
  a	
  significant	
  challenge	
  for	
  both	
  computational	
  and	
  experimental	
  research.	
   	
    	
    38  In	
  multicellular	
  organisms,	
  maintaining	
  precise	
  spatial	
  and	
  temporal	
  control	
  of	
   transcription	
  in	
  various	
  cell	
  types	
  is	
  vital	
  for	
  correct	
  tissue	
  development	
  and	
   specialization	
  (Frech	
  et	
  al.	
  1997;	
  Look	
  1997;	
  Lee	
  et	
  al.	
  2004).	
  One	
  of	
  the	
  most	
  widely	
   studied	
  “programs”	
  of	
  tissue	
  development	
  is	
  the	
  regulation	
  of	
  skeletal	
  muscle	
   differentiation.	
  Myogenesis	
  is	
  a	
  structured	
  process,	
  in	
  which	
  mononucleate	
  myoblasts	
   fuse	
  together	
  to	
  form	
  multinucleate	
  myotubes,	
  which	
  then	
  develop	
  into	
  classes	
  of	
   myofibres	
  (Arnold	
  and	
  Winter	
  1998).	
  C2C12	
  cells	
  provide	
  a	
  popular	
  model	
  for	
  this	
   process,	
  with	
  an	
  easily	
  triggered	
  switch	
  between	
  the	
  growth	
  and	
  differentiation	
  phases	
   (Gramolini	
  and	
  Jasmin	
  1999).	
  Any	
  tissue	
  differentiation	
  process	
  requires	
  complex	
   transcriptional	
  regulation	
  controls.	
  For	
  skeletal	
  muscle,	
  differentiated	
  cell	
  gene	
   expression	
  involves	
  at	
  least	
  two	
  major	
  TF	
  families,	
  the	
  myogenin	
  family	
  and	
  the	
  MADS	
   family	
  (Braun	
  et	
  al.	
  1994;	
  Rudnicki	
  and	
  Jaenisch	
  1995;	
  Naya	
  and	
  Olson	
  1999).	
  In	
  many	
   differentiation	
  processes,	
  multiple	
  proteins	
  within	
  a	
  homology-­‐based	
  family	
  can	
   participate	
  in	
  the	
  regulatory	
  control	
  of	
  gene	
  expression	
  at	
  overlapping	
  temporal	
  stages	
   of	
  the	
  process.	
  Skeletal	
  muscle	
  differentiation	
  follows	
  this	
  model;	
  thus	
  the	
  myogenin	
   family	
  may	
  equally	
  refer	
  to	
  Myogenin,	
  MyoD,	
  and	
  Myf-­‐2	
  while	
  the	
  MADS	
  set	
   encompasses	
  both	
  Srf	
  and	
  multiple	
  members	
  of	
  the	
  Mef2	
  gene	
  family.	
  Dozens	
  of	
   muscle-­‐specific	
  CRMs	
  have	
  been	
  identified	
  (Wasserman	
  and	
  Fickett	
  1998;	
  Angus	
  et	
  al.	
   2001;	
  Zheng	
  et	
  al.	
  2002),	
  usually	
  based	
  on	
  reporter	
  gene	
  assays	
  in	
  the	
  C2C12	
  cell	
   culture	
  myogenesis	
  model.	
  	
   Aided	
  by	
  the	
  relatively	
  plentiful	
  set	
  of	
  skeletal	
  muscle	
  CRMs,	
  much	
  effort	
  has	
  been	
  made	
   by	
  the	
  bioinformatics	
  research	
  community	
  to	
  develop	
  predictive	
  algorithms	
  for	
  CRM	
   discrimination.	
  Multiple	
  CRM	
  detection	
  programs	
  have	
  been	
  developed,	
  which	
  look	
  for	
   clusters	
  of	
  TFBS	
  specific	
  to	
  the	
  TFs	
  known	
  to	
  be	
  involved	
  in	
  the	
  cell	
  type	
  of	
  interest.	
  	
  An	
   original	
  discriminative	
  method	
  to	
  distinguish	
  between	
  CRM	
  and	
  non-­‐CRM	
  sequences	
   based	
  on	
  a	
  logistic	
  regression	
  analysis	
  (LRA)	
  procedure	
  has	
  been	
  followed	
  by	
  a	
  plethora	
   of	
  more	
  advanced	
  approaches	
  (Table	
  2-­‐1)	
  (Wasserman	
  and	
  Fickett	
  1998).	
  For	
  example,	
   MSCAN	
  makes	
  use	
  of	
  motif-­‐specific	
  p-­‐values	
  to	
  compute	
  the	
  statistical	
  significance	
  of	
   sets	
  of	
  non-­‐overlapping	
  potential	
  TFBSs	
  (Alkema	
  et	
  al.	
  2004),	
  while	
  ClusterBuster	
  is	
   based	
  on	
  a	
  hidden	
  Markov	
  model	
  that	
  incorporates	
  heuristics	
  to	
  improve	
  predictive	
   	
    39  performance	
  (Frith	
  et	
  al.	
  2003).	
  None	
  of	
  the	
  methods	
  are	
  sufficiently	
  reliable	
  for	
  direct	
   genome	
  annotation;	
  the	
  specificity	
  of	
  predictions	
  is	
  sufficiently	
  low	
  that	
  laboratory	
   validation	
  is	
  essential	
  to	
  distinguish	
  functional	
  CRMs.	
  The	
  overall	
  performance	
  of	
  the	
   methods	
  and	
  the	
  properties	
  that	
  differentiate	
  the	
  functional	
  CRMs	
  from	
  the	
  false	
   candidates	
  remain	
  to	
  be	
  determined.	
  	
   Table 2-1. List of some of the published CRM prediction programs. Name	
   LRA	
   MSCAN	
   MCAST	
   Cister	
   COMET	
   Cluster-­‐Buster	
    Method	
   Logistic	
  regression	
  analysis	
   Motif-­‐specific	
  p-­‐values	
   Hidden	
  Markov	
  model	
   Hidden	
  Markov	
  model	
   Hidden	
  Markov	
  model	
   Hidden	
  Markov	
  model	
    Author	
   Wasserman	
  and	
  Fickett	
   Johansson	
  et	
  al.	
   Bailey	
  and	
  Noble	
   Frith	
  et	
  al.	
   Frith	
  et	
  al.	
   Frith	
  et	
  al.	
    	
   In	
  some	
  cases,	
  the	
  prediction	
  of	
  CRMs	
  has	
  been	
  coupled	
  with	
  phylogenetic	
  footprinting	
   under	
  the	
  premise	
  that	
  sequence	
  conservation	
  of	
  known	
  CRMs	
  and	
  TFBS	
  is	
  indicative	
  of	
   function	
  and	
  therefore	
  a	
  conservation	
  filter	
  will	
  improve	
  the	
  positive	
  predictive	
  value	
  of	
   the	
  CRM	
  prediction	
  methods	
  (Wasserman	
  and	
  Fickett	
  1998;	
  Blanchette	
  and	
  Tompa	
   2002;	
  Sinha	
  et	
  al.	
  2006;	
  Warner	
  et	
  al.	
  2008).	
  It	
  is	
  often	
  the	
  case	
  that	
  the	
  regulatory	
   sequences	
  display	
  evidence	
  of	
  evolutionary	
  selective	
  pressure	
  compared	
  to	
  the	
   background	
  rates	
  of	
  sequence	
  change	
  in	
  non-­‐functional	
  sequence	
  (Gumucio	
  et	
  al.	
  1992;	
   Shelton	
  et	
  al.	
  1997).	
  If	
  the	
  expression	
  pattern	
  of	
  a	
  gene	
  is	
  conserved	
  between	
  two	
   species	
  in	
  the	
  same	
  taxonomy	
  class,	
  the	
  CRM	
  that	
  confers	
  the	
  pattern	
  is	
  likely	
  to	
  be	
   retained	
  as	
  well	
  (although	
  the	
  individual	
  TFBS	
  within	
  the	
  CRM	
  may	
  be	
  altered).	
  By	
   applying	
  phylogenetic	
  footprinting	
  to	
  the	
  analysis	
  of	
  closely	
  related	
  species	
  (i.e.	
  50-­‐100	
   million	
  years	
  of	
  separation	
  for	
  vertebrates),	
  it	
  becomes	
  possible	
  to	
  concentrate	
   predictions	
  within	
  a	
  subset	
  of	
  regions	
  in	
  the	
  conserved	
  segments	
  of	
  genes.	
  Improved	
   specificity	
  is	
  balanced	
  against	
  the	
  reduced	
  sensitivity	
  imposed	
  by	
  any	
  filter.	
  	
   	
    	
    	
    40  Once	
  predictions	
  of	
  regulatory	
  sequences	
  have	
  been	
  made,	
  laboratory	
  validation	
  is	
   required	
  to	
  confirm	
  regulatory	
  function.	
  One	
  of	
  the	
  most	
  widely	
  used	
  methods	
  for	
   validating	
  computational	
  predictions	
  of	
  regulatory	
  regions	
  are	
  reporter	
  gene	
  assays	
  in	
  a	
   cell	
  culture	
  model	
  system	
  (Chalfie	
  et	
  al.	
  1994).	
  A	
  fusion	
  construct	
  of	
  the	
  predicted	
   regulatory	
  sequence	
  and	
  a	
  reporter	
  gene	
  with	
  a	
  basal	
  promoter	
  in	
  a	
  plasmid	
  is	
   transiently	
  transfected	
  into	
  cells,	
  and	
  the	
  reporter	
  gene	
  activity	
  is	
  measured	
  to	
   determine	
  the	
  regulatory	
  impact	
  that	
  the	
  tested	
  sequence	
  exerts.	
  It	
  is	
  feasible	
  to	
   conduct	
  larger-­‐scale	
  experiments	
  to	
  investigate	
  functional	
  properties	
  of	
  panels	
  of	
   candidate	
  CRMs	
  and	
  promoters	
  within	
  cells.	
  Cooper	
  et	
  al	
  performed	
  a	
  large	
  screen	
  of	
   promoter	
  activity	
  in	
  16	
  cell	
  lines	
  on	
  all	
  predicted	
  promoters	
  in	
  the	
  1%	
  of	
  the	
  human	
   genome	
  targeted	
  for	
  in	
  depth	
  annotation	
  by	
  the	
  ENCODE	
  Project	
  (Cooper	
  et	
  al.	
  2006).	
   Similarly,	
  relatively	
  large-­‐scale	
  in	
  vivo	
  enhancer	
  studies	
  have	
  been	
  performed	
  using	
   highly	
  conserved	
  (human	
  to	
  fish)	
  sequences	
  driving	
  reporter	
  gene	
  expression	
  in	
   transgenic	
  mouse	
  embryos,	
  leading	
  to	
  the	
  identification	
  of	
  75	
  forebrain-­‐specific	
   enhancers	
  (Pennacchio	
  et	
  al.	
  2006).	
  Kappen	
  et	
  al.	
  analyzed	
  the	
  regulatory	
  controls	
  for	
   lsl,	
  a	
  LIM/homeodomain	
  transcription	
  factor,	
  by	
  inserting	
  randomly	
  sheared	
  8-­‐10	
  kb	
   fragments	
  from	
  the	
  lsl	
  genomic	
  locus	
  into	
  reporter	
  constructs	
  and	
  testing	
  for	
   expression	
  both	
  in	
  vitro	
  and	
  in	
  vivo	
  (Kappen	
  and	
  Salbaum	
  2009).	
  Using	
  a	
  single	
  copy	
   insertion	
  mouse	
  transgenesis	
  procedure,	
  the	
  Pleiades	
  Promoter	
  Project	
  evaluated	
  over	
   100	
  candidate	
  regulatory	
  sequences	
  for	
  the	
  capacity	
  to	
  direct	
  selective	
  patterns	
  of	
   reporter	
  gene	
  expression	
  in	
  the	
  developed	
  brain	
  (Portales-­‐Casamar	
  et	
  al.	
  2010).	
  The	
   development	
  of	
  higher-­‐throughput	
  approaches	
  to	
  verify	
  enhancer	
  and	
  promoter	
   function	
  has	
  been	
  a	
  focus	
  of	
  recent	
  efforts	
  to	
  annotate	
  vertebrate	
  genomes.	
  	
   The	
  properties	
  of	
  skeletal	
  muscle	
  CRMs	
  have	
  been	
  widely	
  studied,	
  but	
  relatively	
  few	
   novel	
  functional	
  CRMs	
  have	
  been	
  described	
  since	
  CRM	
  prediction	
  methods	
  have	
   emerged.	
  To	
  quantify	
  the	
  performance	
  of	
  CRM	
  prediction	
  methods	
  requires	
  a	
  new	
  body	
   of	
  reference	
  data.	
  We	
  generated	
  predictions	
  of	
  CRMs	
  with	
  three	
  published	
  methods	
  and	
   assessed	
  the	
  predictive	
  benefit	
  of	
  sequence	
  conservation	
  and	
  annotation	
  of	
  the	
   expression	
  patterns	
  of	
  proximal	
  genes.	
  We	
  employed	
  LRA,	
  MSCAN,	
  and	
  ClusterBuster	
  to	
   scan	
  the	
  human	
  genome	
  for	
  putative	
  skeletal	
  muscle	
  regulatory	
  regions,	
  and	
  tested	
  a	
   	
    41  subset	
  for	
  the	
  capacity	
  to	
  drive	
  reporter	
  gene	
  expression	
  in	
  a	
  selective	
  manner	
  in	
  the	
   C2C12	
  cell	
  skeletal	
  muscle	
  differentiation	
  assay.	
  We	
  compare	
  the	
  reporter	
  gene	
   expression	
  in	
  immature	
  myoblasts	
  against	
  expression	
  in	
  mature	
  myotubes,	
  as	
  well	
  as	
  in	
   a	
  fibroblast	
  cell	
  line.	
  Based	
  on	
  the	
  outcomes	
  of	
  the	
  analysis,	
  we	
  define	
  additional	
   properties	
  of	
  sequence	
  composition	
  that	
  are	
  predictive	
  of	
  function	
  and	
  establish	
  a	
  new	
   reference	
  collection	
  for	
  the	
  continuing	
  development	
  of	
  predictive	
  methods.	
  	
   	
    	
    42  2.2 Methods	
   2.2.1 Candidate	
  CRM	
  Region	
  Selection	
   2.2.1.1 Human	
  Genome	
  Search	
  Regions	
  	
   Promoter	
  regions	
  are	
  identified	
  following	
  the	
  procedure	
  described	
  for	
  the	
  oPOSSUM	
   database	
  (Ho	
  Sui	
  et	
  al.	
  2005).	
  The	
  oPOSSUM	
  database	
  contains	
  the	
  set	
  of	
  genes	
   identified	
  as	
  being	
  in	
  one-­‐to-­‐one	
  human	
  and	
  mouse	
  ortholog	
  pairs	
  based	
  on	
   annotations	
  in	
  EnsEMBL	
  v.	
  41	
  and	
  UCSC	
  hg18/mm8	
  whole	
  genome	
  alignments.	
  	
  For	
   each	
  ortholog	
  pair,	
  10	
  kb	
  upstream	
  and	
  downstream	
  of	
  a	
  TSS	
  is	
  searched	
  for	
  CRMs.	
  All	
   noncoding	
  regions	
  are	
  included	
  in	
  the	
  search,	
  including	
  intergenic	
  regions,	
  introns,	
  and	
   untranslated	
  regions	
  (UTR)	
  of	
  exons;	
  protein	
  coding	
  portions	
  of	
  exons	
  are	
  excluded.	
   Any	
  noncoding	
  region	
  that	
  constitutes	
  a	
  portion	
  of	
  a	
  coding	
  exon	
  in	
  an	
  alternative	
   transcript	
  is	
  removed	
  from	
  the	
  selection	
  process.	
  All	
  alternative	
  transcription	
  start	
  sites	
   (TSS)	
  supported	
  by	
  either	
  human	
  or	
  mouse	
  Fantoms3	
  CAGE	
  evidence	
  were	
  identified	
   and	
  50	
  bp	
  on	
  either	
  side	
  of	
  each	
  TSS	
  was	
  excluded	
  in	
  order	
  to	
  avoid	
  core	
  promoter	
   regions	
  (Kawaji	
  et	
  al.	
  2006).	
  	
   2.2.1.2 Muscle	
  cis-­‐Regulatory	
  Module	
  Prediction	
   CRM	
  prediction	
  tools	
  were	
  used	
  to	
  search	
  for	
  muscle-­‐specific	
  regulatory	
  modules	
   within	
  the	
  specified	
  genome	
  sequences.	
  Logistic	
  Regression	
  Analysis	
  (LRA),	
  MSCAN,	
   and	
  ClusterBuster	
  were	
  applied	
  to	
  the	
  human	
  genomic	
  sequence	
  regions	
  specified	
   above	
  (Wasserman	
  and	
  Fickett	
  1998;	
  Frith	
  et	
  al.	
  2003;	
  Alkema	
  et	
  al.	
  2004).	
  The	
  input	
   TFBS	
  motif	
  models	
  were	
  taken	
  from	
  JASPAR,	
  a	
  database	
  of	
  transcription	
  factor	
  binding	
   site	
  profiles	
  (Sandelin	
  et	
  al.	
  2004).	
  The	
  models	
  used	
  were	
  MEF2A	
  (MA0052),	
  SRF	
   (MA0083),	
  MYF	
  (MA0055),	
  TEAD	
  (MA0090),	
  and	
  SP1	
  (MA0079);	
  TFs	
  with	
  described	
   key	
  roles	
  in	
  muscle-­‐specific	
  gene	
  expression.	
  Predicted	
  CRMs	
  composed	
  entirely	
  of	
  SP1	
   TFBS	
  were	
  excluded.	
  	
    	
    43  2.2.1.3 Conservation	
  Analyses	
   The	
  candidate	
  regions	
  were	
  analyzed	
  for	
  conservation	
  based	
  on	
  phastCons	
  scores	
   (generated	
  with	
  28	
  placental	
  mammal	
  genomes)	
  obtained	
  from	
  the	
  UCSC	
  Genome	
   Annotation	
  system	
  (Karolchik	
  et	
  al.	
  2008).	
  For	
  a	
  region	
  to	
  be	
  classified	
  as	
  conserved,	
   the	
  presence	
  of	
  at	
  least	
  one	
  sub-­‐region	
  with	
  phastCons	
  scores	
  of	
  0.7	
  or	
  greater	
  over	
  20	
   bp	
  is	
  required.	
  For	
  each	
  region,	
  both	
  the	
  mean	
  and	
  the	
  maximum	
  phastCons	
  scores	
   were	
  calculated	
  and	
  sub-­‐regions	
  with	
  phastCons	
  scores	
  over	
  0.7	
  were	
  extracted	
  and	
  the	
   ratio	
  of	
  the	
  length	
  of	
  these	
  sub-­‐regions	
  over	
  the	
  total	
  length	
  of	
  the	
  region	
  calculated.	
  For	
   phylogenetic	
  depth	
  evaluation,	
  three	
  sets	
  of	
  human	
  phyloP	
  scores	
  (generated	
  with	
  46	
   vertebrates,	
  46	
  placental	
  mammals	
  and	
  46	
  primates;	
  database	
  version	
  hg19)	
  were	
   obtained	
  from	
  the	
  UCSC	
  Genome	
  Annotation	
  system.	
   2.2.2 Experimental	
  Validation	
   2.2.2.1 Cell	
  Culture	
   Mouse	
  C2C12	
  myoblasts	
  (ATCC	
  CRL-­‐1772;	
  American	
  Type	
  Culture	
  Collection;	
   Manassas,	
  VA,	
  USA)	
  and	
  mouse	
  NIH-­‐3T3	
  fibroblasts	
  (ATCC	
  CRL-­‐1658;	
  American	
  Type	
   Culture	
  Collection;	
  Manassas,	
  VA,	
  USA)	
  were	
  maintained	
  in	
  Dulbecco’s	
  modified	
  Eagle’s	
   medium,	
  supplemented	
  with	
  10%	
  (v/v)	
  heat	
  inactivated	
  fetal	
  bovine	
  serum,	
  100U/ml	
   penicillin,	
  and	
  100µg/ml	
  streptomycin.	
  The	
  cultures	
  were	
  grown	
  at	
  37oC	
  and	
  5%	
  CO2.	
   Differentiation	
  of	
  myoblasts	
  into	
  myotubes	
  was	
  induced	
  by	
  transferring	
  C2C12	
  cells	
  to	
   differentiating	
  media	
  consisting	
  of	
  2%	
  (v/v)	
  horse	
  serum,	
  100U/ml	
  penicillin,	
  and	
   100µg/ml	
  streptomycin.	
  The	
  media	
  and	
  reagents	
  for	
  cell	
  culture	
  were	
  obtained	
  from	
   Gibco-­‐Invitrogen	
  (GIBCO-­‐Invitrogen	
  Canada,	
  Canadian	
  Life	
  Technologies,	
  Burlington,	
   ON,	
  Canada).	
   	
    	
    	
    44  2.2.2.2 Plasmids	
  and	
  Cloning	
   Primer3	
  was	
  used	
  to	
  design	
  the	
  flanking	
  primers	
  for	
  each	
  predicted	
  CRM	
  for	
  PCR	
   (Rozen	
  and	
  Skaletsky	
  2000).	
  After	
  performing	
  PCR	
  with	
  the	
  designed	
  primers	
   (synthesized	
  by	
  Invitrogen	
  Coporation	
  (Carlsbad,	
  CA,	
  USA)),	
  20	
  ng	
  of	
  each	
  PCR	
  product	
   was	
  pooled,	
  which	
  were	
  then	
  purified	
  using	
  the	
  PCR	
  purification	
  kit	
  (NEB,	
  Mississauga,	
   ON,	
  Canada)	
  and	
  subcloned	
  into	
  the	
  pGL-­‐3	
  promoter	
  luciferase	
  vector	
  (Promega;	
  Fisher	
   Scientific,	
  Nepean,	
  ON,	
  Canada)	
  via	
  Kpn	
  I	
  and	
  Bgl	
  II	
  restriction	
  enzymes	
  sites.	
   Restriction	
  digest	
  was	
  performed	
  overnight	
  at	
  37	
  ºC.	
  Post-­‐digestion,	
  the	
  vector	
  was	
   dephosphorylated	
  with	
  calf	
  intestinal	
  alkaline	
  phosphatase	
  (NEB,	
  Mississauga,	
  ON,	
   Canada).	
  The	
  restriction	
  enzyme-­‐digested	
  PCR	
  products	
  and	
  the	
  vector	
  were	
  gel-­‐ purified	
  using	
  QIAquick	
  gel	
  extraction	
  kit	
  (Qiagen	
  Inc.	
  Mississauga,	
  ON,	
  Canada)	
  and	
   ligated	
  using	
  T4	
  DNA	
  ligase	
  (NEB,	
  Mississauga,	
  ON,	
  Canada).	
  	
   A	
  set	
  of	
  control	
  clones	
  and	
  a	
  sample	
  of	
  the	
  library	
  were	
  prepared.	
  Constructs	
  were	
   transformed	
  into	
  sub-­‐cloning	
  efficient	
  DH5α	
  chemically	
  competent	
  E	
  .coli	
  cells	
  (GIBCO	
   Invitrogen	
  Canada,	
  Canadian	
  Life	
  Technologies,	
  Burlington,	
  ON,	
  Canada)	
  via	
  heatshock	
   at	
  42	
  ºC	
  and	
  plated	
  on	
  LB	
  agar	
  plates	
  containing	
  100μg/ml	
  of	
  Ampicillin	
  for	
  preliminary	
   bacterial	
  colony	
  screening.	
  Colonies	
  were	
  picked	
  and	
  inoculated	
  overnight	
  in	
  3ml	
  LB	
   broth	
  with	
  ampicillin.	
  Plasmids	
  were	
  prepared	
  using	
  QIAprep	
  Spin	
  Miniprep	
  Kit	
   (Qiagen	
  Inc.	
  Mississauga,	
  ON,	
  Canada).	
  Sequence	
  confirmation	
  was	
  performed	
  by	
  the	
   CMMT/CFRI	
  DNA	
  Sequencing	
  Core	
  Facility.	
   2.2.2.3 High-­‐throughput	
  Screening	
  of	
  Clone	
  Libraries	
   Large-­‐scale	
  transformation,	
  colony	
  picking,	
  miniprep,	
  and	
  sequencing	
  reactions	
  with	
   the	
  constructs	
  were	
  performed	
  (Genome	
  Science	
  Centre,	
  Vancouver,	
  BC,	
  Canada).	
  1	
  µl	
  of	
   ligation	
  mix	
  was	
  transformed	
  by	
  electroporation	
  into	
  E.	
  coli	
  DH10B	
  T1	
  resistant	
  cells	
   (Invitrogen).	
  Transformed	
  cells	
  were	
  recovered	
  using	
  1	
  ml	
  of	
  SOC	
  medium	
  and	
  plated	
   onto	
  22	
  cm	
  ×	
  22	
  cm	
  agar	
  plates	
  (Genetix)	
  containing	
  100	
  ug/ul	
  ampicillin.	
  Bacterial	
   colonies	
  were	
  picked	
  from	
  the	
  agar	
  plates	
  and	
  arrayed	
  into	
  384-­‐well	
  microtiter	
  plates	
   (Genetix)	
  using	
  a	
  QPIX	
  automated	
  colony	
  15	
  picker	
  (Genetix).	
  Plasmid	
  preparations	
   	
    45  were	
  performed	
  via	
  an	
  alkaline	
  lysis	
  protocol.	
  DNA	
  sequencing	
  reactions	
  were	
  prepared	
   using	
  a	
  Biomek	
  FX	
  workstation	
  (Beckman-­‐Coulter)	
  and	
  performed	
  using	
  BigDye	
  3.1	
   (Applied	
  Biosystems).	
  Analysis	
  of	
  the	
  resulting	
  sequences	
  to	
  the	
  target	
  DNA	
  regions	
   was	
  performed	
  with	
  AlignX	
  from	
  the	
  Vector	
  NTI	
  software	
  (Invitrogen).	
   2.2.2.4 DNA	
  Concentration	
  Measurement	
  and	
  Normalization	
   Concentration	
  of	
  the	
  plasmid	
  products	
  was	
  quantified	
  using	
  Picogreen	
  assays	
  (GIBCO-­‐ Invitrogen	
  Canada,	
  Canadian	
  Life	
  Technologies,	
  Burlington,	
  ON,	
  Canada)	
  via	
   fluorescence	
  measurement	
  with	
  a	
  POLARstar	
  Omega	
  microplate	
  reader	
  (BMG	
  Labtech;	
   Fisher	
  Scientific,	
  Nepean,	
  ON,	
  Canada).	
  All	
  DNA	
  samples	
  were	
  normalized	
  to	
  100ng/μl	
   per	
  well.	
  	
   2.2.2.5 Transfection	
  and	
  Reporter	
  Gene	
  Assays	
   Two	
  sets	
  of	
  C2C12	
  myoblasts	
  and	
  one	
  set	
  of	
  NIH-­‐3T3	
  fibroblasts	
  were	
  seeded	
  in	
  96-­‐well	
   plates	
  at	
  a	
  density	
  of	
  6000	
  cells	
  per	
  well.	
  The	
  myoblasts	
  were	
  divided	
  into	
  two	
  sets	
  so	
   that	
  one	
  set	
  could	
  be	
  harvested	
  as	
  myoblasts,	
  while	
  the	
  other	
  set	
  could	
  be	
  differentiated	
   into	
  myotubes	
  prior	
  to	
  harvest.	
  After	
  24	
  h	
  (at	
  70	
  %	
  confluency)	
  in	
  growth	
  media,	
  the	
   cells	
  were	
  transfected	
  with	
  200ng	
  of	
  a	
  pGL3-­‐promoter	
  firefly	
  luciferase	
  plasmid	
   construct	
  and	
  20ng	
  renilla	
  phRL-­‐TK	
  internal	
  control	
  luciferase	
  plasmid	
  (Promega,	
   Madison,	
  WI)	
  using	
  Lipofectamine	
  2000	
  according	
  to	
  the	
  manufacturer’s	
  protocol	
   (GIBCO-­‐Invitrogen	
  Canada,	
  Canadian	
  Life	
  Technologies,	
  Burlington,	
  ON,	
  Canada).	
  At	
  24	
   h	
  post-­‐transfection,	
  the	
  myoblast	
  C2C12	
  set	
  and	
  the	
  NIH-­‐3T3	
  fibroblasts	
  were	
   harvested	
  and	
  luciferase	
  activity	
  measured	
  using	
  the	
  Dual-­‐Luciferase	
  Reporter	
  Assay	
   System	
  (Promega,	
  Madison,	
  WI)	
  and	
  a	
  POLARstar	
  Omega	
  microplate	
  reader	
  (BMG	
   Labtech;	
  Fisher	
  Scientific,	
  Nepean,	
  ON,	
  Canada).	
  The	
  final	
  set	
  of	
  C2C12	
  myoblasts	
  was	
   switched	
  to	
  differentiating	
  media	
  24	
  h	
  after	
  transfection,	
  and	
  incubated	
  for	
  96	
  h	
  for	
   differentiation	
  into	
  myotubes.	
  For	
  each	
  clone,	
  duplicate	
  transfections	
  (technical	
   replicates)	
  were	
  performed.	
  The	
  reporter	
  gene	
  activity	
  assays	
  were	
  carried	
  out	
  in	
  two	
   phases.	
  In	
  phase	
  1,	
  all	
  plasmid	
  constructs	
  were	
  tested	
  in	
  the	
  three	
  cell	
  types.	
  In	
  the	
   second	
  phase,	
  only	
  myotube	
  and	
  myoblast	
  activities	
  were	
  assessed.	
  	
   	
    46  2.2.2.6 Technical	
  Challenges	
   Compared	
  to	
  the	
  initial	
  expectations	
  of	
  the	
  authors,	
  the	
  number	
  of	
  successfully	
   validated	
  muscle-­‐specific	
  CRMs	
  was	
  quite	
  low.	
  	
  One	
  significant	
  contribution	
  to	
  the	
  low	
   success	
  rate	
  was	
  technical	
  failures	
  in	
  the	
  transfection	
  assays.	
  It	
  has	
  been	
  reported	
  that	
   the	
  outer	
  columns	
  and	
  rows	
  in	
  96-­‐well	
  plates	
  can	
  exhibit	
  lower	
  expression	
  values	
   compared	
  to	
  the	
  inner	
  wells	
  (Malo	
  et	
  al.	
  2006).	
  	
  For	
  this	
  study,	
  no	
  special	
  precautions	
   were	
  taken	
  against	
  such	
  effects	
  because	
  it	
  was	
  expected	
  that	
  any	
  depressed	
  expression	
   would	
  be	
  normalized	
  across	
  the	
  experiments,	
  as	
  each	
  well	
  location	
  was	
  treated	
   independently	
  for	
  analysis.	
  	
  The	
  ultimate	
  measure	
  was	
  the	
  relative	
  expression	
  observed	
   at	
  a	
  position	
  between	
  the	
  myoblasts	
  and	
  myotubes.	
  However,	
  a	
  large	
  fraction	
  of	
   undetectable	
  expression	
  was	
  observed	
  in	
  the	
  outer	
  wells	
  (Figure	
  2-­‐1).	
  	
   The	
  data	
  analysis	
  was	
  made	
  more	
  complex	
  by	
  the	
  stochastic	
  nature	
  of	
  the	
  laboratory	
   procedure.	
  Because	
  the	
  plasmids	
  from	
  the	
  clones	
  are	
  picked	
  and	
  transferred	
  randomly	
   to	
  the	
  96	
  well	
  plates,	
  the	
  number	
  of	
  replicates	
  for	
  each	
  insert	
  sequence	
  of	
  interest	
  is	
   variable.	
  	
  In	
  this	
  study	
  the	
  replicates	
  range	
  from	
  1	
  to	
  12,	
  but	
  the	
  median	
  was	
  2.	
   	
    	
    	
    47  Figure 2-1. Biases in the location of wells with successful reporter assays. a) Distribution of validated regions across rows and columns. The number of validated regions in the middle rows and columns tend to be higher compared to others.  	
    	
    b) Distribution of failed assays across the 96-well plate. The vertical axis represents the count of wells that did not show any significant reporter expression. The 4 corner wells clearly have the highest number of such inactivity.  	
   	
    48  2.2.3 Validated	
  Region	
  Selection	
  	
   The	
  following	
  terminology	
  will	
  be	
  used	
  when	
  discussing	
  the	
  experimental	
  data:	
   •  Clone:	
  a	
  single	
  clone	
  bacterial	
  colony	
  with	
  a	
  homogeneous	
  insert	
  sequence	
  	
    •  Plasmid	
  prep:	
  plasmid	
  extraction	
  from	
  a	
  single	
  bacterial	
  culture	
  	
    •  Insert	
  sequence:	
  the	
  block	
  of	
  genomic	
  region	
  introduced	
  into	
  a	
  plasmid	
    •  Insert	
  plasmid:	
  the	
  vector	
  plasmid	
  containing	
  a	
  sequence	
  of	
  interest	
    •  Clone	
  replicates:	
  replicated	
  experiments	
  using	
  the	
  plasmids	
  from	
  the	
  same	
  clone	
  but	
   from	
  different	
  plasmid	
  preps	
  (i.e.	
  independent	
  DNA	
  preparation)	
    •  Insert	
  replicates:	
  experiments	
  using	
  plasmids	
  recovered	
  from	
  different	
  bacterial	
   clones	
  but	
  sharing	
  the	
  same	
  insert	
  sequence	
  	
    •  Technical	
  replicates:	
  replicated	
  experiments	
  using	
  plasmids	
  from	
  the	
  same	
  DNA	
   preparation	
  	
    All	
  statistical	
  analyses	
  were	
  done	
  using	
  the	
  R	
  software	
  (R	
  Development	
  Core	
  Team	
   2008).	
  The	
  ratios	
  of	
  firefly	
  luciferase	
  expression	
  values	
  over	
  the	
  renilla	
  luciferase	
   expression	
  values	
  were	
  calculated	
  to	
  measure	
  the	
  relative	
  increase	
  of	
  the	
  firefly	
   luciferase	
  activity	
  over	
  the	
  renilla	
  luciferase	
  activity	
  (the	
  internal	
  control	
  for	
   transfection	
  efficiency).	
  Clones	
  that	
  did	
  not	
  produce	
  both	
  firefly	
  and	
  renilla	
  luciferase	
   expression	
  values	
  above	
  the	
  minimum	
  threshold	
  of	
  1000	
  luminescence	
  relative	
  light	
   units	
  (LRUs)	
  were	
  marked	
  as	
  failed	
  transfections	
  and	
  removed	
  from	
  subsequent	
   analyses.	
  This	
  heuristic	
  filter	
  was	
  applied	
  to	
  exclude	
  spurious	
  expression	
  ratio	
   measurements,	
  as	
  the	
  ratio	
  of	
  two	
  small	
  values	
  can	
  result	
  in	
  a	
  disproportionately	
  high	
   value,	
  and	
  the	
  VSN	
  procedure	
  intended	
  to	
  mitigate	
  this	
  effect	
  was	
  not	
  sufficient	
  (Huber	
   et	
  al.	
  2002).	
  For	
  those	
  clones	
  where	
  only	
  the	
  firefly	
  luciferase	
  values	
  were	
  above	
  this	
   threshold,	
  the	
  renilla	
  luciferase	
  value	
  was	
  set	
  to	
  the	
  threshold	
  level.	
  This	
  step	
  was	
   designed	
  to	
  minimize	
  the	
  occurrence	
  of	
  large	
  ratios	
  even	
  when	
  the	
  firefly	
  luciferase	
   expression	
  values	
  are	
  near	
  the	
  threshold.	
  The	
  threshold	
  of	
  1000	
  LRUs	
  is	
  higher	
  than	
  the	
   median	
  machine	
  background	
  level,	
  which	
  was	
  found	
  to	
  be	
  below	
  250	
  LRUs.	
  	
  While	
  this	
   conservative	
  heuristic	
  filter	
  may	
  result	
  in	
  a	
  decrease	
  in	
  sensitivity,	
  the	
  trade-­‐off	
  was	
   	
    49  deemed	
  acceptable	
  in	
  order	
  to	
  avoid	
  situations	
  where	
  spurious	
  measurements	
  are	
   accepted	
  as	
  false	
  positive	
  results.	
  The	
  expression	
  ratios	
  from	
  the	
  two	
  technical	
   replicates	
  for	
  each	
  clone	
  were	
  averaged,	
  excepting	
  the	
  cases	
  where	
  a	
  replicate	
   transfection	
  failed	
  the	
  minimum	
  expression	
  threshold	
  filter	
  (in	
  such	
  cases	
  the	
  single	
   replicate	
  value	
  was	
  used).	
  The	
  expression	
  ratios	
  obtained	
  for	
  each	
  cell	
  type	
  were	
   normalized	
  using	
  the	
  VSN	
  package.	
  Each	
  clone	
  was	
  treated	
  as	
  an	
  independent	
  sample	
   even	
  though	
  there	
  were	
  in	
  some	
  cases	
  insert	
  replicates.	
  The	
  stochastic	
  variation	
  in	
  the	
   number	
  of	
  insert	
  replicates	
  would	
  otherwise	
  have	
  complicated	
  the	
  analysis.	
  Differential	
   expression	
  between	
  1)	
  fibroblasts	
  and	
  myotubes,	
  2)	
  myoblasts	
  and	
  myotubes,	
  and	
  3)	
   fibroblasts	
  and	
  myotubes	
  groups	
  were	
  determined	
  using	
  the	
  SAM	
  package	
  (Tusher	
  et	
  al.	
   2001),	
  applying	
  a