Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Improving the detection of transcription factor binding regions Hunt, Christine Rebecca 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2015_february_hunt_christine.pdf [ 22.1MB ]
Metadata
JSON: 24-1.0166083.json
JSON-LD: 24-1.0166083-ld.json
RDF/XML (Pretty): 24-1.0166083-rdf.xml
RDF/JSON: 24-1.0166083-rdf.json
Turtle: 24-1.0166083-turtle.txt
N-Triples: 24-1.0166083-rdf-ntriples.txt
Original Record: 24-1.0166083-source.json
Full Text
24-1.0166083-fulltext.txt
Citation
24-1.0166083.ris

Full Text

IMPROVING THE DETECTION OF TRANSCRIPTION FACTOR BINDING REGIONS by Christine Rebecca Hunt   A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)   December 2014  © Christine Rebecca Hunt, 2014   ii Abstract The identification of non-coding regulatory elements in the genome has been the focus of much experimental and computational effort. However, both experimental data, such as ChIP-seq, and computational methods of transcription factor (TF) binding predictions suffer from a degree of non-specificity. ChIP-seq experiments report regions that don’t contain the expected canonical motif for the ChIPped TF, which may arise from indirect binding or a non-TF-specific mechanism. Computational predictions based on sequence-level information alone are plagued by false positives. This thesis explores computational approaches to improve both the interpretation of large-scale TF binding data, and the detection of TF binding regions.  In Chapters 2 and 3 we observe that experimentally defined regulatory regions of the human genome are a mixture of sub-groups reflecting distinct properties. On average a third of a ChIP-seq dataset does not contain the targeted TF’s motif, and within this subset up to 45% of the ChIP-seq peaks are unexpectedly enriched for a small class of non-targeted TFs’ motifs. Many of these regions are not specific to a TF but are ChIPped by multiple diverse TFs across multiple cell types. These recurring regions tend to be the lower scoring peaks of a dataset, are less likely to reproduce between experimental replicates, and tend to associate with cohesin and polycomb protein occupied positions in the genome. The regulatory regions with a greater specificity for a TF do not share these properties. Based on these observations we suggest a TF ‘loading-zone’ model to account for the presence of the aforementioned recurrent regions in ChIP-seq data.   In Chapter 4 we further explore the regulatory region subgroups with a biophysical simulator of TF occupancy (tfOS). Within tfOS we have incorporated TF-DNA interaction energies, TF search mechanics, cooperative TF interactions, and sequence accessibility data into the model.    iii Simulations with tfOS across sequences reveal distinct features associated with recurrent and non-recurrent regions described in Chapter 3.   The research presented has improved our understanding and interpretation of large-scale TF binding data and advanced our understanding of TF regulatory regions, leading to improved annotation and interpretation of the human genome.   iv Preface The content of this dissertation is original, unpublished work by the author, Christine Rebecca (Worsley) Hunt.  Chapter 2. A version of this material has been published as Worsley Hunt R, Mathelier A, del Peso L, and Wasserman WW. Improving analysis of transcription factor binding sites within ChIP-seq data based on topological motif enrichment. BMC Genomics. 2014 Jun 13;15:472.  I performed all analyses and implemented the visualization methods. Together with Dr. Anthony Mathelier and Dr. Luis del Peso we designed, implemented and tested the BiasAway tool, while the final BiasAway package was written by Dr. Anthony Mathelier. Dr. Wyeth Wasserman and I designed the study and wrote the manuscript; the Chapter 2 Methods for BiasAway were written by Dr. Anthony Mathelier and myself.   Chapter 3. A version of this material has been published as Worsley Hunt R and Wasserman WW. Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets. Genome Biology. 2014 Jul 29;15(7):412.  I performed all analyses. Dr. Wyeth Wasserman and I designed the study and wrote the manuscript.   Chapter 4. This material is original work in preparation for submission. The original tfOS simulation code was implemented in Perl, and developed by Diane Wang, Peter Sudmant and Dr. Wyeth Wasserman. David Arenillas re-implemented the original simulator code in C++ and the Perl code that scores a sequence for binding energies. Subsequent expansion of the tfOS C++ implementation and the Perl code has been written by myself. This expansion has added probabilistic decisions for how a protein moves in the simulator, protein binding interactions, a binding threshold, and sequence accessibility data. The design of the study has been extended   v by Dr. Wyeth Wasserman and myself, I have performed all analyses reported, and wrote this chapter.   vi Table of Contents Abstract ....................................................................................................................................... ii	  Preface ........................................................................................................................................ iv	  Table of Contents ...................................................................................................................... vi	  List of Tables ............................................................................................................................ xiii	  List of Figures .......................................................................................................................... xiv	  List of Abbreviations ................................................................................................................ xv	  Acknowledgements ................................................................................................................ xvii	  Dedication ................................................................................................................................ xix	  Chapter 1: Introduction .............................................................................................................. 1	  1.1	   Background ...................................................................................................................... 1	  1.2	   Experimental detection of TFBSs .................................................................................... 5	  1.3	   Computational detection of TFBSs .................................................................................. 9	  1.3.1	   Classical models of TFBS prediction ........................................................................ 9	  1.3.2	   Advanced models of TFBS prediction ..................................................................... 11	  1.3.3	   TFBS prediction through ChIP-seq data ................................................................. 12	  1.3.4	   Biophysical models for predicting regulatory regions .............................................. 14	  1.4	   Thesis research ............................................................................................................. 18	  Chapter 2: Improving analysis of transcription factor binding sites within ChIP-seq data based on topological motif enrichment .................................................................................. 23	  2.1	   Background .................................................................................................................... 23	  2.2	   Results ........................................................................................................................... 25	  2.2.1	   Composition studies reveal the influence of non-random properties of the metazoan genome on the interpretation of ChIP-seq data .................................................................. 25	    vii 2.2.2	   Consequences of biased composition for TFBS motif over-representation analyses     ................................................................................................................................ 26	  2.2.2.1	   Visualizing composition bias in TFBS over-representation analyses ............... 26	  2.2.2.2	   The impact of background composition selection on TFBS over-representation results  ......................................................................................................................... 28	  2.2.3	   Assessing TFBS predictions within ChIP-seq regions ............................................ 31	  2.2.3.1	   TFBS-landscape view ...................................................................................... 31	  2.2.3.2	   Using the HADB algorithm to distinguish direct binding and for the selection of PWM motif score thresholds ........................................................................................... 34	  2.2.3.3	   Regions predicted by HADB to directly bind the ChIPped TF are more likely to replicate between experiments ....................................................................................... 37	  2.2.3.4	   Regions predicted by HADB to directly bind the ChIPped TF are enriched for gene ontology terms related to the TF’s key biological process ..................................... 37	  2.2.4	   Complementary visualization methods for spatial ChIP-seq patterns: TFBS bi-motif view and Dinucleotide-environment view ............................................................................ 38	  2.2.4.1	   TFBS-bi-motif view ........................................................................................... 38	  2.2.4.2	   Dinucleotide-environment view ........................................................................ 41	  2.2.5	   An applied case study of ChIP-seq analyses .......................................................... 43	  2.3	   Discussion ...................................................................................................................... 46	  2.4	   Methods ......................................................................................................................... 51	  2.4.1	   Datasets .................................................................................................................. 51	  2.4.2	   Motif prediction ........................................................................................................ 53	  2.4.3	   Nucleotide composition ........................................................................................... 53	  2.4.4	   Motif over-representation analyses ......................................................................... 53	  2.4.5	   Naïve backgrounds ................................................................................................. 54	    viii 2.4.6	   3rd order Markov model background ....................................................................... 54	  2.4.7	   HOMER 2 GC background ..................................................................................... 55	  2.4.8	   BiasAway background generating tool .................................................................... 55	  2.4.9	   Measures to evaluate over-representation analysis results .................................... 56	  2.4.10	   Enrichment visualization plots and HADB boundaries of enrichment ................... 57	  2.4.10.1	   Composition-bias plots ................................................................................... 57	  2.4.10.2	   TFBS-landscape view .................................................................................... 57	  2.4.10.3	   HADB motif enrichment threshold .................................................................. 58	  2.4.10.4	   HADB motif score lower threshold ................................................................. 58	  2.4.10.5	   TFBS-bi-motif view ......................................................................................... 60	  2.4.10.6	   Dinucleotide-environment view ...................................................................... 60	  2.4.11	   Analysis of broadPeak replicate experiments ....................................................... 61	  2.4.12	   Gene ontology term enrichment analysis .............................................................. 62	  2.4.13	   TBP case study – repeat elements analysis ......................................................... 62	  Chapter 3: Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets .............................................................................................................................. 63	  3.1	   Background .................................................................................................................... 63	  3.2	   Results ........................................................................................................................... 65	  3.2.1	   Zingers are TF binding motifs enriched across multiple TF ChIP-seq datasets ..... 65	  3.2.2	   Ab initio motif discovery of zinger profiles ............................................................... 68	  3.2.3	   Zinger motif enrichment observed within open chromatin and genomic datasets .. 69	  3.2.4	   Visualizing the pattern of motif enrichment ............................................................. 69	  3.2.5	   Defining a set of zinger motif containing peaks ...................................................... 71	  3.2.6	   No strong dependencies detected for zinger motif occurrence ............................... 73	    ix 3.2.7	   Peaks containing a zinger motif but lacking the ChIPped TF motif have low scores      ................................................................................................................................ 74	  3.2.8	   Peaks with a zinger motif may be bona fide targets of the zinger TF ..................... 74	  3.2.9	   Zinger motif peak regions recur across multiple TF datasets ................................. 76	  3.2.10	   Zinger neighbourhoods tend to occur close to regions occupied by cohesin ....... 78	  3.3	   Discussion ...................................................................................................................... 80	  3.4	   Methods ......................................................................................................................... 84	  3.4.1	   Datasets .................................................................................................................. 84	  3.4.2	   Motif over-representation analysis .......................................................................... 86	  3.4.3	   Motif over-representation analysis with shuffled matrices ...................................... 87	  3.4.4	   Motif prediction ........................................................................................................ 88	  3.4.5	   MEME suite tools .................................................................................................... 89	  3.4.6	   Repeat-masking ...................................................................................................... 89	  3.4.7	   Data processing and statistical analyses ................................................................ 89	  3.4.8	   TFBS-landscape visualization plots ........................................................................ 89	  3.4.9	   Heuristic boundaries of enrichment ........................................................................ 90	  3.4.10	   Background expectation of motif predictions ........................................................ 91	  3.4.11	   Heatmaps and correlation between zinger motifs ................................................. 92	  3.4.12	   ChIP-seq controls ................................................................................................. 92	  3.4.13	   Evaluating proximity of zinger motif peaks to genomic features ........................... 93	  3.4.14	   Comparing zinger regions from non-zinger ChIP-seq datasets to peaks ChIPped by the zinger TF .................................................................................................................. 93	  3.4.15	   Generation of ChIP-seq peak neighbourhoods ..................................................... 94	  3.4.16	   Neighbourhood proximity to cohesin and polycomb repressive complex ............. 95	    x Chapter 4: Analysis of regulatory sequence TF occupancy using a biophysical  simulation .................................................................................................................................. 96	  4.1	   Background .................................................................................................................... 96	  4.2	   The transcription factor occupancy simulator (tfOS) model ........................................... 99	  4.3	   Results ......................................................................................................................... 106	  4.3.1	   A single sequence example using FOXA2 occupancy predictions demonstrates specificity improvements with H3K4me1 data or HNF4a co-regulation ............................ 106	  4.3.2	   Open chromatin data improves the specificity of occupancy predictions for a subset of regulatory regions ......................................................................................................... 109	  4.3.3	   Occupancy predictions of non-recurring ChIP-seq regions are varied in response to histone modification data or open chromatin data ............................................................ 112	  4.3.4	   Co-regulation driven stabilization of binding improves simulations of FOXA2 occupancy but not POU5F1 occupancy ............................................................................ 114	  4.4	   Discussion .................................................................................................................... 116	  4.5	   Methods ....................................................................................................................... 120	  4.5.1	   Pre-processing of data for simulation ................................................................... 120	  4.5.2	   Model framework of the TF occupancy simulator ................................................. 122	  4.5.2.1	   TF binding decision ........................................................................................ 123	  4.5.2.2	   Competition .................................................................................................... 125	  4.5.2.3	   3D translocation ............................................................................................. 125	  4.5.2.4	   1D translocation ............................................................................................. 125	  4.5.3	   Data sets ............................................................................................................... 126	  4.5.4	   Recurrent and non-recurrent regions .................................................................... 127	  4.5.5	   Simulations ........................................................................................................... 127	  4.5.6	   Processing tfOS results ........................................................................................ 128	    xi 4.5.7	   Classical PWM analyses ....................................................................................... 128	  4.5.8	   Software accessibility ............................................................................................ 129	  Chapter 5: Conclusion ........................................................................................................... 130	  5.1	   Interpretation of ChIP-seq data and TFBSs ................................................................. 131	  5.2	   The TF view of the genome ......................................................................................... 134	  5.3	   Future work toward understanding TF regulation of gene expression ......................... 135	  Bibliography ............................................................................................................................ 141	  Appendices ............................................................................................................................. 151	  Appendix A ........................................................................................................................... 151	  Appendix B ........................................................................................................................... 152	  B.1	   Sequence composition impact on TFBS prediction ................................................. 152	  B.2	   TFBS motif over-representation analyses ............................................................... 152	  B.3	   TFBS-landscape plot biological interpretation ......................................................... 156	  B.4	   Regions predicted by HADB to directly bind the ChIPped TF are enriched for peaks that co-occur between replicate experiments ................................................................... 158	  B.5	   GO term analyses on HADB identified ChIP-seq peaks .......................................... 158	  Appendix C ........................................................................................................................... 160	  Appendix D ........................................................................................................................... 163	  Appendix E ........................................................................................................................... 179	  Appendix F ............................................................................................................................ 182	  F.1	   Experimentally determined PWM score thresholds ................................................. 182	  F.2	   Position frequency matrices ..................................................................................... 182	  Appendix G ........................................................................................................................... 183	  Appendix H ........................................................................................................................... 196	  H.1	   Open chromatin data and non-TF ChIP-seq data have zinger enrichment ............. 196	    xii H.2	   Extreme cases of datasets lacking the ChIPped TF’s motif .................................... 197	  H.3	   Average background expectation of motif predictions ............................................. 197	  H.4	   Zinger motif attributed peaks’ proximity to genomic features .................................. 198	  H.5	   Zinger motif peaks may be bona fide targets of the zinger TF ................................ 199	  Appendix I ............................................................................................................................. 201	  Appendix J ............................................................................................................................ 202	  Appendix K ........................................................................................................................... 203	  Appendix L ............................................................................................................................ 210	     xiii List of Tables Table 4.1 tfOS parameters ....................................................................................................... 101	  Table 4.2 Recurrent regions analyses’ mean AUC values for 100 regions per TF ................... 110	  Table 4.3 Non-recurrent regions analyses’ mean AUC values for 100 regions per TF ............ 113	  Table 4.4 The mean AUC values from simulations of cooperative interactions on 100 sequences per TF ....................................................................................................................................... 116	     xiv List of Figures Figure 1.1 Regulatory regions. ..................................................................................................... 3	  Figure 1.2 ChIP-seq protocol overview. ........................................................................................ 7	  Figure 2.1 Composition bias plots for bias identification. ............................................................ 27	  Figure 2.2 Over-representation bias from multiple background models. .................................... 30	  Figure 2.3 TFBS-landscape views. ............................................................................................. 33	  Figure 2.4 Defining the TFBS zone of enrichment around the peakMax. ................................... 35	  Figure 2.5 TFBS-bi-motif view for visualization of motif spatial arrangements. .......................... 39	  Figure 2.6 Dinucleotide-environment view. ................................................................................. 42	  Figure 2.7 Case study of TBP in the mouse MEL cell-line. ........................................................ 44	  Figure 2.8 Case study of ZNF143 DNA binding preferences. .................................................... 46	  Figure 3.1 Zinger binding motifs are enriched across multiple human ChIP-seq datasets. ........ 67	  Figure 3.2 Zinger motifs are enriched at the peakMax of non-zinger ChIP-seq datasets. .......... 70	  Figure 3.3 ChIP-seq datasets contain multiple classes of peaks. .............................................. 72	  Figure 3.4 Zinger motif peaks overlap ChIP-seq data for the given zinger TF. .......................... 75	  Figure 3.5 Zinger motif peaks recur across datasets for multiple TFs. ....................................... 77	  Figure 3.6 A model to account for zinger motif enrichment across ChIP-seq datasets. ............. 82	  Figure 4.1 A graphical depiction of a tfOS simulation. .............................................................. 102	  Figure 4.2 A non-specific energy is evident after mutating the MAX consensus binding site. .. 103	  Figure 4.3 tfOS search mechanics. .......................................................................................... 105	  Figure 4.4  FOXA2 occupancy counts on a single sequence. .................................................. 109	  Figure 4.5 Recurrent regions analyses’ AUC values for FOXA2 and JUND ............................ 111	  Figure 4.6 Non-recurrent regions analyses’ AUC values for FOXA2 and JUND ...................... 114	     xv List of Abbreviations  1D, one dimensional 3D, three dimensional 4C, circularized chromosome conformation capture Ab, antibody API, application programming interface AUC, area under the curve bp, base pair CAGE, cap analysis of gene expression ChIA-PET, Chromatin Interaction Analysis by Paired-End Tag Sequencing ChIP-chip, chromatin immunoprecipitation-on-chip ChIP-seq, chromatin immunoprecipitation and high-throughput sequencing ChIP-exo, chromatin immunoprecipitation and high-throughput sequencing with endonuclease digestion CB, Composition-bias ENCODE, Encyclopedia of DNA Elements FANTOM5, Functional Annotation of the Mammalian Genome, phase 5 GfEM, Gibbs free energy matrix GO, Gene Ontology HADB, heuristic algorithm for direct binding HOT, High Occupancy of Transcription-related factors peakMax, peak local maximum  PFM, position frequency matrix  pp, percentage point    xvi PWM, position weight matrix ROC, receiver operating characteristic SINE, short interspersed elements TF, transcription factor TFBS, transcription factor binding site tfOS, transcription factor occupancy simulator TSS, transcription start site   xvii Acknowledgements No person stands alone in life, and that has certainly been true for my walk in science and in particular the progress of my research contributing to this thesis. Dr. Wyeth Wasserman in particular as my thesis supervisor has provided years of patient guidance, and encouragement; believing in me despite my disbelief. Thank-you, Wyeth, for our scientific discussions, your insights, experience, and your constant support and kindness. I also owe a great deal of thanks to my supervisory committee: Dr. Don Moerman, Dr. Anne Condon, and Dr. Paul Pavlidis, for their years of advice and mentorship throughout my studies. Their encouragement to start a doctoral program of studies, and their following support and positive feedback has been invaluable. I will always remember their efforts with the greatest of gratitudes.   I would also like to thank Dr. Tamara Munzner who was a mentor to me in the UBC computer science department, and who continued in that role while I was doing my thesis research. She provided her valuable time and advice at critical points, and I am indebted to her. Thanks are also due to both past and present members of the Wasserman lab, and the three wonderful scientists who joined the Wasserman lab for a year of sabbatical (Dr. Luis del Peso, Dr. Francois Parcy, and Dr. Charles Lecellier), for group and personal discussions, insightful comments on presentations and manuscripts, on-going encouragement in the sinusoidal cycle of doctoral studies, and their laughter and friendship. The Wasserman lab clearly reflects the principal investigator, and has been a positive and warm environment to work in. In particular, I gratefully acknowledge the lab’s software developer, David Arenillas, for his help during my research work. He has been a constant fount of knowledge with regards to the lab’s software, and everything else computational. The Wasserman lab has also been blessed by the administrative support of Dora Pak, and the IT support of Miroslav Hatas and Jonathan Chang; thank-you so much for making everything smoother and simpler.  I’d also like to thank the   xviii Bioinformatics Program coordinator, Sharon Ruschowski, who has also made things smoother and simpler; always responding swiftly to queries for information or help.  I would like to acknowledge the generosity of the Genome Science Centre (GSC) bioinformatics group who permitted me a part-time seat at the GSC for several years. Thank-you to Dr. Sorana Morrissy, Dr. Gordon Robertson, Dr. Misha Bilenky, and Dr. Anthony Fejes for their willingness to answer questions, and to make suggestions of research avenues to explore. I also owe a debt of gratitude to Dr. Dima Brodsky, for reviewing some of my early C++ code and suggesting improvements.  In all scientific research, funding is a vital and central component. I am grateful to the University of British Columbia, the Bioinformatics Graduate Program, the Child and Family Research Institute, the Micheal Smith Foundation for Health Research, and the Canadian Institute of Health Research for salary, research, equipment, and travel funding. The financial assistance has helped alleviate many stresses.    xix Dedication The following dedication is an acknowledgement of a debt of gratitude. The support of friends and family who believe in you is of inestimable value.   First and foremost I would like to thank my parents, Brian and Miranda Hunt, who have weathered every step of the way with me. Thank-you for listening, caring and being patient even when you heard me say the same thing for the hundredth time, and for being there anytime I needed it. Another person who has been there every step of the way is my most wonderful friend Sorana Morrissy (née Petrescu), with whom I can share both the dismay and joys of research, escape with into the mountains to rejuvenate, and who is open-hearted enough to tolerate listening to my self-doubts repeatedly. Dima Brodsky, another dear friend, has also been a constant presence with a shoulder to lean on and the voice of experience for the process of doctoral research, and like Sorana has invited me to play outside and listened to my doubts. Frequently. To you both, thank-you from the depths of my heart for your friendship and for sharing your research experiences. The Rogers family has provided me a second home to retreat to and given me many hours of just being ‘part of the family’, which has been a wonderful support and escape throughout the PhD thesis research process.  I would also like to thank Don Moerman who has been both mentor and friend, and along with his wife, Jan, has been a caring and supportive presence both before and during my PhD.  I cannot express how much it means to have you all a part of my life.  Last and not least, there is my childhood and life’s friend, Carolyn Jones, and Himself, they know what they have meant to me through the years.    1 Chapter 1: Introduction  1.1 Background Transcriptional regulation is central to the response of a cell or organism to physiological or environmental conditions. The patterns of gene expression in multi-cellular organisms determines the developmental fate of cells at the appropriate location and time to produce a mature organism from a single cell, and the subsequent maintenance of healthy tissues and organs. Not surprisingly, damage to the circuitry that regulates gene expression is associated with diverse diseases, such as the following examples in human: hemophilia B [1], diabetes [2], aniridia [3, 4], α-thalassemia [5], inflammatory bowel disease [6], pancreatic agenesis [7], otofaciocervical syndrome [8], and cancers (e.g. leukemias; myelomas; breast cancer) [9, 10].  Understanding the mechanisms governing the selective expression of genes in a biological context is a key step towards future therapies in which damaged genomes are repaired, and tissues are regenerated at the cellular level.  Gene expression is regulated at the DNA, RNA and protein levels, spanning such diverse events as chromatin compaction – which affects the accessibility of DNA to regulatory proteins, transcription initiation – where proteins are recruited to the DNA and DNA is transcribed into RNA, translational – where RNA is modified and translated into protein, and both RNA and protein localization. This thesis focuses on a subset of regulatory proteins and their interaction with specific DNA sequences to modulate the rate of transcription initiation.  The most basic level of DNA expression regulation is through structural mechanics, the protein packaging of DNA into chromatin.  At a coarse level, chromatin can be considered as the mixture of DNA wrapped around histone proteins, forming structures called nucleosomes.  The   2 histone-DNA structures can be compacted or released to control the activity of genes.  Gene-depleted DNA often contributes to an inactive, highly compacted form of chromatin called constitutive heterochromatin, which can have a structural function (e.g. centromeres).  Gene-rich DNA can be part of either facultative heterochromatin, or euchromatin. Facultative heterochromatin is reversibly silenced DNA; its coverage can range from most of a chromosome to smaller regions such as gene promoters.  Euchromatin is the active, and most open, state of chromatin, in which genes are poised for activation through the binding of various trans-acting factors (factors that can be excluded from more compact chromatin) to such regions as promoters.  Genes that are highly transcribed have relatively large and stable nucleosome depleted regions near transcription start sites [11], presumably to facilitate uninterrupted access by regulating factors.  Key to the transition of inactive regions to active transcription are proteins that covalently modify the “tails” of histone proteins. This epigenetic layer of modification acts as interaction instructions for trans-acting proteins such as chromatin remodeling proteins and gene expression regulating proteins.  Examples of proteins that commonly modify the histone tails are acetyltransferases/deacetylases, methyltransferases/demethylases, and kinases/phosphatases, and the resulting modifications are associated with different transcriptional states. For instance, histone acetyltransferase proteins act on amino acid lysines in a histone tail and acetylation of the K14 and K9 lysines of the H3 histone tail relate to transcriptional activation; histone methyltransferases act on both lysines and arginines where methylation of K9 lysine of the histone-3 (H3) tail is associated with constitutive heterochromatin, while tri-methylation of K4 lysine on H3 or methylation of R3 arginine on H4 is associated with transcriptional activation. The epigenetic modifications by these and other proteins work combinatorially to guide such regulatory events as activation or repression of genes [12, 13].    3  Beyond the chromatin and epigenetic layers of gene regulation, are the proteins that bind DNA to initiate or repress the recruitment and activity of polymerase proteins at transcription start sites. Sequence specific transcription factors (TFs) are a subset of DNA-binding proteins that regulate when, where, and with what intensity a gene is transcribed. These regulatory proteins bind to specific patterns in the DNA that are found locally within regulatory regions such as promoters or more distal regulatory modules such as enhancers, silencers, or insulators (Figure 1.1).    Figure 1.1 Regulatory regions. Regulatory regions are both local and distal to transcription start regions and can be found both upstream or downstream of a gene, or within a gene body. Promoter regions are generally specific to the gene(s) in which they are situated, while the distal regions cannot be reliably associated with a gene through a simple distance metric. TFBSs (small coloured rectangles) distal to the promoter act as enhancers, silencers, or insulators, and often cluster in groups.    Promoters can be considered as two elements: a proximal promoter, contributing to core promoter activity, that consists of up to 350bp upstream of a gene’s transcription start sites [14, 15], and the distal promoter, which is upstream of the proximal promoter and varies in length dependent on the organism (generally 1,000–2,000bp in human or mouse). Enhancers are   4 regions associated with increased transcription and are bound by TFs that are transcriptional activators. Enhancers may overlap with silencers, which are regions bound by transcriptional repressors to decrease transcription. Insulators are bound by insulator TFs to act as boundary markers between active (euchromatin) and inactive (heterochromatin) chromatin regions.  Likely reflecting the secondary structure properties of looped DNA within a nucleus, there are confirmed cases of regulatory regions up to ~106 bp distant from the transcription initiating promoter of a gene [16].  Most annotated TF binding sites (TFBSs), however, are found in the vicinity of transcription start sites (e.g. on the order of ~104 bp for human genes).    The combination of TFs that bind these regulatory regions at a particular location and time, act to control events such as cellular differentiation, body patterning, and organ development, as well as providing a means for dynamic cellular response to changing environmental conditions. As key regulatory elements of gene expression, mutations in TF binding sites, or the creation of new binding sites, can disrupt essential protein-DNA interactions required for the appropriate patterning or magnitude of gene expression, and as such, identification of these binding sites and the TFs that bind them, is an important area of research [17]. While much emphasis has been placed on protein-altering variation, it is reasonable to anticipate that regulatory sequence disruptions will become a key focus for medical genetics research as whole genome sequencing of patients becomes a feasible research tool.  The location of TF binding sites (TFBSs) and understanding the interactions between TF and DNA, while not yet a focus of the medical field, is a focus of gene regulation research. However, the detection of functional TFBSs, the identification of specific TFs that bind the TFBSs, and the TFs gene targets has proven difficult as: i) TFs bind short (6-20bp), degenerate patterns i.e. sequences that are similar but not identical, that may arise by chance in the genome, ii) unique   5 TFs may share similar DNA binding domains, and thus recognize similar sequence patterns, and iii) TFBSs are not always linearly proximal along the DNA relative to their gene targets but instead may be located remotely and yet brought into proximity through 3D chromatin organization, such as chromatin looping mechanisms. Numerous approaches have been taken to investigate the means of TF-based regulation of gene expression, both at the experimental and computational levels. A particular topic of discussion has been how low copy numbers of a TF [18] (mean ~5500 copies per cell for 10 TFs across three time points, but with high variance or potentially distinct high and low expression classes) manage to efficiently locate to all of their target genes, as this copy number is far less than the open and accessible size of the genome (>42 million base pairs per DNaseI-seq data).  1.2 Experimental detection of TFBSs Experimental methods, such as promoter deletion analyses or gel shift assays, have been successful in revealing regulatory regions, individual TFBSs and the general binding site patterns targeted by a TF.  In the past the experimental methods were applied for the laborious analysis of a single gene’s upstream regulatory region, or a single TFBS; however with the development of high-throughput technologies (in vitro and in vivo), researchers are studying multiple promoters, binding sequences, and TFs within a single experiment. Even the study of the three-dimensional space of the nucleus and the interactions across that space between regulatory regions has become accessible through variants of these technologies [19].  The partnering of the chromatin immunoprecipitation (ChIP) technique with either genome chip arrays (ChIP-chip) or more recently with high-throughput sequencing (ChIP-seq) in particular have advanced the genome-wide mapping of regulatory sequences, as delineated by either TFs or histone derivatives. The ChIP-seq technique is currently the most commonly used for the   6 study TFs’ genomic binding locations and patterns (Figure 1.2). The ChIP-seq technique, in brief, uses an antibody (frequently polyclonal) against a target protein to pull down the protein of interest and the sheared fragment of DNA to which the protein was bound (or was indirectly bound through another protein) at the time of a protein-DNA cross-linking process. The DNA is size selected through gel electrophoresis (~100-150bp long) and the ends of the fragments are sequenced and mapped back to the reference genome [20, 21]. The locations in the genome where the protein was recurrently engaged should be detected as a frequency of reads recovered from a location that is greater than a background measure i.e. control data. The control data is cross-linked and sheared DNA that is then only size-selected (Input control), or it is both size-selected and immunoprecipitated with a non-specific antibody (mock-IP control).   The characteristics of ChIP-seq data are shaped by both biological and technical influences [22-24].  As with every high-throughput technology, the community learns progressively more about the nuances of the data as it accumulates. Much effort has focused on the development of peak finding methods, which allow for the quantitative determination of TF-bound regions from the sequences recovered in a ChIP-seq experiment. Methods vary on how the DNA fragments ChIPped by the experiment are estimated from the short sequence reads (~30bp), however, the end goal of all methods is to quantify the frequency of the reads and the overlap of the extrapolated fragments. Current methods take into account the background rate of sequence recovery from the control data and use this background to evaluate the significance of an observed number of mapped reads or overlapping fragments from the ChIP experiment [25-27].     7  Figure 1.2 ChIP-seq protocol overview.  DNA and protein (coloured ovals and trapezoid) are first cross-linked and then sonicated to fragment the DNA. An antibody (green Y-shape) specific to the TF of interest is used to enrich for the regions of DNA that interact with the TF. The crosslink is then reversed, the DNA fragments are sequenced, and the reads (short black lines, piled up) are mapped onto the genome. ChIP-seq peak calling software is then used to identify regions with a concentration of reads that is greater than the genomic background. These regions are termed ‘peaks’. They have a start (5’) and end (3’) coordinate, and often a peak local maximum coordinate to indicate the location of maximum read count.     8 The comparison of the ChIPped sequences against a background by ChIP-seq “peak” finding software is the basis for specifying the TF-bound regions (peaks), usually delineated with a start, end and local maximum read density position (i.e. ‘peakMax’). Depending on the peak calling software, these regions may be the width of a single base pair (i.e. only the peak maximum position), or up to 5000bp or more wide. The most common range of widths is 100-400bp. Newly emerging techniques, such as ChIP-exonuclease (ChIP-exo) [28], which enzymatically digests the nucleotides flanking the DNA cross-linked to a protein prior to sequencing, promise to significantly improve the resolution of the ChIP-based technologies.  The release of the large scale ENCODE (ENCyclopedia Of DNA Elements) project findings of functional non-protein coding elements supports the long-standing view that a greater portion of the human genome contributes to the regulation of gene activity than the ~2% encoding proteins [29].  The ENCODE project has expanded the known regulatory regions with hundreds of ChIP-seq experiments for over 100 TFs and many other regulatory proteins such as co-activators (proteins that interact with activating TFs to increase gene expression) and histone modified nucleosomes. These experiments, across multiple cell lines, have successfully highlighted genomic regions that contain regulatory elements, and helped reduce the computational search space for TFBSs many fold. Not withstanding the data resource it has generated, the ENCODE project has also provided standardized protocols for quality ChIP-seq data generation [30].   Another large-scale effort that has contributed to the identification of regulatory regions and further understanding of gene regulation is the recently published FANTOM5 project [31]. The project was established to identify transcription start sites, using the capped analysis of gene expression (CAGE) protocols, across the majority of human tissues and cell types, and to   9 assess the activity of gene expression across time course experiments. Invaluable data arising from this project has been an atlas of the short, bi-directional transcripts that arise at active enhancers [32]. By using these marker transcripts, it is possible to determine those enhancers active in a particular cell-line and context, and from the abundance of the transcripts obtain a measure of enhancer activity. The activity of an enhancer can in turn be used to associate the enhancer with TSSs that have a similar time course and level of transcriptional events. This is a step forward in terms of associating enhancers with the genes that they regulate, which until now has been predominantly accomplished by distance measures, which are poor.  1.3 Computational detection of TFBSs Computational methods for the study of transcriptional regulation unite experimental data, experimentally derived principles, and pattern recognition methods to specify the locations of TFBSs and cis-regulatory regions. In the long term, it is hoped that insight into the mechanisms of transcription regulation will lead to computational predictions of regulatory sequences consistent with experimental results. However, while computer-based methods are more time efficient and less costly than laboratory methods, they are currently constrained by lack of specificity (false predictions).    1.3.1 Classical models of TFBS prediction Using experimentally derived knowledge, computational models have been developed to discriminate between the 6-20 bp patterns of TFBSs relative to the non-binding sites within the genome. As experimental ChIP-seq data generally consists of a set of sequences longer than the typical 6-20bp wide binding pattern of a TF, delineating the precise coordinates at each TFBS is a computational problem (computational methods do exist to first detect the enrichment of a degenerate pattern common to the sequences (de novo motif discovery)) [33-36], prior to   10 providing a model for specifying the location of the pattern. The most widely used computational form of the predictive models representing the binding specificity of a TF is the position weight matrix (PWM) [37].  A TF’s PWM is derived from a frequency matrix, which summarizes how often each base (A, G, C, T) appears at each position of an aligned set of known TFBSs for the TF. The values of the PWM are log-odds scores for the weighted frequency of each base at each position, relative to the expected frequency of that base in the genomic background (see Appendix A, Figure A.1 for details).  The classical assumption required for generation and use of a PWM is that of nucleotide independence, e.g. the base at position 1 does not influence which base occurs at position 2, which reduces binding site score determination to the sum of the contribution from each position. The assigned scores are used to measure how closely a potential binding sequence in the genome matches the known binding of the TF. While the assumption of independence is not valid in all cases, it appears to provide a good approximation in most cases of the affinity of a TF for a sequence [38]. Alternative models, such as those based on multiple regression, Bayesian networks, or hidden Markov models [39-41], do not assume nucleotide independence, but instead attempt to capture dependencies between nucleotides. The binding site profiles for a number of TFs, usually in the format of a frequency matrix, can be found in databases such as JASPAR [42], HoCoMoCo [43], or the commercial TRANSFAC [44].  The ultimate measure of success of a model is the specificity and sensitivity of the method.  The PWMs, perhaps the least complex model to generate and understand, have largely proven to have strong capacity to predict in vitro binding interactions (good sensitivity), but like all TFBS prediction models they have a poor capacity to distinguish sequences with in vivo function from those without function (poor specificity) [45]. This poor specificity presumably arises from a variety of factors. In vivo activity is more complex than in vitro, as the nucleus is crowded,   11 molecules are in constant exchange with each other, concentrations of regulatory molecules are in flux, there are cooperative and competitive interactions, accessibility to protein-occupied DNA varies, and experimental data tends to be tissue and environment specific.  Experimental data is incomplete and noisy, which constrains computational success.  On the computational side, there are a variety of constraints on predictive success. These constraints range from assumptions made due to lack of experimental data or to make a model more computationally efficient, or naively accounting for local sequence composition by assuming a nucleotide background distribution of 25% for each A,G,C,T nucleotide. Another constraint is the need to specify a heuristically chosen threshold on motif similarity scores; this threshold is intended to limit the increasing number of false positives at lower motif scores but generally does not have a biological basis and is often used at the cost of failing to detect lower scoring functional binding sites. Perhaps more pertinently, bioinformatics approaches to TFBS prediction are static measures of a dynamic process, predicting TF occupancy through a statistical count without accounting for such influences as protein concentration, effective genome size, interaction between proteins, or protein-DNA energy constraints.  Although, interestingly, there is some support for the energy of protein-DNA interactions being reflected in the information content of a TFs binding profile matrix [46, 47].   1.3.2 Advanced models of TFBS prediction Biologically grounded approaches have been developed to reduce the specificity issues of TFBSs prediction, using experimentally derived knowledge to constrain the computational search space to regions more likely to be cis-regulatory regions. One such fundamental approach is to assess a set of sequences believed bound by the same TF(s), such as the promoters of co-expressed genes, for over-representation of known TF motifs relative to a set of control sequences [48, 49]. The locations of motifs so identified would then be subject to further   12 experimental testing. Another successful method for detecting regulatory elements is the filtering of TFBSs by phylogenetic footprinting; this method reports TFBSs greater sequence conservation than the background rate in the genetic region analyzed.  The laboratory testing of such conserved regions indicates sequence conservation can be a powerful predictor of regulatory region locations [50]. However, the conservation of TFBSs across species is limited [51], thus this approach excludes those regions with evolutionarily recent changes. A third approach has been to leverage the combinatorial activity of TFs that results in binding sites occurring in clusters [52]; thus methods have been developed to search not for single binding sites but for clusters of multiple TFs binding sites (cis-regulatory modules) from which single TFBSs can later be analyzed. More recently, the computationally assisted studies of chromatin modifications have shown that combinations of epigenetic modifications highlight the regulatory potential of regions in a genome for a given cell-type under the given cellular conditions. For instance, higher levels of H3K4me3 (i.e. 3 methyl groups on lysine 4, of histone 3), and reduced levels of H3K4me1 at transcription start regions is indicative of active promoter regions, while the reverse configuration is indicative of enhancers, and raised levels of H3K27me3 are indicative of repressed genes [12, 13]. Combining epigenetic information with TFBS prediction has been shown to improve specificity and to improve agreement with ChIP-seq particularly when the epigenetic data is from the same tissue as the ChIP-seq data [53].  1.3.3 TFBS prediction through ChIP-seq data Now, with the compilation of large and diverse ChIP-seq data collections available for over 100 TFs, the computational search space for the binding sites of those particular TFs, under the given set of cellular conditions, has been significantly reduced. Reduction of the search space in turn has greatly increased the specificity of TFBS prediction.     13 There are however caveats to ChIP-seq data that impact computational prediction of TFBSs. First, the antibody (Ab) used to ChIP the target TF is often polyclonal, meaning that the Ab recognizes multiple epitopes and thus is likely to interact with more than the target protein. In addition, not all Abs available are high quality. Second, open chromatin, particularly that at promoters in highly expressed regions, is highly shearable, thus there is a tendency to enrich for sequences from these open chromatin regions across diverse ChIP-seq studies [22, 23]. Third, the computational analysis of a ChIP-seq experiment is dependent on the control data and the sequencing depth normalization between the control and ChIP-seq regions. While good progress has been made with regards to normalization this is an area still being developed, and peaks may be misidentified. The necessary depth of a control, and whether input DNA or mock-IP serves best as control, are still being investigated. Fourth, the resolution of the data, once peaks are called, tends to be many fold longer than the ~6-20bp length of a TFBS, thus combined with polyclonal Abs and shearing caveats there remains potential for specificity issues. Lastly, not specific to ChIP-seq, imposition of thresholds often results in data miscalled; thus some regions specifically bound by a TF may not be present in the final data as the collection of mapped reads at a region do not pass the threshold of enrichment, while non-specific regions with a high concentration of sequence reads due to experimental bias may be present. Therefore methods have been developed to further evaluate the peaks that derive from ChIP-seq experiments and peak calling software.   Computational post-ChIP-seq analysis methods exist to assess a ChIP-seq dataset or specific peaks for evidence of direct binding via the ChIPped TFs’ binding pattern [54-57]. In addition there are methods that evaluate enrichment of motifs for non-ChIPped TFs, attempting to infer indirect interactions [58, 59]. However, there appears to be a general view that ChIP-seq data, once called by established peak calling software, has been processed stringently enough that   14 sufficient enrichment of the expected motif in the dataset indicates that the majority of the ChIP-seq dataset is specific to the ChIPped TF, whether bound directly or indirectly. In some cases, even when the motif is infrequently present, the entire ChIP-seq data is used for downstream analyses [60]. Thus the subset of peaks lacking the TF’s canonical motif is treated as equivalent to the subset with motifs. Yet it is possible that some of these non-motif peaks may be derived by mechanisms not specific to the ChIPped TF, as is suggested by HOT regions (High Occupancy of Transcription-related factors) [61]. HOT regions are locations present as a peak across multiple ChIP-seq datasets (for general TFs, sequence specific TFs, and chromatin-related factors) within a given cell line, or less frequently within multiple cell lines. Other possibilities, already alluded to, for peaks without the expected TF’s motif are that of indirect binding of the ChIPped TF through another protein, or inadequate understanding of the TF’s binding specificity.  While it would seem to be of interest to investigate both populations of ChIP-seq peaks, those with and those without the ChIPped TFs motif, there is a dearth of attention given to the two sub-populations. Thus in the analysis of specific TF bound regulatory regions and TFBSs from ChIP-seq defined peak regions there is room for refinement, and such improvement will consequently inform and improve our utilization of ChIP-seq data across a spectrum of analyses. The issue of direct binding in ChIP-seq data is explored in Chapters 2 and 3 of the thesis.  1.3.4 Biophysical models for predicting regulatory regions Despite the increasing availability of experimentally identified regulatory regions, computational methods for the genome-wide detection of TFBSs remain desirable and necessary as many TFs are not yet ‘ChIP-able’, nor do we have cell lines for all cell types. As mentioned above (section 1.3.2), advanced models of motif detection have attempted to reduce false positives through biologically inspired constraints on the genomic search space (e.g. motif over-representation,   15 filtering on phylogenetic conservation, and assessing for cis-regulatory modules). All of these models have earned a measure of success but improve the situation only marginally and may be subject to over-fitting; certainly the predictions are not in good agreement with experimental results such as those derived from ChIP-chip or ChIP-seq. Admittedly, regulatory regions derived from ChIP-based methods are tissue and condition specific, but this is likely not sufficient to explain the discrepancy between predictions and experimental data. Thus, it seems evident from the above that TF regulation is not isolated to primary sequence recognition, but that the engagement of a TF to a site and consequent control of gene expression is an involved process, which needs to be both better understood and represented within computational predictive frameworks. This has inspired the development of physically principled models grounded in work by Berg, von Hippel and Winter who with the E. coli lac repressor, and later with cyclic AMP receptor protein, highlighted how a TF might explore the genome to locate its binding site, and the thermodynamics of the TF::DNA interaction [62, 63]. As the biophysical principles of regulation are mainly working hypotheses, they will be reviewed in more detail than that of the previous sections.  Prior to a TF exerting influence on the transcription of a target gene, it must locate to the appropriate regulatory region, within a reasonable amount of time, and be bound at a suitable binding site in response to cellular demands; demands which may require a swift response. Based on TF copy number and the frequency of binding site motifs in the genome, it appears that for many TFs present in the nucleus there is a low ratio of proteins relative to the number of available specific binding sites and the effective genome size, therefore the governance of how a TF “searches” the nuclear and genomic space for its binding sites is widely assumed to be assisted rather than random 3D diffusion. Berg, von Hippel and Winter proposed that a TF finds its way to a binding site through a two component process called facilitated diffusion: 1) diffusion   16 through nuclear space (3D translocation), followed by 2) non-specific exploration of the local DNA (1D translocation). A graphical animation movie hosted at http://www.stromastudios.com/portfolio/tfs.html, presents the process of facilitated diffusion. The mechanism of 3D diffusion followed by non-specific 1D translocation on the DNA is proposed to produce a more efficient search of the DNA than simple diffusion, and is supported by the observation that DNA-binding proteins diffuse through the nucleus at a much slower rate than proteins with their DNA-binding domain removed [64]. Additional in vitro work by Gowers and Halford [65] with a restriction enzyme (BbvCI) at in vivo salt concentration, and in vivo work by Elf et al. [66] and Hammar et al. [67] on the lac repressor, have observed a process resembling facilitated diffusion. Thus, while some TFs may be dynamically compartmentalized in the nucleus with the genes they regulate [68-70], providing an effective increase in local TF concentration, there is also evidence for movement of proteins both across the nucleus and non-specifically along the DNA.  Molecular crowding (i.e. the physical congestion of many molecules) in the nuclear space promotes encounters between macromolecules, thus encouraging relatively short translocations through 3D space [71].  If during the 1D translocation, the TF encounters a sequence of nucleotides whose bases are a sympathetic match to the TF’s binding domain, the TF will transition from a non-specific interaction with the DNA phosphate backbone to a specific interaction with the bases of the DNA at that location. This specific interaction will likely result in a conformational shift in either the DNA and/or the protein to obtain the most favourable binding energy. Berg and von Hippel proposed that the free energy of the interaction can be estimated from the specificity of the sequence to the TF’s DNA binding domain [72, 73]. Stormo and Fields [46] later related the energy of binding to the information content of a binding matrix, and the   17 relationship has been further supported by microfluidic experiments with the MAX and PHO4 TFs [47].  Combinatorial interaction (homogeneous or heterogeneous) among proteins is a significant component of the regulatory controls governing gene transcription. In complement to the facilitated search method, it has been proposed that the presence of weak binding sites for a TF in a region may act to stabilize a TFs interaction within the region. The presence of weak sites might cause a TF to occupy the region for a longer time due to the presence of multiple sequences to directly interact with or to increase the chance that the TF encounter the region’s cognate binding site. Weak sites might also provide opportunity for two TFs to be present and energetically stabilize their interaction in a region by a conformational change in either the DNA or the proteins. For instance, Gertz et al. [74] have demonstrated for the TF Mig-1, that equivalent regulation of gene expression can be mediated by a pair of TFBSs whether they are both strong or a mixture of weak and strong. Rowan et al. [75] demonstrated that two weak PREP1 binding sites are necessary for Pax6 regulation, and that regulation is disrupted by substituting in strong binding sites. Zhang et al. [76] have shown that some TFBSs are flanked by conserved weak copies.  Zhang et al. suggest that the weak sites coordinate to make the cognate site more accessible, as depletion of the weak sites around a strong REST/NRSF target site resulted in partial repression of a target gene. This suggests that it is worth considering a TF’s tendency to occupy a region rather than evaluating a single binding site.  The biophysical principles outlined by Berg and von Hippel have motivated computational methods of TFBS detection and gene regulation, such as QPMEME [77], GOMER [78], MatrixREDUCE [79], and TRAP [80, 81]. These methods incorporated the proposed thermodynamic principles of TF::DNA interaction that address TF concentration and the   18 assumed equilibrium between free and bound TFs. The free energy of binding is estimated either through a Gibbs free energy conversion of a TF binding profile matrix, as outlined by Stormo and Fields [46], or from ChIP-chip or PBM signal. The interaction energies are used to evaluate the probability of TF occupancy of a region rather than that of a single binding site (GOMER, MatrixREDUCE, and TRAP); this effectively models the 1D translocation of a TF within a region of DNA, and while it was not the specific intent, it also addresses the proposed contribution of weak binding sites. One method also extended to incorporate interactions with other TFs (GOMER). While the formulations of these methods, particularly for QPMEME, may look imposing, they result in a performance minimal improvement over the common TFs PWMs. An extension of these biophysical models is the focus of Chapter 4.  1.4 Thesis research Significant developments have been made with regard to identifying regulatory regions, both experimentally and computationally, which advances us toward a day when we understand gene regulation sufficiently to engage in medical repair at the cellular level or regenerate complex tissues from stem cells. At present, much of the focus in the study of gene regulation is placed on TF ChIP-seq experiments, which reveal DNA regions bound by TFs. While much has been learnt from ChIP-seq data, there are significant challenges for the analysis and interpretation of the data for which we need new computational methods.  In Chapters 2 and 3, new methods are introduced to probe the complex TF ChIP-seq data and important observations emerge about the properties of the data that suggest two distinct populations of regions are being recovered.  In order to gain a fuller understanding of the mechanisms through which TFs access their target sites and probe the two sub-classes, a computational model is introduced that incorporates our existing understanding of the physical process in a flexible framework for exploring hypotheses about TF-mediated transcriptional regulation (Chapter 4).   19  Chapter 2 introduces a set of methods to improve the interpretation of ChIP-seq data, including the inference of mediating TFs based on TFBS motif over-representation analysis and the subsequent study of spatial distribution of TFBSs. The central focus of the ChIP-seq technology is to identify the genomic regions bound by the ChIPped TF. The first and most obvious form of identification for direct binding of the ChIPped TF within a dataset is the over-representation of the TF’s canonical motif in the ChIPped sequences relative to a background set of sequences. Secondary to that, we hope to identify over-representation of other TFs motifs to provide investigative leads to proteins that may interact and/or co-regulate the same genes as the ChIPped TF. For this purpose we extended the oPOSSUM software, developed by the Wasserman Lab, to handle sequence-based data such as ChIP-seq (oPOSSUM 3.0) [55]. We noted during this work, and from a survey of over-representation software for an oPOSSUM 3.0 protocols document (Worsley Hunt et al. unpublished), that first there is often a bias towards high GC TFs in the over-representation scores, second over-representation software generally does not advise a user on the selection of a suitable sequence-based background, and third there is no format to alert a user to bias in their over-representation results.  We therefore created a visualization method to alert a user to over-representation score bias. This was accompanied by an analysis of multiple background types, a recommendation for the background that most effectively reduces over-representation score bias, and development of a background generating tool called BiasAway.    Further to methods for the analysis of ChIP-seq data, we also developed a heuristic procedure, based on topological motif enrichment relative to the ChIP-seq peaks’ local maximums, to highlight peaks likely to be directly bound by a TF of interest. The results suggest that on average almost two-thirds of a ChIP-seq dataset’s peaks are bound by the ChIPped TF; the   20 origin of the remaining peaks is explored in Chapter 3. The heuristic method also identified a PWM-specific motif scoring threshold based on motif enrichment which will potentially be useful in the analysis of cis-regulatory variations that disrupt or create binding sites, a process we reviewed in [17]. Visualization methods are a powerful means for researchers to assess whether their data matches their expectations or statistical evaluations, and to reveal patterns that are worth further investigation. In complement to the heuristic method, a visualization method was created to present both the enrichment of a TF’s top scoring motif in each ChIP-seq peak and the enrichment of the motif scores. Additional visualization methods were generated to allow for the study of both inter-TFBS spatial relationships and motif-flanking sequence properties. While we generally applied these methods to the ChIPped TF, they are equally well suited to studying the enrichment of non-ChIPped TFs. We demonstrate the benefits of each visualization method with either known or novel biological relationships. The approaches outlined in Chapter 2 [82] offer researchers the means to improve the interpretation of their ChIP-seq data and will provide researchers with new leads into the investigation of TF binding and gene regulation.   With the identification of two populations of ChIP-seq peaks by the heuristic method in Chapter 2, those peaks likely to be directly bound and those whose source is undetermined, we asked if there was evidence for why the latter population was ChIPped. From analyses across hundreds of TF ChIP-seq data sets we found a small set of TF binding profiles for which the predicted TF motifs were repeatedly observed to be significantly over-represented not only in the datasets but at the peak local maximums of those datasets. When the related binding profiles of these TFs were grouped, the enriched profiles included: CTCF-like, ETS-like, JUN-like and THAP11 profiles. We termed these frequently enriched profiles as “zingers” to highlight their unanticipated enrichment in datasets for which they were not the targeted TF, and their potential impact on the interpretation and analysis of TF ChIP-seq data. Peaks with zinger motifs and   21 lacking the ChIPped TF’s motif were observed to compose up to 45% of a ChIP-seq dataset (mean 12%; median 9%). There was substantial overlap of the regions containing zinger motifs between diverse TF datasets, and across multiple cell lines, suggesting a mechanism that is not TF-specific for the recovery of these regions. We found that these recurring regions tend to be significantly proximal to cohesin binding sites, and we proposed a ‘loading zone’ model to suggest a mechanism for the recovery of these regions. Chapter 3 was published at Genome Biology [83].  The computational prediction of TFBSs has traditionally focused on the use of PWMs, as used in Chapters 2 and 3, although alternative sequence-based models have been developed. As evaluation of only the primary sequence for to identify TFBSs is given to a high false positive rate, advanced models have incorporated biological layers of information into the prediction of TFBSs. Chapter 4 presents work on a TF occupancy simulator that combines the dynamics of time with 1) the process through which individual TFs search the genome for binding sites, 2) the chromatin state of the sequence being searched, and 3) cooperative/competitive interplay between TFs. Parameters for TF searches and interactions have been selected from the literature where possible. Using the regions outlined in Chapter 3, we simulated on 10kb sequences containing either a region from a ChIP-seq dataset that is recurrently ChIPped by more than 4 unique TFs, or a region that is not recurrently ChIPped. We found that despite 10kb being ~60-fold the size of the recurrent or non-recurrent region (~160bp) that there were identifiable differences in properties between the two populations. Thus the TF occupancy simulator supports the conclusion of Chapter 3 that regions recurrently ChIPped across multiple datasets have different properties than do non-recurrent regions. The simulator concept is well suited for overlaying diverse experimental data, and for exploring hypotheses about the binding of proteins and regulatory regions. To enable incorporation of additional biological information   22 that will improve the modeling of a TFs environment, such as chromatin looping or local DNA configuration e.g. major/minor groove, the simulator code is modular to permit easy future expansion of the program. This work is in preparation.  Together, the three components of this thesis address the discovery of TFBSs through both experimental data and computational modeling.  We focus on the interpretation of ChIP-seq data, providing insights into the sequence composition of ChIP-seq datasets, and proposing specific mechanisms to account for the classes of peaks that we detect. We develop a computational model that employs experimental data and a mechanism based on current hypotheses for TFBS prediction and to provide insights into our understanding of how TF-based regulation works.    23 Chapter 2: Improving analysis of transcription factor binding sites within ChIP-seq data based on topological motif enrichment  2.1  Background Delineating the specific cis-regulatory elements governing gene transcription has been at the forefront of genome research. The landmark release of the ENCODE project findings supports the long-standing view that a greater portion of the human genome contributes to the regulation of gene activity than the ~2% encoding proteins. High-throughput chromatin immunoprecipitation (ChIP) analysis of protein-DNA interactions has been transformative, highlighting the genomic regions that contain regulatory elements, and thus reducing the computational search space for transcription factor binding sites (TFBSs) many fold. The coupling of ChIP to high-throughput sequencing (ChIP-seq) is empowering researchers to investigate DNA binding transcription factors (TFs) and their regulation of the genome. We present here a set of convenient, practical visualization and bioinformatics approaches, which facilitate interpretation of ChIP-seq TF binding data.  The bioinformatics foundations of generating ChIP-seq data have been well explored. Given mapped DNA sequence reads from a ChIP experiment, “peak calling” software is applied to quantitatively delineate regions within which a greater frequency of mapped reads are observed than expected. The peak regions are reported as a pair of coordinates, ranging from 1 bp to >5000 bp wide, often accompanied by a score and the position at which the read frequency local maximum is observed. Within the peak regions, the coordinates of specific TF-DNA interactions can be computationally inferred, using a TFBS profile for the indicated TF. Such profiles, most commonly in the form of Position Frequency Matrices (PFMs), are available for a subset of TFs from databases like JASPAR [42] or HoCoMoCo [43], or can be computationally   24 identified from the ChIP-seq data using de novo motif discovery tools [33-36]. Analysis with a PFM converted to a weighted TFBS profile (Position Weight Matrix – PWM) yields a score that reflects the similarity of the sequence of interest to the modeled binding sites. Although ChIP-seq data reduces the acknowledged specificity problem of detecting short (6-15 bp), degenerate motifs bound by a TF in the genome, the problem of TFBS prediction is not perfectly resolved as the ChIP-seq peaks are often 20-fold or greater in length than the TFBS being searched for. As they become more widely used, higher resolution methods, such as ChIP-exo [28], are expected to reduce the difficulty.  A proportion of ChIP-seq peaks may not contain a canonical TFBS for the ChIPped TF above background expectation; a confounding property of the data that presumably arises from a combination of biological, experimental, and computational influences. While these regions may result from indirect interactions between the TF and the DNA, the multi-epitope specificity of polyclonal antibodies and the tendency for chromatin to shear at promoter regions [22, 23] may give rise to peaks not specific to the ChIPped TF. The subset of peaks lacking the TF’s canonical motif is commonly treated as equivalent to the subset with motifs. The segregation of a ChIP-seq dataset into the two classes could lead to insights into individual TFs mechanisms of regulation and reveal common properties of regions lacking TFBS. The analysis of specific TF bound regulatory regions and TFBSs from ChIP-seq defined peak regions can be refined, and such improvement will consequently inform and improve our utilization of ChIP-seq data across a spectrum of analyses.  In this report, we introduce a set of visualization methods and bioinformatics approaches to improve the study of TFBSs within ChIP-seq regions, and demonstrate the application of these methods for the generation of new insights into regulatory sequences. We focus on three key   25 challenges: known motif over-representation analysis, spatial visualization of TFBS positions, and determination of parameters for TFBS analysis. For over-representation analysis, we introduce the BiasAway tool to account for the non-random properties of regulatory sequences; such accounting has strongly informed the design of de novo motif discovery methods, but has been inadequately addressed for over-representation studies. We introduce a set of visualization approaches that reveal topological patterns of motif positions within ChIP-seq data, helping to delineate the subset of peaks likely to be directly bound by the ChIPped TF. The visualization methods directly inform the selection of parameters for motif prediction, a long-standing challenge in regulatory sequence analysis. Application of the procedure reveals that on average 61% of ChIP-seq peak regions contain the canonical motif for the ChIPped TF. The methods are applied to two cases related to TBP and ZNF143/THAP11. Access to the new methods and visualization approaches will provide the research community with improved capacity to analyze and interpret TF ChIP-seq data.  2.2 Results 2.2.1  Composition studies reveal the influence of non-random properties of the metazoan genome on the interpretation of ChIP-seq data The non-random properties of genome nucleotide sequence composition have been the subject of extensive investigation over decades. Key to the study of regulatory sequences, are observations that promoters and other open chromatin tend to have higher GC content, and that the observed composition has a wide distribution. We evaluated the relationship between mononucleotide composition and TF ChIP-seq data (see Appendix B.1). We found that each TF exhibits a range of nucleotide composition environments in which it binds. The GC content distributions differ between TFs profiled in the same cells and between different cell types profiled for the same TF (Appendix D, Figure D.1). There is a clear relationship between the GC   26 content of the peak regions and the multiplicity of TFBSs (Appendix B.1 and Appendix D, Figure D.2).  2.2.2 Consequences of biased composition for TFBS motif over-representation analyses It is a central practice in the analysis of regulatory regions to evaluate the frequency of predicted TFBSs in the regions of interest. This type of analysis depends on comparing predicted motif frequencies against a reference background. Such analysis of TFBS over-representation could be influenced by the compositional properties of the peaks and the background sequences. While accounting for background composition properties has been explored for de novo pattern discovery, the best approaches for and impact of composition correction on over-representation analysis of known motifs have not been resolved.  2.2.2.1 Visualizing composition bias in TFBS over-representation analyses To assess composition corrective measures for TFBS over-representation analysis, we generated a visualization method, which we term Composition-bias plots (CB-plots). The CB-plots are a method for alerting a researcher to the impact of composition on the reported TFBS over-representation results, and are generally more informative than a ranked list of TFs and motif over-representation scores. A CB-plot presents the average GC content of a TF’s binding profile (the PFM) on the x-axis and the over-representation score of the TF’s predicted binding sites on the y-axis. If an unsuitable background has been used in the over-representation analysis, the CB-plot distribution of motif over-representation scores will reveal a bias toward whichever end of the nucleotide composition spectrum (GC-rich or AT-rich) that the target sequences dominate relative to the background. Figure 2.1 illustrates the CB-plots and provides examples of over-representation results from an E2F1 ChIP-seq dataset. Figure 2.1A presents   27   Figure 2.1 Composition bias plots for bias identification. Composition-bias plots reveal systematic TF PFM nucleotide content bias in motif over-representation analysis. Foreground data was obtained from an E2F1 ChIP-seq study and processed using the oPOSSUM TFBS over-representation analysis software. Each plot presents a motif over-representation score (y-axis) relative to the GC content of the PFMs (x-axis). The over-representation scores reflect the difference between the frequency of motifs in the foreground compared to a background (the background differs between panels). The names of the 5 top ranked PWMs are displayed on the plot. The dotted line at over-representation score 100 is an arbitrarily placed visual point of reference. The sequence logos represent the binding models for E2F1 and E2F4 respectively. (A) Background composed of randomly selected mappable genome sequences. (B) Background generated, using BiasAway, with a GC composition matched to the ChIP-seq sequences and drawn from the set of mappable sequences.     28 results that need to be corrected for over-representation score bias, as outlined in the next section, while Figure 2.1B presents the end results after correction.  2.2.2.2 The impact of background composition selection on TFBS over-representation results For many datasets the calculation of meaningful over-representation scores requires correction of bias. We assessed approaches to create and/or retrieve a background matched to a set of target sequences’ compositional properties in order to resolve the enrichment scoring bias engendered by compositional extremes of some ChIP-seq datasets. For evaluation purposes we retained the naïve background model of random genomic sequences in our assessment. A second background was generated by a 3rd order Markov model (RSAT package [84]). We implemented a background sequence generator, BiasAway, to generate 6 additional background models; these backgrounds derived from mono- and di- nucleotide shuffled sequences, and genomic sequences matched to the GC content of the target data (see Appendix B.2 and Appendix C for details). We included one last set of backgrounds generated by “known motif” over-representation analysis feature of HOMER 2 [56]. HOMER 2 is the only software of which we are aware that uses GC composition matched backgrounds for TFBS over-representation analysis. These backgrounds were then evaluated against 43 human TF ChIP-seq datasets, with 166 PWMs (JASPAR 4.0_alpha development database [42]) via the oPOSSUM 3.0 TFBS over-representation software [55]. Four backgrounds were re-evaluated for platform-independence of both bias and bias correction with the ASAP tool (see Appendix B.2 and Appendix D, Figure D.3).  We established several measures, based on the ideal expectation of four properties pertaining to the results of TFBS over-representation analyses. In essence, we expect to see few outlying   29 over-representation scores for TF binding models, one of which should be for the ChIPped TF’s binding model, with the majority of binding models scoring close to 0. We summarize the results for each of the alternative background models using these measures in Figure 2.2 and Appendix D, Figure D.4.  We summarize the background evaluations on 201 bp sequences here, with further details available in Appendix B.2 for both 201 bp and 401 bp sequences. Appendix E, Table E.1, lists the rank of the ChIPped TF for each analysis, and Appendix D, Figure D.5, provides the CB-plots of E2F1 for six background analyses. A naïve background of randomly selected sequences resulted in a systematic bias, or skew, of over-representation scores towards GC-rich binding sites for some datasets. The mononucleotide backgrounds were not as favourable as the dinucleotide shuffled backgrounds, with or without the sliding window. While neither suffered from bias towards GC-rich binding sites, the dinucleotide shuffled backgrounds produced results with over-representation scores closer to zero and predicted the ChIPped TF’s motif in the top 5 results of more than 60% of the datasets. The 3rd order Markov model performed as well as the dinucleotide background for all measures except that of capturing the ChIPped TF’s motif in the top 5 results. The best performing background was genomic sequences selected to match the GC nucleotide distribution of the target sequences. The BiasAway GC nucleotide background and HOMER generated backgrounds performed comparably for most measures, with the exception that the BiasAway background resulted in an 11 percentage point improvement for the inclusion of the ChIPped TF binding model as one of the top 5 over-represented.   30  Figure 2.2 Over-representation bias from multiple background models. Background sequence selection impacts motif over-representation analyses. (A) For each background, the fraction of the 43 analyses that reported the ChIPped TF in the top 5 enriched PWMs from a particular background (x-axis) is plotted against the average skew of the over-representation results for each background’s 43 analyses. Skew is the negative slope of the line fitted to the over-representation scores versus PFM GC content (i.e. values as displayed in Figure 2.1). The ideal is to have a large x-axis value (vertical dashed line) and an average skew of zero (horizontal dashed line). (B) and (C) summarize the standard deviation (y-axis) and mean (x-axis) of the ‘non-outlier’ oPOSSUM over-representation scores for 10 backgrounds against each of 43 ChIP-seq datasets, where panel (B) displays the average value for each background across the 43 datasets and panel (C) displays the individual value of 430 analyses. The ideal result would be situated at the origin (the intersection of the dashed lines). For all panels each of the 10 backgrounds tested is denoted as a single colour: Light green circle – randomly chosen background from the dataset of mappable sequences, dark green cross – randomly chosen background from the dataset of DNase accessible sequences, orange circle – mononucleotide shuffled background, brown cross – mononucleotide shuffled background within a sliding window, black circle – dinucleotide shuffled background, gray cross – dinucleotide shuffled background within a sliding window, magenta triangle – 3rd order Markov model generated background sequences, blue circle – background selected from the mappable sequences dataset to match the GC composition of the target sequences, light blue cross – background selected from the mappable sequences dataset to match the distribution of GC   31 composition in sliding windows of the target sequences, and red triangle – GC background from HOMER 2.   The CB-plots have been incorporated into the oPOSSUM over-representation analysis software, and BiasAway is available as open-source code posted on GitHub: https://github.com/wassermanlab/BiasAway/archive/noRPY.zip, and is being incorporated into the oPOSSUM 3.0 web interface.  2.2.3 Assessing TFBS predictions within ChIP-seq regions Subsequent to motif over-representation analysis, attention turns to investigating the candidate TFBS binding profiles. We outline here a complementary set of enrichment assessment methods specifically oriented to providing such insights. We implement a heuristic algorithm for direct binding (HADB) to determine the subset of ChIP-seq peaks with high confidence binding sites, based on TFBS enrichment proximal to the peakMax, and to derive a PWM-specific scoring threshold. As with the previous section, visualization is a key method for discerning the properties of the data. Within the R environment [85] we implemented a set of plotting methods (TFBS-landscape view, TFBS-bi-motif view, and Dinucleotide-environment view) for displaying the properties of the ChIP-seq regions relative to the locations of predicted TFBS. The code and user-guide for the visualization methods and calculating the HADB thresholds are posted on GitHub: https://github.com/wassermanlab/TFBS_Visualization/archive/master.zip.  2.2.3.1 TFBS-landscape view The TFBS-landscape view for ChIP-seq data facilitates qualitative assessment of the non-random relationship between predicted motif scores (for the top scoring motif in each sequence) and the peakMax position. The information in the TFBS-landscape view is translatable into a   32 quantitative assessment of TFBSs, as presented by the HADB algorithm below. Motif proximity to the peakMax has been demonstrated on 14 ChIP-seq datasets by [57] and for 3 datasets across 11 peak calling algorithms [86]. The global tendency for ChIP-seq data to yield motifs for the ChIPped TF proximal to the peakMax is systematically confirmed here with a study of ~340 ChIP-seq datasets for ~95 TFs (human and mouse).  In a TFBS-landscape view, the left plot displays the distance of the maximum scoring TFBS to the peakMax for each sequence on the x-axis (the peakMax is x = 0), and the PWM predicted TFBS score on the y-axis (Figure 2.3). This plot conveys both the quality of the motifs observed, and allows users to detect enrichment in a threshold-independent manner. The right plot presents the density of top-scoring predicted motifs (y-axis) versus the distance from the peakMax (x-axis), similar to plots presented by CentriMo (MEME suite [57]) and RSAT::peak-motifs tools [36]. The right side plots include two lines: a 2 bp resolution density of all peaks’ motifs, and a 5 bp resolution density of the subset of peaks containing a motif with score equal to or greater than 85 to capture cases with low enrichment of strong scoring motifs. As shown in Figure 2.3A for the TF C/EBPB, a ChIP-seq experiment can show a strong enrichment for the ChIPped TFs binding motifs. Plots representing the wide variety of enrichment characteristics are displayed in Figures 2.3A–L. The unique shapes displayed in Figure 2.3 derive from a combination of properties for the presented TF’s binding motifs such as width of the binding site, and the presence of other motifs enriched at the peakMax (see Appendix B.3 for further detail and discussion).   33  Figure 2.3 TFBS-landscape views. TFBS-landscape views inform of motif enrichment and are diverse in shape and density. A TFBS-landscape view consists of a plot (left) showing the location of the top scoring motif relative to the peakMax (x=0) on the x-axis and the motif score on the y-axis; the right plot presents a 2bp resolution density of motif distances to the peakMax (black) and 5bp resolution for motifs with a motif score equal to or greater than 85 (green). All plots display some degree of   34 enrichment at the peakMax and a lower limit on motif score enrichment. (A) C/EBPB motifs in a C/EBPB ChIP-seq dataset. (B) C-MYC motifs are enriched at the peakMax of the C-MYC ChIP-seq dataset, but many top-scoring motifs are randomly dispersed. (C) NFYA motifs in a NFYA ChIP-seq dataset exhibit enrichment around the peakMax, with high scoring motifs distinct from the majority of scores in the background regions (D) ZNF143 motifs in a ZNF143 ChIP-seq dataset have low enrichment but some high scoring motifs are distinct from the background. (E) and (F) present JUN motif enrichment in two JUN datasets from different cell types with distinct background motif densities. (G) The REST motif is strongly enriched in a REST ChIP-seq dataset across a large motif score range at the peakMax with a low density of background motifs. (H) MYOD motif enrichment at the peakMax of MYOD ChIP-seq data. (I) HNF4A motif enriched proximal to the peakMax of HNF4A ChIP-seq data. (J), (K) and (L) present motif enrichment for a TF that was not the ChIPped target: (J) CTCF motifs are slightly enriched offset from the peakMax of H3k4me3 ChIP-seq. (K) CTCF motifs are enriched offset from the peakMax in RAD21::cohesin ChIP-seq data. (L) ELK4 motifs in a NELFE ChIP-seq dataset show an enrichment offset from the peakMax.   2.2.3.2 Using the HADB algorithm to distinguish direct binding and for the selection of PWM motif score thresholds The TFBS-landscape plots present a finite zone of non-random TFBS enrichment around the peakMax for the ChIPped TF. Quantitative analysis of the motif densities for a given TF can delineate the width of sequence proximal to the peakMax that is enriched for the TF’s motif above chance expectation. The enrichment zone provides a region of confidence for predicted TFBSs and thus provides a rational focus for downstream analyses. As shown in Figure 2.4A-B, the HADB procedure (see Chapter 2 Methods) determines the distance from the peakMax to the point where the frequency of the top scoring motif approaches the distal flank motif frequency (distal = 250-500 bp from the peakMax). Based on the pool of human ChIP-seq datasets and focusing on the top ranked motif of each sequence, the width of enrichment around the peakMax ranges from ~50-350 bp with an average of 209 ± 34 bp or 194 ± 10 bp for human and    35  Figure 2.4 Defining the TFBS zone of enrichment around the peakMax.  (A) A visual depiction of the enrichment zone determined with a heuristic procedure, as described in methods. The x-axis presents the upper limit of each 5 bp bin, and the bins are the absolute distance of a motif from the peakMax. The y-axis shows the proportion of peaks from the dataset with a motif in a 5 bp bin. The horizontal green line is fit to the distal background bins, and the horizontal line in blue is the allowance line (see Chapter 2 Methods). The blue vertical dashed line indicates where the proportion of peaks in a non-background bin approaches the allowance line without falling below it – this line is the heuristic distance threshold for motif enrichment around the peakMax. (B) The NFYB sequence logo and the TFBS-landscape view for NFYB motifs in NFYB ChIP-seq data. The heuristic enrichment zone is between the blue vertical dashed lines. The black vertical lines indicate the beginning of the distal background region (spanning 200-500 bp from the peakMax). (C) The width of the motif   36 enrichment zone (y-axis) for human ChIP-seq datasets (x-axis); multiple datasets for a TF were averaged to obtain one enrichment zone value per TF. Vertical bars are the average differences between all of the enrichment zone widths for a TF. The red horizontal line is the mean. (D) The proportion of peaks within the enrichment zone for a TF’s set of ChIP-seq datasets were averaged. The x-axis provides, for each of 85 TFs, the mean proportion of peaks with a motif scoring above the motif score threshold and located within the zone of enrichment (mean 0.60, median 0.61).   mouse respectively (mean width ± mean of width differences; Figure 2.4C and Appendix D, Figure D.6A). Assessing enrichment with the inclusion of lower ranked motifs did not change the width of the observed enrichment zone.  Deciding upon an appropriate minimum scoring threshold for PWM-based analysis is an ongoing challenge, as each TF has unique characteristics. Quantitative analysis of the peak TF motif densities relative to chance expectation can delineate a suitable threshold for motif scores for a given PWM. We set the motif score threshold as the lowest motif score for which the count of motifs within the heuristic enrichment zone consistently exceeded the count of motif scores in comparably sized background regions (see Chapter 2 Methods; Appendix D, Figure D.7A-7B). The motif score threshold does not vary greatly for an individual PWM across multiple datasets. The average relative score for each of the human and mouse datasets was 82 ± 1 (mean score ± mean difference between scores; Appendix D, Figure D.7C-7D). A motif score threshold specific to each TF PWM is important for downstream analyses such as the study of TFBS altering mutations. The PWM score thresholds for the studied TF binding profiles are provided (see Appendix F).    37 We used the heuristic boundaries of enrichment and motif score threshold for each dataset to estimate the proportion of peaks that contain at least one TFBS for the ChIPped TF within the bounds of the thresholds. We found that on average, for the datasets studied, ~61% of a ChIP-seq dataset contains the ChIPped TF’s canonical motif that is both within the peakMax enrichment zone and greater than the motif score enrichment threshold (Figure 2.4D – human mean 61%, and Appendix D, Figure D.6B – mouse mean 65%). The source of the remaining ~40% of peaks may be from such factors as indirect or non-specific binding, shearing properties of open chromatin, peak calling errors, or antibody properties.  2.2.3.3 Regions predicted by HADB to directly bind the ChIPped TF are more likely to replicate between experiments When available, replicate experiments are a useful validation of regions ChIPped by a TF. While many experiments lack replicates, ~100 ENCODE datasets included replicates which we used to evaluate whether the HADB method was differentiating between peaks that co-occur between replicates (peakMax within 500 bp) versus those peaks that occur in only one of the replicates (see Chapter 2 Methods). We found that the set of peaks with the ChIPped TFs motif central to the peakMax were significantly enriched for replicated peaks (92% of datasets produced a Fisher exact test one tailed p-value <0.001 (91% produced scores <1e-09))(see Appendix B.4 for similar results with different peakMax separation parameter settings)).  2.2.3.4 Regions predicted by HADB to directly bind the ChIPped TF are enriched for gene ontology terms related to the TF’s key biological process To assess the functional enrichment of peaks defined by the HADB method, we performed gene ontology (GO) enrichment analyses, using the GREAT software [87]. As GO enrichment analysis for TFBSs is somewhat problematic due to the diversity of processes a TF may   38 regulate and the proximity of TFBS to the genes they regulate, we selected two TFs known to be highly specific for a biological process: SRF, a master regulator of the actin cytoskeleton and contractile processes [88], and NFE2L2, a key regulator of oxidative stress response [89]. We submitted peaks from the whole ChIP-seq dataset, peaks from the subset of regions inferred by HABD to be directly bound by the ChIPped TF, and those not inferred to be directly bound. GREAT analyses reported an enrichment of terms for the expected biological processes for SRF and NFE2L2 only in the subset of peaks with inferred direct binding of the ChIPped TF (see Appendix B.5, and Appendix D, Figures D.8 and D.9). The results highlight how the use of the HADB method can enhance the interpretation of ChIP-seq data.  2.2.4 Complementary visualization methods for spatial ChIP-seq patterns: TFBS bi-motif view and Dinucleotide-environment view We present two visualization methods that further empower investigation of spatial patterns in ChIP-seq data that build upon the foundation of peakMax-proximal enrichment defined by HADB.  2.2.4.1 TFBS-bi-motif view The prior TFBS-Landscape view focused on the ChIPped TF in isolation. A substantial body of literature focuses on cooperativity between TFBS, both in homotypic and heterotypic forms [90-92]. In developing the HADB analyses we had observed that in some cases a second motif for the ChIPped TF was enriched proximal to the peakMax, which motivated the creation of a TFBS-bi-motif view (examples are presented in Figure 2.5). The view allows qualitative assessment of both spatial relationships between the two motifs and their distance from the peakMax. If the first motif is enriched around the peakMax there is a horizontal band of enrichment along y = 0. If the 2nd motif is enriched around the peakMax there is a negatively    39  Figure 2.5 TFBS-bi-motif view for visualization of motif spatial arrangements.  The left plot of a TFBS-bi-motif view presents the distance of the primary motif to the peakMax of each sequence on the y-axis, and the distance of a second motif relative to the primary motif   40 on the x-axis. A band of enrichment at y = 0 indicates enrichment near the peakMax for the primary motif, while a diagonal band of enrichment (with a negative slope) indicates enrichment near the peakMax for the second motif. The diagonal limits of the plot arise from the uniform length of the sequences (here 1001 bp). The right plot is a histogram of the distances between the two motifs. The gap in both plots results from the exclusion of overlapping motifs. (A) ESRRB motifs in an ESRRB ChIP-seq data set. The top-scoring ESRRB motif is the primary motif, and the second-best motif is the second motif. (B) NFYB motifs in a NFYB ChIP-seq data set. The top-scoring NFYB motif is the primary motif, and the second-best motif is the second motif. (C) SRF ChIP-seq dataset. The SRF top-scoring motif is the primary motif, and the ELK4 top-scoring motif is the second motif.   sloped band of enrichment on a diagonal. Where the origin displays increased enrichment relative to the horizontal band (y = 0), there is an enrichment of both motifs at the peakMax. The diagonal limits of the plot arise from the uniform length of the peak regions analyzed. The right plot in the TFBS-bi-motif view presents a histogram of the distances between the two motifs. The proximity of two motifs can indicate potential interactions or relationships.  Three examples of bi-motif views are presented in Figure 2.5. Homotypic clustering is observed for ESRRB and NFYB. The NFYB clustering is consistent with published findings for the NF-Y complex [93], while the ESRRB observation suggests that properties observed for ESRRG may extend to other members of the TF family [94]. A third plot for SRF and ELK4 (also called ‘SRF accessory protein’), using SRF ChIP-seq data, presents heterotypic binding site results, consistent with known interactions between these TFs [95].      41 2.2.4.2 Dinucleotide-environment view Given the position of the motif of interest within a region, it can be informative to visualize the nucleotide composition properties in the flanking sequence. Dinucleotide sequence properties flanking TFBSs can be evaluated by aligning sequences at the TFBS (or a feature near the TFBSs, such as transcription start sites) and scoring the dinucleotide frequencies in the adjacent flanks. As shown in the Dinucleotide-environment view for an IFN-γ induced STAT1 TF ChIP-seq experiment (Figure 2.6), there is increased nucleotide patterning in the sequences adjacent to the HADB-inferred directly bound motifs, compared to the peaks without peakMax-proximal motifs. Further analysis of the data reveals the presence of a common repeat sequence within a large subset of the peaks containing the highest scoring STAT1 motifs, consistent with literature which reports that STAT1 binding at MER41 repeat elements increases with IFN-y induction [96]. Further illustration of insights provided by the dinucleotide-environment view is provided in the case studies.    42  Figure 2.6 Dinucleotide-environment view. Dinucleotide-environment view plots the dinucleotide enrichment of the dataset around the motif of interest. The x-axis shows the dinucleotide offset from the centre of the motif, and the y-axis is the proportion of the dinucleotide in the ChIP-seq sequences. The STAT1 motif is the high frequency pattern in the centre of the plot. The sequence logo for STAT1 is above the high frequency pattern. (A) The subset of STAT1 ChIP-seq peaks containing a STAT1 motif in the enrichment zone with a motif score greater than or equal to 85. The magenta box highlights the   43 enrichment of dinucleotides in the flanking regions of STAT1 motifs proximal to the peakMax. (B) The subset of STAT1 peaks with a motif outside the enrichment zone and a motif score of 85 or greater. The magenta box highlights the lack of dinucleotide enrichment in the flanking regions of motifs found distal to the peakMax.   2.2.5 An applied case study of ChIP-seq analyses To demonstrate how over-representation analysis and the visualization methods are complementary, and enhance the analysis of ChIP-seq data, we applied the TFBS-landscape view and the Dinucleotide-environment view to two case studies. The first focuses on the TATA binding protein (TBP) in the MEL mouse cell line, while the second explores the sequence properties proximal to ZNF143 TFBSs.  Employing both a motif over-representation analysis and TFBS-landscape view, we noted that the TBP ChIP-seq experiment had a low occurrence of the canonical TBP TATA motif proximal to the peakMax (<12% of the dataset using the HADB method described above; Figure 2.7A). However, TFBS motif over-representation analysis also indicated a large number of secondary TFBSs are enriched. We investigated the reported enriched motifs using the TFBS-landscape view, and saw an unusual pattern emerge: spikes of lower scoring motif enrichment at distances from the peakMax specific to each tested TF (Appendix D, Figure D.10). As shown in Figure 2.7B, there are TF motif enrichment patterns up to 200 bp from the peakMax. A Dinucleotide-environment view, using the subset of peaks with a high scoring TATA motif, presented an enriched sequence pattern up to 100 bp either side of the TATA motif and an AA-rich sequence ~200 bp distant on the motif strand (Figure 2.7C). The pattern was subsequently identified to arise from enrichment of SINE elements (predominantly B2-B4). The percentage of peaks    44  Figure 2.7 Case study of TBP in the mouse MEL cell-line.  (A) TFBS-landscape view for TBP PWM on a TBP dataset. The top plot presents the top scoring motif distance to the peakMax (x-axis) and the motif relative score on the y-axis. The bottom plot presents a histogram of motif distances to the peakMax: black line – 2 bp resolution of the top scoring motif distance per peak; green line – 5 bp resolution of distances for the top scoring motifs with a score equal to or higher than 85. The sequence logo is for TBP. (B) TFBS-landscape view density plots for 15 PWM’s are overlaid on a single plot, for visualization purposes. The black line is the enrichment of the TBP motif, the coloured lines are NR2F1, MYC::MAX, CTCF, GABPA, TAL1::TCF3, FOSL2, FOXD3, NRF1, MEF2A, AP1, SPI1, ZNF143_b, E2F1, and NFYB motifs, as noted on the plot. (C) The Dinucleotide-environment view around the TBP motif. The x-axis is the location of the dinucleotide with respect to the TBP motif, and the y-axis is the fraction of sequences with the dinucleotide at a given position. The   45 coloured lines each represent one of 16 dinucleotides, as specified in the plot legend. The magenta box highlights the dinucleotide enrichment in the regions flanking the TBP motif.   containing SINE elements is 14.1% in the TBP peaks, while in GC content matched controls or DNase accessible regions it drops to 5.2% and 5.6% respectively. While SINE elements are known to contribute to the formation of promoter regions for genes [97], the 14.1% fraction of the TBP ChIP-seq sequences is striking.  The seven zinc finger ZNF143 TF has been reported to bind to an ~18-21 bp nucleotide sequence [98, 99]. The ZNF143 ChIP-seq dataset, like the TBP dataset, exhibits a low enrichment of the canonical ZNF143 motif around the peakMax (~5% of the dataset; Figure 2.3D). We extracted the subset of regions with a motif in the HADB enrichment zone, repeat-masked the sequences, and generated a Dinucleotide-environment view aligned on the top-scoring motif in each region (Figure 2.8A). This display reveals additional striking features not captured in the initial model, including a strong 5’ pattern outside the PWM-covered positions. This 5’ flank pattern is present in about 35% of the aligned sequences, and is not present in alignments of motifs outside of the HADB enrichment zone. A Dinucleotide-environment view of sequences with a high scoring (score >85) ZNF143 motif, reveals an increased enrichment of the 5’ flanking pattern from 35% to ~50% of the sequences (Figure 2.8B). A subsequent search of the literature revealed a recent study by Ngondo-Mbongo et. al. [100], showing that a 2nd protein, THAP11, binds to the extended flank seen in the Dinucleotide-environment view and half of the ZNF143 motif in a manner mutually exclusive to ZNF143.   46 Figure 2.8 Case study of ZNF143 DNA binding preferences.  The sequence logo presents the binding site characteristics of ZNF143. (A) Dinucleotide-environment view of ZNF143 ChIP-seq repeat-masked regions aligned on motifs with a motif score of 85 or greater. The x-axis is the nucleotide position and the y-axis is the frequency of the dinucleotide. The coloured lines each represent one of 16 dinucleotides, as specified in the plot legend. The vertical magenta lines frame the positions of the sequence logo. (B) Dinucleotide-environment view of ZNF143 canonical motifs with a motif score of >85. The x-axis is the nucleotide position and the y-axis is the frequency of the dinucleotide. The coloured lines each represent one of 16 dinucleotides, as specified in the plot legend. The orange horizontal line above the plot indicates the overlapping THAP11 binding profile.   2.3 Discussion We have introduced methods that allow researchers confronted with ChIP-seq data for TF binding to extract more insights into TFBS. The first challenge in such studies is often the identification of contributing TFs based on known motif over-representation analysis. Such methods can both support the quality of data based on the identification of the ChIPped TF’s   47 motif, as well as highlight additional TFs that may be cooperatively acting with the former. In this report we present two key components related to over-representation analysis. First, Composition-bias (CB) plots are introduced to display skew in the GC-content of the enriched motifs, a common occurrence that reflects the non-random composition properties of the ChIP-seq regions recovered. Second, the BiasAway software tool is introduced to generate composition-matched background sequence sets that correct for the skew. Once relevant TF binding profiles are identified, the challenge shifts to the identification of reliable TFBS within the broader ChIP-seq regions. Every ChIP-seq dataset is a mixture of directly bound segments and regions that may reflect alternative influences. We introduce TFBS-landscape plots as a convenient form for the visualization of the topological distribution of TFBS within ChIP-seq segments. Based on the topology, we present the HADB method for the selection of PWM thresholds for the selection of candidate TFBS consistent with direct binding events. The HADB approach provides a quantitative method for selecting PWM thresholds – an enduring challenge in regulatory sequence analysis. Systematic analysis over ~340 ChIP-seq datasets indicates that an average of 61% of peaks are classified as directly bound. The biochemical, experimental, and/or computational influences that account for the remaining 39% of regions remain to be resolved in future investigation. In the third stage of analysis, spatial properties relative to the defined TFBS and peakMax are analyzed. Bi-motif plots allow the study of inter-TFBS spacing concurrent with HADB assessment. To gain insight into additional properties at the edges of reliable TFBS, Di-nucleotide-environment views are introduced through which compositional enrichment in the flanking regions can be revealed. In combination, the software and methods allow for researchers to launch deeper explorations of TFBS within ChIP-seq data.  The work builds on a substantial foundation of bioinformatics approaches to regulatory sequence analysis. Over-representation analysis of known TF motifs complements de novo   48 motif discovery methods such as MEME. The later methods often incorporate better background representation to account for the non-random properties of sequences. The CB-plots highlight the problem for over-representation methods, providing a useful means for researchers to rapidly assess the quality of results. We introduced the BiasAway tool to correct for the bias. The HOMER 2 package includes an unpublished option for background correction of over-representation analysis, which does not perform as well as the GC composition matched option in BiasAway (HOMER was originally described in [56]). The open-access BiasAway provides a general purpose background generation capacity, which can be used as a component in other tools going forward. Both BiasAway and the CB plots are being incorporated into the online oPOSSUM-3 motif over-representation analysis tool [55].  The spatial analysis of TFBS positions within ChIP-seq data has been a central focus in the development of several bioinformatics methods. The MEME Suite includes the CentriMo software which can evaluate both known motif collections and de novo pattern discovery derived motif models based on the centrality of predicted TFBS within ChIP-seq peak regions [57]. The SpaMo component of the same package evaluates the statistical significance of spacing between predicted pairs of TFBS across ChIP-seq regions [59]. The GEM system performs de novo pattern discovery of spatially correlated motif pairs [101]. While these approaches touch on the same theme, they are not directly comparable to the work presented here. Our methods are complementary to the published work, allowing researchers to more fully explore the properties of ChIP-seq data.  We introduced the HADB procedure for the selection of appropriate PWM score thresholds for the prediction of TFBS within ChIP-seq data. The threshold parameter for TFBS calling with PWMs is treated in diverse ways in the research literature. In some cases thresholds are   49 individually selected for each matrix based on the determination of empirical p-values based on distributions of scores observed for a sequence pool [102]. This has been extended to generating empirical p-value thresholds for composition-matched pools of sequences [103]. In other cases, percentile-based scores are applied uniformly across diverse matrices [55]. Biophysical approaches based on the calculation of binding energies have also been introduced [104]. The HADB procedure provides an experimental basis for the selection of the threshold parameter for a TF that can be applied to the matrix in multiple data sets, as the results show little variation in the value for different ChIP-seq studies for the same TF.  The prevalence of directly bound regions in ChIP-seq data has received increasing attention. The ChIP-seq protocol is undeniably successful at retrieving TF bound regions, however, the data from these studies of sequence-specific TFs is invariably shaped by a mixture of biological, experimental and computational influences. From our perspective, TF ChIP-seq data is effectively bi-partite: a subset of ChIPped regions have a direct relationship to the ChIPped TF made evident by the presence of the TF’s binding motif near the peak local maximum, while the remaining subset of regions are lacking the motif. We found that on average, for the ~340 datasets studied, a third of peak regions lack canonical motifs proximal to the peakMax. Some of these peaks may result from indirect interactions (in which the ChIPped TF interacts with a different DNA bound protein) or result from a change to the ChIPped TF’s binding properties due to an interaction partner. Methods have been proposed to infer indirect binding by the enrichment of secondary TFs motifs, such as [58], but the presence of secondary motifs do not reliably confirm the presence of the ChIPped TF at the regions. We have observed that the motifs of some TFs are enriched at the peakMax for more than 10 other unique TFs ChIP-seq datasets across diverse cell lines [83], suggesting that some peaks in ChIP-seq datasets may result from a mechanism that is not specific to the ChIPped TF. Some additional potential   50 sources of non-direct ChIP-seq regions, unrelated to indirect binding, include chromatin structure properties, antibody properties, stochastic noise, or peak calling software. Two recent papers have identified highly transcribed regions as enriched in the non-direct subset of ChIP-seq data in yeast [23, 105]. We do not suggest that any peaks be dismissed, as all are potentially informative. However, we believe that it is appropriate to use the methods introduced in this report to segregate the direct-binding subset that can be explained by the ChIPped TFs motif within the limited range of confidence around the peakMax, allowing the two subsets to be individually analyzed.  We highlighted the investigative potential of the methods and visualization approaches in two case studies. First, a study of TBP (TATA binding protein) ChIP-seq data, where an enrichment of SINE elements and many secondary TFs’ motifs suggests that the adaptive use of repetitive sequences as promoters may be more frequent than widely thought. Second, a study of ZNF143 that revealed the ZNF143 canonical motif is part of a wider pattern than presented by the existing PWM, which has been shown by [100] to be the binding site for THAP11, a protein that works antagonistically with ZNF143.  This broad survey of ChIP-seq data from diverse studies provides greater clarity about the properties of the data that impact the quality of interpretation. Using the methods presented here, the applied researcher can visualize the distribution of predicted TFBSs and calculate thresholds for both the maximum motif enrichment distance from the peakMax, and the PWM scores, for classifying TFBS-containing peaks. In addition to ChIP-seq peak classification, the PWM scores thresholds set a lower limit on the range of motif scores expected for functional sites, which will be useful for downstream analyses such as binding site mutation analysis. These methods and approaches will improve TFBS enrichment analyses and the applied   51 analysis of ChIP-seq data, particularly for the annotation of reliable TFBSs within ChIP-seq peaks.  2.4 Methods 2.4.1 Datasets We downloaded ChIP-seq datasets from the GEO database: 1) GSE11431 - thirteen mouse ES cell datasets [106]; 2) GSE25532 – mouse NFYA data in ES cells [107]; 3) GSE17917 and GSE18292 – human KLF4, POU5F1, C-MYC, NANOG, and SOX2 data [108]; and 4) GSE22078 – human and mouse C/EBPA and HNF4A [51]. A dataset for mouse FOXA2 was downloaded from http://www.bcgsc.ca/data/histone-modification/histone-modification-data/ [109]. We also used ENCODE ChIP-seq datasets (human and mouse), and human ChIP-seq controls [29] downloaded from the UCSC ENCODE database [110]. The ENCODE broadPeak datasets frequently occurred in replicate; we selected the larger replicate for our analyses. Where only the mapped sequence data was available (5 datasets), we called peaks using FindPeaks 4.0 [26] with the following parameter options: −dist_type 1 200 -subpeaks 0.6 -trim 0.2 -duplicatefilter.  As the downloaded ChIP-seq peaks were reported with a multitude of lengths, ranging from 1 bp to >5,000 bp, we trimmed or extended all peaks to a constant length (201 bp, 401 bp, or 1001 bp) centered at the local peak maximum, or at the peak centre for datasets which do not have peakMax positions provided.  A control set of “mappable” regions was generated from the CRG Alignability (36mer) data [29] downloaded from the UCSC ENCODE database [110]. Unique, contiguous CRG Alignability 36mer regions were merged, and the resulting larger regions were then split into multiple non-  52 overlapping regions of length 201 bp or 401 bp. This yielded two datasets of mappable regions, to be used as a source of background sequences in later analyses.  The DNase accessible control datasets used in over-representation analyses were generated from DNase I hypersensitivity datasets (UW DNase: University of Washington) [29] downloaded from the UCSC ENCODE database [110]. To obtain a dataset that reflected a broad range of DNase accessibility, DNase I hypersensitive segments from multiple cell lines were concatenated, and contiguous or overlapping regions merged. The resultant sequences were split into one of the two lengths, 201 bp and 401 bp, to generate two large datasets of DNase I accessible sequences to be used as a source of background sequences in later analyses.  The Ensembl Perl API was employed to retrieve sequences from GRCh37/hg19 and NCBI37/mm9 assemblies. Where the genome coordinates first needed conversion to GRCh37/hg19 or NCBI37/mm9, we used a locally installed version of the UCSC lift-over tool [111].  Position frequency matrices (PFMs) were obtained from the JASPAR [42] development 4.0_alpha database of transcription factor models (April 2013). As the development set has been subsequently revised for a curated release, we provide the entire set of matrices as used in this chapter (see Appendix F). The PFMs were converted to position weight matrices (PWMs) using the TFBS Perl modules [112], with default background values for a uniform nucleotide background.    53 2.4.2 Motif prediction Motif prediction was performed with C code adapted from the TFBS Perl Modules [112], which scans sequences for TFBS instances and reports both the motif location and a PWM relative motif score. Where multiple motifs per sequence per PWM were predicted, the reported motifs were not permitted to overlap by more than one-fifth the PWM length (e.g. a 7 bp motif could only overlap a neighbouring motif by 1 bp).  2.4.3 Nucleotide composition The mononucleotide GC content of sequences was determined from a count of each nucleotide type in the sequence, based on single stranded DNA. The composition of the TF profiles was determined similarly; from a count per each nucleotide base in the position frequency matrix (PFM), and subsequently obtaining the ratio of GC nucleotides.  2.4.4 Motif over-representation analyses Motif over-representation analyses were performed on 201 bp and 401 bp sequences with a locally installed version of oPOSSUM 3.0 [55] and the online ASAP tool [113]. Within oPOSSUM we used the sequence-based analysis default settings, aside from supplying our own set of 166 PFMs. The oPOSSUM software converts the PFMs to PWMs using the default setting of the TFBS Perl modules [112]. Over-representation scores of “infinite” value were set to the greater of either 500 or to the maximum non-infinite enrichment score plus 100. For ASAP analyses, we chose parameter values similar to those used for the oPOSSUM analyses. The ASAP tool limited the number of nucleotides submitted for each analysis, therefore our input was limited to a randomly selected subset of 10,000 sequences on each submission.    54 The backgrounds used for the over-representation analyses were individually generated for each set of target sequences to be analyzed, and matched to the length of the target sequences. The backgrounds were derived from either the target sequences (e.g. dinucleotide shuffle methods, or a Markov model), a dataset of uniquely mappable sequences, or DNase accessibility data (see Datasets, above).  2.4.5 Naïve backgrounds For evaluation purposes we retained naïve backgrounds in our assessment, which were composed of randomly selected sequences from either the uniquely mappable portion of the human genome, or DNase accessible regions.  2.4.6 3rd order Markov model background A 3rd order Markov model background was generated using a combination of the oligo-analysis and random-seq programs from a local installation of the RSAT package [84]. The target sequences of interest were submitted to RSAT::oligo-analysis with the parameters: −l 4 -1str (word length 4 bp, and single strand), and the results file was in turn submitted to RSAT::random-seq with the parameter: −ol 4. The RSAT::random-seq program then used a 3rd order Markov model trained on 4 bp oligo-nucleotides to generate sequences that matched the GC content and length of the target sequences. While Markov models have been previously introduced to generate background sequences, the ideal order of the models is not clear. If the order is too high, the model will simply recapitulate the frequency of the TFBSs in the target dataset, which is why we restricted the model to 3rd order.    55 2.4.7 HOMER 2 GC background HOMER 2 [56] was downloaded from http://homer.salk.edu/homer/ and installed on a cluster. The findMotifsGenome.pl tool was used to perform a known motif over-representation analysis on each of the 43 datasets using the following options: −N number_of_seqs –nomotif –dumpFasta –size length_of_sequence. The number_of_sequences was the number of sequences in each dataset, and the length_of_sequence was either 201 bp or 401 bp, dependent on the analysis. The –dumpFasta option was set in order to retrieve the backgrounds after the analysis was finished running. We provided the hg19 human genome (from UCSC) and the coordinates of the peaks for each dataset.  2.4.8 BiasAway background generating tool Six background models were implemented as a background sequence generator: BiasAway. The BiasAway tool has been implemented in Python using BioPython modules [114], and is available as open source code from GitHub https://github.com/wassermanlab/BiasAway/archive/noRPY.zip. The tool provides a user with six approaches for generating a background useful to over-representation analyses: 1) mononucleotide shuffled sequences, 2) dinucleotide shuffled target sequence to preserve the dinucleotide composition of the target sequences, 3) genomic sequences matched to the mononucleotide GC content of each target sequence to preserve the non-random association of nucleotides, 4) sliding windows of mononucleotide shuffled sequence, 5) sliding windows of dinucleotide shuffled target sequence, and 6) genomic sequences matched in windows of internal mononucleotide GC content for each target sequence. The latter two backgrounds (BiasAway 5–6) are variants of the former two backgrounds (BiasAway 2–3), in which we utilized a sliding window over the ChIP-seq sequences to determine a distribution for local regions of composition. The background sequence set is then generated (dinucleotide shuffle)   56 or selected from a pool of genomic sequences (genomic composition match) to match the distribution of window compositions for each target sequence. These latter backgrounds were considered because due to evolutionary changes such as insertion of repetitive sequences, local rearrangements, or biochemical missteps, the target sequences may have sub-regions of distinct nucleotide composition. See Appendix C (Supplemental Methods) for greater detail.  2.4.9 Measures to evaluate over-representation analysis results To summarize the impact that the choice of background has on the reported over-representation results, we assessed the over-representation results by four measures: 1) the skew of the over-representation scores, 2) the count of datasets with the ChIPped TF’s binding profile reported in the top 5 over-representation results, 3) the mean of the over-representation scores (excluding outlying scores), and 4) the variability of the over-representation scores (excluding outlying scores). The first measure, the skew of the over-representation scores, is the negative of the slope of the line fitted to the over-representation scores (y-value) and the associated PFM GC content values (x-value; see Figure 2.2A for reference). The skew informs us of the degree to which a dataset is biased towards extreme composition motifs i.e. GC-rich or AT-rich; the more negative the skew, the greater the bias. The second measure calculates the proportion of datasets for which the over-representation analysis (using a given background type) reports the ChIPped TF’s motif in the top 5 results. The skew and the proportion of datasets are plotted together (Figure 2.2A). Background types that consistently yield a low skew value and return the ChIPped TF in the top results, are ideal. We used the third and fourth measures to assess the tendency for the over-representation analysis to report the majority of TF’s motifs as having a low over-representation score. Therefore we set a threshold of the mean plus one standard deviation (SD), to remove the highest over-representation scores from further analysis. We then obtained the mean and SD of the remaining scores. Ideally the results would display both a low   57 mean and a low SD, as this indicates that the majority of TF’s motifs are close to an enrichment score of 0.  2.4.10 Enrichment visualization plots and HADB boundaries of enrichment For the evaluation of ChIP-seq datasets with the HADB method, we restricted to those datasets for which we had a PWM for the ChIPped TF.  2.4.10.1 Composition-bias plots Composition-bias (CB) plots are generated to detect whether GC-rich or AT-rich PWMs are over-represented in a TFBS motif over-representation analysis. The GC content of the PFM’s submitted to the over-representation analysis are presented on the x-axis, and the over-representation score of the predicted TFBSs on the y-axis.  2.4.10.2 TFBS-landscape view TFBS-landscape views were generated to visualize central enrichment of a TF’s motifs within a ChIP-seq dataset. TFBS prediction was done on 1001 bp regions, to extend sequences sufficiently far into the flanks to present the background rate of TFBS prediction. The view’s left plot presents the motif score of the top scoring predicted motif in each peak for the given TF PWM on the y-axis, and the motif’s signed distance from the peakMax on the x-axis, where the peakMax is x = 0. The right plot presents two densities depicting: 1) the distances of the top scoring motifs to the peakMax for all peaks, and 2) the distances to the peakMax for the subset of peaks for which the top scoring motif was equal to or greater than 85 to capture cases with low enrichment of strong scoring motifs. The value of 85 was based on the default parameters of the oPOSSUM software, for which 85 was chosen to best discriminate known TFBS from background. The x-axis of the right plot is the distance of the motifs to the peakMax (x = 0), and   58 the y-axis is the probability density, which is a reflection of peak frequency for given distances. The default peak length was 1001 bp. For the TBP case study, we overlaid multiple smoothed densities of motif-to-peakMax distance enrichments on one plot (Figure 2.7B).  2.4.10.3 HADB motif enrichment threshold The data used for the TFBS-landscape plots (TFBSs and 1001 bp sequences), was also used by the HADB method. To calculate a heuristic enrichment threshold we identified the distance from the peakMax at which the density of the top scoring motifs (one motif per sequence) falls below the density of motifs in the background regions, such as we see in the TFBS-landscape plots (Figure 2.4B). The background is the region spanning 200–500 bp from the peakMax. We created 5 bp bins of the absolute distance from the peakMax, and counted the number of peaks with a top scoring motif within each bin. The count in each bin was converted to the proportion of peaks in the dataset and a regression line was fitted to the background bins, where x was the upper limit of the 5 bp bin, and y was the proportion of peaks in the bin. To adjust for the variation in the y-values, we created an ‘allowance line’ by shifting the regression line along the y-axis by 2-fold the 3rd largest residual value (we omitted the two most extreme residuals as they were generally outliers). We then selected the largest non-background bin (X) that had a y-value greater than the allowance line’s predicted y-value at bin X (Figure 2.4A). The heuristic threshold for motif enrichment was set at the greatest distance represented by bin X (e.g. bin X is 100bp-105bp, therefore the threshold is 105 bp).  2.4.10.4 HADB motif score lower threshold A similar procedure was used to determine a lower limit threshold for a PWM’s motif scores (Appendix D, Figure D.7). We generated bins of the top motif scores (one per peak) and compared the bins in the enrichment zone to bins in the background zone. Bins were populated   59 with the count of peaks corresponding to a bin’s motif score range. Each bin’s range was 1 score point e.g. for bin81, the range was 80 < X < =81. ‘Central bins’ were within the enrichment zone as defined using the heuristic enrichment thresholds described above. An equal number of ‘control bins’ were generated using background regions outside of the enrichment zone, of the same width as the central enrichment zone (e.g. if the central enrichment zone was 180 bp, then 180 bp of background motif scores were also sampled and binned). The two sets of bins were compared to identify all the central score bins that exceeded the number of peaks in their matching control bin by greater than 20% of the maximum control bin; the maximum was selected from the set of control bins equal to or of higher value than the bin of interest (i.e. to the right of the bin of interest on the x-axis of Appendix D, Figure D.7A). By using the set of control bins of higher value than the bin of interest, we limited cases where low scoring bins with few peaks in the bin were selected as a threshold. The array of bins that passed the aforementioned 20% threshold was then used to determine the lower boundary of motif score enrichment. Proceeding from the upper score bins to the lower score bins, we identified the lowest set of 5 consecutive central bins (centered on the bin of interest) where at least 4 of the bins, including the bin of interest, exceeded the control bins. The latter requirement was to reduce the chance of the algorithm selecting a central bin that was an outlier in the surrounding neighbourhood of bins. The central bin preceding the flagged bin was selected as the threshold. We provide two score thresholds per PWM (see Appendix F): a threshold based on signal 20% above the background, as outlined above, and a threshold based on signal 5% above the background; the median absolute difference between the two different thresholds is 2 points (e.g. a motif score of 80 versus 82). The parameter of >20% signal was a heuristic limit selected to reduce stochastic noise in the HADB results.    60 The diversity of enrichment zones and relative motif score thresholds from the datasets were summarized in Figure 2.4C and Appendix D, Figures D.6A and D.7C-7D. The enrichment zone thresholds for multiple datasets of a given TF were averaged to select one threshold value per TF. To assess the similarity of the thresholds for a given TF, we calculated the average of the differences between every dataset’s threshold for the given TF, and plotted the average difference as bars above and below the average threshold. We used the same approach for reporting the average motif score and the similarity of motif scores per TF.  2.4.10.5 TFBS-bi-motif view For each given sequence, the left plot reports the location of the first motif relative to the peakMax on the y-axis, and the location of the second motif relative to the first motif on the x-axis. A horizontal band of enrichment around y = 0 shows that the first motif is enriched at the peakMax, while a band of enrichment on the diagonal is the second motif’s enrichment at the peakMax. A gap at x = 0 is due to restricting the overlap of motifs. The diagonal limits of the plot arise from the sequence length limit, here 1001 bp. Increased density at the origin indicates that a number of peaks have both motif 1 and motif 2 enriched at the peakMax. The x-axis on the left plot was set as the distance between motifs in order to mirror the right plot, The right plot is a 10 bp resolution histogram of the distances between the two motifs.  2.4.10.6 Dinucleotide-environment view To create a Dinucleotide-environment view, we wrote a Perl script to align a set of sequences on the motif of interest with the motif oriented in the same direction, and assess the frequency of dinucleotides at every position of the aligned sequences. The file of dinucleotide frequencies is submitted to an R script (R version 2.14.1). The R script calculates the proportion of sequences   61 with a particular dinucleotide at each position of the alignment, and plots the position on the x-axis and the proportion of sequences on the y-axis.  Visualization was performed using the statistical package R (version 2.14.1) [85]. The R functions for generating the four visualization plots and the Perl code for generating dinucleotide frequencies from motif-anchored sequence alignments are available at GitHub https://github.com/wassermanlab/TFBS_Visualization/archive/master.zip. Plots for visualizing the heuristic threshold decisions, as seen in Figure 2.4A and Appendix D, Figure D.7A, are included in the provided R code.  2.4.11 Analysis of broadPeak replicate experiments The ENCODE broadPeak datasets are available as replicates, which allowed us to determine whether the peaks predicted to be directly bound by the ChIPped TF due to the presence of the TF’s motif were enriched for occurrence in both replicate experiments. Replicate experiments were pooled together, with neighbouring peaks (two peak maximums within 500 bp or 1000 bp) flagged as a single instance of a region. The 500 bp distance was selected to be inclusive of the ~400 bp median width of the broadPeak regions; 1000 bp was selected to explore the sensitivity to longer settings. Using a Fisher exact test, we then compared the proportion of replicated regions versus regions unique to a single experiment for the set of peaks with the ChIPped TFs motif within the heuristic enrichment zone, and for the set of peaks without the ChIPped TFs motif. Peaks that had been flagged as having the ChIPped TFs motif in one experiment but without the ChIPped TFs motif in the other experiment were rare (median 0.28% of the pooled replicates for an experiment), and we chose to omit these from the analysis.    62 2.4.12 Gene ontology term enrichment analysis We used the default settings of the web-based version of GREAT [87] http://bejerano.stanford.edu/great/public/html/splash.php for GO enrichment analysis. We submitted those peaks predicted to contain the ChIPped TF’s motif by the HADB method, the peaks without the HADB predicted ChIPped TF’s motif and peaks from the whole dataset. We used the Binomial Region Set Coverage and the number of GO terms related to actin and oxidative stress, to assess differences between the three sets of peaks.  2.4.13 TBP case study – repeat elements analysis Repeat elements in the TBP peaks were assessed using data downloaded from the UCSC mm9 Repeat Masker track via the UCSC Table Browser [115].    63 Chapter 3: Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets  3.1 Background The mapping of the regulatory sequences in the human genome is proceeding rapidly. Large-scale chromatin immunoprecipitation coupled to high-throughput sequencing (ChIP-seq) experiments have been a central component of the mapping efforts, including both transcription factor (TF) target and histone target derivatives [29]. These mapping efforts are providing key insights into the properties of regulatory sequences, the interactions between TFs, and the mechanisms contributing to selective patterns of gene transcription. With the compilation of large and diverse ChIP-seq data collections, an opportunity has emerged to study the common characteristics of TF-bound regions revealed by ChIP-seq.  The characteristics of ChIP-seq data are shaped by both biological and technical influences [30, 116-118]. As with every high-throughput technology, the community learns progressively more about the nuances of the data as they accumulate. Much effort has focused on the development of peak finding methods, which allow for the quantitative determination of TF-bound regions within the sequences recovered in a ChIP-seq experiment. In general, most methods take into account a background rate of sequence recovery and use this background to evaluate the significance of an observed number of mapped reads in the foreground ChIP experiment [116]. Most commonly background sequence data sources are generated from sheared input DNA or mock immunoprecipitation (mock-IP) using a non-specific antibody (for example, IgG). The comparison of the foreground against the background by peak finding software is often the basis for specifying the TF-bound regions, usually delineated with a start, stop, and local maximum read density position (that is, ‘peakMax’).   64  It is clear that the ChIP-seq procedure is working well for detecting regions bound by sequence-specific TFs. Analysis of ChIP-seq datasets reveals an enrichment of the expected TF binding site (TFBS) pattern close to the peakMax or, where no peakMax is determined, peak centre positions (hereafter also referred to as ‘peakMax’) [57, 86]. Ab initio pattern discovery software applied to ChIP-seq data routinely recover the known TFBS pattern [119], and pattern enrichment methods confirm highly significant enrichment of the TFBS pattern of the ChIPped TF [55, 56]. Additionally, a sufficient number of replicates have been performed to demonstrate general consistency between ChIP-seq datasets using the same cells and antibodies [120].  The properties of DNA in the nucleus have a strong influence on the results of diverse methods, including ChIP-seq and DNase I hypersensitivity mapping data [121]. Both input DNA and diverse ChIPped DNA reveal a strong tendency for the recovery of sequences from promoter regions [118, 120], indicating that the DNA shearing process favors regions of open or less compact DNA. These open regions have been demonstrated to be enriched for TF binding and other indicators of accessible DNA such as key histone modifications [22].  One of the open questions about ChIP-seq results is the not infrequent recovery of peaks under which the target motif of the ChIPped TF is absent. Such observations might be attributable to an inadequate understanding of the TF binding specificity, the potential indirect tethering of a TF to a region through protein-protein interactions, or non-specific antibody pull-down. Based on this background, we sought to understand the properties of ChIP-seq TF binding data, with an emphasis on the identification of mechanisms to account for the past observations of peaks lacking the motif of the targeted TF. Based on our research, we report a striking property of TFBS enrichment around the peakMax for CTCF-like, JUN-like, ETS-like, and THAP11 motifs   65 across a broad set of TF ChIP-seq data. The broadly enriched TFBS classes, which we term ‘zingers’ for their startling enrichment, can account for a substantial portion of TFBS ChIP-seq data. The zinger regions are observed to recur across ChIP-seq data from multiple cell lines and for multiple TFs. These recurring regions tend to be proximal to structural features defined by cohesin and polycomb group proteins. A model to account for the observed properties of zingers is introduced and discussed.  3.2 Results 3.2.1 Zingers are TF binding motifs enriched across multiple TF ChIP-seq datasets A subset of TF ChIP-seq data has been reported to lack motifs for the ChIPped TF, suggesting that there may be additional proteins interacting in a sequence specific manner with these regions. Drawing together diverse TF-ChIP-seq data, we sought to determine if characterized TFs might account for a portion of the discrepancy. To measure the enrichment of TF motifs across the compiled TF ChIP-seq datasets we performed motif over-representation analyses, using the oPOSSUM 3.0 software [55]. We tested 165 position weight matrices (PWMs) selectively curated from the JASPAR development database (see Chapter 3 Methods), on 285 human datasets (33 cell-lines) for 101 TFs (ENCODE and other resources; see Chapter 3 Methods). A parallel analysis of mouse data was performed for 81 datasets (12 cell-lines) encompassing 43 TFs (ENCODE and other resources; see Chapter 3 Methods). For each oPOSSUM analysis we provided a set of background sequences of similar length and nucleotide composition relative to the ChIP-seq dataset (all peaks were constrained to 401 bp length). As there were two or more ChIP-seq datasets for many TFs, generated from different cell lines or conditions, we averaged the oPOSSUM enrichment scores across all datasets for a given ChIPped TF. The details of the statistical measures and assessed thresholds are presented in the methods. Briefly, two oPOSSUM enrichment scores were used to evaluate the   66 datasets: a Fisher-log score (to assess enrichment of motifs across many ChIP-seq peaks) and a Kolmogorov-Smirnov (KS) centrality score (to assess enrichment of motifs in proximity to the peakMax position).  Of 165 TF motifs analyzed, CTCF, ETS-like (for example, GABPA and ELK4), and JUN-like motifs were found to be both the most enriched and most proximal to the peakMax across the greatest number of both human (Figure 3.1A and binding site logos in 1B) and mouse (Appendix G, Figure G.1A and binding site logos in G.1B) TFs’ datasets. We refer to such broadly enriched TF motifs as ‘zingers’, reflecting their potential to confound the analysis and interpretation of TF ChIP-seq results.  To assess if zinger enrichment is independent of the ChIPped TFs’ motifs (that is, not over-lapping the expected motif), we performed a second enrichment analysis on human ChIP-seq sequences in which the ChIPped TF motifs were masked (thus restricting the analysis to the subset of ChIP-seq datasets for which a TF binding profile is available). We again consider the two metrics of Fisher-log enrichment score and KS centrality score. The CTCF, ETS-like, and JUN-like zinger motifs remained enriched (Appendix G, Figure G.2A).  Short patterns, such as those found by PWMs, can occur by chance in the genome. To confirm the findings of zinger-specific enrichment, we shuffled the zinger PFMs and determined the likelihood of achieving the frequency of enriched datasets observed for the original profile (see Chapter 3 Methods). In all cases, the comparison against the frequency of enriched datasets obtained with the shuffled matrices confirmed that the true zinger motifs’ enrichment was extremely unlikely to occur by chance (P values are: 2.5e-44 for CTCF, 2.8e-09 for GABPA, and 3.7e-08 for JUN).   67  Figure 3.1 Zinger binding motifs are enriched across multiple human ChIP-seq datasets.  (A) The histogram displays the results of TFBS motif enrichment analysis on 281 human ChIP-seq datasets generated with the oPOSSUM 3.0 software. Along the x-axis is the fraction of datasets that displayed enrichment near the peakMax for a TF profile. The y-axis is the number of TF profiles that were found enriched for a given fraction of datasets. The profiles most frequently observed to be enriched are labeled on the histogram. The likelihood (P values) of a PWM with the same width, information content, and GC composition as the CTCF, GABPA, or JUN PWMs to attain the enrichment frequency observed in the histogram follow: 2.5e-44 for CTCF, 2.8e-09 for GABPA, and 3.7e-08 for JUN. (B) The binding site logos of the 10 TF binding   68 models with enriched motifs across the greatest number of datasets, manually grouped by motif similarity. Each logo depicts position along the x-axis and information content (that is, pattern strength) along the y-axis. (C) Motifs detected consistently by ab initio motif discovery across five datasets of 5,000 random sequences. The upper motif is similar to the CTCF logo in section B, while the lower motif is similar to the motif for the THAP11 TF.   3.2.2 Ab initio motif discovery of zinger profiles We sought to determine if ab initio pattern discovery could recover either novel profiles or known TFBS profiles in pooled data, a process requiring a greater signal-to-noise ratio than the more noise-tolerant oPOSSUM motif enrichment testing above. Across all of the ChIP-seq data, we masked the motif of the ChIPped TF and repeat-masked the sequences (see Chapter 3 Methods), then drew five sets of 5,000 sequences from the ChIP-seq pool and subjected each set to pattern discovery analysis using the MEME system [119]. From the five replicate pools, MEME returned profiles for wide and high information content patterns. In all cases MEME detected a pattern consistent with the CTCF binding profile in the top six results (Figure 3.1C, top logo) and a profile unknown to MEME Suite’s TOMTOM pattern similarity scoring system [122] (Figure 3.1C, bottom logo). A report from Ngondo-Mbongo et al. [100] identified that THAP11 binds to a motif that matches the unknown profile, so we will hereafter refer to the MEME derived profile as the THAP11 profile. We reviewed oPOSSUM results for the enrichment of the THAP11 motif, and found that it is consistent with the zingers for the Fisher-log score enrichment frequency, but the motif is not frequently observed to be centrally positioned based on the oPOSSUM KS-score (although it is proximal to the peakMax by the heuristic motif enrichment method presented below in this report). Given the strength of evidence, we elected to classify THAP11 motif as an additional zinger.    69 3.2.3 Zinger motif enrichment observed within open chromatin and genomic datasets Using the oPOSSUM enrichment analysis procedure, we sought to determine if the zingers showed enrichment in other genomic data collections. ChIP-seq data are recognized to be highly enriched with open chromatin regions, and in particular ChIP-seq data for CTCF, one of the zinger TFs, are known to strongly overlap with DNase I hypersensitive sites [123, 124]. We therefore analyzed ENCODE DNaseI-seq and Faire-seq data to assess the enrichment of the zinger motifs. Each region (average 150 bp) was extended to 401 bp for enrichment analysis using the oPOSSUM 3.0 software. oPOSSUM enrichment results revealed the zinger profiles to be the most frequently enriched within DNaseI-seq and Faire-seq datasets, showing enrichment near the region centre in 50% to 100% of the DNaseI-seq datasets, and 20% to 92% of Faire-seq datasets (Appendix G, Figure G.3). We further assessed the ratio of zinger motifs in DNase and Faire regions compared to flanking regions, providing an indication of the portion of each dataset that could be attributed to zingers: mean values of 47% for DNaseI-seq and 13% for Faire-seq were obtained (see Appendix H).  We have observed enrichment of zingers in other open chromatin associated data such as ChIP-seq data for helicase-related proteins or histone modifiers (Appendix G, Figure G.4, and Appendix H), and ChIP-seq control data (Appendix G, Figure G.5, and Appendix H). Thus zinger motifs are observed in multiple classes of genomic datasets.  3.2.4 Visualizing the pattern of motif enrichment We first used visualization approaches to examine the distribution of both the motif scores and peakMax proximity for the CTCF, JUN, GABPA, and THAP11 zinger motifs for several datasets using TFBS-landscape plots [82]. To visually assess the topological pattern of enrichment of zinger motifs using TFBS-landscape plots, we extended all analyzed sequences to 1,001 bp    70  Figure 3.2 Zinger motifs are enriched at the peakMax of non-zinger ChIP-seq datasets.  The enrichment plots display the location of the top scoring motif for each peak relative to the peakMax (the peakMax is at 0) on the x-axis, while the score of the motif is plotted on the y-axis. The adjacent line plots display the fraction of motifs observed in 5 bp increments. The logo reflecting the binding specificity for each zinger appears above the related enrichment plot. (A) CTCF motif predictions from NRF1 ChIP-seq (GM12878 cells). (B) JUN motif predictions from TCF7L2 ChIP-seq (Hct116 cells). (C) GABPA motif predictions from NFKB ChIP-seq (GM19099 cells). (D) THAP11 motif predictions from IRF1 ChIP-seq (K562 cells).   (peakMax position at 501 bp), and plotted the motif position relative to the peakMax (x-axis; upstream and downstream of peakMax) and the motif score (y-axis) of the top scoring zinger motif for each peak. As seen in Figure 3.2, the motif predictions of zinger PWMs are in general concentrated in motif score ranges across all positions relative to the peakMax, for example,   71 motif scores 70 to 85 for CTCF (Figure 3.2A), or 80 to 87 for JUN (Figure 3.2B). However, proximal to the peakMax, there is a distinctive enrichment for the zinger motif, most strikingly seen for CTCF and THAP11 where almost all high scoring motifs (>85) are located proximal to the peakMax. The enrichment of JUN and particularly GABPA zinger motifs are less distinctive visually, due to the peakMax proximal enrichment overlapping the same score range as the background motifs. In control datasets and with shuffled matrices we do not see the distinct high scoring population of motif scores; we instead see a uniform distribution along the total 1,001 bp of sequence, which conveys, visually, the background rate of motif prediction for the PWM (Appendix G, Figure G.6). The distinctive zinger motif enrichment allowed for the selection of subsets of peaks that were enriched for the motif of a TF that was not specifically targeted by the ChIP-seq experiment.  3.2.5 Defining a set of zinger motif containing peaks Based on the visualization analysis we used a procedure for determining the range of motif enrichment relative to peakMax proximity and motif score enrichment [82]. The outer limits of these ranges of enrichment were then applied as thresholds that defined ‘enrichment zones’ for quantitative analysis of ChIP-seq dataset motif composition (Figure 3.3; see Chapter 3 Methods).  For ease of reference, we will hereafter use ‘zinger motifs’ to refer to the collection of CTCF, JUN-like, ETS-like, and THAP11 motifs within the enrichment zones and ‘zinger motif peaks’ to refer to those peaks within a dataset that have a zinger motif but not the ChIPped TF’s motif. Motif predictions outside the enrichment zones will be referred to as ‘distal-zinger’ motifs. As anticipated, peaks with the ChIPped TF’s motif proximal to the peakMax comprised the majority of most datasets (up to 99% in the best case). After accounting for background   72 ChIPped TF motif rates, the mean observed portion was 55% (the median was 59% with a median absolute deviation (MAD) of 27 pp). There are, however, extreme cases in which the ChIPped TF’s canonical binding motif is present in less than 10% of the peaks (Appendix G, Figure G.7, and Appendix H).   Figure 3.3 ChIP-seq datasets contain multiple classes of peaks. The fraction of zinger motif peaks and ChIPped TF motif peaks varies across ChIP-seq datasets. The pie charts present a random selection of 50 datasets for multiple TFs and cell-lines with zinger motifs present (>1% zinger). The charts are ordered by greatest zinger motif peak enrichment to the least. Black is the portion of peaks with the ChIPped TF’s motif, red is the portion of zinger motif peaks, and brown is the remaining portion of peaks that do not contain either the ChIPped TF nor zinger motifs.   After accounting for background, and excluding two outliers, up to 45% of a ChIP-seq dataset are zinger motif peaks with a mean of 12% (median of 9% with a MAD of 3 pp) (Appendix G, Figure G.8A). The zinger motif peaks account for up to 69% of the set of peaks unexplained by   73 the ChIPped TF’s motif, with a mean of 27% (median of 27% with MAD of 14 pp), in datasets with at least 1% zinger motif peak content (Appendix G, Figure G.8B); the zinger motif peak enrichment is visually depicted in a heat map format (Appendix G, Figure G.8C). For clarity, the portion of zinger motif peaks are anti-correlated with the portion of ChIPped TF motif peaks (Appendix G, Figure G.8D).  3.2.6 No strong dependencies detected for zinger motif occurrence As zinger motifs are present in peaks without the ChIPped TF’s motif we wanted to determine if there were any characteristics specific to or in common among this set of peaks. We found that neither the presence nor proportion of zinger motif peaks within a ChIP-seq dataset is dependent on cell type, as seen in Appendix G, Figure G.9A, for the five most abundant cell lines. Neither, is the proportion of zinger motif-containing peaks consistent across multiple datasets for the same TF (Appendix G, Figure G.9B).  Next we asked if the zinger motifs have a strong tendency to co-occur in the same zinger motif peaks. We found that at most 11% of datasets show a positive association with a significant P value (Fisher exact P values <0.001 and log odds ratios >0) for any pairwise co-occurrence of two different zingers within a single peak (the most frequent pair of zingers being GABPA and THAP11). A few datasets (17%) show a negative association for zinger motif co-occurrence with a significant P value (Appendix G, Figure G.10A). Thus, the zinger motifs are not inter-dependent. We next evaluated the pairwise tendency for zinger motif peak enrichment within the same ChIP-seq datasets, finding unremarkable correlation values (correlation coefficients -0.0233 to 0.3803) (Appendix G, Figure G.10B).    74 Lastly we determined whether zinger motif peaks were consistently located near a feature in the genome. We evaluated the proximity of zinger-associated regions to genomic features such as transcription start sites (TSS), CpG islands, conserved regions, and repeat sequence regions. Comparing the set of zinger motif peaks to peaks with the ChIPped TF’s motif, we did not detect consistent enrichment tendencies that distinguished between the two sets of regions (Appendix H).  3.2.7 Peaks containing a zinger motif but lacking the ChIPped TF motif have low scores As zinger motifs are an unexpected presence across datasets we assessed the quality of the peaks they occur in, asking if the zinger motifs tended to be in the lower scoring peaks of the dataset. We compared the peak calling scores of peaks containing the ChIPped TF’s motif against peaks with a zingers’ motif. The peak scores for the zinger containing peaks are significantly poorer than for those peaks with the ChIPped TF’s canonical motif (Wilcoxon one-tailed test P values <5.0e-05).  3.2.8 Peaks with a zinger motif may be bona fide targets of the zinger TF Prediction of TFBSs can suffer from poor specificity, and as the enriched zinger motifs’ peaks were unexpectedly found in datasets for non-zinger TFs, we asked if the zinger motif peaks were actual binding locations for the zinger TF or not. Therefore we investigated the degree of agreement (co-occurrence within 100 bp) between zinger motif peaks with a strong motif score (score >85) and ChIP-seq data ChIPped for the zingers TF in the same cell type (Figure 3.4). On average 75% of zinger CTCF motif peaks overlapped CTCF ChIP-seq peaks (median 79% with a MAD of 15 pp); 38% of zinger JUN motif peaks overlapped JUN ChIP-seq peaks (median 38% with a MAD of 17 pp); and 28% of zinger GABPA motif peaks overlapped GABPA ChIP-seq peaks (median 27% with a MAD of 13 pp). In all cases the agreement was significant    75  Figure 3.4 Zinger motif peaks overlap ChIP-seq data for the given zinger TF. For each plot, a selection of TF ChIP-seq datasets is alphabetically ordered by TF name horizontally. The y-axis represents the fraction of peaks that overlap with the zinger TF’s ChIP-seq peak in experiments performed with the same cell type. Two populations of peaks are plotted per dataset: solid circles represent the subset of peaks with a peakMax-proximal zinger motif, and open triangles represent the subset of peaks with a distal-zinger motif. (A) CTCF, (B) JUN, or (C) GABPA. The horizontal dashed line at 0.13 is a qualitatively selected visual aide.     76 (Wilcoxon P values <3.4e-20) with respect to the distal-zinger control (see Appendix H), and indicated that many of the zinger regions may be bona fide binding regions for the zinger TF.  A comparison of the peak scores for the ChIP-seq peaks that overlapped the set of zinger motif peaks versus the set of distal-zinger peaks revealed a significant difference between the two groups (Wilcoxon one-tailed test significance threshold P value <0.001). The zinger motif peaks associated with stronger scoring ChIP-seq peaks than did the distal-zinger peaks for the majority of datasets (that is, 81%, 67%, and 79% of CTCF, JUN, and GABPA ChIP-seq datasets, respectively).  3.2.9 Zinger motif peak regions recur across multiple TF datasets As zinger motif peaks are enriched in numerous datasets for which the zinger is not the targeted TF, we asked whether the same zinger regions were occurring repeatedly across multiple datasets, that is, are the same zinger regions being ChIPped by many TFs. We pooled the zinger motif peaks, which by definition lacked the motif of the ChIPped TF, from across datasets (33 cell lines; 823,574 peaks), requiring that the zinger motif have a strong motif score of 85 or greater to reduce false positives. We assigned peaks whose peakMax were within 50 bp of each other into neighbourhoods (see Chapter 3 Methods), and then assessed the recurrence of each neighbourhood, that is, the number of unique TFs whose datasets contributed a zinger motif peak to the neighbourhood.  We obtained 257,631 zinger neighbourhoods of which 92,244 neighbourhoods derived from regions ChIPped by two or more unique TFs. The neighbourhoods ChIPped by two or more TFs are on average 167 bp in width (maximum 607 bp), and 77% derive from two or more cell lines. This amounts to approximately 15.4 Mbp of recurrently detected zinger motif associated   77 sequence that was ChIPped by 2 to 41 non-zinger TFs in up to 21 cell lines. Figure 3.5 exemplifies the number of TFs that ChIPped zinger neighbourhoods across chromosomes 1 and 3 (see Appendix I regarding zinger neighbourhood coordinates).   Figure 3.5 Zinger motif peaks recur across datasets for multiple TFs.  The plots present two distinct neighbourhood sets (as defined in the text): one set derived from zinger motif peaks (red) and the other from ChIPped TF motif peaks without zinger motifs (black). The x-axis gives the neighbourhood position on a chromosome: (A) chromosome 1, (B) chromosome 3. The y-axis is the number of unique TFs that ChIPped a peak in a neighbourhood. A horizontal dotted line at y = 5 is given for visualization purposes, to highlight that there are many zinger neighbourhood locations (red) that were ChIPped by multiple unique TFs.    78  We similarly generated neighbourhoods from those regions with neither the ChIPped TF nor zinger motifs (unidentified motif neighbourhoods - 536,546), and from the regions found to have a high scoring motif (score >85) for the ChIPped TF and no zinger motif (ChIPped TF neighbourhoods - 408,677) (see Chapter 3 Methods). The zinger neighbourhoods were found to be ChIPped by significantly more unique TFs than are the other two sets of neighbourhoods (Wilcoxon one-tailed test P value = 0).  The recurrence of the zinger motif peaks across datasets prompted us to consider the motif content of HOT regions. HOT (high occupancy of transcription-related proteins) regions, as defined by Yip et al. [61], are ChIP-seq regions that within a single cell line (GM12878, HeLa, H1-hESC, HepG2, or K562) demonstrate binding co-occurrence among chromatin-related factors, general TFs, and sequence-specific TFs. Yip et al. noted that a substantial portion of a cell-line’s HOT regions are motif-less for the ChIPped factor, and associate with strong DNaseI  signals. The HOT regions are present in two or more cell lines in 25% of cases according to Yip et al., while zinger neighbourhoods were noted above to be 77% of cases. oPOSSUM enrichment analysis on the combined set of HOT regions found the zinger motifs to be 13 out of the 20 most enriched patterns, consistent with what was observed above for the DNaseI-seq/Faire-seq open chromatin datasets (Appendix J, Table J.1).  3.2.10 Zinger neighbourhoods tend to occur close to regions occupied by cohesin Recurring open chromatin enrichment across datasets suggested that structural properties of chromatin might contribute to zinger motif recovery across ChIP experiments [121]. Cohesin is a protein noted for both its role in gene regulation and DNA structure [125, 126]. It is a multi-subunit complex, which is believed to form a ring like structure around DNA, and has been well   79 documented in its role of sister chromatid interaction during the mitotic metaphase. Cohesin has also been implicated in promoting interaction between enhancers and core promoters of active genes in embryonic stem cells [126] and in chromosomal looping [127]. Chromosomal looping may be a structural element that is conducive to DNA shearing under the stress of sonication. Additionally, cohesin or associated proteins may function as a ‘loading station’ by bringing together proteins bound to remote regulatory elements and promoter regions that will in turn regulate transcription within the looped region [128].  We evaluated the proximity of the zinger neighbourhoods to cohesin-interacting regions. Zinger neighbourhoods are enriched for proximity (that is, within 500 bp) to cohesin regions (via RAD21 and SMC3 ChIP-seq) compared to the ChIPped TF neighbourhoods or unidentified motif neighbourhoods (Fisher exact one-tailed test P value of 0 for both comparisons; 77% of the zinger neighbourhoods observed for multiple TFs are proximal, while 46% of the unidentified motif and 13% of the ChIPped TF neighbourhoods are so positioned). The neighbourhoods for unidentified motif peaks were also significantly more proximal to cohesin than neighbourhoods from ChIPped TF peaks (Fisher exact one-tailed test P value of 0). As some of the neighbourhoods contain CTCF zinger attributed regions, and cohesin is known to interact with CTCF [129, 130], we removed neighbourhoods within 500 bp of a CTCF ChIP-seq region and repeated the analysis. Regardless of the depletion of CTCF associated neighbourhoods, the zinger neighbourhoods remained significantly closer to cohesin (Fisher exact one-tailed test P value of 0 for all comparisons).  Another system noted to impact chromatin structure are the polycomb group proteins (including polycomb repressive complex 1 (PRC1) and polycomb repressive complex 1 (PRC2) forms), which are implicated in the remodeling of chromatin. In drosophila, PRC1 has been noted to   80 interact with cohesin to co-regulate active genes [131]. We used ChIP-seq data for the constituent proteins CBX and EZH2 proteins to identify regions bound by the PRC1 and PRC2 complexes, respectively. We found that the zinger neighbourhoods were significantly closer to CBX peaks and EZH2 peaks than are the neighbourhoods derived from either ChIPped TF motif peaks, or from unidentified motif peaks (Fisher exact one-tailed test P value of 0). We observed that the PRC1 and PRC2 peaks proximal to the zinger neighbourhoods, tend to be those that are also within 500 bp of cohesin (Fisher exact one-tailed test P value <7.6e-160 for PRC1, and P = 0 for PRC2). The unidentified motif neighbourhoods are, in turn, significantly closer to PRC regions than the neighbourhoods derived from peaks with the motif for the ChIP-seq experiment’s targeted TF.  Thus, the zinger neighbourhoods, and to a lesser degree the unidentified motif neighbourhoods, are associated with cohesin and polycomb repressive complex regions. This suggests that these diverse regions, which were initially identified as not containing the motif of the ChIPped TF, and yet in many cases enriched for an alternative motif (zingers), may be part of a structure involving cohesin. Such a structure could influence the tendency for these regions to be detected recurrently across diverse ChIP-seq data.  3.3 Discussion ChIP-seq experiments are increasingly used to investigate how sequence-specific DNA binding TFs regulate gene expression. In this report, we introduce ‘zingers’: four classes of TFBSs that display significant binding site enrichment, unexpectedly proximal to the peakMax, across ChIP-seq binding experiments for other TFs. Within individual TF ChIP-seq experiments, up to 45% of peaks are observed that lack the canonical TF binding motif and contain a zinger motif, with a mean of 12% (median 9%). While biased to the lower scoring peaks in other TF ChIP-seq data,   81 the same zinger-associated regions tend to be high scoring peaks within datasets ChIPped for the zinger TF; indicating these regions are likely bound by the zinger TF. The zinger motif peaks derive from 257,631 regions (neighbourhoods) in the genome, 36% of which are observed recurrently across datasets for diverse TFs, in sharp contrast to neighbourhoods containing only the ChIPped TF’s motif, which recur relatively infrequently. Some regions lacking both the ChIPped TF’s motif and a zinger motif, are also recurrently observed. Both zinger motif and unidentified motif neighbourhoods are positioned proximal to structural regions defined by the presence of cohesin and polycomb group complexes. Accounting for the contribution of zinger-associated regions to global studies of regulatory sequences will be a consideration for future analysis of ChIP-seq data.  Understanding the underlying biochemical mechanism by which the zinger-associated regions are observed across such diverse datasets remains to be resolved. However, based on the findings in this investigation, we present a ‘loading station’ model consistent with our state of understanding (Figure 3.6). Cohesin/polycomb and zinger proteins are proposed to participate in demarcation and stabilization of inter-segment interactions of DNA at which TFs bind. At these ‘stations’, the ChIPped TF may be present via direct (Figure 3.6B) or indirect (Figure 3.6C) interactions with the DNA, and either in cis- or trans- arrangements with a zinger TFBS. In a ChIP experiment, assuming covalent linking of the ChIPped TF and the cohesin-paired DNA, the patterns of motif enrichment observed in this report could emerge, including the presence or absence of motifs for both the ChIPped TF and a zinger. Alternatively, or possibly in combination, there may exist zinger-containing regions (Figure 3.6D) at which many proteins are present (at a cell population level). Such regions may contain a diverse range of epitopes and therefore be more likely to be recovered in ChIP-seq experiments, especially with polyclonal antibodies. Within this model, TFs may ‘visit’ cohesin and zinger marked regions,    82  Figure 3.6 A model to account for zinger motif enrichment across ChIP-seq datasets.  A TF loading station model is presented that is compatible with the observed enrichment of zinger motifs across diverse TF ChIP-seq data and cell lines. The dark blue oval represents the ChIPped TF, the magenta oval represents the zingers, the remaining coloured ovals represent TFs or other proteins or complexes that engage with the DNA, and the red loop represents cohesin and polycomb group proteins. The grey strands are chromatin. (A) Overview of a loading station. Multiple proteins may interact within a local region, from which TFs may disperse to search for other regulatory regions. Zingers and structural components such as cohesin and polycomb group proteins are key features. Panels (B), (C), and (D) present specific scenarios under which DNA loading station segments might be recovered in a ChIP-seq experiment. (B) Direct binding. The ChIPped TF directly binds to a TFBS, while a zinger motif is present in trans (upper) or in cis (lower). (C) Indirect binding. The ChIPped TF is present due to an indirect interaction, involving a mediating protein. The zinger motif is again present in trans   83 (upper) or in cis (lower). (D) Non-specific events. Numerous proteins are present at the loading station, providing an abundance of epitopes, thus increasing the probability of being recovered in a ChIP-seq experiment.   resulting in a low but consistent recovery of reads in a ChIP-seq experiment. The model accounts for recurring detection of zinger motif peaks, the proximity of the peaks to cohesion interacting regions, and why the zinger motifs may be present in the sequence even when the ChIPped TF’s motif is absent.  From a broader mechanistic perspective, a loading station mechanism is consistent with the ‘hop-skip-jump’ theory for how TFs efficiently search the nucleus to arrive at their TFBSs [132]. The proposed loading station model is supported in recent literature. Faure et al. [128] propose a role for cohesin in stabilizing large protein-DNA complexes. While this manuscript was under  review, Yan et al. [60] published a study using the LoVo cell line suggesting that cohesin participates in holding chromatin open during cell division to facilitate TFs relocating back to those regions once division is complete.  The zinger content of every ChIP-seq dataset should be evaluated, consistent with a growing effort to critically evaluate such data [23, 121, 133]. For instance, the STAT1 (GM12878) ChIP-seq dataset exceeds 30% of peaks with zinger motifs proximal to the peakMax, while STAT1 motifs occur only at the background frequency. We propose a general approach for the study of zinger content. For each ChIP-seq dataset, the peak regions should be scanned for the presence of the ChIPped TF motif in proximity to the peakMax. The peaks lacking a ChIPped TF motif should be compared to the recurring zinger neighbourhoods (Appendix I). The portion   84 of the dataset overlapping the neighbourhoods gives insight into the overall specificity of the experiment.  We have identified zinger motifs that are frequently enriched across a portion of TF ChIP-seq data, including CTCF-like, ETS-like, and JUN-like motif families, and THAP11. As high-throughput ChIP-seq data informs genome annotation, research into gene regulation may be impacted by zinger motif derived annotations. Moving forward it will be important to determine the prevalence of zinger-like motifs in ChIP-seq data in diverse organisms, probe the structural properties of the zinger regions, and develop computational approaches to systematically identify recurring zinger regions in large-scale genome annotation. Ultimately, understanding the biophysical processes that result in the zinger motif enrichment in ChIP-seq data may provide broader insight into the mechanisms of transcription regulation.  3.4 Methods 3.4.1 Datasets For our analyses, we used ENCODE ChIP-seq datasets (human and mouse), ENCODE DNaseI-seq and Faire-seq data, and human ChIP-seq controls [29] downloaded from the UCSC ENCODE database [110]. We also incorporated non-ENCODE ChIP datasets downloaded from GEO: (1) GSE11431 - 13 mouse ESC datasets [106]; (2) GSE25532 - mouse NFYA data in ES cells [107]; (3) GSE17917 and GSE18292 - human KLF4, POU5F1, cMYC, NANOG, and SOX2 data [108]; and (4) GSE22078 - human and mouse CEBPA and HNF4A [51]. Where only the mapped data were available, we used FindPeaks 4.0 [26] to call peaks using the following parameter options: dist_type 1 200 -subpeaks 0.6 -trim 0.2 -duplicatefilter. The ENCODE broadPeak datasets frequently occurred in replicate; to avoid duplication, only the replicate with the most peaks of a pair was used for analyses.   85  Where coordinates were provided as NCBI36/hg18 or NCBI36/mm8, they were first converted to GRCh37/hg19 or NCBI37/mm9, using a locally installed version of the UCSC lift-over tool [111]. We then used the Ensembl API to retrieve sequences from GRCh37/hg19 and NCBI37/mm9 assemblies.  The ENCODE ChIP-seq data are in one of two formats, narrowPeak and broadPeak. Both formats contain two columns pertaining to statistical significance of the peaks (also known as peak scores): one is a P value, the other a q value (Bonferroni corrected). We used the q value field when it was assigned, and otherwise used the P value field.  As peaks are reported in a multitude of lengths, in the range of 1 bp to greater than 5,000 bp, we trimmed or extended all peaks to a constant length centered at the peak maximum for narrowPeak format datasets, or at the peak centre for broadPeak format and DNaseI-seq/Faire-seq datasets. For enrichment visualization and determining heuristic boundaries of enrichment we used 1,001 bp sequences, oPOSSUM TFBS enrichment analysis input was 401 bp sequences, and ab initio motif detection input was 201 bp sequences.  Position frequency matrices (PFMs) were obtained from the JASPAR [42] development 4.0_alpha database of transcription factor models (prior to 2013, April) (see Appendix F). Where the JASPAR PFM did not agree with the consensus in the literature we performed an ab initio analysis on the top 500 peaks (selected by peak score) of two or more ChIP-seq datasets for the given TF, using a locally installed version of the MEME software [119]. MEME results were then checked against the literature and for enrichment in a different ChIP-seq dataset for the given TF. MEME position specific probability matrices (PSPM) were converted to PFMs by   86 transposing the PSPM and multiplying each letter (A, G, C, T) frequency in the matrix by the number of sites found by MEME. The PFMs were subsequently converted to position weight matrices (PWMs), using the TFBS Perl Module [112], only PWMs based on PFMs with information content (IC) greater than 8 bits were retained.   For those analyses using datasets of shuffled matrices, the datasets were generated by random permutation of all columns of the originating PFMs, excluding the lower information content columns on the edges (2 columns on each side for all cases, except for the wider CTCF PWM for which 3 columns on each side were held constant).  3.4.2 Motif over-representation analysis Motif over-representation analyses were performed with a locally installed version of oPOSSUM 3.0 [55]. We used the sequence-based analysis option with default settings, except for specifying the use of the JASPAR development PFM matrices (Appendix F). We trimmed or extended all peaks to 401 bp. Backgrounds for the over-representation analyses came from the mappable portion of the genome, and were chosen to match the sequence length and mononucleotide GC composition distribution of each dataset.  The oPOSSUM Fisher-log enrichment score is derived from a one-tailed Fisher exact probability test, based on the hypergeometric distribution, which compares the number of sequences that contain a motif for the TF of interest in the target and background datasets. The negative natural logarithm of the Fisher test probabilities is the reported Fisher-log score. Thus a Fisher-log score of 6.91 or higher is equivalent to a P value of 0.001 or lower. Fisher-log enrichment scores of ‘infinite’ value were set to either 500 or to 100 past the maximum non-infinite Fisher-log score.   87  The oPOSSUM KS centrality score is the negative logarithm of the probabilities from a Kolmogorov-Smirnov test. Thus a KS score of 6.91 or higher is equivalent to a P value of 0.001 or lower. The Kolmogorov-Smirnov tests whether a TF’s motifs are positionally enriched at the center of the target sequences relative to the motifs in the background set of sequences. KS ‘infinite’ enrichment scores were set to 100.  To calculate the number of datasets enriched for a motif we first obtained the average Fisher-log score and KS log score for datasets ChIPped for the same TF. Once we had a set of scores for each TF, we used a binary count of 1 or 0 to indicate whether both of the oPOSSUM enrichment scores passed a threshold based on the standard deviation (SD) of the scores or not (two SD for Fisher-log scores and one SD for KS log scores). This yielded the number of datasets with enrichment around the sequence midpoint for each of the 165 TFs. We then applied a further correction to compensate for the bias created by multiple datasets for families of TFs that recognize the same motif (for example, JUN, JUND, JUNB, AP1, FOS, FOSL1, FOSL2, and BATF PWMs all recognize a TGA(g/c)TCA consensus). The number of motif-family members, minus one, was subtracted from the count of datasets for each of the member TFs, for example, if JUNB were enriched in 20 TF datasets, and 9 of those datasets were ChIPped for a TF that recognizes the JUN-motif family consensus, then a count of 8 would be subtracted from 20. The 165 TFs were then ranked according to this final number of associated datasets.  3.4.3 Motif over-representation analysis with shuffled matrices To assess the probability of a PWM’s predictions being enriched within as many datasets as observed with the zinger PWMs, we shuffled the PFMs of the zingers and fit a distribution to the results. We generated 100 shuffled matrices as described above. We performed oPOSSUM   88 enrichment analyses with the shuffled PWMs, on the same human datasets as used to generate Figure 1. The oPOSSUM results were evaluated as outlined above. However, we applied the enrichment score thresholds for each dataset as was set for the original PWMs. We then counted the number of datasets within which each shuffled profile was enriched, and fit a zero-adjusted logarithmic distribution (ZALG) to the counts. The distribution was selected using the fitDist() function in the R statistical package GAMLSS 4.1-5 [85], and the parameters describing the distribution were obtained with gamlss family ZALG and the gamlss() function. We tested for goodness-of-fit of the distribution to the data by generating datasets from the random generation function, rZALG, and assaying the similarity of the generated distributions to our data using a chi-squared test. The fitted distribution function was then used to determine the probability of the shuffled PWMs obtaining a result as extreme as the original PWM. The probability was calculated with the density function for the zero-adjusted logarithmic distribution (dZALG).  3.4.4 Motif prediction Motif prediction was performed with C-code adapted from the TFBS Perl Module [112], reporting relative motif scores. Motifs predicted by a PFM are not permitted to overlap by more than one-fifth the PFM length (this setting is intended to equate to the low information content flanks of a PWM), for example, a 7 bp motif could only overlap a neighbouring motif by 1 bp.  For post-oPOSSUM analyses, we predicted the presence of zinger motifs using one PWM per zinger TF motif family as proxy, to prevent redundancies. CTCF-like motifs were predicted with the CTCF PWM, ETS-like motifs with the GABPA PWM, JUN-like motifs with the JUN PWM, and THAP11 motifs with a THAP11 PWM.    89 3.4.5 MEME suite tools MEME [119] analyses were run using the following options: -dna -nmotifs 10 -minw 6 -maxw 15 -maxsize 2000000 -mod zoops -revcomp. TOMTOM [122] analyses were run with default values, aside from increasing the E-value threshold to 20, from the web server.  3.4.6 Repeat-masking Masking of repeat elements was performed using a local installation of RepeatMasker (RMBlast) [134] and RepBase [135], using default settings.  3.4.7 Data processing and statistical analyses Data processing and statistical analyses were done with a combination of in-house Unix and R scripts (R version 2.14.1) [85]. Throughout the manuscript we report the combination of median and the median absolute deviation (MAD), a measure of dispersion around the median. For a normal distribution the median and MAD are the same values as the mean and SD.  3.4.8 TFBS-landscape visualization plots To visualize peakMax proximal enrichment of TF motifs within ChIP-seq datasets, the top scoring predicted motif in each region for the given TF PWM, was plotted relative to its signed distance from the peakMax (using the R basic statistical package). The dense horizontal ranges of motif scores spanning all positions relative to the peakMax, such as seen in the Figure 3.2 plots, are observed for the combination of most PWMs and ChIP-seq datasets, and are likely a mixture of both false and true TFBS predictions. Those motif matches that are distal to the peakMax are anticipated to be less reliable, as the observed frequency is consistent with background rates of motif prediction. If we take enrichment proximal to the peakMax as a measure of confidence for the predictions we can determine a distance threshold and motif   90 score threshold (see next section) at a point where motif frequency proximal to the peakMax is greater than the flanking distal motif frequency. Using this threshold, we can select a sub-population of peaks that are less likely to have arisen by chance.  3.4.9 Heuristic boundaries of enrichment We assessed the enrichment of motif distance to the peakMax and motif score, using a heuristic method for topological motif enrichment [82], which we outline in brief here. To determine whether a motif was proximal to the peakMax, we used heuristic distance boundaries derived from the density of the top scoring motif for each 1,001 bp region. We identified the location, relative to the 501st bp, at which the density of motifs exceeds that of the distal region (approximately 175 to 500 bp distant from the peakMax). This change in density is observed in the TFBS-landscape plots of Figure 3.2, where there is a constant density of motif scores in the distal regions and an increase in the density of motif scores within approximately 100 bp of the peakMax. The heuristic distance boundaries were set at the transition point. A similar procedure was applied to determine a threshold for the motif score, where the motif score threshold was set at the point where the motif enrichment proximal to the peakMax was at least 20% higher than the flanking enrichment. The region defined by the distance boundaries and the motif score threshold, was termed the ‘enrichment zone’. The enrichment zone was subsequently used to specify peakMax enriched proximal motifs. On average, an enrichment boundary was ±90 bp from the peakMax, and the motif relative score threshold was 82.  The heuristic analysis of motif enrichment across datasets reports that on average a CTCF zinger motif is enriched above a motif score threshold of 79, while for JUN the average was 86, for GABPA it was 83, and for THAP11 it was 84. CTCF and THAP11 in particular consistently have enrichment above a motif score threshold of 85 that is strongly distinct from the flanking   91 regions of similar score range, as seen in Figure 3.2A and 3.2D. The regions that flank the peakMax proximal enrichment in Figure 3.2 are representative of the background expectation of a PWM’s motif prediction. Thus, to reduce the presence of false positive predictions in subsets of peaks we analyzed, we selected, where noted in the main text, peaks with a motif scoring above the motif score threshold of 85. The use of a single threshold permits the processing of data as a single unit. A motif score of 85 is also the default threshold score in the oPOSSUM software.  3.4.10 Background expectation of motif predictions To estimate the proportion of regions in a given dataset in which motifs may result from background motif prediction, we compared the count of regions with motifs in the enrichment zone relative to the count of regions with motifs at least 50 bp outside the enrichment zone. The distal ‘zone’ from which counts were determined, was set to be the same length of sequence as the enrichment zone, that is, if the enrichment zone was 200 bp wide, then the distal zone was also 200 bp wide (100 bp from 5′ and 100 bp from 3′ of the region center). To estimate the proportion of regions in the enrichment zone with false positives, we divided the number of regions with motifs in the distal zone by the number of regions with motifs in the enrichment zone. See Appendix H for the estimated overall background expectation of ChIPped TF and zinger motif prediction.  Calculating the background corrected estimates of ChIPped TF and zinger motif proportions within a dataset was done by subtracting the distal zone count from the enrichment zone count for the ChIPped TF or each zinger. For the ChIPped TF, the corrected count was divided by the size of the dataset. For the four zingers, the four corrected counts were first summed, and then divided by the size of the dataset.   92  3.4.11 Heatmaps and correlation between zinger motifs Heatmaps were created with the heatmap.2() function from the R statistical package: gplots, with the distance measure as ‘manhattan’ and the ‘ward’ agglomeration method for clustering. The heatmap of zinger motif peak log2 fold enrichment was generated using the log2 fold enrichment of zinger motif peaks with motif score 85 or greater, relative to distal-zinger peaks of similar score range. Where the fold enrichment was below 1.5 we assigned a minimum value, represented as a grey colour in the heatmap, to facilitate visualization.  A heatmap of zinger motif inter-dependency within datasets was generated using the set of zinger motif peaks with motif scores equal to or greater than 85, and a 2x2 confusion matrix for each pair of zinger motifs. A Fisher exact P value <0.001 was taken to indicate significance and the sign of the log odds ratio to indicate whether a positive or negative association existed. The values used to generate the heatmap were 1-pvalue for positive associations, -1*(1-pvalue) for the negative associations, and 0 for the non-significant P values.  The pairwise correlation of zinger motif peaks for the different zingers, across datasets, was assessed using the log2 fold enrichment values generated for the above heatmap. The correlations were evaluated with both Pearson correlation and Spearman’s rank order correlation (R basic statistical package: cor() function).  3.4.12 ChIP-seq controls We obtained controls from a range of cell types and ENCODE consortium groups, and processed the mapped reads with FindPeaks. We used the peak height to rank the control peaks, and then selected the top 70,000 peaks. The number of peaks was chosen to match the   93 average size of the ChIP-seq datasets. The peaks were then scored with the zinger PWMs and the enrichment of the motifs with respect to the peakMax was evaluated.  3.4.13 Evaluating proximity of zinger motif peaks to genomic features We compared the genomic feature proximity of zinger motif peaks, with those peaks containing the ChIPped TF’s motif and lacking zinger motifs. We measured the distance between the peakMax and the middle of the feature, which in the case of transcription start sites (TSSs) was simply the starting coordinate of the transcript. We used only those datasets for which we had at least 200 zinger motif peaks. The number of peaks that were within 500 bp, 1 kb or 5 kb of the TSS, or within 500 bp of CpG islands, conserved regions or repeat-masked regions were compared between the zinger motif peaks and the ChIPped TF peaks using a Fisher exact test. For the results of a zinger to be considered striking we required that at least 60% of the datasets with zinger motifs show statistical significance in one direction, that is, either 60% of datasets tend to be proximal to a feature, or 60% of datasets tend to be distal to a feature.  3.4.14 Comparing zinger regions from non-zinger ChIP-seq datasets to peaks ChIPped by the zinger TF We assessed the proximity of the zinger motif peaks with a high scoring zinger motif (score >85) to ChIP-seq peaks ChIPped by the zinger’s TF to determine whether the zinger motif peaks found in datasets for which the zinger is not the targeted TF, are potential bona fide binding regions for the zinger TF. In all cases we required that the zinger motif peaks and zinger TF’s ChIP-seq data be from the same cell line. To call a zinger motif peak in agreement with the zinger TF’s ChIP-seq data we required that the peakMax of the zinger motif peak be within 100 bp of a peakMax in the zinger TF’s dataset. This 100 bp distance reflects the average range of enrichment for a TF’s motif relative to the peakMax. The assessment of the distal-zinger peaks,   94 that is, those peaks with motifs not proximal to the peakMax, relative to the zinger TF’s ChIP-seq dataset was performed in the same manner.  3.4.15 Generation of ChIP-seq peak neighbourhoods To determine the degree of recurrence for a zinger motif peak region across multiple datasets we pooled all zinger motif peaks that had a high scoring (score >85) zinger motif from all datasets. We then calculated the inter-zinger distances between each zinger motif peak and its nearest neighbour in the 3′ direction on the plus strand. Consecutive peaks that were within 50 bp of their nearest neighbour were merged into a ‘zinger neighbourhood’. The distance of 50 bp was chosen as a stringent measure of proximity between zinger motif peaks. For each neighbourhood, we counted the number of unique TFs that ChIPped the zinger motif peaks and the number of unique cell lines. See Appendix I regarding coordinates for the zinger neighbourhoods.  We generated neighbourhoods from the remaining two groups of peaks in a similar manner: those with the ChIPped TF motifs and lacking zinger motifs (‘ChIPped TF neighbourhoods’), and those without either motif (‘unidentified motif neighbourhoods’). For the ChIPped TF neighbourhoods we required that there be a high scoring motif (score >85) for the ChIPped TF. Neighbourhood widths were <150 bp on average. As stated in the main text, zinger motif peaks may be bona fide binding regions for the zinger TF. Thus, after generating the neighbourhood sets, we removed from the ChIPped TF neighbourhoods those regions that were within 300 bp (measured centre to centre) of the zinger neighbourhoods to ensure that comparisons were made between distinct neighbourhood sets. We also removed from the ChIPped TF neighbourhoods those regions that overlapped the unidentified motif neighbourhoods in the same manner.   95  3.4.16 Neighbourhood proximity to cohesin and polycomb repressive complex To assess whether a neighbourhood is proximal to a region occupied by cohesin or the polycomb repressive complex (PRC) 1 or 2, we generated three datasets by combining the ENCODE ChIP-seq data for the cohesin proteins, RAD21 and SMC3, into a one dataset; combining the ENCODE ChIP-seq data for CBX to form a dataset for PRC1 occupancy, and lastly combining EZH2 ChIP-seq data into a dataset for PRC2 occupancy. We then assessed how many zinger neighbourhoods were situated within 500 bp of one of the three protein complexes, measuring from the center of a neighbourhood to the ChIP-seq peakMax, and compared this to the two other neighbourhoods.    96 Chapter 4: Analysis of regulatory sequence TF occupancy using a biophysical simulation   4.1 Background Patterns of gene expression change over the course of tissue development or in response to physiological or environmental conditions. Central to the regulation of gene transcription, and thus gene expression, are the DNA-binding proteins called transcription factors (TFs). The existing biophysical interpretation of transcriptional regulation by sequence specific TFs can be considered to be comprised of three main components: i) the TF search through nuclear and DNA space for a target binding site; ii) DNA sequence-specific binding by a TF; and iii) regulatory control of a gene or group of genes either through direct or indirect cooperative interaction with other TFs. The identification of a TF’s binding sites (TFBSs) across the genome has, in particular, been the subject of extensive experimental and computational research. This work has resulted in models of TF binding based on collections of diverse sequence patterns that a TF specifically recognizes. However, TFBS prediction with these models suffers from poor specificity as the sequence patterns are often short and can arise in the genome by random chance.  The most common model of TF binding patterns is the position weight matrix (PWM) or derivatives thereof. This model summarizes across all known binding sites for a given TF, the frequency with which a particular nucleotide occurs at a specific position in the aligned set of binding sites, and then weights the frequency by the expected frequency of nucleotides in the sequences of interest. An extension of this model identifies and incorporates dependencies between nucleotides at different positions of the known binding sites. Regardless of which model is employed, the sensitivity is good but the specificity is poor, thus computational   97 biologists have found it necessary to incorporate additional information to constrain the genomic search space.  Phylogenetic conservation, and TFBS clustering are two common measures used to improve the specificity of PWM predictions, however, as for any purifying constraint, there is a necessary decrease in sensitivity. Most recently, TFBS prediction models have engaged epigenetic and open chromatin data as filters to identify cell specific regions that TFs will “read” as accessible for direct binding. This has resulted in an increased specificity for cell-specific binding site predictions, with DNase I hypersensitivity data identified as most impactful for these improvements [136, 137]. While ideally we want to identify all binding sites actively bound by a TF in the genome, the identification of binding site usage under specific conditions is important to understanding regulation. What we learn under cell specific conditions may advance our understanding of TF-DNA interactions such that eventually our predictive capacity need not be bound by a specific cellular context.   Additional work on understanding the specificity of TFs for particular patterns of DNA has evaluated TF-DNA interactions in terms of a ‘free energy of binding’ analogy to statistical-mechanical energy [73]. In this analogy, Berg and von Hippel demonstrate how the statistics of base-pair frequencies in the set of known binding sites contribute to an estimation of the interaction energy of a given sequence and TF. Berg and von Hippel’s free energy of binding model has subsequently been adapted by several groups to establish physically based approaches to determine the probability of TF occupancy at a genomic region under the assumption of thermodynamic equilibrium: QPMEME [77], GOMER [78], MatrixREDUCE [79], and TRAP [80]. These approaches have improved specificity to a constrained degree in their model systems of E. coli and S.cerevisiae, and have introduced the concept of modeling both TF binding energies and the tendency of a region to associate with a TF rather than a specific   98 single site.  GOMER’s biophysical model also evaluated the orientation and distance of binding sites at S.cerevisiae promoters, for cooperative TFs.    A non-physical component of these physically based approaches is the decision to incorporate the energetic contribution of all positions in a DNA region determined from an affinity model without discriminating between bound and unbound energy contributions. This is in contrast to the postulation of Berg and von Hippel, in the same work mentioned above [73], that TFs have two modes of DNA interaction: specific and non-specific. Non-specific interaction with the DNA-phosphate backbone permits a TF to explore a local sequence region by ‘sliding’ along the DNA for a finite distance. When a binding site is encountered a specific interaction between the TF’s DNA binding domain and the nucleotide bases of the DNA occurs. The two modes of interaction are part of a ‘facilitated diffusion’ model describing TF search mechanics. The model proposes that TFs diffuse both in 3D, through nuclear space, and in 1D, non-specifically sliding along the DNA for a limited distance before translocating (3D) to a local or distal region of DNA (presumed to be local in 3D space). This combination of 3D and 1D diffusion is proposed as an explanation for how low concentration TFs can efficiently locate to their binding sites despite a large genomic search space. A graphical animation movie hosted at: http://www.stromastudios.com/portfolio/tfs.html presents the concept of ‘facilitated diffusion’ (with artistic license). Experimental support for the two discrete modes of TF-DNA interaction and the facilitated diffusion model have been accumulating [65-67, 138].   Sequence level information alone is insufficient for the prediction of TF binding regions, particularly for the identification of those binding sites that are active only in specific cell types or conditions. TFBS predictive models that are more physically based and reflect the TF’s search environment may improve TF specific predictions and provide insight into factors that influence   99 a TF’s occupancy of a region. In this report we introduce a dynamic physical model of TF occupancy that incorporates: 1) binding energies, 2) non-specific and specific TF-DNA interactions, 3) cooperative and competitive TF-TF interactions, 4) TF concentration, 5) ‘facilitated diffusion’ TF search mechanics, and 6) sequence accessibility for TF binding. The model is run as a simulation in which TFs engage in facilitated diffusion and assess the potential binding energy and accessibility of each sequence position that the TFs locate to – thus we refer to the model as a ‘TF occupancy simulator’ (tfOS).    Working with experimental data for five well-characterized TFs, we have evaluated tfOS on 10kb long sequences containing ChIP-seq peaks for studied TFs. Each 10kb sequence contained at least one of the regions highlighted in Chapter 3: regions that were ChIPped by multiple unique TFs (‘recurrent regions’), or regions that were more specific to a given TF (‘non-recurrent’ regions). We evaluated tfOS predictions relative to ChIP-seq regions for the given TF, as a reference collection. The sensitivity (true positive rate) and false positive rates showed that open chromatin data (DNaseI-seq and Faire-seq) had the greatest impact on improving predictions at recurrent regions, while non-recurrent regions tend not to benefit from either open chromatin or histone modification data. tfOS has been designed as a biophysically based conceptual framework that can be expanded in the future to incorporate additional aspects of the TF “view” of the genome.   4.2 The transcription factor occupancy simulator (tfOS) model For the presented simulations of TF occupancy, a set of parameters were identified based on the underlying biochemical and biophysical properties of the process.  The value of each parameter should be derived from our understanding of the process, with uncertainties clearly   100 specified.  The benefit of the simulation approach is that the model can be progressively improved as our understanding develops.   The parameters for tfOS are listed in Table 4.1; most can be set by the user. Currently, the  model simulates on one sequence, with a default maximum length of 10kb. Each simulation  engages the user designated TFs over N iterations (default N=60,000); where an iteration is equivalent to a single 3D translocation, a 1D translocation of 1bp along the DNA if the TF is non-specifically interacting with the sequence, or a maintained status at a position if engaged in sequence-specific binding (Figure 4.1). By default two copies of each TF are present in a simulation, however the user can alter this parameter. This TF concentration setting, while somewhat arbitrary, was selected based on the ratio of 10kb to the amount of accessible DNA in the nucleus (~42-107 million base pairs per DNaseI-seq data), and the known range of DNA-binding TF concentrations (250 to 300,000 copies per cell [18]); given the range of concentrations observed across TFs and the potential for local regional concentration enrichment within the nucleus [139], this is a value that can be further explored in the future. Each TF in the simulation must have a user provided position frequency matrix (PFM) to define its binding properties. Each PFM is transformed into a matrix of binding energies [46] (see Chapter 4 Methods), which is used to estimate the binding energy of each sequence position for the given TF. Several groups have hypothesized that binding energies are related to the information content of binding site profiles, which has been supported by work from Liu and Clarke [140], and Maerkl and Quake [47]. While not explicitly stated within Maerkl and Quake’s work, there is support for Berg and von Hippel’s hypothesis of a binding energy threshold, where given sufficient mutations to a consensus binding site, the binding energy of the TF of interest is at a constant state, and is interpretable as a non-specific energy of interaction with the DNA phosphate backbone (Maerkl and Quake’s Figure 2 is re-interpreted in Figure 4.2).  A    101 Table 4.1 tfOS parameters Parameter Default value Impact Set by Sequence maximum length 10kb global fixed Number of sequences in one simulation 1 global fixed Protein collision fall off DNA, P=1 global fixed Bound state ends fall off DNA, P=1 global fixed Iterations 60,000 global user Binding affinity profile Gibbs free energy matrix (GfEM) per TF user provides TF profile binding energy threshold SD = 3 global user Protein fall-off distribution 1=uniform 2=Gaussian per TF user TF copy number 2 per TF user Sliding distance 42 per TF user Maximum distance to hop 45 per TF user Hop chance P=0.70 per TF user Jump chance P=1-(Hop chance) per TF - Pioneer TF false per TF user Palindromic false per TF user Sequence accessibility data >0 (indicates accessibility) global user provides data file Cooperative weight - per TF pair user Cooperative max distance - per TF pair user Cooperative state - per TF pair user The parameters with a global influence are used generally, for all TFs submitted to a simulation, while parameters with a ‘per TF’ influence are specific to a designated TF.     102  Figure 4.1 A graphical depiction of a tfOS simulation.   A TF can be in one of three states: free, tethered, and bound. (A) In every simulation the TF is initiated into a free state, from where it will be randomly assigned a position along the sequence. (B) If a free protein tries to engage with the DNA at a location that already has a protein present, then the TF remains free. (C) If the DNA is open when the free TF translocates (hop or jump), then the TF enters the tethered state. From a tethered state, a TF can (D) slide non-specifically along the DNA retaining its tethered state, (E) probabilistically disassociate from the DNA or disassociate due to a collision with another protein, or (F) encounter a binding site with sufficient energy to bind the TF. If a TF becomes bound it (G) remains bound in a process probabilistically dependent on the amplitude of the binding energy, however eventually the decision (H) to break the bound state occurs, and with the release of the binding energy the bound protein disassociates from the DNA and enters the free state. (I) After 60,000 iterations of the preceding processes, the simulation ends.   binding threshold is set in the simulation at a standard deviation (SD) of 3 from the mean of all interaction energies across the sequence. The SD value was set by comparing the distribution of energies of the known binding sites of 10 TFs relative to the distribution of energies for ten 10kb, background sequences; a SD of 3 is the minimal point of overlap between the binding site and background distributions for 8 of the 10 TFs (see Chapter 4 Methods and Appendix K, Figure K.1). In future versions of tfOS, we will incorporate TF specific thresholds based on the   103 work in Chapter 2. As tfOS runs, it records the number of iterations for which each base pair is bound. We anticipated that regions of high TF occupancy in the simulation (e.g. 200bp windows) would agree with a reference collection of TF::DNA interactions, such as ChIP-seq data, for the modeled TF.   Figure 4.2 A non-specific energy is evident after mutating the MAX consensus binding site. A non-specific constant energy of interaction was apparent, yet not explicitly noted, by Maerkl and Quake, who measured and predicted the change in interaction energies for MAX, given all possible mutations of the MAX consensus binding site (6bp wide). The x-axis is the predicted change in energy (ΔΔG) from a TF binding profile. The left y-axis is the experimentally measured change in energy. The right y-axis shows the approximate number of mutations in the corresponding energy range. After two mutations (1/3 of consensus site) any successive mutations did not alter the energy of interaction (red oval), which is consistent with a non-specific energy mode of interaction with the DNA phosphate backbone. This figure has been redrawn from Maerkl and Quake’s study [47].   At the start of the simulation, the first iteration, each protein is randomly associated to a position along the sequence. A random direction for 1D non-specific sliding along the DNA is selected and the protein slides in that direction for a default maximum of 42bp, with a selected probability of falling off before the maximum distance. This maximum distance, of 42bp is consistent with estimates of 40-50 bp in experimental work by Gowers et al. [65] and Hammar et al. [67], with the specific value being arbitrarily selected based on unrelated literature. A protein non-  104 specifically interacting with the DNA is called ‘tethered’. If a TF’s next move is to a location at which another protein is already present, and the two proteins were not designated by the user as cooperative, then the TF-TF interaction is considered competitive and each tethered protein will fall off the DNA (this setting should be a focus of future investigations). A protein that has been disengaged from the DNA then undergoes a 3D translocation to either a local region of DNA linearly proximal to the previous position – called a ‘hop’, or it translocates to a randomly selected location along the sequence that is proposed to be proximal spatially although distal linearly due to chromatin folding – this is called a ‘jump’ (Figure 4.3) [141]. By default a hop takes place in 70% of the 3D translocation decisions and will not exceed 45bp from the point of  disassociation (both parameters can be set by the user). The hop distance and probability are arbitrarily chosen values but are intended to model the tendency of a TF to spend time within a local region [67].  While a TF is engaged in 1D translocation along the DNA, it may encounter a site that is both in the same orientation as the protein’s DNA binding domain and has sufficient affinity (equal to or greater than the binding energy threshold) for the TF’s binding domain to engage the DNA in a specific interaction i.e. the TF is ‘bound’. For some TFs, orientation of the protein with respect to a binding site is not a concern as the binding pattern is palindromic; for these TFs a palindromic flag can be set to disregard orientation. The number of iterations for which the TF remains bound probabilistically depends on the strength of the binding energy at that position. If cooperative interaction data is included in a simulation then an encounter between two     105  Figure 4.3 tfOS search mechanics.   A TF encounters the DNA and slides along the double helix axis in a non-specific interaction with the phosphate backbone (tethered).  If the TF encounters a binding site it engages in a specific interaction (bound) and probabilistically remains at that position based on the strength of binding energy. If no binding site is reached the TF continues sliding until it probabilistically disassociates from the DNA after a given distance. A ‘free’ TF diffuses through nuclear space in either a hop or a jump until it re-engages on the DNA.  When a protein encounters another protein along the DNA, and they are not cooperative proteins, a collision ensues that results in tethered proteins disassociating from the DNA.   cooperating proteins can result in a further stabilization of those two proteins on the DNA, as has been shown by experimental work with SOX2 and POU5F1 [142]. Currently, the user provided interaction weight results in a simple fold increase in the probability of binding, as there is a dearth of quantitative values for the stabilization contribution of interacting proteins. tfOS allows the user to choose whether both proteins must be at binding sites above the given binding threshold for the interaction to increase stability of the DNA interaction, or whether only one of the proteins must be at a potential binding site. The former option is suitable not only for direct interactions between neighbouring TFs but also for cooperating TFs that may not be in   106 direct contact. Cooperating TFs both participate in transcription regulation of the same gene(s) and thus may bind the DNA individually yet both contribute together to the stability of the regulating machinery. The latter option makes allowance for indirect interactions, or low affinity binding sites that normally would not bind the TF but due to the presence of an interacting protein the affinity of the TF for the DNA is increased, such as was demonstrated for the ETS-1 TF during interaction with PAX5 [143].   Inclusion of sequence accessibility data into a simulation is optional, and when included acts to constrain TFs from binding within sequence regions unless the sequence is flagged as accessible; although the TFs remain able to tether to regions that are not accessible. Multiple data types for active histone modifications or open chromatin data are compressed into a single score representing the accessibility of each base pair in the sequence. Similarly for repressive histone modifications or nucleosome position data. If both active region and repressive region data is provided, only the active data is incorporated. If a protein is a pioneer TF, a sub-class of TFs that can act to convert inaccessible regions to accessible, a parameter flag can be set to indicate that this particular TF will ignore sequence accessibility data.  4.3 Results 4.3.1 A single sequence example using FOXA2 occupancy predictions demonstrates specificity improvements with H3K4me1 data or HNF4a co-regulation As an initial evaluation of simulator performance, we ran five simulations with the TF FOXA2 (comparing to reference data from HepG2 cells), using default settings, on a 10kb sequence containing at least one FOXA2 ChIP-seq region that did not overlap recurrent regions (regions repeatedly ChIPped across multiple ChIP-seq datasets). The bound occupancy results of the five simulations were combined into a single set of bound counts. The five simulations were run   107 for each of 10 scenarios to evaluate the contributions of various data types (default simulator settings) to the TF’s ‘view’ of the genome: 1) only binding energies, 2) cooperativity with HNF4a, 3) H3K4me1 properties (based on ChIP-seq data), 4) H3K4me3 properties (based on ChIP-seq data), 5) combining both H3K4me1 and H3K4me3 properties, 6) co-regulation with HNF4a, H3K4me1 and H3K4me3 properties, 7) DNaseI-seq accessibility properties, 8) Faire-seq accessibility properties, 9) combining both DNaseI-seq and Faire-seq (open chromatin), and 10) cooperativity with HNF4a, DNaseI-seq and Faire-seq data. As a baseline, we also performed classical PWM analyses on the same sequences used in the simulations, thresholding the PWM binding site predictions of each TF at a stringent relative motif score of 85 or alternatively using the TF-specific thresholds determined in Chapter 2 (see Chapter 4 Methods for the specific thresholds), and then calculating windows of occupancy from the results.  We used a receiver operating characteristic (ROC) analysis of sensitivity (true positive rate) and false positive rate to compare the 200bp occupancy of FOXA2 relative to FOXA2 (HepG2) ChIP-seq identified regions. The area under the curve (AUC) of a ROC curve indicates predictive performance, where 1 is optimal and 0.5 is considered random. Figure 4.4 presents both the bound occupancy counts (y-axis) of FOXA2 and a ROC curve, for scenarios 1, 2 and 6, on a single genomic region. The AUC for classical PWM-derived occupancy (Figure 4.4A and Appendix K, Figure K.2) are in close agreement with tfOS results for binding energies alone (0.674 for the former and 0.697 for the latter). The AUC for scenarios 1 (Figure 4.4B, binding energies alone) and 7-9 (inclusion of DNaseI-seq or Faire-seq) was less than 0.697, with DNaseI-seq detracting from the binding energy AUC. Scenario 2, which included co-regulation with HNF4A, increased the AUC to 0.757 (Figure 4.4C). Scenario’s 3, 4, and 5, which included H3K4me1 and H3K4me3 data, increased the AUC to greater than 0.960 with H3K4me1 contributing the most to the AUC (Figure 4.4D both H3K4me1 and H3K4me3). Thus, both   108    109 Figure 4.4  FOXA2 occupancy counts on a single sequence. In each panel, the left plot presents 200bp window bound occupancy counts (y-axis) greater than 0 (blue ovals), and the coordinate location of the region simulated on (x-axis). The vertical red bars highlight the windows that overlap FOXA2 ChIP-seq regions. The right-side plots are ROC curves, with true positive rate on the y-axis and false positive rate on the x-axis. The bound occupancy counts of FOXA2 on a single 10kb non-recurrent sequence (chr10:91152983-91162983) are presented for (A) classical PWM predictions, and (B-C) three simulations, each simulation incorporating different data parameters: (B) No sequence accessibility data (binding energies alone). (C) Co-operative association with HNF4A when both proteins were within 200bp of each other (no sequence accessibility data). (D) Combined H3K4me1 and H3K4me3 sequence accessibility data.    co-regulation and active H3K4me1 modification information individually improve the sensitivity and false positive rate of FOXA2 simulated occupancy predictions.   4.3.2 Open chromatin data improves the specificity of occupancy predictions for a subset of regulatory regions The work described in Chapter 2 and 3 proposes that ChIP-seq datasets are a mixture of regions that are either directly or indirectly bound by the ChIPped TF, and that a subset of the ChIPped regions are present in multiple other TFs datasets. tfOS presented an opportunity to test whether those regions that are ChIPped by multiple TFs versus regions, which are more specific to the immunoprecipitated TF, have different properties.    The first set of sequences we simulated on were one hundred 10kb regions centred on ChIP-seq peaks ChIPped by more than four TFs (we hereafter refer to such regions as ‘recurrent regions’). These regions were randomly selected for FOXA2 and four other TFs, SRF (GM12878), REST (H1-hESC and HepG2), POU5F1 (H1-hESC), and JUND (K562 and   110 HepG2).  Five simulations were run on each sequence, with the same set of parameters and data types as the FOXA2 simulation above. We also performed classical PWM analyses using the two thresholding approaches (a single setting of 85 for all, or a TF specific threshold as described in Chapter 4 Methods) on the same sequence sets, as a comparative measure to simulations using binding energies alone (Table 4.2, and Appendix K, Figure K4.3, Appendix L, Table L.1). The PWM results are in agreement with the tfOS results (Wilcoxon test significance threshold P >0.001). The exceptions are SRF, where the classical PWM performs better than the simulations (Wilcoxon test P=0.0007), and REST, where the simulations and classical PWM   Table 4.2 Recurrent regions analyses’ mean AUC values for 100 regions per TF Data SRF (GM12878) REST (H1-hESC) REST (HepG2) JUND (HepG2) JUND (K562) POU5F1 (H1-hESC) FOXA2 (HepG2) classical PWM TF specific threshold 0.505 0.601 0.649 0.561 0.509 0.549 0.563 classical PWM  single stringent threshold 0.517 0.531 0.566 0.536 0.499 0.551 0.554 no accessibility data 0.419 0.629 0.674 0.529 0.478 0.548 0.557 DNaseI-seq 0.576 0.716 0.680 0.592 0.597 0.694 0.757 Faire-seq 0.540 0.570 0.580 0.537 0.552 0.714 0.749 DNaseI-seq and Faire-seq 0.598 0.724 0.696 0.611 0.608 0.772 0.796 H3K4me1 0.479 0.587 0.596 0.563 0.524 0.596 0.638 H3K4me3 0.572 0.704 0.662 0.568 0.574 0.677 0.689 H3K4me1 and H3K4me3 0.545 0.697 0.673 0.590 0.560 0.666 0.640 Values in bold indicate the greatest mean AUC value in each TF column. Cells highlighted in grey indicate that the simulations’ AUC results are greater (Wilcoxon one-tailed test significance threshold P <0.001) than results from simulations lacking chromatin accessibility data.    111  Figure 4.5 Recurrent regions analyses’ AUC values for FOXA2 and JUND The plot displays the AUC values for simulations on ten 10kb recurrent regions for (A) FOXA2 and (B) JUND using binding energies alone (black), or sequence accessibility data from: combined DNaseI-seq and Faire-seq (dark blue), DNaseI-seq only (bright blue), Faire-seq only (light blue), combined H3K4me1 and H3K4me3 (dark green), H3K4me1 alone (bright green), or H3K4me3 alone (olive green). The height of each line is the AUC (y-axis). The sequence regions are ordered by the AUC from simulations without sequence accessibility data (black).    TF specific threshold achieve greater AUC values than the classical PWM with the stringent threshold (motif score >=85) (Appendix L, Table L.1).  The simulation results were evaluated relative to the ChIP-seq reference datasets using ROC analyses. Open chromatin data (DNaseI-seq and Faire-seq) contributes the most to improving   112 sensitivity and specificity of occupancy predictions relative to simulations using binding energy alone (Wilcoxon one-tailed test significance threshold P <0.001) as reported by the AUC of the recurrent regions (exemplified in Figure 4.5 with a random subset of 10 sequences for FOXA2 and JUND (HepG2)). The mean AUC from the set of 100 sequences for each TF is listed in Table 4.2, and Wilcoxon one-tailed test P values reporting the significance of the improved sensitivity are in Appendix L, Table L.2.  4.3.3 Occupancy predictions of non-recurring ChIP-seq regions are varied in response to histone modification data or open chromatin data We repeated the previous simulations on regions with a central ChIP-seq peak that did not overlap the recurrent regions (“non-recurrent regions”) and are thus likely to be specific to a given TF. We required that the ChIP-seq peak central to each region be at least 1000bp distant from recurrent regions. Again, 100 randomly chosen regions were extended to 10kb and five simulations were run per sequence.  In contrast to the recurrent regions, simulations with open chromatin data frequently resulted in a reduced mean AUC relative to the ‘binding energy alone’ simulations; only POU5F1 demonstrated an increase in mean AUC values (Wilcoxon one-tailed test significance threshold P <0.001) with the inclusion of open chromatin data (Table 4.3, and Appendix L, Table L.3). Also in contrast, the non-recurrent regions ‘binding energy alone’ results tend to have a higher mean AUC than the simulations on the recurrent regions, and three of the TFs (SRF, REST, and JUND (HepG2)) achieved their greatest mean AUC simulating with binding energy alone (significantly greater than simulations with open chromatin data; Wilcoxon one-tailed test significance threshold P <0.001). The remaining TF, FOXA2, demonstrated an increase in AUC values (Wilcoxon one-tailed test P value <2.036e-09) with the inclusion of H3K4me1 data.   113 Figure 4.6 presents the individual FOXA2 (HepG2) and JUND (HepG2) AUC values for a subset of 10 sequences.    We again performed classical PWM analyses on each of the sequences used in the simulations, and calculated PWM occupancy from the results. The PWM AUC values tend to be in close agreement with simulation results from binding energies alone (Wilcoxon test significance threshold P >0.001; Table 4.3, and Appendix K, Figure K.4). The exception is REST, where the simulations and classical PWM TF specific threshold achieve greater AUC values than the classical PWM with the stringent threshold (motif score >=85) (Appendix L, Table L.4).  Table 4.3 Non-recurrent regions analyses’ mean AUC values for 100 regions per TF Data SRF (GM12878) REST (H1-hESC) REST (HepG2) JUND (HepG2) JUND (K562) POU5F1 (H1-hESC) FOXA2 (HepG2) classical PWM TF specific threshold 0.629 0.605 0.651 0.608 0.554 0.629 0.623 classical PWM single stringent threshold  0.634 0.535 0.607 0.600 0.549 0.634 0.617 no accessibility data 0.650 0.614 0.685 0.604 0.563 0.629 0.633 DNaseI-seq 0.510 0.534 0.556 0.507 0.536 0.717 0.573 Faire-seq 0.507 0.522 0.521 0.503 0.521 0.741 0.586 DNaseI-seq and Faire-seq 0.509 0.544 0.565 0.512 0.539 0.795 0.610 H3K4me1 0.522 0.545 0.534 0.527 0.522 0.706 0.738 H3K4me3 0.511 0.514 0.508 0.502 0.500 0.607 0.579 H3K4me1 and H3K4me3 0.528 0.545 0.535 0.523 0.522 0.710 0.748 Values in bold indicate the greatest mean AUC value in each TF column. Cells highlighted in grey indicate that the simulations’ AUC results are greater (Wilcoxon one-tailed test significance threshold P <0.001) than results from simulations lacking chromatin accessibility data.    114   Figure 4.6 Non-recurrent regions analyses’ AUC values for FOXA2 and JUND The plot displays the AUC values for simulations on ten 10kb non-recurrent regions for (A) FOXA2 and (B) JUND using binding energies alone (black), or sequence accessibility data from: combined DNaseI-seq and Faire-seq (dark blue), DNaseI-seq only (bright blue), Faire-seq only (light blue), combined H3K4me1 and H3K4me3 (dark green), H3K4me1 alone (bright green), or H3K4me3 alone (olive green). The height of each line is the AUC (y-axis). The regions are ordered by the AUC from simulations without sequence accessibility data (black).    4.3.4 Co-regulation driven stabilization of binding improves simulations of FOXA2 occupancy but not POU5F1 occupancy   TFs do not function as independent units, but often interact with other proteins either directly or indirectly as part of the greater regulatory machinery. These associations have been identified   115 as a stabilizing force on the interaction between TF and DNA [142, 143], and contribute to the specificity of TFs for select binding sites. As such, we wondered if cooperative activity would contribute to the predictive power of tfOS.  We selected FOXA2 and HNF4A as one interaction set, and POU5F1 and SOX2 as another.  We modeled FOXA2 and HNF4A as capable of either direct or indirect interaction (indirect through the regulating machinery), by permitting stabilizing activity between bound proteins up to 200bp apart.  With regard to POU5F1, recent work has shown that POU5F1’s binding preference in H1-hESC cells is to bind at the same locations where SOX2 is already present [142]. The current version of tfOS cannot specify that TF-A can bind only in the presence of TF-B, instead we required that both POU5F1 and SOX2 proteins be present at binding sites within 24bp of each other.   For the sake of continuity, we simulated on the recurrent and non-recurrent 10kb sequence sets previously used for FOXA2 and POU5F1. The simulations were run to evaluate the contribution of cooperative interactions 1) without additional data, 2) with open chromatin data, and 3) with H3K4me1/H3K4me3 chromatin data.  Two copies of each TF pair were present in each simulation.  Neither FOXA2 simulations with HNF4a, nor POU5F1 simulations with SOX2 resulted in an over-all notable increase in AUC values from the co-operative interactions (Table 4.4; Wilcoxon one-tailed test significant threshold P  <0.001). However, on an individual sequence basis, as seen in Figure 4.4D, cooperative interactions in tfOS may contribute to improved specificity – approximately 15-20% of the simulations achieved an increase in the AUC of at least 0.10.    116 Table 4.4 The mean AUC values from simulations of cooperative interactions on 100 sequences per TF  FOXA2 (<200 bp HNF4A) POU5F1 (<24 bp SOX2) Data Recurrent regions Non-recurrent regions Recurrent regions Non-recurrent regions no accessibility data 0.593 (0.557) 0.662 (0.633) 0.550 (0.546) 0.655 (0.629) DNaseI-seq and Faire-seq 0.806 (0.796) 0.609 (0.610) 0.771 (0.772) 0.801 (0.795) H3K4me1 and H3K4me3 0.665 (0.640) 0.762 (0.747) 0.678 (0.666) 0.713 (0.710) For comparison, values in brackets are the AUC values without simulated interactions. Grey cells highlight the mean AUC values from simulations using co-operative interactions but not sequence accessibility data.   4.4 Discussion In the pursuit of understanding the regulation of gene expression, computational identification of regulatory regions bound by a specific TF remains an area of extensive research. Predictions based on the known sequence patterns a TF binds, are subject to an unfortunate false positive rate, particularly when the binding profile is short and/or has low information content. It is clear that TF recognition of a binding site does not rely on sequence level information alone. Advanced motif and biochemically-inspired approaches incorporate genomic information beyond the sequence level to improve the false positive rate of predictions. However these methods fail to account for the specific and non-specific protein-DNA interactions of TFs, and are not readily adaptable for the influence of local chromatin and epigenetic properties. We have thus introduced a new approach to the identification of cis-regulatory regions that incorporates these properties and has an open, flexible framework for the testing of future biophysical models.    117  We used tfOS to assess a subset of the regions that we identified in Chapter 3. The first set of 10kb sequences simulated upon sequences containing regions we had identified as being recurrently ChIPped by 4 or more different TFs. Those regions not flagged as recurrent provided the pool for the second set of 10kb sequences, which are likely to be specific to a given TF. Despite 10kb spanning more nucleotides than the ChIP-seq regions (~60-fold greater) we found that the recurrent regions generally have a lower AUC value from the ROC analyses than do the non-recurrent regions (consistent with greater contribution of the ChIPped TF’s binding site on the non-recurrent regions). Simulations that included DNaseI-seq and Faire-seq data showed improvement in the AUC for the recurrent regions. This is in accordance with our observation in Chapter 3 that DNaseI-seq and Faire-seq data are enriched for zinger motifs, which are in turn a component of recurrent regions. On the other hand, importantly, the non-recurrent regions tend to have a reduced AUC when open chromatin data is introduced. H3K4me1 data results in a TF-specific improvement in AUC for FOXA2, but little improvement for the remaining TFs regions. Thus, the tfOS simulations within the scope of the five TFs tested support our previous observations that those regions that recur across multiple ChIP-seq datasets have different properties than those regions which are specific to a TF.   Prior work, using Bayesian mixture models (CENTIPEDE) [136] or probabilistic models (FIMO) [137], have proposed that DNaseI-seq data is the most beneficial for obtaining TF occupancy predictions that are in agreement with ChIP-seq data, while smaller gains are derived from histone modification data.  While the recurrent regions we simulated on are in agreement with this observation, the non-recurrent regions did not gain in predictive value from either DNaseI-seq or Faire-seq data. Restated, we observe that the recurrent regions, which can compose a large subset of a ChIP-seq dataset, are better detected with open chromatin information, but   118 TF-specific regions are not. We should note that the interpretation of DNA accessibility data within the aforementioned static models differs from tfOS, as the alternative static models use the continuous signal values while tfOS converts signal values into binary categories. Incorporation of the continuous signal values of sequence accessibility datasets may contribute to an improvement in AUC values across all simulations, and will be incorporated into a future version of tfOS. However, we do not expect that use of the continuous signal values will change the key finding: recurrent and non-recurrent regions have different properties.   Despite the improvements introduced by considering sequence accessibility, the AUC tends to remain below 0.80, suggesting that there are further determinants for a TF’s choice of binding site that were not represented in the simulations. We therefore modeled cooperative behavior between two TFs. An improvement in the AUC results is apparent for a small subset of the sequences tested (14-20%). This is an area that remains to be developed further within the simulator, such as providing parameters to specify an orientation for direct TF::TF interactions, or allowing for binding dependencies e.g. TF A cannot bind unless TF B is present. We can additionally explore the impact of increased TF copy number within tfOS on TF interaction results. There is also a lack of data in the literature informing of the stabilization contribution of interacting TF proteins. However, a study this year using single-molecule imaging reported an approximate 1.2-fold increase in dwell time when the two TFs, SOX1 and POU5F1, are both present with DNA binding domains and interaction domains intact. As availability of single-molecule experiments increases, the impact of TF protein interactions on TF::DNA binding dynamics will be better represented in tfOS.    The tfOS results highlight that the “one size fits all” approach of many computational approaches is best set aside if we hope to understand how a given TF contributes to   119 transcriptional regulation. While the search for a global element that impacts the most factors being studied is appealing, it is likely that global trends are limited in their capacity to understand the local mechanistic details of a gene’s regulation. This transition in research focus from global studies of TFs back to a specific biochemical focus on well-defined proteins makes such methods as GOMER, CENTIPEDE, FIMO [78, 136, 137] and tfOS ideal as they evaluate the contributions of different aspects of the genomic environment at a TF and sequence specific level. The simulator in particular, provides a flexible framework within which not only is sequence accessibility modeled, but the contribution of TF concentration (and thus molecular crowding), cooperative interactions, and search mechanics may be explored. In addition, new biophysically inspired parameters can be fit into the modular framework.  There remains room to improve the tfOS predictions, which may be accomplished by integrating additional information on the genomic environment. One avenue might be to include multiple sequences that have been shown to interact via chromatin interaction data. This framework would conceptually allow a model of a transcription factory, where co-regulated genes and the regulating TFs may congregate in a discrete nuclear space. Another approach would be to incorporate the chromatin shape, as at least some TFs are known to use distinct major or minor groove geometries [144-146].   tfOS is a flexible biophysically based model, that offers a rich opportunity for exploring the contribution of numerous biophysical parameters to the capacity of a TF to find the regulatory regions it interacts with.    120 4.5 Methods 4.5.1 Pre-processing of data for simulation Pre-processing of data for simulation is implemented in Perl. Data that are pre-processed are the potential binding energy of every position of the sequence with respect to each TF in the simulation, the binding energy threshold for each TF, and the optional sequence accessibility data if provided by the user. The user submitted matrix of TF binding properties (a PFM), which provides the frequency of each nucleotide at a given position in an alignment of known binding sites, is used to generate a binding energy model (Gibbs free energy model: GfEM) for each TF; this is accomplished through a combination of in-house scripts and the TFBS Perl module [112]. A GfEM matrix for scoring the binding energy potential of a sequence is generated as follows:   Each TF binding profile is first converted to a weighted frequency matrix (see also Appendix A, Figure A.1): Wx,i = , where                                                        Eq.1 Wx,I  is the background weighted frequency for nucleotide x, at position i of the aligned known binding sites of the given TF.   is the weighted frequency of nucleotide x, at position i   is the frequency of nucleotide x, from the genomic background (default is 42% GC, 58% AT)  is the count of nucleotide x at position i, in the set of aligned known binding sites (given by the PFM)    is the number of known binding sites that contributed to the matrix; it acts with  dually as a confidence weight and as a pseudocount to prevent log2 0.  fx,ibx€ fx,i=cx,i+ bxnn + n€ fx,i€ bx€ cx,i€ n€ bx  121 The weighted frequency matrix (W, of length i) is then converted to a GfEM using the standard state Gibbs free energy equation (ΔG=RTlnKA):  (kJ/mol), where is Wx,I                                       Eq.2  is the Gibbs free energy of reaction contributed by each nucleotide x, at position i.  is a constant in the Gibbs standard-state free energy equation (R=8.314472/1000 kJ/mol*K; T=300 K)  is the equilibrium association constant, with the value, Wx,), from  the weighted frequency matrix  The GfEM is then used with the TFBS Perl module to score the sequence, obtaining the potential binding energy of each position. This is done for each unique TF.  The binding energy threshold was determined as the number of SDs from the mean of the genomic background scores (ten 10kb sequences) at which the left tail of the scores of known binding sites for the majority of TFs (8 of 10) were located. The majority of TFs’ known binding sites scores are in the right tail of the genomic background distribution. We selected a SD of 3 to be applied to all TFs, acknowledging that this could be restrictive for a minority of TFs. During the pre-processing of binding energies for each TF, the energy of the binding threshold is determined by calculating the binding energy at 3 SDs to the right of the mean of all binding energies for the given TF. An interaction energy below the binding threshold is indicative that the location will non-specifically interact with the given TF.   Sequence accessibility information, such as open chromatin or histone modification data, is optionally provided by the user. By default the entire sequence being simulated on is accessible ΔGx,i =RT lnKAx ,i KAx ,iΔGx,iRTKAx ,i  122 for binding should a site with sufficient binding energy be present. If data for accessible or repressed chromatin regions is provided, then TF binding can only occur in regions that are accessible or not repressed. If both open regions and repressed regions are provided, the accessible chromatin data will over-ride the repressive data. The sequence accessibility input data is required to have four fields: chromosome, start coordinate, end coordinate, and signal value. The Perl code extracts the signal value for regions that overlap the coordinates of the sequence being simulated on. If more than one dataset is provided, the highest signal is kept for regions where more than one data set has a value above zero. The value of the signal is not currently used in the simulation but in the future will be a weight on the accessibility of a region for binding. Currently, data that is repressive will be translated into ‘accessible’ regions by recording a signal of ‘1’ for those regions that are not represented in the repressive dataset.   The remainder of the user specified parameters are formatted into a single file with the sequence, binding energy and accessibility data, to subsequently be submitted to tfOS.  4.5.2 Model framework of the TF occupancy simulator tfOS is implemented in C++, and takes as input a single file generated by the pre-processing Perl code. The starting position of each TF in a simulation is randomly selected from across the length of the sequence and the TF is registered as tethered. The position that a TF is tethered (or bound) at is recorded as the length of the TF’s binding energy profile.  In the next iteration, each TF randomly selects a direction for 1D translocation, and if not a palindromic TF then an orientation is also randomly selected. The TF then moves 1bp along the DNA in a non-specific interaction with the DNA phosphate backbone.     123 4.5.2.1 TF binding decision If another protein is not already at the new location, then the binding potential of the new location is evaluated. The local region around the TF is also assessed for potential cooperating proteins that are within a user specified maximum distance; information pertinent to potential protein cooperation becomes part of the binding decision process. The probability of the sequence being bound is evaluated when: 1) the TF-DNA interaction energy is greater than the binding threshold, 2) the binding site is in the same orientation as the TF (or the TF is palindromic), and 3) the sequence is marked as accessible at that position or the TF is flagged as a pioneer. If one of these three fails, then the protein will remain tethered.   The probability of occupancy (Pocc) is determined using a similar approach to GOMER [78], TRAP [80], and MatrixREDUCE [79], and derives from the works of Berg and von Hippel [73] and Stormo and Fields [46]. The probability of a location being bound, Pocc, is as follows:       Pocc = [D ::T ][D]+[D ::T ]                                                   Eq.3 where [D ::T ]  is the concentration of DNA bound by a TF, and  is unbound DNA, i.e. the subset of DNA with sufficient affinity for the TF that it can specifically interact with the TF.  Given the association constant, KA = [D ::T ][T ][D]  , then [D ::T ]= KA[T ][D] , where [T] is the concentration of unbound TF.  Thus, Pocc can be written as,            KA[T ]KA[T ]+1   = [T ][T ]+ 1KA    = 11+ KD[T ] , where KA−1  = KD                     Eq.4 If we assume that the concentration of disassociated TF, [T], is equivalent to the disassociation constant of a low affinity binding site, KD*  (low affinity sites are the most common type of binding [D]  124 site for a TF), then given the standard state Gibbs free energy equation, ΔG = RTlnKA, Pocc becomes:  Pocc = 11+ KDKD*  = 11+ e−ΔG−ΔG*RT                                          Eq.5  whereΔG −ΔG *  is the change in interaction energy between the site of interest and the binding threshold (a similar assumption is made in GOMER, TRAP, and MatrixReduce).  Accordingly, where a site in the sequence has the same energy as the threshold, the probability of binding is 0.5, thus, effectively, the TF will only pause at this position for at most a few iterations before transitioning to a new location.   If cooperative associations between two TFs (TA, TB ) have been included in the simulation then these associations are evaluated as part of the probability of binding. The binding energy of the cooperating TF pair is the maximum of the two individual TFs’ binding energies, max(ΔGTA ,ΔGTB ). The probability of binding, which is determined from the maximum energy, is then weighted by the user provided stabilization factor (i.e. if the weight is 1.3 then the probability is  P *1.3). Experimental data on stabilization is difficult to find within the literature, we arbitrarily chose a stabilization factor of 1.3.   If a binding probability is equal to or greater than 1 (interaction weights can result in P>1), then the probability is set to the upper limit of 0.99 to allow a TF the chance of disassociating from the DNA. Once a binding probability is determined it is compared to a random number selected between 0 and 1, and if the probability value is higher than the random number the protein(s) is registered as bound, otherwise the protein(s) is registered as tethered.     125 4.5.2.2 Competition If the new location that the TF was intended to move to is already occupied by another protein, then a competitive interaction ensues. TFs that are merely tethered fall off the DNA in response to such an interaction, while bound TFs remain bound.  4.5.2.3 3D translocation A TF that has fallen off the DNA is translocated to a new location on the sequence in the next iteration through either a hop or a jump. A hop is a short local 3D translocation that places the TF at a maximum of 45bp from the position at which it fell off. A jump is a similarly short translocation, however it models chromatin folding which brings a distal portion of the sequence into close proximity in 3D space, and thus the protein jumps a short distance to a linearly distant region along the DNA. The decision to hop occurs for ~70% of the 3D translocation decisions, while the distance of a hop from the previous location is a random number of base pairs between 1 and 45. The TF obtains a new random direction and orientation on the DNA after a jump or a hop.  4.5.2.4 1D translocation A TF that is non-specifically sliding along the DNA has a maximum limit of 42bp to slide [65, 67]. The point at which a TF transitions from tethered to falling off of the DNA is determined probabilistically, using either a uniform or Gaussian probability density function. In effect, the mean sliding distance of all TFs is approximately 20bp, although with a uniform process the number of TFs that slide 10bp or 30bp is the same as 20bp, whereas with the Gaussian process, fewer TFs slide to 10bp or 30bp.  Uniform function:  P = 1maxSlide− (currentSlide−1)    126 maxSlide, is the maximum distance the given protein can slide before disassociating currentSlide, is the sliding distance already achieved  Gaussian function:  P = 1σ 2π e−(x−µ )22σ 2  σ , is 0.15*maxSlide µ , is 0.60*maxSlide x , is the sliding distance already achieved  As with the binding probabilities, a random number is generated and if the probability of continuing a non-specific slide is greater than the random number, the TF remains tethered, otherwise it disengages from the DNA and is recorded as “free”.   4.5.3 Data sets TF ChIP-seq data (FOXA2 (HepG2), SRF (GM12878), REST (H1-hESC and HepG2), POU5F1 (H1-hESC), and JUND (K562 and HepG2)), histone modification ChIP-seq (H3K4me1 and H3K4me3), and DNaseI-seq, and Faire-seq data in HepG2, H1-hESC, K562 and GM12878 cell-lines were generated by the ENCODE consortium [29] and downloaded from the UCSC databases [110].   PFM matrices were obtained from JASPAR 2014 [147]. The matrix used for POU5F1 analyses, was positions 8-15 of the composite POU5F1::SOX2 PFM in JASPAR 2014.    127 4.5.4 Recurrent and non-recurrent regions Prior work in this thesis established a set of ChIP-seq recovered regions that were recurrently identified in multiple ChIP-seq datasets (see Chapter 3). One such class was zinger neighbourhoods, that were enriched for non-targeted TF motifs at the peak maximum, and the other class was not enriched for any particular motif (unidentified motif neighbourhoods). In both cases, they did not contain the ChIPped TFs motif near the peak maximum in the datasets they were identified in, however that does not preclude them from containing the ChIPped TFs motif in one or two other datasets. The point of these regions was that they were recurrently recovered by multiple diverse TFs.     We combined the zinger and unidentified neighbourhoods into a single ‘recurrent region’ file. For each of the TF ChIP-seq datasets of interest, we assessed which peaks shared at least 1bp with a recurrent region or not. From each ChIP-seq dataset we extracted those regions that overlapped a recurrent region that had been ChIPped by at least four unique TFs (not considering TF motif families) and then randomly selected 10 peaks for simulation. From those peaks not overlapping a recurrent region, we extracted those regions that were at least 1000bp distant (measure from centre to centre) from the nearest recurrent region, and then randomly selected 10 peaks for simulation.  4.5.5 Simulations We used default parameters, as listed in Table 4.1. The fall-off distribution was Gaussian. The cooperative interactions parameters for FOXA2 and HFN4A were: weight=1.3, maxDistance=200bp, and state=bothbound (both TFs were required to be at a site that exceeded the binding threshold). The cooperative interactions parameters for POU5F1 and   128 SOX2 were: weight=1.5, maxDistance=36bp, and state=bothbound (both TFs were required to be at a site that exceeded the binding threshold).  4.5.6 Processing tfOS results tfOS reports a count of the iterations a particular position in the sequence was bound. We combined the results of the two copies of a TF in each simulation across all five simulations, by summing the bound counts at each sequence position. These results were then processed as windows of TF occupancy by summing the counts within 200bp wide sliding windows with a step of 100bp. Those windows which overlapped a ChIP-seq peak trimmed to 400bp around the peak summit for the given TF::cell type pair were flagged ‘1’ and all others were flagged ‘-1’. The ChIP-seq peak widths were ‘normalized’ to 400bp based on previously published work (Chapter 2) that showed that the ChIPped TFs motifs are enriched close to the peak maximum, with the maximum enrichment ‘zone’ being 360bp wide. To call a window as overlapping a ChIP-seq peak the two regions were required to share at least 1bp. This file was processed in R 3.1.1 [85] using the ROCR package [148] to establish true positive and false positive rates, and obtain the area under the curve (AUC).  4.5.7 Classical PWM analyses To obtain classical PWM predictions of TF binding sites, we used the same sequence set and PFM set that was used in the tfOS analyses. PFM matrices were converted to PWM matrices using the TFBS perl modules [112]. The 10kb sequences were scored with the PWM using C code adapted from the TFBS Perl Modules. The locations with a score greater than the threshold (either a stringent motif score of 85, or a motif score threshold specific to each TF as determined in Chapter 2: 82 for FOXA2 (HepG2), 80 for SRF (GM12878), 78 for POU5F1 (H1 hESC), 66 for REST (H1 hESC), 67 for REST (HepG2), and 87 for JUND (K562 and HepG2))   129 were considered ‘bound’ locations and each nucleotide position in the bound location was given the value ‘1’. The processing of the PWM results going forward from this point was the same as that described above for processing the tfOS predicted bound regions.   4.5.8 Software accessibility The software is available from GitHub https://github.com/wassermanlab/tfOS   130 Chapter 5: Conclusion  The regulation of genes is a complex process that has evolved over billions of years. Deciphering the system that has resulted from the pressures of natural selection is not a small task, however in recent years advancing technologies have given us the capacity to study organisms at the cellular, nuclear and genomic level in an unprecedented manner. Genome sequencing costs have dropped to a level that is now reasonable for most labs to take advantage of, and high-throughput protocols to analyse gene expression, protein::DNA interactions, DNA modifications, and chromatin structure at a genome-wide level have been developed. The demand for computational methods to study the large datasets generated by high-throughput studies has likewise increased. From the union of experimental and computational methods, the field of transcriptional regulation has gained insights into the location of TF binding sites for many TFs in a cell-type and condition specific manner. Insight has also been gained into changes at the chromatin level that may impact the availability of binding sites to the TF search and recognition mechanism.   The development of computational methods that best interpret and integrate experimentally derived datasets is an ongoing process, as with each iterative step in method development we learn more about the data and the problems that are inherent to each technology. Suffice it to say, there is a continuous need for new analytical approaches and interpretation of high-throughput data, just as we still have much to learn about how TFs govern gene expression. This thesis introduces novel approaches and observations that both assist and improve the interpretation of TF ChIP-seq data and detection of TFBSs, and a computational method that integrates multiple levels of experimental data in a dynamic system and provides a scaffold for future work towards better modelling and understanding gene regulation.    131  5.1 Interpretation of ChIP-seq data and TFBSs Furthering our understanding of how genes are regulated requires insight into the location and timing of sequence specific TF interaction with the DNA. It is known that TFs recognize short, degenerate 6-20bp patterns in the DNA, and that due to random chance non-functional motifs similar to these patterns arise. Therefore experimental data, such as ChIP-seq, may be used to reduce the computational search space for prediction of those sites to which a TF specifically and functionally binds. ChIP-seq experiments, with their greater than ChIP-chip resolution, have opened the door to regulatory element identification in an unparalleled way. However, like all experimental data, there is inevitable experimental noise or bias.  Deciphering which experimentally determined regions are functionally bound by the ChIPped TF is not simple, however the first step is to determine which ChIP-seq regions are supported by the presence of the ChIPped TF’s motif.   The initial assessment of a ChIP-seq dataset is to evaluate which motifs are over-represented. The selection of a background is critical to avoid biases in such analyses, and yet TFBS over-representation software, for TFs with known binding profiles, has not generally alerted the user to this nor provided the user with recommendations for generating an appropriate background. We addressed this problem (Chapter 2) in complement to the expansion of oPOSSUM 3.0 TFBS over-representation software to analyze ChIP-seq data [82]. We created a visualization method (Composition-bias view) to alert users to bias in the TFBS over-representation score results; this has subsequently been incorporated into oPOSSUM 3.0. We then tested multiple background types and identified the most suitable background as one whose GC composition distribution matches that of the target set of sequences. The set of background sequences may be derived from either the portion of the whole genome that is mappable (non-repetitive) or from   132 open chromatin data, however open chromatin data is rich in TF binding motifs, which will impact over-representation results. It is up to the individual researcher to decide which source of background sequences best addresses their question. To enable researchers to generate backgrounds matched to their sequence datasets, we made available an open source tool for generating different backgrounds (BiasAway), including a GC composition matched background (Chapter 2). The Composition-bias view has provided useful guidance to users of oPOSSUM 3.0, and generated inquiries on how to create a suitable background for TFBS over-representation analyses (personal communication). We anticipate that BiasAway and the Composition-bias view will be of broad utility to researchers investigating TF regulation of a set of sequences, particularly those new to the nuances of TFBS detection.  Due to the non-random nature of ChIP-seq experiments, determining the location of TF binding sites in ChIP-seq data should take into account topological enrichment of motifs. The experimental expectation of the ChIPped TF’s motif near the peakMax has been confirmed and demonstrated on a limited number of datasets in previous work [57, 86]. We extended this work with several novel visualization approaches and a heuristic method (HADB – heuristic algorithm for direct binding) to identify the subset of TF ChIP-seq peaks that present evidence of direct binding by a TF through topological motif and motif score enrichment. Results from HADB provide a motif “enrichment zone” around the peakMax, defined by the maximum enriched distance of motifs relative to the peakMax and the minimum enriched motif score. Our evaluation of >300 ChIP-seq datasets, using HADB, revealed that on average ~60% of a dataset contains evidence of direct binding with the ChIPped TF (Chapter 2). The ChIP-seq datasets with a low topological enrichment of the expected motif inform us either that we don’t understand in full the TF’s binding specificity (direct or indirect), or that the experiment or antibody efficiency were less than ideal. If TFs in the same family have similar low motif   133 recovery, this may indicate that it is the TFs interaction with the DNA that we don't understand; the TF may not bind directly but work cooperatively with other proteins to control transcription. In either case future work is needed to understand why ChIP-seq experiments are recovering these regions and how the regions relate to TF regulation.  It is a common tendency amongst researchers to assume that those peaks without the ChIPped TF’s motif are a product of indirect binding between the ChIPped TF and another protein. However, as we work towards a better understanding of gene regulation, it would be beneficial to identify the relationship of these peaks to the ChIPped TF rather than make such assumptions. Using TFBS over-representation analyses and the HADB method, we identified a number of TFs binding patterns (zingers motifs) enriched in a non-random spatial relationship with the peakMax in up to 69% (mean 27%) of the subset of peaks without the ChIPped TFs motif. Many of these genomic regions were found recurrently across different ChIP-seq datasets, suggesting a non-TF-specific mechanism of sequence recovery. The proposed TF “loading zone” model to account for the observation hypothesizes that a chromatin structure, such as driven by cohesin, is a hub of TF interaction, which may expedite TFs’ facilitated diffusion method for searching the genomic space for a regulation target. Such a multi-TF zone would permit multiple ChIP-seq experiments to recover the same associated regions. An evaluation of the association between zinger peak neighbourhoods and Hi-C chromatin interaction data might shed some light both on the mechanism by which these regions are recurrently recovered across multiple ChIP-seq datasets, and should be further extended to explore the properties of the HOT regions identified by Yip et al. [61].    134 5.2 The TF view of the genome In our endeavour to understand how TFs contribute to the regulation of genes, we must consider that TFs do not function in isolation nor is the primary DNA sequence the sole means of determining TF specific interaction with the DNA. ChIP-seq data is a powerful method for identifying regulatory regions, but it is constrained to those TFs for which we have antibodies and to available cell types. Computational models have the potential to cross these boundaries.  With this in mind, we constructed a dynamic computational model – a simulator of TF occupancy. We found that the incorporation of only two biophysical components, the protocols for how a TF searches the DNA for a binding site and the energetics of the TFs interaction with the DNA, performed similarly to the classical PWM model. The low specificity may result due to the fact that we use cell specific ChIP-seq data as the reference collection for evaluating TF occupancy results, but we incorporate no cell specific information within the simulator. The introduction of such information into the model, such as DNaseI-seq or H3K4me1 ChIP-seq, improved predictions, as was expected from work by [53, 136, 137]. Analysis of the sub-classes introduced in Chapter 3, supported our previous observations that recurrently and non-recurrently bound regions have distinct properties. Predictive performance is improved by the inclusion of open chromatin data for recurrent regions, consistent with observations from static models [136, 137], however in disagreement with the same model, the open chromatin data fails to contribute to improved predictive capacity on non-recurrent regions. Further investigation is needed to understand how these two classes of regions relate to TF driven regulation, why different ChIP-seq experiments have different ratios of the two classes, and what the different observed properties are indicative of.    With a modular computational model, such as tfOS, we can test our current understanding of events leading to TF driven transcriptional regulation, and incorporate new genomic information   135 as it comes available. For instance, a global map of TF::TF interactions would provide a means for weighing TF::TF interactions within the simulator. Knowing the concentration of TFs in the nucleus would constrain one of the variable parameters in tfOS. Chromatin interaction data would provide a constrained biologically-based search space to model the chance of a set of regulating TFs for an interacting enhancer and promoter being present within the same window of time as the interaction occurs. The future development of the simulator will need to allow for a more complex set of molecular dynamic properties, as new classes of data become available.    From the work presented in this thesis we have better insight into how nucleotide composition impacts motif analyses, and we can account and correct for it in TFBS over-representation analyses. We have identified systemic characteristics that influence how we view and interpret ChIP-seq data, and provided approaches that can be used to evaluate these ChIP-based datasets. Recognizing that a TF does not work in isolation with primary DNA sequences we have developed a dynamic computational model of TF interactions with chromatin and other TFs, from which we confirmed that chromatin accessibility impacts detection of a subset of ChIP-seq defined regulatory regions, and that TF::TF interactions, as they are currently modeled, contribute minor improvements for a subset of TFs and regions. This work contributes to our understanding of TF interaction within the genomic space, and suggests future avenues of research, thus we move closer to understanding how gene expression is regulated.  5.3 Future work toward understanding TF regulation of gene expression We have many questions to answer with respect to sequence-specific TFs and their binding sites contribution to gene regulation. TFs do not work alone and both their binding sites and immediate environment (other proteins, chromatin shape, chromatin epigenetic signatures, TF concentration) determines their action on a target gene. It is necessary to know not only where   136 TFs bind in the genome, but when and with whom they interact. We also need to know which regulatory regions regulate which genes, and how binding site affinity impacts TF dynamics, which in turn contributes to the dynamics of gene regulation. To gain insights into these areas, and others, we will need to draw upon a combination of new experimental approaches and improved computational methods.     As mentioned in the thesis, sequence alone is not sufficient to distinguish functional from non-functional TFBSs and we need to incorporate layers of information to model how a TF “views” the nuclear and genomic space. One of the layers of information that is likely to improve identification of bona fide binding sites for some TFs, if not all, is the 3D conformation of the chromatin. Some TFs are known to have geometrical preferences with regards to the major and minor grooves [144-146], and there are likely other chromatin structural preferences that contribute to TFBS specificity. How chromatin 3D shape contributes to TFBS recognition by TFs has been investigated in recent work [149] but has yet to be incorporated into TFBS prediction methods. Leveraging chromatin structure information to improve computational predictions, may prove to be a much looked for advance in increasing the specificity of TFBS predictions, and could be introduced into future versions of tfOS.  In addition to TF::chromatin interactions, there are combinatoric dependencies between TFs that contribute to the dynamics of TF::chromatin interactions. The combinatorics of TFs are an important means by which genes are differentially regulated between tissues, conditions, and developmental stages. For instance, a recent study using single cell technologies, confirmed that SOX2 and OCT4 cooperatively work together in ES cells by co-stabilizing their interactions with the DNA, and highlighted a hierarchical dependency i.e. the OCT4 bound fraction relies on the presence of SOX2 but not the reverse [142]. Such a hierarchical binding and single direction   137 of dependency reduces the complexity of events that must be balanced between two cooperating TFs and thus seems a favourable system for a cell to generally employ. A global map of TF::TF interactions and an exploration of interaction hierarchies and dependencies would assist our understanding of how gene regulation is coordinated. In some cases TF::TF interactions may stabilize due to a conformation change in the chromatin; thus a study of TF::TF interactions might integrate well with the evaluation of the aforementioned chromatin structure at TFBSs to provide insight into TF specificity for limited regions in the genome.  One of the largest limitations we still face is identifying which TF::TFBSs pairs target which genes. The FANTOM5 project has highlighted a genome-wide approach for identifying active enhancers based on the presence of proximal bi-directional transcripts [32]. While we won’t know precisely the TFs engaged at these enhancers, it is possible that the activity of the enhancers, as measured by the transcript abundance, could be correlated with the transcriptional activity of genes to begin proposing regulatory associations between enhancers and genes. This has started to be explored by groups from the FANTOM consortium. However, this can only be employed for ‘expressed’ enhancer regions, and it is an association rather than a confirmed direct interaction. Regardless, it is a considerable improvement over distance measures, which quickly become inappropriate as distances get large; particularly as regulatory regions may skip the most proximal promoter to regulate a more distal gene, and promoters may act to co-regulate other promoters [150]. However, with the advent of chromatin interaction protocols over the last few years, the potential to identify physical interactions between distal regions of chromatin, such as regulatory regions and promoters, has been realized.   There are three methods that are well suited for identifying genome-wide regulatory region interactions: 1) 4C – detection of all contacts with a single genomic region of interest [151], 2)   138 Hi-C – detection of all chromatin interactions genome-wide [152], and 3) ChIA-PET – detection of all chromatin interactions with regions that are bound (directly or indirectly) by a given protein [153]. Methods for deciphering true signal from noise in these protocols remain under development, as does developing techniques that reduce the requirement for a large population of cells, but as with most technologies this will improve over time. A large-scale project, such as ENCODE [29], to determine the interaction map of regulatory regions and improve methods for dealing with background signal, would be an enormous asset towards understanding TF-based gene regulation; as would determining how the translocation of a distal regulatory region to its target gene is accomplished  Beyond establishing which TFBSs contribute to the regulation of which genes, there is the question of how the affinity of a TFBS for its partner TF impacts the expression of its target genes. From the point of view of the TF, different TFBS affinities provide a means to differentially regulate individual target genes. One of the least studied, but definitely speculated upon, modes of TF-based regulation is the contribution of low affinity binding sites to gene regulation. Opinion is divided on whether such binding sites are functionally irrelevant, or whether they have roles to play, and because they are hard to identify research naturally sets them aside for the more easily identified strong affinity sites. However, there is some supporting evidence that low affinity binding sites can be important determinants in development. An example of this is the PREP1 control of the Pax6 gene during mouse eye lens development, where the low affinity of two PREP1 binding sites in an enhancer is essential to determining the developmental timing of Pax6 expression [75]. There are multiple similar mechanisms in the development of Drosophila, such as the required low binding affinity binding sites of the TF, Ci, in an enhancer of a wing development gene [154]. These low affinity binding sites are conserved with respect to affinity although not necessarily to sequence alignment. This   139 suggests that low affinity sites may be an evolutionary approach to a robust regulatory response during development, as low affinity sites are likely easier to achieve within the regulatory region than are high affinity sites.   The computational identification of low affinity TFBSs, even from the constrained space of experimental data is troublesome as low affinity sites are plentiful in a genome. However, the degree to which low affinity binding may participate in the regulation of developmental systems is intriguing and has not been evaluated much in vertebrates. A starting point might be the prediction of low affinity binding sites for TFs that 1) are known to be important to developmental programs, 2) occur as modules of at least two binding sites, 3) are not associated with repeat elements, and 4) are conserved either at the sequence or affinity level. The subset of binding sites also near developmentally important genes, or overlapping ChIP-seq/exo data or active epigenetic signals, would be the first to evaluate for functionality. If evidence in vertebrates of functionally important low affinity TFBSs, such as the PREP1 and Pax6 system, were specific to developmental regulation it would be interesting to investigate whether low affinity sites are the evolutionary answer to regulation for precision control, such as in the case of morphogen gradients. The study of low-affinity TFBSs might be particularly well suited to future versions of the simulator developed in this thesis.  In the last decade, our understanding of gene regulation on a genome-wide scale has dramatically advanced as modern automated technologies and new experimental methods have allowed us to gather genome-wide data of gene transcription events, and computational methods have enhanced our ability to process and interpret the data. As the cost of sequencing comes down, and new high-throughput technologies are refined to increase the output of experimental methods and decrease the required biological material, we will find that individual   140 labs will start producing data that it has previously taken large multi-lab consortiums to generate. This is a process that we have already witnessed with whole genome sequencing  In addition, the single cell technologies that are currently developing will open the door to the dynamics of gene regulation that are currently masked by the averaging of cell population studies. Important to this anticipated increase in data will be computational methods that help convert the data into knowledge and integrate it into a nuclear view of how regulating proteins and chromatin interact. It is reasonable to expect that our understanding of gene regulation, and thus diseases, will develop more rapidly in the next decade than it has in the last, bringing us closer to an era where we can repair damaged tissues or organs, and thus personalize treatment of diseases.       141 Bibliography  1. Funnell AP, Crossley M: Hemophilia B Leyden and once mysterious cis-regulatory mutations. Trends Genet 2014, 30:18-23. 2. Lomberk G, Grzenda A, Mathison A, Escande C, Zhang JS, Calvo E, Miller LJ, Iovanna J, Chini EN, Fernandez-Zapico ME, Urrutia R: Kruppel-like factor 11 regulates the expression of metabolic genes via an evolutionarily conserved protein interaction domain functionally disrupted in maturity onset diabetes of the young. J Biol Chem 2013, 288:17745-17758. 3. Weisschuh N, Wissinger B, Gramer E: A splice site mutation in the PAX6 gene which induces exon skipping causes autosomal dominant inherited aniridia. Mol Vis 2012, 18:751-757. 4. Bhatia S, Bengani H, Fish M, Brown A, Divizia MT, de Marco R, Damante G, Grainger R, van Heyningen V, Kleinjan DA: Disruption of autoregulatory feedback by a mutation in a remote, ultraconserved PAX6 enhancer causes aniridia. Am J Hum Genet 2013, 93:1126-1134. 5. De Gobbi M, Viprakasit V, Hughes JR, Fisher C, Buckle VJ, Ayyub H, Gibbons RJ, Vernimmen D, Yoshinaga Y, de Jong P, et al: A regulatory SNP causes a human genetic disease by creating a new transcriptional promoter. Science 2006, 312:1215-1217. 6. Okou DT, Mondal K, Faubion WA, Kobrynski LJ, Denson LA, Mulle JG, Ramachandran D, Xiong Y, Svingen P, Patel V, et al: Exome sequencing identifies a novel FOXP3 mutation in a 2-generation family with inflammatory bowel disease. J Pediatr Gastroenterol Nutr 2014, 58:561-568. 7. Weedon MN, Cebola I, Patch AM, Flanagan SE, De Franco E, Caswell R, Rodriguez-Segui SA, Shaw-Smith C, Cho CH, Lango Allen H, et al: Recessive mutations in a distal PTF1A enhancer cause isolated pancreatic agenesis. Nat Genet 2014, 46:61-64. 8. Pohl E, Aykut A, Beleggia F, Karaca E, Durmaz B, Keupp K, Arslan E, Palamar M, Yigit G, Ozkinay F, Wollnik B: A hypofunctional PAX1 mutation causes autosomal recessively inherited otofaciocervical syndrome. Hum Genet 2013, 132:1311-1320. 9. Bowman T, Garcia R, Turkson J, Jove R: STATs in oncogenesis. Oncogene 2000, 19:2474-2488. 10. Darnell JE, Jr.: Transcription factors as targets for cancer therapy. Nat Rev Cancer 2002, 2:740-749. 11. Yuan GC, Liu YJ, Dion MF, Slack MD, Wu LF, Altschuler SJ, Rando OJ: Genome-scale identification of nucleosome positions in S. cerevisiae. Science 2005, 309:626-630. 12. Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, et al: Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 2007, 39:311-318. 13. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129:823-837. 14. Cooper SJ, Trinklein ND, Anton ED, Nguyen L, Myers RM: Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res 2006, 16:1-10.   142 15. Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, Hume DA: Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet 2007, 8:424-436. 16. Lettice LA, Heaney SJ, Purdie LA, Li L, de Beer P, Oostra BA, Goode D, Elgar G, Hill RE, de Graaff E: A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum Mol Genet 2003, 12:1725-1735. 17. Worsley-Hunt R, Bernard V, Wasserman WW: Identification of cis-regulatory sequence variations in individual genome sequences. Genome Med 2011, 3:65. 18. Simicevic J, Schmid AW, Gilardoni PA, Zoller B, Raghav SK, Krier I, Gubelmann C, Lisacek F, Naef F, Moniatte M, Deplancke B: Absolute quantification of transcription factors during cellular differentiation using multiplexed targeted proteomics. Nat Methods 2013, 10:570-576. 19. Dekker J, Marti-Renom MA, Mirny LA: Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat Rev Genet 2013, 14:390-403. 20. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 2007. 21. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science 2007, 316:1497-1502. 22. Auerbach RK, Euskirchen G, Rozowsky J, Lamarre-Vincent N, Moqtaderi Z, Lefrancois P, Struhl K, Gerstein M, Snyder M: Mapping accessible chromatin regions using Sono-Seq. Proc Natl Acad Sci U S A 2009, 106:14926-14931. 23. Teytelman L, Thurtle DM, Rine J, van Oudenaarden A: Highly expressed loci are vulnerable to misleading ChIP localization of multiple unrelated proteins. Proc Natl Acad Sci U S A 2013, 110:18602-18607. 24. Felsani A, Gudmundsson B, Nanni S, Brini E, Moles A, Thormar HG, Estibeiro P, Gaetano C, Capogrossi M, Farsetti A, et al: Impact of different ChIP-Seq protocols on DNA integrity and quality of bioinformatics analysis results. Brief Funct Genomics 2014. 25. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS: Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008, 9:R137. 26. Fejes AP, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones SJ: FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics 2008, 24:1729-1730. 27. Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol 2008, 26:1351-1359. 28. Rhee HS, Pugh BF: Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 2011, 147:1408-1419. 29. ENCODE Project Consortium, Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M: An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489:57-74. 30. Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB, Cayting P, et al: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res 2012, 22:1813-1831. 31. FANTOM Consortium and the RIKEN PMI and CLST (DGT): A promoter-level mammalian expression atlas. Nature 2014, 507:462-470 (A. Mathelier and W.W.   143 Wasserman are members of the FANTOM465 Consortium of over 200 international scientists)   32. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, et al: An atlas of active enhancers across human cell types and tissues. Nature 2014, 507:455-461 (A. Mathelier and W.W. Wasserman are members of the FANTOM455 Consortium of over 200 international scientists). 33. Machanick P, Bailey TL: MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 2011, 27:1696-1697. 34. Georgiev S, Boyle AP, Jayasurya K, Ding X, Mukherjee S, Ohler U: Evidence-ranked motif identification. Genome Biol 2010, 11:R19. 35. Kulakovskiy IV, Boeva VA, Favorov AV, Makeev VJ: Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics 2010, 26:2622-2623. 36. Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van Helden J: RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res 2012, 40:e31. 37. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16:16-23. 38. Benos PV, Bulyk ML, Stormo GD: Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res 2002, 30:4442-4451. 39. Stormo GD, Schneider TD, Gold L: Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res 1986, 14:6661-6679. 40. Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I: Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 2005, 21:2657-2666. 41. Mathelier A, Wasserman WW: The next generation of transcription factor binding site prediction. PLoS Comput Biol 2013, 9:e1003214. 42. Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A: JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res 2010, 38:D105-110. 43. Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, Makeev VJ: HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res 2013, 41:D195-202. 44. Wingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res 1996, 24:238-241. 45. Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 2004, 5:276-287. 46. Stormo GD, Fields DS: Specificity, free energy and information content in protein-DNA interactions. Trends Biochem Sci 1998, 23:109-113. 47. Maerkl SJ, Quake SR: A systems approach to measuring the binding energy landscapes of transcription factors. Science 2007, 315:233-237. 48. Liu X, Brutlag DL, Liu JS: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput 2001:127-138. 49. Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res 2005, 33:3154-3164. 50. Pennacchio LA, Olivier M, Hubacek JA, Cohen JC, Cox DR, Fruchart JC, Krauss RM, Rubin EM: An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing. Science 2001, 294:169-173.   144 51. Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-Jimenez CP, Mackay S, et al: Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science 2010, 328:1036-1040. 52. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A 2002, 99:757-762. 53. Whitington T, Perkins AC, Bailey TL: High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites. Nucleic Acids Res 2009, 37:14-25. 54. Boeva V, Surdez D, Guillon N, Tirode F, Fejes AP, Delattre O, Barillot E: De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis. Nucleic Acids Res 2010, 38:e126. 55. Kwon AT, Arenillas DJ, Hunt RW, Wasserman WW: oPOSSUM-3: Advanced Analysis of Regulatory Motif Over-Representation Across Genes or ChIP-Seq Datasets. G3 2012, 2:987-1002. 56. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK: Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 2010, 38:576-589. 57. Bailey TL, Machanick P: Inferring direct DNA binding from ChIP-seq. Nucleic Acids Res 2012, 40:e128. 58. Gordan R, Hartemink AJ, Bulyk ML: Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Res 2009, 19:2090-2100. 59. Whitington T, Frith MC, Johnson J, Bailey TL: Inferring transcription factor complexes from ChIP-seq data. Nucleic Acids Res 2011, 39:e98. 60. Yan J, Enge M, Whitington T, Dave K, Liu J, Sur I, Schmierer B, Jolma A, Kivioja T, Taipale M, Taipale J: Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell 2013, 154:801-813. 61. Yip KY, Cheng C, Bhardwaj N, Brown JB, Leng J, Kundaje A, Rozowsky J, Birney E, Bickel P, Snyder M, Gerstein M: Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol 2012, 13:R48. 62. Berg OG, Winter RB, von Hippel PH: Diffusion-driven mechanisms of protein translocation on nucleic acids. 1. Models and theory. Biochemistry (Mosc) 1981, 20:6929-6948. 63. Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins. II. The binding specificity of cyclic AMP receptor protein to recognition sites. J Mol Biol 1988, 200:709-723. 64. Phair RD, Scaffidi P, Elbi C, Vecerova J, Dey A, Ozato K, Brown DT, Hager G, Bustin M, Misteli T: Global nature of dynamic protein-chromatin interactions in vivo: three-dimensional genome scanning and dynamic interaction networks of chromatin proteins. Mol Cell Biol 2004, 24:6393-6402. 65. Gowers DM, Wilson GG, Halford SE: Measurement of the contributions of 1D and 3D pathways to the translocation of a protein along DNA. Proc Natl Acad Sci U S A 2005, 102:15883-15888. 66. Elf J, Li GW, Xie XS: Probing transcription factor dynamics at the single-molecule level in a living cell. Science 2007, 316:1191-1194.   145 67. Hammar P, Leroy P, Mahmutovic A, Marklund EG, Berg OG, Elf J: The lac repressor displays facilitated diffusion in living cells. Science 2012, 336:1595-1598. 68. Misteli T: Beyond the sequence: cellular organization of genome function. Cell 2007, 128:787-800. 69. Janga SC, Salgado H, Martinez-Antonio A: Transcriptional regulation shapes the organization of genes on bacterial chromosomes. Nucleic Acids Res 2009, 37:3680-3688. 70. Li S, Heermann DW: Transcriptional regulatory network shapes the genome structure of Saccharomyces cerevisiae. Nucleus 2013, 4:216-228. 71. Marenduzzo D, Finan K, Cook PR: The depletion attraction: an underappreciated force driving cellular organization. J Cell Biol 2006, 175:681-686. 72. von Hippel PH, Berg OG: On the specificity of DNA-protein interactions. Proc Natl Acad Sci U S A 1986, 83:1608-1612. 73. Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol 1987, 193:723-750. 74. Gertz J, Siggia ED, Cohen BA: Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature 2009, 457:215-218. 75. Rowan S, Siggers T, Lachke SA, Yue Y, Bulyk ML, Maas RL: Precise temporal control of the eye regulatory gene Pax6 via enhancer-binding site affinity. 0890-9369 2010, 24:980-985. 76. Zhang C, Xuan Z, Otto S, Hover JR, McCorkle SR, Mandel G, Zhang MQ: A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acids Res 2006, 34:2238-2246. 77. Djordjevic M, Sengupta AM, Shraiman BI: A biophysical approach to transcription factor binding site discovery. Genome Res 2003, 13:2381-2390. 78. Granek JA, Clarke ND: Explicit equilibrium modeling of transcription-factor binding and gene regulation. Genome Biol 2005, 6:R87. 79. Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 2006, 22:e141-149. 80. Roider HG, Kanhere A, Manke T, Vingron M: Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics 2007, 23:134-141. 81. Manke T, Roider HG, Vingron M: Statistical modeling of transcription factor binding affinities predicts regulatory interactions. PLoS Comput Biol 2008, 4:e1000039. 82. Worsley Hunt R, Mathelier A, Del Peso L, Wasserman WW: Improving analysis of transcription factor binding sites within ChIP-Seq data based on topological motif enrichment. BMC Genomics 2014, 15:472. 83. Worsley Hunt R, Wasserman WW: Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets. Genome Biol 2014, 15:412. 84. Thomas-Chollier M, Defrance M, Medina-Rivera A, Sand O, Herrmann C, Thieffry D, van Helden J: RSAT 2011: regulatory sequence analysis tools. Nucleic Acids Res 2011, 39:W86-91. 85. R Core Team: R: A Language and Environment for Statistical Computing. 2.14.1 edition: R Foundation for Statistical Computing; 2014. 86. Wilbanks EG, Facciotti MT: Evaluation of algorithm performance in ChIP-seq peak detection. PLoS ONE 2010, 5:e11471. 87. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G: GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 2010, 28:495-501.   146 88. Miano JM, Long X, Fujiwara K: Serum response factor: master regulator of the actin cytoskeleton and contractile apparatus. Am J Physiol Cell Physiol 2007, 292:C70-81. 89. Singh S, Vrishni S, Singh BK, Rahman I, Kakkar P: Nrf2-ARE stress response mechanism: a control point in oxidative stress-mediated dysfunctions and chronic inflammatory diseases. Free Radic Res 2010, 44:1267-1288. 90. Gotea V, Visel A, Westlund JM, Nobrega MA, Pennacchio LA, Ovcharenko I: Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res 2010, 20:565-577. 91. Giorgetti L, Siggers T, Tiana G, Caprara G, Notarbartolo S, Corona T, Pasparakis M, Milani P, Bulyk ML, Natoli G: Noncooperative interactions between transcription factors and clustered DNA binding sites enable graded transcriptional responses to environmental inputs. Mol Cell 2010, 37:418-428. 92. Ji Z, Donaldson IJ, Liu J, Hayes A, Zeef LA, Sharrocks AD: The forkhead transcription factor FOXK2 promotes AP-1-mediated transcriptional regulation. Mol Cell Biol 2012, 32:385-398. 93. Yu X, Zhu X, Pi W, Ling J, Ko L, Takeda Y, Tuan D: The long terminal repeat (LTR) of ERV-9 human endogenous retrovirus binds to NF-Y in the assembly of an active LTR enhancer complex NF-Y/MZF1/GATA-2. J Biol Chem 2005, 280:35184-35194. 94. Razzaque MA, Masuda N, Maeda Y, Endo Y, Tsukamoto T, Osumi T: Estrogen receptor-related receptor gamma has an exceptionally broad specificity of DNA sequence recognition. Gene 2004, 340:275-282. 95. Watson DK, Robinson L, Hodge DR, Kola I, Papas TS, Seth A: FLI1 and EWS-FLI1 function as ternary complex factors and ELK1 and SAP1a function as ternary and quaternary complex factors on the Egr1 promoter serum response elements. Oncogene 1997, 14:213-221. 96. Schmid CD, Bucher P: MER41 repeat sequences contain inducible STAT1 binding sites. PLoS ONE 2010, 5:e11425. 97. Ferrigno O, Virolle T, Djabari Z, Ortonne JP, White RJ, Aberdam D: Transposable B2 SINE elements can provide mobile RNA polymerase II promoters. Nat Genet 2001, 28:77-81. 98. Schaub M, Myslinski E, Schuster C, Krol A, Carbon P: Staf, a promiscuous activator for enhanced transcription by RNA polymerases II and III. EMBO J 1997, 16:173-181. 99. Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, et al: DNA-binding specificities of human transcription factors. Cell 2013, 152:327-339. 100. Ngondo-Mbongo RP, Myslinski E, Aster JC, Carbon P: Modulation of gene expression via overlapping binding sites exerted by ZNF143, Notch1 and THAP11. Nucleic Acids Res 2013, 41:4000-4014. 101. Guo Y, Mahony S, Gifford DK: High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput Biol 2012, 8:e1002638. 102. Medina-Rivera A, Abreu-Goodger C, Thomas-Chollier M, Salgado H, Collado-Vides J, van Helden J: Theoretical and empirical quality assessment of transcription factor-binding motifs. Nucleic Acids Res 2011, 39:808-824. 103. Johansson O, Alkema W, Wasserman WW, Lagergren J: Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics 2003, 19 Suppl 1:i169-i176.   147 104. Zhao Y, Ruan S, Pandey M, Stormo GD: Improved models for transcription factor binding site identification using nonindependent interactions. Genetics 2012, 191:781-790. 105. Park D, Lee Y, Bhupindersingh G, Iyer VR: Widespread misinterpretable ChIP-seq bias in yeast. PLoS ONE 2013, 8:e83506. 106. Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, et al: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 2008, 133:1106-1117. 107. Tiwari VK, Stadler MB, Wirbelauer C, Paro R, Schubeler D, Beisel C: A chromatin-modifying function of JNK during stem cell differentiation. Nat Genet 2012, 44:94-100. 108. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM, et al: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 2009, 462:315-322. 109. Hoffman BG, Robertson G, Zavaglia B, Beach M, Cullum R, Lee S, Soukhatcheva G, Li L, Wederell ED, Thiessen N, et al: Locus co-occupancy, nucleosome positioning, and H3K4me1 regulate the functionality of FOXA2-, HNF4A-, and PDX1-bound loci in islets and liver. Genome Res 2010, 20:1037-1051. 110. Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM, Wong MC, Maddren M, Fang R, Heitner SG, et al: ENCODE data in the UCSC Genome Browser: year 5 update. Nucleic Acids Res 2013, 41:D56-63. 111. Kuhn RM, Haussler D, Kent WJ: The UCSC genome browser and associated tools. Brief Bioinform 2013, 14:144-161. 112. Lenhard B, Wasserman WW: TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics 2002, 18:1135-1136. 113. Marstrand TT, Frellsen J, Moltke I, Thiim M, Valen E, Retelska D, Krogh A: Asap: a framework for over-representation statistics for transcription factor binding sites. PLoS ONE 2008, 3:e1623. 114. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25:1422-1423. 115. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Res 2004, 32:D493-496. 116. Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nat Methods 2009, 6:S22-32. 117. Kidder BL, Hu G, Zhao K: ChIP-Seq: technical considerations for obtaining high-quality data. 1529-2908 2011, 12:918-922. 118. Chen Y, Negre N, Li Q, Mieczkowska JO, Slattery M, Liu T, Zhang Y, Kim TK, He HH, Zieba J, et al: Systematic evaluation of factors influencing ChIP-seq fidelity. Nat Methods 2012, 9:609-614. 119. Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proc Int Conf Intell Syst Mol Biol. 1994: 28-36. 120. Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 2009, 27:66-75. 121. Teytelman L, Ozaydin B, Zill O, Lefrancois P, Snyder M, Rine J, Eisen MB: Impact of chromatin structures on DNA processing for genomic analyses. PLoS ONE 2009, 4:e6700.   148 122. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS: Quantifying similarity between motifs. Genome Biol 2007, 8:R24. 123. Boyle AP, Song L, Lee BK, London D, Keefe D, Birney E, Iyer VR, Crawford GE, Furey TS: High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Res 2011, 21:456-464. 124. Shu W, Chen H, Bo X, Wang S: Genome-wide analysis of the relationships between DNaseI HS, histone modifications and gene expression reveals distinct modes of chromatin domains. Nucleic Acids Res 2011, 39:7428-7443. 125. Schmidt D, Schwalie PC, Ross-Innes CS, Hurtado A, Brown GD, Carroll JS, Flicek P, Odom DT: A CTCF-independent role for cohesin in tissue-specific transcription. Genome Res 2010, 20:578-588. 126. Kagey MH, Newman JJ, Bilodeau S, Zhan Y, Orlando DA, van Berkum NL, Ebmeier CC, Goossens J, Rahl PB, Levine SS, et al: Mediator and cohesin connect gene expression and chromatin architecture. Nature 2010, 467:430-435. 127. Kim YJ, Cecchini KR, Kim TH: Conserved, developmentally regulated mechanism couples chromosomal looping and heterochromatin barrier activity at the homeobox gene A locus. Proc Natl Acad Sci U S A 2011, 108:7391-7396. 128. Faure AJ, Schmidt D, Watt S, Schwalie PC, Wilson MD, Xu H, Ramsay RG, Odom DT, Flicek P: Cohesin regulates tissue-specific expression by stabilizing highly occupied cis-regulatory modules. Genome Res 2012, 22:2163-2175. 129. Parelho V, Hadjur S, Spivakov M, Leleu M, Sauer S, Gregson HC, Jarmuz A, Canzonetta C, Webster Z, Nesterova T, et al: Cohesins functionally associate with CTCF on mammalian chromosome arms. Cell 2008, 132:422-433. 130. Wendt KS, Yoshida K, Itoh T, Bando M, Koch B, Schirghuber E, Tsutsumi S, Nagae G, Ishihara K, Mishiro T, et al: Cohesin mediates transcriptional insulation by CCCTC-binding factor. Nature 2008, 451:796-801. 131. Schaaf CA, Misulovin Z, Gause M, Koenig A, Gohara DW, Watson A, Dorsett D: Cohesin and polycomb proteins functionally interact to control transcription at silenced and active genes. PLoS Genet 2013, 9:e1003560. 132. Loverdo C, Benichou O, Voituriez R, Biebricher A, Bonnet I, Desbiolles P: Quantifying hopping and jumping in facilitated diffusion of DNA-binding proteins. Phys Rev Lett 2009, 102:188101. 133. Cheung MS, Down TA, Latorre I, Ahringer J: Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res 2011, 39:e103. 134. Smit A, Hubley R, Green P: RepeatMasker 4.0.3 edition. Seattle: Institute for Systems Biology; 2013. 135. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. 1424-8581 2005, 110:462-467. 136. Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK: Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res 2011, 21:447-455. 137. Cuellar-Partida G, Buske FA, McLeay RC, Whitington T, Noble WS, Bailey TL: Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics 2012, 28:56-62. 138. Biebricher A, Wende W, Escude C, Pingoud A, Desbiolles P: Tracking of single quantum dot labeled EcoRV sliding along DNA manipulated by double optical tweezers. Biophys J 2009, 96:L50-52. 139. Pombo A, Cuello P, Schul W, Yoon JB, Roeder RG, Cook PR, Murphy S: Regional and temporal specialization in the nucleus: a transcriptionally-active nuclear domain   149 rich in PTF, Oct1 and PIKA antigens associates with specific chromosomes early in the cell cycle. EMBO J 1998, 17:1768-1778. 140. Liu X, Clarke ND: Rationalization of gene regulation by a eukaryotic transcription factor: calculation of regulatory region occupancy from predicted binding affinities. J Mol Biol 2002, 323:1-8. 141. Halford SE, Marko JF: How do site-specific DNA-binding proteins find their targets? Nucleic Acids Res 2004, 32:3040-3052. 142. Chen J, Zhang Z, Li L, Chen BC, Revyakin A, Hajj B, Legant W, Dahan M, Lionnet T, Betzig E, et al: Single-molecule dynamics of enhanceosome assembly in embryonic stem cells. Cell 2014, 156:1274-1285. 143. Garvie CW, Hagman J, Wolberger C: Structural studies of Ets-1/Pax5 complex formation on DNA. Mol Cell 2001, 8:1267-1276. 144. Bewley CA, Gronenborn AM, Clore GM: Minor groove-binding architectural proteins: structure, function, and DNA recognition. Annu Rev Biophys Biomol Struct 1998, 27:105-131. 145. Remenyi A, Lins K, Nissen LJ, Reinbold R, Scholer HR, Wilmanns M: Crystal structure of a POU/HMG/DNA ternary complex suggests differential assembly of Oct4 and Sox2 on two enhancers. Genes Dev 2003, 17:2048-2059. 146. Cirillo LA, Zaret KS: Specific interactions of the wing domains of FOXA1 transcription factor with DNA. J Mol Biol 2007, 366:720-724. 147. Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen CY, Chou A, Ienasescu H, et al: JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res 2014, 42:D142-147. 148. Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics 2005, 21:3940-3941. 149. Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordan R, Rohs R: TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res 2014, 42:D148-155. 150. Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, Poh HM, Goh Y, Lim J, Zhang J, et al: Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell 2012, 148:84-98. 151. Zhao Z, Tavoosidana G, Sjolinder M, Gondor A, Mariano P, Wang S, Kanduri C, Lezcano M, Sandhu KS, Singh U, et al: Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat Genet 2006, 38:1341-1347. 152. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, et al: Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009, 326:289-293. 153. Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, Orlov YL, Velkov S, Ho A, Mei PH, et al: An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 2009, 462:58-64. 154. Ramos AI, Barolo S: Low-affinity transcription factor binding sites shape morphogen responses and enhancer evolution. Philos Trans R Soc Lond B Biol Sci 2013, 368:20130018. 155. Fu Y, Sinha M, Peterson CL, Weng Z: The insulator binding protein CTCF positions 20 nucleosomes around its binding sites across the human genome. PLoS Genet 2008, 4:e1000138.   150 156. Altschul SF, Erickson BW: Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 1985, 2:526-538. 157. Clote P, Ferre F, Kranakis E, Krizanc D: Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA 2005, 11:578-591. 158. Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 2008, 36:e105. 159. Yang J, Stark GR: Roles of unphosphorylated STATs in signaling. Cell Res 2008, 18:443-451.      151 Appendices  Appendix A   This appendix contains the Supplemental Figure A.1 for Chapter 1.   Figure A.1 The PWM TF binding profile model. Generation of the classical TF binding profile model, the PWM, using SPI1 as an example. A set of known binding sites for a TF are aligned, and the counts of each nucleotide in each column form the position frequency matrix (F ). The counts ( cx,i ) of each nucleotide ( x ) at each position ( i ) in F  are weighted by the background frequency of the nucleotide (bx ) and the number of aligned binding sites (n ). The values of the position weight matrix (Wx,i ) are then the log-odds scores for the weighted frequency of each base at each position ( fx,i ), relative to the expected frequency of that base in the genomic background ( bx ).  The binding site logo is a graphical representation of a TF’s binding preferences, with the height of each letter representing the preference at each position.    152 Appendix B   This appendix contains the Supplemental analyses related to Chapter 2.  B.1 Sequence composition impact on TFBS prediction Given that ChIP-seq data displays a wide range of GC composition (Appendix D, Figure D.1), we reviewed the impact of low and high GC sequence composition on the occurrence of TFBS motifs. All peaks were trimmed to 201bp prior to TFBSs prediction within the peaks. The sequences were binned by the multiplicity of the predicted TFBS over a given motif score threshold of 85. The GC composition of the bins was evaluated.  For a peak region bound in a sequence-specific manner, we expect at least one TFBS for the ChIPped TF to be present in a perfect experiment. Additional TFBS patterns may reflect function, but could arise at a frequency consistent with chance. In considering this alternative, for a multiplicity of 1 through 5 of predicted TFBSs we plotted the relationship between peak count and mean peak GC composition. We observed a strong correspondance between control multiplicity X and peak multiplicity X+1.  In other words, the ChIP-seq peaks appear to have only one extra TFBS relative to chance expectation. The Pearson correlation between ChIP-seq peaks with multiplicity X+1 and control peaks with multiplicity X was >0.99 for the peak count, and >0.99 for the mean GC composition (Appendix D, Figure D.2).  B.2 TFBS motif over-representation analyses The impact of background composition selection on TFBS over-representation results As indicated in the main text, we prepared several backgrounds to be evaluated with respect to their impact on TFBS motif over-representation analyses. The first background was a naïve background model of random genomic sequences (derived from both DNase regions and from genomic mappable regions). A second background was generated by a 3rd order Markov model   153 (RSAT package [84]). Six additional background models were implemented as a background sequence generator, BiasAway: 1) mononucleotide shuffled target sequence, 2) dinucleotide shuffled target sequence, 3) genomic sequences matched to the nucleotide composition of the target data, 4) sliding windows of mononucleotide shuffled target sequence, 5) sliding windows of dinucleotide shuffled target sequence, or 6) genomic sequences matched in sliding windows of internal composition for each target sequence. One set of backgrounds was generated by HOMER 2 [56], a tool-kit for next generation sequencing analysis. HOMER 2 is the only software of which we are aware that uses GC composition matched backgrounds for TFBS over-representation analysis.  The backgrounds were evaluated against a set of 43 human TF ChIP-seq datasets from ENCODE (one dataset per TF, randomly chosen), and 165 PWMs (from JASPAR 4.0_alpha development database) [42], to assess the impact on over-representation scores for peakMax-centered sequences. We trimmed peaks to 201 bp, a length selection based on analyses described below, and then repeated the analyses using sequences of double the length (401 bp).  We chose two over-representation software tools that accept user-defined backgrounds: oPOSSUM 3.0 [55], and ASAP [113]. All backgrounds were evaluated with the 43 datasets on oPOSSUM, and the results of the background evaluation are summarized in Figure 2.2 and Appendix D, Figure D.4 (described below). The rank of the ChIPped TF from each of 430 analyses are provided in Appendix E, Table E.1. CB-plots for the enrichment results of E2F1 ChIP-seq data and six backgrounds are presented in Appendix D, Figure D.5.  Four backgrounds were re-evaluated on three datasets for both bias and bias correction with the ASAP tool (Appendix D, Figure D.3).     154 We expect certain characteristics for ideal ChIP-seq over-representation analysis results: 1) no extreme bias towards any one range of TF profile composition, which we term “skew”, 2) the ChIPped TF profile will appear amongst the top few results, 3) the majority of over-representation scores are neither over- nor under-represented (i.e. over-representation score mean is close to zero), and 4) the variance of the over-representation scores is low (excluding a few TFs that are outliers relative to the distribution of over-representation scores). A low variance is desired as a large variance could potentially inhibit the discovery of secondary TFs of interest. We summarize each of these ideals with respect to the six evaluated background types in Figure 2.2 (201 bp) and Appendix D, Figure D.4 (401 bp). Figure 2.2A and Appendix D, Figure D.4A present the skew of the data (the negative slope of a line fitted to the over-representation scores) and presence of the ChIPped TF in the top 5 results on the y- and x-axis respectively. The skew is a measure of bias in the over-representation scores, while the rank of the ChIPped TF can be considered as a measure of confidence of the over-representation results. Figure 2.2B-C and Appendix D, Figure D.4B-4C, present the average mean of the low over-representation scores and the variance of the low over-representation scores on the x- and y-axis respectively (where “low” indicates the exclusion of profiles with scores greater than one SD above the mean of the entire profile set; see Chapter 2 Methods).   Not surprisingly, use of a naïve random background for motif over-representation analysis, from either the mappable portion of the genome or DNase accessible regions, displays a systematic bias in the results towards particular groups of TFs, in almost all cases those TF’s with GC-rich binding sites (e.g. Appendix D, Figure D.5A). Despite the bias, the ChIPped TF is frequently one of the top-ranked TFs in both the 201bp and 401bp sequence sets; however the ranking of the remaining high scoring TFs often arises due to the dissimilarity of the composition between the target and background datasets. The remaining backgrounds reduced the bias towards any one   155 range of TF profiles (see y- axis of Figure 2.2A and Appendix D, Figure D.4A). As expected for TFBS over-representation analyses, few TFs are outliers relative to the majority of scores. However, it is the GC composition matched background results (BiasAway 3 and 6, and HOMER) that most consistently returned the ChIPped TF in the top 5 results (see x-axis of Figure 2.2A, and Appendix D, Figure D.4A), and resulted in the lowest values for both the mean and variation of the low over-representation scores (Figure 2.2B-C and Appendix D, Figure D.4B-4C). While Bias-Away and HOMER GC matched backgrounds resulted in good agreement for all measurements, Bias-Away predicted the ChIPped TF in the top 5 results in 11 percentage points (pp) greater cases for 201 bp sequences (7 pp for 401 bp sequences) compared to HOMER generated backgrounds. Relative to the GC composition matched background, both the dinucleotide shuffled background (BiasAway 2 and 5) and 3rd order Markov model background results reported the ChIPped TF 11pp and ~27pp less frequently (respectively), and had higher means and variances of the low over-representation scores. We speculate that the 3rd order Markov model may reflect the regeneration of TFBS motifs in the background sequences. There may be cases in which datasets with extreme nucleotide composition benefit from the sliding window approach (BiasAway 3 and 6), but in our tests neither instance of the sliding window backgrounds consistently performed notably better than the simpler GC composition matched background results.   TFBS over-representation analysis with ASAP We performed an over-representation analysis on three ChIP-seq datasets (E2F1, JUND, and C/EBPB) with the ASAP tool to test whether over-representation score bias is platform independent. The three datasets were selected from among those that had demonstrated both an over-representation score bias and over-representation of the ChIPped TF using oPOSSUM. The ASAP online tool limits the number of sequences to be analyzed, therefore we selected   156 random subsets of sequences from the datasets for testing. The backgrounds for ASAP analyses were: 1) a random genomic background, 2) sequences generated by a 3rd order Markov model, 3) di-nucleotide shuffled sequences, and 4) a GC composition matched background. As is seen in Appendix D, Figure D.3, the random genomic background results in an over-representation score bias towards GC rich TFBSs, while the remaining backgrounds correct the bias. Of the non-random backgrounds, the GC matched background results reported the expected motif among the top 5 over-representation scores for 2 of the 3 datasets, while the dinucleotide shuffle results reported the expected motif for 1 of the 3 datasets, and the 3rd order Markov model results did not report the expect motif in the top 5. The ASAP results are consistent with the oPOSSUM results for both the occurrence of over-representation score bias, and the success of the corrective procedures.  B.3 TFBS-landscape plot biological interpretation The main text describes the properties of ChIP-seq datasets and TFBSs that are used to generate a TFBS-landscape view, which presents the peakMax proximity and motif scores of predicted TFBSs. Figure 2.3, presents a number of TFBS-landscape plots for which the differences in shape are notable and worthy of comment (Figure 2.3A-I present motif enrichment of the ChIPped TF). The overlying commonality between all plots is the enrichment of a motif within close proximity to the peakMax (x-axis) over a range of scores (y-axis), and the presence of a band with uniform distribution of motif scores across all 1001bp. A contributing influence to the distinct patterns observed may relate to the complexity of the motif to which the TF binds. A wide TF binding pattern, such as observed for zinc-finger proteins (Figure 2.3D – ZNF143, Figure 2.3G – NRSF/REST, or Figure 2.3K – CTCF), with strong similarity to the majority of the known binding sites for the TF (i.e. high scoring motifs), will not be found often by chance in the genome. This is seen in the reduced number of motifs, outside the peakMax   157 enrichment “zone”, for the higher scoring range (y-axis) compared against the lower scoring range. While the lower scoring motifs may be enriched near the peakMax they also occur more frequently in the genome by chance; thus they are seen uniformly distributed across the 1001bp sequence space (aside from enrichment at the peakMax), which results in a horizontal “band” in a number of plots (Figure 2.3C-D).  In the three last plots, Figure 2.3J-L the observed enrichment is offset from the peakMax.  Figure 2.3J, is the CTCF PWM on H3K4me3 ChIP-seq data; CTCF is known to be enriched in proximity to nucleosomes [155] and thus the gap likely reflects a nucleosome at the peakMax, with CTCF offset to the side. Based on the nucleosome observation, it seems reasonable for CTCF motifs in RAD21 ChIP-seq (Figure 2.3K), and ELK4 motifs in NELFE ChIP-seq (Figure 2.3L), that here too the gap arises from the ChIPped protein bound to the sequence at the peakMax and a TF’s binding site enriched in proximity to that protein. While the cohesin complex, which RAD21 is part of, is well known to interact with CTCF [129, 130], it is not known whether there is an interaction between ELK4 and NELFE (a subunit of the NELF complex that binds to RNA pol II).  The non-random enrichment of lower scoring motifs in the TFBS-landscape plots, as mentioned above and seen in Figure 2.3, suggests that while low scoring motifs may occur by chance within genomic sequence, some will be bound by the ChIPped TF.  One of the unresolved questions in TFBS prediction is at what motif score should a threshold be set for selecting a set of motifs. This is especially of interest in the computational prediction of TFBS gain/loss of function due to mutations. The TFBS landscape plots suggest a lower limit for a motif score threshold that is specific to each predictive PWM and supported by the biology of ChIP-seq experiments.     158 B.4 Regions predicted by HADB to directly bind the ChIPped TF are enriched for peaks that co-occur between replicate experiments The HADB approach predicts regions that are bound directly by the TF, and we reasoned that if these peaks are indeed directly bound, then they should be more consistently replicated compared to peaks without the TF’s direct interaction. We used those datasets for which we had replicates to assess which peaks are consistently ChIPped. In the main text we applied a 500bp distance window between peakMax positions to classify peak regions as replicating. Here we present the same analysis using 1000bp windows. The original value was selected based on the median peak size.  We selected 1000bp to ascertain if the procedure was sensitive to this parameter. We observe little difference. Using a Fisher exact test on each dataset, we determined that for the majority of datasets the peaks predicted to bind the ChIPped TF are significantly more likely (93% of datasets produced a Fisher exact test one tailed p-value <0.001 (92% produced p-values <1e-09)) to be found in both replicate experiments than are the peaks without the ChIPped TFs motif.   B.5 Gene ontology term analyses on HADB identified ChIP-seq peaks We performed gene ontology (GO) enrichment analyses, using the GREAT software [87], on the subset of peaks inferred by the HADB method to be directly bound by a TF.  GO enrichment analysis for TFBSs is somewhat problematic due to the diversity of processes a TF may regulate and the proximity of TFBS to the genes regulated. We therefore chose datasets for TFs known to be highly specific for a biological process: SRF, a master regulator of the actin cytoskeleton and contractile processes [88] (5,632 peaks), and NFE2L2, a key regulator of oxidative stress response [89] (1,256 peaks). We submitted three sets of peaks: 1) the whole ChIP-seq dataset, 2) peaks from the subset of regions inferred to be directly bound by the ChIPped TF, and 3) those peaks not inferred to be directly bound. The SRF whole dataset   159 returned GREAT results without actin cytoskeleton related terms. The subset of SRF flagged by the HADB method as containing a motif for SRF returned with the first 5 terms related to actin. The remaining peaks, without the SRF motif proximal to the peakMax, did not return actin-related terms. GREAT results for SRF are provided in Appendix D, Figure D.8.  The GREAT analysis with the NFE2L2 whole dataset returned two terms related to oxidative stress terms at ranks 5 and 7 in the list (“response to reactive oxygen species” and “response to oxidative stress”). The subset of HADB selected peaks returned with the same two terms at the top of the list, but with greater Binomial Region Set Coverage values for both terms. The 3rd term in the list was “response to hydrogen peroxide”, which is related to oxidative stress. GREAT results for NFE2L2 are provided in Appendix D, Figure D.9.     160 Appendix C   This appendix contains the Supplemental methods for the BiasAway background generating tool of Chapter 2.  Background 1 – Mononucleotide shuffling generator BiasAway permutes the mononucleotides of the target sequences, thus keeping the nucleotide composition of the original sequence.   Background 2 – Dinucleotide shuffling generator BiasAway permutes the dinucleotides of the target sequences, thus keeping the nucleotide composition of the original sequence. The dinucleotide shuffle is performed by the Altschul-Erickson permutation algorithm [156], for which we adapted the Python implementation of the algorithm written by Dr. Peter Clote [157].  Background 3 –  GC Mononucleotide composition matched background selector This option of BiasAway requires as input both the target sequences and a large set of potential background sequences. The GC nucleotide composition of each target sequence is computed and sequences are assigned to bins in steps of 1% GC. The same procedure is applied to the background pool of sequences, using a separate set of 1% GC bins. Then for each target sequence in a given GC bin, BiasAway randomly selects a background sequence from the equivalent background 1% GC bin.  Background 4 – Mononucleotide shuffling within a sliding window generator For each sequence in the input set of target sequences, BiasAway generates a background sequence by shuffling the mononucleotides within a sliding window. Formally, it slides a window   161 of length W (default 100bp) with a step S (default 1bp) along the sequence and, within each window, shuffling the sub-sequence within the window.   Background 5 – Dinucleotide shuffling within a sliding window generator For each sequence in the input set of target sequences, BiasAway generates a background sequence by shuffling the dinucleotides within a sliding window. Formally, it slides a window of length W (default 100bp) with a step S (default 1bp) along the sequence and, within each window, uses the Altshul-Erickson algorithm, mentioned above, to shuffle the sub-sequence within the window.   Background 6 –  GC Mononucleotide composition matched background within a sliding window background selector For this background method, BiasAway again requires as input both the target sequences and a large set of potential background sequences. For each sequence in the set of target sequences, BiasAway slides a window of length W (default 100bp) with a step S (default 1bp) to compute the corresponding %GC composition of the sequence within each window position. From the distribution of all %GC computed from the windows, the minimum NT, the maximum XT, the standard deviation DT, and the coefficient of variation VT, are computed for each target sequence. Each sequence is assigned to a GC bin (bins are in increments of 1% GC) along with the four meta values (NT, XT, DT, and VT) of each sequence. Then the average bA, and standard deviation bD of the distribution of each of the meta values, NT, XT, DT, and VT, are calculated independently for each bin of target sequences. The bAN,X,D,V and bDN,X,D,V values are associated with the given bin. The pool of background sequences are then sorted into background 1% GC bins by evaluating the four background sequence meta values (NB, XB, DB,   162 and VB) against the bAN,X,D,V and bDN,X,D,V values of the target GC bins; if a background sequence does not satisfy the following criteria relative to one of the target bins it is removed:  bAJ  - cov * bDJ  <= val <= bAJ + cov * bDJ   where J={N , X, D, V}, cov is a coefficient of variation given as a parameter (default 2.6), val corresponds to a background sequence meta value {NB, XB, DB, VB} for the background sequence being analyzed, and bAJ is the average (respecting standard deviation, bDJ) of the corresponding target meta value, for the given GC bin of the target sequences (e.g. to evaluate val=NB  of a background sequence, we use bAN and bDN of the target bins). Finally, for each target sequence within a given GC bin B (where B is the % GC name of the bin), BiasAway randomly selects a background sequence from the equivalent background GC bin (e.g. if Y target sequences are in the 60% GC bin, then BiasAway randomly selects Y sequences from the 60% GC background bin).     163 Appendix D   This appendix contains the Supplemental Figures D.1 to D.10 for Chapter 2.   Figure D.1 Nucleotide composition is variable between ChIP-seq datasets. (A) The y-axis presents the GC content of ChIP-seq datasets (x-axis) generated from the K562 cell line; one dataset per TF. (B) The GC content (y-axis) of datasets from multiple samples (source cell lines indicated along the x-axis) for the JUN-D TF.   164  Figure D.2 The multiplicity of predicted TFBS motifs in ChIPped sequences corresponds to the multiplicity +1 of control sequences. The number of peaks with a given multiplicity are plotted on the x-axis and the mean GC composition of the peaks is on the y-axis. ‘X’ is the motif multiplicity of the controls, and ‘X+1’ is the motif multiplicity of the ChIP-seq peaks. (A) C/EBPB ChIP-seq sequences (black) and control sequences matching the average GC composition of the ChIP-seq sequences (red). (B) AP2γ ChIP-seq sequences (black) and control sequences matching the average GC composition of the ChIP-seq sequences (red).     165  Figure D.3 Binding site over-representation results using the ASAP tool. The CB-plots present the PFM GC composition on the x-axis and the ASAP over-representation score on the y-axis. The top 5 over-represented TF profiles’ names are written on the plot; the name of the ChIPped TF or related TF is highlighted in magenta, and the logos are shown in (A) E2F1 and E2F4, (F) JUN-family (JUN, JUN-D, AP1, and FOSL2), and (K) C/EBPA and CEBP/B. (B)-(E) E2F1 ChIP-seq. (G)-(J) JUN-B ChIP-seq. (L)-(O) C/EBPB ChIP-seq. The first CB-plot for each of the 3 TFs (B), (G), (L) are results using a random background selected from a pool of uniquely mappable sequences. The second CB-plot for each of the 3 TFs (C), (H), (M) are results using a background generated by a 3rd order Markov model. The third CB-plot for each of the 3 TFs (D), (I), (N) are results using dinucleotide shuffled target sequences as background. The last   166 CB-plot for each of the 3 TFs (E), (J), (O) are results using background sequences from the mappable dataset, matched to the GC composition distribution of the target sequences.     167  Figure D.4 Background impact on over-representation analyses for 400bp datasets. (A) For each background, the fraction of the 43 analyses that reported the ChIPped TF in the top 5 over-represented PWMs from a particular background (x-axis) is plotted against the average skew of the over-representation results for each background’s 43 analyses. Skew is the negative slope of the line fitted to the over-representation scores versus PFM GC content (i.e. values visualized by Figure 1a axes). The ideal is to have a large x-axis value (vertical dashed line) and an average skew of zero (horizontal dashed line). (B) and (C) summarize the standard deviation (y-axis) and mean (x-axis) of the ‘non-outlier’ oPOSSUM over-representation scores for 10 backgrounds against each of 43 ChIP-seq datasets, where panel (B) displays the average value for each background across the 43 datasets and panel (C) displays the individual value of 430 analyses. The ideal result would be situated at the origin (the intersection of the dashed lines. For all panels, each of the 10 backgrounds tested is denoted as a single colour: Light green circle – randomly chosen background from the dataset of mappable sequences, dark green cross – randomly chosen background from the dataset of DNase accessible sequences, orange circle – mononucleotide shuffled background, brown cross – mononucleotide shuffled background within a sliding window, black circle – dinucleotide shuffled background, gray cross – dinucleotide shuffled background within a sliding window, magenta triangle – 3rd order Markov model generated background sequences, blue circle – background selected from the mappable sequences dataset to match the GC composition of the target   168 sequences, light blue cross – background selected from the mappable sequences dataset to match the distribution of GC composition in sliding windows of the target sequences, and red  triangle – GC background from HOMER 2.  169    170 Figure D.5 Background selection can correct the over-representation score bias towards GC-rich or AT-rich TFBSs in motif over-representation analyses. The results of over-representation analyses for an E2F1 ChIP-seq dataset using six distinct backgrounds (one background per plot). The names of the 5 top ranked TF PWMs are written on the plot. The horizontal line is set at over-representation score 100 as a visual reference point. Points corresponding to E2F1 and E2F4 motifs are highlighted in pink. The dotted line at over-representation score 100 is for visual reference. The sequence logos are E2F1 and E2F4 respectively. (A) Randomly chosen background from a pool of DNase accessible sequences. (B) Randomly generated background sequences based a 3rd order Markov model. (C) A background of dinucleotide shuffled target sequences. (D) Selected regions from the mappable sequence dataset matching the GC composition distribution of the target sequence set. (E) Sliding windows of dinucleotide shuffled target sequence. (F) Genomic sequences matched in windows of internal GC composition for each target sequence.   171  Figure D.6 Zones of motif enrichment defined around the peakMax of mouse ChIP-seq datasets vary per TF. (A) Zones of PWM motif enrichment defined by a heuristic enrichment threshold for mouse datasets. The average width of the motif enrichment zone around the peakMax for TF’s datasets are plotted on the y-axis; the differences between all widths, for all of a TF’s datasets, were averaged and plotted on the y-axis as vertical bars. The datasets are along the x-axis. The red horizontal line is the mean width of enrichment. (B) The proportion of peaks within the motif enrichment zone for a TF’s set of ChIP-seq datasets were averaged. The x-axis provides, for each of 39 TFs, the mean proportion of peaks with a motif scoring above the motif score threshold and located within the zone of enrichment (mean 0.65, median 0.72).     172  Figure D.7 Defining a PWM’s lower bound of motif score enrichment within the heuristic enrichment zone. (A) A visual depiction of the motif score lower threshold determined with our heuristic procedure. The x-axis indicates the upper bound of bins of PWM motif scores for an NFYB ChIP-seq dataset. The bins of motif scores are in steps of 1 score point (e.g. bin80 is 79<scores<=80). The y-axis is the count of peaks in each bin. The black dotted line depicts scores for peaks with top-scoring motifs within the enrichment zone around the peakMax, while the red dotted line depicts scores for peaks with top-scoring motifs from distal positions. The vertical blue line depicts the threshold for motif score enrichment relative to the background. (B) The TFBS-landscape view for the NFYB dataset. The x-axis is the distance of the top scoring motif to the peakMax and the y-axis is the motif score. The blue line is the calculated motif score threshold. (C) and (D) The mean motif score threshold of multiple datasets. The motif score   173 thresholds for a TF’s multiple datasets were averaged and plotted on the y-axis, with vertical lines showing the average differences between the thresholds. The x-axis is the TF. The red horizontal line is the mean. Left (C) is human data, and right (D) is mouse data     174  Figure D.8 GREAT analysis results on SRF ChIP-seq data. GREAT results from the analyses of three sets of SRF peaks. (A) All peaks in the SRF dataset. (B) The subset of peaks identified   175 by the HADB method to have an SRF motif proximal to the peakMax. The red box highlights the actin related GO term. (C) The subset of peaks that do not have an SRF motif proximal to the peakMax.    176  Figure D.9. GREAT analysis results on NFE2L2 ChIP-seq data. GREAT results from the analyses of three sets of NFE2L2 peaks. (A) All peaks in the NFE2L2 dataset. (B) The subset of peaks identified by the HADB method to have an NFE2L2 motif proximal to the peakMax. The red box highlights the oxidative stress related GO terms. (C) The subset of peaks that do not have an NFE2L2 motif proximal to the peakMax.   177    178 Figure D.10 TFBS-landscape view of four PWMs in a TBP ChIP-seq dataset from mouse MEL cell-line. TFBS-landscape views are shown for four PWM’s on a TBP dataset. The left-side plots present the top scoring motif distance to the peakMax on the x-axis, and the motif score on the y-axis. The right-side plots present a histogram of motif distances: black – 2bp resolution of the top scoring motif distance per peak, green – 5bp resolution of the distances for the top scoring motifs with a score equal to or higher than 85. Sequence logos indicate the profile used to scan the TBP peaks: (A) ZNF143_b PWM. (B) NF2F1 PWM. (C) CTCF PWM. (D) FOXD3 PWM. The data represent a subset of TF profiles depicted in Figure 7b.    179 Appendix E   This appendix is Table E.1 of oPOSSUM over-representation ranks for the ChIPped TFs’ motif in Chapter 2.  Table E.1  Rankings of the ChIPped TFs’ binding profile, from over-representation analyses with 10 different backgrounds  Sequence backgrounds tested in TFBS over-representation analyses TF Random mappable genome Random DNase  Mono-nucleotide shuffle Mono-nucleotide shuffled window  Di-nucleotide shuffle Di-nucleotide shuffled window 3rd order Markov model GC matched GC matched window HOMER GC matched Average skew 617 340 118 119 41 43 -15 -18 8 13 AP2A 2 2 5 2 2 2 3 3 2 3 AP2G 1 1 2 4 1 2 1 1 1 2 ATF3 5 3 10 9 9 10 12 4 7 5 BHLHE40 6 3 10 8 2 3 18 1 1 2 BRCA1 4 7 70 67 18 21 22 3 3 3 CEBPB 1 1 1 1 1 1 1 1 1 1 cFOS 11 13 13 13 4 8 6 7 6 13 cMYC 20 19 39 45 15 15 27 14 14 10 CTCF 2 2 2 2 2 2 2 2 2 2 E2F1 2 1 13 11 13 12 83 4 4 6 E2F4 6 5 11 9 7 6 11 6 3 6 E2F6 10 6 3 5 6 6 14 16 15 19 EbF1 1 1 1 1 2 2 3 1 1 1 ELK4 2 1 2 7 2 2 5 1 1 2 ERRA 10 5 8 6 11 5 9 8 8 8 FOS 1 1 5 5 1 1 1 1 1 5 GATA1 4 6 7 7 8 6 9 4 4 6 GATA2 5 3 7 8 4 4 9 3 4 5 GATA3 10 9 6 7 7 9 21 10 13 11   180  Sequence backgrounds tested in TFBS over-representation analyses TF Random mappable genome Random DNase  Mono-nucleotide shuffle Mono-nucleotide shuffled window  Di-nucleotide shuffle Di-nucleotide shuffled window 3rd order Markov model GC matched GC matched window HOMER GC matched Average skew 617 340 118 119 41 43 -15 -18 8 13 HNF4A 2 2 4 3 3 4 4 2 2 2 SFf1 4 3 3 4 4 4 3 3 4 3 IRF1 112 119 140 151 159 161 105 124 141 114 IRF3 57 51 2 26 36 23 11 11 25 19 JUNB 1 1 5 5 1 1 1 1 1 5 JUND 3 1 7 8 1 1 4 1 1 7 JUN 5 3 9 8 3 4 5 3 2 6 MAX 9 5 9 10 9 9 18 9 11 10 MXI1 22 22 36 36 18 17 44 17 17 18 NFE2 11 5 6 6 11 13 16 13 11 12 NFKB 3 1 4 2 1 2 1 1 1 2 NFYA 4 3 4 5 5 3 8 2 2 3 NFYB 1 1 4 2 1 1 1 1 1 1 NRF1 3 1 3 4 5 5 4 1 3 3 PRDM1 1 1 3 2 1 3 2 1 1 4 RFX5 14 12 4 4 3 5 8 2 4 4 STAT1 2 1 4 5 2 1 3 1 2 1 STAT2 9 4 8 4 5 6 8 4 7 5 STAT3 120 58 22 25 48 39 161 78 105 31 TAL1 2 2 2 2 2 2 4 2 2 2 TBP 145 123 138 148 32 31 133 35 40 52 USF2 2 1 1 1 1 1 2 1 1 2 YY1 2 2 4 5 2 4 5 3 2 4 ZNF143 34 16 15 15 8 9 10 5 6 6 ZNF263 1 1 1 1 1 1 11 1 1 1 ZNF274 3 5 5 6 6 5 8 4 5 5    181 Listed are the ranks of the ChIPped TFs profile from 430 over-representation analyses for 43 datasets and 10 backgrounds. The tendency for some backgrounds to have a large bias towards TFs with GC-rich binding profiles is presented as the average skew for each background. A background with a large skew factor (>100) will favour TFs with GC-rich profiles.   182 Appendix F    F.1 Experimentally determined PWM score thresholds The motif relative score thresholds for position weight matrices (PWMs) obtained from human and mouse ChIP-seq datasets, as presented in Chapter 2, are provided as supplemental material (ubc_2015_february_hunt_christine_Appendix_F_F1.xls) in UBC’s cIRcle repository http://hdl.handle.net/2429/51447.  F.2 Position frequency matrices The position frequency matrices (PFMs) used in Chapter 2 and Chapter 3 are provided as supplemental material (ubc_2015_february_hunt_christine_Appendix_F_F2.txt) in UBC’s cIRcle repository http://hdl.handle.net/2429/51447.    183 Appendix G   This appendix contains Supplemental Figures G.1 to G.10 for Chapter 3.   Figure G.1. Zinger motifs are enriched across multiple mouse ChIP-seq datasets. (A) The histogram displays the results of TFBS motif enrichment analysis on 81 mouse ChIP-seq datasets generated with the oPOSSUM 3.0 software. Along the x-axis is the fraction of datasets that displayed enrichment near the peakMax for a TF profile. The y-axis is the number of TF profiles that were found enriched for a given fraction of datasets. The profiles most frequently   184 observed to be enriched are labeled on the histogram. (B) The binding site logos of the nine TF binding models with enriched motifs across the greatest number of datasets, manually grouped by motif similarity. Each logo depicts position along the x-axis and information content (that is, pattern strength) along the y-axis.    185  Figure G.2. Zinger motifs are enriched across multiple human datasets after masking the ChIPped TF’s motif. (A) The histogram displays the results of TFBS motif enrichment analysis on 281 human ChIP-seq datasets in which the ChIPped TFs motifs were masked. Results were generated with the oPOSSUM 3.0 software. Along the x-axis is the fraction of datasets that displayed enrichment for a TF profile. The y-axis is the number of TF profiles that were found enriched near the peakMax for a given fraction of datasets. The profiles most frequently observed to be enriched are labeled on the histogram. (B) The binding site logos of the TF binding models with enriched motifs across the greatest number of datasets, manually grouped by motif similarity. Each logo depicts position along the x-axis and information content (that is, pattern strength) along the y-axis.   186  Figure G.3. DNaseI-seq and Faire-seq datasets are enriched for zinger motifs. The histograms display the results of TFBS motif enrichment analysis on (A) DNaseI-seq datasets and (B)   187 Faire-seq datasets. Results were generated with the oPOSSUM 3.0 software. Along the x-axis is the fraction of datasets that displayed enrichment for a TF profile. The y-axis is the number of TF profiles that were found enriched for a given fraction of datasets. The profiles most frequently observed to be enriched are labeled on the histogram. (C) The binding site logos of the TF binding models with enriched motifs across the greatest number of either DNaseI-seq or Faire-seq datasets. The logos are manually grouped by motif similarity, except for the bottom row. Each logo depicts position along the x-axis and information content (that is, pattern strength) along the y-axis.   188  Figure G.4. ChIP-seq datasets for non-sequence-specific proteins are enriched for zinger motifs. The enrichment plots display the location of the top scoring motif for each peak relative to the peakMax (the peakMax is at 0) on the x-axis, while the score of the motif is plotted on the y-axis. The adjacent line plots display the fraction of motifs observed in 5 bp increments. The logo reflecting the binding specificity for each zinger appears above the related enrichment plot. (A) CTCF motif predictions on ChIP-seq data for WHIP, a helicase interacting protein. (B) JUN motif predictions on ChIP-seq data for p300, a histone acetyltransferase. (C) GABPA motif predictions on ChIP-seq data for CCNT2, a cyclin regulator of CDK9 kinase. (D) THAP11 motif predictions on ChIP-seq data for CHD2, a chromodomain helicase.    189  Figure G.5. Input and mock-IP control data are enriched for zinger motifs. The enrichment plots display the location of the top scoring CTCF motif for each peak relative to the peakMax (the peakMax is at 0) on the x-axis, while the score of the motif is plotted on the y-axis. The adjacent line plots display the fraction of CTCF motifs observed in 5 bp increments. The logo reflecting the binding specificity for CTCF appears above the related enrichment plot. (A) Input regions from the HUVEC cell line. (B) IgG rabbit mock-IP regions from GM12878 cells.   190  Figure G.6. Shuffled zinger PWMs are not enriched proximal to the peakMax. The enrichment plots display the location of the top scoring motif for each peak relative to the peakMax (the peakMax is at 0) on the x-axis, while the score of the motif is plotted on the y-axis. The adjacent line plots display the fraction of motifs observed in 5 bp increments. The logo reflecting the binding specificity for each zinger appears above the related enrichment plot. (A) Enrichment of CTCF motifs on the NRF1 (GM12878) dataset. (B) Enrichment of shuffled-CTCF motifs on the same NRF1 (GM12878) dataset. (C) Enrichment of JUN motifs on the TCF7L2 (Hct116) dataset. (D) Enrichment of a shuffled-JUN motif on the same TCF7L2 (Hct116) dataset.   191  Figure G.7. Untreated STAT1 ChIP-seq data show strong zinger motif enrichment and not STAT1 motif enrichment. The enrichment plots display the location of the top scoring motif for each peak relative to the peakMax (the peakMax is at 0) on the x-axis, while the score of the motif is plotted on the y-axis. The adjacent line plots display the fraction of motifs observed in 5 bp increments. The logo reflecting the binding specificity for each zinger appears above the related enrichment plot. (A) STAT1 motif predictions on STAT1 ChIP-seq from untreated GM12878 cells. No STAT1 motif. (B) CTCF motif predictions on STAT1 ChIP-seq from untreated GM12878 cells. (C) STAT1 motif predictions on STAT1 ChIP-seq from IFNγ treated HeLa cells. STAT1 motif is present. (D) CTCF motif predictions on STAT1 ChIP-seq form IFNγ treated HeLa cells.   192  Figure G.8. The distribution of zinger motif content varies across ChIP-seq datasets. (A, B) For those datasets with at least a 1% zinger component, the histograms present the distribution of observed zinger motif peak content. The x-axis reports the proportion of zinger motif peaks within an analyzed dataset, and the y-axis the frequency of such observations. The black   193 vertical dashed line represents the mean, the blue vertical dashed line represents the median, and the red vertical dashed line represents the point where two-thirds of the datasets are to the right of the line. The asterisk indicates the maximum zinger proportion, excluding outliers. (A) Analysis performed on entire ChIP-seq datasets. (B) Analysis on the set of peaks unaccounted for by the ChIPped TF motif. (C) A heatmap of the individual zingers’ motif peaks log2 fold enrichment in the set of peaks unaccounted for by the ChIPped TF and with a strong motif score (score 85 or greater). Fold enrichment less than 1.5 is grey. The rows are individual datasets, the columns are the zingers. (D) A scatterplot of the proportions of zinger motif peaks (y-axis) and ChIPped TF motif peaks (x-axis) in each dataset.   194  Figure G.9. The proportion of a dataset with zinger motifs is not dependent on cell-line nor the ChIPped TF. (A) The x-axis is the proportion of datasets composed of zinger motif peaks. The y-axis is a density value reflecting the fraction of datasets with zinger motifs. The five cell lines are K562 (black), GM12878 (blue), HeLa (red), H1-hESC (green), and HepG2 (magenta). There are no significant differences between the distributions per Wilcoxon test P values. (B) The TFs analyzed are listed on the horizontal access. The y-axis is the maximum difference of zinger proportions observed between two ChIP-seq datasets for the same TF.   195  Figure G.10. Zinger motifs and zinger motif peaks are not strongly correlated. (A) A heatmap of significance for inter-dependence between pairs of zinger motifs in zinger motif peaks. Positive associations with a significant Fisher exact P value (P value <0.001) are yellow, negative associations with a significant Fisher exact P value are red, and non-significant P values are grey. The colour density reflects P value significance, with the densest colours being P values closest to 0. The columns are individual datasets; the rows are the six possible zinger pairs. (B) A correlation matrix presenting both Spearman’s rank (lower diagonal) and Pearson (upper diagonal) correlation coefficients for the pairwise association of zinger motif peak enrichment within the same ChIP-seq dataset  196 Appendix H   This appendix contains Supplemental analyses for the zinger motifs and zinger motif peaks of Chapter 3.  H.1 Open chromatin data and non-TF ChIP-seq data have zinger enrichment As stated in the main text we found that zinger motifs are enriched in DNaseI-seq and Faire-seq datasets. We extended the DNaseI-seq and Faire-seq regions from 150bp (on average) to 1001 bp to assess the proportion of regions within a dataset containing a zinger motif near the ~150bp region. After accounting for background rates of motif prediction, up to 67% of DNaseI-seq regions contain a zinger, with a mean of 47% (median 48% with a median absolute deviation (MAD) of 6 pp (percent points)). The Faire-seq regions had less zinger representation with a maximum of 32% of peaks and mean 13% (median 10% with a MAD of 7 pp).  As zinger motifs were enriched across numerous TF ChIP-seq datasets, and zinger motifs also have a strong presence in open chromatin regions, we wondered to what extent zinger motifs are present in any ChIP-seq based data. We assessed 31 datasets (18 proteins) for non-sequence specific proteins (histone modifiers, general TFs, elongation factors, co-activators etc.) and found zinger motifs enriched in half of these datasets.  WHIP, a helicase interacting protein, CCNT2, a cyclin-T2 protein, CHD2, a helicase, and p300, a histone acetyl transferase, are examples of datasets displaying zinger motif enrichment proximal to the peakMax (Appendix G, Figure G.4).   Control data such as sheared DNA or mock immunoprecipitated DNA (mock-IP) for ChIP-seq is known to be enriched for open chromatin regions [22, 121, 158]. We assessed the enrichment of zinger motifs in the top ranking 70,000 peaks of 40 ChIP-seq controls, both Input and mock-  197 IP.  While all zingers were enriched in a subset of controls (GABPA in 26%, THAP11 in 16%, and JUN in 11%), only CTCF showed a strong and consistent pattern of enrichment (enriched in 66% of controls) (Appendix G, Figure G.5).  H.2 Extreme cases of datasets lacking the ChIPped TF’s motif      There are extreme cases in which the ChIPped TF’s binding motif occurs in less than 10% of the dataset and more than 30% of the dataset contains a zinger motif e.g. BRCA1 (HepG2), ZNF143 (GM12878, HeLa, K562), and unstimulated STAT1 (GM12878). The STAT1 (GM12878) data is notable. According to the literature, unstimulated STAT1 can act as a constitutive TF that may interact weakly with DNA or other proteins [159]. There were 49,261 peaks reported in the STAT1 unstimulated dataset, but the dataset was not enriched for weak sites of either the full or half canonical motif. At least 30% (after adjusting for background) of the peaks in the dataset are zinger motif peaks, primarily the CTCF motif (Appendix G, Figure G.7A-7B). In contrast, as a positive control, when STAT1 (HeLa) is stimulated with IFNγ, we observe clear enrichment for the STAT1 motif near the peakMax and little of the zinger motifs (Appendix G, Figure G.7C-7D).   H.3 Average background expectation of motif predictions Determining the overall background expectation for the frequency of peaks with central motifs for either the ChIPped TF or the zingers can be assessed by considering the motif peak frequencies in the flanking regions of sequences, distal to the enrichment around the peakMax. The analysis was performed for all datasets displaying central enrichment for the assessed motif. For the ChIPped TF motif peaks, we estimate that, on average, 10% (median 7% with a median absolute deviation (MAD) of 9 percent points (pp)) of the cases arise by chance (~6% of the average ChIP-seq experiment).  Likewise, 36% (median 40% with a MAD of 11 pp) of the   198 enriched zinger motif peaks are expected to be present by chance (~7% of the average ChIP-seq experiment). Within the main text, each background expectation was determined specific to the datasets analyzed prior to reporting background corrected estimates of enrichment.  As indicated in the main text, we selected the subset of peaks with motifs that had a strong match to the PWM (motif score of 85 or greater). The background expectation using such high scoring motifs for the ChIPped TF dropped to a mean of 8% (median 5% with a MAD of 8 pp) of the predictions arising by chance. For the zinger motif peaks the mean dropped to 25% (median 21% with a MAD of 15 pp) of the predictions potentially arising by chance.   H.4 Zinger motif attributed peaks’ proximity to genomic features  Using the zinger motif peaks, we assessed their proximity to promoter-related features such as the transcription start site (TSS), CpG islands, and conserved regions, in addition to repeat sequence regions. We evaluated the proportion of peaks within 500bp, 1kb, 2kb, or 5kb of the TSS, or within 500bp for the other listed features, relative to the proportion of peaks more distant and counted those datasets meeting a significance threshold (Fisher’s exact test p-values <0.001). For the majority of features, the number of datasets with the zinger motif peaks statistically further from the genomic feature than the ChIPped TF motif peaks was not striking. However, there were a few exceptions:  68% and 75% of datasets enriched for GABPA or THAP11 zinger motifs, respectively, displayed a tendency for the zinger motif peaks to be closer to the TSS (or CpG islands) than ChIPped TF motif peaks (Fisher test p-value <4e-05), while the reverse was true for JUN zinger motif enriched datasets (69% of datasets; Fisher test p-value <3e-05). When we examined GABPA and JUN ChIP-seq data we found that a greater proportion of GABPA ChIP-seq peaks are proximal to a TSS than are JUN ChIP-seq peaks (64% versus 17% relatively). Thus the zinger motif peaks show positional enrichment consistent   199 with the enrichment observed for the ChIP-seq experiments specifically targeting the zinger TFs.  This observation is supportive of the zinger motif peaks being bona fide targets of the zinger TFs.   H.5 Zinger motif peaks may be bona fide targets of the zinger TF As indicated in the main text, we found good agreement between the zinger CTCF peaks with a strong motif score (score > 85) and CTCF ChIP-seq peaks from the same cell line (Figure 3.4A) – on average 75% of zinger CTCF peaks overlap CTCF ChIP-seq peaks (median 79%; MAD 15 pp). As control, we assessed those peaks with the CTCF motif distal to the peakMax (see Chapter 3 Methods). On average, only 10% of the peakMax for the distal CTCF motifs were within 100bp of a CTCF ChIP-seq peakMax (median 9%; MAD 4 pp). The zinger CTCF motif peaks are significantly more likely (Wilcoxon one-tailed test p-values 5.5<e-50) to overlap CTCF ChIP-seq than distal-zinger CTCF peaks.   We evaluated JUN zinger motifs (Figure 3.4B) and GABPA zinger motifs (Figure 3.4C), in the same manner, using these two zinger motifs as proxy for the two families of motifs. Neither JUN nor GABPA ChIP-seq datasets and PWMs produce the same sharp delineation between high scoring peaks that are peakMax proximal compared to distal as occurs with CTCF (Figure 3.2A-C).  We therefore expected a higher false positive rate and reduced agreement between the zinger motif peaks and the JUN or GABPA ChIP-seq datasets.   On average 38% of zinger JUN peaks are within 100bp of a JUN ChIP-seq peak within the same cell type (median 38%; MAD 17 pp), while only 6% of distal-zinger JUN peaks (distal to the peakMax) are in such agreement with JUN ChIP-seq peaks (median 4%; MAD 4 pp) (Figure 3.4B). The proportion of peaks with a JUN zinger motif proximal to the peakMax that agrees   200 with a JUN ChIP-seq peak is significantly more than the distal-zinger JUN peaks (Wilcoxon one-tailed test p-value 3.4<e-20).   Although GABPA zinger peaks have higher background motif rates compared to CTCF or JUN, the overlap of GABPA zinger motif containing peaks with the GABPA ChIP-seq peaks is still significantly higher than the overlap observed for the distal-zinger motif containing regions (Wilcoxon one-tailed test p-value <1.1e-30) (Figure 3.4C). On average, 28% of GABPA zinger peaks in a dataset agree with GABPA ChIP-seq peaks from the same cell type (median 27%; MAD 13 pp), while there is only 8% average agreement for the non-zinger peaks (median 5%; MAD 5 pp).   In all datasets the zinger motif peak proportion of agreement with a ChIP-seq dataset was always greater than the distal-zinger proportion of agreement.   201 Appendix I   The coordinates of the zinger neighbourhoods from Chapter 3 are provided as supplemental material (ubc_2015_february_hunt_christine_Appendix_I.txt) in UBC’s cIRcle repository http://hdl.handle.net/2429/51447.    202 Appendix J   This appendix is  Table J.1 presenting the top twenty motifs from an over-representation analysis on HOT regions, for Chapter 3.  Table J.1 The top 20 motifs over-represented in HOT regions. Enriched TF motif TF matrix Information content Fisher log score Fisher p-value Zinger motif CTCF 17.205 -Inf 0 CTCF-like CTCFL 15.983 478.563 1.4e-208 CTCF-like CTCF::SMC3 15.886 456.433 5.9e-199 CTCF-like CTCF::RAD21 14.398 239.965 6.1e-105 CTCF-like HLF 11.147 171.034 5.3e-75  CEBPB 12.334 142.261 1.6e-62 JUN-like JUND 14.840 128.454 1.6e-56 JUN-like ELK4 13.433 122.643 5.5e-54 ETS-like IRF1-1 16.008 120.295 5.7e-53  GABPA 13.821 111.216 5.0e-49 ETS-like PRRX2 9.063 105.924 9.9e-47  HIF1A::ARNT 9.740 102.951 1.9e-45  ARNT 10.992 101.559 7.8e-45  JUN 13.308 101.034 1.3e-44 JUN-like E2F4 11.346 95.011 5.5e-42  MIZF 13.197 89.491 1.4e-39  THAP11::SIX5 23.317 81.275 5.0e-36 THAP11 BATF 13.028 79.320 3.6e-35 JUN-like FOSL2 13.999 78.750 6.3e-35 JUN-like CREB1 10.139 73.909 8.0e-33 JUN-like      203 Appendix K   This appendix contains the Supplemental figures for Chapter 4.   Figure K.1. The determination of a biologically-based threshold for specific vs. non-specific TF-DNA interaction. The standardized distribution of binding energies of twelve TFs on multiple background sequences located proximal to transcription start sites are plotted (gray); in the right tail of the backgrounds are the distributions of binding energies for twelve TFs’ known binding sites (multiple colours). A) An example of one TF, MEF2A, it’s known binding sites, and one background sequence. The distribution of binding energies for sequences known to be bound by MEF2A (black) is in the right tail of the energy distribution for a randomly chosen promoter   204 sequence (gray). B) Energy distributions of twelve TFs known binding sites (multiple colours) are all in the right tail of the background energy distributions (gray). The vertical dashed line is at Z-score=3 which was chosen as representative for the left tail of most TFs energy distributions (i.e. the left tail are binding sites with a low similarity to the consensus binding site).      205  Figure K.2. FOXA2 occupancy counts on a single 10kb sequence (chr10:91152983-91162983) using the classical PWM approach for binding site prediction and thresholding on a TF-specific motif score of 82. The left plot presents 200bp window bound occupancy counts (y-axis) greater than 0 (blue ovals), and the coordinate location of the region simulated on (x-axis). The vertical red bars highlight the windows that overlap FOXA2 ChIP-seq regions. The right-side plot is a ROC curve, with true positive rate on the y-axis and false positive rate on the x-axis.     206  Figure K.3. The AUC of simulations without chromatin accessibility data for recurrent regions is in close agreement with the AUC determined from PWM-derived occupancy predictions.  Each   207 plot presents the AUC from the classical PWM model thresholded at a motif score specific to each TF (Chapter 4 Methods) on the x-axis, and the AUC from the simulations on the y-axis. The number in the top left of each plot is the Pearson correlation coefficient (r), and the number in the bottom right corner is the Wilcoxon test P value. (A) SRF (GM12878). (B) FOXA2 (HepG2). (C) REST (H1 hESC). (D) REST (HepG2). (E) JUND (HepG2). (F) JUND (K562). (G) POU5F1 (H1 hESC).     208  Figure K.4. The AUC of simulations without chromatin accessibility data for non-recurrent regions is in close agreement with the AUC determined from PWM-derived occupancy   209 predictions.  Each plot presents the AUC from the classical PWM model thresholded at a motif score specific to each TF (Chapter 4 Methods) on the x-axis, and the AUC from the simulations on the y-axis. The number in the top left of each plot is the Pearson correlation coefficient (r), and the number in the bottom right corner is the Wilcoxon test P value. (A) SRF (GM12878). (B) FOXA2 (HepG2). (C) REST (H1 hESC). (D) REST (HepG2). (E) JUND (HepG2). (F) JUND (K562). (G) POU5F1 (H1 hESC).   210 Appendix L   This appendix contains the Supplemental tables for Chapter 4.  Table L.1. Wilcoxon test P values from comparison of the AUC values for simulations using only bindng energies on recurrent regions relative to the classical PWM AUC values Data SRF (GM12878) REST (H1-hESC) REST (HepG2) JUND (HepG2) JUND (K562) POU5F1 (H1-hESC) FOXA2 (HepG2) classical PWM specific threshold 0.00065 0.188 0.179 0.265 0.066 0.854 0.819 classical PWM single stringent threshold 3.850e-05 1.08e-10 9.058e-09 0.843 0.287 0.346 0.777 Cells highlighted in grey indicate that the simulations’ AUC values are not significantly different than the AUC values using the classical PWM method with a TF-specific threshold or a single threshold for all TFs (Wilcoxon test significance threshold P <0.001). The remaining cells are where the simulator AUC values tend to be higher than classical PWM AUC values.    211 Table L.2. Wilcoxon one-tailed test P values from comparison of the AUC values for simulations on recurrent regions using accessibility data relative to simulations without accessibility data.  Data SRF (GM12878) REST (H1-hESC) REST (HepG2) JUND (HepG2) JUND (K562) POU5F1 (H1-hESC) FOXA2 (HepG2) DNaseI-seq 2.334e-08 3.653e-05 0.384 01.48e-03 1.760e-09 4.524e-09 1.061e-12 Faire-seq 4.743e-06 0.999 1 0.374 1.674e-05 2.996e-10 6.039e-12 DNaseI-seq and Faire-seq 3.516e-09 4.975e-06 0.083 3.653e-05 9.769e-10 3.803e-14 5.518e-23 H3K4me1 7.107e-05 0.989 1 0.037 0.017 0.014 9.46e-09 H3K4me3 2.971e-05 3.229e-04 0.734 0.049 8.039e-06 3.158e-05 2.734e-09 H3K4me1 and H3K4me3 5.060e-07 1.757e-04 0.312 3.49e-03 2.290e-05 6.781e-08 1.930e-10 Cells highlighted in grey indicate that the simulations’ AUC results are greater (Wilcoxon one-tailed significance threshold P <0.001) than results from simulations lacking chromatin accessibility data. Values in bold indicate which data yields the greatest mean AUC value in each TF column and coincidently the lowest P value.     212 Table L.3. Wilcoxon one-tailed test P values from comparison of the AUC values from simulations on non-recurrent regions using accessibility data relative to simulations without accessibility data. Data SRF (GM12878) REST (H1-hESC) REST (HepG2) JUND (HepG2) JUND (K562) POU5F1 (H1-hESC) FOXA2 (HepG2) DNaseI-seq 1 1 1 1 0.914 1.070e-05 1 Faire-seq 1 1 1 1 0.995 3.582e-06 1 DNaseI-seq and Faire-seq 1 1 1 1 0.881 6.181e-11 0.967 H3K4me1 1 1 1 1 0.981 2.853e-04 2.036e-09 H3K4me3 1 1 1 1 1 0.913 0.998 H3K4me1 and H3K4me3 1 1 1 1 0.984 7.457e-05 7.922e-11 Cells highlighted in grey indicate which chromatin accessibility data improves the AUC results for a given TF (Wilcoxon one-tailed significance threshold P <0.001) relative to simulations lacking chromatin accessibility data. Values in bold indicate which data yields the greatest mean AUC value in each TF column and coincidently the lowest P value.    213 Table L.4. Wilcoxon test P values from comparison of the AUC values for simulations using only bindng energies on non-recurrent regions relative to the classical PWM AUC values. Data SRF (GM12878) REST (H1-hESC) REST (HepG2) JUND (HepG2) JUND (K562) POU5F1 (H1-hESC) FOXA2 (HepG2) classical PWM specific threshold 0.423 0.606 0.052 0.838 0.897 0.0.590 0.429 classical PWM single stringent threshold 0.010 4.246e-06 1.836e-06 0.770 0.620 0.622 0.125 Cells highlighted in grey indicate that the simulations’ AUC values are not significantly different than the AUC values using the classical PWM method with a TF-specific threshold or a single threshold for all TFs (Wilcoxon test significance threshold P <0.001). The remaining cells are where the simulator AUC values tend to be higher than classical PWM AUC values.  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0166083/manifest

Comment

Related Items