Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Mixture models for analysing high throughput sequencing data Zhang, Xuekui 2011

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2011_fall_zhang_xuekui.pdf [ 4.78MB ]
Metadata
JSON: 24-1.0105109.json
JSON-LD: 24-1.0105109-ld.json
RDF/XML (Pretty): 24-1.0105109-rdf.xml
RDF/JSON: 24-1.0105109-rdf.json
Turtle: 24-1.0105109-turtle.txt
N-Triples: 24-1.0105109-rdf-ntriples.txt
Original Record: 24-1.0105109-source.json
Full Text
24-1.0105109-fulltext.txt
Citation
24-1.0105109.ris

Full Text

Mixture Models for Analysing High Throughput Sequencing Data by Xuekui Zhang B. Mathematics, Nankai University, 2000 M. Mathematics, Nankai University, 2003 M. Statistics, The University of British Columbia, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Statistics) The University Of British Columbia (Vancouver) July 2011 © Xuekui Zhang, 2011 Abstract The goal of my thesis is to develop methods and software for analysing high- throughput sequencing data, emphasizing sonicated ChIP-seq. For this goal, we developed a few variants of mixture models for genome-wide profiling of tran- scription factor binding sites and nucleosome positions. Our methods have been implemented into Bioconductor packages, which are freely available to other re- searchers. For profiling transcription factor binding sites, we developed a method, PICS, and implemented it into a Bioconductor package. We used a simulation study to confirm that PICS compares favourably to rival methods, such as MACS, QuEST, CisGenome, and USeq. Using published GABP and FOXA1 data from human cell lines, we then show that PICS predicted binding sites were more consistent with computationally predicted binding motifs than the alternative methods. For motif discovery using transcription binding sites, we combined PICS with two other existing packages to create the first complete set of Bioconductor tools for peak-calling and binding motif analysis of ChIP-Seq and ChIP-chip data. We demonstrate the effectiveness of our pipeline on published human ChIP-Seq datasets for FOXA1, ER, CTCF and STAT1, detecting co-occurring motifs that were con- sistent with the literature but not detected by other methods. For nucleosome positioning, we modified PICS into a method called PING. PING can handle MNase-Seq and MNase- or sonicated-ChIP-Seq data. It com- pares favourably to NPS and TemplateFilter in scalability, accuracy and robustness to low read density. To demonstrate that PING predictions from sonicated data can have sufficient spatial resolution to be biologically meaningful, we use H3K4me1 data to detect nucleosome shifts, discriminate functional and non-functional tran- ii scription factor binding sites, and confirm that Foxa2 associates with the accessible major groove of nucleosomal DNA. All of the above uses single-end sequencing data. At the end of the thesis, we briefly the discuss the issue of processing paired-end data, which we are currently investigating. iii Preface This thesis was completed under the supervision of Dr. Raphael Gottardo, Dr. Kevin Murphy, and Dr. Gordon Robertson. Chapter 2 of this thesis is based on the paper “PICS: Probabilistic inference for ChIP-seq.” Biometrics, X. Zhang, G. Robertson, M. Krzywinski, K. Ning, A. Droit, S. Jones, and R. Gottardo, 2010 [131]. This research problem was identi- fied and designed by my thesis advisor Dr. Raphael Gottardo and by Dr. Gordon Robertson. As the first author, I developed the methodology, performed the analy- sis. I drafted the manuscript, Gordon Robertson and other coauthors helped me to revise it. Chapter 3 of this thesis is based on the paper “An integrated pipeline for the genome-wide analysis of transcription factor binding sites from chip-seq.” PLoS ONE , E. Mercier, A. Droit, L. Li, G. Robertson, X. Zhang and Raphael Gottardo, 2011 [89]. The paper introduced an analysis pipeline consist of three Bioconductor packages. As a coauthor, my major contribution was in developing one of the software packages for the pipeline. I was also involved as a helper in other related works such as revising the manuscript and conducting some data analysis. Chapter 4 of this thesis is based on paper “Probabilistic inference for nucle- osome positioning with mnase-based or sonicated short-read data.” under review for Genome Biology, X. Zhang, G. Robertson, S. Woo, B. G. Hoffman, and R. Gottardo, 2011 [132]. As the first author, with help of Raphael Gottardo, Gordon Robertson, and other coauthors, I identified and designed this research problem, developed the methodology, performed the analysis, and prepared the manuscript. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Sonicated ChIP-seq Experiments . . . . . . . . . . . . . . 2 1.1.2 Other Experiment Protocols . . . . . . . . . . . . . . . . 5 1.1.3 Paired-end Sequencing Experiments . . . . . . . . . . . . 6 1.2 Existing Analysis Methods in the Literature . . . . . . . . . . . . 7 1.2.1 Analysis of ChIP-seq Data of Transcription Factor (TF)s . 7 1.2.2 Analysis Pipeline . . . . . . . . . . . . . . . . . . . . . . 9 1.2.3 Methods for Nucleosome Positioning from ChIP-seq or MNase-seq Data . . . . . . . . . . . . . . . . . . . . . . 12 1.2.4 Methods for Paired-End Sequencing Data . . . . . . . . . 13 1.3 Summary of Key Results in This Thesis . . . . . . . . . . . . . . 14 v 2 PICS: Probabilistic Inference for ChIP-seq Transcription Factor Data 15 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Data, Preprocessing, and Notation . . . . . . . . . . . . . . . . . 16 2.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Modeling a Single Binding Event . . . . . . . . . . . . . 17 2.3.2 Modeling Multiple Binding Events . . . . . . . . . . . . 18 2.3.3 Prior Distributions . . . . . . . . . . . . . . . . . . . . . 20 2.3.4 Summary of Model and Priors . . . . . . . . . . . . . . . 20 2.3.5 Accounting for Missing Reads . . . . . . . . . . . . . . . 22 2.4 Estimation and Inference . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Parameter Estimation Using the EM Algorithm . . . . . . 23 2.4.2 Inference and Detection of Binding Sites . . . . . . . . . 24 2.5 Derivation of the EM Algorithm . . . . . . . . . . . . . . . . . . 27 2.5.1 No Missing Reads . . . . . . . . . . . . . . . . . . . . . 27 2.5.2 Accounting for Missing Reads . . . . . . . . . . . . . . . 29 2.6 Application to Experimental Data . . . . . . . . . . . . . . . . . 31 2.7 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7.1 Simulation Scheme . . . . . . . . . . . . . . . . . . . . . 37 2.7.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . 38 2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3 An Integrated Pipeline for the Genome-Wide Analysis of Transcrip- tion Factor Binding Sites from Chip-Seq . . . . . . . . . . . . . . . 43 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.1 Identification of Primary Motifs . . . . . . . . . . . . . . 46 3.2.2 Identification of Secondary Motifs . . . . . . . . . . . . . 47 3.2.3 Functional Annotation of Motifs and Modules . . . . . . . 49 3.2.4 Biological Significance of Modules . . . . . . . . . . . . 50 3.2.5 Comparison with Other Methods . . . . . . . . . . . . . . 52 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.1 Peak Calling: PICS . . . . . . . . . . . . . . . . . . . . 55 vi 3.4.2 De novo Motif Discovery: rGADEM . . . . . . . . . . . . 56 3.4.3 Post-processing and Motif Interpretation: MotIV . . . . . 58 3.4.4 Software Availability and Architecture . . . . . . . . . . . 59 3.4.5 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.6 Comparison to Other Methods . . . . . . . . . . . . . . . 60 4 Probabilistic Inference for Nucleosome Positioning with MNase or Sonicated Short-read Data . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 PING Model . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.2 Methods Comparison . . . . . . . . . . . . . . . . . . . . 67 4.2.3 Inferring Nucleosome Positioning with Sonicated Chip-Seq Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.4 Identifying Transcription Factor-Nucleosome Interactions in Mouse Islet Data . . . . . . . . . . . . . . . . . . . . . 71 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.2 Filtering Duplicated Reads . . . . . . . . . . . . . . . . . 75 4.3.3 Segmentation of Candidate Regions . . . . . . . . . . . . 76 4.3.4 Parameter Estimation and Model Selection . . . . . . . . 76 4.3.5 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3.6 Choosing the Number of Nucleosomes in Each Region . . 77 4.3.7 Scores of Predicted Nucleosomes and False Discovery Rates 78 4.3.8 Calculating AUC Values . . . . . . . . . . . . . . . . . . 79 4.3.9 Model-based Nucleosome Positioning Profile . . . . . . . 79 4.3.10 Classification of Transcription Factor Binding Regions . . 80 4.3.11 Removing Nucleosomes that have Low Read Densities . . 80 4.3.12 Multiple Transcription Factor Binding Regions Associated with the Same Gene . . . . . . . . . . . . . . . . . . . . 81 4.3.13 Distribution of Pdx1 and Foxa2 Motifs around Monomodal Sites Identified Using Liver Nucleosome Positions . . . . 81 4.4 Difference between PING and PICS . . . . . . . . . . . . . . . . 82 vii 4.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4.2 Post-process . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4.3 GMRF Prior . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5 Derivation of the EM Algorithm . . . . . . . . . . . . . . . . . . 85 5 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . 95 5.1 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 95 5.2 Work In Progress - Paired-End Sequencing Data . . . . . . . . . . 97 5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.3 The Model without Ambiguous Regions . . . . . . . . . . 98 5.2.4 Conditional Expectation of the Penalized Log Complete Data Likelihood . . . . . . . . . . . . . . . . . . . . . . 100 5.2.5 The Model with Ambiguous Regions . . . . . . . . . . . 103 5.2.6 Consider DNA Fragments with Only One End Aligned . . 107 5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.1 PE or SE? . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.2 DNA Methylation Data . . . . . . . . . . . . . . . . . . . 110 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A Supplementary Material for Chapter Two . . . . . . . . . . . . . . 128 A.1 Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.2 Parameter Recalculation when Merging Binding Events . . . . . . 129 A.3 Application to Experimental Datasets . . . . . . . . . . . . . . . 129 A.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 129 B Supplementary Material for Chapter Three . . . . . . . . . . . . . 135 B.1 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . . 135 B.2 Sequence Logo and Distributions for All Motifs . . . . . . . . . . 135 B.3 Pairwise Distances . . . . . . . . . . . . . . . . . . . . . . . . . 135 B.4 rGADEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 B.4.1 Choice of Expected Number of Occurrences MAXP . . . 140 B.4.2 Seeded / Unseeded Analysis . . . . . . . . . . . . . . . . 143 viii B.4.3 Motif Spatial Prior Probability . . . . . . . . . . . . . . . 147 B.4.4 Background Model . . . . . . . . . . . . . . . . . . . . . 149 B.4.5 Fold Enrichment Analysis . . . . . . . . . . . . . . . . . 149 B.5 Gene Ontology Analysis . . . . . . . . . . . . . . . . . . . . . . 150 B.6 Agreement between Motif Finders . . . . . . . . . . . . . . . . . 152 C Supplementary Material for Chapter Four . . . . . . . . . . . . . . 158 C.1 Post-processing PING Results . . . . . . . . . . . . . . . . . . . 158 C.1.1 Mismatched Peaks . . . . . . . . . . . . . . . . . . . . . 158 C.1.2 Missing Nucleosomes . . . . . . . . . . . . . . . . . . . 159 C.1.3 Duplicate Predictions . . . . . . . . . . . . . . . . . . . . 159 C.2 The Supplementary Figures and Table . . . . . . . . . . . . . . . 160 ix List of Tables Table 2.1 Number of 5000 top-ranked PICS predictions from GABP and FOXA1 data in candidate regions that had 1, 2 or at least 3 pre- dicted binding events (i.e. mixture components). The first row gives the number of binding events in each category, while the second row gives the percentage of predicted events that could be associated with an expected motif. For example, 80 per- cent of the 940 binding events in two-component GABP regions could be associated with a predicted GABP site. . . . . . . . . 34 Table 2.2 Number of proximal binding events found in the 5000 top-ranked regions identified by each method in GABP and FOXA1 data, as a function of the motif ‘proximity’ distance d. The numbers in paratheses give the percentage of predicted binding events that could be associated with at least one predicted motif site. For example, in the GABP data, PICS identified 128 binding event locations that were within 250 bp of another location, and 86% of these predicted events could be associated with a motif. 35 Table 3.1 Motifs identified by all compared methods . . . . . . . . . . . 51 Table 4.1 The number matched predicted nucleosomes between two dif- ferent strengths of GMRF priors. . . . . . . . . . . . . . . . . 84 Table A.1 Estimated number of binding sites against true number of bind- ing sites for simulation scenario 1. Bold text marks the best results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 x Table A.2 Estimated number of binding sites against true number of bind- ing sites for simulation scenario 2. Bold text marks the best results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Table A.3 Estimated number of binding sites against true number of bind- ing sites for simulation scenario 3. Bold text marks the best results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Table C.1 The number (in thousands) of predicted nucleosomes for (A) NOCL4 data (Kaplan 2009), 5000 selected regions of (B) 1hr H3K4me1 data (Heinz 2010), 5000 selected regions (C) islet data (Hoffman 2010) and their random subsets of reads using PING, TpF and NPS. . . . . . . . . . . . . . . . . . . . . . . 161 xi List of Figures Figure 1.1 Details of a sonicated ChIP-seq experiment for a transcription factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Figure 2.1 Predicted binding events in two candidate regions in the GABP data. PICS detected one binding event in the region in (a) and two binding events in the region in (b). The observed data of forward and reverse strand aligned reads are shown by > and < arrowheads, respectively. Mappability profiles are shown as black/white lines, in which the white intervals show non- mappable regions. In (a) the distribution of reverse readx posi- tions was biased by a region with low mappability. . . . . . . 19 Figure 2.2 Directed acyclic graph representation of the PICS model. Reads aligned to forward and reverse strands are denoted by f and r. Dashed rectangles represent fixed hyper-parameters. Grey- filled circles indicate missing data that are incorporated into the model to facilitate inference via the EM algorithm while plain circles represent unknown variables that need to be esti- mated. Plain squares represent the data (aligned forward and reverse read positions). Finally, thick arrows indicate deter- ministic relationships, while thin arrows indicate stochastic re- lationships. The index k refers to the mixture component k. . . 21 Figure 2.3 Motif occurrence rate and spatial error (see text) for GABP and FOXA1 data, as a function of region enrichment rank, for the 5000 top-ranked regions for each method. . . . . . . . . . . . 33 xii Figure 2.4 Fraction of predicted binding events with a GABP (a) or FOXA1 (b) motif site within c ·SE(µ̂) of the predicted event location, µ̂ , as a function of region enrichment rank. . . . . . . . . . . 36 Figure 2.5 Partial receiver operating characteristic curves and nominal ver- sus true FDR for the three simulation scenarios. . . . . . . . . 40 Figure 3.1 Primary motifs identified by rGADEM and visualized with MotIV. The motif matches and associated similarity E-values are based on the JASPAR database included in MotIV. . . . . . . . . . 47 Figure 3.2 Distance distribution between the rGADEM motif occurrences and the PICS predictions for the STAT1, CTCF, FOXA1 and CTCT motifs identified from datasets. . . . . . . . . . . . . . 48 Figure 3.3 Pairwise distance distributions between the STAT1, AP-1 and CTCF motifs identified by rGADEM from the STAT1 data. . . 49 Figure 3.4 PICS peak calling. The example shows a FOXA1-enriched region in which PICS discriminates two closely adjacent bind- ing events, each of which contains a rGADEM de novo FOXA- like motif (black squares); these are separated by less than 300bps. In contrast, MACS outputs a single enriched regions. For clarity, the aligned reads (blue/red bars) and the combined forward/reverse PICS density profiles are also shown. . . . . 55 Figure 3.5 The ChIP-Seq processing pipeline. Short sequence reads are first mapped onto a reference genome, and the mapping results are loaded into R. The pipeline core consists of the three dark blue rectangles. Enriched regions are identified by PICS and passed to rGADEM for de novo motif discovery, and motifs and motif occurrences are passed to MotIV for postprocessing. . . 61 xiii Figure 3.6 PICS read mappability correction in a FOXA1 binding re- gion with missing reads due to genome repetitiveness. An ambiguous region (i.e. a region into which short reads can- not be uniquely mapped) is shown as a grey rectangle. For- ward and reverse aligned reads are respectively shown as black and red arrowheads. Forward and reverse PICS read density profiles are respectively shown in black and red, with solid/- dashed lines representing t distributions with/without the map- pability correction. The rGADEM-estimated FOXA1 binding site is shown by a vertical black line. When PICS corrects for read mappability, the de novo motif is within the confidence interval of the site location that it predicts, but it is outside of the interval when the correction is not used. The spatial error, i.e. the distance between binding site location and the PICS prediction, is 15bps with the correction and 47bps without the correction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Figure 4.1 AUC statistics as a function of number of reads in random sub- sets for PING, TemplateFilter (TPF) and NPS. Datasets are (a) MNase-Seq data from budding yeast [57], (b) sonicated H3K4me1 ChIP-Seq data from the mouse PUER cell line [41], and (c) sonicated H3K4me1 ChIP-Seq data from mouse adult islet tissue [44]. . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 4.2 The model-based nucleosome positioning within±500 bp from the top 5000 detected in vivo (a) SPI1 binding sites and (b) CEBPB binding sites from Heinz’s sonicated H3K4me1 ChIP- Seq data for 0 hour (blue) and 1 hour (red) after stimulation. The heatmaps show nucleosome prediction profiles for each region as pairs of red/blue horizontal lines, and the curves on the bottom are average occupancy profiles across all regions. 71 xiv Figure 4.3 Modality and nucleosome occupancy profiles for in vivo bind- ing sites for the transcription factors Foxa2 (left) and Pdx1 (right) mouse adult islet tissue. a) The number of binding sites in bimodal, monomodal and NoNuc groups. b) Aver- age model-based nucleosome positioning profiles for the three classes of binding sites. c) Average model-based nucleosome positioning profiles in mouse adult liver relative to binding sites in islets. . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Figure 4.4 Boxplots of gene expression levels from RNA-seq data in each group. The horizontal dashed line shows the overall median. Islet results are shown on the top row and liver results in the bottom row. . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Figure 4.5 The profile of center distance between the closest motifs to the centre of a predicted nucleosome position for detected “monomodal” Pdx1 and (Foxa2) binding sites in islets. Dashed curves show profiles in liver. Profiles are truncated at ±100 bp, and vertical dashed lines show ±75 bp from the estimated centres of the nucleosome-associated DNA. . . . . . . . . . . . . . . . . . 91 Figure 4.6 The data and results of an example candidate region of NOCL4 data (Kaplan 2009), where multiple nucleosomes are described using single mixture components. The red/black arrowheads shows the location of forward/reverse reads. The character ‘+’ shows predicted position of nucleosomes, whose confidence intervals are shown using ‘[’ and ‘]’. The black/red curves shows the components density of fitted mixture models. The vertical green lines shows the predicted nucleosome positions after post-process. . . . . . . . . . . . . . . . . . . . . . . . 92 xv Figure 4.7 The data and results of an example candidate region of NOCL4 data (Kaplan 2009), where multiple nucleosomes are described using single mixture components. The red/black arrowheads shows the location of forward/reverse reads. The character ‘+’ shows predicted position of nucleosomes, whose confidence intervals are shown using ‘[’ and ‘]’. The black/red curves shows the components density of fitted mixture models. The vertical green lines shows the predicted nucleosome positions after post-process. . . . . . . . . . . . . . . . . . . . . . . . 93 Figure 4.8 The data and results of an example candidate region of NOCL4 data (Kaplan 2009), where ‘None’ and ‘Weak’ do not match. The red/black arrowheads shows the location of forward/re- verse reads. The character ‘+’ shows predicted position of nucleosomes, whose confidence intervals are shown using ‘[’ and ‘]’. The black/red curves shows the components density of fitted mixture models. . . . . . . . . . . . . . . . . . . . . . 94 Figure 5.1 AUC statistics as a function of proportion of reads in random subsets for PING applied to SE data and PING-PE applied to SE data. Each boxplot in this figure is generated from AUC of 100 simulated data sets. Datasets are chromosome I of yeast data (with∼ 360k DNA fragments) generated using total DNA MNase-seq protocol. The green star indicates AUC of PE and SE are significantly different. . . . . . . . . . . . . . . . . . . 109 Figure 5.2 AUC statistics as a function of proportion of DNA fragments in random subsets for PING applied to SE data and PING-PE ap- plied to SE data. Each boxplot in this figure is generated from AUC of 100 simulated data sets. Datasets are chromosome I of yeast data (with ∼ 360k DNA fragments) generated using total DNA MNase-seq protocol. Green star indicate AUC of PE and SE are significantly different. . . . . . . . . . . . . . 110 xvi Figure A.1 Distribution of estimated average DNA fragment lengths, δ , in GABP (a) and FOXA1 (b) data, before and after filtering. For clarity, only results for the top 10000 regions are shown. . . . 130 Figure A.2 Number of detected peaks as a function of False Discovery Rate (FDR) levels for the four analysis methods, for the GABP data (a) and the FOXA1 data (b). For the FOXA1 data the cisGenome FDR includes values up to ∼ 20%, but we have limited the y-axis for better visualization of other FDRs. Note that actual FDR values are given in %, and one needs to divide by 100 to get the actual FDR level. For consistency with motif analysis, we show results for only the top 5000 PICS regions. 130 Figure A.3 Spatial error as a function of predicted region rank for the five methods compared under the three simulation scenarios. . . . 132 Figure A.4 Coverage of 95% approximate confidence interval for binding site positions for the three simulation scenarios. . . . . . . . . 133 Figure A.5 Receiver operating characteristic curves for the three simula- tion scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Figure B.1 Estimated FDR as a function of the enrichment score for the ER data. The number of enriched regions for the correspond- ing score is given at the top. . . . . . . . . . . . . . . . . . . 136 Figure B.2 Estimated FDR as a function of the enrichment score for the FOXA1 data. The number of enriched regions for the corre- sponding score is given at the top. . . . . . . . . . . . . . . . 137 Figure B.3 Estimated FDR as a function of the enrichment score for the STAT1 data. The number of enriched regions for the corre- sponding score is given at the top. . . . . . . . . . . . . . . . 138 Figure B.4 Motifs identified by rGADEM and visualized with MotIV from the CTCF data. The motif matches and associated E-values are based on the JASPAR database included in MotIV. For clarity only motifs with E-value less than 10−4 are retained. . . . . . 139 xvii Figure B.5 Motifs identified by rGADEM and visualized with MotIV from the ER data. The motif matches and associated E-values are based on the JASPAR database included in MotIV. For clarity only motifs with E-value less than 10−4 are retained. . . . . . 140 Figure B.6 Motifs identified by rGADEM and visualized with MotIV from the FOXA1 data. The motif matches and associated E-values are based on the JASPAR database included in MotIV. For clarity only motifs with E-value less than 10−4 are retained. . 141 Figure B.7 Motifs identified by rGADEM and visualized with MotIV from the STAT1 data. The motif matches and associated E-values are based on the JASPAR database included in MotIV. For clarity only motifs with E-value less than 10−4 are retained. . 142 Figure B.8 Distance distribution between the rGADEM motif occurrences and the PICS predictions for all motifs identified in the CTCF data. For clarity only motifs with E-value less than 10−4 are retained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Figure B.9 Distance distribution between the rGADEM motif occurrences and the PICS predictions for all motifs identified in the ER data. For clarity only motifs with E-value less than 10−4 are retained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Figure B.10 Distance distribution between the rGADEM motif occurrences and the PICS predictions for all motifs identified in the FOXA1 data. For clarity only motifs with E-value less than 10−4 are retained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Figure B.11 Distance distribution between the rGADEM motif occurrences and the PICS predictions for all motifs identified in the STAT1 data. For clarity only motifs with E-value less than 10−4 are retained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Figure B.12 Pairwise distance distributions between the CTCF, Myf motifs identified from the CTCF data. . . . . . . . . . . . . . . . . . 147 Figure B.13 Pairwise distance distributions between the ER, FOXA1 and AP-1 motifs identified from the ER data. . . . . . . . . . . . 148 xviii Figure B.14 Pairwise distance distributions between the FOXA1 and AP-1 motifs identified from the ER data. . . . . . . . . . . . . . . . 149 Figure B.15 ER motif identified by rGADEM and visualized with MotIV from the FOXA1 data. The motif matches and associated E- values are based on the JASPAR database included in MotIV. 153 Figure B.16 Venn diagram for the number of overlapped occurrences of FOXA1 primary motifs. . . . . . . . . . . . . . . . . . . . . 154 Figure B.17 Venn diagram for the number of overlapped occurrences of ER primary motifs. . . . . . . . . . . . . . . . . . . . . . . . . . 155 Figure B.18 Venn diagram for the number of overlapped occurrences of CTCF primary motifs. . . . . . . . . . . . . . . . . . . . . . 156 Figure B.19 Venn diagram for the number of overlapped occurrences of STAT1 primary motifs. . . . . . . . . . . . . . . . . . . . . . 157 Figure B.20 Example of a MotIV alignment output based on the FOXA1 data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Figure C.1 Density of Gaussian Markov random field (GMRF) used in PING, with horizontal lines showing 10 simulations or realiza- tions of 3 adjacent nucleosomes from this prior. . . . . . . . . 160 Figure C.2 Prior density of δ with selected values for hyperparameters (left) and posterior density of δ in three real data sets (right). . 161 Figure C.3 Nucleosome prediction scores of PING, TpF, and NPS for MNase “NOCL R4” data (Kaplan 2009). The vertical lines marks the top-ranked 10000 predicted nucleosomes. Note the scores of each method have been rescaled to range in [0,1]. . . . . . . . 162 Figure C.4 Nucleosome prediction scores of PING, TpF, and NPS for mouse PUER cell line sonicated H3K4me1 ChIP-seq data, 1 hour af- ter stimulation (Heinz 2010). The vertical line marks the top- ranked 2000 predicted nucleosomes. Note the scores of each method have been rescaled to range in [0,1]. . . . . . . . . . . 163 xix Figure C.5 Nucleosome prediction scores of PING, TpF, and NPS for mouse islet sonicated H3K4me1 ChIP-seq data (Hoffman 2010). Ver- tical lines mark the top-ranked 1000, 2000 and 4000 predicted nucleosomes. Note the scores of each method have been rescaled to range in [0,1]. . . . . . . . . . . . . . . . . . . . . . . . . 164 Figure C.6 The nucleosome positioning in ±500 bp from top 5000 de- tected in vivo (a) SPI1 biding sites and (b) CEBPB biding sites from Heinz’s sonicated H3K4me1 ChIP-Seq data for 0 hour (blue) and 1 hour (red) after stimulation. This profile is gen- erated using a method consistent with that outlined by Heinz et al 2010. The SPI1 profile in the data for 0 hour (the blue curve in the top figure) seams to be just to a large and wide peak on each side of binding sites, which provide almost no information about nucleosome positions. . . . . . . . . . . . 165 Figure C.7 Nucleosome positioning profiles in Heinz’s data. Noting that the distance for proximal nucleosomes shifts varied between SPI1 binding regions. To get clearer profiles, we align the data based on the locations of the flanking (position 1) nucleo- somes after SPI1 binding at 0hr before tamoxinfen addition. It was not clear how to assign a strand orientation for SPI1 and CEBPB enhancers, and using absolute distances allowed us to align to +1 and -1 nucleosomes at the same time. This figure confirmed the SPI1 nucleosome shift away from binding sites. 166 Figure C.8 The adaptive threshold of local enrichment based on the num- ber of reads and the peak width of both nucleosomes, obtained from the negative binomial model. The parameter “prob” in NB model is assumed to be 0.5 in this figure. To calculate the adaptive threshold the significance level used is 0.1 (i.e. 90% of quantile in NB distribution). . . . . . . . . . . . . . . . . . 167 xx Figure C.9 The sorted PING scores in loge scale for all nucleosomes pre- dicted in mouse islet (black) and liver (red) from sonicated ChIP-seq data (Hoffman 2010). The blue vertical dotted line shows top 50000 (the vertical blue line) nucleosomes is a rea- sonable elbow point of the score curves. . . . . . . . . . . . . 168 xxi Glossary TF Transcription Factor TFBS Transcription Factor Binding Site TPF TemplateFilter PICS Probabilistic Inference for ChIP-Seq transcription factor data PING Probabilistic Inference for Nucleosome positioninG PE Paired-End SE Single-End XSET 174-bp eXtended Single-End Tags GMRF Gaussian Markov Random Field xxii Acknowledgments I wish to express deep gratitude to my supervisors, Raphael Gottardo and Kevin Murphy, for their inspiration, guidance and continuous support throughout this journey. I would also like to thank my committee member, Gordon Robertson of the Michael Smith Genome Science Centre, for his invaluable advice on my research and career. I feel honoured to have been able to study at the UBC Statistics Department, where I had the good fortune to have been mentored by Jiahua Chen, Paul Gustafson, and Harry Joe and others. I learned so much from their insight, advice and tremen- dous knowledge of statistics. I wish also to acknowledge: • My wife Li Xing for her ongoing support in my family, • Brad Hoffman for kindly offering data and providing assistance in interpret- ing the data, • Junyi Guo and Nancy Heckman for enriching my research experience with their excellent supervision during my Master’s at Nankai University and at University of British Columbia (UBC); • Peggy Ng, Elaine Salameh and Andrea Sollberger in the UBC Statistics de- partment office, who always gave efficient assistance with sincere care rather than merely carrying out duties. The research constituting this thesis has received funding support from NSERC- CGS fellowship, CGS-MSFS Supplement, BCIC Innovation Scholarship, and Uni- versity 4-year Fellowships of UBC. xxiii Chapter 1 Introduction The goal of this thesis is to develop statistical models and software to profile Tran- scription Factor Binding Site (TFBS)s and nucleosome positions using genome- wide high-throughput sequencing data. We focus on widely available sonicated ChIP-seq data. We address the “noise” in sonicated data by using statistical mod- els that incorporate more detailed information about data-generating mechanisms compared to prior work. We show that the result of this more careful modelling is better detection sensitivity and specificity, and show how to leverage this improved profiling to discover new scientific facts of interest to biologists. Our main technique is based on using finite mixture models to describe the density of aligned short reads generated by sequencing (see Section 1.1 for more details on the experiment and data collection). We can then infer the locations of the object generating the reads from the model parameters. In more detail, our contributions are as follows: In Chapter 2, we show how to profile TFBSs by creating a mixture of Student distributions in which each mixture component describe two peaks, correspond- ing to the two ends of immunoprecipitated DNA fragments. Unfortunately, some reads are discarded while being aligned to a reference genome because they have no unique alignment position. To correct estimation bias from such discarded reads, we extend some previous work [86] on fitting mixture models to truncated or cen- sored data using the EM algorithm. We call this method “PICS”. We show that it out-performs existing approaches to this problem. 1 In Chapter 3, we discuss how PICS has been integrated into a larger analy- sis pipeline, which is the first pipeline available on Bioconductor for analysing ChIP-seq data. We demonstrate the effectiveness of our pipeline on published hu- man ChIP-Seq datasets for FOXA1, ER, CTCF and STAT1, where it detected co- occurring motifs that were consistent with the literature but not detected by other methods. In Chapter 4, we modify PICS to profile nucleosome positions instead of TFBSs. The main difference is that we impose a 1D Gaussian Markov Random Field (GMRF) prior on the positions of the means of the Student distributions; this en- codes our prior knowledge about the spatial relationship of adjacent nucleosomes. We call this method “PING”. We show that it out-performs existing approaches to this problem. Paired-End (PE) sequencing is an alternative experiment protocol to single- end (SE) sequencing. In Chapter 5 we describe our current work on developing statistical methods for analyzing PE data. 1.1 Biological Background 1.1.1 Sonicated ChIP-seq Experiments ChIP-seq, which combines chromatin immumoprecipitation with massively paral- lel short-read sequencing, offers high specificity, sensitivity and spatial resolution in profiling in vivo protein-DNA association; histones, histone variants and modi- fied histones; nucleosome positioning; polymerases and transcriptional machinery complexes; and DNA methylation [45, 99]. While sequencing overcomes certain limitations of profiling with microarrays (ChIP-chip), it raises statistical and com- putational challenges, some of which are related to those for ChIP-chip, and others that are novel. The details of a sonicated ChIP-seq experiment are shown in Figure 1.1. A DNA-binding protein is cross-linked to its in vivo genomic DNA targets, and the chromatin (a complex of DNA and protein) is isolated (1). The DNA with the bound proteins is extracted from the cells, and is sheared by sonication into frag- ments of average length 500-1000 bps (2). DNA fragments that are cross-linked to 2 Cross link Proteins to DNA Sonication to shear the DNA Protein of interest is enriched by immunoprecipitation Reverse cross links Gel size selection (e.g. 100-300bps) Add Antibody ~1kb 100-300 bps Sequencing Millions of short reads (e.g. 36 bps) 36 bps 100-300 bps 1 2 3 45 6 7 Forward read Reverse read DNA fragment DNA binding protein binding 8 5' 3' 3' 5' Genome Figure 1.1: Details of a sonicated ChIP-seq experiment for a transcription factor. 3 the protein of interest are enriched by immunoprecipitation with an antibody that specifically binds that protein (3-4). After the immunoprecipitation step, the DNA is separated from the protein (5), the resulting suspension of IP-enriched DNA is size selected on a gel, and only smaller fragments (e.g. 100-300 bps) are retained (6). Then, the size-selected, IP-enriched DNA is sequenced to generate millions of short reads, each of which represents either a fragment start or end (7-8). (In an al- ternative ‘paired end’ experiment, a read is generated from each end of each DNA fragment.) After read sequences are aligned to a reference genome, read positions can be used to infer binding site positions. (8) shows two binding sites, with the right-hand site more enriched in DNA fragments. Fragments that do not align with a binding site reflect biases like nonspecific immunoprecipitation, misalignment, etc. A typical ChIP-seq data set consists of millions or tens of millions of sequence reads that are generated from ends of immunoprecipitated DNA fragments. The quality of called bases varies along and between reads, and, as the sequencing technology evolves, read lengths and quality are increasing, as are the number of reads generated in a machine run. Current read lengths are generally between 36 and 75 bp. While pairs of end reads can be generated from each DNA fragment, current ChIP-seq data typically consist of single-end reads, in which each DNA fragment contributes a directional read from only one randomly selected fragment end ((7-8) in Figure 1.1). After read sequences have been aligned to a reference genome ( e.g. [73], the aligned read data are transformed into a form that reflects local densities of im- munoprecipitated DNA fragments, and, in the work described here, into estimates of locations where Transcription Factor (TF)s were associated with DNA in the experimental cellular system. The analysis is complicated by biases and artifacts in local read densities that can be introduced in sequencing and aligning, and by chromatin structure and genome copy number variations [43, 55, 61, 111]. As well, repetitive sequences can prevent aligning (mapping) reads to unique genomic loca- tions [106, 111], and reads that cannot be uniquely aligned are typically removed from the analysis. In typical mammalian experiments, 15 to 40 percent of reads may be discarded, but higher rates can be encountered in particular experiments. Because of ChIP-seq’s cost-effectiveness, such global losses are usually not an im- 4 portant practical consideration; however, few analysis methods correct for the local biases in aligned read densities that are caused by repetitive regions. Certain types of biases in read density profiles can be estimated by sequencing a control sample in addition to the immunoprecipitated ‘treatment’ sample, and then using an analysis method that considers the treatment profile relative to the control profile [61, 97, 111]. Control data can help identify false positives, as- sess numerical background models, and estimate a threshold for segmenting a read density or enrichment profile in order to identify a subset of significantly enriched regions. Analysis methods are described as ‘two-sample’ when a control data set is available and as ‘single sample’ when only treatment data are available. As with ChIP-chip, there are various ways to generate control samples, and we refer the reader to [14] for an overview. In summary, once reads have been aligned to a reference genome, there are at least four central analysis issues: interpreting the information in local densities of directional reads; identifying local read densities that represent false positives; addressing biases in read densities that arise from local variations in the efficiency with which reads can be aligned to unique genomic locations; and segmenting the enrichment profile in order to identify a statistically and biologically meaningful subset of enriched regions. 1.1.2 Other Experiment Protocols For the purpose of generating profiles for nucleosome positioning other experi- mental protocols can be used. The following two protocols can offer better better spatial distributions for aligned reads than sonicated ChIP-seq. MNase ChIP-seq Experiments MNase is an endo-exonuclease that can be used to digest the linkers between adja- cent nucleosomes. In an MNase ChIP-seq experiment, the second step (sonication step) in figure Figure 1.1 is replace by MNase digestion. Compared to sonicated ChIP-seq experiments, the ends of DNA sequences obtained from MNase ChIP-seq experiments are closer to the boundaries of nucleosomes, hence MNase ChIP-seq experiments offer better spatial distributions for nucleosome positioning. 5 MNase-seq Experiments In some cases, researchers are interested nucleosome positions over the whole genome, and not for regions related to certain types of histones or histone mod- ifications. MNase-seq experiments are used in this situation. Compared to MNase ChIP-seq experiments, MNase-seq experiments do not require adding an antibody, i.e. step 3 to step 5 are removed in Figure 1.1. Discussion Currently, nucleosome-based data are typically generated by high-throughput short- read sequencing following either MNase digestion (MNase-Seq), MNase diges- tion with chromatin immunoprecipitation (ChIP-Seq) or sonication with ChIP-seq. MNase digests linker DNA with relatively high specificity [19], and this specificity is reflected in the spatial distribution of aligned reads. However, classes of func- tional genomic regions are being identified by integrated analysis of diverse sets of short-read sequence data [7, 28, 92], many of which are nucleosome-based data types that are generated with ChIP-Seq protocols that use sonication. In summary, MNase data are not always available for the goal of nucleosome positioning, while large amounts of sonicated ChIP-seq data have been generated and are being generated. Hence a statistical analysis method for nucleosome posi- tioning is desired to be able to handle the noisier sonicated ChIP-seq data. 1.1.3 Paired-end Sequencing Experiments In all experiments discussed above, the DNA fragments are sequenced for a small proportion from a single randomly selected end. Alternatively, each fragment can be sequenced from both ends. With same of resource (money), single-end ex- periments can sequence about twice the number of DNA fragments as paired-end experiments can. However the length of each DNA fragment is useful information that only available from paired-end experiments. We are currently adapting our model to handle paired end data; see Chapter 5 for details. 6 1.2 Existing Analysis Methods in the Literature 1.2.1 Analysis of ChIP-seq Data of TFs In this section we describe the peak finder methods for analyzing TFs. These meth- ods are designed for detecting localized and unstructured enriched regions. ‘Un- structured’ is a term relative to nucleosome data, whose positions (the enriched regions) have some known structure, as outlined above. Methods Based on XSET Profiles Many methods in the literature call peaks based on the 174-bp eXtended Single- End Tags (XSET) profile [108]. These methods usually extend each read by a cer- tain length (174 bps) to the 3’ direction, so that each read is changed from a point into a segments. The counts of segments at each bin (1 base pair or multiple bps), sometimes called as height of XSET of the bin, can be used for thresholding an enrichment profile. Most earlier methods call peaks (i.e. TFBSs) by thresholding the XSET profile (i.e. counts of observed segments). Such as ‘ChipSeq Peak Finder’ [54] uses the ratio of the counts from the immunoprecipitated sample to the control in order to identify (or call) regions where large numbers of fragments overlap as “peaks”. The ‘eRange’ [93] is similar to ‘ChipSeq Peak Finder’, but it allows the use of reads that map to multiple locations in the genome. Those reads are mapped to the most likely positions. ‘FindPeaks’ [30] calls peaks according to some minimum height criteria either with or without including a negative control sample in the analysis (typically this is input DNA). More complex methods model the counts in a small genome region with cer- tain probability models and call peaks by thresholding the p-values. The most popular peak finders in the literature is ‘MACS’ [133] for its fast speed and good performance. ‘MACS’ uses both forward and reverse read profiles to empirically model the ‘shift size’ of ChIP-seq reads, and uses this to improve the spatial ac- curacy of the predicted binding sites. MACS uses a parametric model based on a dynamic Poisson distribution to identify and quantify locations where the protein of interest binds. The dynamic Poisson distribution allows different bins to have 7 different values for the model parameter λ . The estimated FDR is calculated by switching the IP and control data. ‘CisGenome’ [50] is a pipeline for the analysis of ChIP-chip or ChIP-seq data. Their method is also based on a Negative bino- mial background model, and includes the functionalities not offered by MACS and QuEST (e.g. filtering atypical regions, and different types of FDR estimates). The negative binomial distribution can be presented as infinite Poisson mixture distri- bution or a Poisson-Gamma compound distribution, which is used to address the over-dispersion problem of modelling the counts by the Poisson distribution. The ‘USeq’ [97] is method using the window-level binomial p-values. An FDR correc- tion is applied to call the individual binding events. The ‘PeakSeq’ [111] allows for this read-mappability effect, starts with a normalization step comparing the control to the background component of the ChIP sample and then, it first thresholds peaks by counts of reads, and calls peaks by comparing the counts in IP relative to control regions. The p value is calculated using the Binomial distribution. Site Identifica- tion from Short Sequence Reads (SISSR) [56] estimates the Poisson probabilities of high read counts, and calls regions where the peaks shift from the forward to the reverse strand. Method Based on Aligned Positions of DNA Segments The density estimation-based methods model the bi-directional read locations di- rectly, because this offers more information than only the counts of reads in each base pair. To our best knowledge, there are two such methods in the literature. The ‘QuEST’ [125] is a method based on the enrichment score calculated from the Gaussian kernel density estimates of the forward and reverse read counts, which al- lows estimating DNA fragment lengths. The forward and reverse profiles are then combined to estimate the binding site locations and to quantify the local enrich- ments. Given the control sample data, QuEST can estimate a false discovery rate (FDR). Kharchenko et al. [61] also proposed a peak calling method, whose score is based on Gaussian kernel density estimates of forward and reverse read counts. 8 Challenges and Our Method While above methods have established statistical approaches for ChIP-seq analysis, model-based and Bayesian approaches are in earlier stages of development. Four problems need to be addressed. First, methods based on XSET profiles assume each DNA fragment has a fixed length, but the length could vary between 80 bp and 300 bp. So this assumption might result in bias. Second, up to 40% of DNA fragments typically cannot be uniquely mapped to a genome location, and such missing information should be addressed in the model. Third, the real data contain DNA fragments not contributed from TFs, so a robust method is desired to reduce effects of such nonspecific DNA fragments. Finally, the methods not based on XSET profiles model the forward reads and reverse reads separately using different density estimations. However forward and reverse reads from the same TF are related to each other when we know the range of lengths of fragments. So a joint model is desired to borrow strength between forward and reverse reads. In the work described here, we introduce PICS, a method for probabilistic infer- ence of ChIP-seq data that is based on a Bayesian hierarchical truncated t-mixture model. PICS will address all four issues. We will apply PICS to two published human ChIP-seq datasets and compare its results to those from QuEST, MACS, CisGenome, and USeq. We will also presents the results of a simulation study that evaluates the robustness of our model to mis-specification and compares the performance of PICS and the four other methods. 1.2.2 Analysis Pipeline Peak-calling method discussed above returns a list of enriched genomic regions in which the protein of interest is expected to be directly or indirectly associated with DNA. There are two downstream analyses to be done. (1) To identify po- tential DNA binding sites within these regions, and summarize these sets of short sequences as motifs, typically as position weight matrices (PWMs) or families of PWMs [59, 122]. (2) Once de novo motifs have been identified, it is desirable to annotate and assess these in order to retain motifs that are likely to be biologically relevant, while removing artifactual and background motifs. 9 Motif Discovery There are two main types of algorithms for de novo motif discovery: enumera- tive and probabilistic. Enumerative methods identify and rank all m-letter pat- terns in a set of sequences. Probabilistic methods use stochastic sequence models along with Expectation-Maximization (EM) or Gibbs sampling techniques to infer PWMs [4, 70, 78, 110], and can be computationally impractical for large datasets. Established tools like Weeder [101], Gibbs sampler [70] or MEME [5] were de- veloped to address relatively small sets of input sequences, and scale poorly to the much larger sets of enriched sequences that whole-genome ChIP-Seq data can re- turn. Pipelines developed for ChIP-Seq analysis, e.g. CisGenome [51] and MICSA [10], are based on these algorithms or variants of them, and face similar constraints. Other tools like HMS [46] and ChiPMunk [66] were developed for motif discovery from ChIP-Seq data, and so are more scalable, but can identify only a single-motif at a time, and would need to be modified to discover motif combinations. GADEM [75] is a good compromise between fully probabilistic and enumer- ative approaches, can process large sets of ChIP-Seq regions, handles both dimer and monomer motifs, automatically identifies multiple motifs, and automatically adjusts motif widths. To build our pipeline, we have ported GADEM to R, as a package called rGADEM. To address very large sets of enriched regions, we have extended the original C code to take advantage of multithreading, without requiring user configuration, via Grand Central Dispatch on OS X, and openMP (openmp.org), which supports shared-memory parallel programming on all archi- tectures, including Unix and Windows. Compared to probabilistic approaches, this provides a simple, fast and efficient de novo framework. Annotate and Assess Motifs To annotate and assess motifs discovered in the last step, we have designed a new tool, MotIV (Motif Identification and Validation), which is based on STAMP [84]. Like STAMP, MotIV provides queries to the JASPAR database [13], and users can flexibly input other sets of reference PWMs (e.g. TRANSFAC [129], UniProbe [96], DBTBS [119], or RegulonDB [34]). As outlined below, MotIV provides vi- sualization and postprocessing options that are unavailable in STAMP, TOMTOM 10 [38] and MACO [123]. It provides summary statistics on motif occurrences, re- ports joint motif occurrences and plots distance and pairwise-distance distributions. It can also refine motifs and motif occurrences based on a set of filters provided by the user. Because gene regulation typically involves combinatorial action of multiple TFs, functional binding sites tend to occur as groups that are often referred to as cis-regulatory modules (CRMs) [12]. Identifying CRMs can improve the accuracy of predicting functional binding sites. However, results from computational meth- ods for determining CRMs (e.g. Cluster-Buster [33] and CisModule [136]) are rarely reported for ChIP-Seq data, because they are too computationally intensive or return long lists of candidate modules that are challenging to assess. MotIV offers an alternative way to identify biologically relevant combinations of motifs. Pipeline It is desired to conduct all above three analysis in a single pipeline software, since using different softwares for each individual step need tedious work on data for- mat conversion. Note the size of data we handle is usually large, so data format converting might be a heavy work load and can increase chances of error. Bio- conductor, which contains hundreds of packages, has become a popular platform for analyzing genomic data. So a pipeline made from Bioconductor packages are desired so that the pipeline output can be easily used in other packages to conduct other analysis. For this goal, we implemented the PICS method as a Bioconductor package, and extend it into an analysis pipeline by adding in two other Bioconductor pack- ages (i.e. rGADEM identifies de novo motifs; and MotIV visualizes and annotates motifs, and identifies motif combinations that have nonrandom spatial relation- ships.) This is the first complete Bioconductor pipeline for analyzing transcrip- tion factor ChIP-Seq data. The pipeline is computationally efficient, supporting processing datasets that consist of tens of thousands of peaks. We illustrate the pipeline by analyzing published Illumina datasets for genome-wide binding in hu- man of FOXA1, ER, STAT1, and CTCF. We compare the performance of our ap- proach to previously described methods for motif and module discovery, and show 11 that the pipeline supports detecting biologically relevant motif modules that are not easily discovered by other methods. 1.2.3 Methods for Nucleosome Positioning from ChIP-seq or MNase-seq Data NPS [134] and TemplateFilter (TpF) [127] are two popular methods for inferring nucleosome positions from short-read data, and their software is available. So we will compare these two methods with our new method. NPS first uses a wavelet transformation to reduce the noise in the count data. Second, NPS applies LoG edge detection to the de-noised read counts, which is Gaussian smoothing, fol- lowed by applying a Laplace operator on the smoothed result. The location where Laplace transformation results cross zero is the boundary of the peak. Third, NPS applies filters to the detected reads to reduce the false positive rate. Finally they assign the nucleosomes to each histone mark using a Possion p-value. Weiner et al [127] proposed to fit the data to several given possible shapes of peaks (templates), and they call the peaks when a good fit is obtained. Results from these methods have been reported only for data generated with protocols that use MNase-Seq, or MNase with ChIP-Seq (e.g. [39]), and the effectiveness of these methods with the more widely available data generated using sonicated DNA and ChIP-seq is not known. Recently we described PICS, a probabilistic peak-caller for transcription fac- tor ChIP-Seq data [131]. PICS models bi-directional read densities, uses mixture models to resolve adjacent binding events, and imputes reads that are not mapped due to repetitive genome sequences. The problem and data structure of profiling TF and profiling nucleosome positions are similar, so we anticipated that PICS model-based framework should be extensible to address both MNase-digested and sonicated nucleosome-based short-read data. We were interested in assessing how effectively the model could be adapted to the two data types, how robust the new al- gorithm would be to lower read densities, and the types of biological inferences that it would support from sonicated data. To address these issues, we developed PING, a method for probabilistic inference of nucleosome positioning from nucleosome- based sequence data. Like PICS, PING models bi-directional read densities, uses mixture models, and imputes missing reads; however, it uses a new prior speci- 12 fication for the spatial positioning of nucleosomes, has different model selection criteria, model parameters, and different post-processing for estimated parameters. In addition, PING also includes novel statistical methods to identify nucleosomes whose read densities are lower than those of neighboring nucleosomes. In addition to the methods we compared with, there are other methods in the literature. Kuan et al proposed a HMM-based method [64] for nucleosome posi- tioning. The genome is split into bins, and Gaussian kernel is used to smooth the count of reads in each bin for removing local trend. Instead of the smoothed counts, difference of counts in adjacent bins (the derivative) is used for further analysis. An HMM model is fitted to the derivatives. The HMM model has a Gaussian emission probability and two hidden states nucleosome/linker, and it has numbers of sub- states to control the length of linkerDNA. The multi-layer method (MLM) [36] is a less statistical method. It smooths the counts by convolution operator with kernel window [1/4, 1/2, 1/4], and then obtains nested enriched intervals by using dif- ferent thresholds on counts (a lower threshold result in larger enriched intervals). Then it extracts the pattern from the intervals. So for a peak, the number of in- creasing thresholds, m, corresponds to the peak height. The peaks are selected by a cutoff value of m, and the selected peaks are classified as well positioned, fuzzy, and de-localized nucleosomes, depending to the shape of the peaks. 1.2.4 Methods for Paired-End Sequencing Data Most methods discussed above are based on extended single-end reads. So to ap- ply these methods to PE data, the only change is that extended reads come from observed starts and ends instead of extending single-end reads by fixed length. To extend PICS/PING to handle PE data is not as simple. When the reads are paired, we need to use such information in our model. For example, in a mixture model, mixture responsibilities of paired reads are forced to be tied. Such changes require modifying formulae for the E-step, M-step, standard error calculation and model selection. 13 1.3 Summary of Key Results in This Thesis The rest of this report is organized as follows. In Chapter 2, we introduced Prob- abilistic Inference for ChIP-Seq transcription factor data (PICS), a probabilistic inference method to detect the locations where TF proteins bind on the whole genome, from ChIP-seq data. Using both real data and simulated data, we com- pared PICS with other popular methods such as MACS, cisGenome, QuEST, and USEQ. Using published real data, we showed PICS predictions agree best with dis- covered motifs Figure 2.3, and PICS has best sensitivity/specificity to detect prox- imal binding events Table 2.2. Using simulated data, we showed PICS is robust to mis-specification of model distributions and prior distributions and we showed PICS predictions agree best with simulation truth Figure 2.5. In Chapter 3, we introduce an Bioconductor pipeline for analysis of ChIP-seq data. For this pipeline, we reimplemented PICS method in C language and build it into a Bioconductor package, and build two other packages for follow up analysis of motif discovery/assessment. Next we applied our pipeline to published data sets and showed we can detect co-occurring motifs that were consistent with the literature but not detected by other methods. In Chapter 4, we introduce Probabilistic Inference for Nucleosome positioninG (PING), a statistical method for detecting nucleosome positions in whole genome high-throughput sequencing data. This method is modified version of PICS. We compared PING with other methods using a sampling approach, and showed PING is more robust to low sequence depth Figure 4.1. Next we applied PING to two published data and show that its results are better, see Figure 4.2 and Figure 4.4. In Chapter 5, we summarize the results of the thesis and introduce a model for paired-end (PE) sequencing data. PE sequencing is a different experiment protocol than the popular protocol used in practice (i.e. SE), which offer more information of a single DNA fragment, but can sequence fewer DNA fragments for a fixed sequencing cost. As a future research direction, we plan to apply our method to PE data and address the question of whether or not it is worthwhile to use PE data for predicting nucleosome positions. 14 Chapter 2 PICS: Probabilistic Inference for ChIP-seq Transcription Factor Data 2.1 Introduction ChIP-seq combines chromatin immunoprecipitation with massively parallel short- read sequencing. While it can profile genome-wide in vivo transcription factor- DNA association with higher sensitivity, specificity and spatial resolution than ChIP-chip, it poses new challenges for statistical analysis that derive from the com- plexity of the biological systems characterized and from variability and biases in its sequence data. In Chapter 1, we have summarized methods in the literature, and discussed a few problems not well addressed by existing methods. In the work described here, we introduce PICS, a method for probabilistic inference of ChIP-seq data that is based on a Bayesian hierarchical truncated t-mixture model. PICS inte- grates four important components. First, it jointly models local concentrations of directional reads. Second, it uses mixture models to distinguish closely adjacent binding events. Third, it incorporates prior information for the length distribution of immunoprecipitated DNA fragments to help resolve closely adjacent binding 15 events and to identify events that have atypical fragment lengths. Fourth, it uses pre-calculated whole-genome read mappability profiles to adjust local read den- sities for reads that are missing due to genome repetitiveness. For each binding event, PICS returns an enrichment score that is relative to a control sample when such a sample is available, and it can use a control sample to estimate a false discov- ery rate. Finally, because it is based on a probabilistic model, PICS can compute measures of uncertainty for binding site estimates; these can be used to estimate binding site locations and to filter out low-confidence regions. The chapter is organized as follows. In Section 2 we introduce the data struc- ture and some notation. In Section 3 we present our Bayesian hierarchical truncated t-mixture model. Section 4 discusses parameter estimation, inference and detec- tion of binding sites. In Section 5 we apply PICS to two published human ChIP-seq datasets and compare its results to those from QuEST, MACS, CisGenome, and USeq. Section 6 presents the results of a simulation study that evaluates the ro- bustness of our model to mis-specification and compares the performance of PICS and the four other methods. In Section 7 we briefly discuss our results and possible extensions. 2.2 Data, Preprocessing, and Notation We used two ChIP-seq data sets that have been analyzed by other groups. [133], us- ing MACS, identified binding sites of FOXA1 (hepatocyte nuclear factor 3α) in hu- man MCF7 breast cancer cells. [125], using QuEST, identified binding sites of the growth associated binding protein (GABP) in human Jurkat T cells. Each data set consists of single-end reads for a treatment (ChIP) and an input DNA control sam- ple. The FOXA1 data consist of 3,909,507 treatment reads and 5,233,322 control reads, while the GABP data consist of 7,830,602 treatment reads and 17,028,066 control reads. Because ChIP-seq aligned-read data are usually sparse, consisting largely of regions in which few or no reads are observed, we first preprocess the read data by segmenting the genome into regions, each of which has a minimum number of reads that aligned to forward and reverse strands. We detect such regions using a w-bp sliding window with an s-bp step size, counting the number of forward and 16 reverse strand reads in the left and right half-windows respectively, and we retain windows that contain at least one forward read and one reverse read. For each chromosome, after merging overlapping windows and removing merged regions that have fewer than two forward or reverse reads, we obtain a disjoint set of can- didate regions, each of which we analyze separately. Because DNA fragments are often between 100 and 300 bp long after gel size selection, for the work described here we chose w = 100 bp, and set s = 10 bp for computational convenience. We tested other values for w and s and obtained essentially the same candidate regions. From this point, we will use the following terminology. A candidate region is a region obtained by our sliding window approach, i.e. one with enough for- ward/reverse reads to be processed by PICS. A binding event refers to a location where the protein of interest is associated with DNA. At a binding event the in vivo protein was either interacting with the DNA directly at a binding site, i.e. a DNA motif that the protein recognizes and binds to [23], or indirectly via membership in a DNA-associated protein complex. We will define a binding site’s midpoint as its position or location. Note that a candidate region can result from more than one binding event, and so can contain more than one binding site. 2.3 Model In this section, we use G a(α,β ) to denote a Gamma distribution with shape param- eter α and scale parameter β . Similarly, N(µ,σ2) denotes a Normal distribution with mean µ and variance σ2, while t4(µ,σ2) denotes a t distribution with 4 de- grees of freedom, mean µ and variance parameter σ2. The exact density of the distributions used are given in the Appendix A. 2.3.1 Modeling a Single Binding Event Having segmented the read data into candidate regions (Section 2.2), we first as- sume that each region contains a single transcription factor binding site. An ex- tension to the case of multiple binding sites is treated below. Let us denote by fi and r j the i− th forward and j− th reverse read positions in a given region, with i = 1, . . . ,n f and j = 1, . . . ,nr. Note that the number of forward reads, n f , and reverse reads, nr, will typically vary between candidate regions. We jointly model 17 the forward and reverse read positions as: fi ∼ t4 ( µ−δ/2,σ2f ) and r j ∼ t4 ( µ+δ/2,σ2r ) (2.1) where µ represents the binding site position, δ is the distance between the max- ima of the forward and reverse read position densities, which corresponds to the average DNA fragment length for the binding event in question, and σ f and σr measure the corresponding variability in DNA fragment end positions. Note that this approach differs from that typically used in analyzing sequencing data, in that we do not model the sequence counts, but rather the distributions of the fragment end positions. Figure 2.1a shows a candidate region with one binding event, along with the corresponding PICS parameter estimates. 2.3.2 Modeling Multiple Binding Events We use mixture models to address the possibility that the sets of forward and re- verse reads within a candidate region were generated by multiple closely-spaced binding events. We model the forward and reverse read positions using t-mixture distributions: fi ∼ K ∑ k=1 wkt4 ( µ f k,σ2f k ) d=g f ( fi|w,µ,δ ,σ f ) r j ∼ K ∑ k=1 wkt4 ( µrk,σ2rk ) d=gr(r j|w,µ,δ ,σ r) (2.2) where µ f k = µk− δk/2 and µrk = µk + δk/2 and µk, δk, σ f k, σrk are defined as in (2.1), but have an index k that corresponds to the binding event k, while wk is the mixture weight of component k, which represents the relative proportion of reads coming from the k− th binding event. For simplicity we denote by g f and gr the resulting p.d.f. of the forward and reverse mixture distributions. Figure 2.1b shows a candidate region that has two binding events, along with the corresponding PICS parameter estimates. As described in (2.1-2.2), PICS uses t distributions with 4 degrees of freedom to model local distributions of the forward and reverse read positions. While the t4 distribution is similar in shape to the Gaussian distribution, its heavier tails make it 18 12064300 12064400 12064500 12064600 12064700 12064800 0. 00 0 0. 00 5 0. 01 0 a) One binding event Re ad s D en sit y > > >> > >> >> >> >> > >> > >> > >> >> >>> >> > << << << < << << < << << < << << < < <<< ∝ 1 σf ∝ 1 σr δ μ PICS Kernel density 005569451000569451005469451 −0 .0 01 0. 00 0 0. 00 1 0. 00 2 0. 00 3 0. 00 4 b) Two binding events Re ad s D en sit y >> >>> > >>> > >> > >> > >> >> >> >>> >> >>> > >> >> > > >>> >> > >> > >>>> >> >> > >>> > >> >> >> >> > >>>> >> >> >>> > >> > > >> >>> >> > << << < < << <<< < << << < << < << < << <<< << << << << << < < <<< <<<< << <<< < < <<<< < < << << < < <<< << < << < <<< < <<< << << < ∝ w1 σf1 ∝ w1 σr1 δ1 μ1 ∝ w2 σf2 ∝ w2 σr2 δ2 μ2 Figure 2.1: Predicted binding events in two candidate regions in the GABP data. PICS detected one binding event in the region in (a) and two bind- ing events in the region in (b). The observed data of forward and reverse strand aligned reads are shown by > and < arrowheads, respectively. Mappability profiles are shown as black/white lines, in which the white intervals show non-mappable regions. In (a) the distribution of reverse readx positions was biased by a region with low mappability. 19 more robust to outliers [68, 79]. Since a DNA fragment should contribute a forward read or a reverse read with equal probability, we use the same mixture weight wk for both forward and reverse distributions. Finally, to accommodate possible biases that result in asymmetric forward and reverse peaks, we use different forward and reverse variance parameters σ2f k and σ 2 rk. 2.3.3 Prior Distributions Typically the library construction process makes prior information available for the length of the DNA fragments. We can use a Bayesian approach to take advantage of this information by allowing the δk’s for all binding sites to derive from a common prior fragment length distribution. Similarly, we can also put a common prior distribution on σ2f k and σ 2 rk, which allows us to incorporate prior information about the variability of the DNA fragment length across a set of binding events, and to regularize variance estimates when few reads are available. For computational convenience, we use a Normal-Gamma conjugate prior, given by (σ−2f k +σ −2 rk )∼ G a(α,β ) and (δk|σ2f k, σ2rk)∼ N(ξ ,ρ−1/(σ−2f k +σ−2rk )) (2.3) where ξ represents our best prior guess about the mean fragment length across all binding sites, and ρ controls the spread around this guess. Based on preliminary analysis of several experimental data, we chose α = 20, β = 40000, ξ = 175, and ρ = 1. These values were based on knowing that DNA fragments should be on the order of 50-250 bp after gel size selection for both datasets considered [125, 133], and resulted in a fairly diffuse prior for the DNA fragment length, with a mean of 175 bp and a standard deviation of approximately 50 bp. Since the common values of hyper-parameters are used in all candidate regions, parameter estimation in different candidate regions can borrow strength from each other. 2.3.4 Summary of Model and Priors In this section, we summarize our model and priors using a graphical representa- tion (Figure 2.2) and the compact set of equations given by (2.4-2.9). In the PICS 20 Zfik Yfi ufik ν (α,β) (ξ,ρ) Zrjk Yrj urjk i=1,2,...,nf j=1,2,...,nrk=1,2,...,K μfk σfk μrk σrk δk wk μk Figure 2.2: Directed acyclic graph representation of the PICS model. Reads aligned to forward and reverse strands are denoted by f and r. Dashed rectangles represent fixed hyper-parameters. Grey-filled circles indi- cate missing data that are incorporated into the model to facilitate in- ference via the EM algorithm while plain circles represent unknown variables that need to be estimated. Plain squares represent the data (aligned forward and reverse read positions). Finally, thick arrows in- dicate deterministic relationships, while thin arrows indicate stochastic relationships. The index k refers to the mixture component k. model, the forward and reverse read positions are linked by the shared parameters, which are denoted by the nodes in the figure’s center column. The left and right sides of the figure respectively show the information related to the forward and reverse strand reads. Given the average fragment length δk and the binding site positions µk, the centers of the forward and reverse read position density distri- butions can be calculated by (2.4). Conditional on the binding event memberships 21 (the zd’s), the read positions follow a Normal-Gamma compound distribution given by (2.5-2.6). In other words, conditionally on the z’s, the read positions follow a t distribution with ν degrees of freedom. The unknown membership indicator z f i (resp. zr j) indicates which binding event a forward read i (resp. reverse read j) comes from; both z f and zr share the same multinomial distribution given by (2.7). As a consequence, the marginal distribution of the read positions is a mixture of t distributions whose weights are given by the parameters of the multinomial dis- tribution in (2.7). The introduction of the missing data (the z’s and the u’s) allows us to perform parameter estimation via an EM algorithm as described in Section 2.4. Finally, the average fragment length parameter δk and peak variances σdk, d = { f ,r} are assigned a Normal-Gamma conjugate prior, given by (2.8-2.9). µ f k = µk−δk/2, µrk = µk +δk/2 (2.4) (di|Udik = udik,zdik = 1,µk,δk) ∼ N ( µdk, σ2dk udik ) (2.5) (Udik|zdik = 1) ∼ G a(v/2,v/2) (2.6) z f i,zr j ∼ Multinomial(w) (2.7) (σ−2f k +σ −2 rk ) ∼ G a(α,β ) (2.8) (δk|σ2f k,σ2rk) ∼ N(ξ ,ρ−1/(σ−2f k +σ−2rk )) (2.9) 2.3.5 Accounting for Missing Reads Building on (2.1-2.2), we now consider the case where some reads are missing due to one or more non-mappable regions intersecting a candidate region. For clarity, we again focus on a single candidate region, whose genomic extent is denoted by I. For each chromosome, a mappability profile for a specific read length (e.g. 36 bp) consists of a vector that lists an estimated read mappability ‘score’ for each base pair in the chromosome [107]. A score of one at a genomic position means that we should be able to uniquely align a read that overlaps that position, while a score of zero indicates that no read of that length should be uniquely alignable at that position. As noted above, typically only reads that map to unique genomic lo- cations are retained for analysis. For convenience, and because transitions between mappable and ambiguous regions are often much shorter than the regions, we com- 22 pactly summarize each chromosome’s mappability profile as a disjoint union of ambiguous intervals that specify only zero-valued profile regions (Figure 2.1). Let us assume that a candidate region intersects one or more of these inter- vals. We can write I = ⋃L l=0 Il , where Il = [al,bl] denotes the l− th ambiguous interval, with l = 1, . . . ,L; and I0 denotes the union of intervals that have high map- pability, and so should have no missing reads. In Il , the fli (i = 1, . . . ,n f l) and rl j ( j = 1, . . . ,nrl) denote n f l independent forward read positions and nrl indepen- dent reverse read positions. Note that only the quantities with l = 0 are observed, while all others are unobserved random variables. Also, note that n f 0, nr0, and L will vary across candidate regions. Based on (2.2), fli and rl j, l = 1, . . . ,L, follow a truncated t-mixture model, which is given by g f and gr truncated on Il . The only information carried in the mappability profile is the locations and lengths of intervals Il; these affect the es- timation of the model parameters shared between the observed and unobserved reads, i.e. w, µ , δ , σ f , and σ r as described in Section 2.4. 2.4 Estimation and Inference 2.4.1 Parameter Estimation Using the EM Algorithm Given the conjugacy of the prior chosen, an Expectation-Maximization (EM) al- gorithm [22] can be derived to find the maximum a posteriori estimates (MAP) of the unknown parameter vector Θ = (θ 1, . . . ,θK) where θ k = (wk,µk,δk,σ2f k,σ 2 rk). Our algorithm is similar to those used in t mixture models and Bayesian regular- ization for mixture models [32, 103]. In the presence of missing reads, we use an algorithm similar to that developed by [86] for grouped and truncated data. The difference is that in each mixture component, we joint model forward and reverse reads. The algorithm is described in detail in the Section 2.5. We initialize EM using results of K-means algorithm. This could signifi- cantly reduce the number of iterations required for convergence, and can reduce the chance of converged to a local maximum. Alternatively, using multiple ran- dom initial values is a popular approach used in the literature, and it might provide better results by avoiding a local maximum, but it makes our method much slower. 23 Given the sample size of the data is usually large, we did not choose multiple ran- dom initial approach. For the speed issue, we force the EM algorithm to stop, if it does not converge within 100 iterations. This result in an approximation of MAP is given in some regions, when EM algorithm cannot converge fast enough. The threshold 100 is chosen based on our empirical data analysis. We found EM algorithm converge within 100 iterations in most regions of the datasets we have analyzed. 2.4.2 Inference and Detection of Binding Sites Choosing the number of binding events in each region. In practice, K, the num- ber of binding events, is unknown and needs to be estimated. For each candidate region, we fit our PICS model with K taking values from 1 to 15, and select the value of K that has the largest BIC [114], which in our case is given by BIC = 2Q(Θ= Θ̂|Θ̂)− (5K−1) ln(n f 0+nr0), (2.10) where Θ̂ is the final estimate for the parameters Θ, and Q is the log-likelihood as defined in the Appendix A. Uncertainty of parameter estimates: It is useful to extend the point estimates for the parameters of interest, µ and δ , by deriving measures of uncertainty for them. Within our framework of mixture models with truncated data, we use the approach described in [87] to derive an approximation of the observed information matrix for the parameters. From the observed information matrix, we can then obtain approximate standard errors for both µ̂ and δ̂ . We can use these standard errors to define the starts and ends of binding event neighborhoods, filter out noisy predictions, and estimate confidence intervals for binding site point locations. Binding event neighborhoods: Because PICS models local concentrations of bidirectional read positions, we can define ‘high confidence’ neighbourhoods whose extents are given by the maxima of forward and reverse read position densities. Us- ing our PICS parameters, and taking into consideration the standard errors of the estimates, for a given binding event this neighborhood is defined as the interval µ̂± δ̂/2, extended by three standard errors on each side (i.e. SE(µ̂− δ̂/2) for the left limit and SE(µ̂ + δ̂/2) for the right limit). These high confidence neighbor- 24 hoods can define ‘enriched’ regions in a file that can be visualized in a genome browser [65]. Peak merging and filtering: We use BIC to estimate the number of binding events within each candidate region. While BIC is well suited for selecting the number of mixture components required to estimate an underlying probability density, it sometimes overestimates the number of components [6]. In our case, this can oc- cur when a candidate region contains hundreds of reads. To address this, we merge peaks that have overlapping binding event neighborhoods. The parameters of the merged peaks are obtained by moment-matching conditions (see Appendix A). Since the combined parameter estimates µ̂ and δ̂ are linear combination of the original ones, the original information matrix can be used to recompute the stan- dard errors. For the GABP and FOXA1 data described below, this approach merged less than 1% of the binding events. In addition to merging overlapping events, we also filter out binding events that have noisy (i.e. poorly estimated) or atypical parameter estimates, as these could affect downstream analysis. Specifically, we remove binding events that fail to satisfy any of the following three criteria: (i) SE(µ̂)< 50; (ii) 50 < δ̂ < 200 and (iii) σ̂d < 150,d = { f ,r}. Essentially, (i) filters events that have poor binding site position estimates, while (ii) and (iii) filter events that have atypical DNA fragment distributions, given the fragment size selection applied in library construction, and so are likely to be artifactual regions (e.g. events that have high fractional overlaps with simple tandem repeats [55]). Scoring and ranking binding events. In order to identify and rank a statistically meaningful subset of binding events, we define an enrichment score for each bind- ing event. For a given event, we define FChIP and RChIP, the number of observed forward/reverse ChIP (‘treatment’) read positions that fall within the 90% con- tours of the forward/reverse read position densities, i.e. within µ̂d ± c · σ̂d where d = { f ,r} and c≈ 2.13 is the 95% quantile of the t4 distribution. For each binding event, we define the enrichment score as O = FChIP +RChIP, which is an estimate of the observed number of DNA enriched fragments falling within the binding events after removing outliers. When a control sample is available, we also de- fine Ocont = Fcont +Rcont , by computing the number of observed forward/reverse reads in the control sample that fall within the 90% contour of the forward/re- 25 verse read position densities estimated from the ChIP sample. Using this infor- mation, we define an enrichment score for the treatment relative to the control as S=(Ncontrol/NChIP) ·O/(Ocont+1), where the addition of the constant one prevents a division by zero, and Ncontrol (resp. NChIP) denote the total read count in control sample (resp. IP sample). The scaling of the enrichment score by Ncontrol/NChIP accounts for the control and ChIP samples having different numbers of reads (se- quence depth). False discovery rate. Given control data, we can estimate the false discovery rate as a function of the enrichment score. We do this by repeating the analysis after swapping the control sample for the ChIP sample and recomputing our enrichment scores (i.e. fit PICS to the control data, and define control-over-IP enrichment scores), which we call ‘null’ enrichment scores and denote by S0. Then the FDR, as a function of the threshold value q, can be computed as: FDR(q) = {#S0 : S0 > q} {#S : S > q} , that is, as the ratio of the number of null enrichment scores greater than q, divided by the number of observed enrichment scores greater than q. Summary Detection of locations of transcription factor binding sites is the objec- tive of our analysis. These locations are obtained from the estimated and post- processed model parameter µ’s. The number of binding sites is decided by the following four steps. (1) In the model fitting step, the number of µ’s in each re- gions is chosen by BIC criteria. (2) In the Peak-merging step, we merge mixture components that are used to describe a single peak, which reduces the number of µ’s. (3) In the Peak-filtering step, we filter out the estimated binding locations where model parameters have atypical estimated values, which further reduces the number of µ’s. (4) Finally all binding sites are ranked by their scores, only the top N binding sites are reported as the final results of PICS. Note N could be chosen using the estimated FDR, but we do not recommend that since it is known that FDR of binding sites are hard to estimate well by all available methods. Alter- natively, users could choose N by the requirement of downstream analysis or the amble point on the score curve. 26 2.5 Derivation of the EM Algorithm From this point, for ease of notation, we use the letter d to denote strand of read (i.e. either d = f or d = r). 2.5.1 No Missing Reads We first consider the case in which no reads are missing due to locally low read mappability. Let us denote the complete data by c = ( f ,r,z f ,zr,u f ,ur). Then the complete data likelihood can be written as: l(Θ|c) = n f ∑ i=1 K ∑ k=1 z f ik { log [ wkN ( fi ∣∣∣∣ µk− δk2 , σ 2 f k u f ik )] + ν 2 ( logu f ik−u f ik )− logu f ik } + nr ∑ j=1 K ∑ k=1 zr jk { log [ wkN ( r j ∣∣∣∣ µk + δk2 , σ2rkur jk )] + ν 2 ( logur jk−ur jk )− logur jk} + (n f +nr) [ν 2 log (ν 2 ) − logΓ (ν 2 )] . Similarly, we define the prior penalty lprior as: lprior = ∑ k logpi(δk,σ2rk,σ 2 f k|ρ,ξ ,α,β ) = ∑ k { 1 2 log(σ−2rk +σ −2 f k )− ρ 2 (σ−2f k +σ −2 rk )(δ −ξ )2 } +∑ k { (α−1) log(σ−2f k )−βσ−2f k +(α−1) log(σ−2rk )−βσ−2rk } + cst = −1 2∑k { (σ−2f k +σ −2 rk )[ρ(δk−ξ )2+2β ]− (2α−1)[logσ−2f k + logσ−2rk ] } + cst where cst is a constant. Our goal is to maximize the penalized likelihood l∗(Θ|y)≡ l(Θ|y)+ lprior over Θ to obtain the maximum a posteriori (MAP) estimates. E-Step: Given the current estimate Θ− for Θ, the conditional expectation of the 27 penalized log complete data likelihood is given as: Q(Θ|Θ−) def= E[l(Θ|y)|Θ−]+ lprior = ∑ d∈{ f ,r} nd ∑ i=1 K ∑ k=1 z̃dik { logwk− logσdk− ũdik2 ( di−µdk σdk )2} +lprior+ constant (2.11) Given this, the E-step [103] consists of computing the following quantities: z̃dik def= E(Zdki|di,Θ−) = w−k t4(di|µ−dk,σ−dk) ∑k w−k t4(di|µ−dk,σ−dk) , (2.12) ũdik def= E(Udik|di,zik = 1,Θ−) = (ν+1)/(ν+ ( di−µ−dk σ−dk )2 ). (2.13) M-Step: During the M-step, the goal is to maximize Q(Θ|Θ−) with respect to Θ, which requires solving ∂Q(Θ|Θ−)/∂Θ = 0. We solve the following equation system to obtain updated parameter estimates: 0 = ∂Q∗ ∂wk = ∑ d∈{ f ,r} nd ∑ i=1 K ∑ k=1 z̃dikw−1k , ∑ k wk = 1 (2.14) 0 = ∂Q∗ ∂µk = ∑ j z̃r jkũr jk (r j−µrk) σ2rk +∑ i z̃ f ikũ f ik ( fi−µ f k) σ2f k (2.15) 0 = ∂Q∗ ∂δk = ∑ j z̃r jkũr jk (r j−µrk) 2σ2rk −∑ i z̃ f ikũ f ik ( fi−µ f k) 2σ2f k −ρ(σ−2f k +σ−2rk )(δk−ξ ) (2.16) 0 = ∂Q∗ ∂σ−2f k = ∑ i z̃ f ik 2 [ σ2f k− ( fi−µ f k)2ũ f ik ] −ρ(δk−ξ ) 2+2β 2 + 2α−1 2 σ2f k (2.17) 0 = ∂Q∗ ∂σ−2rk = ∑ j z̃r jk 2 [ σ2rk− (r j−µrk)2ũr jk ] −ρ(δk−ξ ) 2+2β 2 + 2α−1 2 σ2rk. (2.18) 28 Because there is no simple closed form solution forΘ, we adopted a conditional approach in which we first maximize over (w,µ,δ ), conditional on (σ f ,σ r), and then maximize over (σ f ,σ r), conditional on the previously updated (w,µ,δ ), re- sulting in an Expectation/Conditional Maximization (ECM) algorithm [88]. Con- ditionally on current estimations σ−f k and σ − rk , we need to solve the linear system given by (2.14 - 2.16). We obtained the following closed form new estimates: ŵk ← χ̃ f k + χ̃rkN f +Nr , µ̂k ← s̃ f k + s̃rkm̃ f k + m̃rk + m̃ f k− m̃rk 2(m̃ f k + m̃rk) δ̂k, δ̂k ← s̃rkm̃−1rk − s̃ f km̃−1f k +ρ((σ−f k)−2+(σ−rk)−2)ξ (m̃−1f k + m̃−1rk ) 1+ρ((σ−f k)−2+(σ − rk)−2)(m̃ −1 f k + m̃ −1 rk ) , where χ̃dk = nd ∑ i=1 z̃dik, s̃dk = (σ−dk) −2 nd ∑ i=1 diz̃dikũdik, m̃dk = (σ−dk) −2 nd ∑ i=1 z̃dikũdik. Similarly, conditionally on these new estimates ŵk, µ̂k, δ̂k, we can then solve a lin- ear system given by (2.17 - 2.18). The new estimate of σ−2dk is give as σ̂−2dk ← χ̃dk +2α−1 η̃dk +ρ(δ̂k−ξ )2+2β 2.5.2 Accounting for Missing Reads In the presence of missing reads, we decompose the log complete data likelihood as l(Θ|y) = ∑Ll=0 ll(Θ|y), where ll(Θ|y) is the complete-data log-likelihood in par- tition Il , given by: ll(Θ|y) = ∑ d∈{ f ,r} ndl ∑ i=1 K ∑ k=1 zdlik { logwk− logσdk− udlik2 ( dli−µdk σdk )2} + cst. We now have additional missing data, ndl and dli, corresponding to the number of missing reads and their positions. To accommodate this, we simply add two 29 steps to our E-step. Because the unknown counts ndl , l = 1, . . . ,L, follow a negative multinomial distribution, we simply replace them with their conditional expectations, which are given by: ñdl def= E(ndl|yd0i,Θ−) = nd0Pdl(Θ−)/Pd0(Θ−), (2.19) where Pdl(Θ−) def= Pr(X ∈ Il) = ∫ Il gd(x|Θ−)dx and Pd0(Θ−) def= Pr(X ∈ S0) = 1− ∑Ll=1 Pdl(Θ −) are the probability measures of the partitions Il and I0. Then, conditionally on the ñdl , we replace the following quantities with the corresponding expectations: χ̃dk ← χ̃d0k + L ∑ l=1 ñdlEdl[z̃dlk], s̃dk ← s̃d0k + L ∑ l=1 ñdlEdl[z̃dlkũdlk], m̃dk ← m̃d0k + L ∑ l=1 ñdlEdl[dl z̃dlkũdlk], η̃dk ← η̃d0k + L ∑ l=1 ñdlEdl[(dl− µ̂k)2z̃dlkũdlk], where χ̃d0k, s̃d0k, m̃d0k, and η̃d0k are the original quantities as defined in the M-step in the case of no missing reads, and Edl are the expectations with respect to the unobserved read positions (dli, l > 0), conditional on observed read positions d0i and on previous estimated parameters Θ−. We now need to calculate the following expectations with respect to the double truncated t-mixture density of the positions of the unobserved reads: Edl[z̃dlk] = w−k P −1 dl (Θ −)H3,dlk Edl[z̃dlkũdlk] = w−k P −1 dl (Θ −)H0,dlk Edl[dl z̃dlkũdlk] = w−k P −1 dl (Θ −)[2σ−k H1,dlk +µ − k H0,dlk] Edl[(dl− µ̂k)2z̃dlkũdlk] = w−k P−1dl (Θ−)[4(σ−k )2H2,dlk +(µ−k − µ̂k)2H0,dlk] +w−k P −1 dl (Θ −)4(µ−k − µ̂k)σ−k H1,dlk. 30 The H quantities can be calculated as: H j,dlk = h j ( bl−µ−dk 2σ−dk ) −h j ( al−µ−dk 2σ−dk ) , H3,dlk = T4(bl|µ−dk,σ−dk)−T4(al|µ−dk,σ−dk) for j = 0,1,2, where B = Γ(3.5)/(Γ(3) √ pi), and the functions h j’s are defined as: h0(x) = B ( 1 5 [h4(x)]5− 23 [h4(x)] 3+h4(x) ) h1(x) = B ( 1 3 [h4(x)]3− 15 [h4(x)] 5 ) h2(x) = −B 5 ( 1+ x2 )−2.5 h4(x) = sin(arctan(x)). 2.6 Application to Experimental Data We applied PICS to the two experimental data sets described in Section 2.2, obtain- ing 75,451 candidate regions and 86,262 binding events for the GABP data, and 51,843 candidate regions and 53,740 binding events for FOXA1 data. Figure A.1 shows histograms of estimated average DNA fragment lengths for both datasets. For the FOXA1 data the estimated average fragment size was approximately 150 bp, consistent with [133]; it was somewhat smaller for the GABP data. Figure A.1 also shows that most of the regions had DNA fragment lengths between 50 and 200 bp, which supports our filtering atypical regions by this parameter. Noting that some of the algorithms responded very differently in terms of esti- mated FDR (see Figure A.2), we compared the methods by identifying conserved DNA sequence motifs in the 5000 top-ranked predictions from each method, using 200-bp wide regions that were centered on each method’s binding site estimates. For motif analysis we used GADEM [75], which can process large sets of ChIP- seq regions on a single CPU, identifies multiple motifs and adjusts motif widths, and returns motifs similar to those from algorithms that are more computationally demanding. We assessed the de novo motifs using STAMP [85], and retained only 31 expected and biologically relevant motifs. As expected, for all five methods, GA- DEM identified GABP and Forkhead (FKHR) motifs as dominant in GABP and FOXA1 datasets respectively. For the FOXA1 data, for all methods, many regions also contained the binding motif for the FOS proto-oncogene protein. The FOS gene family encodes leucine zipper proteins that can dimerize with proteins of the JUN family to form the AP-1 complex [90]. The AP-1 complex is overexpressed in ER-positive cells (e.g. MCF7) and can interact directly with the ER transcription factor [17, 90]. Similarly, the FOXA1 protein is known to play an important role in ER regulation and to interact with ER [26, 82]. The FOS motif that we identified was consistent with AP-1 enriched motifs reported for ChIP-chip FOXA1 regions [82] and may reflect interactions, possibly indirect, between the FOS and FOXA1 proteins. For the work described here, we used GABP motif occurrences for eval- uating GABP results, and both FKHR and FOS motif occurrences for evaluating FOXA1 results. We evaluated the methods using two criteria: 1) the motif occurrence rate, i.e. the fraction of predicted binding events that contained a biologically expected mo- tif, for which a larger value indicates better performance; and 2) the spatial error, i.e. the distance between a binding site point estimate and a motif location, for which a smaller value indicates better performance. Because a motif can occur more than once in a sequence, we used only the motif instance closest to the pre- dicted binding event when computing spatial error. Figures 2.3 show the motif occurrence rate and spatial error as a function of the region rank for the 5000 top-ranked predictions for the GABP and FOXA1 data. Overall, in terms of occurrence rate, PICS performed better than the other methods. MACS and USeq were relatively close to PICS, while QuEST and cisGenome performed comparably, but were less effective than PICS, MACS and USeq. For spatial error, PICS and MACS were most effective, followed closely by USeq, and then QuEST and cisGenome. As stated in Section 2.3, we use mixture models to address the possibility that the sets of forward and reverse reads within a candidate region were generated by multiple binding events. In order to assess how PICS’s mixture model can detect multiple binding sites within a candidate region, we used our predicted transcrip- tion factor binding motifs for the top-ranked 5000 PICS predictions for the GABP 32 (a) GABP: occurrence rate (b) GABP: spatial error 1000 2000 3000 4000 5000 0. 70 0. 75 0. 80 0. 85 0. 90 (Top) n M ot if o cc ur re nc e ra te  (% ) PICS MACS QUEST CIS USEQ 0 1000 2000 3000 4000 5000 0 10 20 30 40 (Top) n Sp at ial  e rro r ( bp ) PICS MACS QUEST CIS USEQ 1000 2000 3000 4000 5000 0. 70 0. 75 0. 80 0. 85 0. 90 0. 95 (Top) n M ot if o cc ur re nc e ra te  (% ) PICS MACS QUEST CIS USEQ 0 1000 2000 3000 4000 5000 0 10 20 30 40 (Top) n Sp at ial  e rro r ( bp ) PICS MACS QUEST CIS USEQ (c) FOXA1: occurrence rate (d) FOXA1: spatial error Figure 2.3: Motif occurrence rate and spatial error (see text) for GABP and FOXA1 data, as a function of region enrichment rank, for the 5000 top- ranked regions for each method. and FOXA1 data. In each case, we determined the percentage of binding events from single- and multiple-component candidate regions (i.e. regions with multiple predicted events) that could be associated with at least one motif site. Table 2.1 shows these results as a function of the number of PICS predictions (i.e. mixture components) in a candidate region. Four times more GABP regions than FOXA1 regions had two components (940 vs. 208), and seven times more had at least three components (236 vs. 33), even though candidate regions had comparable sizes. Because shorter DNA fragments should support detecting more adjacent binding sites, these differences may be related to the smaller average fragment size esti- mated for the GABP data. For both data sets, the percentage of binding events 33 Table 2.1: Number of 5000 top-ranked PICS predictions from GABP and FOXA1 data in candidate regions that had 1, 2 or at least 3 predicted bind- ing events (i.e. mixture components). The first row gives the number of binding events in each category, while the second row gives the percent- age of predicted events that could be associated with an expected motif. For example, 80 percent of the 940 binding events in two-component GABP regions could be associated with a predicted GABP site. GABP FOXA1 # of components in region 1 2 3+ 1 2 3+ # of events 3824 940 236 4759 208 33 % of motifs 81 80 77 86 88 94 that was associated with a predicted binding motif was relatively insensitive to the number of predictions in a region. Because cells can use multiple closely-spaced transcription factor binding sites to establish progressive expression responses to cellular signals, we also assessed how effectively PICS can detect closely adjacent binding sites. To evaluate PICS and all other methods compared here, we gen- erated another table, but this time considered predicted binding events that had at least one other event within a fixed distance d. Table 2.2 summarizes the results for d = 250,500 and 1000 bp. For these data, PICS, cisGenome, and USeq were the most effective at identifying proximal binding events, and large fractions of these events were associated with a predicted motif site. While cisGenome and USeq also predicted large numbers of proximal binding events, a larger fraction of those reported by PICS were associated with predicted binding motifs. In comparison, MACS returned the lowest number of proximal binding events, which suggests that MACS is less effective in discriminating proximal binding events, even though it performs relatively well in terms of the overall number of predictions that are as- sociated with a motif (Figure 2.3). These results suggest that our mixture model was effective in distinguishing biologically meaningful proximal binding events. Results from the simulation study in Section 2.7 were consistent with this. As described in Section 2.4, PICS can compute approximate standard errors for its model parameter estimates. In particular, we can derive an approximate 34 Table 2.2: Number of proximal binding events found in the 5000 top-ranked regions identified by each method in GABP and FOXA1 data, as a func- tion of the motif ‘proximity’ distance d. The numbers in paratheses give the percentage of predicted binding events that could be associated with at least one predicted motif site. For example, in the GABP data, PICS identified 128 binding event locations that were within 250 bp of another location, and 86% of these predicted events could be associated with a motif. d (bp) PICS QuEST MACS cisGen USeq 250 128(86) 0 0 0 6(90) GABP 500 437(84) 2(50) 18(83) 16(57) 106(77) 1000 517(81) 405(63) 54(81) 76(63) 269(72) 250 2(100) 0 0 4(75) 4(83) FOXA1 500 62(85) 18(72) 18(100) 34(82) 70(74) 1000 108(87) 74(76) 54(88) 105(85) 149(77) confidence interval for a given predicted binding event location as µ̂ ± c ·SE(µ̂), where c is a constant to be chosen as a function of the motif coverage desired. For example, assuming that µ̂ is approximately Normal, c = 1.96 should give us an approximate 95% confidence interval for our binding site position. Using the set of motifs identified by GADEM, we evaluated the actual cov- erage of our confidence intervals for different values of c. Figure 2.4 shows the occurrence frequency of GABP motifs (top) and FOXA1 motifs (bottom) within c · SE(µ̂) of peaks centers. Using 3 standard errors, the coverage was approxi- mately 65% and 85% for the GABP and FOXA1 data. While these results indicate that the current version of PICS provides a capable modelling framework, they also suggest that certain biases remain unaddressed. For example, PICS models the binding site as the mid-point between the F/R peak summits, but fragmentation biases during sonication due to local chromatin structure could result in a binding site being asymmetrically positioned with respect to its associated DNA fragment densities. Locations where motifs fall outside event confidence intervals identify a subset of regions on which to focus ongoing work. Finally, we evaluated the effect of the mappability profiles on the parameter 35 (a) GABP (b) FOXA1 0 1000 2000 3000 4000 5000 0. 5 0. 6 0. 7 0. 8 0. 9 (Top) n M ot if o cc ur re nc e ra te  (% ) c=2 c=1 c=3 c=5 0 1000 2000 3000 4000 5000 0. 4 0. 5 0. 6 0. 7 0. 8 (Top) n M ot if o cc ur re nc e ra te  (% ) c=2 c=1 c=3 c=5 Figure 2.4: Fraction of predicted binding events with a GABP (a) or FOXA1 (b) motif site within c · SE(µ̂) of the predicted event location, µ̂ , as a function of region enrichment rank. 36 estimates. We re-did the analysis while ignoring mappability, and compared the spatial error, i.e. the distance to the closest computationally verified binding site, with and without the mappability correction. A paired t-test showed that our map- pability correction significantly (p< 0.05) improved spatial error (data not shown). 2.7 Simulation Study We used a series of simulations to characterize the performance of PICS under var- ious model (mis) specifications and to compare it to the four methods described above. Our simulation study is based on chromosome 1 GABP data and the pa- rameters estimated by PICS on these data. 2.7.1 Simulation Scheme We considered three simulation scenarios, first generating read positions from our model, then mis-specifying the prior, and finally mis-specifying the likelihood. We started by randomly removing a third of the control reads, which we used to inject noise into the IP data, while retaining the remaining two thirds as the new control. From our analysis of the GABP data, we obtained about 2000 candidate regions on chromosome 1. From these, we randomly selected 500 regions as spike- ins. In each spike-in region, we generated F/R signal read positions from a mixture of t distributions, with a prior given by our model and hyperparameters set to the values used previously. In order to simulate our data, we also needed to know how many reads to simulate and where to place the binding events. For the binding site locations, we used the parameter values for µ estimated from the GABP data for the corresponding spiked-in regions. Then we set the number of reads to be simulated to the number of reads in the control region multiplied by the enrichment score estimated by PICS. After simulation we obtained 551 binding events across 500 spike-in regions. Note that the number of binding events was greater than the number of spiked-in regions, since PICS identified some of the regions in the GABP data as having multiple binding events. We also added the random third of control reads to the simulated reads to form the IP sample to emulate background reads. While we explicitly mis-specified models for scenarios 2 and 3, the addition of the control reads to the IP sample makes even scenario 1’s model slightly mis- 37 specified. Finally, in order to account for non-unique alignments, we removed all reads that fell into ambiguous regions. In the second simulation scenario we mis-specified the prior by doubling the shape parameter of the Gamma distribution and shifting the mean fragment distri- bution from 175 to 125, as in (2.3). Finally, in the third simulation scenario we mis-specified the likelihood by re- placing the t distributions by Gamma distributions. The parameters of the F/R Gamma distributions were set so that the mean and variance were approximately equal to those of the t4 distributions. For a given binding event (mixture compo- nent), the reverse Gamma distribution has support over (µk,∞), while the forward read position distribution is defined as the mirror image of the same Gamma about µk. 2.7.2 Simulation Results We analyzed each data set with PICS, MACS, cisGenome, QuEST, and USeq. In each case, we summarized the results with a receiver operating characteristic (ROC) curve, which shows the relationship of sensitivity to specificity when vary- ing the cutoff for each method, and a plot of the nominal FDR against the true FDR. Figure 2.5 shows that the results for the simulation scenarios were similar to those for experimental data. In this figure we only show the ROC curves in the region where 1-specificity is between 0 and 0.2, since that is the region of most interest. The full ROC curves (from 0 to 1) are shown in Figure A.5. PICS was slightly better than MACS and USeq under scenarios 1 and 3, whereas under sce- nario 2 USeq was slightly better than PICS, though the differences were small. FDR results show larger differences; PICS was closest to the expected y = x line, while MACS, QUEST and USeq tended to underestimate the FDR and cisGenome tended to overestimate it. Note that in the results presented here we only used a single simulation because some of the software (e.g. USeq, cisGenome) require manual user input at different analysis stages and thus could not be run in a fully automated way. This said, the result for PICS and MACS were consistent over a few repeats (data not shown). Finally, even though PICS’ estimation of the FDR is good, it is still slightly overestimated at low values, which could be due to the 38 slight model mis-specification caused by injecting control reads into the IP sam- ple. Overall this suggests that FDR estimates should be treated carefully for all methods. As with the FOXA1 and GABP data, we also compared the methods in terms of their spatial error (see Figure A.3). Again PICS was the most accurate, followed closely by USeq, MACS and QuEST. In addition, we also assessed the perfor- mance of our measures of uncertainty by calculating the actual coverage rate of our approximate 95% confidence interval calculated as ±1.96 · se(µ̂). Figure A.4 shows that the coverage rate as a function of peak rank. The coverage rate was be- tween 90-95% even under mis-specified models, which suggests that our approach is reasonably accurate and robust to mis-specification. Finally, we re-assessed the ability of each method to detect closely adjacent binding sites. Since our simulation scheme included closely adjacent binding sites that were generated in spike-in regions, we compared the number of sites predicted by each method to the true number of sites in each region. Tables A.1-A.3 show the results for each method and simulation scenario. As expected, PICS performed best at discriminating closely adjacent binding sites, and did this even under mis- specification. This suggests that BIC is an effective metric for estimating the num- ber of binding events. 2.8 Discussion We have developed PICS, a probabilistic framework for detecting transcription fac- tor binding events from ChIP-seq experiments. The approach integrates a number of important factors in interpreting aligned read data, including using prior infor- mation for input DNA fragment lengths and correcting for reads that are missing due to genome repetitiveness. Working with two published ChIP-seq data sets from human cell lines, we compared PICS to four alternative analysis methods. While additional methods are available (e.g. [29, 60, 111]), the four methods that we used have been shown to perform well, and so offer reasonable performance baselines from a range of algorithms. For both experimental data sets, the binding events predicted by PICS were the most consistent with computationally identified motif sites. Our simulation study confirmed PICS’ effectiveness and also showed that its 39 0.0 0.1 0.2 0.3 0.4 0.5 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Nominal FDR Tr ue  F DR PICS MACS cisGenome QuEST USEQ 0.00 0.05 0.10 0.15 0.20 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 1ïSpecificity Se ns itiv ity PICS MACS cisGenome QuEST USEQ 0.0 0.1 0.2 0.3 0.4 0.5 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Nominal FDR Tr ue  F DR PICS MACS cisGenome QuEST USEQ 0.00 0.05 0.10 0.15 0.20 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 1ïSpecificity Se ns itiv ity PICS MACS cisGenome QuEST USEQ 0.0 0.1 0.2 0.3 0.4 0.5 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Nominal FDR Tr ue  F DR PICS MACS cisGenome QuEST USEQ 0.00 0.05 0.10 0.15 0.20 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 1ïSpecificity Se ns itiv ity PICS MACS cisGenome QuEST USEQ (b) Scenario 1 FDR(a) Scenario 1 ROC (d) Scenario 2 FDR(c) Scenario 2 ROC (e) Scenario 3: ROC (f) Scenario 3: FDR Figure 2.5: Partial receiver operating characteristic curves and nominal ver- sus true FDR for the three simulation scenarios. 40 model is robust to mis-specifications. To address read ‘outliers’ due to biological and technical biases, we used a t distribution as a robust alternative to the Gaussian distribution. We fixed the degrees of freedom parameter at four because this value has been shown to work well in many applications [68]. To confirm that this was an appropriate choice, we estimated the degrees of freedom for each candidate region and found that it was less than ten in about 20% of all regions. This suggests that many of the regions contain read outliers and would be poorly modeled by a Gaussian distribution. In addition, many of the candidate regions have very few reads, and it is possible that the number of reads was not large enough to properly estimate the degrees of freedom. We believe that with more sequence reads, the overall percentage of candidate regions with low degrees of freedom would have been even larger. For that reason, and because estimating the degrees of freedom is computationally intensive, we used a fixed four degrees of freedom. We showed that PICS’ mixture model addresses multiple adjacent enrichment events, and can fit a different DNA fragment length value for each binding event in a mixture. Our estimation of the number of binding events in a region is based on the Bayesian information criteria. Although regularity conditions required for consistency of the BIC estimate do not hold for mixture models, there is consid- erable theoretical and practical support for its use in this context [31, 109]. Our simulation study confirmed this, and suggests that our estimation of the number of binding events performs relatively well. In our BIC estimation, we allowed the mixture model to detect up to 15 components per candidate region, but this limit can be readily adjusted. Because it is based on mixture models and accounts for missing reads, PICS is computationally intensive. The results reported here were generated with an implementation of PICS that was written in the R programming language [47]. Processing a ∼ 10M read data set required an average computing time of 30 min- utes on a machine with dual quad-core 3GHz CPUs. PICS will be made freely available via Bioconductor [35]. Paired-end (PE) data should resolve a subset of read alignments that would be non-unique in SE data, and offer direct information on DNA fragment lengths that is not available for single-end (SE) reads. However, PE experiments require 41 more input DNA and more complex library preparation, and double the sequencing machine time; to our knowledge, all published short read ChIP-seq data are SE rather than PE. We anticipate that PICS’ probabilistic approach will remain useful for PE data, where the defined fragment lengths should simplify the modelling framework. As a first step in implementing a probabilistic approach for ChIP-seq data, we have shown how to incorporate prior information about the DNA fragment lengths using a Bayesian approach. We can extend the PICS framework to include more types of prior information. For example, we could place a prior distribution on µ , the binding site position, and could include in this information about nucleosome occupancy and computationally derived motifs. Such extensions should allow us to improve the detection of biologically relevant binding sites. More generally, we an- ticipate that extended probabilistic methods for ChIP-seq will contribute to mech- anistic biological insights by offering principled ways for addressing backgrounds, noise and biases, and for integrating diverse types of biological information. For example, histone modifications are an important part of cellular biology, and their aligned reads present a wider range of nucleosome-based density distributions than the localized transcription factors that PICS currently addresses. As part of ongo- ing work in this area, we are extending PICS to infer nucleosome positioning from histone modification data. 42 Chapter 3 An Integrated Pipeline for the Genome-Wide Analysis of Transcription Factor Binding Sites from Chip-Seq 3.1 Introduction Transcription factors (TFs) play critical roles in regulating gene expression. Deter- mining transcription factor binding sites (TFBSs) is challenging because the DNA segments recognized by TFs are often short and dispersed in the genome, and the target loci of a TF vary between tissues, developmental stages and physiological conditions. Genome-wide protein-DNA interactions are now typically profiled using ChIP- Seq, i.e. chromatin immunoprecipitation (ChIP) with massively parallel short-read sequencing [104]. A typical ChIP-Seq experiment generates millions of short (35- 75 bp) directional DNA sequence reads that represent ends of ∼200bp immuno- precipitated DNA fragments. The read sequences are mapped onto a reference genome. Then, for experiments with transcription factors, there are three central analysis issues: peak-calling, binding motif identification, and motif interpreta- 43 tion. Here, we report an R/Bioconductor-based pipeline that offers an efficient, integrated set of analysis tools for such experiments. The aligned read data are first transformed into a form that reflects local densi- ties of immunoprecipitated DNA fragments, and regions with high read densities, typically referred to as peaks, are identified by a peak-calling algorithm (Reviewed in [67, 124]). Peak-calling returns a list of enriched genomic regions in which the protein of interest is expected to be directly or indirectly associated with DNA. Analysis then identifies potential DNA binding sites within these regions, and sum- marizes these sets of short sequences as motifs, typically as position weight ma- trices (PWMs) or families of PWMs [59, 122]. Once de novo motifs have been identified, it is desirable to compare, annotate and assess these in order to retain motifs that are likely to be biologically relevant, while removing artifactual and background motifs. To develop this pipeline, we first re-implement PICS in C language to improve its speed, and then maded it into a Bioconductor package also called PICS. Then we extend PICS into a pipeline by adding two additional Bioconductor packages for follow up analysis. They are rGADEM for motif discovery and MotIV for motif annotation and assessment. To address huge data size and intensive computing, we implemented PICS method in C language. Thanks to the segmentation step, we can split the whole genome into multiple regions and process them in parallel. For this goal we al- low users to use multiple CPUs with two different protocols based on R packages ‘snowfall’ and ‘multicore’. The bioconductor package rGADEM is developed by porting the software GA- DEM [75] into R. Our pipeline uses GADEM because it is a good compromise between fully probabilistic and enumerative approaches, can process large sets of ChIP-Seq regions, handles both dimer and monomer motifs, automatically identi- fies multiple motifs, and automatically adjusts motif widths. To address very large sets of enriched regions, the original C code was modified to take advantage of multithreading, without requiring user configuration, via Grand Central Dispatch on OS X, and openMP (openmp.org), which supports shared-memory parallel pro- gramming on all architectures, including Unix and Windows. Compared to proba- bilistic approaches, this provides a simple, fast and efficient de novo framework. 44 The bioconductor package MotIV (Motif Identification and Validation) is de- veloped based on STAMP [84]. Like STAMP, MotIV provides queries to the JAS- PAR database [13], and users can flexibly input other sets of reference PWMs (e.g. TRANSFAC [129], UniProbe [96], DBTBS [119], or RegulonDB [34]). As outlined below, MotIV provides visualization and postprocessing options that are unavailable in STAMP, TOMTOM [38] and MACO [123]. It provides summary statistics on motif occurrences, reports joint motif occurrences and plots distance and pairwise-distance distributions. It can also refine motifs and motif occurrences based on a set of filters provided by the user. Below, we describe the pipeline in more detail. Its core consists of three Bio- conductor packages: PICS calls enriched regions; rGADEM identifies de novo mo- tifs; and MotIV visualizes and annotates motifs, and identifies motif combinations that have nonrandom spatial relationships. This is the first complete Bioconductor pipeline for analyzing transcription factor ChIP-Seq data. The pipeline is compu- tationally efficient, supporting processing datasets that consist of tens of thousands of peaks. We illustrate the pipeline by analyzing published Illumina datasets for genome-wide binding in human of FOXA1, ER, STAT1, and CTCF. We compare the performance of our approach to previously described methods for motif and module discovery, and show that the pipeline supports detecting biologically rele- vant motif modules that are not easily discovered by other methods. 3.2 Results We applied the pipeline to the four ChIP-Seq datasets mentioned above and de- scribed in the Methods section. We first used PICS to select the top 15000 enriched regions for the CTCF, STAT1 and the FOXA1 data. For the STAT1 and FOXA1 data, this corresponded roughly to a 5-10% FDR. For the ER data, PICS detected 8000 enriched regions at a similar FDR level (Figures B.1-B.3). For CTCF, be- cause we had no control data, we used the top 15,000 regions for consistency with STAT1 and FOXA1. In each case, we used PICS to export the top-ranked 400-bp wide enriched regions around predicted binding sites (peak centers). In R, this cre- ates a RangedData object, containing the chromosome, start and end positions of each sequence, which can be input directly into rGADEM. We post-processed the 45 resulting rGADEM object using MotIV. 3.2.1 Identification of Primary Motifs rGADEM respectively identified 68, 23, 25 and 78 motifs in the CTCF, STAT1, FOXA1 and ER datasets. To interpret the detected motifs, we used MotIV to com- pare the identified PWMs to JASPAR PWMs [13]. For each input motif, MotIV returns a user-defined number of best-matching PWMs from the user-specified ref- erence database. The output consists of the name and sequence logo of the highly- ranked database hits, along with the pairwise alignments (in consensus sequence format) and the alignment E-values (see Figures B.4-B.7, B.20). When displaying PWM matches, the user can choose to set filters that re- tain only certain motifs, e.g. all matches with an E-value less than 10−4, or all matches containing the name ‘STAT’. Here, we retained only the ‘expected’ motif for each data set (Figure 3.1) by filtering on the names STAT1, CTCF, FOXA1 and ESR1 in the JASPAR database and applying an E-value cutoff of 10−4 . Fig- ures B.4-B.7 show that rGADEM can sometime identify variants of the same motif (e.g. FOXA1). A user may chose to combine the motif occurrences of these vari- ants and treat them as occurrences of the same motif. This can easily be done via MotIV’s combine method, which regroups multiple motifs based on a set of filters. Using this approach, we combined all variants of primary motifs, as follows: FOXA1={m5,m10,m25}, ER={m4,m22,m33,m48}, STAT1={m1} and CTCF={m1}. Note that such combining is ‘virtual’, in that the PWMs of the se- lected motifs are not actually combined nor modified, but are simply assigned the same label. We find the combining process particularly useful for plotting distribu- tions and exporting motif occurrences, and the interactive R environment readily supports iteratively exploring such operations. As a secondary check, we used the distance distribution plots provided by MotIV. Given the specificity of ChIP-Seq and the accuracy of PICS, a de novo motif that reflects a DNA-binding interac- tion should be located close to a PICS site prediction. Using both the output of rGADEM and the RangedData object returned by PICS (i.e. the input of rGADEM) MotIV can plot the frequency distributions of the distance between motif occur- rences and peak centers. Note that such distance distribution plots do not depend 46 on database matches, and so can be used with novel motifs and motif variants. Fig- ures B.8-B.11 shows that the selected motifs are concentrated around peak centers, as expected. Our combined primary motifs resulted in a total of 10059, 7105, 8711 and 3947 binding site occurrences for CTCF, STAT1, FOXA1 and ER respectively. Figure 3.2 shows the distribution for the combined primary motifs. Overall, the spatial error between PICS binding site predictions and actual motif occurrences is relatively small. Figure 3.1: Primary motifs identified by rGADEM and visualized with MotIV. The motif matches and associated similarity E-values are based on the JASPAR database included in MotIV. 3.2.2 Identification of Secondary Motifs Once expected motifs have been identified, we now look for other motifs that may be biologically relevant. Because we may not know which secondary motifs to expect, further computational assessment may be required to discriminate artifac- tual motifs. A simple but elegant approach involves using distributions of distances 47 Figure 3.2: Distance distribution between the rGADEM motif occurrences and the PICS predictions for the STAT1, CTCF, FOXA1 and CTCT motifs identified from datasets. between rGADEMmotif occurrences and PICS predicted binding sites. If the iden- tified motif corresponds to a protein that has a short-range interaction with the im- munoprecipitated protein, we would expect the motif site to be close to the PICS site prediction. A quick look at the distribution plots, sequence logos and E-values reveals three interesting motifs for the STAT1 data: STAT1, AP-1 and CTCF (Fig- ure 3.3 and Figures B.7-B.11). A similar approach suggested ER, FOXA1 and AP-1 motifs for the ER data, FOXA1 and AP-1 for the FOXA1 data, and CTCF and Myf for the CTCF data. As noted above, we identified 68 and 75 motifs for CTCF and ER respectively. MotIV let us quickly filter and visualize these (Fig- ures B.4-B.11), and suggested that many of these were either variants of the same motif or artifactual motifs due to sequence repeats. MotIV also provides a way to characterize how frequently two motifs occur 48 Pairwise motifs distance distance ST AT 1 STAT1 AP1 890 5.9% 5255 35% 1899 12.7% STAT1 CTCF 900 6% 5245 35% 1952 13% d( STAT1 ï AP1 ) ï200 0 200 AP 1 AP1 CTCF 317 2.1% 2472 16.5% 2535 16.9% d( STAT1 ï CTCF ) ï200 0 200 d( AP1 ï CTCF ) ï200 0 200 CT CF Figure 3.3: Pairwise distance distributions between the STAT1, AP-1 and CTCF motifs identified by rGADEM from the STAT1 data. on the same input sequence, as well as distance distributions between occurrences of any two motifs. Figure 3.3 and Figures B.12-B.14 show that there were fewer secondary motifs than primary, and that relatively large fractions of a secondary motif’s sites can co-occur with a primary motif. The distance distributions show that most distances between a primary motif and its secondary ones are relatively short (∼50-100bps), suggesting that the DNA-associated proteins may interact. 3.2.3 Functional Annotation of Motifs and Modules To complement our analysis, we can combine our results with Gene Ontology (GO) annotations [2], using R’s ChIPpeakAnno, to provide general insights into the functions of proteins targeted by ChIP-Seq experiments. For primary motifs, we identified several over-represented terms for associated genes, as determined by the nearest transcriptional start site (TSS) (Tables B.1-B.4). In general, the cate- gories for the primary motifs listed in the tables were consistent with the known 49 biological role of ER/FOXA1/AP1 (see Appendix B). Applying the same analysis looking at genes that were close to motif pairs formed by the primary motif and one secondary motif (Tables B.1-B.4) returned terms that, in some cases, were not returned when working with primary motifs only, which suggested that motif pairs may be functionally more discriminatory. 3.2.4 Biological Significance of Modules Given that PICS, rGADEM and MotIV support efficiently identifying candidate factor-cofactor relationships in ChIP-Seq data, we assessed whether the literature suggested that the relationships identified were biologically meaningful. FOXA1, which is regulated in response to estrogen treatment, has been shown to be cru- cial for ER to bind to chromatin and activate target gene transcription [15, 26]. This supports the FOXA1 motif detected by rGADEM in ER-enriched regions, and supports an interaction between the two proteins. Fos and Jun family proteins usually function as dimeric transcription factor that bind to AP-1 regulatory elements [TGA(C/G)TCA] [16, 117]. The AP-1 complex has been shown to be over-expressed in ER positive cells (e.g. MCF7) and can interact directly with the ER transcription factor [17, 91]. This supports the AP-1 motif identified by rGADEM in the ER enriched regions, and the AP-1 motif that we identified in FOXA1-enriched regions, which may reflect interactions, possibly indirect, between the AP-1 and FOXA1 proteins via ER. Given that we identified the FOXA1 motif in the ER-enriched regions, we ex- pected to identify the ER motif in the FOXA1-enriched regions. We noted that a previous attempt to discover the ER motif in this dataset had been unsuccessful [133]. A seeded analysis with rGADEM (see Methods), using the ESR1 motif from JASPAR, identified an ER motif (Figures B.15) with only 723 sites. These results suggest that ER requires FOXA1, but that the converse is not true, which is consis- tent with the above literature. Additionally, only 7% of the ER sites identified in the ER-enriched regions overlapped with a FOXA1-enriched region. For this cal- culation we used MotIV to export the ER sites as a RangedData object and used the countOverlaps function of the IRanges package to count the number of such sites that overlapped a FOXA1 enriched regions. 50 We examined the predicted interaction between STAT1 and AP-1 (Table 3.1 and Figure 3.3). Cytokine stimulation induces members of the STAT transcrip- tion factor family, Stat1 and Stat3, to ‘dock’ onto receptor phosphotyrosines, en- abling their own tyrosine phosphorylation [11, 83, 121]. Subsequently, STAT proteins translocate to the nucleus and bind to conserved genomic regulatory se- quences to rapidly activate gene transcription [21, 48]. The cytokines also activate components of other intracellular signaling pathways, including Ras, mitogen- activated protein kinase (MAPK), and the Fos-Jun (AP-1) transcription factors [81, 113, 115], and activate direct interaction between STAT1 and AP-1 [130]. This supports our AP-1 motif detected by rGADEM in the STAT1 enriched regions. Table 3.1: Motifs identified by all compared methods CTCF ER FOXA1 STAT1 rGADEM CTCF (0) ER (0) FOXA1 (2e-12) STAT1 (3e-13) Myf (4e-8) FOXA1 (5e-12) AP1 (6e-10) CTCF (0) ETS-like (1e-8) ETS-like (9e-7) AP1 (3e-7) AP1 (6e-10) cisFinder CTCF (0) ER (0) FOXA1 (4e-13) STAT1 (2e-10) ETS-like (9e-8) AP1 (9e-8) AP1 (8e-3) Flexmodule CTCF (0) ER (0) FOXA1 (3e-11) STAT1 (4e-11) FOXA1 (1e-13) AP1 (4e-8) SRF (1e-8) AP1 (3e-8) Weeder CTCF (2e-11) ER (1e-14) FOXA1 (1e-12) STAT1 (2e-11) AP1 (1e-10) ETS-like (2e-8) MEME CTCF (0) ER (0) FOXA1 (2e-15) STAT1 (5e-9) AP1 (3e-4) ETS-like (1e-5) AP1 (4e-4) Motifs identified by all compared methods in the selected PICS enriched regions. The number given between parenthesis is the E-value match to the corresponding JASPAR motif. We analyzed the predicted interaction between CTCF and Myf (Table 3.1). Wilson and al. [128] suggested that CTCF binding is required for MyoD-induced IGF-2 gene activity in muscle. Moreover, Myf and CTCF can co-localize in the same cellular fraction during cellular process [1]. In this case the literature is not 51 as supportive but does suggest a potential co-operation between CTCF and Myf. Finally, we found no strong evidence in the literature for an interaction between CTCF and STAT1. Given this, we again used MotIV to export the CTCF sites as a RangedData object and used the countOverlaps function of the IRanges package determine that 28% of such sites overlapped a CTCF enriched region, even though the two experiments used different cellular systems. A similar analysis showed that 31% of the FOXA1 sites identified in the ER-enriched regions overlapped with a FOXA1-enriched region. Note that such intersections of genomic intervals can easily be carried out using the RangedData class and methods provided by the IRanges package, illustrating how Bioconductor and R can be used to extend our pipeline. 3.2.5 Comparison with Other Methods In order to assess the performance of our pipeline, we compared other motif/CRM identification tools on the above four datasets, using PICS for peak calling and MotIV for validation. CisFinder and Cluster-Buster took less than a minute on 15000 sequences, while Weeder and FlexModule took several days. Using 8-core multithreading, rGADEM completed these runs in a few hours. MEME’s computational require- ments allowed us to process only the top 5000 sequences for all datasets, even when using the parallel version running on 24 CPUs. HMS and ChIPMunk return a single motif from a run, and so are less directly applicable for work involving combinations of motifs. As well, while they are scalable, they are slower than rGADEM; for motif discovery on 15000 400-bp sequences, HMS (100 iterations) and ChIPMunk took approximately 24h on a 16 x 2.4 Ghz server. The number of motifs identified varied greatly between the de novo motif anal- ysis tool (Table 3.1). As expected, each method returned the primary or expected motif from each dataset, and the methods compared agreed relatively well for these motifs (Figures B.16-B.19). The de novo tools differed in the secondary motifs and modules identified (Table 3.1). Weeder and CisFinder systematically returned the lowest number of motifs, while rGADEM and MEME tended to identify larger num- 52 bers of secondary motifs. rGADEM identified the most secondary motifs that could all be supported from the literature. Cluster-Buster identified, in average, 1587 clusters containing 12 motifs for ER, 4558 clusters containing 15 motifs for CTCF, 1484 clusters containing 12 mo- tifs for STAT1 and finally 1501 clusters containing 16 motifs for FOXA1. Such large numbers of motifs and clusters are difficult to interpret, and complicates comparison with other methods. While Cluster-Buster identified the same motif combinations as our pipeline in some of its clusters, these were mixed with tens of other motifs in thousands of clusters, validation of which would clearly be diffi- cult. Additionally, cisFinder and Cluster-Buster used the input PWMs to scan for motif occurrences, and so assume that these motifs are sufficiently representative. In contrast, MotIV uses PWMs only for ‘labeling’ motifs. 3.3 Discussion We have developed a pipeline for analyzing ChIP-Seq data for transcription factors, the core of which consists of three complementary R packages: PICS, rGADEM and MotIV. Using four published human datasets, we showed that the pipeline compares favorably to other de novo motif tools and CRM clustering tools. For example, it identified co-occurring pairs of motifs that were consistent with the literature and were not detected by other methods. Other integrated pipelines for ChIP-Seq data are available, for example, MICSA [10], CEAS [118], and Sole-Search [9]. Issues that should be considered in assess- ing such systems are reviewed by [100]. MICSA [10] was largely designed to improve ChIP-Seq data analysis by prioritizing enriched sequences that contained a motif logo for the expected motif. In MICSA, the authors use MEME on the top few hundred sequences to detect de novo motifs, and then scan the remaining sequences with the identified logos. While this can improve the speed of motif discovery, its biased subsampling of input sequences may compromise detecting secondary motifs. CEAS [118] and SoleSearch [9] are largely annotation systems that offer less functionality and are less flexible than our pipeline. We briefly tried to compare CEAS to our pipeline, but, as with Cluster-Buster, found this difficult because of the lack of control over the output. 53 The R pipeline described here offers functionality that is not available in CEAS, cisGenome, MICSA and Sole-Search, e.g. distance distribution plots, pairwise dis- tance plots and motif filtering. Filtering is efficient in removing artifactual and background motifs based on combinations of E-values and distance distributions. For this reason, for our pipeline it is unnecessary to mask sequence repeats, which is recommended for CEAS and MICSA. As such masking could remove infor- mative motifs, an unmasked approach may be preferable. Other approaches (e.g. cisGenome [49, 51]) use relative enrichment computed using control regions to discriminate relevant motifs from irrelevant ones. rGADEM reports a fold enrich- ment for each motif, and the pipeline complements this metric with information on distance distributions and pairwise separation distributions. Although the methodology behind PICS, and an earlier command line ver- sion of GADEM have been published and demonstrated elsewhere, MotIV was developed for the pipeline, and the PICS and rGADEM R packages are new and implement improved versions of the respective algorithms. Finally, we emphasize that our pipeline can leverage other Bioconductor pack- ages so that a user can develop, repeat and share advanced analyses. We have described some of these packages, but there are many more libraries that could be used with our pipeline. For example, Figure 3.4 makes use of the rtracklayer package [71] to interact with the UCSC genome browser. Other packages that can be used include: SeqLogo for visualization of PWM, GenomeGraphs [24] for further graphics functionality, BiomaRt for retrieving annotations, IRanges and GenomicRanges for interval manipulations, Biostrings for sequence manipulations, etc. Many other relevant packages are listed on the Bioconductor website. We anticipate that the characteristics of the R environment, including its extensibility, will help to make the pipeline useful for a wide range of ChIP-Seq datasets. 3.4 Materials and Methods The analysis pipeline consists of three main steps (see Figure 3.5): peak calling, motif discovery, and motif postprocessing and validation. These steps are handled by three R packages:PICS, rGADEM and MotIV, which have been designed to 54 Scale chr10: FOXA1 RefSeq Genes Human mRNAs Spliced ESTs Rhesus Mouse Dog Elephant Opossum Platypus Chicken Lizard X_tropicalis Stickleback SNPs (130) RepeatMasker 200 bases 72769700 72769800 72769900 72770000 72770100 72770200 FOXA1 FOXA1 enriched regions by MACS FOXA1 enriched regions by PICS FOXA1 treatment tags PICS scores UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics RefSeq Genes Human mRNAs from GenBank Human ESTs That Have Been Spliced Vertebrate Multiz Alignment & Conservation (44 Species) Placental Mammal Basewise Conservation by PhyloP Multiz Alignments of 44 Vertebrates Simple Nucleotide Polymorphisms (dbSNP build 130) Repeating Elements by RepeatMasker SLC29A3 SLC29A3 SLC29A3 SLC29A3 riched regions by MACS nriched regions by PICS PICS scores 0.000910647 _ 6.59183e-06 _ Mammal Cons 3 _ -0.5 _ Figure 3.4: PICS peak calling. The example shows a FOXA1-enriched re- gion in which PICS discriminates two closely adjacent binding events, each of which contains a rGADEM de novo FOXA-like motif (black squares); these are separated by less than 300bps. In contrast, MACS outputs a single enriched regions. For clarity, the aligned reads (blue/red bars) and the combined forward/reverse PICS density profiles are also shown. work together and interact with other Bioconductor packages. 3.4.1 Peak Calling: PICS The first step consists of identifying, from the aligned ChIP-Seq reads, regions that represent protein-DNA association. For this step, we rely on our method, PICS [131]. PICS is based on a Bayesian hierarchical truncated t-mixture model, and integrates four important components. It jointly models local concentrations of directional reads. It uses mixture models to distinguish closely-spaced adjacent 55 binding events. It incorporates prior information for the length distribution of im- munoprecipitated DNA to help resolve closely adjacent binding events (see Fig. 3.4), and identifies enriched regions that have atypical fragment lengths. Finally, it uses pre-calculated whole-genome read “mappability” profiles to adjust local read densities that are missing due to genome repetitiveness (see Fig. 3.6 and “Avail- ability”, below). When a negative control sample is available (e.g. input DNA), PICS returns an enrichment score that is relative to the control sample for each binding event. Given a control sample, PICS can also estimate a false discovery rate (FDR) as a function of the enrichment score, which can be used to select a threshold score for segmenting (calling) enriched regions. Because PICS is based on a formal statistical model that requires an EM algorithm for estimating the un- known parameters, we have designed the R package PICS to be computationally efficient enough to process large sets of ChIP-Seq reads. The core of the algorithm is coded in C, and a user can easily take advantage of parallel processing via R’s snowfall [63] and multicore packages. Figure 3.6 illustrates the read mappability correction in a genomic region from the FOXA1 data. With the correction, the estimated PICS binding site was within the PICS 95% approximate confidence interval for the FOXA1 binding site loca- tion identified by rGADEM; when no correction was done, the de novo motif was outside of this interval. Figure 3.4 also shows that PICS can discriminate closely adjacent binding events. Two binding sites are separated into two disjoint enriched regions by PICS, whereas MACS [133] combined these two sites into a single re- gion. Such features make PICS particularly attractive for subsequent motif-based analyses. 3.4.2 De novo Motif Discovery: rGADEM From the list of enriched regions returned by PICS, the next step involves dis- covering over-represented DNA motifs. Probability model-based de novo motif finding algorithms like MEME can be sensitive [69, 77], but may be too slow when thousands to tens of thousands of enriched regions need to be analyzed. We have developed an open-source R package rGADEM, based on the GADEM software [75]. GADEM is an efficient and scalable de novo motif discovery tool 56 that combines spaced dyads and an expectation-maximization (EM) algorithm. A genetic algorithm (GA) guides the formation of a “population” of spaced dyads. Each spaced dyad is converted into a letter probability matrix, which is optimized by an EM algorithm. The optimized PWM is then used to scan for binding sites in the data. A subsequence of the length of the PWM is declared a binding site when the p-value of its PWM score is less than or equal to a preset threshold value. The logarithm of the E-value [3, 42, 95] is used as the fitness score for the spaced dyad from which the motif is derived. The resulting unique motifs with fitness values less than or equal to a pre-specified cutoff are reported, and corresponding binding sites in the original sequences are masked. This procedure is repeated until no further motifs can be found that satisfy the run parameters. rGADEM is an R package containing an extended version of the original GA- DEM C code. For ChIP-Seq data, a key improvement is that, on multicore com- puters, it can take advantage of multithreading via Grand Central Dispatch on Mac OS X 10.6 and above, and openMP on other Unix platforms, to sharply reduce run times. A second important extension, shared by both R and the current command line versions, is an optional ‘seeded’ analysis run mode. In this mode, rGADEM does not generate the starting PWMs through spaced dyads, but instead initializes the optimization with a user-specified PWM. This PWM guides motif discovery, but is used only for initialization and not during the EM-based PWM updating. A seeded analysis has two important advantages. It is approximately ten times faster than an standard run. Further, the prior knowledge helps address both signal-to-noise issues [112] and problematic (e.g. short) motifs. In our experience, seeded runs are also useful for ChIP-chip data, where the signal is less clear and expected motifs can be more difficult to recover. rGADEM can also prioritize sequences with large ChIP enrichments and in- cludes novel prior distributions that prioritize for motif occurrences that are nearer to sequence (peak) centers. Such prior settings can potentially improve the detec- tion of primary motifs at the cost of missing secondary motifs that can be present at low enrichment and/or further away for the center. For these reasons, we prefer to use the default uniform prior and use our post processing tools to detect biolog- ically relevant motif combinations. 57 All novel computational aspects of rGADEM are described in Appendix B. Be- cause the C code has been wrapped in R, the overall interface is accessible and the package contains functions to ease manipulation and visualization of the input and output. 3.4.3 Post-processing and Motif Interpretation: MotIV To identify a subset of potentially biologically relevant de novo motifs, we have developed a simple, efficient post-processing tool, MotIV. Based on STAMP [84], it compares and annotates motifs, and supports identifying candidate motif mod- ules. MotIV accepts as input an R object returned by rGADEM, a PWM output file from the command-line version of GADEM, or a PWM in TRANSFAC format [129]. MotIV can be used to compare a list of input motifs against a reference mo- tif database. It contains the JASPAR 2010 database, with pre-computed stimulated profiles that are used to determine the likelihood or E-value of a motif similarity score (see [84] for details). User-supplied PWM databases and can easily be used, and scores computed. Because MotIV uses the STAMP source code, it provides a range of options for alignment calculations (see the documentation for the R pack- age and/or STAMP). MotIV also provides several new visualization functionalities for sequence lo- gos, motif occurrence distributions and pairwise distance distributions, which are available in grid layouts (Figures B.4-B.15). The first type of plot displays the alignment as logos with motif similarity E-values for the top 5 matches (this num- ber can be changed). Because sequence repeats in artifactual enriched regions (e.g. regions that have high fractional overlaps with simple tandem repeats [53]) can lead to the detection of motifs with good E-value matches, MotIV provides several options for identifying and filtering such artifactual motifs. For example, MotIV allows one to plot the distribution of the motif occurrences within our en- riched regions. A biologically relevant motif should have a distribution that is peaked around the center of the region; conversely, the spatial distribution for a less relevant motif will typically be flatter. Finally, in order to identify co-occurring combinations of motifs, MotIV can display motif pairwise distance distributions. In such a plot, one can quickly quan- 58 tify both co-occurring motif pairs and assess the distribution of the inter-motif dis- tances. To our knowledge, no other method provides such functionality. Once interesting motifs have been identified, motifs and motif occurrences can easily be filtered and exported for further analysis. Note that for motif occurrence and pair- wise distributions, the use of a database is not required, and novel motifs can be discovered based on their spatial distributions alone. 3.4.4 Software Availability and Architecture In the three packages, the source code is written in C for speed, and wrapped in R code for accessibility. All packages use object-oriented programming with classes and methods, which supports usability as well as integration with other R/Biocon- ductor packages [35], making it straightforward for a user to construct advanced analyses. For example, PICS and MotIV support exporting enriched regions and MotIV occurrences as RangedData objects which can directly be used by other packages such as ChIPpeakAnno [137], BSgenome and rtracklayer [71]. PICS, rGADEM and MotIV are available from the Bioconductor web site at (http://bioconductor.org). They run on Linux, OS X ad MS-Windows. The packages are distributed under the terms of the Artistic License 2.0. Each con- tains a detailed manual and vignette with examples. Frequently asked questions, additional tutorials, and further installation instructions can be found at (http:// wiki.rglab.org). In addition, we offer pre-generated mappability profiles for com- mon genomes and read lengths, as well as a “proMap” pipeline that can be in- stalled locally for generating such profiles (http://wiki.rglab.org/index.php?title= Public:Mappability Profile). The profiles are based on aligning read-length seg- ments of a reference genome back to that reference genome, using the same aligner (BWA, [74]) and parameters that we use for ChIP-seq data. 3.4.5 Data Sets To demonstrate the power and resolution of analyses supported by our pipeline we used four recently published ChIP-Seq data for human transcription factors: CTCF (CCCTC-binding factor) in CD4+ T cells [133], STAT1 in interferon stimulated (IFN-gamma) HeLa S3 cells [108], and FOXA1 [133] and Estogen Receptor in 59 the MCF-7 breast cancer cell line [46]. The CTCF data contains 2.95M reads, the STAT1 data contains 26.7M treatment reads and 23.4M input control reads, the FOXA1 data consists of 3.9M treatment reads and 5.2M input control reads, and finally the ER data contains 3.6M treatment reads and 5.2M input control reads. 3.4.6 Comparison to Other Methods Because we have already shown that PICS compares favourably to other peak find- ers [131], we considered only steps 2 and 3 for comparing to other de novo motif tools. Because STAMP is widely used for motif postprocessing and MotIV ex- tends STAMP, we used MotIV for step 3. Essentially, then, we were largely com- paring rGADEM with other de novo discovery tools, for which we used MEME, cisFinder [116], FlexModule [51] and Weeder [102], which are widely used and perform well. For module discovery we compared our pipeline to Cluster-Buster [33]. Each application was used with its default parameters, according to the in- structions given in the manuals. All computations were performed on a Mac Pro with dual 3.2 Ghz Quad-Core CPU processors and 16 GB RAM. 60 Genome alignment Enriched regions Sequencing platform Visualisation (seqLogo, MotIV) Motif discovery rGADEM Motif Identication MotIV Gene set analysis (ChIPpeakAnno) Differential analysis (IRanges) Peak calling: PICS Relationship to gene structure (ChIPpeakAnno) Data export to BED, WIG, etc (rtracklayer) Data export (rtracklayer) Data import from other peak calling software (BED, FASTA, etc) R environment Figure 3.5: The ChIP-Seq processing pipeline. Short sequence reads are first mapped onto a reference genome, and the mapping results are loaded into R. The pipeline core consists of the three dark blue rectangles. En- riched regions are identified by PICS and passed to rGADEM for de novo motif discovery, and motifs and motif occurrences are passed to MotIV for postprocessing. 61 de ns ity PICS mappability correction 0. 00 0 0. 00 2 0. 00 4 0. 00 6 0. 00 8 165600 165700 165800 165900 166000 166100 > >>>> >> >>>>Aligned Reads < < <<<< Mappability Correction +[ ] No Correction +[ ] Binding site Figure 3.6: PICS read mappability correction in a FOXA1 binding region with missing reads due to genome repetitiveness. An ambiguous re- gion (i.e. a region into which short reads cannot be uniquely mapped) is shown as a grey rectangle. Forward and reverse aligned reads are respectively shown as black and red arrowheads. Forward and reverse PICS read density profiles are respectively shown in black and red, with solid/dashed lines representing t distributions with/without the mappa- bility correction. The rGADEM-estimated FOXA1 binding site is shown by a vertical black line. When PICS corrects for read mappability, the de novo motif is within the confidence interval of the site location that it predicts, but it is outside of the interval when the correction is not used. The spatial error, i.e. the distance between binding site location and the PICS prediction, is 15bps with the correction and 47bps without the correction. 62 Chapter 4 Probabilistic Inference for Nucleosome Positioning with MNase or Sonicated Short-read Data 4.1 Background The structural unit for chromatin packaging is the nucleosome, which is com- posed of approximately 147 bps of DNA wrapped around a core histone octamer. Nucleosome-associated DNA is less accessible to regulatory proteins like tran- scription factors; nucleosome positioning, as well as histone modifications and histone variants (e.g. H2A.Z, H3.3), can influence cellular processes that depend on chromatin accessibility [20, 27, 37, 40, 52, 105]. Because nucleosome positions depend on cellular processes as well as intrinsic factors (e.g. DNA sequence), un- derstanding how these positions influence cell state requires determining nucleo- some positions within individual genomic regions [135]. Currently, nucleosome-based data are typically generated by high-throughput short-read sequencing following either MNase digestion (MNase-Seq), MNase di- gestion with chromatin immunoprecipitation (ChIP-Seq) or sonication with ChIP- 63 seq. MNase digests linker DNA with relatively high specificity [19], and this speci- ficity is reflected in the spatial distribution of aligned reads. However, classes of functional genomic regions are being identified by integrated analysis of diverse sets of short-read sequence data [7, 28, 92], many of which are nucleosome-based data types that are generated with ChIP-Seq protocols that use sonication. Two methods are available for inferring nucleosome positions from short-read data: NPS [134] and TemplateFilter (TPF) [127]. Results from these methods have been reported only for data generated with protocols that use MNase-Seq, or MNase with ChIP-Seq (e.g. [39]), and the effectiveness of these methods with the more widely available data generated using sonicated DNA and ChIP-seq is not known. Recently we described PICS, a probabilistic peak-caller for transcription factor ChIP-Seq data [131]. PICS models bi-directional read densities, uses mixture mod- els to resolve adjacent binding events, and imputes reads that are not mapped due to repetitive genome sequences. We anticipated that its model-based framework should be extensible to address both MNase-digested and sonicated nucleosome- based short-read data. We were interested in assessing how effectively the model could be adapted to the two data types, how robust the new algorithm would be to lower read densities, and the types of biological inferences that it would sup- port from sonicated data. To address these issues, we developed PING, a method for probabilistic inference of nucleosome positioning from nucleosome-based se- quence data. Like PICS, PING models bi-directional read densities, uses mix- ture models, and imputes missing reads; however, it uses a new prior specification for the spatial positioning of nucleosomes, has different model selection criteria, model parameters, and different post-processing for estimated parameters. In addi- tion, PING also includes novel statistical methods to identify nucleosomes whose read densities are lower than those of neighboring nucleosomes. In the work described here, we apply the new algorithm to three published short-read nucleosome-based data sets. We focus on regions around transcriptional start sites and in vivo transcription factor binding sites, which have well-defined nucleosome distributions [40, 44] . We demonstrate that PING is effective with both MNase-Seq data in yeast and sonicated H3K4me1 ChIP-Seq data in mouse, and that it compares favorably to NPS and TPF in robustness to lower read densi- 64 ties. Then, using published data from the mouse PUER cell line [41], we consider global changes in nucleosome positioning relative to in vivo binding sites for for SPI1 (also known as PU.1) and CEBPB, and show that PING predictions from son- icated H3K4me1 ChIP-Seq data are consistent with published results from MNase- Seq data. Next, we apply PING to sonicated ChIP-Seq H3K4me1 data from mouse pancreas islet tissue [44]. We distinguish in vivo Foxa2 and Pdx1 binding sites that are between flanking H3K4me1-marked nucleosomes from sites that are within nucleosomal DNA. We show that genes associated with flanked TF-bound loci are more abundantly expressed than those associated with nucleosomal loci, consistent with flanked sites being active enhancer elements. Finally, we compare spatial dis- tributions on nucleosomal DNA of sites for Pdx1 and for the pioneer transcription factor Foxa2. 4.2 Results and Discussion In this section, we first describe PING’s probabilistic model for inferring nucle- osome positions from short-read sequencing data. Then, we compare the perfor- mance of PING, NPS [134] and TPF [127], using three published datasets that have different experimental protocols and genome sizes: MNase-Seq data from budding yeast [57], sonicated ChIP-Seq data from a mouse PUER cell line [41], and son- icated ChIP-Seq data from mouse pancreatic islets and liver tissue [44]. Finally, focusing on the data from islet tissue, we demonstrate and assess several types of inferences from sonicated ChIP-Seq data. 4.2.1 PING Model As in our previous work with transcription factor data [131], we first pre-process the read data by segmenting the genome into candidate regions, each of which has a minimum number of reads that aligned to forward and reverse strands. As in PICS, in each candidate region, conditional on the number of nucleosomes (K) in the region, we model the aligned read positions as fi ∼∑ k wkt4 ( µk−δk/2,σ2f k ) , r j ∼∑ k wkt4 ( µk +δk/2,σ2rk ) , (4.1) 65 where fi and r j the i-th forward and j-th reverse read positions in the region, with i = 1, . . . ,n f and j = 1, . . . ,nr and k = 1, . . . ,K refers to the k-th nucleosome in the candidate region. The function t4 is the probability density function of a Student’s t-distribution with four degrees of freedom. For the k-th nucleosome, µk represents the position of its center, while δk is the distance between the maxima of the for- ward and reverse read position densities, which corresponds to the average DNA fragment length in this bidirectional read cluster. Note that this length can differ from 146 bp, as we discuss below for prior distributions. Because a DNA fragment should contribute a forward read or a reverse read with equal probability, we use a common mixture weight wk for both forward and reverse distributions. The pa- rameters σ f k and σrk measure the corresponding variability in DNA fragment end positions. To accommodate possible biases related to sequencing and read mappa- bility [111, 131], that result in asymmetric forward and reverse peaks, we do not assume or require that the forward and reverse variances of reads associated with a nucleosome are equal. Since it models aligned reads as PICS does (4.1), PING inherits PICS’ advan- tages, including robustness to outlier reads and imputation of missing reads (see Materials and Methods). PING’s main novelty is in the modelling of nucleosome positions and their downstream inference, as explained below. In PING, the nucleosome positions (the µk’s) are assumed, a priori, to be drawn from a one dimensional Gaussian Markov random field (GMRF) distribution [8]. GMRF distributions are well suited to modelling the linear arrays that are typical of nucleosomes. The prior distribution of µk’s is defined conditionally on neighboring nucleosomes as µk+1|µk ∝ exp (−λ (µk+1−µk−200)2) (4.2) where λ > 0 is a fixed parameter. This prior states that consecutive nucleosome centers should be separated by approximately 200 bp. A larger λ value will con- strain distances to be closer to 200 bp, while a smaller value will allow a wider range of values. After characterizing the effect of λ on the prior (Supplementary Figure C.1), we chose a relatively weak prior by setting λ = 6×10−4, which cor- responds to a distance between adjacent nucleosomes of between 25 and 375 bp. 66 The lower bound allows nucleosome positioning to vary between sub-populations of cells [19], while the upper bound accommodates short nucleosome-free regions. Note that segmentation into candidate regions excludes genomic regions with low read densities that are longer than 375 bp from candidate regions. The remaining parameters σ f k, σrk and δk summarize our prior knowledge about the DNA fragment size distribution. For computational convenience we use conjugate priors defined by σ−2f k ,σ −2 rk ∼ G a(α,β ) (4.3) (δk|σ2f k,σ2rk) ∼ N ( ξ ,ρ−1/(σ−2f k +σ −2 rk ) ) (4.4) where α , β , ρ , and ξ are fixed hyper-parameters. For data generated by an MNase protocol, we set α = 20, β = 20000, ρ = 3, and ξ = 150, which result in δ values between 100 and 200 bp (see Supplementary Figure C.2). For data generated by a sonication protocol, where we expect DNA fragment lengths to be more variable, we used α = 10 and ρ = 1.2, which result in δ values between 50 and 250 bp. (See Supplementary Figure C.2). These param- eters can be adjusted by a user, given, for example, fragment length information from library construction. 4.2.2 Methods Comparison Because the cost of sequencing experiments can constrain work involving large genomes and experimental designs, we evaluated the performance of PING, NPS [134] and TPF [127] over a range of sequencing depths, using the three data sets noted above. Two considerations led us to generate test datasets by subsampling rather than simulation. First, because both the position and number of reads for a nucleosome may depend on neighbouring nucleosomes, it was not clear how to simulate such data. Second, the MNase-Seq yeast data were deeply sequenced, given the compact genome. For these data, many nucleosomes had strong, well- defined read signals, and many appeared to be both well positioned and accurately predicted by all three methods (Supplementary file “Supp example.pdf”). This suggested that at least this dataset would give good ‘reference’ nucleosome posi- 67 tions for the subsampling comparisons. For Kaplan’s MNase-Seq yeast data we used the most deeply sequenced sam- ple, NOCL4. For the two mammalian datasets, we considered a subset of regions that were associated with transcription factor binding sites, and so should have relatively well-positioned nucleosomes [39, 44, 106]. For the mouse PUER cell line data [41] we selected the 62.6 thousand MNase-Seq reads that were within 1 kb of centers of top-ranked 5000 CEBPB enriched regions detected by PICS for 1 hour after treatment with tamoxifen. For mouse islet data we selected the 32.8 thousand H3K4me1 sonicated ChIP-Seq reads that were within 1 kb of centers of the top-ranked 5000 Pdx1 peak summits [44]. As reference nucleosomes we se- lected 10000 nucleosomes in the yeast MNase-Seq data and 2000 nucleosomes in the mouse sonicated ChIP-Seq data (Supplementary Figures C.3, C.4, C.5). For each of the three data sets, we generated 14 random subsets of reads that contained from 30% to 95% (with step size 5%) of the original number of reads. For each data subset we calculated area-under-the-curve (AUC) statistics for the three methods (see Materials and Methods). A larger AUC value for a subset of reads indicates that reference nucleosome positions were detected more accurately and frequently. Figure 4.1 shows AUC profiles as a function of the number of reads in random subsets for three methods. AUCs for PING were consistently larger than for the other two methods, suggesting that PING can predict nucleosome positions more accurately and may require less deeply sequenced data than these methods. TPF showed comparable performance only in Kaplan’s MNase-Seq data, for which its templates should be appropriate (Supplementary file “Supp example.pdf”). NPS only predicts nucleosomes that have relatively high read counts. While our com- parison method is favourable to NPS, this method’s performance was lower with the larger sets of reference nucleosomes, for which returned maximum sensitivities less than one. Tests with alternative settings showed that results were robust to the number of nucleosomes in reference sets. For example, we tried reference sets with 5000 or 20000 nucleosomes in Kaplan’s data, and reference sets with 1000 or 5000 nu- cleosomes in Heinz’s and Hoffman’s data. In these assessments, PICS generally returned larger AUCs than the other methods (data not shown). 68 30 40 50 60 70 80 90 100 0. 00 0. 05 0. 10 0. 15 0. 20 Percent of data A U C − tr un ca te d PING TpF NPS 30 40 50 60 70 80 90 100 0. 10 0. 12 0. 14 0. 16 0. 18 0. 20 Percent of data A U C − tr un ca te d PING TpF NPS 30 40 50 60 70 80 90 100 0. 00 0. 05 0. 10 0. 15 0. 20 Percent of data A U C − tr un ca te d PING TpF NPS (a) Kaplan (b) Heinz (c) Hoffman Figure 4.1: AUC statistics as a function of number of reads in random sub- sets for PING, TPF and NPS. Datasets are (a) MNase-Seq data from budding yeast [57], (b) sonicated H3K4me1 ChIP-Seq data from the mouse PUER cell line [41], and (c) sonicated H3K4me1 ChIP-Seq data from mouse adult islet tissue [44]. Prior to sequencing, given the biology that an experimental design will address, it is desirable to be able to estimate how deeply a sample should be sequenced; given sequencing data, it is desirable to be able to estimate whether sufficient se- quence data has been generated. The AUC approach shown here may be appro- priate way to address the second issue; for an experiment in which the slope of the curve is low as it approaches 100% of the reads available, additional reads are unlikely to improve the results. 4.2.3 Inferring Nucleosome Positioning with Sonicated Chip-Seq Data While nucleosome-positioning predictions can be generated using an MNase-based protocol, much histone modification data is available from protocols in which the DNA has been fragmented by sonication [7]. In this section, using data for a mouse cell line [41], we assess nucleosome-level results generated by PING from soni- cated data, and compare these with published occupancy profiles from MNase-Seq data. We considered four states: before vs. 1 hour after tamoxifen stimulation, and regions around SPI1 vs. CEBPB binding sites. Using MNase-Seq data from PUER cells, in which SPI1 becomes localized to 69 the nucleus and can bind DNA only after tamoxifen treatment, Heinz et al. [41] showed that, globally, positions of nucleosomes flanking SPI1 binding sites are more distant from SPI1 sites after SPI1 binding. To determine whether we could use PING to generate similar results using sonicated ChIP-Seq data, we predicted nucleosome positions genome-wide from sonicated ChIP-Seq H3K4me1 samples for both 0 hour and 1 hour after tamoxifen stimulation, and used PICS to predict binding sites for SPI1 and for CEBPB. Figure 4.2 shows model-based nucleosome profiles in ±500-bp regions around the top-ranked 5000 binding sites for both fac- tors. The heatmaps show individual regions as pairs of red/blue horizontal lines (denoting 0 and 1 hr respectively), with darker colors indicating higher scoring, i.e. better positioned, nucleosomes, while the profiles show the average nucleo- some occupancy across all TF binding regions. For both transcription factors and time points, the heatmaps show that the distance between −1 and +1 nucleosomes varies between regions, suggesting caution in interpreting average profiles alone. Despite this, from the heatmaps and profiles it is evident that the closest three nu- cleosomes have shifted away from SPI1 binding sites by ≈ 50-bp at the 1 hr time point, consistent with MNase-Seq data ([41], Figure 4D). We note that while the published MNase-Seq profile more clearly indicates a global ≈ 50-bp shift, the relatively low MNase-seq read density did not support model-based nucleosome predictions on individual regions. Also, because SPI1 is not localized to the nu- cleus at 0 hr [13], its profiles at this time point are poorly defined; in contrast, upon SPI1 binding, at 1 hr, the nucleosome profiles are better defined, suggesting that binding stabilizes flanking nucleosome positions. Both heatmaps and profiles sug- gest that nucleosome positioning is well defined for CEBPB at both time points. While a small positional shift was evident for this factor, CEBPB is localized in the nucleus and so is expected to be associated with DNA at both time points, and this result is of uncertain biological significance. Together, these data from a mouse cell line indicate that PING can be effective in inferring nucleosome positions with sonicated ChIP-Seq data, but that the de- gree of positioning, and so the inferences possible, can depend on the transcription factor and on the biological state. 70 Figure 4.2: The model-based nucleosome positioning within ±500 bp from the top 5000 detected in vivo (a) SPI1 binding sites and (b) CEBPB binding sites from Heinz’s sonicated H3K4me1 ChIP-Seq data for 0 hour (blue) and 1 hour (red) after stimulation. The heatmaps show nu- cleosome prediction profiles for each region as pairs of red/blue hori- zontal lines, and the curves on the bottom are average occupancy pro- files across all regions. 4.2.4 Identifying Transcription Factor-Nucleosome Interactions in Mouse Islet Data Transcription factor binding sites typically occur within nucleosome-free regions flanked within≈ 250−450 bp by H3K4me1-marked nucleosomes (‘bimodal’ sites) ([39, 44, 106]). Hoffman et al. 2010 used XSET enrichment profiles [108] to 71 show that both Pdx1 and Foxa2 can also bind motifs within regions enriched for H3K4me1 (‘monomodal’ sites). Such a pattern of association is characteris- tic of ‘pioneer’ transcription factors (TFs) like Foxa2 [18]. Comparing the func- tional properties of in vivo Pdx1 and Foxa2 binding sites that were in bimodal vs. monomodal regions indicated that only bimodal Pdx1- and Foxa2-bound loci were functional in regulating gene expression. To determine whether PING-based nucleosome predictions could be used to distinguish transcription factor binding sites flanked by paired H3K4me1-marked nucleosomes from sites within nucleosomal DNA, we applied PING to the soni- cated H3K4me1 ChIP-Seq data from mouse adult islets and liver (Materials and Methods). We used the resulting predicted nucleosomes in islets to classify in vivo binding sites of Pdx1 and Foxa2 in islets, using nucleosomes that we predicted in H3K4me1 data from mouse liver as a negative control. From the spatial relation- ship between a TF binding site (taken as the summit of an enrichment profile peak) and nearby predicted nucleosomes in islets, we classified binding sites into three subgroups: those flanked by paired H3K4me1-marked nucleosomes (‘bimodal’), those within H3K4me1-marked nucleosomal-DNA (‘monomodal’), and those with no H3K4me1-marked nucleosomes within 1Kb (‘NoNuc’). Figure 4.3 (a) shows the number of regions in each group, and shows that, consistent with the data of Hoffman et al, the majority of transcription factor binding sites (62-81%) are bi- modal; however, between 5 and 9% are within nucleosomal DNA. Figure 4.3 (b) shows the average read density profiles for sites in the three subgroups, and the average model-based nucleosome positioning profiles for sites flanked by paired nucleosomes versus those bound to nucleosomal DNA. In the center of the regions, at the Pdx1 and Foxa2 peak summit locations, profiles show a deep valley for sites identified as flanked by paired H3K4me1-marked nucleo- somes, a sharp peak in the read density profile for sites identified as bound within nucleosomal DNA, and a flat unenriched profile for the “NoNuc” group. As a negative control, we show the same profiles generated from mouse adult liver (Fig- ure 4.3 (c)), in which Pdx1 is not expressed, and in which Foxa2 binds ≈ 25% of the sites identified in islets [44]. In liver, the sites identified as flanked by paired nucleosomes show a slightly lower nucleosome density at Pdx1 peak summit lo- cations, suggesting that some of these loci are bound by other factors in liver. In 72 contrast, the sites identified as bound within nucleosomal-DNA for Pdx1 have no distinct profile. For Foxa2 some reduction in nucleosome density is noted at the peak summit location for sites identified as bound within nucleosomal-DNA in islets, probably because some of these sites are bound in liver. This is consistent with previous results indicating that Foxa2 loci that are bound in both islets and liver, and are monomodal in one tissue, are often bimodal in the other [44]. Note compared to Heinz’s data, these data were generated from tissue rather than a cell line; hence, we may expect more biological heterogeneity and variability, and so potentially more variability in nucleosome positions. To assess our classification results using independent data, following Hoffman (2010), we compared the gene expression levels between subgroups of binding regions, using published islet RNA-seq data [62] (Figure 4.4). We assessed ex- pression levels differences between groups using a Kruskal-Wallis test and a null hypothesis that there is no difference among gene expression levels of three groups vs. the alternative hypothesis that at least two groups are different. P-values were less than 10−16 for all combinations of two transcription factors and two tissues. We then conducted a post-hoc multiple pairwise comparison [94] for each combi- nation of group pairs. Genes associated with loci that lacked H3K4me1-marked nucleosomes were always significantly less expressed than regions in other groups (p < 10−4). In contrast, genes associated with loci within nucleosomal DNA were significantly less expressed than genes associated with loci flanked by paired nu- cleosomes (p = 1.3×10−5 for Pdx1 and p = 1.2×10−2 for Foxa2). This is con- sistent with sites flanked by paired nucleosomes being more functionally active. As expected, we saw no difference between these site types using liver H3K4me1- based nucleosome calls using the same islet RNA-seq data (p=0.99 for Pdx1 and p=0.21 for Foxa2). These results show that using nucleosome positions predicted by PING to define the ‘modality’ of transcription binding sites generates more ef- fective bi/monomodal classification results than those originally generated from XSET profiles. Unlike Foxa2, Pdx1 is not known to be a pioneer factor, i.e. a factor that can bind motifs within nucleosomal DNA [44]. Given this, we compared spatial dis- tributions of Foxa2- and Pdx1-bound sites at predicted nucleosome locations to assess where these transcription factors were predicted to bind within the core nu- 73 cleosome. By profiling the density of the de novo Foxa2 and Pdx1 binding sites that were closest to their respective peak summit locations, we found that Foxa2-bound sites were enriched near the nucleosome centers and showed a periodicity of ≈ 20 bp (≈ 2 helix turns) (Figure 4.5). This profile is consistent with Foxa2 binding at locations where the major groove faces away from the histone octamer, as expected for its helix-turn-helix domain [18, 25, 72, 98]. In contrast, Pdx1-bound sites were enriched at two locations that were ≈ 25 bp from the nucleosome center. While Pdx1’s homeo domain also associates with the major groove [80], the site profile for this factor was only partially consistent with the locations in which the major groove is accessible. To confirm that these patterns of enrichment were a result of constraints placed on Pdx1 and Foxa2 binding, we found no comparable spatial enrichment of Pdx1 and Foxa2 motifs around monomodal sites identified using nu- cleosome positions predicted from liver H3K4me1 data (Methods). These results indicate that both Pdx1 and Foxa2 can bind within nucleosomal DNA, but have dif- ferent binding preferences for the locations of their motifs within the nucleosome. As well, the results confirm that, for appropriate biological states, nucleosome po- sitioning can be defined with high spatial resolution from sonicated data. 4.3 Methods 4.3.1 Data Sets The ‘Kaplan’ MNase data are from S. cerevisceae (GEO data set GSM351492 [57]). They were generated using MNase-Seq, i.e. digesting linker DNA with MNaseI, size selecting mononucleosome DNA fragments, and single-end sequenc- ing the ends of these fragments. There are six biological replicates. Four have no formaldehyde cross-linking and have between 3.3 and 5 million aligned reads; two were cross-linked and have between 2.4 and 3.5 million aligned reads. These sam- ples were deeply sequenced, given the compact 12.1 Mb genome. Many nucleo- somes had strong, well-defined aligned read signals, and many appeared to be both well positioned and accurately predicted by all three methods (Supplementary file “Supp example.pdf”). The ‘Heinz’ data are sonicated ChIP-Seq data generated from the mouse PUER 74 cell line [41]. We used two H3K4me1 sonicated samples from GEO data set GSE21512, which corresponded two biological states: 0 hour, i.e. before stimula- tion (GSM538012) and 1 hour after tamoxifen stimulation (GSM538013). Single- ended 25-bp reads were generated after sonicating chromatin to 200−300 bp, then immunoprecipitating with an antibody against H3K4me1 (Abcam ab8895). These two samples contained 8.1 and 7.2 million aligned reads respectively, and, given the≈ 2.5 Gb mouse genome, they were much less deeply sequenced than Kaplan’s data. The ‘Hoffman’ data are sonicated ChIP-Seq data generated from mouse adult pancreas islet and liver [44]. The data contains 11.8 and 14.9 million aligned reads in islet and liver respectively. For Heinz data, to obtain the in vivo binding sites of transcription factors of SPI1 and CEBPB, we used PICS [131] to analyze the ChIP-Seq data GSM538000 (SPI1, 0 hr), GSM538001 (SPI1, 1 hr), GSM538006 (CEBPB, 0 hr), and GSM538007 (CEBPB, 1 hr). The in vivo binding sites for the transcription factors of Pdx1 and Foxa2 were obtained from [44]. 4.3.2 Filtering Duplicated Reads A relatively high number of duplicate reads (i.e. single-end reads with identical 5’ alignment coordinates) may be the result of biases in library construction and PCR amplification. Since the spatial distributions of locations of fragment starts should be more concentrated near ends of wrapped DNA for MNase than for son- icated data, we expect that we could see more repeated reads that are not process artifacts in MNase data. To control potential artifacts, while accommodating differ- ences between MNase vs. sonication protocols, we removed reads beyond an upper bound that is set as a quantile for the number of duplicates found while processing a data set. In practice, we chose the 99.5% quantile for MNase data, and the 95% quantile for sonicated data. These threshold quantiles can be set by a user. And, since PING models read densities, highly ranked nucleosome predictions should be rather insensitive to the value set for the upper bound on duplicates. 75 4.3.3 Segmentation of Candidate Regions To process large data sets, particularly when multiple CPU cores are available, it is preferable to split the aligned read data into smaller disjoint subsets and process each subset separately. We first segment the genomic aligned read data into ‘can- didate’ regions, each of which has a minimum number of reads that were mapped to forward and reverse strands. The segmentation step is similar to that used for PICS [131]. To suit nucleosome-based data, we adjusted the parameters and added an additional recursive splitting step to avoid candidate regions being too long. We detect such regions using a w-bp sliding window with an s-bp step size, counting the number of forward and reverse strand reads in the left and right half- windows respectively (reads within 25 bps from the center of the window are not counted), and we retain windows that contain at least n forward reads and n reverse reads. Here we used, w = 300 bp, s = 2 bp and n = 5 and merged overlapping windows from left to right to obtain a disjoint set of candidate regions. Depending on the density of nucleosomes expected across the genome for a given experiment (e.g. MNase-Seq), segmentation could result in genomic regions that are long enough that applying a mixture model to infer nucleosomes requires extended computing times. To avoid this, we recursively divide candidate regions that are longer than 1200 bps at points of low read density, until no regions are longer than this. 4.3.4 Parameter Estimation and Model Selection Given the conjugate prior chosen, an Expectation-Maximization (EM) algorithm can be derived to find the maximum a posteriori (MAP) estimates for the unknown parameter vectorΘ= (θ1, . . . ,θK), where θk = (wk,µk,δk,σ2f k,σ 2 rk). Our algorithm is similar to that used in PICS [131], and it is described in detail in the Appendix C. The main difference comes in the M-step where we developed a novel procedure to incorporate the spatial prior for the µ’s. 4.3.5 Model Fitting After segmenting the whole genome into candidate regions using a sliding window, we fit our PING model in each candidate region. In practice, K, the number of 76 mixture components in each region, is unknown and needs to be estimated. In our previous work with ChIP-Seq data for transcription factors (TFs), we used the Bayesian Information Criteria (BIC) to estimate the number of components, by trying K = 1, . . . ,15 and selecting the value of K with the largest BIC. For nucleosome-based sequence data, candidate regions are often longer than is typical for TF data, and we expect to routinely encounter much larger values of K than for TF data. To reduce computing time, we try only (integer) values of K in the interval [Nnuc/3,Nnuc×1.5] where Nnuc is the expected number of nucleosomes in a region, given the region’s length, and is calculated as Nnuc = region length/200. Note that this range will vary from region to region and is dynamically set during a run. 4.3.6 Choosing the Number of Nucleosomes in Each Region After having fit a model for each value of K in the above range, we need to select a single best value in order to make inferences about the nucleosome positions. While BIC is well suited for selecting the number of components in mixture mod- els, it does not effectively use the information contained in our spatial prior (Eq. 4.2). Instead, we use a log likelihood penalized by our prior for µ . We select the value of K with the largest penalized log likelihood as follows, l(K|Θ̂)−λ K−1 ∑ k=1 (µk+1−µk−200)2, (4.5) where Θ is the final estimate for the parameters Θ, and l is the log-likelihood as defined in the Appendix C. Even though our model selection procedure gives satis- factory results for most regions, we noted a few cases in which the results were not optimal because of noise in read distributions. As with our PICS model, we have derived approaches to check for noisy estimates and wrongly estimated values of K, and to correct for these if needed. See Appendix C for details. Note our model selection criteria is somewhat adhoc. Following is the rational of why we use this criteria. If BIC is used for model selection, it may cause two problems in some regions, (1) Use one mixture component to describe multiple weak (less reads) nucleosomes to achieve parsimony and (2) use multiple mixture 77 components to describe single strong (more reads) nucleosome. As a result, peak merging is proposed to address the second problem, and we want a model selection criteria with less penalty on complex model to address the first problem. We know BIC is equivalent to log-posterior probability, so we decide to partial log-prior to reduce the penalty on complex model. To highlight the spacial structure of nucleosomes, we only keep the prior related to µ’s. 4.3.7 Scores of Predicted Nucleosomes and False Discovery Rates In order to identify and rank a statistically meaningful subset of nucleosomes, we define an enrichment score for each nucleosome. For a given nucleosome, we define FChIP (RChIP), the number of observed forward (reverse) ChIP read positions that fall within the 80% contours of the forward (reverse) read posi- tion densities, i.e. within µ f ± c · σ f (µr ± c · σr) where c = 1.5 (approximately the 90% quantile of the t4 distribution). We then define the enrichment score as O = (FChIP + RChIP)/(2cσ f + 2cσr), which is an estimate of the observed den- sity of DNA enriched fragments contributing to this nucleosome, after removing outliers. When a control sample is available, we also define Ocont = (Fcont + Rcont)/(2c ·σ f + 2c ·σr), by computing the number of observed forward/reverse reads in the control sample that fall within the 80% contour of the forward/re- verse read position densities estimated from the ChIP sample. Using this infor- mation, we define an enrichment score for the treatment relative to the control as S = (Ncontrol/NChIP)×O/(Ocont + 1), where the addition of the constant one pre- vents a division by zero, and Ncontrol (resp. NChIP ) denotes the total read count in control sample (resp. IP sample). The scaling of the enrichment score by Ncontrol/NChIP accounts for the control and ChIP samples having different num- bers of reads (sequence depth). Note that the score introduced here is slightly dif- ferent from the one used in PICS [131]. We made improvements by normalizing the scores by their peak widths (sigmas) which produces more stable nucleosome based scores. When control sample is available, false discovery rates can be calcu- lated from the scores of nucleosomes using the approach proposed in PICS [131]. 78 4.3.8 Calculating AUC Values We calculate AUC values in four steps. First, we predicted nucleosomes using PING, NPS and TPF; i.e. we generated three method-specific sets of reference pre- dictions. (The number of predicted nucleosomes in each random subset is given in Supplementary Table C.1.) Second, we generate the receiver operating characteris- tic (ROC) curves for each method using the predicted and reference nucleosomes. Third, we truncated each ROC curve at a specificity of 0.8, since sensitivity is of little value without a reasonable specificity. Finally, we calculated AUC statistics as the area under these truncated ROC curves. To generate an ROC curve, we needed to define a threshold distance, so that a reference nucleosome is called ‘detected’ if the distance between the centre of the reference nucleosome and a nucleosome predicted from a subset of reads is less than the threshold. The threshold distances were chosen as 10 bp in Kaplan’s MNase data, 30 bp in Heinz’s and Hoffman’s sonication data. These values re- sulted in the areas under most full ROC curves being larger than 0.5, where 0.5 is expected value for binary random guesses. When all methods detected reference nucleosomes less accurately than binary random guess, their performances were meaningless and we increased the minimum distance. To assess how robust our results were with respect to the threshold distance, we tried alternative settings and noted similar results. For example, we tried threshold distances of 5 bp in Kaplan’s MNase data, 20 bp and 40 bp in two sonication data as well as AUC statistics calculated from the full ROC curves instead of the truncated ones. In all of these assessments, PING generally performed better than the two other methods. 4.3.9 Model-based Nucleosome Positioning Profile Kaplan et al [58] defined ‘absolute nucleosome positioning’ and ‘nucleosome oc- cupancy’ proposed to estimate it using a method based on aligned read counts. Here we propose a model-based method for predicting nucleosome positions based on aligned read densities (via t distributions). For each estimated nucleosome, we shift its estimated forward/reverse peak to the center and weight them by the number of reads falling the 80% contour of the density. Finally we truncate the peaks at ±36 bp from the center to highlight the 79 nucleosome cores as supported by [134, 135]. Compared to profiles generated from read counts, our model-based profile has three advantages; it is generated from the estimated density, which: a) down- weights potential noise reads by using long-tailed distribution, b) borrows strength from forward-and-reverse peak-pairs by using a hierarchical model, and c) adjusts for the bias from reads missing due to mappability by introducing latent variables. 4.3.10 Classification of Transcription Factor Binding Regions We classify binding regions according to the distances between a TF binding site to the nearest called nucleosome. After removing weak nucleosome calls (see the following subsection for details), we classify regions as follows: 1. A binding region without any H3K4me1-marked nucleosome detected within 1 kb of its peak summit is called as “NoNuc”. 2. A binding region with at least one H3K4me1 nucleosome detected within 50 bp of its peak summit is called “monomodal”. 3. Other binding regions are called “bimodal”. 4.3.11 Removing Nucleosomes that have Low Read Densities Because nucleosomes with relatively low read densities are more likely to be falsely called as present, we want to detect and remove them from PING predictions. For this, we compare each predicted nucleosome to other nucleosomes in its neigh- bourhood as follows. For each predicted nucleosome, referred to as the ‘reference nucleosome’, we select other predicted nucleosomes within 500 bp. We ignore any nucleosomes that are separated from the ‘reference nucleosome’ by a nucleosome-free region longer than 300 bp (filter boundary of PING delta). We refer to these selected nucleo- somes as ‘neighborhood’ nucleosomes. We compare the reference nucleosome to each of its neighborhood nucleosomes, and consider the reference nucleosome as ‘falsely-called’ if its read count is significantly lower than that of any neighbor- hood nucleosome. In these comparisons, a read count ratio for two nucleosomes is significantly different if it is higher than a threshold. We calculate the threshold 80 adaptively using a negative binomial model that takes into account the widths (σ ) of forward/reverse read density distributions of the nucleosomes as follows. In a neighborhood, given the reads count (N0) of a reference nucleosome, the reads count (N1) of another nucleosome in its neighbourhood follows negative bi- nomial distribution as follows N1 ∼ NB(size = N0,prob = (σr1+σ f 1)/(σr0+σ f 0+σr1+σ f 1)) Where σr0 and σ f 0 describe the width of forward/reverse peak of reference nu- cleosome, and σr1 and σ f 1 describe the width of forward/reverse peak of the nu- cleosome to be compared with. An example threshold curve of N1/N0 is given in Supplementary Figure C.8. 4.3.12 Multiple Transcription Factor Binding Regions Associated with the Same Gene Multiple transcription factor binding regions can be associated with the same gene. TFBS sites that are flanked by H3K4me1-enriched regions are functional, while sites within H3K4me1-enriched regions, or in regions without H3K4me1, are non- functional in regulating gene expression [44]. Given this, when we identify a monomodal region for a gene, we ignore NoNuc regions for the same gene, and when we identify bimodal regions, we ignore both monomodal and NoNuc regions for that gene. 4.3.13 Distribution of Pdx1 and Foxa2 Motifs around Monomodal Sites Identified Using Liver Nucleosome Positions While we were interested in the results in islets, we also generated the same re- sults from liver data as a negative control, to show that the results obtained in islet data were unlikely to have occurred by chance. To generate the liver results, we needed to identify nucleosome predictions that overlapped with a TF peak summit in liver. For this, we use predicted nucleosomes from liver H3K4me1 data to clas- sify islet transcription factor binding sites and obtained liver “monomodal” sites and corresponding liver core nucleosomes. In each “monomodal” region, we determined the TF site closest to the peak 81 summit of the transcription factor binding region, and considered the center dis- tance of this motif to the center of the central nucleosome. We considered all “monomodal” regions, as well as a subset of them chosen from the regions whose central nucleosomes had corresponding PING score in the top 50000 among all predicted nucleosomes whole genome, again using the elbow point of score distri- butions of all whole-genome predicted nucleosomes (Supplementary Figure C.9). 4.4 Difference between PING and PICS In previous sections, we introduced features of PING as a method modified from PICS. In this section, I emphasize their differences and why we made such changes. 4.4.1 Segmentation Using PICS segmentation on nucleosome data, we obtained many candidate re- gions with length over tens of kb. In such regions there could be up to hundreds of nucleosomes. In each of such regions, we have to fit many mixture models with different number of mixture components, and then select the best model. This significantly increased the computing workload. So we used a recursive cutting algorithm to limit the length of candidate regions within 1200 bp. Such a change ensure only a few mixture models to be fitted in each candidate region. 4.4.2 Post-process Nucleosome data are more complex than transcription factor data. A candidate region of nucleosome data could contains several nucleosomes with very different enrichment levels. The mixture models might describe a highly enriched nucle- osome with multiple mixture components to achieve better likelihood, whereas describe multiple low enriched nucleosomes with a single mixture component for parsimonious. This can cause mis-match of forward/reverse clusters, which results in wrongly called nucleosomes from center of such forward/reverse clusters. This can also cause missing useful nucleosomes around a highly enriched nucleosome. In addition, we also encountered numerical problem in many candidate regions. To solve this problem, we developed a post-process approach for PING. Mis- match of forward/reverse clusters usually associated atypical estimations on δ (i.e. 82 estimated distance between peaks of forward and reverse cluster). Missing multiple nucleosomes usually associated with atypical estimations of σ f or σr, since a quite wide peak is needed to describe density of reads from multiple nucleosomes. So we screen the estimated parameters from original fitted models, and refit the model with much stronger prior on δ or σ . We do the same for the candidate regions where warning message of ‘numerical problem’ is given. Figure 4.6 shows a region with large sigma problem, i.e. single mixture com- ponents used to describe multiple nucleosomes. The vertical green lines show pre- dictions after postprocess, whereas black ‘+’ characters show predictions before postprocess. We expected to see each nucleosome predicted in the middle of a pair of forward/reverse (black/red) reads clusters. The predictions before post-process (black ‘+’) missed two nucleosomes on the left. Post-process of atypical estima- tions of sigma detected and solved the problem and successfully detected all three nucleosomes (green vertical lines). Figure 4.7 shows a region with mismatched forward/reverse clusters. The ver- tical green lines show predictions after post-process, whereas black ‘+’ characters show predictions before post-process. We expected to see a nucleosome predicted in the middle of each pair of forward/reverse (black/red) reads clusters. From the figure we found the second reverse reads cluster (from the left) is described using to two components in mixture models, since it contains more reads than other clus- ters and has a asymmetric shape density. However the matched forward cluster is only described by one mixture component, and this caused mismatch problem in the third reads clusters (from left). As a result of mismatch, the location of third nucleosome (from the left) is not correctly predicted by mixture model (black ‘+’). Post-process of atypical estimations of delta detected and solved the problem and successfully detected the third nucleosome (green vertical line). Note stronger prior on δ and σ could help to solve problems in some regions. However we do not suggestion using strong prior globally, because we know these values can vary a lot in different situations. Using global strong prior may results in bias in estimated parameters when their true value is not very close to the prior modal. 83 4.4.3 GMRF Prior To incorporate the know spacial relationship of nucleosomes, we assigned the GMRF prior on locations of nucleosomes. We suggested to use a weak prior, since nucleosomes can be moved around by external stimulations. Users of PING could choose a stronger prior depends on the biological data the generated. Using the NOCL4 data (Kaplan 2009), we compared data analysis results from different strengths of prior (including no prior case, i.e. the original PICS model). The lambda values are chosen as lambda = 0,−0.000096,−0.0008,−0.02, which corresponding to sd(dµ) = ∞,125,50,10. We call the four settings as ‘None’, ‘Weak’, ‘Median’, and ‘Strong’. Note ‘None’ is equivalent to the original PICS model, and ‘Weak’ is the setting we suggested. Table 4.1 shows the number of matched predictions between two different set- tings. Note the numbers of comparing a setting to itself show total number of predicted nucleosomes of that setting. Note ‘match’ is defined as distance within 5 bp. We found the strength of prior affects the results in terms of the number of predictions and the agreement of predictions from different prior strengths. The larger difference in prior strength the larger difference in their results. Table 4.1: The number matched predicted nucleosomes between two differ- ent strengths of GMRF priors. Before post-process None Weak Median Strong None 59524 55843 7398 8800 Weak 55843 59543 7487 8865 Median 7398 7487 27928 10441 Strong 8800 8865 10441 34685 After post-process None Weak Median Strong None 59775 56127 24066 9691 Weak 56127 59733 24387 9765 Median 24066 24387 50795 10589 Strong 9691 9765 10588 36794 Although there are only∼ 1% of unmatched predictions between settings ‘None’ 84 and ‘Weak’ (i.e. PICS and PING), these missing predictions could be important if biologists are interested in localized small regions instead of whole genome. Figure 4.8 show the data and results of different prior settings in an example candidate region where predictions of ‘None’ and ‘Weak’ do not agree. We ex- pected to see a nucleosome predicted in the middle of each pair of forward/reverse (black/red) reads clusters. The predictions based on PICS model (no GMRF prior) only detected the third, fifth, and sixth nucleosomes (from the left), whereas pre- dictions based on weak GMRF prior can detect three more nucleosomes (i.e. the second, fourth and seventh). The fourth nucleosome has clear signal which should not be missed by a good method. 4.5 Derivation of the EM Algorithm Introduce Latent Variables To use the EM algorithm to fit the model, we introduce the latent variables z’s and u’s, and rewrite our hierarchical t-mixture models as z f i,zr j ∼ Multinomial(w), (4.6) ( fi|U f ik = u f ik,z f ik = 1,µk,δk) ∼ N ( µk +δk/2,σ2f k/u f ik ) (4.7) (r j|Ur jk = ur jk,zr jk = 1,µk,δk) ∼ N ( µk−δk/2,σ2rk/ur jk ) (4.8) (U f ik|z f ik = 1),(Ur jk|zr jk = 1) ∼ G a(v/2,v/2). (4.9) The unknown membership indicator z f i (resp. zr j) indicates which nucleosome a forward read i (resp. reverse read j) is associated with; both z f and zr share the same multinomial distribution given by (4.6). Conditional on the memberships (the z’s), the read positions follow a t-distribution. We introduce another latent variable u, and rewrite the t-distribution as a Normal-Gamma compound distribution given by (4.7 - 4.9). 85 Conditional Expectation of the Penalized Log Complete Data Likelihood From this point, for ease of notation, we use the letter d to denote either f or r. As in Section 2.5 we denote the complete data by c. Then the complete data log-likelihood can be written as l(Θ|c) = n f ∑ i=1 K ∑ k=1 z f ik { log [ wkN ( fi ∣∣∣∣ µ f k, σ2f ku f ik )] + ν 2 ( logu f ik−u f ik )− logu f ik } + nr ∑ j=1 K ∑ k=1 zr jk { log [ wkN ( r j ∣∣∣∣ µrk, σ2rkur jk )] + ν 2 ( logur jk−ur jk )− logur jk} + (n f +nr) [ν 2 log (ν 2 ) − logΓ (ν 2 )] . Similarly, we define the prior penalty lprior as lprior = K ∑ k=1 logpi(δk,σ2rk,σ 2 f k,µk|ρ,ξ ,α,β ,λ ) = K ∑ k=1 { 1 2 log(σ−2rk +σ −2 f k )− ρ 2 (σ−2f k +σ −2 rk )(δ −ξ )2 } + K ∑ k=1 { (α−1) log(σ−2f k )−βσ−2f k +(α−1) log(σ−2rk )−βσ−2rk } −λ K−1 ∑ k=1 (µk+1−µk−200)2+ cst = −1 2 K ∑ k=1 { (σ−2f k +σ −2 rk )[ρ(δk−ξ )2+2β ]− (2α−1) log(σ−2f k +σ−2rk ) } −λ K−1 ∑ k=1 (µk+1−µk−200)2+ cst where cst is a constant. Given the current estimate Θ− for Θ, the conditional expectation of the penal- 86 ized log complete data likelihood is given as: Q(Θ|Θ−) def= E[l(Θ|c)|Θ−]+ lprior = ∑ d∈{ f ,r} nd ∑ i=1 K ∑ k=1 z̃dik { logwk− logσdk− ũdik2 ( di−µdk σdk )2} +lprior+ constant (4.10) E-Step The E-step [103] consists of computing the following quantities: z̃dik def= E(Zdki|di,Θ−) = w−k t4(di|µ−dk,σ−dk) ∑Kk=1 w − k t4(di|µ−dk,σ−dk) , (4.11) ũdik def= E(Udik|di,zdik = 1,Θ−) = (ν+1)/(ν+ ( di−µ−dk σ−dk )2 ). (4.12) M-Step During the M-step, the goal is to maximize Q(Θ|Θ−) with respect to Θ, which requires solving ∂Q(Θ|Θ−)/∂Θ = 0. We solve the following equation system to 87 obtain updated parameter estimates 0 = ∂Q∗ ∂wk = ∑ d∈{ f ,r} nd ∑ i=1 K ∑ k=1 z̃dikw−1k , K ∑ k=1 wk = 1 (4.13) 0 = ∂Q∗ ∂µk = ∑ j z̃r jkũr jk (r j−µrk) σ2rk +∑ i z̃ f ikũ f ik ( fi−µ f k) σ2f k +qk (4.14) 0 = ∂Q∗ ∂δk = ∑ j z̃r jkũr jk (r j−µrk) 2σ2rk −∑ i z̃ f ikũ f ik ( fi−µ f k) 2σ2f k −ρ(σ−2f k +σ−2rk )(δk−ξ ) (4.15) 0 = ∂Q∗ ∂σ−2f k = ∑ i z̃ f ik 2 [ σ2f k− ( fi−µ f k)2ũ f ik ] (4.16) −ρ(δk−ξ ) 2+2β 2 + 2α−1 2 σ2f k (4.17) 0 = ∂Q∗ ∂σ−2rk = ∑ j z̃r jk 2 [ σ2rk− (r j−µrk)2ũr jk ] −ρ(δk−ξ ) 2+2β 2 + 2α−1 2 σ2rk (4.18) where qk comes from the Gaussian Markov Random Field prior assigned to µk, and is given by qk =  −2λ (2µk− (µk−1+µk+1)), if k 6= 1,K, −2λ (µ1−µ2+200), if k = 1, −2λ (µK−µK−1−200), if k = K. Because there is no simple closed form solution forΘ, we adopted a conditional approach in which we first maximize over (w,µ,δ ), conditional on (σ f ,σ r), and then maximize over (σ f ,σ r), conditional on the previously updated (w,µ,δ ), re- sulting in an Expectation/Conditional Maximization (ECM) algorithm [88]. Con- ditionally on σ f k and σrk, the new estimations of parameters (ŵk, µ̂k, δ̂k) are up- dated by the unique solution of the linear system given by (4.13 - 4.15). Similarly, conditionally on these new estimates ŵk, µ̂k, δ̂k, we can then solve a linear system given by (4.17 and 4.18). 88 1000 500 0 500 1000 2e 04 5e 04 8e 04 P IN G  p ro fil e in  li ve r 1000 500 0 500 1000 2e 04 6e 04 1e 03 (c) 1000 500 0 500 1000 0. 00 00 0. 00 10 P IN G  p ro fil e in  is le t Bi Mono No 1000 500 0 500 1000 0. 00 00 0. 00 10 Distance to Foxa2 site (bp) Distance to Pdx1 site (bp) (b) Distance to Foxa2 site (bp) Distance to Pdx1 site (bp) (a) Foxa2 Pdx1 Bi 2004 No 920 Mono 283 Bi 2004 No 730 Mono 283 Figure 4.3: Modality and nucleosome occupancy profiles for in vivo bind- ing sites for the transcription factors Foxa2 (left) and Pdx1 (right) mouse adult islet tissue. a) The number of binding sites in bimodal, monomodal and NoNuc groups. b) Average model-based nucleosome positioning profiles for the three classes of binding sites. c) Average model-based nucleosome positioning profiles in mouse adult liver rela- tive to binding sites in islets. 89 Bim od al Mo no mo da l No  N uc s 0.01 1 3 10 32 102 104 106 Bim od al Mo no mo da l No  N uc s 0.01 1 3 10 32 102 104 106 Bim od al Mo no mo da l No  N uc s 0.01 1 3 10 32 102 104 106 Bim od al Mo no mo da l No  N uc s 0.01 1 3 10 32 102 104 106 Pdx1 Foxa2(a) (b) (c) (d) Ta g C ou nt  (I sl et  R N A -s eq ) Ta g C ou nt  (L iv er  T ag -s eq ) Ta g C ou nt  (I sl et  R N A -s eq ) Ta g C ou nt  (L iv er  T ag -s eq ) Pdx1 Foxa2 n= 43 90 n= 27 9 n= 73 0 n= 20 04 n= 28 3 n= 92 0 n= 31 64 n= 14 9 n= 15 80 n= 12 18 n= 14 6 n= 15 69 Figure 4.4: Boxplots of gene expression levels from RNA-seq data in each group. The horizontal dashed line shows the overall median. Islet re- sults are shown on the top row and liver results in the bottom row. 90 Motif center to core nucleosome center (bp) Motif center to core nucleosome center (bp) (a) (b) Pdx1Foxa2 N um be r o f m ot ifs N um be r o f m ot ifs -100 -50 0 50 100 0 5 10 15 20 -100 -50 0 50 100 0 5 10 15 20 Figure 4.5: The profile of center distance between the closest motifs to the centre of a predicted nucleosome position for detected “monomodal” Pdx1 and (Foxa2) binding sites in islets. Dashed curves show profiles in liver. Profiles are truncated at ±100 bp, and vertical dashed lines show ±75 bp from the estimated centres of the nucleosome-associated DNA. 91 de ns ity A candidate region with large sigma problem. 0. 00 0 0. 00 4 0. 00 8 0. 01 2 253800 254000 254200 254400 254600 >>>>>>>>>>>> >>>>>>>>>> > >>>>>>>> IP < <<<<<<<<<<< < <<<<<<< < <<<<<<<<<< < + +[ [] ] Cont. Figure 4.6: The data and results of an example candidate region of NOCL4 data (Kaplan 2009), where multiple nucleosomes are described using single mixture components. The red/black arrowheads shows the loca- tion of forward/reverse reads. The character ‘+’ shows predicted posi- tion of nucleosomes, whose confidence intervals are shown using ‘[’ and ‘]’. The black/red curves shows the components density of fitted mix- ture models. The vertical green lines shows the predicted nucleosome positions after post-process. 92 de ns ity A candidate region with mismatched clusters. 0. 00 0 0. 00 2 0. 00 4 0. 00 6 127200 127400 127600 127800 128000 128200 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>>>> IP <<<<<<<<<<<<< <<<<<<<<<< <<<<<< <<<<<<<<<<<< + + + +[ [ [ [] ] ] ] Cont. Figure 4.7: The data and results of an example candidate region of NOCL4 data (Kaplan 2009), where multiple nucleosomes are described using single mixture components. The red/black arrowheads shows the loca- tion of forward/reverse reads. The character ‘+’ shows predicted posi- tion of nucleosomes, whose confidence intervals are shown using ‘[’ and ‘]’. The black/red curves shows the components density of fitted mix- ture models. The vertical green lines shows the predicted nucleosome positions after post-process. 93 de ns ity 0. 00 0 0. 00 2 0. 00 4 0. 00 6 271000 271200 271400 271600 271800 272000 272200 >>>> >>>>> >>>>>>>>>> >>>>> >>>>>> >>>> >>> IP <<<<< <<<<<< <<<<<<< <<<<<<< <<<<<< <<<<<< < < + + +[ [ [] ] ] Cont. de ns ity 0. 00 0 0. 00 2 0. 00 4 0. 00 6 271000 271200 271400 271600 271800 272000 272200 >>>> >>>>> >>>>>>>>>> >>>>> >>>>>> >>>> >>> IP <<<<< <<<<<< <<<<<<< <<<<<<< <<<<<< <<<<<< < < + + + + + +[ [ [ [ [ [] ] ] ] ] ] Cont. Weak prior None prior Figure 4.8: The data and results of an example candidate region of NOCL4 data (Kaplan 2009), where ‘None’ and ‘Weak’ do not match. The red/black arrowheads shows the location of forward/reverse reads. The character ‘+’ shows predicted position of nucleosomes, whose confi- dence intervals are shown using ‘[’ and ‘]’. The black/red curves shows the components density of fitted mixture models. 94 Chapter 5 Conclusions and Future Directions 5.1 Summary and Discussion The intent of this thesis is to develop statistical methods and software packages to address issues arising from high-throughput sequencing data. In Chapter 2, we have described PICS, a probabilistic framework for detecting transcription factor binding events from ChIP-seq experiments. The approach in- tegrates a number of important factors in interpreting aligned read data, including using prior information for input DNA fragment lengths and correcting for reads that are missing due to genome repetitiveness. Working with two published ChIP- seq data sets from human cell lines, we compared PICS to four alternative analysis methods. While additional methods are available (e.g. [29, 60, 111]), the four methods that we used have been shown to perform well, and so offer reasonable performance baselines from a range of algorithms. For both experimental data sets, the binding events predicted by PICS were the most consistent with computation- ally identified motif sites. Our simulation study confirmed PICS’ effectiveness and also showed that its model is robust to mis-specifications. In Chapter 3, we have re-implemented PICS using the C language and made it into a Bioconductor package. We extended PICS into a pipeline for analyzing ChIP-Seq data for transcription factors by adding two more Bioconductor pack- 95 ages. The core of the pipeline consists of three complementary R packages: PICS, rGADEM and MotIV. Using four published human datasets, we showed that the pipeline compares favourably to other de novo motif tools and CRM clustering tools. For example, it identified co-occurring pairs of motifs that were consistent with the literature and were not detected by other methods. In Chapter 4, we propose PING, a model-based method for predicting nucleo- some positioning that can flexibly be applied to either MNase-based and sonicated ChIP-Seq data. Using a sampling-based ROC/AUC analysis, and three data sets with different characteristics, i.e. MNase-Seq data from budding yeast, and son- icated ChIP-Seq from a mouse PUER cell line and from mouse islet tissue, we showed that our method’s predictive accuracy and scalability compare favourably to two alternative short-read methods NPS and TemplateFilter. While additional methods are available (e.g. [64]), the two methods that we used in comparison have been shown to perform well, and offer reasonable performance baselines from a range of all algorithms. PING can readily be applied to data from mammalian genomes, and is relatively robust to low read densities. Given a method that can be applied to both MNase-based or sonicated data, we addressed the question of the spatial resolution available from sonicated data. Using published sonicated H3K4me1 ChIP-Seq read data in a mouse cell line, PING-based results for nucleosome displacement away from transcription binding sites after tamoxifen stimulation were consistent with results reported for MNase- Seq data. However, occupancy profiles for SPI1 and CEBPB indicated that the effectiveness of nucleosome positioning predictions can depend on the biological state. For sonicated H3K4me1 ChIP-Seq data from mouse islet tissue, we used nu- cleosome predictions to refine a classification of in vivo Foxa2 and Pdx1 binding sites into three groups, and showed that the between-group expression differences were more statistically significant for the updated groups. Characterizing the bind- ing profile of the pioneer transcription factor Foxa2 on nucleosomal DNA in islet tissue, we showed that, for appropriate biological states, sonicated data can sup- port positioning predictions that have high spatial resolution. These results, and the flexibility and scalability of the model-based method that we describe, suggest that it may be useful in work with short-read data to generate mechanistic insight within sets of individual genomic regions; for example, regions in which specific 96 combinations of epigenetic marks are associated with particular functional proper- ties. Note, to our knowledge, PING is the first method applied to infer nucleosome positions from sonicated data. From sonication data, our method did generate in- teresting biological results with good resolution, e.g. Figure 4.2 and Figure 4.5. 5.2 Work In Progress - Paired-End Sequencing Data 5.2.1 Introduction As discussed above, Paired-End (PE) sequencing is an alternative experiment pro- tocol to Single-End (SE) sequencing. Compared to SE, PE offers better information of each DNA fragment by giving positions of both ends, but for the same sequenc- ing cost only half the number of DNA fragments can be PE sequenced. In this research our objective is to help researchers think about the utility of ‘obtaining more information on each DNA fragment’ versus ‘sequencing more DNA frag- ments’ with a limited budget. For this objective, we developed a new method for analysing PE data, and in future work will compare SE method and PE method using real data. In this section, we sketch an extension to PING to handle PE data; we call this new model PING-PE. Compared with PING, the major change is that, for a given DNA fragment PING-PE forces its forward and reverse reads belong to the same mixture component. This changes requires to modify many formula in EM updates and calculating BIC and standard error of estimations. In PE data, one of the two mate pair reads may be unaligned and so missing for some DNA fragments, PING- PE can handle such situation by integrating out missing values. For ambiguous regions, PING-PE use the same approach as PING. In addition, we also changed the segmentation step to better use the information in paired-end sequencing data. 5.2.2 Segmentation We have criticized the XSET profile because it extends each DNA fragment for a fixed length, which is an approximation to the true length distribution. Now, with PE data, we know how long each DNA fragment is, and we can simply pile up them 97 to generate the actual fragment overlap profile. We can remove regions whose maximum height is lower than a given threshold, and what is left are candidate regions. When a candidate region is too long (say more than 1kb), we can cut it into two regions from the position with lowest profile height. To avoid effects of random noise we could smooth the profile into bins (say 5bp on each side of a given genome position). 5.2.3 The Model without Ambiguous Regions Let us denote by fi and ri the forward and reverse read positions for the i− th DNA fragment in a given candidate region, with i = 1, . . . ,n. Note that the number of DNA fragments, n, will typically vary between candidate regions. Unlike the single-end sequencing data, the forward and reverse reads are paired in PE data, hence we should force them to belong to the same nucleosome in our model. So we modified the PICS model, and jointly model the forward and reverse read positions as: ( fi,ri)∼ K ∑ k=1 wkt4 ( µ f k,σ2f k ) t4 ( µrk,σ2rk ) def= K∑ k=1 wkgk( fi,ri|θ k) def= g( fi,ri|Θ) (5.1) where k is the index of a binding event. In the binding event k, µk represents the binding site position, δk is the distance between the maxima of the forward and reverse read position densities, which corresponds to the average DNA fragment length for the binding event k, and σ f k and σrk measure the corresponding variabil- ity in DNA fragment end positions. µ f k = µk− δk/2 and µrk = µk + δk/2 are the centers of forward/reverse maxima, while wk is the mixture weight of component k, which represents the relative proportion of reads coming from the k− th bind- ing event. We denote the unknown parameter vector by Θ = (θ 1, . . . ,θK), where θ k = (wk,µk,δk,σ2f k,σ 2 rk). For simplicity we denote by g the resulting p.d.f. of the mixture distributions and denote gk the resulting p.d.f. of k-th mixture components. To perform parameter estimation via an EM algorithm, we introduce the miss- 98 ing data (the z’s and the u’s) and rewrite the model (5.1) as follows, µ f k = µk−δk/2, µrk = µk +δk/2 (5.2) ( fi,ri)|(Udik = udik,zik = 1,µk,δk) ∼ N ( µ f k, σ2f k u f ik ) ·N ( µrk, σ2rk urik ) (5.3) U f ik,Urik|zik = 1 ∼ G a(v/2,v/2) (5.4) zi ∼ Multinomial(w) (5.5) Given the average fragment length δk and the binding site positions µk, the centers of the forward and reverse read position density distributions can be calculated by (5.2). Conditional on the binding event memberships (the z’s), the read positions follow a Normal-Gamma compound distribution given by (5.3-5.4), where ν = 4 denotes the degree of freedom of the t-distribution. In other words, conditionally on the z’s, the read positions follow a t distribution with ν degrees of freedom. The unknown membership indicator zi indicates which binding event the i-th for- ward/reverse read pair come from; both zi follows the multinomial distribution given by (5.5). As a consequence, the marginal distribution of the read positions is a mixture of t distributions whose weights are given by the parameters of the multinomial distribution in (5.5). We assign the prior probability to model parameters as follows, σ−2f k ,σ −2 rk ∼ G a(α,β ) (5.6) δk|(σ2f k,σ2rk) ∼ N(ξ ,ρ−1/(σ−2f k +σ−2rk )) (5.7) µk+1|µk ∝ exp (−λ (µk+1−µk−200)2) . (5.8) The average fragment length parameter δk and peak variances (σ f k,σrk) are as- signed a Normal-Gamma conjugate prior, given by (5.6-5.7). The nucleosome positions (the µk’s) are assumed, a priori, to be drawn from a one dimensional Gaussian Markov random field (GMRF) distribution [8]. GMRF distributions are well suited to modelling the linear arrays that are typical of nucleosomes. The prior distribution of µk’s is defined conditionally on neighboring nucleosomes as (5.8), where λ > 0 is a fixed parameter. The GMRF prior (5.8) states that consecutive nu- cleosome centers should be separated by approximately 200 bp. A larger λ value 99 will constrain distances to be closer to 200 bp, while a smaller value will allow a wider range of values. After characterizing the effect of λ on the prior, we chose a relatively weak prior by setting λ = 6× 10−4, which corresponds to a distance between adjacent nucleosomes of between 25 and 375 bp. The lower bound allows nucleosome positioning to vary between sub-populations of cells [19], while the upper bound accommodates short nucleosome-free regions. Note that segmenta- tion into candidate regions excludes genomic regions with low read densities that are longer than 375 bp from candidate regions. 5.2.4 Conditional Expectation of the Penalized Log Complete Data Likelihood From this point, for ease of notation, we use the letter d to denote either f or r. Let us denote all observed data (i.e. the read pairs) by y = {( fi,ri)}i=1,...,n, and denote the complete data by c = (y,z,u f ,ur). Then the complete data log-likelihood can be written as l(Θ|c) = n ∑ i=1 K ∑ k=1 ziklik(Θ|c)+2n [ν 2 log (ν 2 ) − logΓ (ν 2 )] , (5.9) where lik(Θ|c) = log [ wkN ( fi ∣∣∣∣ µk− δk2 , σ 2 f k u f ik ) N ( ri ∣∣∣∣ µk + δk2 , σ2rkur jk )] +∑ d [ν 2 (logudik−udik)− logudik ] 100 Similarly, we define the prior probability lprior as lprior = K ∑ k=1 logpi(δk,σ2rk,σ 2 f k,µk|ρ,ξ ,α,β ,λ ) = K ∑ k=1 { 1 2 log(σ−2rk +σ −2 f k )− ρ 2 (σ−2f k +σ −2 rk )(δ −ξ )2 } + K ∑ k=1 { (α−1) log(σ−2f k )−βσ−2f k +(α−1) log(σ−2rk )−βσ−2rk } −λ K−1 ∑ k=1 (µk+1−µk−200)2+ cst = −1 2 K ∑ k=1 { (σ−2f k +σ −2 rk )[ρ(δk−ξ )2+2β ]− (2α−1) ( logσ−2f k + logσ −2 rk )} −λ K−1 ∑ k=1 (µk+1−µk−200)2+ cst where cst is a constant w.r.t. the model parameters and latent variables. Conditional on the current estimate Θ− for Θ, the expectation (w.r.t. latent variables) of the penalized log complete data likelihood is by first integral out u’s conditional on z’s, and integral out z’s from l(Θ|c). The formula is given as follows, Q(Θ|Θ−) def= E[l(Θ|c)|Θ−]+ lprior = Ez [ Eu|z(l(Θ|c)|Θ−)|Θ− ] + lprior = n ∑ i=1 K ∑ k=1 z̃ik { logwk− ∑ d∈{ f ,r} [ logσdk + ũdik 2 ( di−µdk σdk )2]} + lprior. E-Step Similar to [103], the E-step consists of computing expectations of Zik and Uik con- ditional on Θ−. The results are shown as follows, z̃ik def= E(Zik| fi,ri,Θ−) = w−k gk( fi,ri|θ−k ) g( fi,ri|Θ−) (5.10) ũdik def= E(Udik|di,zik = 1,Θ−) = (ν+1)/(ν+ ( di−µ−dk σ−dk )2 ). (5.11) 101 M-Step During the M-step, the goal is to maximize Q(Θ|Θ−) with respect to Θ, which requires solving ∂Q(Θ|Θ−)/∂Θ = 0. We solve the following equation system to obtain updated parameter estimates 0 = ∂Q ∂wk = n ∑ i=1 K ∑ k=1 z̃ikw−1k , K ∑ k=1 wk = 1 (5.12) 0 = ∂Q ∂µk = ∑ d∈{ f ,r} ∑ i z̃ikũdik di−µdk σ2dk +qk (5.13) 0 = ∂Q ∂δk = ∑ i z̃ik { ũrik ri−µrk 2σ2rk − ũ f ik fi−µ f k2σ2f k } −ρ(σ−2f k +σ−2rk )(δk−ξ ) (5.14) 0 = ∂Q ∂σ−2dk = 1 2∑i z̃ik [ σ2dk− (di−µdk)2ũdik ] + 1 2 [(2α−1)σ2f k− (ρ(δk−ξ )2+2β )] (5.15) where qk comes from the Gaussian Markov Random Field prior assigned to µk, and is given by qk =  −2λ (2µk− (µk−1+µk+1)), if k 6= 1,K, −2λ (µ1−µ2+200), if k = 1, −2λ (µK−µK−1−200), if k = K. Because there is no simple closed form solution for Θ, we adopted a condi- tional approach in which we first maximize over (w,µ,δ ), conditional on (σ f ,σ r), and then maximize over (σ f ,σ r), conditional on the previously updated (w,µ,δ ), resulting in an Expectation/Conditional Maximization (ECM) algorithm [88]. Conditionally on current estimation σ−f k and σ − rk , the new estimation wk are given as follows, ŵk ← n ∑ i=1 z̃ik/n, 102 and the parameters (µ̂k, δ̂k) are updated by the unique solution of the linear system given as (5.13 and 5.14). Similarly, conditionally on these new estimates ŵk, µ̂k, δ̂k, we can then solve equations (5.15). The new estimate of σ−2dk is the root, which is given by: σ̂−2dk ← ∑ni=1 z̃ik +2α−1 ∑ni=1(di− µ̂dk)2z̃ikũdik +ρ(δ̂k−ξ )2+2β . 5.2.5 The Model with Ambiguous Regions In the presence of L disjoint ambiguous regions, we split the whole genome into disjoint partitions ∪Ll=0Il , where {Il | l = 1, . . . ,L} denote the disjoint ambiguous regions, and I0 denote rest part of genome which has no mappability problem. We introduce two extra latent variables, in Il , we denote Nl as unknown number of ambiguous DNA fragments, and denote xl = { fl,rl} as the unknown positions of forward and reverse read of an unobserved DNA fragment. Note all of the missing (ambiguous) DNA fragment within one ambiguous region follows the same distri- bution, so we xl no longer contains index for DNA fragment. Similarly the latent variables defined in Section 5.2.3 changed into zl = {zlk}l and udl = {udlk}dl in partition Il . Finally we define cl = {Nl,xl,zl,u f l,url} as set of all latent variables in Il . We then decompose the log complete data likelihood as l(Θ|c,c1, . . . ,cL) = L ∑ l=i ll(Θ|cl)+ l0(Θ|c), where l0(Θ|c) is the likelihood of partition without mappability profile as given in Equation 5.9, and ll(Θ|cl) is the complete-data log-likelihood in ambiguous region Il , given by: ll(Θ|cl) = Nl K ∑ k=1 zlk { logwk− ∑ d∈{ f ,r} [ logσdk + udlk 2 ( dl−µdk σdk )2]} + cst. We assume the number of missing reads is independent to the underlining dis- 103 tribution, (i.e. other latent variables z’s, u’s), hence in the expected log-likelihood, Nl can be replaced by its conditional expectation. Because the unknown counts Nl , l = 1, . . . ,L, follow a negative multinomial distribution, so their conditional expectations are given by: ñl def= E(Nl|y,Θ−) = nPl(Θ−)/P0(Θ−), (5.16) where Pl(Θ) and P0(Θ) are the probability measures of the partitions Il and I0 given as, Pl(Θ) def= Pr(Il) def= ∫ Il g(xl|Θ)dxl =∑ k [ wk ∫ bl al t4( fl|µ f k,σ f k)d fl ∫ bl al t4(rl|µrk,σrk)drl ] P0(Θ) def= Pr(I0) def= 1− L ∑ l=1 Pl(Θ) When DNA fragments in partition Il is not alignable, the observed positions ( fi,ri) become unknown random variables xl in Equation 5.10 and Equation 5.11, and conditional on xl , the expectations of z’s and u’s are defined as function of xl as follows z̃lk(xl) def= E(Zlk|xl,Θ−) = w−k gk ( xl|θ−k ) g ( xl|Θ− ) , (5.17) ũdlk(dl) def= E(Udlk|Zlk = 1,dl,Θ−) = (ν+1)/(ν+ ( dl−µ−dk σ−dk )2 ). (5.18) Given the current estimate Θ− for Θ, the conditional expectation (w.r.t. latent variables, i.e. integral out z’s, u’s, and xl) of the log complete data likelihood for the ambiguous regions is given as: Ql(Θ|Θ−) def= E[ll(Θ|cl)|Θ−] = ExlEz|xl Eu|z,xl (ll(Θ|cl)|Θ−) = ñl K ∑ k=1 Exl [ z̃lk(xl) { logwk− ∑ d∈{ f ,r} [ logσdk + ũdlk(xl) 2 ( dl−µdk σdk )2]} |Θ− ] + cst 104 In the M-step, we solve the equation system ∂Q(Θ|Θ−) ∂Θ + L ∑ l=1 Ql(Θ|Θ−) ∂Θ = 0. (5.19) For the second term of Equation 5.19 we calculate the following quantities, ∂Ql ∂wk = ñl E1lk wk ∂Ql ∂µk = ñl [ E3r−µrkE2r σ2rk + E3 f −µrkE2 f σ2f k ] ∂Ql ∂δk = ñl 2 [ E3r−µrkE2r σ2rk − E3 f −µrkE2 f σ2f k ] ∂Ql ∂σ−2dk = ñl 2 [ σ2dkE1lk−E4dlk ] , where the quantities E’s are expectations w.r.t xl defined as below, E1lk def= Exl [ z̃lk(xl)|Θ− ] (5.20) E2dlk def= Exl [ z̃lk(xl)ũdlk(xl)|Θ− ] (5.21) E3dlk def= Exl [ z̃lk(xl)ũdlk(xl)dl|Θ− ] (5.22) E4dlk def= Exl [ z̃lk(xl)ũdlk(xl)(dl− µ̂dk)2|Θ− ] (5.23) The probability density function of xl can be considered as g(xl|Θ) truncated on Il , so the expectation of a generic function q(xl) w.r.t. xl can be calculated as Exl [q(xl)|Θ−] = ∫ Il q(xl) g(xl|Θ−) Pl(Θ−) dxl. (5.24) So, in the E-step, we can calculate the expectations E’s by substituting equations (5.16), (5.17), and (5.18) into equations (5.20)-(5.23), and then applying the inte- gral formula (5.24). The results are given as follows, 105 E1lk = ∫ Il w−k gk ( xl|θ−k ) g ( xl|Θ− ) g(xl|Θ−) Pl(Θ−) dxl = w−k Pl(Θ−) ∫ Il gk(xl|θ−k )dxl = w−k Pl(Θ−) ∫ bl al t4( fl|µ−f k,σ−f k)d fl ∫ bl al t4(rl|µ−rk,σ−rk)drl E2dlk = ∫ Il w−k gk ( xl|θ−k ) g ( xl|Θ− ) · 5 4+(dl−µ−dk)2/(σ−dk)2 · g(xl|Θ −) Pl(Θ−) dxl = w−k Pl(Θ−) ∫ Il 5 ·gk(xl|θ−k ) 4+(dl−µ−dk)2/(σ−dk)2 dxl = w−k Pl(Θ−) ∫ bl al 5 · t4(x|µ−dk,σ−dk) 4+(x−µ−dk)2/(σ−dk)2 dx ∫ bl al t4(x|µ−dck,σ−dck)dx E3dlk = ∫ Il w−k gk ( xl|θ−k ) g ( xl|Θ− ) · 5 4+(dl−µ−dk)2/(σ−dk)2 ·dl · g(xl|Θ −) Pl(Θ−) dxl = w−k Pl(Θ−) ∫ Il 5dl ·gk(xl|θ−k ) 4+(dl−µ−dk)2/(σ−dk)2 dxl = w−k Pl(Θ−) ∫ bl al 5x · t4(x|µ−dk,σ−dk) 4+(x−µ−dk)2/(σ−dk)2 dx ∫ bl al t4(x|µ−dck,σ−dck)dx E4dlk = ∫ Il w−k gk ( xl|θ−k ) g ( xl|Θ− ) · 5 4+(dl−µ−dk)2/(σ−dk)2 · (dl− µ̂dk)2 · g(xl|Θ −) Pl(Θ−) dxl = w−k Pl(Θ−) ∫ Il 5(dl− µ̂dk)2 ·gk(xl|θ−k ) 4+(dl−µ−dk)2/(σ−dk)2 dxl = w−k Pl(Θ−) ∫ bl al 5(x− µ̂dk) · t4(x|µ−dk,σ−dk) 4+(x−µ−dk)2/(σ−dk)2 dx ∫ bl al t4(x|µ−dck,σ−dck)dx, where we define dc = f when d = r and dc = r when d = f . Similar to calculation in PICS, we get E2dlk = w−k P −1 l (Θ −)H3,dclkH0,dlk E3dlk = w−k P −1 l (Θ −)H3,dclk[2σ−dkH1,dlk +µ − dkH0,dlk] E4dlk = w−k P −1 l (Θ −)H3,dclk[4(σ−dk) 2H2,dlk +(µ−dk− µ̂dk)2H0,dlk] +w−k P −1 l (Θ −)H3,dclk4(µ−dk− µ̂dk)σ−dkH1,dlk. 106 The H quantities can be calculated as: H3,dlk def= ∫ bl al t4(x|µ−dk,σ−dk)dx, H j,dlk def= h j ( bl−µ−dk 2σ−dk ) −h j ( al−µ−dk 2σ−dk ) , for j = 0,1,2, where B = Γ(3.5)/(Γ(3) √ pi), and the functions h j’s are defined as: h0(x) = B ( 1 5 [h4(x)]5− 23 [h4(x)] 3+h4(x) ) h1(x) = B ( 1 3 [h4(x)]3− 15 [h4(x)] 5 ) h2(x) = −B 5 ( 1+ x2 )−2.5 h4(x) = sin(arctan(x)). Note, compared with PICS, the two changes of calculating these expectations are 1. Pl(Θ) is calculated differently. 2. There is an additional term H3,dclk multiplied to these quantities. 5.2.6 Consider DNA Fragments with Only One End Aligned In some case, only one end-read of a DNA fragment can be aligned to the reference genome. In such case we have missing data from one end of the DNA fragment. Note by integring out the missing end from the complete-data log-likelihood Equa- tion 5.9, we get a result exactly same as defined for SE data. So we treat a DNA fragment with a missing end as SE data and add its complete data log-likelihood to Equation 5.9, the find values of parameters that maximize the sum. 107 5.3 Future Work 5.3.1 PE or SE? We recently obtained yeast nucleosome PE data containing two deeply sequenced samples. One sample is generated from total DNA MNase-seq and the other is from sonicated ChIP-seq with an antibody for H4 histone. We applied our method to those data and compared the utilities of PE data versus SE data as a function of sequencing depth and experimental protocol. For this goal, we will analyze the original data using PING-PE, and treat the high rank predictions as ‘truth’. To generate SE data, we will remove one read from a randomly selected end of each DNA fragment in PE data. Next we will sample subsets of reads from original data, say 50% subset to 5% subset with a step of 2.5%. Note 50% is the largest proportion that can be obtained in SE data due to the way how we generate the SE data. We will analyze the PE subsets using PING-PE and analyze SE subsets using PING, then compare these predictions to the ‘truth’ by calculating the area under ROC curves. The key result of this project will be a curve of AUC as a function of proportion of reads in the data subsets. This curve will show how predictions degrade when sequence depth decreases. The relative utility of PE data vs SE data can be inferred from such curves. To estimate the reliability of results, we repeat this comparison 100 times, generating 100 AUC curves for PE data and 100 for SE data. The 200 curves can be summarized with boxplots. We can also perform Wilcoxon test to compare AUC statistics of PE vs SE. FDR adjustment is used to address the multiple test issue. Figure 5.1 shows comparison results using chromosome I of yeast data (with SE or PE sequence data from ∼ 360k DNA fragments) generated using total DNA MNase-seq protocol. From the figure, we can see that the SE protocol provides significantly better results on data sets with less than 25% of original reads, and PE protocol provides significantly better results on data sets with at least 30% of original reads. What we found suggest that for low cost, 2x single end reads is better than x paired end reads because provides more coverage of DNA fragments. Once coverage is good enough, PE is better because provides extra information 108 (lengths of DNA fragments) for better localization. 10 20 30 40 50 0. 5 0. 6 0. 7 0. 8 0. 9 % of original reads in subsample Ar ea  u nd er  R O C cu rv e s 5 0. 5 0. 6 0. 7 0. 8 0. 9 7.5 0. 5 0. 6 0. 7 0. 8 0. 9 l ll l 0. 5 0. 6 0. 7 0. 8 0. 9 l l 12.5 0. 5 0. 6 0. 7 0. 8 0. 9 15 0. 5 0. 6 0. 7 0. 8 0. 9 l 17.5 0. 5 0. 6 0. 7 0. 8 0. 9 l 0. 5 0. 6 0. 7 0. 8 0. 9 l 22.5 0. 5 0. 6 0. 7 0. 8 0. 9 l 25 0. 5 0. 6 0. 7 0. 8 0. 9 l 27.5 0. 5 0. 6 0. 7 0. 8 0. 9 l 0. 5 0. 6 0. 7 0. 8 0. 9 l l 32.5 0. 5 0. 6 0. 7 0. 8 0. 9 l l 35 0. 5 0. 6 0. 7 0. 8 0. 9 ll 37.5 0. 5 0. 6 0. 7 0. 8 0. 9 l l l l 0. 5 0. 6 0. 7 0. 8 0. 9 42.5 0. 5 0. 6 0. 7 0. 8 0. 9 l ll l 45 0. 5 0. 6 0. 7 0. 8 0. 9 47.5 0. 5 0. 6 0. 7 0. 8 0. 9 l 0. 5 0. 6 0. 7 0. 8 0. 9 * * * * * * * * * * * * * * * * * * Paired−End Data Single−End Data p−value < 0.05 Figure 5.1: AUC statistics as a function of proportion of reads in random subsets for PING applied to SE data and PING-PE applied to SE data. Each boxplot in this figure is generated from AUC of 100 simulated data sets. Datasets are chromosome I of yeast data (with ∼ 360k DNA fragments) generated using total DNA MNase-seq protocol. The green star indicates AUC of PE and SE are significantly different. Based on our preliminary results, our suggestion of PE vs SE is as follows. Given fixed amount of grant, researchers first calculate number of reads can be generated. Then they look at our results to find the cutoff amount of reads where PE works better to make decision. Once data become available, we will generate more results (e.g. mouse, human) to give researchers better guide for choice of PE vs SE. To show PE protocol is better than SE protocol when cost is not a issue, we compare the PE data to SE data with the same number of DNA fragments. In this case, the sequencing cost of generating PE data is twice of sequencing cost for generating SE data, and we expect to see PE data provide better performance. To confirm this, we performed another analysis and generated a similar figure as Figure 5.1, but in this the new figure we plot AUC as a function of number of DNA fragments, each of which yields either one read (for SE) or two reads (for PE). in Figure 5.2. In this figure we can see PE data always significantly works better than 109 SE data as we expected. 20 40 60 80 100 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 % of original DNA fragments in subsample Ar ea  u nd er  R O C cu rv e s l ll l 10 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 15 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l l 25 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 ll 30 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l l 35 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l l l 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l 45 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l 50 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l 55 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l 65 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l l l 70 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l 75 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l l 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 85 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l l l l l ll l 90 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 ll 95 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 l 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 * * * * * * * * * * * * * * * * * * * * Paired−End Data Single−End Data p−value < 0.05 Figure 5.2: AUC statistics as a function of proportion of DNA fragments in random subsets for PING applied to SE data and PING-PE applied to SE data. Each boxplot in this figure is generated from AUC of 100 simulated data sets. Datasets are chromosome I of yeast data (with ∼ 360k DNA fragments) generated using total DNA MNase-seq protocol. Green star indicate AUC of PE and SE are significantly different. Note PE protocol could be used to generate high-throughput sequencing data for profiling transcription factors. For that data we could extend PICS into PICS- PE using similar approach as described above. We do not suggest using PE data for that goal, since we expect very small improvement in using PE data. 5.3.2 DNA Methylation Data DNA methylation is a modification of DNA. It is a chemical modification that attach a methyl group to the DNA, but does not change the DNA sequence itself. DNA methylation usually happen in the CpG sites (DNA sequence of ‘CG’) of genome [76]. This modification can be inherited through cell division. So DNA methylation serves as a landmark for the location where the gene expression pattern should be altered. DNA methylation is also an important factor in the development of nearly all types of cancer, since it can silences many tumor suppressor genes. 110 The data structure of MeDIP-seq DNA methylation data is similar to that for ChIP- seq for TFs; hence, methods for TFs could be directly applied these data to estimate DNA methylation levels and the locations of specific methylation sites. However to get best estimation, sequence information should also be included in the model, since the methylation only happens on CpG sites. We will obtain such data in the near future and will modify our PICS method to analyze them. 111 Bibliography [1] A. Alexeyenko and E. L. L. Sonnhammer. Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res, 19(6):1107–16, Jun 2009. → pages 51 [2] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet, 25(1): 25–9, May 2000. → pages 49 [3] T. Bailey and M. Gribskov. Combining evidence using p-values: application to sequence homology searches. Bioinformatics, 14(1):48–54, 1998. → pages 57 [4] T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings / International Conference on Intelligent Systems for Molecular Biology ; ISMB International Conference on Intelligent Systems for Molecular Biology, 2:28–36, Jan 1994. → pages 10, 140, 147 [5] T. L. Bailey, N. Williams, C. Misleh, and W. W. Li. Meme: discovering and analyzing dna and protein sequence motifs. Nucleic Acids Res, 34(Web Server issue):W369–73, Jul 2006. → pages 10 [6] J. Baudry, A. Raftery, G. Celeux, K. Lo, and R. Gottardo. Combining mixture components for clustering. Department of Statistics Technical Report, University of Washington., Jan 2008. → pages 25 [7] B. E. Bernstein, J. A. Stamatoyannopoulos, J. F. Costello, B. Ren, A. Milosavljevic, A. Meissner, M. Kellis, M. A. Marra, A. L. Beaudet, J. R. Ecker, P. J. Farnham, M. Hirst, E. S. Lander, T. S. Mikkelsen, and J. A. 112 Thomson. The nih roadmap epigenomics mapping consortium. Nat Biotechnol, 28(10):1045–8, Oct 2010. → pages 6, 64, 69 [8] J. Besag and C. Kooperberg. On conditional and intrinsic autoregressions. Biometrika, 82(4):733–746, Jan 1995. → pages 66, 99 [9] K. R. Blahnik, L. Dou, H. O’Geen, T. McPhillips, X. Xu, A. R. Cao, S. Iyengar, C. M. Nicolet, B. Ludäscher, I. Korf, and P. J. Farnham. Sole-search: an integrated analysis program for peak detection and functional annotation using chip-seq data. Nucleic Acids Res, 38(3):e13, Jan 2010. → pages 53 [10] V. Boeva, D. Surdez, N. Guillon, F. Tirode, A. P. Fejes, O. Delattre, and E. Barillot. De novo motif identification improves the accuracy of predicting transcription factor binding sites in chip-seq data analysis. Nucleic Acids Res, 38(11):e126, Jun 2010. → pages 10, 53 [11] A. Bonni, D. A. Frank, C. Schindler, and M. E. Greenberg. Characterization of a pathway for ciliary neurotrophic factor signaling to the nucleus. Science, 262(5139):1575–9, Dec 1993. → pages 51 [12] C. Brown, D. J. DS, and A. Sidow. Functional architecture and evolution of transcriptional elements that drive gene coexpression. Science, 317(5844): 1557–1560, 2007. → pages 11 [13] J. C. Bryne, E. Valen, M.-H. E. Tang, T. Marstrand, O. Winther, I. da Piedade, A. Krogh, B. Lenhard, and A. Sandelin. Jaspar, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res, 36(Database issue):D102–6, Jan 2008. → pages 10, 45, 46 [14] M. J. Buck and J. D. Lieb. Chip-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics, 83(3):349–60, Mar 2004. → pages 5 [15] J. S. Carroll, C. A. Meyer, J. Song, W. Li, T. R. Geistlinger, J. Eeckhoute, A. S. Brodsky, E. K. Keeton, K. C. Fertuck, G. F. Hall, Q. Wang, S. Bekiranov, V. Sementchenko, E. A. Fox, P. A. Silver, T. R. Gingeras, X. S. Liu, and M. Brown. Genome-wide analysis of estrogen receptor binding sites. Nat Genet, 38(11):1289–97, Nov 2006. → pages 50 [16] Y. Chinenov and T. K. Kerppola. Close encounters of many kinds: Fos-jun interactions that mediate transcription regulatory specificity. Oncogene, 20 (19):2438–52, Apr 2001. → pages 50 113 [17] L. Cicatiello, C. Scafoglio, L. Altucci, M. Cancemi, G. Natoli, A. Facchiano, G. Iazzetti, R. Calogero, N. Biglia, M. D. Bortoli, C. Sfiligoi, P. Sismondi, F. Bresciani, and A. Weisz. A genomic view of estrogen actions in human breast cancer cells by expression profiling of the hormone-responsive transcriptome. J Mol Endocrinol, 32(3):719–75, Jun 2004. → pages 32, 50 [18] L. A. Cirillo and K. S. Zaret. Specific interactions of the wing domains of foxa1 transcription factor with dna. Journal of Molecular Biology, 366(3): 720–4, Feb 2007. → pages 72, 74 [19] D. J. Clark. Nucleosome positioning, nucleosome spacing and the nucleosome code. J Biomol Struct Dyn, 27(6):781–93, Jun 2010. → pages 6, 64, 67, 100 [20] M. L. Conerly, S. S. Teves, D. Diolaiti, M. Ulrich, R. N. Eisenman, and S. Henikoff. Changes in h2a.z occupancy and dna methylation during b-cell lymphomagenesis. Genome Res, 20(10):1383–90, Oct 2010. → pages 63 [21] J. E. Darnell, I. M. Kerr, and G. R. Stark. Jak-stat pathways and transcriptional activation in response to ifns and other extracellular signaling proteins. Science, 264(5164):1415–21, Jun 1994. → pages 51 [22] A. DEMPSTER, N. Laird, and D. RUBIN. Maximum likelihood from incomplete data via em algorithm. J Roy Stat Soc B Met, 39(1):1–38, Jan 1977. → pages 23 [23] P. D’haeseleer. What are dna sequence motifs? Nat Biotechnol, 24(4): 423–5, Apr 2006. → pages 17 [24] S. Durinck, J. Bullard, P. Spellman, and S. Dudoit. Genomegraphs: integrated genomic data visualization with r. BMC Bioinformatics, 10:2, 2009. → pages 54 [25] R. S. Edayathumangalam, P. Weyermann, J. M. Gottesfeld, P. B. Dervan, and K. Luger. Molecular recognition of the nucleosomal ”supergroove”. Proc Natl Acad Sci USA, 101(18):6864–9, May 2004. → pages 74 [26] J. Eeckhoute, J. Carroll, T. Geistlinger, M. Torres-Arzayus, and M. Brown. A cell-type-specific transcriptional network required for estrogen regulation of cyclin d1 and cell cycle progression in breast cancer. Genes Dev, 20(18):2513–26, Sep 2006. → pages 32, 50 114 [27] S. J. Elsaesser, A. D. Goldberg, and C. D. Allis. New functions for an old variant: no substitute for histone h3.3. Current opinion in genetics & development, 20(2):110–7, Apr 2010. → pages 63 [28] J. Ernst and M. Kellis. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol, 28(8): 817–25, Aug 2010. → pages 6, 64 [29] A. Fejes, G. Robertson, M. Bilenky, R. Varhol, M. Bainbridge, and S. Jones. Findpeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics, 24 (15):1729–30, Aug 2008. → pages 39, 95 [30] A. P. Fejes, G. Robertson, M. Bilenky, R. Varhol, M. Bainbridge, and S. J. M. Jones. Findpeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics, 24(15):1729–30, Aug 2008. → pages 7 [31] C. Fraley. . . . How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal, Jan 1998. → pages 41 [32] C. Fraley and A. Raftery. Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification, 24(2): 155–181, 2007. → pages 23 [33] M. C. Frith, M. C. Li, and Z. Weng. Cluster-buster: Finding dense clusters of motifs in dna sequences. Nucleic Acids Res, 31(13):3666–8, Jul 2003. → pages 11, 60 [34] S. Gama-Castro, V. Jiménez-Jacinto, M. Peralta-Gil, A. Santos-Zavaleta, M. I. Peñaloza-Spinola, B. Contreras-Moreira, J. Segura-Salazar, L. Muñiz-Rascado, I. Martı́nez-Flores, H. Salgado, C. Bonavides-Martı́nez, C. Abreu-Goodger, C. Rodrı́guez-Penagos, J. Miranda-Rı́os, E. Morett, E. Merino, A. M. Huerta, L. Treviño-Quintanilla, and J. Collado-Vides. Regulondb (version 6.0): gene regulation model of escherichia coli k-12 beyond transcription, active (experimental) annotated promoters and textpresso navigation. Nucleic Acids Res, 36(Database issue):D120–4, Jan 2008. → pages 10, 45 [35] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, 115 G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. H. Yang, and J. Zhang. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol, 5(10):R80, Jan 2004. → pages 41, 59 [36] V. D. Gesù, G. L. Bosco, L. Pinello, G.-C. Yuan, and D. F. V. Corona. A multi-layer method to study genome-scale positions of nucleosomes. Genomics, 93(2):140–5, Feb 2009. → pages 13 [37] A. D. Goldberg, L. A. Banaszynski, K.-M. Noh, P. W. Lewis, S. J. Elsaesser, S. Stadler, S. Dewell, M. Law, X. Guo, X. Li, D. Wen, A. Chapgier, R. C. DeKelver, J. C. Miller, Y.-L. Lee, E. A. Boydston, M. C. Holmes, P. D. Gregory, J. M. Greally, S. Rafii, C. Yang, P. J. Scambler, D. Garrick, R. J. Gibbons, D. R. Higgs, I. M. Cristea, F. D. Urnov, D. Zheng, and C. D. Allis. Distinct factors control histone variant h3.3 localization at specific genomic regions. Cell, 140(5):678–91, Mar 2010. → pages 63 [38] S. Gupta, J. A. Stamatoyannopoulos, T. L. Bailey, and W. S. Noble. Quantifying similarity between motifs. Genome Biol, 8(2):R24, Jan 2007. → pages 11, 45 [39] H. H. He, C. A. Meyer, H. Shin, S. T. Bailey, G. Wei, Q. Wang, Y. Zhang, K. Xu, M. Ni, M. Lupien, P. Mieczkowski, J. D. Lieb, K. Zhao, M. Brown, and X. S. Liu. Nucleosome dynamics define transcriptional enhancers. Nature genetics, 42(4):343–7, Apr 2010. → pages 12, 64, 68, 71 [40] N. D. Heintzman, G. C. Hon, R. D. Hawkins, P. Kheradpour, A. Stark, L. F. Harp, Z. Ye, L. K. Lee, R. K. Stuart, C. W. Ching, K. A. Ching, J. E. Antosiewicz-Bourget, H. Liu, X. Zhang, R. D. Green, V. V. Lobanenkov, R. Stewart, J. A. Thomson, G. E. Crawford, M. Kellis, and B. Ren. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature, 459(7243):108–12, May 2009. → pages 63, 64 [41] S. Heinz, C. Benner, N. Spann, E. Bertolino, Y. C. Lin, P. Laslo, J. X. Cheng, C. Murre, H. Singh, and C. K. Glass. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities. Mol Cell, 38(4):576–589, Jan 2010. → pages xiv, 65, 68, 69, 70, 75 [42] G. Hertz and G. Stormo. Identifying dna and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7-8):563–77, Jul 1999. → pages 57 116 [43] B. G. Hoffman and S. J. M. Jones. Genome-wide identification of dna-protein interactions using chromatin immunoprecipitation coupled with flow cell sequencing. J Endocrinol, 201(1):1–13, Apr 2009. → pages 4 [44] B. G. Hoffman, G. Robertson, B. Zavaglia, M. Beach, R. Cullum, S. Lee, G. Soukhatcheva, L. Li, E. D. Wederell, N. Thiessen, M. Bilenky, T. Cezard, A. Tam, B. Kamoh, I. Birol, D. Dai, Y. Zhao, M. Hirst, C. B. Verchere, C. D. Helgason, M. A. Marra, S. J. M. Jones, and P. A. Hoodless. Locus co-occupancy, nucleosome positioning, and h3k4me1 regulate the functionality of foxa2-, hnf4a-, and pdx1-bound loci in islets and liver. Genome Res, 20(8):1037–51, Aug 2010. → pages xiv, 64, 65, 68, 69, 71, 72, 73, 75, 81 [45] R. A. Holt and S. J. M. Jones. The new paradigm of flow cell sequencing. Genome Res, 18(6):839–46, Jun 2008. → pages 2 [46] M. Hu, J. Yu, J. Taylor, A. Chinnaiyan, and Z. Qin. On the detection and refinement of transcription factor binding sites using chip-seq data. Nucleic Acids Res, 38(7):2154–67, Apr 2010. → pages 10, 60 [47] R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of computational and graphical statistics, Jan 1996. → pages 41 [48] J. N. Ihle. Cytokine receptor signalling. Nature, 377(6550):591–4, Oct 1995. → pages 51 [49] H. Ji, S. A. Vokes, and W. H. Wong. A comparative analysis of genome-wide chromatin immunoprecipitation data for mammalian transcription factors. Nucleic Acids Res, 34(21):e146, 2006. → pages 54 [50] H. Ji, H. Jiang, W. Ma, D. S. Johnson, R. M. Myers, and W. H. Wong. An integrated software system for analyzing chip-chip and chip-seq data. Nat Biotechnol, 26(11):1293–300, Nov 2008. → pages 8 [51] H. Ji, H. Jiang, W. Ma, D. S. Johnson, R. M. Myers, and W. H. Wong. An integrated software system for analyzing chip-chip and chip-seq data. Nat Biotechnol, 26(11):1293–300, Nov 2008. → pages 10, 54, 60 [52] C. Jiang and B. Pugh. A compiled and systematic reference map of nucleosome positions across the saccharomyces cerevisiae genome. Genome Biology, 10(10):R109, Oct 2009. → pages 63 117 [53] D. Johnson, S. Corthals, C. Ramos, A. Hoering, K. Cocks, N. Dickens, J. Haessler, H. Goldschmidt, J. Child, S. Bell, G. Jackson, D. Baris, S. Rajkumar, F. Davies, B. Durie, J. Crowley, P. Sonneveld, B. V. Ness, and G. Morgan. Genetic associations with thalidomide mediated venous thrombotic events in myeloma identified using targeted genotyping. Blood, 112(13):4924–34, Dec 2008. → pages 58 [54] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold. Genome-wide mapping of in vivo protein-dna interactions. Science, 316(5830):1497–502, Jun 2007. → pages 7 [55] D. S. Johnson, W. Li, D. B. Gordon, A. Bhattacharjee, B. Curry, J. Ghosh, L. Brizuela, J. S. Carroll, M. Brown, P. Flicek, C. M. Koch, I. Dunham, M. Bieda, X. Xu, P. J. Farnham, P. Kapranov, D. A. Nix, T. R. Gingeras, X. Zhang, H. Holster, N. Jiang, R. D. Green, J. S. Song, S. A. McCuine, E. Anton, L. Nguyen, N. D. Trinklein, Z. Ye, K. Ching, D. Hawkins, B. Ren, P. C. Scacheri, J. Rozowsky, A. Karpikov, G. Euskirchen, S. Weissman, M. Gerstein, M. Snyder, A. Yang, Z. Moqtaderi, H. Hirsch, H. P. Shulha, Y. Fu, Z. Weng, K. Struhl, R. M. Myers, J. D. Lieb, and X. S. Liu. Systematic evaluation of variability in chip-chip experiments using predefined dna targets. Genome Res, 18(3):393–403, Mar 2008. → pages 4, 25 [56] R. Jothi, S. Cuddapah, A. Barski, K. Cui, and K. Zhao. Genome-wide identification of in vivo protein-dna binding sites from chip-seq data. Nucleic Acids Research, 36(16):5221–31, Sep 2008. → pages 8 [57] N. Kaplan, I. K. Moore, Y. Fondufe-Mittendorf, A. J. Gossett, D. Tillo, Y. Field, E. M. LeProust, T. R. Hughes, J. D. Lieb, J. Widom, and E. Segal. The dna-encoded nucleosome organization of a eukaryotic genome. Nature, 458(7236):362–6, Mar 2009. → pages xiv, 65, 69, 74 [58] N. Kaplan, T. R. Hughes, J. D. Lieb, J. Widom, and E. Segal. Contribution of histone sequence preferences to nucleosome organization: proposed definitions and methodology. Genome biology, 11(11):140, Nov 2010. → pages 79 [59] S. Keleş, C. Warren, C. Carlson, and A. Ansari. Csi-tree: a regression tree approach for modeling binding properties of dna-binding molecules based on cognate site identification (csi) data. Nucleic Acids Research, 36(10): 3171–3184, 2008. → pages 9, 44 118 [60] P. Kharchenko, M. Tolstorukov, and P. Park. Design and analysis of chip-seq experiments for dna-binding proteins. Nat Biotechnol, 26(12): 1351–9, Dec 2008. → pages 39, 95 [61] P. V. Kharchenko, M. Y. Tolstorukov, and P. J. Park. Design and analysis of chip-seq experiments for dna-binding proteins. Nat Biotechnol, 26(12): 1351–9, Dec 2008. → pages 4, 5, 8 [62] H. Kim, Y. Toyofuku, F. C. Lynn, E. Chak, T. Uchida, H. Mizukami, Y. Fujitani, R. Kawamori, T. Miyatsuka, Y. Kosaka, K. Yang, G. Honig, M. van der Hart, N. Kishimoto, J. Wang, S. Yagihashi, L. H. Tecott, H. Watada, and M. S. German. Serotonin regulates pancreatic beta cell mass during pregnancy. Nat Med, 16(7):804–8, Jul 2010. → pages 73 [63] C. P. J. Knaus and G. Schwarzer. Easier parallel computing in r with snowfall and sfcluster. R Journal, 1(1):54, Jan 2008. → pages 56 [64] P. F. Kuan, D. Huebert, A. Gasch, and S. Keles. A non-homogeneous hidden-state model on first order differences for automatic detection of nucleosome positions. Statistical Applications in Genetics and Molecular Biology, 8(1):Article29, Jan 2009. → pages 13, 96 [65] R. M. Kuhn, D. Karolchik, A. S. Zweig, T. Wang, K. E. Smith, K. R. Rosenbloom, B. Rhead, B. J. Raney, A. Pohl, M. Pheasant, L. Meyer, F. Hsu, A. S. Hinrichs, R. A. Harte, B. Giardine, P. Fujita, M. Diekhans, T. Dreszer, H. Clawson, G. P. Barber, D. Haussler, and W. J. Kent. The ucsc genome browser database: update 2009. Nucleic Acids Research, 37 (Database issue):D755–61, Jan 2009. → pages 25 [66] I. V. Kulakovskiy, V. A. Boeva, A. V. Favorov, and V. J. Makeev. Deep and wide digging for binding motifs in chip-seq data. Bioinformatics, 26(20): 2622–3, Oct 2010. → pages 10 [67] T. D. Laajala, S. Raghav, S. Tuomela, R. Lahesmaa, T. Aittokallio, and L. L. Elo. A practical comparison of methods for detecting transcription factor binding sites in chip-seq experiments. BMC Genomics, 10:618, Jan 2009. → pages 44 [68] K. LANGE, R. LITTLE, and J. TAYLOR. Robust statistical modeling using the t-distribution. J Am Stat Assoc, 84(408):881–896, Jan 1989. → pages 20, 41 119 [69] C. Lawrence and A. Reilly. An expectation maximization (em) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7(1):41–51, 1990. → pages 56 [70] C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald, and J. Wootton. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment. Science, 262(5131):208–14, Oct 1993. → pages 10 [71] M. Lawrence, R. Gentleman, and V. Carey. rtracklayer: an r package for interfacing with genome browsers. Bioinformatics, 25(14):1841–2, Jul 2009. → pages 54, 59 [72] G. Li and J. Widom. Nucleosomes facilitate their own invasion. Nature Structural & Molecular Biology, 11(8):763–9, Aug 2004. → pages 74 [73] H. Li and R. Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14):1754–60, Jul 2009. → pages 4 [74] H. Li and R. Durbin. Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics, 26(5):589–95, Mar 2010. → pages 59 [75] L. Li. Gadem: a genetic algorithm guided formation of spaced dyads coupled with an em algorithm for motif discovery. J Comput Biol, 16(2): 317–29, Feb 2009. → pages 10, 31, 44, 56, 145, 146, 148 [76] R. Lister and J. R. Ecker. Finding the fifth base: genome-wide sequencing of cytosine methylation. Genome Research, 19(6):959–66, Jun 2009. → pages 110 [77] J. Liu. The collapsed gibbs sampler in bayesian computations with applications to a gene regulation problem. American Statistical Association, 89:8, 1994. → pages 56 [78] X. Liu, D. Brutlag, and J. Liu. Bioprospector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput, pages 127–38, 2001. → pages 10 [79] K. Lo, R. R. Brinkman, and R. Gottardo. Automated gating of flow cytometry data via robust model-based clustering. Cytometry A, 73(4): 321–32, Apr 2008. → pages 20 120 [80] A. Longo, G. P. Guanga, and R. B. Rose. Structural basis for induced fit mechanisms in dna recognition by the pdx1 homeodomain. Biochemistry, 46(11):2948–57, Mar 2007. → pages 74 [81] K. A. Lord, A. Abdollahi, S. M. Thomas, M. DeMarco, J. S. Brugge, B. Hoffman-Liebermann, and D. A. Liebermann. Leukemia inhibitory factor and interleukin-6 trigger the same immediate early response, including tyrosine phosphorylation, upon induction of myeloid leukemia differentiation. Mol Cell Biol, 11(9):4371–9, Sep 1991. → pages 51 [82] M. Lupien, J. Eeckhoute, C. Meyer, Q. Wang, Y. Zhang, W. Li, J. Carroll, X. Liu, and M. Brown. Foxa1 translates epigenetic signatures into enhancer-driven lineage-specific transcription. Cell, 132(6):958–70, Mar 2008. → pages 32 [83] C. Lütticken, U. M. Wegenka, J. Yuan, J. Buschmann, C. Schindler, A. Ziemiecki, A. G. Harpur, A. F. Wilks, K. Yasukawa, and T. Taga. Association of transcription factor aprf and protein kinase jak1 with the interleukin-6 signal transducer gp130. Science, 263(5143):89–92, Jan 1994. → pages 51 [84] S. Mahony and P. V. Benos. Stamp: a web tool for exploring dna-binding motif similarities. Nucleic Acids Res, 35(Web Server issue):W253–8, Jul 2007. → pages 10, 45, 58 [85] S. Mahony, P. Auron, and P. Benos. Dna familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput Biol, 3(3):e61, Mar 2007. → pages 31 [86] G. McLachlan and P. Jones. Fitting mixture models to grouped and truncated data via the em algorithm. Biometrics, 44(2):571–578, 1988. → pages 1, 23 [87] G. J. McLachlan and T. Krishnan. The em algorithm and extensions. page 359, Jan 2008. → pages 24 [88] X. MENG and D. RUBIN. Maximum-likelihood-estimation via the ecm algorithm - a general framework. Biometrika, 80(2):267–278, Jan 1993. → pages 29, 88, 102 [89] E. Mercier, A. Droit, L. Li, G. Robertson, X. Zhang, and R. Gottardo. An integrated pipeline for the genome-wide analysis of transcription factor binding sites from chip-seq. PLoS ONE, Jan 2011. → pages iv 121 [90] K. Milde-Langosch. The fos family of transcription factors and their role in tumourigenesis. Eur J Cancer, 41(16):2449–61, Nov 2005. → pages 32 [91] K. Milde-Langosch, S. Janke, I. Wagner, C. Schröder, T. Streichert, A.-M. Bamberger, F. Jänicke, and T. Löning. Role of fra-2 in breast cancer: influence on tumor cell invasion and motility. Breast Cancer Res Treat, 107 (3):337–47, Feb 2008. → pages 50 [92] modENCODE Consortium, S. Roy, J. Ernst, P. V. Kharchenko, P. Kheradpour, N. Negre, M. L. Eaton, J. M. Landolin, C. A. Bristow, L. Ma, M. F. Lin, S. Washietl, B. I. Arshinoff, F. Ay, P. E. Meyer, N. Robine, N. L. Washington, L. D. Stefano, E. Berezikov, C. D. Brown, R. Candeias, J. W. Carlson, A. Carr, I. Jungreis, D. Marbach, R. Sealfon, M. Y. Tolstorukov, S. Will, A. A. Alekseyenko, C. Artieri, B. W. Booth, A. N. Brooks, Q. Dai, C. A. Davis, M. O. Duff, X. Feng, A. A. Gorchakov, T. Gu, J. G. Henikoff, P. Kapranov, R. Li, H. K. MacAlpine, J. Malone, A. Minoda, J. Nordman, K. Okamura, M. Perry, S. K. Powell, N. C. Riddle, A. Sakai, A. Samsonova, J. E. Sandler, Y. B. Schwartz, N. Sher, R. Spokony, D. Sturgill, M. van Baren, K. H. Wan, L. Yang, C. Yu, E. Feingold, P. Good, M. Guyer, R. Lowdon, K. Ahmad, J. Andrews, B. Berger, S. E. Brenner, M. R. Brent, L. Cherbas, S. C. R. Elgin, T. R. Gingeras, R. Grossman, R. A. Hoskins, T. C. Kaufman, W. Kent, M. I. Kuroda, T. Orr-Weaver, N. Perrimon, V. Pirrotta, J. W. Posakony, B. Ren, S. Russell, P. Cherbas, B. R. Graveley, S. Lewis, G. Micklem, B. Oliver, P. J. Park, S. E. Celniker, S. Henikoff, G. H. Karpen, E. C. Lai, D. M. MacAlpine, L. D. Stein, K. P. White, and M. Kellis. Identification of functional elements and regulatory circuits by drosophila modencode. Science, 330(6012):1787–97, Dec 2010. → pages 6, 64 [93] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nature methods, 5(7):621–8, Jul 2008. → pages 7 [94] U. Munzel. . . . A unified approach to simultaneous rank test procedures in the unbalanced one-way layout. Biometrical Journal, Jan 2001. → pages 73 [95] N. Nagarajan, N. Jones, and U. Keich. Computing the p-value of the information content from an alignment of multiple sequences. Bioinformatics, 21 Suppl 1:i311–8, Jun 2005. → pages 57 122 [96] D. Newburger and M. Bulyk. Uniprobe: an online database of protein binding microarray data on protein–dna interactions. Nucleic Acids Research, 37(Database issue):D77–D82, 2009. → pages 10, 45 [97] D. A. Nix, S. J. Courdy, and K. M. Boucher. Empirical methods for controlling false positives and estimating confidence in chip-seq peaks. BMC Bioinformatics, 9:523, Jan 2008. → pages 5, 8 [98] M. S. Ong, T. J. Richmond, and C. A. Davey. Dna stretching and extreme kinking in the nucleosome core. Journal of Molecular Biology, 368(4): 1067–74, May 2007. → pages 74 [99] P. J. Park. Chip-seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 10(10):669–80, Oct 2009. Notes: Good review for ChIP-seq technology. → pages 2 [100] J. Parkhill, E. Birney, and P. Kersey. Genomic information infrastructure after the deluge. Genome Biology, 11:402, 2010. → pages 53 [101] G. Pavesi, G. Mauri, and G. Pesole. An algorithm for finding signals of unknown length in dna sequences. Bioinformatics, 17 Suppl 1:S207–14, Jan 2001. → pages 10 [102] G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole. Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res, 32(Web Server issue):W199–203, Jul 2004. → pages 60 [103] D. Peel and G. McLachlan. Robust mixture modelling using the t distribution. Statistics and Computing, 10(4):339–348, 2000. → pages 23, 28, 87, 101 [104] S. Pepke, B. Wold, and A. Mortazavi. Computation for chip-seq and rna-seq studies. Nat Methods, 6(11 Suppl):S22–32, Nov 2009. → pages 43 [105] M. Radman-Livaja and O. J. Rando. Nucleosome positioning: how is it established, and why does it matter? Dev Biol, 339(2):258–66, Mar 2010. → pages 63 [106] A. G. Robertson, M. Bilenky, A. Tam, Y. Zhao, T. Zeng, N. Thiessen, T. Cezard, A. P. Fejes, E. D. Wederell, R. Cullum, G. Euskirchen, M. Krzywinski, I. Birol, M. Snyder, P. A. Hoodless, M. Hirst, M. A. Marra, and S. J. M. Jones. Genome-wide relationship between histone h3 lysine 4 123 mono- and tri-methylation and transcription factor binding. Genome Res, 18(12):1906–17, Dec 2008. → pages 4, 68, 71 [107] A. G. Robertson, M. Bilenky, A. Tam, Y. Zhao, T. Zeng, N. Thiessen, T. Cezard, A. P. Fejes, E. D. Wederell, R. Cullum, G. Euskirchen, M. Krzywinski, I. Birol, M. Snyder, P. A. Hoodless, M. Hirst, M. A. Marra, and S. J. M. Jones. Genome-wide relationship between histone h3 lysine 4 mono- and tri-methylation and transcription factor binding. Genome Res, 18(12):1906–17, Dec 2008. → pages 22 [108] G. Robertson, M. Hirst, M. Bainbridge, M. Bilenky, Y. Zhao, T. Zeng, G. Euskirchen, B. Bernier, R. Varhol, A. Delaney, N. Thiessen, O. Griffith, A. He, M. Marra, M. Snyder, and S. Jones. Genome-wide profiles of stat1 dna association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods, 4(8):651–7, Aug 2007. → pages 7, 59, 71 [109] K. Roeder and L. Wasserman. Practical bayesian density estimation using mixtures of normals. J Am Stat Assoc, 92(439):894–902, Jan 1997. → pages 41 [110] F. Roth, J. Hughes, P. Estep, and G. Church. Finding dna regulatory motifs within unaligned noncoding sequences clustered by whole-genome mrna quantitation. Nat Biotechnol, 16(10):939–45, Oct 1998. → pages 10 [111] J. Rozowsky, G. Euskirchen, R. K. Auerbach, Z. D. Zhang, T. Gibson, R. Bjornson, N. Carriero, M. Snyder, and M. B. Gerstein. Peakseq enables systematic scoring of chip-seq experiments relative to controls. Nat Biotechnol, 27(1):66–75, Jan 2009. → pages 4, 5, 8, 39, 66, 95 [112] A. Sandelin and W. Wasserman. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J Mol Biol, 338(2):207–215, 2004. → pages 57 [113] W. P. Schiemann and N. M. Nathanson. Involvement of protein kinase c during activation of the mitogen-activated protein kinase cascade by leukemia inhibitory factor. evidence for participation of multiple signaling pathways. J Biol Chem, 269(9):6376–82, Mar 1994. → pages 51 [114] G. Schwarz. Estimating the dimension of a model. The annals of statistics, Jan 1978. → pages 24 124 [115] M. A. Schwarzschild and R. E. Zigmond. Effects of peptides of the secretin-glucagon family and cyclic nucleotides on tyrosine hydroxylase activity in sympathetic nerve endings. J Neurochem, 56(2):400–6, Feb 1991. → pages 51 [116] A. A. Sharov and M. S. H. Ko. Exhaustive search for over-represented dna sequence motifs with cisfinder. DNA Res, 16(5):261–73, Oct 2009. → pages 60 [117] E. Shaulian and M. Karin. Ap-1 in cell proliferation and survival. Oncogene, 20(19):2390–400, Apr 2001. → pages 50 [118] H. Shin, T. Liu, A. K. Manrai, and X. S. Liu. Ceas: cis-regulatory element annotation system. Bioinformatics, 25(19):2605–6, Oct 2009. → pages 53 [119] N. Sierro, Y. Makita, M. de Hoon, and K. Nakai. Dbtbs: a database of transcriptional regulation in bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res, 36(Database issue):D93–6, Jan 2008. → pages 10, 45 [120] R. Staden. Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci, 5(2):89–96, Apr 1989. → pages 149 [121] N. Stahl, T. J. Farruggella, T. G. Boulton, Z. Zhong, J. E. Darnell, and G. D. Yancopoulos. Choice of stats and other substrates specified by modular tyrosine-based motifs in cytokine receptors. Science, 267(5202): 1349–53, Mar 1995. → pages 51 [122] G. D. Stormo. Dna binding sites: representation and discovery. Bioinformatics, 16(1):16–23, Jan 2000. → pages 9, 44 [123] G. Su, B. Mao, and J. Wang. Maco: a gapped-alignment scoring tool for comparing transcription factor binding sites. In Silico Biol (Gedrukt), 6(4): 307–10, Jan 2006. → pages 11, 45 [124] A. M. Szalkowski and C. D. Schmid. Rapid innovation in chip-seq peak-calling algorithms is outdistancing benchmarking efforts. Briefings in Bioinformatics, 2010. → pages 44 [125] A. Valouev, D. S. Johnson, A. Sundquist, C. Medina, E. Anton, S. Batzoglou, R. M. Myers, and A. Sidow. Genome-wide analysis of transcription factor binding sites based on chip-seq data. Nature methods, 5 (9):829–34, Sep 2008. Notes The paper for QuEst. → pages 8, 16, 20 125 [126] J. van Helden, A. F. Rios, and J. Collado-Vides. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res, 28(8):1808–18, Apr 2000. → pages 145 [127] A. Weiner, A. Hughes, M. Yassour, O. J. Rando, and N. Friedman. High-resolution nucleosome mapping reveals transcription-dependent promoter packaging. Genome Res, 20(1):90–100, Jan 2010. → pages 12, 64, 65, 67 [128] E. M. Wilson, M. M. Hsieh, and P. Rotwein. Autocrine growth factor signaling by insulin-like growth factor-ii mediates myod-stimulated myocyte maturation. J Biol Chem, 278(42):41109–13, Oct 2003. → pages 51 [129] E. Wingender. The transfac project as an example of framework technology that supports the analysis of genomic regulation. Brief Bioinformatics, 9(4):326–32, Jul 2008. → pages 10, 45, 58 [130] W. Xu, S. A. A. Comhair, S. Zheng, S. C. Chu, J. Marks-Konczalik, J. Moss, S. J. Haque, and S. C. Erzurum. Stat-1 and c-fos interaction in nitric oxide synthase-2 gene activation. Am J Physiol Lung Cell Mol Physiol, 285(1):L137–48, Jul 2003. → pages 51 [131] X. Zhang, G. Robertson, M. Krzywinski, K. Ning, A. Droit, S. Jones, and R. Gottardo. Pics: Probabilistic inference for chip-seq. Biometrics, Jun 2010. → pages iv, 12, 55, 60, 64, 65, 66, 75, 76, 78 [132] X. Zhang, G. Robertson, S. Woo, B. G. Hoffman, and R. Gottardo. Probabilistic inference for nucleosome positioning with mnase-based or sonicated short-read data. Under Review, pages 1–20, 2011. → pages iv [133] Y. Zhang, T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein, C. Nussbaum, R. M. Myers, M. Brown, W. Li, and X. S. Liu. Model-based analysis of chip-seq (macs). Genome Biology, 9(9):R137, Jan 2008. → pages 7, 16, 20, 31, 50, 56, 59, 129 [134] Y. Zhang, H. Shin, J. S. Song, Y. Lei, and X. S. Liu. Identifying positioned nucleosomes with epigenetic marks in human from chip-seq. BMC Genomics, 9:537, Jan 2008. → pages 12, 64, 65, 67, 80 [135] Z. Zhang and B. F. Pugh. High-resolution genome-wide mapping of the primary structure of chromatin. Cell, 144(2):175–86, Jan 2011. → pages 63, 80 126 [136] Q. Zhou and W. H. Wong. Cismodule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc Natl Acad Sci USA, 101 (33):12114–9, Aug 2004. → pages 11 [137] L. J. Zhu, C. Gazin, N. D. Lawson, H. Pages, S. M. Lin, D. S. Lapointe, and M. R. Green. Chippeakanno: a bioconductor package to annotate chip-seq and chip-chip data. BMC Bioinformatics, 11(1):237, May 2010. → pages 59 127 Appendix A Supplementary Material for Chapter Two A.1 Densities In this appendix, we use G a(α,β ) to denote a Gamma distribution with shape parameter α and scale parameter β , with density given by f (x|α,β ) = β α Γ(α) xα−1 exp(−βx). Similarly, N(µ,σ2) denotes a Normal distribution with mean µ and variance σ2. Finally, tν(µ,σ2) denotes a t distribution with ν degrees of freedom, mean µ and variance parameter σ2, with density given by f (x|µ,σ2,ν) = Γ((ν+1)/2) Γ(ν/2)σ √ νpi ( 1+ (x−µ)2 σ2ν )−(ν+1)/2 . The tν(µ,σ2) distribution can also be written as a Normal-Gamma distribution (i.e. X = σZ/ √ U +µ , where Z ∼ N(0,1) independent of U ∼ G a(ν/2,ν/2)), and we use this definition in our EM algorithm. 128 A.2 Parameter Recalculation when Merging Binding Events When merging binding events, we recompute parameter estimates by matching the first and second moments as follows: µ = ∑k µkwk ∑k wk δ = ∑k δkwk ∑k wk ν ν−2σ 2 f − (µ− δ 2 )2 = ∑ k [ wk( ν ν−2σ 2 f k− (µk− δk 2 )2) ] ν ν−2σ 2 r − (µ+ δ 2 )2 = ∑ k [ wk( ν ν−2σ 2 rk− (µk + δk 2 )2) ] . A.3 Application to Experimental Datasets Web Figure A.1 shows histograms of estimated average DNA fragment lengths δ for the top-ranked 10000 filtered and unfiltered predicted binding events. For the FOXA1 data the estimated average fragment size was approximately 150 bps, consistent with [133]; it was somewhat smaller for the GABP data. Web Figure A.1 also shows that most of the regions had DNA fragments between 50 and 200 bps, which supports our filtering atypical regions by this parameter. Web Figure A.2 shows the relationships between the region rank and FDR for the 5000 top-ranked regions for each method. Except for cisGenome, all methods returned fairly low FDRs on these regions for both datasets. Overall, MACS’ re- sponse was similar to PICS’. Nevertheless, the figure shows that estimated FDRs can greatly vary between methods and that a comparison based on a fixed number of regions is likely to be more meaningful. A.4 Simulation Study We analyzed each data set using the five methods previously compared, namely PICS, MACS, cisGenome, QuEST and Useq. We assessed the ability of each method to detect adjacent binding sites in the 500 spike-in regions. Web Tables 129 (a) Delta (bp) Fr eq ue nc y 50 100 150 200 250 0 20 0 40 0 60 0 80 0 unfiltered filtered (b) Delta (bp) Fr eq ue nc y 50 100 150 200 250 0 20 0 40 0 60 0 unfiltered filtered (a) GABP (b) FOXA1 Figure A.1: Distribution of estimated average DNA fragment lengths, δ , in GABP (a) and FOXA1 (b) data, before and after filtering. For clarity, only results for the top 10000 regions are shown. 0 1000 2000 3000 4000 5000 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 (Top) n Es tim at ed  F DR  (% ) PICS MACS QUEST CIS USEQ 0 1000 2000 3000 4000 5000 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 (Top) n Es tim at ed  F DR  (% ) PICS MACS QUEST CIS USEQ (a) GABP (b) FOXA1 Figure A.2: Number of detected peaks as a function of False Discovery Rate (FDR) levels for the four analysis methods, for the GABP data (a) and the FOXA1 data (b). For the FOXA1 data the cisGenome FDR includes values up to ∼ 20%, but we have limited the y-axis for better visualiza- tion of other FDRs. Note that actual FDR values are given in %, and one needs to divide by 100 to get the actual FDR level. For consis- tency with motif analysis, we show results for only the top 5000 PICS regions. 130 A.1-A.3 show the number of binding events estimated by each method versus the number of true binding events in each spike-in region. Given that the number of true binding events was 551, for these tables we used only the top 551 reported binding events; an analysis based on an FDR threshold gave similar results. Over- all, PICS performed better than other methods when there were multiple adjacent binding sites. Finally, we also assessed the coverage rate of our PICS confidence intervals for the µ’s, as a function of the number of binding sites called. The coverage of our approximate 95% confidence intervals was between 90-100%, independent of the region rank and the simulation scenario. Web Figure A.5 shows the ROC curves of each method. Table A.1: Estimated number of binding sites against true number of binding sites for simulation scenario 1. Bold text marks the best results. PICS MACS cisGenome QuEST USEQ Truth 0 1 2 3 0 1 2 0 1 2 0 1 2 0 1 2 1 29 421 1 0 12 439 0 39 412 0 125 326 0 125 326 0 2 0 11 36 0 0 45 2 0 44 3 15 31 1 15 31 1 3 0 0 0 2 0 2 0 0 2 0 0 2 0 0 2 0 Table A.2: Estimated number of binding sites against true number of binding sites for simulation scenario 2. Bold text marks the best results. PICS MACS cisGenome QuEST USEQ Truth 0 1 2 3 0 1 2 0 1 2 0 1 2 0 1 2 1 19 427 5 0 8 443 0 31 420 0 101 349 1 9 442 0 2 1 13 33 0 0 44 3 1 43 3 13 32 2 0 43 4 3 0 1 0 1 0 2 0 0 2 0 0 2 0 0 2 0 131 Table A.3: Estimated number of binding sites against true number of binding sites for simulation scenario 3. Bold text marks the best results. PICS MACS cisGenome QuEST USEQ Truth 0 1 2 3 0 1 2 0 1 2 0 1 2 0 1 2 1 26 424 1 0 9 442 0 37 414 0 142 306 3 7 444 0 2 1 12 34 0 0 44 3 0 42 5 13 31 3 0 45 2 3 0 0 0 2 0 2 0 0 2 0 1 1 0 0 2 0 100 300 500 700 0 5 10 20 30 Peak Ranks Sp at ia l E rro r ( bp ) PICS MACS cisGenome QuEST USEQ 100 300 500 700 0 5 10 20 30 Peak Ranks Sp at ia l E rro r ( bp ) PICS MACS cisGenome QuEST USEQ (a) PICS model (b) Mis-specified prior 100 300 500 700 0 5 10 20 30 Peak Ranks Sp at ia l E rro r ( bp ) PICS MACS cisGenome QuEST USEQ (c) Mis-specified likelihood Figure A.3: Spatial error as a function of predicted region rank for the five methods compared under the three simulation scenarios. 132 (a) PICS Model (b) Mis-specified Prior (c) Mis-specified Likelihood 0 200 400 600 0 20 40 60 80 10 0 Peak RanksRa te  o f 9 5%  C I o f d et ec te d pe ak s c ov er  tr ut h (% ) PICS 0 200 400 600 0 20 40 60 80 10 0 Peak RanksRa te  o f 9 5%  C I o f d et ec te d pe ak s c ov er  tr ut h (% ) PICS 0 200 400 600 0 20 40 60 80 10 0 Peak RanksRa te  o f 9 5%  C I o f d et ec te d pe ak s c ov er  tr ut h (% ) PICS Figure A.4: Coverage of 95% approximate confidence interval for binding site positions for the three simulation scenarios. 133 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 1ïSpecificity Se ns itiv ity PICS MACS cisGenome QuEST USEQ 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 1ïSpecificity Se ns itiv ity PICS MACS cisGenome QuEST USEQ (a) Scenario 1: ROC (b) Scenario 2: ROC 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 1ïSpecificity Se ns itiv ity PICS MACS cisGenome QuEST USEQ (c) Scenario 3: ROC Figure A.5: Receiver operating characteristic curves for the three simulation scenarios. 134 Appendix B Supplementary Material for Chapter Three B.1 False Discovery Rate The False Discovery Rate (FDR) can be used to derive a threshold used to define the number of sequences to analyze. Figures B.1–B.3 show the evolution of the FDR as a function of the score threshold for the ER, FOXA1, and STAT1 data. Here, we used an FDR of 10% represented by a grey horizontal line. B.2 Sequence Logo and Distributions for All Motifs The analysis of the ER, FOXA1, STAT1 and CTCF data by rGADEM revealed receptively 78, 25, 23 and 68 motifs. We used MotIV to compare the resulting PWMs to the JASPAR database in order to identify the transcription factors asso- ciated to this binding sites. For clarity, here, we only selected motifs with E-value less than 10 · e−4, see Figures B.4–B.11. B.3 Pairwise Distances Pairwise distance plots for selected biologically relevant motifs (see main text) for the CTCF, ER and FOXA1 data are shown in Figures B.12–B.14. 135 Figure B.1: Estimated FDR as a function of the enrichment score for the ER data. The number of enriched regions for the corresponding score is given at the top. 136 Figure B.2: Estimated FDR as a function of the enrichment score for the FOXA1 data. The number of enriched regions for the corresponding score is given at the top. 137 Figure B.3: Estimated FDR as a function of the enrichment score for the STAT1 data. The number of enriched regions for the corresponding score is given at the top. 138 Figure B.4: Motifs identified by rGADEM and visualized with MotIV from the CTCF data. The motif matches and associated E-values are based on the JASPAR database included in MotIV. For clarity only motifs with E-value less than 10−4 are retained. B.4 rGADEM We describe in this section an updated version of GADEM (v.1.3) included in rGADEM. The new version retains the key practical strengths of automatically de- termining motif width and the number of motifs. Extending the functionality for ChIP-seq data, we added: 1) an optional seeded analysis, 2) an improved method for choosing the maximum number of site occurrences (MAXP), 3) options for us- ing spatially non-uniform motif priors, and 4) an enrichment assessment for each motif. We briefly describe each of the changes below, and give technical details in the supplementary material. Finally, we made the code more efficient and im- 139 Figure B.5: Motifs identified by rGADEM and visualized with MotIV from the ER data. The motif matches and associated E-values are based on the JASPAR database included in MotIV. For clarity only motifs with E-value less than 10−4 are retained. proved the documentation. B.4.1 Choice of Expected Number of Occurrences MAXP In MEME [4], MAXP is a user-specified parameter. In GADEM (v1.2) (Li:2009p1061), MAXP is assumed to be the number of sequences in the data. In both cases, the same MAXP is used for all motifs, regardless of potential differences in preva- lence between motifs. Although Bailey and Elkan (1995) showed that the choice 140 Figure B.6: Motifs identified by rGADEM and visualized with MotIV from the FOXA1 data. The motif matches and associated E-values are based on the JASPAR database included in MotIV. For clarity only motifs with E-value less than 10−4 are retained. for MAXP is not critical, it may be preferable to allow each discovered motif to have a separately optimized MAXP value. In the updated version (v1.3), GADEM assumes that MAXP is one of the ten equally spaced numbers in the set (0.1N, 0.2N,..,1.0N), where N is the number of sequences in the data. It is feasible to systematically evaluate each of the ten choices for one or a few PWMs, but not for tens of thousands of PWMs (i.e. the product of the number of genetic algorithm (GA) generations, the GA population size and 10 MAXP choices). Thus, in the updated version (v1.3), we implemented two approaches for choosing a MAXP, depending on the number of PWMs to be evaluated. When a starting PWM is pro- vided, GADEM systematically examines each of the ten choices for MAXP and 141 Figure B.7: Motifs identified by rGADEM and visualized with MotIV from the STAT1 data. The motif matches and associated E-values are based on the JASPAR database included in MotIV. For clarity only motifs with E-value less than 10−4 are retained. 142 Figure B.8: Distance distribution between the rGADEM motif occurrences and the PICS predictions for all motifs identified in the CTCF data. For clarity only motifs with E-value less than 10−4 are retained. uses the one that results in the lowest entropy E-value. Otherwise, For an unseeded run (see below), GADEM uses a GA to stochastically select one of the ten choices. B.4.2 Seeded / Unseeded Analysis Seeded Analysis When a PWM is provided as a seed, GADEM uses it as the starting PWM for all motifs that it reports. As in unseeded runs, once a motif has been found that satis- fies user-specified thresholds for E-value and the minimal number of sites (MINN), all of its instances are masked, and the same seed is used again, iteratively, until satisfactory motifs can no longer be found. Typically, such a run reports a diverse set of motifs, some of which may be variants that are closely related to the start- 143 Figure B.9: Distance distribution between the rGADEM motif occurrences and the PICS predictions for all motifs identified in the ER data. For clarity only motifs with E-value less than 10−4 are retained. ing PWM. Seeded analyses are not only fast but support asking focused biological questions. Unseeded Analysis For an unseeded analysis, GADEM stochastically selects a MAXP from one of the ten choices (see above) using a modified GA. A brief description of the genetic algorithm is given below: • Initialization: For each spaced dyad, a MAXP is randomly chosen from one of the ten choices with equal probability. • Mutation: GADEM first decides with equal probability whether one of the three components in a spaced dyad or MAXP is going to be mutated. If 144 Figure B.10: Distance distribution between the rGADEM motif occurrences and the PICS predictions for all motifs identified in the FOXA1 data. For clarity only motifs with E-value less than 10−4 are retained. MAXP is chosen, GADEM replaces its current MAXP by a randomly cho- sen different MAXP. Otherwise, the mutation operation is carried out as de- scribed in [75]. • Crossover: A spaced dyad consists of two words ( α1 and α2) and a spacer (ηx ) [126]. Each spaced dyad has an associated MAXP. The spaced dyad and its associated MAXP constitute a GA chromosome. Thus, a spaced dyad and its corresponding MAXP have three potential “crossover” points (α1-ηx , ηx-α2, and α2-MAXP, respectively). A GA crossover operation will result in either a new spaced dyad with the same MAXP or the same spaced dyad with a different MAXP. 145 Figure B.11: Distance distribution between the rGADEM motif occurrences and the PICS predictions for all motifs identified in the STAT1 data. For clarity only motifs with E-value less than 10−4 are retained. Seeded Versus Unseeded Methodology The major difference between a seeded and unseeded analysis is that in a seeded analysis GADEM does not generate the starting PWMs through spaced dyads and optimize them through a GA as in Li [75]. This makes seeded runs much faster than unseeded. In a seeded analysis, GADEM internally sets the number of GA generations to 1 and the GA population size to 10 (corresponding to the 10 MAXP choices), over- riding any user-provided values for the two parameters. Other settings/parameters 146 Figure B.12: Pairwise distance distributions between the CTCF, Myf motifs identified from the CTCF data. such as motif spatial prior, background model and EM (number of iterations and convergence threshold) apply equally to both seeded and unseeded runs. B.4.3 Motif Spatial Prior Probability For sequence data that should have centrally enriched motifs (e.g. ChIP-seq), we have added three non-uniform priors to the original uniform spatial motif prior probability. In these, the central part of each sequence (e.g. 50 bp around the location of maximal enrichment in an enrichment profile, for discovery in a set of ∼±200-bp peak “cores”) is given a greater weight in EM motif optimization than the regions flanking the center. This “window” width is a user-specified parameter. To summarize, GADEM now contains four spatial prior options: • Prior 0: Uniform distribution as in Bailey and Elkan [4] and GADEM (v1.2) 147 Figure B.13: Pairwise distance distributions between the ER, FOXA1 and AP-1 motifs identified from the ER data. [75]. • Prior 1: Rectangle at the sequence center. This prior gives the center 50 bp a non-zero uniform weight and a zero weight elsewhere. • Prior 2: Triangular. This prior gives a weight that is largest at the center of an input sequence and decreases linearly to zero at each end of the sequence. • Prior 3: Gaussian. As with a seeded run, non-uniform priors can improve the identification of mo- tifs for noisy and/or short motifs. However, for a typical de novo analysis followed by MotIV, we prefer to use the uniform prior to avoid biasing motif occurrence distributions used in the post-processing step. 148 Figure B.14: Pairwise distance distributions between the FOXA1 and AP-1 motifs identified from the ER data. B.4.4 Background Model It is convenient to assume that the background model is iid, as the exact distribution of the log likelihood ratio (llr) score can be efficiently approximated using the probability generating function method Staden [120]. B.4.5 Fold Enrichment Analysis GADEM estimates motif enrichment in the input data by comparing the total num- ber of sites and the number of sequences containing at least one binding site in the input data and in the background/random sequences. GADEM simulates several sets (e.g. 10) of background/random sequences (independent from those used in null llr score distribution) of the same size as the input sequences, using the nu- cleotide frequencies from the input data. GADEM computes the same llr, on each subsequence of length w. A subsequence is declared a binding site if its llr score 149 is at or above the llr score that corresponds to the p-value cutoff. GADEM deter- mines the total number of binding sites in the background sequences , and reports Cinput Crandom and NinputNrandom as estimates of fold enrichment, where C and N are the total number of sites and the number of sequences containing at least one predicted site, respectively. B.5 Gene Ontology Analysis To complement our analysis, we analyzed Gene Ontology terms with the ChIP- peakAnno package from BioConductor. We used the getEnrichedGO function with the following parameters: • orgAnn: org.Hs.eg.db (Homo Sapiens) • maxP: 0.1 to 0.3 • minGOterm: 1 We report the most significantly Gene Ontology (GO) terms for the primary and secondary motifs. GO analyses reveal several functional categories in which regu- lated genes are overrepresented. 150 151 B.6 Agreement between Motif Finders We calculated the number overlapping primary motif occurrences for cisFinder, Weeder and rGADEM. Figures B.16–B.19 show that the overlap between the three methods is relatively large. This is particularly true for rGADEM and cisFinder. We don’t show results from MEME because of the number of sequences we used, capped at 5000 for computational reasons. 152 Figure B.15: ER motif identified by rGADEM and visualized with MotIV from the FOXA1 data. The motif matches and associated E-values are based on the JASPAR database included in MotIV. 153 Figure B.16: Venn diagram for the number of overlapped occurrences of FOXA1 primary motifs. 154 Figure B.17: Venn diagram for the number of overlapped occurrences of ER primary motifs. 155 Figure B.18: Venn diagram for the number of overlapped occurrences of CTCF primary motifs. 156 Figure B.19: Venn diagram for the number of overlapped occurrences of STAT1 primary motifs. Figure B.20: Example of a MotIV alignment output based on the FOXA1 data. 157 Appendix C Supplementary Material for Chapter Four C.1 Post-processing PING Results Given the range of cases that short-read data present to PING’s mixture modelling, we detect and post-process certain problematic events within the results that the mixture model returns. C.1.1 Mismatched Peaks When one of a nucleosome’s two XSET profile peaks is much more asymmetric and/or taller than the other, mixture models may use different number of compo- nents to describe the F peak and R peaks of the same nucleosome. In such a case, unmatched mixture components within a nucleosome will be matched to compo- nents of other nucleosomes, which will cause subsequently processed nucleosomes to have mismatched components and incorrectly estimated center locations. We de- tect such problems as atypical estimated δ outside of the range of (80,250), or as un-fittable models, and fit a new model to such a candidate region, using a stronger prior on δ . For such a new prior we set ρ = 8, which reduces the prior’s standard error to be in the range (7,16). While this strong prior helps address the mismatch problem, it is too strong to be used globally, as the range of true δ is much wider 158 than it permits. C.1.2 Missing Nucleosomes A nucleosome may have much higher F/R XSET profile peaks than its neighbour- ing nucleosomes in the same candidate region. In such a case, mixture models tend to use many components to describe the nucleosome with tall peaks, while using a single component to describe multiple low-peak nucleosomes, and PING can fail to detect some low rank nucleosomes. We can detect such problems by looking for predictions with forward and reverse peaks that are too wide (σ f ,σr > 50 for MNase, and σ f ,σr > 70 for sonication), then extracting reads within a distance of 2σ from the estimated peak centers, and refitting the model with these reads. This is typically effective, since reads from higher peak are excluded in the new fitted model, and new hyper-parameters α = 100 and β = 100000 are used to ensure that the 95% CI of prior σ is (20,25). C.1.3 Duplicate Predictions When we divide a long candidate region, some care is needed. The concern arises because, when a nucleosome is near the cutting point, enough reads could be present in each of the new candidate regions that the same nucleosome can be predicted in both of the new candidate regions. To address this, when we find pre- dicted nucleosomes that are in adjacent candidate regions and are within 100 bp of each other, we extract reads from and between these two nucleosomes and refit the model. We let the new fitted model decide whether or not to merge the duplicates into a single nucleosome prediction or to retain both predictions. 159 C.2 The Supplementary Figures and Table l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Figure C.1: Density of Gaussian Markov random field (GMRF) used in PING, with horizontal lines showing 10 simulations or realizations of 3 adjacent nucleosomes from this prior. 160 50 100 150 200 250 0. 00 0 0. 01 0 0. 02 0 0. 03 0 Prior delta de ns ity MNase Sonication 50 100 150 200 250 0. 00 0. 02 0. 04 0. 06 Posterior delta de ns ity MNase Kaplan Sonication Heinz Sonication Hoffman Figure C.2: Prior density of δ with selected values for hyperparameters (left) and posterior density of δ in three real data sets (right). Data % subsets 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 (A) PING 61 60 60 60 59 59 58 58 57 57 56 55 54 53 51 (A ) TpF 54 54 54 54 54 54 54 54 55 55 54 54 55 55 55 (A) NPS 28 28 27 27 26 26 25 24 23 22 21 21 20 18 15 (B) PING 36 35 34 33 32 31 30 28 26 24 22 20 17 14 11 (B) TpF 51 50 50 49 48 47 46 44 43 42 40 38 36 33 30 (B) NPS 27 26 26 25 24 24 23 22 21 20 19 18 17 15 13 (C) PING 25 24 23 21 20 19 17 16 14 12 10 8 6 4 2 (C) TpF 41 40 39 38 37 36 35 33 32 30 28 26 24 22 19 (C) NPS 20 20 19 18 18 17 16 15 14 13 12 11 9 7 6 Table C.1: The number (in thousands) of predicted nucleosomes for (A) NOCL4 data (Kaplan 2009), 5000 selected regions of (B) 1hr H3K4me1 data (Heinz 2010), 5000 selected regions (C) islet data (Hoffman 2010) and their random subsets of reads using PING, TpF and NPS. 161 Figure C.3: Nucleosome prediction scores of PING, TpF, and NPS for MNase “NOCL R4” data (Kaplan 2009). The vertical lines marks the top-ranked 10000 predicted nucleosomes. Note the scores of each method have been rescaled to range in [0,1]. 162 Figure C.4: Nucleosome prediction scores of PING, TpF, and NPS for mouse PUER cell line sonicated H3K4me1 ChIP-seq data, 1 hour after stimu- lation (Heinz 2010). The vertical line marks the top-ranked 2000 pre- dicted nucleosomes. Note the scores of each method have been rescaled to range in [0,1]. 163 Figure C.5: Nucleosome prediction scores of PING, TpF, and NPS for mouse islet sonicated H3K4me1 ChIP-seq data (Hoffman 2010). Vertical lines mark the top-ranked 1000, 2000 and 4000 predicted nucleosomes. Note the scores of each method have been rescaled to range in [0,1]. 164 −1000 −500 0 500 1000 1. 5 2. 5 3. 5 Distance to SPI1 binding sites R ea ds  D en sit y 0h 1h −1000 −500 0 500 1000 2 3 4 5 6 Distance to CEBPB binding sites R ea ds  D en sit y 0h 1h Figure C.6: The nucleosome positioning in ±500 bp from top 5000 detected in vivo (a) SPI1 biding sites and (b) CEBPB biding sites from Heinz’s sonicated H3K4me1 ChIP-Seq data for 0 hour (blue) and 1 hour (red) after stimulation. This profile is generated using a method consistent with that outlined by Heinz et al 2010. The SPI1 profile in the data for 0 hour (the blue curve in the top figure) seams to be just to a large and wide peak on each side of binding sites, which provide almost no information about nucleosome positions. 165 −200 0 100 300 0 10 0 20 0 30 0 40 0 50 0 SPI1 Distance to position 1 nucleosomes PI N G  p ro file 0hr 1hr −200 0 100 300 0 20 0 60 0 10 00 CEBPB Distance to position 1 nucleosomes PI N G  p ro file 0hr 1hr Figure C.7: Nucleosome positioning profiles in Heinz’s data. Noting that the distance for proximal nucleosomes shifts varied between SPI1 binding regions. To get clearer profiles, we align the data based on the loca- tions of the flanking (position 1) nucleosomes after SPI1 binding at 0hr before tamoxinfen addition. It was not clear how to assign a strand ori- entation for SPI1 and CEBPB enhancers, and using absolute distances allowed us to align to +1 and -1 nucleosomes at the same time. This figure confirmed the SPI1 nucleosome shift away from binding sites. 166 ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 10 20 30 40 50 1. 5 2. 0 2. 5 3. 0 thresholds of Negative Binomial test number of reads in reference nucleosome fo ld −c ha ng e cu to ff at  s ig ni fic an t l ev e l 0 .1 Figure C.8: The adaptive threshold of local enrichment based on the number of reads and the peak width of both nucleosomes, obtained from the negative binomial model. The parameter “prob” in NB model is as- sumed to be 0.5 in this figure. To calculate the adaptive threshold the significance level used is 0.1 (i.e. 90% of quantile in NB distribution). 167 Figure C.9: The sorted PING scores in loge scale for all nucleosomes pre- dicted in mouse islet (black) and liver (red) from sonicated ChIP-seq data (Hoffman 2010). The blue vertical dotted line shows top 50000 (the vertical blue line) nucleosomes is a reasonable elbow point of the score curves. 168

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0105109/manifest

Comment

Related Items