UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Mixture models for analysing high throughput sequencing data Zhang, Xuekui


The goal of my thesis is to develop methods and software for analysing high-throughput sequencing data, emphasizing sonicated ChIP-seq. For this goal, we developed a few variants of mixture models for genome-wide profiling of transcription factor binding sites and nucleosome positions. Our methods have been implemented into Bioconductor packages, which are freely available to other researchers. For profiling transcription factor binding sites, we developed a method, PICS, and implemented it into a Bioconductor package. We used a simulation study to confirm that PICS compares favourably to rival methods, such as MACS, QuEST, CisGenome, and USeq. Using published GABP and FOXA1 data from human cell lines, we then show that PICS predicted binding sites were more consistent with computationally predicted binding motifs than the alternative methods. For motif discovery using transcription binding sites, we combined PICS with two other existing packages to create the first complete set of Bioconductor tools for peak-calling and binding motif analysis of ChIP-Seq and ChIP-chip data. We demonstrate the effectiveness of our pipeline on published human ChIP-Seq datasets for FOXA1, ER, CTCF and STAT1, detecting co-occurring motifs that were consistent with the literature but not detected by other methods. For nucleosome positioning, we modified PICS into a method called PING. PING can handle MNase-Seq and MNase- or sonicated-ChIP-Seq data. It compares favourably to NPS and TemplateFilter in scalability, accuracy and robustness to low read density. To demonstrate that PING predictions from sonicated data can have sufficient spatial resolution to be biologically meaningful, we use H3K4me1 data to detect nucleosome shifts, discriminate functional and non-functional transcription factor binding sites, and confirm that Foxa2 associates with the accessible major groove of nucleosomal DNA. All of the above uses single-end sequencing data. At the end of the thesis, we briefly discuss the issue of processing paired-end data, which we are currently investigating.

Item Media

Item Citations and Data


Attribution-NonCommercial-NoDerivatives 4.0 International