UBC Faculty Research and Publications

Improving analysis of transcription factor binding sites within ChIP-Seq data based on topological motif… Worsley Hunt, Rebecca; Mathelier, Anthony; del Peso, Luis; Wasserman, Wyeth W Jun 13, 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12864_2013_Article_6188.pdf [ 1.6MB ]
JSON: 52383-1.0228397.json
JSON-LD: 52383-1.0228397-ld.json
RDF/XML (Pretty): 52383-1.0228397-rdf.xml
RDF/JSON: 52383-1.0228397-rdf.json
Turtle: 52383-1.0228397-turtle.txt
N-Triples: 52383-1.0228397-rdf-ntriples.txt
Original Record: 52383-1.0228397-source.json
Full Text

Full Text

RESEARCH ARTICLE Open AccessImproving analysis of transcription factor bindingsites within ChIP-Seq data based on topologicalmotif enrichmentRebecca Worsley Hunt1,2, Anthony Mathelier2, Luis del Peso3 and Wyeth W Wasserman2,4*AbstractBackground: Chromatin immunoprecipitation (ChIP) coupled to high-throughput sequencing (ChIP-Seq)techniques can reveal DNA regions bound by transcription factors (TF). Analysis of the ChIP-Seq regions is now acentral component in gene regulation studies. The need remains strong for methods to improve the interpretationof ChIP-Seq data and the study of specific TF binding sites (TFBS).Results: We introduce a set of methods to improve the interpretation of ChIP-Seq data, including the inference ofmediating TFs based on TFBS motif over-representation analysis and the subsequent study of spatial distribution ofTFBSs. TFBS over-representation analysis applied to ChIP-Seq data is used to detect which TFBSs arise morefrequently than expected by chance. Visualization of over-representation analysis results with new composition-biasplots reveals systematic bias in over-representation scores. We introduce the BiasAway background generatingsoftware to resolve the problem. A heuristic procedure based on topological motif enrichment relative to theChIP-Seq peaks’ local maximums highlights peaks likely to be directly bound by a TF of interest. The results suggestthat on average two-thirds of a ChIP-Seq dataset’s peaks are bound by the ChIP’d TF; the origin of the remainingpeaks remaining undetermined. Additional visualization methods allow for the study of both inter-TFBS spatialrelationships and motif-flanking sequence properties, as demonstrated in case studies for TBP and ZNF143/THAP11.Conclusions: Topological properties of TFBS within ChIP-Seq datasets can be harnessed to better interpret regulatorysequences. Using GC content corrected TFBS over-representation analysis, combined with visualization techniques andanalysis of the topological distribution of TFBS, we can distinguish peaks likely to be directly bound by a TF. The newmethods will empower researchers for exploration of gene regulation and TF binding.Keywords: Chromatin immunoprecipitation, ChIP-Seq, Motif prediction, Over-representation analysis, Regulation,Sequence analysis, Transcription factor, Transcription factor binding site, VisualizationBackgroundDelineating the specific cis-regulatory elements governinggene transcription has been at the forefront of genomeresearch. The landmark release of the ENCODE projectfindings supports the long-standing view that a greaterportion of the human genome contributes to the regula-tion of gene activity than the ~2% encoding proteins.High-throughput chromatin immunoprecipitation (ChIP)analysis of protein-DNA interactions has been trans-formative, highlighting the genomic regions that containregulatory elements, and thus reducing the computationalsearch space for transcription factor binding sites (TFBSs)many fold. The coupling of ChIP to high-throughputsequencing (ChIP-Seq) is empowering researchers toinvestigate DNA binding transcription factors (TFs)and their regulation of the genome. We present here aset of convenient, practical visualization and bioinformat-ics approaches, which facilitate interpretation of ChIP-SeqTF binding data.The bioinformatics foundations of generating ChIP-Seqdata have been well explored. Given mapped DNA* Correspondence: wyeth@cmmt.ubc.ca2Centre for Molecular Medicine and Therapeutics, Child and Family ResearchInstitute, Department of Medical Genetics, University of British Columbia,Vancouver, BC, Canada4Centre for Molecular Medicine and Therapeutics, 950 W.28th Avenue,Vancouver, BC V5Z 4H4, CanadaFull list of author information is available at the end of the article© 2014 Worsley Hunt et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of theCreative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons PublicDomain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in thisarticle, unless otherwise stated.Worsley Hunt et al. BMC Genomics 2014, 15:472http://www.biomedcentral.com/1471-2164/15/472sequence reads from a ChIP experiment, “peak calling”software is applied to quantitatively delineate regionswithin which a greater frequency of mapped reads areobserved than expected. The peak regions are reportedas a pair of coordinates, ranging from 1 bp to >5000 bpwide, often accompanied by a score and the position atwhich the read frequency local maximum is observed.Within the peak regions, the coordinates of specificTF-DNA interactions can be computationally inferred,using a TFBS profile for the indicated TF. Such profiles,most commonly in the form of Position Frequency Matri-ces (PFMs), are available for a subset of TFs from data-bases like JASPAR [1] or HoCoMoCo [2], or can becomputationally identified from the ChIP-Seq data usingde novo motif discovery tools [3-6]. Analysis with a PFMconverted to a weighted TFBS profile (Position WeightMatrix – PWM) yields a score that reflects the similarityof the sequence of interest to the modeled binding sites.Although ChIP-Seq data reduces the acknowledged speci-ficity problem of detecting short (6-15 bp), degeneratemotifs bound by a TF in the genome, the problem of TFBSprediction is not perfectly resolved as the ChIP-Seq peaksare often 20-fold or greater in length than the TFBS beingsearched for. As they become more widely used, higherresolution methods, such as ChIP-exo [7], are expected toreduce the difficulty.A proportion of ChIP-Seq peaks may not contain a ca-nonical TFBS for the ChIP’d TF above background expect-ation; a confounding property of the data that presumablyarises from a combination of biological, experimental, andcomputational influences. While these regions may resultfrom indirect interactions between the TF and the DNA,the multi-epitope specificity of polyclonal antibodies andthe tendency for chromatin to shear at promoter regions[8,9] may give rise to peaks not specific to the ChIP’d TF.The subset of peaks lacking the TF’s canonical motif iscommonly treated as equivalent to the subset with motifs.The segregation of a ChIP-Seq dataset into the two classescould lead to insights into individual TFs mechanisms ofregulation and reveal common properties of regions lack-ing TFBS. The analysis of specific TF bound regulatory re-gions and TFBSs from ChIP-Seq defined peak regions canbe refined, and such improvement will consequently in-form and improve our utilization of ChIP-Seq data acrossa spectrum of analyses.In this report, we introduce a set of visualizationmethods and bioinformatics approaches to improve thestudy of TFBSs within ChIP-Seq regions, and demonstratethe application of these methods for the generation ofnew insights into regulatory sequences. We focus on threekey challenges: known motif over-representation analysis,spatial visualization of TFBS positions, and determinationof parameters for TFBS analysis. For over-representationanalysis, we introduce the BiasAway tool to account forthe non-random properties of regulatory sequences; suchaccounting has strongly informed the design of de novomotif discovery methods, but has been inadequately ad-dressed for over-representation studies. We introduce aset of visualization approaches that reveal topological pat-terns of motif positions within ChIP-Seq data, helping todelineate the subset of peaks likely to be directly bound bythe ChIP’d TF. The visualization methods directly informthe selection of parameters for motif prediction, a long-standing challenge in regulatory sequence analysis. Appli-cation of the procedure reveals that on average 61% ofChIP-Seq peak regions contain the canonical motif for theChIP’d TF. The methods are applied to two cases relatedto TBP and ZNF143/THAP11. Access to the new methodsand visualization approaches will provide the researchcommunity with improved capacity to analyze and in-terpret TF ChIP-Seq data.ResultsComposition studies reveal the influence of non-randomproperties of the metazoan genome on the interpretationof ChIP-Seq dataThe non-random properties of genome nucleotide se-quence composition have been the subject of extensiveinvestigation over decades. Key to the study of regulatorysequences, are observations that promoters and other openchromatin tend to have higher GC content, and that theobserved composition has a wide distribution. We evalu-ated the relationship between mononucleotide compositionand TF ChIP-Seq data (see Additional file 1: Text S1).We found that each TF exhibits a range of nucleotidecomposition environments in which it binds. The GCcontent distributions differ between TFs profiled in thesame cells and between different cell types profiled forthe same TF (Additional file 2: Figure S1). There is aclear relationship between the GC content of the peakregions and the multiplicity of TFBSs (Additional file 1:Text S1, Additional file 3: Figure S2).Consequences of biased composition for TFBS motifover-representation analysesIt is a central practice in the analysis of regulatory regionsto evaluate the frequency of predicted TFBSs in the regionsof interest. This type of analysis depends on comparingpredicted motif frequencies against a reference background.Such analysis of TFBS over-representation could be influ-enced by the compositional properties of the peaks andthe background sequences. While accounting for back-ground composition properties has been explored for denovo pattern discovery, the best approaches for and im-pact of composition correction on over-representationanalysis of known motifs have not been resolved.Worsley Hunt et al. BMC Genomics 2014, 15:472 Page 2 of 19http://www.biomedcentral.com/1471-2164/15/472Visualizing composition bias in TFBS over-representationanalysesTo assess composition corrective measures for TFBSover-representation analysis, we generated a visualizationmethod, which we term Composition-Bias plots (CB-plots).The CB-plots are a method for alerting a researcher tothe impact of composition on the reported TFBS over-representation results, and are generally more informativethan a ranked list of TFs and motif over-representationscores. A CB-plot presents the average GC content of aTF’s binding profile (the PFM) on the x-axis and the over-representation score of the TF’s predicted binding sites onthe y-axis. If an unsuitable background has been used inthe over-representation analysis, the CB-plot distributionof motif over-representation scores will reveal a bias towardwhichever end of the nucleotide composition spectrum(GC-rich or AT-rich) that the target sequences dominaterelative to the background. Figure 1 illustrates the CB-plotsand provides examples of over-representation results froman E2F1 ChIP-Seq dataset. Figure 1a presents results thatneed to be corrected for over-representation score bias, asoutlined in the next section, while Figure 1b presents theend results after correction.The impact of background composition selection on TFBSover-representation resultsFor many datasets the calculation of meaningful over-representation scores requires correction of bias. We assessedapproaches to create and/or retrieve a background matchedto a set of target sequences’ compositional properties inorder to resolve the enrichment scoring bias engenderedby compositional extremes of some ChIP-Seq datasets.For evaluation purposes we retained the naïve backgroundmodel of random genomic sequences in our assessment.A second background was generated by a 3rd orderMarkov model (RSAT package [10]). We implemented abackground sequence generator, BiasAway, to generate6 additional background models; these backgrounds de-rived from mono- and di- nucleotide shuffled sequences,and genomic sequences matched to the GC content of thetarget data (see Additional file 1: Text S1 for details). Weincluded one last set of backgrounds generated by “knownmotif” over-representation analysis feature of HOMER 2[11]. HOMER 2 is the only software of which we areaware that uses GC composition matched backgroundsfor TFBS over-representation analysis. These backgroundswere then evaluated against 43 human TF ChIP-Seqdatasets, with 166 PWMs (JASPAR 4.0_alpha develop-ment database [1]) via the oPOSSUM 3.0 TFBS over-representation software [12]. Four backgrounds werere-evaluated for platform-independence of both biasand bias correction with the ASAP tool (see Additionalfile 1: Text S1 and Additional file 4: Figure S3).We established several measures, based on the idealexpectation of four properties pertaining to the resultsof TFBS over-representation analyses. In essence, we ex-pect to see few outlying over-representation scores forTF binding models, one of which should be for theFigure 1 Composition-bias plots reveal systematic TF PFM nucleotide content bias in motif over-representation analysis. Foregrounddata was obtained from an E2F1 ChIP-Seq study and processed using the oPOSSUM TFBS over-representation analysis software. Each plotpresents a motif over-representation score (y-axis) relative to the GC content of the PFMs (x-axis). The over-representation scores reflect thedifference between the frequency of motifs in the foreground compared to a background (the background differs between panels). The namesof the 5 top ranked PWMs are displayed on the plot. The dotted line at over-representation score 100 is an arbitrarily placed visual point ofreference. The sequence logos represent the binding models for E2F1 and E2F4 respectively. (a) Background composed of randomly selectedmappable genome sequences. (b) Background generated, using BiasAway, with a GC composition matched to the ChIP-Seq sequences anddrawn from the set of mappable sequences.Worsley Hunt et al. BMC Genomics 2014, 15:472 Page 3 of 19http://www.biomedcentral.com/1471-2164/15/472ChIP’d TF’s binding model, with the majority of bindingmodels scoring close to 0. We summarize the results foreach of the alternative background models using thesemeasures in Figure 2 and Additional file 5: Figure S4.We summarize the background evaluations on 201 bpsequences here, with further details available in Additionalfile 1: Text S1 for both 201 bp and 401 bp sequences.Additional file 6: Table S1 lists the rank of the ChIP’d TFfor each analysis, and Additional file 7: Figure S5 providesthe CB-plots of E2F1 for six background analyses. A naïvebackground of randomly selected sequences resulted ina systematic bias, or skew, of over-representation scorestowards GC-rich binding sites for some datasets. Themononucleotide backgrounds were not as favourable asthe dinucleotide shuffled backgrounds, with or withoutthe sliding window. While neither suffered from bias to-wards GC-rich binding sites, the dinucleotide shuffledbackgrounds produced results with over-representationscores closer to zero and predicted the ChIP’d TF’smotif in the top 5 results of more than 60% of the data-sets. The 3rd order Markov model performed as well asthe dinucleotide background for all measures exceptthat of capturing the ChIP’d TF’s motif in the top 5 re-sults. The best performing background was genomic se-quences selected to match the GC nucleotide distributionof the target sequences. The BiasAway GC nucleotidebackground and HOMER generated backgrounds per-formed comparably for most measures, with the exceptionthat the BiasAway background resulted in an 11 percent-age point improvement for the inclusion of the ChIP’d TFbinding model as one of the top 5 over-represented.The CB-plots have been incorporated into the oPOSSUMover-representation analysis software, and BiasAway is avail-able as open-source code posted on GitHub: https://github.com/wassermanlab/BiasAway/archive/noRPY.zip, and is be-ing incorporated into the oPOSSUM 3.0 web interface.Assessing TFBS predictions within ChIP-Seq regionsSubsequent to motif over-representation analysis, attentionturns to investigating the candidate TFBS binding profiles.We outline here a complementary set of enrichment as-sessment methods specifically oriented to providing suchFigure 2 Background sequence selection impacts motif over-representation analyses. (a) For each background, the fraction of the 43analyses that reported the ChIP’d TF in the top 5 enriched PWMs from a particular background (x-axis) is plotted against the average skew of theover-representation results for each background’s 43 analyses. Skew is the negative slope of the line fitted to the over-representation scores versusPFM GC content (i.e. values as displayed in Figure 1). The ideal is to have a large x-axis value (vertical dashed line) and an average skew of zero(horizontal dashed line). (b) and (c) summarize the standard deviation (y-axis) and mean (x-axis) of the ‘non-outlier’ oPOSSUM over-representationscores for 10 backgrounds against each of 43 ChIP-Seq datasets, where panel (b) displays the average value for each background across the 43datasets and panel (c) displays the individual value of 430 analyses. The ideal result would be situated at the origin (the intersection of thedashed lines). For all panels each of the 10 backgrounds tested is denoted as a single colour: Light green circle – randomly chosen backgroundfrom the dataset of mappable sequences, dark green cross – randomly chosen background from the dataset of DNase accessible sequences,orange circle – mononucleotide shuffled background, brown cross – mononucleotide shuffled background within a sliding window, blackcircle – dinucleotide shuffled background, gray cross – dinucleotide shuffled background within a sliding window, magenta triangle – 3rdorder Markov model generated background sequences, blue circle – background selected from the mappable sequences dataset to matchthe GC composition of the target sequences, light blue cross – background selected from the mappable sequences dataset to match thedistribution of GC composition in sliding windows of the target sequences, and red triangle – GC background from HOMER 2.Worsley Hunt et al. BMC Genomics 2014, 15:472 Page 4 of 19http://www.biomedcentral.com/1471-2164/15/472insights. We implement a heuristic algorithm for directbinding (HADB) to determine the subset of ChIP-Seqpeaks with high confidence binding sites, based on TFBSenrichment proximal to the peakMax, and to derive aPWM-specific scoring threshold. As with the previoussection, visualization is a key method for discerning theproperties of the data. Within the R environment [13] weimplemented a set of plotting methods (TFBS-landscapeview, TFBS-bi-motif view, and Dinucleotide-environmentview) for displaying the properties of the ChIP-Seq regionsrelative to the locations of predicted TFBS. The code anduser-guide for the visualization methods and calculatingthe HADB thresholds are posted on GitHub: https://github.com/wassermanlab/TFBS_Visualization/archive/master.zip.TFBS-Landscape viewThe TFBS-landscape view for ChIP-Seq data facilitatesqualitative assessment of the non-random relationship be-tween predicted motif scores (for the top scoring motif ineach sequence) and the peakMax position. The informa-tion in the TFBS-landscape view is translatable into aquantitative assessment of TFBSs, as presented by theHADB algorithm below. Motif proximity to the peakMaxhas been demonstrated on 14 ChIP-Seq datasets by [14]and for 3 datasets across 11 peak calling algorithms [15].The global tendency for ChIP-Seq data to yield motifs forthe ChIP’d TF proximal to the peakMax is systematicallyconfirmed here with a study of ~340 ChIP-Seq datasetsfor ~95 TFs (human and mouse).In a TFBS-landscape view, the left plot displays thedistance of the maximum scoring TFBS to the peakMaxfor each sequence on the x-axis (the peakMax is x = 0),and the PWM predicted TFBS score on the y-axis(Figure 3). This plot conveys both the quality of the mo-tifs observed, and allows users to detect enrichment in athreshold-independent manner. The right plot presentsthe density of top-scoring predicted motifs (y-axis) ver-sus the distance from the peakMax (x-axis), similar toplots presented by CentriMo (MEME suite [14]) andRSAT::peak-motifs tools [6]. The right side plots includetwo lines: a 2 bp resolution density of all peaks’ motifs,and a 5 bp resolution density of the subset of peaks con-taining a motif with score equal to or greater than 85 tocapture cases with low enrichment of strong scoringmotifs. As shown in Figure 3a for the TF C/EBPB, aChIP-Seq experiment can show a strong enrichment forthe ChIP’d TFs binding motifs. Plots representing thewide variety of enrichment characteristics are displayedin Figures 3a–l. The unique shapes displayed in Figure 3derive from a combination of properties for the pre-sented TF’s binding motifs such as width of the bindingsite, and the presence of other motifs enriched at thepeakMax (see Additional file 1: Text S1 for further de-tail and discussion).Using the HADB algorithm to distinguish direct bindingand for the selection of PWM motif score thresholdsThe TFBS-landscape plots present a finite zone of non-random TFBS enrichment around the peakMax for theChIP’d TF. Quantitative analysis of the motif densitiesfor a given TF can delineate the width of sequence prox-imal to the peakMax that is enriched for the TF’s motifabove chance expectation. The enrichment zone pro-vides a region of confidence for predicted TFBSs andthus provides a rational focus for downstream analyses.As shown in Figure 4a-b, the HADB procedure (seemethods) determines the distance from the peakMax tothe point where the frequency of the top scoring motifapproaches the distal flank motif frequency (distal =250-500 bp from the peakMax). Based on the pool ofhuman ChIP-Seq datasets and focusing on the topranked motif of each sequence, the width of enrichmentaround the peakMax ranges from ~50-350 bp with anaverage of 209 ± 34 bp or 194 ± 10 bp for human andmouse respectively (mean width ±mean of width differ-ences; Figure 4c and Additional file 8: Figure S6a). Asses-sing enrichment with the inclusion of lower ranked motifsdid not change the width of the observed enrichmentzone.Deciding upon an appropriate minimum scoring thresh-old for PWM-based analysis is an ongoing challenge, aseach TF has unique characteristics. Quantitative analysisof the peak TF motif densities relative to chance expect-ation can delineate a suitable threshold for motif scoresfor a given PWM. We set the motif score threshold as thelowest motif score for which the count of motifs withinthe heuristic enrichment zone consistently exceeded thecount of motif scores in comparably sized backgroundregions (see methods; Additional file 9: Figure S7a-b).The motif score threshold does not vary greatly for anindividual PWM across multiple datasets. The averagerelative score for each of the human and mouse datasetswas 82 ± 1 (mean score ±mean difference between scores;Additional file 9: Figure S7c-d). A motif score thresholdspecific to each TF PWM is important for downstreamanalyses such as the study of TFBS altering mutations.The PWM score thresholds for the studied TF bindingprofiles are provided in Additional file 10.We used the heuristic boundaries of enrichment andmotif score threshold for each dataset to estimate theproportion of peaks that contain at least one TFBS forthe ChIP’d TF within the bounds of the thresholds. Wefound that on average, for the datasets studied, ~61%of a ChIP-Seq dataset contains the ChIP’d TF’s canonicalmotif that is both within the peakMax enrichment zoneand greater than the motif score enrichment threshold(Figure 4d – human mean 61%, and Additional file 8:Figure S6b – mouse mean 65%). The source of theremaining ~40% of peaks may be from such factors asWorsley Hunt et al. BMC Genomics 2014, 15:472 Page 5 of 19http://www.biomedcentral.com/1471-2164/15/472Figure 3 TFBS-landscape views inform of motif enrichment and are diverse in shape and density. A TFBS-landscape view consists of a plot(left) showing the location of the top scoring motif relative to the peakMax (x = 0) on the x-axis and the motif score on the y-axis; the right plotpresents a 2 bp resolution density of motif distances to the peakMax (black) and 5 bp resolution for motifs with a motif score equal to or greaterthan 85 (green). All plots display some degree of enrichment at the peakMax and a lower limit on motif score enrichment. (a) C/EBPB motifs in aC/EBPB ChIP-Seq dataset. (b) C-MYC motifs are enriched at the peakMax of the C-MYC ChIP-Seq dataset, but many top-scoring motifs arerandomly dispersed. (c) NFYA motifs in a NFYA ChIP-Seq dataset exhibit enrichment around the peakMax, with high scoring motifs distinct fromthe majority of scores in the background regions (d) ZNF143 motifs in a ZNF143 ChIP-Seq dataset have low enrichment but some high scoringmotifs are distinct from the background. (e) and (f) present JUN motif enrichment in two JUN datasets from different cell types with distinctbackground motif densities. (g) The REST motif is strongly enriched in a REST ChIP-Seq dataset across a large motif score range at the peakMaxwith a low density of background motifs. (h) MYOD motif enrichment at the peakMax of MYOD ChIP-Seq data. (i) HNF4A motif enriched proximalto the peakMax of HNF4A ChIP-Seq data. (j), (k) and (l) present motif enrichment for a TF that was not the ChIP’d target: (j) CTCF motifs areslightly enriched offset from the peakMax of H3k4me3 ChIP-Seq. (k) CTCF motifs are enriched offset from the peakMax in RAD21::cohesinChIP-Seq data. (l) ELK4 motifs in a NELFE ChIP-Seq dataset show an enrichment offset from the peakMax.Worsley Hunt et al. BMC Genomics 2014, 15:472 Page 6 of 19http://www.biomedcentral.com/1471-2164/15/472indirect binding, shearing properties of open chromatin,peak calling errors, or antibody properties.Regions predicted by HADB to directly bind the ChIP’d TFare more likely to replicate between experimentsWhen available, replicate experiments are a useful valid-ation of regions ChIP’d by a TF. While many experi-ments lack replicates, ~100 ENCODE datasets includedreplicates which we used to evaluate whether the HADBmethod was differentiating between peaks that co-occurbetween replicates (peakMax within 500 bp) versus thosepeaks that occur in only one of the replicates (seemethods). We found that the set of peaks with theChIP’d TFs motif central to the peakMax were signifi-cantly enriched for replicated peaks (92% of datasetsproduced a Fisher exact test one tailed p-value <0.001(91% produced scores <1e-09)) (see Additional file 1:Figure 4 Defining the TFBS zone of enrichment around the peakMax. (a) A visual depiction of the enrichment zone determined with aheuristic procedure, as described in methods. The x-axis presents the upper limit of each 5 bp bin, and the bins are the absolute distance of amotif from the peakMax. The y-axis shows the proportion of peaks from the dataset with a motif in a 5 bp bin. The horizontal green line is fit tothe distal background bins, and the horizontal line in blue is the allowance line (see methods). The blue vertical dashed line indicates where theproportion of peaks in a non-background bin approaches the allowance line without falling below it – this line is the heuristic distance thresholdfor motif enrichment around the peakMax. (b) The NFYB sequence logo and the TFBS-landscape view for NFYB motifs in NFYB ChIP-Seq data.The heuristic enrichment zone is between the blue vertical dashed lines. The black vertical lines indicate the beginning of the distal backgroundregion (spanning 200-500 bp from the peakMax). (c) The width of the motif enrichment zone (y-axis) for human ChIP-Seq datasets (x-axis);multiple datasets for a TF were averaged to obtain one enrichment zone value per TF. Vertical bars are the average differences between all of theenrichment zone widths for a TF. The red horizontal line is the mean. (d) The proportion of peaks within the enrichment zone for a TF’s set ofChIP-Seq datasets were averaged. The x-axis provides, for each of 85 TFs, the mean proportion of peaks with a motif scoring above the motifscore threshold and located within the zone of enrichment (mean 0.60, median 0.61).Worsley Hunt et al. BMC Genomics 2014, 15:472 Page 7 of 19http://www.biomedcentral.com/1471-2164/15/472Text S1 for similar results with different peakMax separ-ation parameter settings)).Regions predicted by HADB to directly bind the ChIP’d TFare enriched for GO terms related to the TF’s keybiological processTo assess the functional enrichment of peaks defined bythe HADB method, we performed GO enrichment ana-lyses, using the GREAT software [16]. As GO enrichmentanalysis for TFBSs is somewhat problematic due to the di-versity of processes a TF may regulate and the proximityof TFBS to the genes they regulate, we selected two TFsknown to be highly specific for a biological process: SRF, amaster regulator of the actin cytoskeleton and contractileprocesses [17], and NFE2L2, a key regulator of oxidativestress response [18]. We submitted peaks from the wholeChIP-Seq dataset, peaks from the subset of regions in-ferred by HABD to be directly bound by the ChIP’d TF,and those not inferred to be directly bound. GREAT ana-lyses reported an enrichment of terms for the expectedbiological processes for SRF and NFE2L2 only in the sub-set of peaks with inferred direct binding of the ChIP’d TF(see Additional file 1: Text S1, Additional file 11: Figure S8and Additional file 12: Figure S9). The results highlighthow the use of the HADB method can enhance the inter-pretation of ChIP-Seq data.Complementary visualization methods for spatialChIP-Seq patterns: TFBS bi-motif view andDinucleotide-environment viewWe present two visualization methods that further em-power investigation of spatial patterns in ChIP-Seq datathat build upon the foundation of peakMax-proximalenrichment defined by HADB.TFBS-bi-motif viewThe prior TFBS-Landscape view focused on the ChIP’dTF in isolation. A substantial body of literature focuseson cooperativity between TFBS, both in homotypic andheterotypic forms [19-21]. In developing the HADB ana-lyses we had observed that in some cases a second motiffor the ChIP’d TF was enriched proximal to the peak-Max, which motivated the creation of a TFBS-bi-motifview (examples are presented in Figure 5). The view al-lows qualitative assessment of both spatial relationshipsbetween the two motifs and their distance from thepeakMax. If the first motif is enriched around the peak-Max there is a horizontal band of enrichment along y =0. If the 2nd motif is enriched around the peakMaxthere is a negatively sloped band of enrichment on a di-agonal. Where the origin displays increased enrichmentrelative to the horizontal band (y = 0), there is an enrich-ment of both motifs at the peakMax. The diagonal limitsof the plot arise from the uniform length of the peakregions analyzed. The right plot in the TFBS-bi-motifview presents a histogram of the distances between thetwo motifs. The proximity of two motifs can indicate po-tential interactions or relationships.Three examples of bi-motif views are presented inFigure 5. Homotypic clustering is observed for ESRRB andNFYB. The NFYB clustering is consistent with publishedfindings for the NF-Y complex [22], while the ESRRB ob-servation suggests that properties observed for ESRRGmay extend to other members of the TF family [23]. Athird plot for SRF and ELK4 (also called ‘SRF accessoryprotein’), using SRF ChIP-Seq data, presents heterotypicbinding site results, consistent with known interactionsbetween these TFs [24].Dinucleotide-environment viewGiven the position of the motif of interest within a re-gion, it can be informative to visualize the nucleotidecomposition properties in the flanking sequence. Di-nucleotide sequence properties flanking TFBSs can beevaluated by aligning sequences at the TFBS (or a fea-ture near the TFBSs, such as transcription start sites)and scoring the dinucleotide frequencies in the adjacentflanks. As shown in the Dinucleotide-environment viewfor an IFN-γ induced STAT1 TF ChIP-Seq experiment(Figure 6), there is increased nucleotide patterning inthe sequences adjacent to the HADB-inferred directlybound motifs, compared to the peaks without peakMax-proximal motifs. Further analysis of the data reveals thepresence of a common repeat sequence within a largesubset of the peaks containing the highest scoring STAT1motifs, consistent with literature which reports thatSTAT1 binding at MER41 repeat elements increases withIFN-y induction [25]. Further illustration of insights pro-vided by the dinucleotide-environment view is provided inthe case studies.An Applied Case Study of ChIP-Seq AnalysesTo demonstrate how over-representation analysis andthe visualization methods are complementary, and en-hance the analysis of ChIP-Seq data, we applied theTFBS-landscape view and the Dinucleotide-environmentview to two case studies. The first focuses on the TATAbinding protein (TBP) in the MEL mouse cell line, whilethe second explores the sequence properties proximal toZNF143 TFBSs.Employing both a motif over-representation analysis andTFBS-landscape view, we noted that the TBP ChIP-Seq ex-periment had a low occurrence of the canonical TBPTATA motif proximal to the peakMax (<12% of the data-set using the HADB method described above; Figure 7a).However, TFBS motif over-representation analysis also in-dicated a large number of secondary TFBSs are enriched.We investigated the reported enriched motifs using theWorsley Hunt et al. BMC Genomics 2014, 15:472 Page 8 of 19http://www.biomedcentral.com/1471-2164/15/472Figure 5 TFBS-bi-motif view for visualization of motif spatial arrangements. The left plot of a TFBS-bi-motif view presents the distance ofthe primary motif to the peakMax of each sequence on the y-axis, and the distance of a second motif relative to the primary motif on the x-axis.A band of enrichment at y = 0 indicates enrichment near the peakMax for the primary motif, while a diagonal band of enrichment (with anegative slope) indicates enrichment near the peakMax for the second motif. The diagonal limits of the plot arise from the uniform length of thesequences (here 1001 bp). The right plot is a histogram of the distances between the two motifs. The gap in both plots results from the exclusionof overlapping motifs. (a) ESRRB motifs in an ESRRB ChIP-Seq data set. The top-scoring ESRRB motif is the primary motif, and the second-bestmotif is the second motif. (b) NFYB motifs in a NFYB ChIP-Seq data set. The top-scoring NFYB motif is the primary motif, and the second-bestmotif is the second motif. (c) SRF ChIP-Seq dataset. The SRF top-scoring motif is the primary motif, and the ELK4 top-scoring motif is thesecond motif.Worsley Hunt et al. BMC Genomics 2014, 15:472 Page 9 of 19http://www.biomedcentral.com/1471-2164/15/472TFBS-landscape view, and saw an unusual pattern emerge:spikes of lower scoring motif enrichment at distances fromthe peakMax specific to each tested TF (Additional file 13:Figure S10). As shown in Figure 7b, there are TF motif en-richment patterns up to 200 bp from the peakMax. ADinucleotide-environment view, using the subset of peakswith a high scoring TATA motif, presented an enriched se-quence pattern up to 100 bp either side of the TATA motifand an AA-rich sequence ~200 bp distant on the motifstrand (Figure 7c). The pattern was subsequently identifiedto arise from enrichment of SINE elements (predomin-antly B2-B4). The percentage of peaks containing SINEelements is 14.1% in the TBP peaks, while in GC contentmatched controls or DNase accessible regions it drops to5.2% and 5.6% respectively. While SINE elements areknown to contribute to the formation of promoter regionsfor genes [26], the 14.1% fraction of the TBP ChIP-Seq se-quences is striking.The seven zinc finger ZNF143 TF has been reportedto bind to an ~18-21 bp nucleotide sequence [27,28].The ZNF143 ChIP-Seq dataset, like the TBP dataset, ex-hibits a low enrichment of the canonical ZNF143 motifaround the peakMax (~5% of the dataset; Figure 3d).We extracted the subset of regions with a motif in theHADB enrichment zone, repeat-masked the sequences,and generated a Dinucleotide-environment view alignedon the top-scoring motif in each region (Figure 8a). Thisdisplay reveals additional striking features not capturedin the initial model, including a strong 5’ pattern outsidethe PWM-covered positions. This 5’ flank pattern ispresent in about 35% of the aligned sequences, and is notpresent in alignments of motifs outside of the HADB en-richment zone. A Dinucleotide-environment view of se-quences with a high scoring (score >85) ZNF143 motif,reveals an increased enrichment of the 5’ flanking patternfrom 35% to ~50% of the sequences (Figure 8b). A subse-quent search of the literature revealed a recent study byNgondo-Mbongo et. al. [29], showing that a 2nd protein,THAP11, binds to the extended flank seen in theDinucleotide-environment view and half of the ZNF143motif in a manner mutually exclusive to ZNF143.DiscussionWe have introduced methods that allow researchersconfronted with ChIP-seq data for TF binding to extractmore insights into TFBS. The first challenge in suchstudies is often the identification of contributing TFsbased on known motif over-representation analysis. Suchmethods can both support the quality of data based onthe identification of the ChIP’d TF’s motif, as well ashighlight additional TFs that may be cooperatively actingwith the former. In this report we present two key com-ponents related to over-representation analysis. First,Composition-bias (CB) plots are introduced to displayskew in the GC-content of the enriched motifs, a com-mon occurrence that reflects the non-random compos-ition properties of the ChIP-Seq regions recovered.Second, the BiasAway software tool is introduced togenerate composition-matched background sequencesets that correct for the skew. Once relevant TF bindingprofiles are identified, the challenge shifts to the identifi-cation of reliable TFBS within the broader ChIP-Seqregions. Every ChIP-Seq dataset is a mixture of directlybound segments and regions that may reflect alternativeinfluences. We introduce TFBS-landscape plots as aconvenient form for the visualization of the topologicalFigure 6 Dinucleotide-environment view plots the dinucleotideenrichment of the dataset around the motif of interest. Thex-axis shows the dinucleotide offset from the centre of the motif,and the y-axis is the proportion of the dinucleotide in the ChIP-Seqsequences. The STAT1 motif is the high frequency pattern in thecentre of the plot. The sequence logo for STAT1 is above the highfrequency pattern. (a) The subset of STAT1 ChIP-Seq peaks containinga STAT1 motif in the enrichment zone with a motif score greater thanor equal to 85. The magenta box highlights the enrichment ofdinucleotides in the flanking regions of STAT1 motifs proximal tothe peakMax. (b) The subset of STAT1 peaks with a motif outsidethe enrichment zone and a motif score of 85 or greater. Themagenta box highlights the lack of dinucleotide enrichment inthe flanking regions of motifs found distal to the peakMax.Worsley Hunt et al. BMC Genomics 2014, 15:472 Page 10 of 19http://www.biomedcentral.com/1471-2164/15/472distribution of TFBS within ChIP-Seq segments. Basedon the topology, we present the HADB method for theselection of PWM thresholds for the selection of can-didate TFBS consistent with direct binding events. TheHADB approach provides a quantitative method forselecting PWM thresholds – an enduring challenge inregulatory sequence analysis. Systematic analysis over ~340ChIP-Seq datasets indicates that an average of 61% ofpeaks are classified as directly bound. The biochemical,experimental, and/or computational influences that ac-count for the remaining 39% of regions remain to beresolved in future investigation. In the third stage ofanalysis, spatial properties relative to the defined TFBSand peakMax are analyzed. Bi-motif plots allow thestudy of inter-TFBS spacing concurrent with HADB as-sessment. To gain insight into additional properties atthe edges of reliable TFBS, Di-nucleotide-environmentviews are introduced through which compositional en-richment in the flanking regions can be revealed. Incombination, the software and methods allow forresearchers to launch deeper explorations of TFBSwithin ChIP-Seq data.The work builds on a substantial foundation of bio-informatics approaches to regulatory sequence analysis.Over-representation analysis of known TF motifs comple-ments de novo motif discovery methods such as MEME.The later methods often incorporate better backgroundrepresentation to account for the non-random propertiesof sequences. The CB-plots highlight the problem forover-representation methods, providing a useful meansfor researchers to rapidly assess the quality of results. Weintroduced the BiasAway tool to correct for the bias. TheHOMER 2 package includes an unpublished option forbackground correction of over-representation analysis,which does not perform as well as the GC compositionmatched option in BiasAway (HOMER was originally de-scribed in [11]). The open-access BiasAway provides ageneral purpose background generation capacity, whichcan be used as a component in other tools going forward.Both BiasAway and the CB plots are being incorporatedFigure 7 Case study of TBP in the mouse MEL cell-line. (a) TFBS-landscape view for TBP PWM on a TBP dataset. The top plot presents the topscoring motif distance to the peakMax (x-axis) and the motif relative score on the y-axis. The bottom plot presents a histogram of motif distancesto the peakMax: black line – 2 bp resolution of the top scoring motif distance per peak; green line – 5 bp resolution of distances for the topscoring motifs with a score equal to or higher than 85. The sequence logo is for TBP. (b) TFBS-landscape view density plots for 15 PWM’s areoverlaid on a single plot, for visualization purposes. The black line is the enrichment of the TBP motif, the coloured lines are NR2F1, MYC::MAX,CTCF, GABPA, TAL1::TCF3, FOSL2, FOXD3, NRF1, MEF2A, AP1, SPI1, ZNF143_b, E2F1, and NFYB motifs, as noted on the plot. (c) TheDinucleotide-environment view around the TBP motif. The x-axis is the location of the dinucleotide with respect to the TBP motif, and the y-axis is thefraction of sequences with the dinucleotide at a given position. The coloured lines each represent one of 16 dinucleotides, as specified in the plotlegend. The magenta box highlights the dinucleotide enrichment in the regions flanking the TBP motif.Worsley Hunt et al. BMC Genomics 2014, 15:472 Page 11 of 19http://www.biomedcentral.com/1471-2164/15/472into the online oPOSSUM-3 motif over-representationanalysis tool [12].The spatial analysis of TFBS positions within ChIP-Seqdata has been a central focus in the development of sev-eral bioinformatics methods. The MEME Suite includesthe CentriMo software which can evaluate both knownmotif collections and de novo pattern discovery derivedmotif models based on the centrality of predicted TFBSwithin ChIP-Seq peak regions [14]. The SpaMo compo-nent of the same package evaluates the statistical signifi-cance of spacing between predicted pairs of TFBS acrossChIP-Seq regions [30]. The GEM system performs denovo pattern discovery of spatially correlated motif pairs[31]. While these approaches touch on the same theme,they are not directly comparable to the work presentedhere. Our methods are complementary to the publishedwork, allowing researchers to more fully explore theproperties of ChIP-Seq data.We introduced the HADB procedure for the selectionof appropriate PWM score thresholds for the predictionof TFBS within ChIP-Seq data. The threshold parameterfor TFBS calling with PWMs is treated in diverse waysin the research literature. In some cases thresholds areindividually selected for each matrix based on the deter-mination of empirical p-values based on distributions ofscores observed for a sequence pool [32]. This has beenextended to generating empirical p-value thresholds forcomposition-matched pools of sequences [33]. In othercases, percentile-based scores are applied uniformly acrossdiverse matrices [12]. Biophysical approaches based onthe calculation of binding energies have also been intro-duced [34]. The HADB procedure provides an experimen-tal basis for the selection of the threshold parameter for aTF that can be applied to the matrix in multiple data sets,as the results show little variation in the value for differentChIP-Seq studies for the same TF.The prevalence of directly bound regions in ChIP-Seqdata has received increasing attention. The ChIP-Seqprotocol is undeniably successful at retrieving TF boundregions, however, the data from these studies of sequence-specific TFs is invariably shaped by a mixture of biological,experimental and computational influences. From ourperspective, TF ChIP-Seq data is effectively bi-partite: asubset of ChIP’d regions have a direct relationship to theChIP’d TF made evident by the presence of the TF’s bind-ing motif near the peak local maximum, while theremaining subset of regions are lacking the motif. Wefound that on average, for the ~340 datasets studied, athird of peak regions lack canonical motifs proximal tothe peakMax. Some of these peaks may result from indir-ect interactions (in which the ChIP’d TF interacts with adifferent DNA bound protein) or result from a change tothe ChIP’d TF’s binding properties due to an interactionpartner. Methods have been proposed to infer indirectbinding by the enrichment of secondary TFs motifs, suchas [35], but the presence of secondary motifs do not reli-ably confirm the presence of the ChIP’d TF at the regions.We have observed that the motifs of some TFs are enrichedat the peakMax for more than 10 other unique TFsChIP-Seq datasets across diverse cell lines (Worsley-Hunt,submitted), suggesting that some peaks in ChIP-Seqdatasets may result from a mechanism that is not specificto the ChIP’d TF. Some additional potential sources ofnon-direct ChIP-Seq regions, unrelated to indirect bind-ing, include chromatin structure properties, antibodyFigure 8 Case study of ZNF143 DNA binding preferences. Thesequence logo presents the binding site characteristics of ZNF143.(a) Dinucleotide-environment view of ZNF143 ChIP-Seq repeat-maskedregions aligned on motifs with a motif score of 85 or greater. Thex-axis is the nucleotide position and the y-axis is the frequency of thedinucleotide. The coloured lines each represent one of 16 dinucleotides,as specified in the plot legend. The vertical magenta lines frame thepositions of the sequence logo. (b) Dinucleotide-environment view ofZNF143 canonical motifs with a motif score of >85. The x-axis is thenucleotide position and the y-axis is the frequency of the dinucleotide.The coloured lines each represent one of 16 dinucleotides, as specifiedin the plot legend. The orange horizontal line above the plot indicatesthe overlapping THAP11 binding profile.Worsley Hunt et al. BMC Genomics 2014, 15:472 Page 12 of 19http://www.biomedcentral.com/1471-2164/15/472properties, stochastic noise, or peak calling software. Tworecent papers have identified highly transcribed regions asenriched in the non-direct subset of ChIP-Seq data inyeast [9,36]. We do not suggest that any peaks be dis-missed, as all are potentially informative. However, webelieve that it is appropriate to use the methods intro-duced in this report to segregate the direct-binding subsetthat can be explained by the ChIP’d TFs motif within thelimited range of confidence around the peakMax, allowingthe two subsets to be individually analyzed.We highlighted the investigative potential of the methodsand visualization approaches in two case studies. First, astudy of TBP (TATA binding protein) ChIP-Seq data,where an enrichment of SINE elements and many second-ary TFs’ motifs suggests that the adaptive use of repetitivesequences as promoters may be more frequent thanwidely thought. Second, a study of ZNF143 that revealedthe ZNF143 canonical motif is part of a wider pattern thanpresented by the existing PWM, which has been shown by[29] to be the binding site for THAP11, a protein thatworks antagonistically with ZNF143.ConclusionThis broad survey of ChIP-Seq data from diverse studiesprovides greater clarity about the properties of the datathat impact the quality of interpretation. Using the methodspresented here, the applied researcher can visualize thedistribution of predicted TFBSs and calculate thresholdsfor both the maximum motif enrichment distance fromthe peakMax, and the PWM scores, for classifying TFBS-containing peaks. In addition to ChIP-Seq peak classifica-tion, the PWM scores thresholds set a lower limit on therange of motif scores expected for functional sites, whichwill be useful for downstream analyses such as binding sitemutation analysis. These methods and approaches will im-prove TFBS enrichment analyses and the applied analysisof ChIP-Seq data, particularly for the annotation of reli-able TFBSs within ChIP-Seq peaks.MethodsDatasetsWe downloaded ChIP-Seq datasets from the GEO data-base: 1) GSE11431 - thirteen mouse ES cell datasets [37]; 2)GSE25532 – mouse NFYA data in ES cells [38]; 3)GSE17917 and GSE18292 – human KLF4, POU5F1,C-MYC, NANOG, and SOX2 data [39]; and 4) GSE22078 –human and mouse C/EBPA and HNF4A [40]. A dataset formouse FOXA2 was downloaded from http://www.bcgsc.ca/data/histone-modification/histone-modification-data [41].We also used ENCODE ChIP-Seq datasets (human andmouse), and human ChIP-Seq controls [42] downloadedfrom the UCSC ENCODE database [43]. The ENCODEbroadPeak datasets frequently occurred in replicate; we se-lected the larger replicate for our analyses. Where only themapped sequence data was available (5 datasets), we calledpeaks using FindPeaks 4.0 [44] with the following param-eter options: −dist_type 1 200 -subpeaks 0.6 -trim 0.2-duplicatefilter.As the downloaded ChIP-Seq peaks were reported witha multitude of lengths, ranging from 1 bp to >5,000 bp,we trimmed or extended all peaks to a constant length(201 bp, 401 bp, or 1001 bp) centered at the local peakmaximum, or at the peak centre for datasets which do nothave peakMax positions provided.A control set of “mappable” regions was generated fromthe CRG Alignability (36mer) data [42] downloaded fromthe UCSC ENCODE database [43]. Unique, contiguousCRG Alignability 36mer regions were merged, and theresulting larger regions were then split into multiplenon-overlapping regions of length 201 bp or 401 bp. Thisyielded two datasets of mappable regions, to be used as asource of background sequences in later analyses.The DNase accessible control datasets used in over-representation analyses were generated from DNase Ihypersensitivity datasets (UW DNase: University of Wash-ington) [42] downloaded from the UCSC ENCODE data-base [43]. To obtain a dataset that reflected a broad rangeof DNase accessibility, DNase I hypersensitive segmentsfrom multiple cell lines were concatenated, and contiguousor overlapping regions merged. The resultant sequenceswere split into one of the two lengths, 201 bp and 401 bp,to generate two large datasets of DNase accessible se-quences to be used as a source of background sequencesin later analyses.The Ensembl Perl API was employed to retrieve sequencesfrom GRCh37/hg19 and NCBI37/mm9 assemblies. Wherethe genome coordinates first needed conversion to GRCh37/hg19 or NCBI37/mm9, we used a locally installed versionof the UCSC lift-over tool [45].Position frequency matrices (PFMs) were obtained fromthe JASPAR [1] development 4.0_alpha database of tran-scription factor models (April 2013). As the developmentset has been subsequently revised for a curated release, weprovide the entire set of matrices as used in this report inAdditional file 14. The PFMs were converted to positionweight matrices (PWMs) using the TFBS Perl modules[46], with default background values for a uniform nucleo-tide background.Motif predictionMotif prediction was performed with C code adaptedfrom the TFBS Perl Modules [46], which scans sequencesfor TFBS instances and reports both the motif locationand a PWM relative motif score. Where multiple motifsper sequence per PWM were predicted, the reportedWorsley Hunt et al. BMC Genomics 2014, 15:472 Page 13 of 19http://www.biomedcentral.com/1471-2164/15/472motifs were not permitted to overlap by more than one-fifth the PWM length (e.g. a 7 bp motif could only overlapa neighbouring motif by 1 bp).Nucleotide compositionThe mononucleotide GC content of sequences was deter-mined from a count of each nucleotide type in the se-quence, based on single stranded DNA. The compositionof the TF profiles was determined similarly; from a countper each nucleotide base in the position frequency matrix(PFM), and subsequently obtaining the ratio of GCnucleotides.Motif over-representation analysesMotif over-representation analyses were performed on201 bp and 401 bp sequences with a locally installed ver-sion of oPOSSUM 3.0 [12] and the online ASAP tool [47].Within oPOSSUM we used the sequence-based analysisdefault settings, aside from supplying our own set of 166PFMs. The oPOSSUM software converts the PFMs toPWMs using the default setting of the TFBS Perl modules[46]. Over-representation scores of “infinite” value were setto the greater of either 500 or to the maximum non-infiniteenrichment score plus 100. For ASAP analyses, we choseparameter values similar to those used for the oPOSSUManalyses. The ASAP tool limited the number of nucleotidessubmitted for each analysis, therefore our input was limitedto a randomly selected subset of 10,000 sequences on eachsubmission.The backgrounds used for the over-representationanalyses were individually generated for each set of targetsequences to be analyzed, and matched to the length ofthe target sequences. The backgrounds were derived fromeither the target sequences (e.g. dinucleotide shufflemethods, or a Markov model), a dataset of uniquely map-pable sequences, or DNase accessibility data (see Datasets,above).Naïve backgroundsFor evaluation purposes we retained naïve backgroundsin our assessment, which were composed of randomly se-lected sequences from either the uniquely mappable por-tion of the human genome, or DNase accessible regions.3rd order Markov model backgroundA 3rd order Markov model background was generatedusing a combination of the oligo-analysis and random-seqprograms from a local installation of the RSAT package[10]. The target sequences of interest were submitted toRSAT::oligo-analysis with the parameters: −l 4 -1str (wordlength 4 bp, and single strand), and the results file was inturn submitted to RSAT::random-seq with the parameter:−ol 4. The RSAT::random-seq program then used a 3rdorder Markov model trained on 4 bp oligo-nucleotides togenerate sequences that matched the GC content andlength of the target sequences. While Markov models havebeen previously introduced to generate background se-quences, the ideal order of the models is not clear. If theorder is too high, the model will simply recapitulate thefrequency of the TFBSs in the target dataset, which is whywe restricted the model to 3rd order.HOMER 2 GC backgroundHOMER 2 [11] was downloaded from http://homer.salk.edu/homer/ and installed on a cluster. The findMotifs-Genome.pl tool was used to perform a known motif over-representation analysis on each of the 43 datasets usingthe following options: −N number_of_seqs –nomotif –dumpFasta –size length_of_sequence. The number_of_se-quences was the number of sequences in each dataset,and the length_of_sequence was either 201 bp or 401 bp,dependent on the analysis. The –dumpFasta option wasset in order to retrieve the backgrounds after the analysiswas finished running. We provided the hg19 human gen-ome (from UCSC) and the coordinates of the peaks foreach dataset.BiasAway Background Generating ToolSix background models were implemented as a back-ground sequence generator: BiasAway. The BiasAway toolhas been implemented in Python using BioPython mod-ules [48], and is available as open source code fromGitHub https://github.com/wassermanlab/BiasAway/arch-ive/noRPY.zip. The tool provides a user with six ap-proaches for generating a background useful to over-representation analyses: 1) mononucleotide shuffled se-quences, 2) dinucleotide shuffled target sequence to pre-serve the dinucleotide composition of the targetsequences, 3) genomic sequences matched to the mono-nucleotide GC content of each target sequence to preservethe non-random association of nucleotides, 4) sliding win-dows of mononucleotide shuffled sequence, 5) sliding win-dows of dinucleotide shuffled target sequence, and 6)genomic sequences matched in windows of internalmononucleotide GC content for each target sequence.The latter two backgrounds (BiasAway 5–6) are variantsof the former two backgrounds (BiasAway 2–3), in whichwe utilized a sliding window over the ChIP-Seq sequencesto determine a distribution for local regions of com-position. The background sequence set is then generated(dinucleotide shuffle) or selected from a pool of genomicsequences (genomic composition match) to match thedistribution of window compositions for each target se-quence. These latter backgrounds were considered becausedue to evolutionary changes such as insertion of repetitivesequences, local rearrangements, or biochemical missteps,the target sequences may have sub-regions of distinctWorsley Hunt et al. BMC Genomics 2014, 15:472 Page 14 of 19http://www.biomedcentral.com/1471-2164/15/472nucleotide composition. See Additional file 1: Text S1 Sup-plemental Methods for greater detail.Measures to evaluate over-representation analysis resultsTo summarize the impact that the choice of backgroundhas on the reported over-representation results, we assessedthe over-representation results by four measures: 1) theskew of the over-representation scores, 2) the count ofdatasets with the ChIP’d TF’s binding profile reported inthe top 5 over-representation results, 3) the mean of theover-representation scores (excluding outlying scores), and4) the variability of the over-representation scores (exclud-ing outlying scores). The first measure, the skew of theover-representation scores, is the negative of the slope ofthe line fitted to the over-representation scores (y-value)and the associated PFM GC content values (x-value; seeFigure 2a for reference). The skew informs us of the degreeto which a dataset is biased towards extreme compositionmotifs i.e. GC-rich or AT-rich; the more negative the skew,the greater the bias. The second measure calculates theproportion of datasets for which the over-representationanalysis (using a given background type) reports the ChIP’dTF’s motif in the top 5 results. The skew and the propor-tion of datasets are plotted together (Figure 2a). Back-ground types that consistently yield a low skew value andreturn the ChIP’d TF in the top results, are ideal. We usedthe third and fourth measures to assess the tendency forthe over-representation analysis to report the majority ofTF’s motifs as having a low over-representation score.Therefore we set a threshold of the mean plus one standarddeviation, to remove the highest over-representation scoresfrom further analysis. We then obtained the mean andstandard deviation of the remaining scores. Ideally theresults would display both a low mean and a low standarddeviation, as this indicates that the majority of TF’s motifsare close to an enrichment score of 0.Enrichment visualization plots and HADB boundaries ofenrichmentFor the evaluation of ChIP-Seq datasets with the HADBmethod, we restricted to those datasets for which we hada PWM for the ChIP’d TF.Composition-Bias plotsComposition-Bias (CB) plots are generated to detect\whether GC-rich or AT-rich PWMs are over-representedin a TFBS motif over-representation analysis. The GCcontent of the PFM’s submitted to the over-representationanalysis are presented on the x-axis, and the over-representation score of the predicted TFBSs on the y-axis.TFBS-landscape viewTFBS-landscape views were generated to visualize centralenrichment of a TF’s motifs within a ChIP-Seq dataset.TFBS prediction was done on 1001 bp regions, to extendsequences sufficiently far into the flanks to present thebackground rate of TFBS prediction. The view’s left plotpresents the motif score of the top scoring predicted motifin each peak for the given TF PWM on the y-axis, and themotif ’s signed distance from the peakMax on the x-axis,where the peakMax is x = 0. The right plot presents twodensities depicting: 1) the distances of the top scoring mo-tifs to the peakMax for all peaks, and 2) the distances tothe peakMax for the subset of peaks for which the topscoring motif was equal to or greater than 85 to capturecases with low enrichment of strong scoring motifs. Thevalue of 85 was based on the default parameters of theoPOSSUM software, for which 85 was chosen to best dis-criminate known TFBS from background. The x-axis ofthe right plot is the distance of the motifs to the peakMax(x = 0), and the y-axis is the probability density, which is areflection of peak frequency for given distances. The de-fault peak length was 1001 bp. For the TBP case study, weoverlaid multiple smoothed densities of motif-to-peakMaxdistance enrichments on one plot (Figure 7b).HADB motif enrichment thresholdThe data used for the TFBS-landscape plots (TFBSs and1001 bp sequences), was also used by the HADBmethod. To calculate a heuristic enrichment thresholdwe identified the distance from the peakMax at whichthe density of the top scoring motifs (one motif per se-quence) falls below the density of motifs in the back-ground regions, such as we see in the TFBS-landscapeplots (Figure 4b). The background is the region spanning200–500 bp from the peakMax. We created 5 bp bins ofthe absolute distance from the peakMax, and countedthe number of peaks with a top scoring motif withineach bin. The count in each bin was converted to theproportion of peaks in the dataset and a regression linewas fitted to the background bins, where x was the upperlimit of the 5 bp bin, and y was the proportion of peaks inthe bin. To adjust for the variation in the y-values, we cre-ated an ‘allowance line’ by shifting the regression linealong the y-axis by 2-fold the 3rd largest residual value(we omitted the two most extreme residuals as they weregenerally outliers). We then selected the largest non-background bin (X) that had a y-value greater than the al-lowance line’s predicted y-value at bin X (Figure 4a). Theheuristic threshold for motif enrichment was set at thegreatest distance represented by bin X (e.g. bin X is100 bp-105 bp, therefore the threshold is 105 bp).HADB motif score lower thresholdA similar procedure was used to determine a lower limitthreshold for a PWM’s motif scores (Additional file 9:Figure S7). We generated bins of the top motif scores (oneper peak) and compared the bins in the enrichment zone toWorsley Hunt et al. BMC Genomics 2014, 15:472 Page 15 of 19http://www.biomedcentral.com/1471-2164/15/472bins in the background zone. Bins were populated with thecount of peaks corresponding to a bin’s motif score range.Each bin’s range was 1 score point e.g. for bin81, the rangewas 80 <X < =81. ‘Central bins’ were within the enrichmentzone as defined using the heuristic enrichment thresholdsdescribed above. An equal number of ‘control bins’ weregenerated using background regions outside of the enrich-ment zone, of the same width as the central enrichmentzone (e.g. if the central enrichment zone was 180 bp, then180 bp of background motif scores were also sampled andbinned). The two sets of bins were compared to identify allthe central score bins that exceeded the number of peaks intheir matching control bin by greater than 20% of the max-imum control bin; the maximum was selected from the setof control bins equal to or of higher value than the bin ofinterest (i.e. to the right of the bin of interest on the x-axisof Additional file 9: Figure S7a). By using the set of controlbins of higher value than the bin of interest, we limitedcases where low scoring bins with few peaks in the bin wereselected as a threshold. The array of bins that passed theaforementioned 20% threshold was then used to determinethe lower boundary of motif score enrichment. Proceedingfrom the upper score bins to the lower score bins, we iden-tified the lowest set of 5 consecutive central bins (centeredon the bin of interest) where at least 4 of the bins, includingthe bin of interest, exceeded the control bins. The latter re-quirement was to reduce the chance of the algorithmselecting a central bin that was an outlier in the sur-rounding neighbourhood of bins. The central bin pre-ceding the flagged bin was selected as the threshold. Weprovide two PWM thresholds in Additional file 10: athreshold based on signal 20% above the background, asoutlined above, and a threshold based on signal 5%above the background; the median absolute differencebetween the two different thresholds is 2 points (e.g. a motifscore of 80 versus 82). The parameter of >20% signal was aheuristic limit selected to reduce stochastic noise in theHADB results.The diversity of enrichment zones and relative motifscore thresholds from the datasets were summarized inFigure 4c, Additional file 8: Figure S6a and Additional file 9:Figure S7c-d. The enrichment zone thresholds for mul-tiple datasets of a given TF were averaged to select onethreshold value per TF. To assess the similarity of thethresholds for a given TF, we calculated the average of thedifferences between every dataset’s threshold for the givenTF, and plotted the average difference as bars above andbelow the average threshold. We used the same approachfor reporting the average motif score and the similarity ofmotif scores per TF.TFBS-bi-motif viewFor each given sequence, the left plot reports the locationof the first motif relative to the peakMax on the y-axis,and the location of the second motif relative to the firstmotif on the x-axis. A horizontal band of enrichmentaround y = 0 shows that the first motif is enriched at thepeakMax, while a band of enrichment on the diagonal isthe second motif ’s enrichment at the peakMax. A gap atx = 0 is due to restricting the overlap of motifs. The diag-onal limits of the plot arise from the sequence lengthlimit, here 1001 bp. Increased density at the origin indi-cates that a number of peaks have both motif 1 and motif2 enriched at the peakMax. The x-axis on the left plot wasset as the distance between motifs in order to mirror theright plot, The right plot is a 10 bp resolution histogramof the distances between the two motifs.Dinucleotide-environment viewTo create a Dinucleotide-environment view, we wrote aPerl script to align a set of sequences on the motif of inter-est with the motif oriented in the same direction, and as-sess the frequency of dinucleotides at every position of thealigned sequences. The file of dinucleotide frequencies issubmitted to an R script (R version 2.14.1). The R scriptcalculates the proportion of sequences with a particulardinucleotide at each position of the alignment, and plotsthe position on the x-axis and the proportion of sequenceson the y-axis.Visualization was performed using the statistical packageR (version 2.14.1) [13]. The R functions for generating thefour visualization plots and the perl code for generatingdinucleotide frequencies from motif-anchored sequencealignments are available at GitHub https://github.com/was-sermanlab/TFBS_Visualization/archive/master.zip. Plots forvisualizing the heuristic threshold decisions, as seen inFigure 4a and Additional file 9: Figure 7a, are included inthe provided R code.Analysis of broadPeak replicate experimentsThe ENCODE broadPeak datasets are available as repli-cates, which allowed us to determine whether the peakspredicted to be directly bound by the ChIP’d TF due tothe presence of the TF’s motif were enriched for occur-rence in both replicate experiments. Replicate experimentswere pooled together, with neighbouring peaks (two peakmaximums within 500 bp or 1000 bp) flagged as a singleinstance of a region. The 500 bp distance was selected tobe inclusive of the ~400 bp median width of the broad-Peak regions; 1000 bp was selected to explore the sensitiv-ity to longer settings. Using a Fisher exact test, we thencompared the proportion of replicated regions versus re-gions unique to a single experiment for the set of peakswith the ChIP’d TFs motif within the heuristic enrichmentzone, and for the set of peaks without the ChIP’d TFsmotif. Peaks that had been flagged as having the ChIP’dTFs motif in one experiment but without the ChIP’d TFsmotif in the other experiment were rare (median 0.28% ofWorsley Hunt et al. BMC Genomics 2014, 15:472 Page 16 of 19http://www.biomedcentral.com/1471-2164/15/472the pooled replicates for an experiment), and we chose toomit these from the analysis.GO term enrichment analysisWe used the default settings of the web-based versionof GREAT [16] http://bejerano.stanford.edu/great/public/html/splash.php. We submitted those peaks predicted tocontain the ChIP’d TF’s motif by the HADB method, thepeaks without the HADB predicted ChIP’d TF’s motif andpeaks from the whole dataset. We used the Binomial Re-gion Set Coverage and the number of terms related toactin and oxidative stress, to assess differences betweenthe three sets of peaks.TBP case study – repeat elements analysisRepeat elements in the TBP peaks were assessed usingdata downloaded from the UCSC mm9 Repeat Maskertrack via the UCSC Table Browser [49].Additional filesAdditional file 1: Text S1. Supplemental Data and Methods.Additional file 2: Figure S1. Nucleotide composition is variablebetween ChIP-Seq datasets. (a) The y-axis presents the GC content ofChIP-Seq datasets (x-axis) generated from the K562 cell line; one datasetper TF. (b) The GC content (y-axis) of datasets from multiple samples(source cell lines indicated along the x-axis) for the JUN-D TF.Additional file 3: Figure S2. The multiplicity of predicted TFBS motifsin ChIP’d sequences corresponds to the multiplicity +1 of controlsequences. The number of peaks with a given multiplicity are plotted onthe x-axis and the mean GC composition of the peaks is on the y-axis. ‘X’is the motif multiplicity of the controls, and ‘X + 1’ is the motif multiplicityof the ChIP-Seq peaks. (a) C/EBPB ChIP-Seq sequences (black) and controlsequences matching the average GC composition of the ChIP-Seqsequences (red). (b) AP2γ ChIP-Seq sequences (black) and controlsequences matching the average GC composition of the ChIP-Seqsequences (red).Additional file 4: Figure S3. Binding site over-representation resultsusing the ASAP tool. The CB-plots present the PFM GC composition onthe x-axis and the ASAP over-representation score on the y-axis. The top5 over-represented TF profiles’ names are written on the plot; the nameof the ChIP’d TF or related TF is highlighted in magenta, and the logosare shown in (a) E2F1 and E2F4, (f) JUN-family (JUN, JUN-D, AP1, andFOSL2), and (k) C/EBPA and CEBP/B. (b)-(e) E2F1 ChIP-Seq. (g)-(j) JUN-BChIP-Seq. (l)-(o) C/EBPB ChIP-Seq. The first CB-plot for each of the 3 TFs(b), (g), (l) are results using a random background selected from a poolof uniquely mappable sequences. The second CB-plot for each of the 3TFs (c), (h), (m) are results using a background generated by a 3rd orderMarkov model. The third CB-plot for each of the 3 TFs (d), (i), (n) areresults using dinucleotide shuffled target sequences as background. Thelast CB-plot for each of the 3 TFs (e), (j), (o) are results using backgroundsequences from the mappable dataset, matched to the GC compositiondistribution of the target sequences.Additional file 5: Figure S4. Background impact on over-representationanalyses for 400 bp datasets. (a) For each background, the fraction of the 43analyses that reported the ChIP’d TF in the top 5 over-represented PWMsfrom a particular background (x-axis) is plotted against the average skew ofthe over-representation results for each background’s 43 analyses. Skew isthe negative slope of the line fitted to the over-representation scores versusPFM GC content (i.e. values visualized by Figure 1a axes). The ideal is to havea large x-axis value (vertical dashed line) and an average skew of zero(horizontal dashed line). (b) and (c) summarize the standard deviation (y-axis)and mean (x-axis) of the ‘non-outlier’ oPOSSUM over-representation scoresfor 10 backgrounds against each of 43 ChIP-Seq datasets, where panel (b)displays the average value for each background across the 43 datasets andpanel (c) displays the individual value of 430 analyses. The ideal result wouldbe situated at the origin (the intersection of the dashed lines. For all panels,each of the 10 backgrounds tested is denoted as a single colour: Lightgreen circle – randomly chosen background from the dataset of mappablesequences, dark green cross – randomly chosen background from thedataset of DNase accessible sequences, orange circle – mononucleotideshuffled background, brown cross – mononucleotide shuffled backgroundwithin a sliding window, black circle – dinucleotide shuffled background,gray cross – dinucleotide shuffled background within a sliding window,magenta triangle – 3rd order Markov model generated backgroundsequences, blue circle – background selected from the mappablesequences dataset to match the GC composition of the target sequences,light blue cross – background selected from the mappable sequencesdataset to match the distribution of GC composition in sliding windows ofthe target sequences, and red triangle – GC background from HOMER 2.Additional file 6: Table S1. Rankings of the ChIP’d TFs binding profilefrom over-representation analysis with 10 backgrounds. The table lists therank of the ChIP’d TFs profile from 430 over-representation analyses for43 datasets and 10 backgrounds. The tendency for some backgrounds tohave a large bias towards TFs with GC-rich binding profiles is presentedas the average skew for each background. A background with a largeskew factor (>100) will favour TFs with GC-rich profiles.Additional file 7: Figure S5. Background selection can correct theover-representation score bias towards GC-rich or AT-rich TFBSs in motifover-representation analyses. The results of over-representation analysesfor an E2F1 ChIP-Seq dataset using six distinct backgrounds (one backgroundper plot). The names of the 5 top ranked TF PWMs are written on theplot. The horizontal line is set at over-representation score 100 as avisual reference point. Points corresponding to E2F1 and E2F4 motifs arehighlighted in pink. The dotted line at over-representation score 100 isfor visual reference. The sequence logos are E2F1 and E2F4 respectively.(a) Randomly chosen background from a pool of DNase accessiblesequences. (b) Randomly generated background sequences based a 3rdorder Markov model. (c) A background of dinucleotide shuffled targetsequences. (d) Selected regions from the mappable sequence datasetmatching the GC composition distribution of the target sequence set.(e) Sliding windows of dinucleotide shuffled target sequence. (f) Genomicsequences matched in windows of internal GC composition for each targetsequence.Additional file 8: Figure S6. Zones of motif enrichment definedaround the peakMax of mouse ChIP-Seq datasets vary per TF. (a) Zonesof PWM motif enrichment defined by a heuristic enrichment thresholdfor mouse datasets. The average width of the motif enrichment zonearound the peakMax for TF’s datasets are plotted on the y-axis; the differ-ences between all widths, for all of a TF’s datasets, were averaged andplotted on the y-axis as vertical bars. The datasets are along the x-axis.The red horizontal line is the mean width of enrichment. (b) The propor-tion of peaks within the motif enrichment zone for a TF’s set of ChIP-Seqdatasets were averaged. The x-axis provides, for each of 39 TFs, the meanproportion of peaks with a motif scoring above the motif score thresholdand located within the zone of enrichment (mean 0.65, median 0.72).Additional file 9: Figure S7. Defining a PWM’s lower bound of motifscore enrichment within the heuristic enrichment zone. (a) A visualdepiction of the motif score lower threshold determined with ourheuristic procedure. The x-axis indicates the upper bound of bins of PWMmotif scores for an NFYB ChIP-Seq dataset. The bins of motif scores are insteps of 1 score point (e.g. bin80 is 79 < scores < =80). The y-axis is thecount of peaks in each bin. The black dotted line depicts scores for peakswith top-scoring motifs within the enrichment zone around the peakMax,while the red dotted line depicts scores for peaks with top-scoring motifsfrom distal positions. The vertical blue line depicts the threshold for motifscore enrichment relative to the background. (b) The TFBS-landscapeview for the NFYB dataset. The x-axis is the distance of the top scoringmotif to the peakMax and the y-axis is the motif score. The blue line isthe calculated motif score threshold. (c) and (d) The mean motif scorethreshold of multiple datasets. The motif score thresholds for a TF’sWorsley Hunt et al. BMC Genomics 2014, 15:472 Page 17 of 19http://www.biomedcentral.com/1471-2164/15/472multiple datasets were averaged and plotted on the y-axis, with verticallines showing the average differences between the thresholds. The x-axisis the TF. The red horizontal line is the mean. Left (c) is human data, andright (d) is mouse data.Additional file 10: Motif relative score thresholds for positionweight matrices (PWMs).Additional file 11: Figure S8. GREAT analysis results on SRF ChIP-Seqdata. GREAT results from the analyses of three sets of SRF peaks. (a) Allpeaks in the SRF dataset. (b) The subset of peaks identified by the HADBmethod to have an SRF motif proximal to the peakMax. The red boxhighlights the actin related GO term. (c) The subset of peaks that do nothave an SRF motif proximal to the peakMax. (PDF 3905 kb)Additional file 12: Figure S9. GREAT analysis results on NFE2L2ChIP-Seq data. GREAT results from the analyses of three sets of NFE2L2peaks. (a) All peaks in the NFE2L2 dataset. (b) The subset of peaks identifiedby the HADB method to have an NFE2L2 motif proximal to the peakMax.The red box highlights the oxidative stress related GO terms. (c) The subsetof peaks that do not have an NFE2L2 motif proximal to the peakMax.Additional file 13: Figure S10. TFBS-landscape view of four PWMs in aTBP ChIP-Seq dataset from mouse MEL cell-line. TFBS-landscape views areshown for four PWM’s on a TBP dataset. The left-side plots present thetop scoring motif distance to the peakMax on the x-axis, and the motifscore on the y-axis. The right-side plots present a histogram of motifdistances: black – 2 bp resolution of the top scoring motif distance perpeak, green – 5 bp resolution of the distances for the top scoring motifswith a score equal to or higher than 85. Sequence logos indicate theprofile used to scan the TBP peaks: (a) ZNF143_b PWM. (b) NF2F1 PWM.(c) CTCF PWM. (d) FOXD3 PWM. The data represent a subset of TFprofiles depicted in Figure 7b.Additional file 14: Position frequency matrices (PFMs).AbbreviationsChIP-Seq: Chromatin immunoprecipitation and high-throughput sequencing;CB: Composition-bias; peakMax: Peak local maximum; PFM: Positionfrequency matrix; PWM: Position weight matrix; TF: Transcription factor;TFBS: Transcription factor binding site.Competing interestsThe authors declare that they have no competing interests.Authors’ contributionsRWH performed all analyses. RWH and WWW designed the study and wrotethe manuscript. RWH, AM, and LdP designed, implemented and tested theBiasAway tool, and AM wrote the final BiasAway package. All authors readand approved the final manuscript.AcknowledgmentsWe thank Miroslav Hatas for systems support and Dora Pak for managementsupport, and the members of the Wasserman lab for suggestions and feedback.We are indebted to the researchers around the globe who generated theChIP-Seq data. The work was supported by funding from the National Institutesof Health (USA) grant 1R01GM084875, the Canadian Institutes for Health Research,the National Science and Engineering Research Council (NSERC), and GenomeBCand GenomeCanada (ABC4DE Project). RWH was supported by fellowships fromthe Canadian Institutes of Health Research and the Michael Smith Foundationfor Health Research. LdP was supported by funding from a Fundación CajaMadrid mobility fellowship for visiting professors 2011–2012, and the SpanishMinistry of Science and Innovation, MICINN, grant number SAF2011_24225.Author details1Bioinformatics Graduate Program, University of British Columbia, Vancouver,BC, Canada. 2Centre for Molecular Medicine and Therapeutics, Child andFamily Research Institute, Department of Medical Genetics, University ofBritish Columbia, Vancouver, BC, Canada. 3Universidad Autónoma de Madrid,Biochemistry, Madrid 28029, Spain. 4Centre for Molecular Medicine andTherapeutics, 950 W.28th Avenue, Vancouver, BC V5Z 4H4, Canada.Received: 20 December 2013 Accepted: 20 May 2014Published: 13 June 2014References1. Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E,Yusuf D, Lenhard B, Wasserman WW, Sandelin A: JASPAR 2010: the greatlyexpanded open-access database of transcription factor binding profiles.Nucleic Acids Res 2010, 38(Database issue):D105–D110.2. Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB,Makeev VJ: HOCOMOCO: a comprehensive collection of human transcriptionfactor binding sites models. Nucleic Acids Res 2013,41(Database issue):D195–D202.3. Machanick P, Bailey TL: MEME-ChIP: motif analysis of large DNA datasets.Bioinformatics 2011, 27(12):1696–1697.4. Georgiev S, Boyle AP, Jayasurya K, Ding X, Mukherjee S, Ohler U:Evidence-ranked motif identification. Genome Biol 2010, 11(2):R19.5. Kulakovskiy IV, Boeva VA, Favorov AV, Makeev VJ: Deep and wide diggingfor binding motifs in ChIP-Seq data. Bioinformatics 2010,26(20):2622–2623.6. Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, Van Helden J:RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic AcidsRes 2012, 40(4):e31.7. Rhee HS, Pugh BF: Comprehensive genome-wide protein-DNA interactionsdetected at single-nucleotide resolution. Cell 2011, 147(6):1408–1419.8. Auerbach RK, Euskirchen G, Rozowsky J, Lamarre-Vincent N, Moqtaderi Z,Lefrancois P, Struhl K, Gerstein M, Snyder M: Mapping accessible chromatinregions using Sono-Seq. Proc Natl Acad Sci USA 2009,106(35):14926–14931.9. Teytelman L, Thurtle DM, Rine J, Van Oudenaarden A: Highly expressed lociare vulnerable to misleading ChIP localization of multiple unrelatedproteins. Proc Natl Acad Sci USA 2013, 110(46):18602–18607.10. Thomas-Chollier M, Defrance M, Medina-Rivera A, Sand O, Herrmann C,Thieffry D, Van Helden J: RSAT 2011: regulatory sequence analysis tools.Nucleic Acids Res 2011, 39:W86–W91.11. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C,Singh H, Glass CK: Simple combinations of lineage-determining transcriptionfactors prime cis-regulatory elements required for macrophage and B cellidentities. Mol Cell 2010, 38(4):576–589.12. Kwon AT, Arenillas DJ, Worsley Hunt R, Wasserman WW: oPOSSUM-3:advanced analysis of regulatory motif over-representation across genesor ChIP-Seq datasets. G3 2012, 2(9):987–1002.13. R Core Team: R: A Language and Environment for Statistical Computing.In R Foundation for Statistical Computing. 2013.14. Bailey TL, Machanick P: Inferring direct DNA binding from ChIP-seq.Nucleic Acids Res 2012, 40(17):e128.15. Wilbanks EG, Facciotti MT: Evaluation of algorithm performance in ChIP-seqpeak detection. PLoS ONE 2010, 5(7):e11471.16. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM,Bejerano G: GREAT improves functional interpretation of cis-regulatoryregions. Nat Biotechnol 2010, 28(5):495–501.17. Miano JM, Long X, Fujiwara K: Serum response factor: master regulator ofthe actin cytoskeleton and contractile apparatus. Am J Physiol Cell Physiol2007, 292(1):C70–C81.18. Singh S, Vrishni S, Singh BK, Rahman I, Kakkar P: Nrf2-ARE stress responsemechanism: a control point in oxidative stress-mediated dysfunctionsand chronic inflammatory diseases. Free Radic Res 2010, 44(11):1267–1288.19. Gotea V, Visel A, Westlund JM, Nobrega MA, Pennacchio LA, Ovcharenko I:Homotypic clusters of transcription factor binding sites are a key componentof human promoters and enhancers. Genome Res 2010, 20(5):565–577.20. Giorgetti L, Siggers T, Tiana G, Caprara G, Notarbartolo S, Corona T,Pasparakis M, Milani P, Bulyk ML, Natoli G: Noncooperative interactionsbetween transcription factors and clustered DNA binding sites enablegraded transcriptional responses to environmental inputs. Mol Cell 2010,37(3):418–428.21. Ji Z, Donaldson IJ, Liu J, Hayes A, Zeef LA, Sharrocks AD: The forkheadtranscription factor FOXK2 promotes AP-1-mediated transcriptionalregulation. Mol Cell Biol 2012, 32(2):385–398.22. Yu X, Zhu X, Pi W, Ling J, Ko L, Takeda Y, Tuan D: The long terminal repeat(LTR) of ERV-9 human endogenous retrovirus binds to NF-Y in theassembly of an active LTR enhancer complex NF-Y/MZF1/GATA-2. J BiolChem 2005, 280(42):35184–35194.23. Razzaque MA, Masuda N, Maeda Y, Endo Y, Tsukamoto T, Osumi T: Estrogenreceptor-related receptor gamma has an exceptionally broad specificityof DNA sequence recognition. Gene 2004, 340(2):275–282.Worsley Hunt et al. BMC Genomics 2014, 15:472 Page 18 of 19http://www.biomedcentral.com/1471-2164/15/47224. Watson DK, Robinson L, Hodge DR, Kola I, Papas TS, Seth A: FLI1 and EWS-FLI1function as ternary complex factors and ELK1 and SAP1a function asternary and quaternary complex factors on the Egr1 promoter serumresponse elements. Oncogene 1997, 14(2):213–221.25. Schmid CD, Bucher P: MER41 repeat sequences contain inducible STAT1binding sites. PLoS ONE 2010, 5(7):e11425.26. Ferrigno O, Virolle T, Djabari Z, Ortonne JP, White RJ, Aberdam D:Transposable B2 SINE elements can provide mobile RNA polymerase IIpromoters. Nat Genet 2001, 28(1):77–81.27. Schaub M, Myslinski E, Schuster C, Krol A, Carbon P: Staf, a promiscuousactivator for enhanced transcription by RNA polymerases II and III.EMBO J 1997, 16(1):173–181.28. Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E,Enge M, Taipale M, Wei G, Palin K, Vaquerizas JM, Vincentelli R, LuscombeNM, Hughes TR, Lemaire P, Ukkonen E, Kivioja T, Taipale J: DNA-bindingspecificities of human transcription factors. Cell 2013, 152(1–2):327–339.29. Ngondo-Mbongo RP, Myslinski E, Aster JC, Carbon P: Modulation of geneexpression via overlapping binding sites exerted by ZNF143, Notch1 andTHAP11. Nucleic Acids Res 2013, 41(7):4000–4014.30. Whitington T, Frith MC, Johnson J, Bailey TL: Inferring transcription factorcomplexes from ChIP-seq data. Nucleic Acids Res 2011, 39(15):e98.31. Guo Y, Mahony S, Gifford DK: High resolution genome wide bindingevent finding and motif discovery reveals transcription factor spatialbinding constraints. PLoS Comput Biol 2012, 8(8):e1002638.32. Medina-Rivera A, Abreu-Goodger C, Thomas-Chollier M, Salgado H,Collado-Vides J, Van Helden J: Theoretical and empirical quality assessmentof transcription factor-binding motifs. Nucleic Acids Res 2011, 39(3):808–824.33. Johansson O, Alkema W, Wasserman WW, Lagergren J: Identification offunctional clusters of transcription factor binding motifs in genomesequences: the MSCAN algorithm. Bioinformatics 2003, 19(Suppl 1):i169–i176.34. Zhao Y, Ruan S, Pandey M, Stormo GD: Improved models for transcriptionfactor binding site identification using nonindependent interactions.Genetics 2012, 191(3):781–790.35. Gordan R, Hartemink AJ, Bulyk ML: Distinguishing direct versus indirecttranscription factor-DNA interactions. Genome Res 2009, 19(11):2090–2100.36. Park D, Lee Y, Bhupindersingh G, Iyer VR: Widespread misinterpretableChIP-seq bias in yeast. PLoS ONE 2013, 8(12):e83506.37. Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W,Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B,Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH:Integration of external signaling pathways with the core transcriptionalnetwork in embryonic stem cells. Cell 2008, 133(6):1106–1117.38. Tiwari VK, Stadler MB, Wirbelauer C, Paro R, Schubeler D, Beisel C: Achromatin-modifying function of JNK during stem cell differentiation.Nat Genet 2012, 44(1):94–100.39. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, LeeL, Ye Z, Ngo QM, Edsall L, Antosiewicz-Bourget J, Stewart R, Ruotti V, Millar AH,Thomson JA, Ren B, Ecker JR: Human DNA methylomes at base resolutionshow widespread epigenomic differences. Nature 2009, 462(7271):315–322.40. Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A,Kutter C, Watt S, Martinez-Jimenez CP, Mackay S, Talianidis I, Flicek P,Odom DT: Five-vertebrate ChIP-seq reveals the evolutionary dynamics oftranscription factor binding. Science 2010, 328(5981):1036–1040.41. Hoffman BG, Robertson G, Zavaglia B, Beach M, Cullum R, Lee S,Soukhatcheva G, Li L, Wederell ED, Thiessen N, Bilenky M, Cezard T, Tam A,Kamoh B, Birol I, Dai D, Zhao Y, Hirst M, Verchere CB, Helgason CD, MarraMA, Jones SJ, Hoodless PA: Locus co-occupancy, nucleosome positioning,and H3K4me1 regulate the functionality of FOXA2-, HNF4A-, andPDX1-bound loci in islets and liver. Genome Res 2010, 20(8):1037–1051.42. Consortium EP, Bernstein BE, Birney E, Dunham I, Green ED, Gunter C,Snyder M: An integrated encyclopedia of DNA elements in the humangenome. Nature 2012, 489(7414):57–74.43. Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM,Wong MC, Maddren M, Fang R, Heitner SG, Lee BT, Barber GP, Harte RA,Diekhans M, Long JC, Wilder SP, Zweig AS, Karolchik D, Kuhn RM, HausslerD, Kent WJ: ENCODE data in the UCSC Genome Browser: year 5 update.Nucleic Acids Res 2013, 41(Database issue):D56–D63.44. Fejes AP, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones SJ:FindPeaks 3.1: a tool for identifying areas of enrichment from massivelyparallel short-read sequencing technology. Bioinformatics 2008,24(15):1729–1730.45. Kuhn RM, Haussler D, Kent WJ: The UCSC genome browser and associatedtools. Brief Bioinform 2013, 14(2):144–161.46. Lenhard B, Wasserman WW: TFBS: Computational framework fortranscription factor binding site analysis. Bioinformatics 2002,18(8):1135–1136.47. Marstrand TT, Frellsen J, Moltke I, Thiim M, Valen E, Retelska D, Krogh A:Asap: a framework for over-representation statistics for transcriptionfactor binding sites. PLoS ONE 2008, 3(2):e1623.48. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I,Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ: Biopython: freely availablePython tools for computational molecular biology and bioinformatics.Bioinformatics 2009, 25(11):1422–1423.49. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ:The UCSC Table Browser data retrieval tool. Nucleic Acids Res 2004,32(Database issue):D493–D496.doi:10.1186/1471-2164-15-472Cite this article as: Worsley Hunt et al.: Improving analysis oftranscription factor binding sites within ChIP-Seq data based on topo-logical motif enrichment. BMC Genomics 2014 15:472.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at www.biomedcentral.com/submitWorsley Hunt et al. BMC Genomics 2014, 15:472 Page 19 of 19http://www.biomedcentral.com/1471-2164/15/472


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items