Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Understanding gene regulatory mechanisms of mouse immune cells using a convolutional neural network Maslova, Alexandra 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_may_maslova_alexandra.pdf [ 1.73MB ]
JSON: 24-1.0390334.json
JSON-LD: 24-1.0390334-ld.json
RDF/XML (Pretty): 24-1.0390334-rdf.xml
RDF/JSON: 24-1.0390334-rdf.json
Turtle: 24-1.0390334-turtle.txt
N-Triples: 24-1.0390334-rdf-ntriples.txt
Original Record: 24-1.0390334-source.json
Full Text

Full Text

Understanding gene regulatorymechanisms of mouse immune cellsusing a convolutional neural networkbyAlexandra MaslovaB.Sc. Physics, The University of British Columbia, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)May 2020c© Alexandra Maslova 2020The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, thethesis entitled:Understanding gene regulatory mechanisms of mouseimmune cells using a convolutional neural networksubmitted by Alexandra Maslova in partial fulfillment of therequirements for the degree of Master of Science in Bioinformatics.Examining Committee:Sara Mostafavi, Assistant Professor, Statistics and Medical Genetics,University of British ColumbiaSupervisorMaxwell Libbrecht, Assistant Professor, Computing Science, Simon FraserUniversitySupervisory Committee MemberElodie Portales-Casamar, Clinical Assistant Professor, Department ofPediatrics, Faculty of Medicine, University of British ColumbiaSupervisory Committee MemberiiAbstractCell differentiation is controlled via complex interactions of genomic regu-latory sites such as promoters and enhancers that lead to precise cell type-specific patterns of gene expression. Local chromatin accessibility at thesesites is a requirement of regulatory activity, and is therefore an importantcomponent of the gene regulation machinery. To understand how DNA se-quence drives local chromatin accessibility within the context of immune celldifferentiation, we examined a dataset of open chromatin regions (OCRs)derived with the ATAC-seq assay from 81 closely related mouse immune celltypes. We selected and optimized a convolutional neural network (CNN),which we named AI-TAC, that takes as input a 250bp DNA sequence ofa potential OCR and predicts the relative chromatin activity at that OCRacross the 81 different immune cell types in our dataset. Test dataset resultsshowed that for many OCRs, AI-TAC is able to predict chromatin state witha high degree of accuracy, even differentiating between closely related celltypes. Using CNN interpretability methods we identified sequence motifsused by the model to make predictions, many of which match closely toknown transcription factor (TF) binding sites. The cell type - specific in-fluence assigned to each motif by AI-TAC in many instances recapitulatesprior biological knowledge about the role of these TFs in immune cell dif-ferentiation, lending credibility to our model and interpretation methods.Finally, we attempt to discern if the model detected any combinatorial ac-tivity between TFs that is predictive of chromatin accessibility.iiiLay SummaryAll cells in an organism contain an identical genetic blueprint (DNA) thatencodes all the information necessary for the organism to develop and func-tion. However, different cell types look and behave in highly variable waysbecause genes are selectively switched on or off in different cells to establishdistinct forms and functions. How genes are activated and deactivated atthe right time during the process of cell differentiation - the formation ofmany different specialized cell types - is not yet well understood.This thesis aims to improve the understanding of the differentiation pro-cess of mouse immune cells by examining one particular regulatory “switchof gene expression. Most of the DNA in each cell is tightly packaged, butregions required to activate genes need to be accessible to function. In thiswork, we study how DNA accessibility in different immune cells is deter-mined locally by the DNA sequence itself.ivPrefaceThe overall selection and design of the analysis methods presented in thiswork was done by the author, Alexandra Maslova, and Dr. Sara Mostafavi.This work was done in collaboration with members of the Mostafavi laband the Benoist lab at Harvard Medical School. In particular, the followingparts of the analysis were not performed by me:• The mouse ATAC-seq data generation and processing was performedby members of the ImmGen consortium• Data normalization was performed by Dr. Sara Mostafavi• Bayesian optimization of the AI-TAC model hyperparameters was per-formed by Ke Ma from the Mostafavi lab• Processing of the human ATAC-seq data was performed by CalebLareau• Clustering of filter PWMs (mentioned in section 3.4.1) was performedby Ricardo Ramirez from the Benoist labParts of this thesis, in particular Chapter 2 and 3, appear in a preprintarticle available on bioRxiv as Alexandra Maslova, Ricardo N. Ramirez,Ke Ma, Hugo Schmutz, Chendi Wang, Curtis Fox, Bernard Ng, ChristopheBenoist, Sara Mostafavi, and the Immunological Genome Project. (2019)Learning Immune Cell Differentiation [34]. In particular, Figures 2.7, 2.4,2.5, 2.6, 3.5, and 3.8a appear in Maslova et al. (2019) with some minormodifications. Additionally, portions of the text in sections 2.3, 3.2, 3.3,and 3.4 were taken from Maslova et al. (2019).vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction and Background . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Biological Background . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Immunological Genome Project . . . . . . . . . . . . 21.2.2 Transcriptional Regulation . . . . . . . . . . . . . . . 21.2.3 Chromatin Accessibility . . . . . . . . . . . . . . . . . 41.2.4 Measuring Chromatin Accessibility . . . . . . . . . . 51.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.1 Identification of Transcription Factor Binding Sites . 61.3.2 Machine Learning Approaches . . . . . . . . . . . . . 81.3.3 Deep Neural Networks in Genomics . . . . . . . . . . 91.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . 13viTable of Contents2 Part I: Model Architecture and Performance . . . . . . . . 162.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.1 Data Generation . . . . . . . . . . . . . . . . . . . . . 162.1.2 Processing and Normalization . . . . . . . . . . . . . 172.2 The AI-TAC Model . . . . . . . . . . . . . . . . . . . . . . . 172.2.1 Model Architecture . . . . . . . . . . . . . . . . . . . 172.2.2 Model Training . . . . . . . . . . . . . . . . . . . . . 202.3 Model Performance . . . . . . . . . . . . . . . . . . . . . . . 202.3.1 Comparing to Randomized Null . . . . . . . . . . . . 202.3.2 10x10 Cross-validation . . . . . . . . . . . . . . . . . 222.3.3 Chromosome Leave-out . . . . . . . . . . . . . . . . . 232.3.4 Predictions Vary by OCR Type . . . . . . . . . . . . 242.3.5 Model Performance on Human Data . . . . . . . . . . 242.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Part II: Model Interpretation . . . . . . . . . . . . . . . . . . 283.1 Interpreting AI-TAC with First Layer Filters . . . . . . . . . 283.2 Filter Properties . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.1 Information Content . . . . . . . . . . . . . . . . . . 303.2.2 Reproducibility . . . . . . . . . . . . . . . . . . . . . 313.2.3 Influence . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Fine-tuning Model with Filter Subset . . . . . . . . . . . . . 353.4 Detecting TF Cooperativity with AI-TAC . . . . . . . . . . . 363.4.1 Second Layer Filters . . . . . . . . . . . . . . . . . . . 373.4.2 Filter Pair Influence . . . . . . . . . . . . . . . . . . . 393.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46AppendicesA Filter motif information . . . . . . . . . . . . . . . . . . . . . . 53viiList of Tables2.1 Correspondence between mouse and human cell types . . . . 273.1 Possible redundant filter pairs . . . . . . . . . . . . . . . . . . 413.2 Possible cooperating filter pairs . . . . . . . . . . . . . . . . . 41viiiList of Figures1.1 Deep neural network . . . . . . . . . . . . . . . . . . . . . . . 102.1 AI-TAC architecture . . . . . . . . . . . . . . . . . . . . . . . 182.2 One-hot encoding of input sequence . . . . . . . . . . . . . . 182.3 AI-TAC test set performance . . . . . . . . . . . . . . . . . . 212.4 Results of 10x10 cross-validation experiment . . . . . . . . . . 222.5 Chromosome leave-out results . . . . . . . . . . . . . . . . . . 232.6 OCR variance versus prediction accuracy . . . . . . . . . . . 242.7 AI-TAC performance on human data . . . . . . . . . . . . . . 263.1 Examples of first layer filter PWMs . . . . . . . . . . . . . . . 293.2 Number of sequences comprising each first layer filter PWM . 303.3 Reproducibility of AI-TAC first layer filters . . . . . . . . . . 323.4 Influence values of AI-TAC first layer filters . . . . . . . . . . 333.5 Cell type-specific influence of AI-TAC first layer filters . . . . 343.6 Fine-tuning AI-TAC with subsets of first layer filters . . . . . 353.7 AI-TAC predictions with 99 reproducible filters . . . . . . . . 373.8 Second layer convolutional filter weights . . . . . . . . . . . . 383.9 Influence values of first layer filter pairs . . . . . . . . . . . . 40ixGlossaryCNN Convolutional neural networkDNN Deep neural networkIC Information contentOCR Open chromatin regionPWM Position weight matrixTF Transcription factorxAcknowledgementsI would like to thank my supervisor, Dr. Sara Mostafavi, for her extensivesupport over the course of my graduate studies. I would also like to thankthe members of my supervisory committee, Dr. Maxwell Libbrecht and Dr.Elodie Portales-Casamar, for their guidance over the course of this project.I would like to acknowledge all the members of the Mostafavi lab whocollaborated on this project with me, in particular: Ke Ma, Curtis Fox,Chendi Wang, Bernard Ng, and Hugo Schmutz. A special thank you toDr. Christophe Benoist for all his insights and support, and Benoist labmembers Ricardo Ramirez and Caleb Lareau for their contributions to theproject.I would also like to thank all the members of the Mostafavi lab, past andpresent, for their thoughtful feedback and overall support during the courseof this project.Finally, a big thank you to my friends and family, and especially mypartner Dylan Reviczky, for being there during the challenging times.xiTo my dad, Igor, who’s the reason I got this far.xiiChapter 1Introduction andBackground1.1 IntroductionAlthough all cells in multicellular organisms share the same genetic blueprint,they exhibit widely varying morphologies and perform very different func-tions. These differences are in large part owed to cell-specific patterns of geneexpression, which are established during cell differentiation via complex epi-genetic mechanisms. However, there are currently gaps in our understandingof the mechanisms by which these sequences coordinate precise programs ofgene expression[41].It is established that regulatory regions can enhance or suppress tran-scription of individual genes via the activity of proteins called transcriptionfactors (TFs) that bind DNA in a sequence-specific manner. Although thesequence preference of many individual TFs has been identified, overall en-hancer activity is more challenging to predict because TFs act in a coop-erative manner that has not been well characterized [41]. Additionally, thefrequency of these TF binding events is mediated by many epigenetic mech-anisms such as DNA accessibility, the expression levels of the TF itself andother factors, and DNA methylation within the binding sequence [45].Our work aims to help elucidate the regulatory mechanisms of non-coding genomic regions by examining cell type-specific effects of sequenceon local DNA accessibility, which serves as an important mediator of reg-ulatory activity. We analyze a dataset of open chromatin regions (OCRs)from dozens of isolated mouse immune cell types across the hematopoieticdifferentiation trajectory, created by the ImmGen consortium [24]. We ap-proach this task by building a model that can predict cell-specific chromatinaccessibility at potential OCR sites based on their sequence alone and thenderiving the sequence features weighted most heavily by the model whenmaking its predictions.The remainder of this chapter is primarily dedicated to an overview11.2. Biological Backgroundof relevant background information. We first describe what is currentlyknown about the role of chromatin state in gene regulation as well as themechanisms by which chromatin accessibility is established and maintained.Next, we provide an overview of methods that have been previously usedto understand chromatin accessibility as a function of DNA sequence. Thefinal section is a summary of the contributions of this thesis.1.2 Biological Background1.2.1 Immunological Genome ProjectThe Immunological Genome Project (ImmGen) is a collaboration betweenimmunology and computational biology research groups that aims to thor-oughly characterize the gene regulatory networks of the entire mouse im-mune system. By generating a large number of genome-wide datasets acrossa wide range of immune cell types, the consortium intends to build a compre-hensive understanding of cell type-specific and condition-specific regulatorymechanisms. The consortium has established rigorously standardized pro-tocols for animal care, cell isolation and data generation for all participatinglaboratories in order to produce consistent, high quality datasets[23].1.2.2 Transcriptional RegulationRegulatory regions of eukaryotic genomes are broadly classified into tran-scription start site-proximal promoters and cis-regulatory elements such asenhancers, silencers and insulators[35]. The core promoter is located in theimmediate vicinity of a gene’s transcription start site (TSS). The promotersequence is sufficient to recruit general transcription factors, RNA Poly-merase II and other proteins that form the transcriptional machinery thatcopies the DNA sequence of the gene into RNA[2]. However, transcription ofa gene is often weak without additional regulatory activity at distal enhancersites[41].Enhancer sequences are characterized by their ability to affect transcrip-tion rates at large distances to their target TSS and independently of theirrelative orientation[41]. Enhancers function by recruiting TFs that recog-nize and bind short (typically 6-10bp) sequence motifs[41]. These TFs canthen increase transcription rates of the target gene by recruiting transcrip-tion complexes to the gene promoter, either directly or via their bindingpartners. These long-range interactions are enabled by DNA looping whichbrings active enhancers into spatial proximity of their target promoter[41].21.2. Biological BackgroundSome TFs act instead to repress transcription rates by interfering with therecruitment of transcriptional machinery, thus allowing certain enhancers toact as silencers under specific conditions[35]. Transcription at a single TSSis regulated via the coordinated activity of multiple enhancer and silencersites[45].Because TF motifs are short and abundant throughout the genome, thebinding affinity of individual TFs alone does not account for precise cellularprograms such as differentiation. Instead, the activity at a single enhancersite is the cumulative outcome of the binding of many different TFs. In thesimplest case, the observed effect of multiple TFs is additive - the activityat a given enhancer sequence is proportional to the concentrations of theindividual TFs for which binding sites are present. However, more preciseregulatory “switches” require cooperativity between TFs. In some cases,TF cooperativity results from physical protein-protein interaction betweenadjacently bound TFs that enhances their binding affinity to their respectivebinding sequences. Alternatively, two TFs bound to the same enhancermay be responsible for recruiting the same cofactor or different componentsof a multi-protein complex. TFs may also facilitate the binding of otherfactors by triggering local DNA bending, or even by changing the sequencespecificity of another TF through protein-protein interactions[45].There are multiple models for how the sequence architecture of an en-hancer enables its function, with evidence that each model may apply tosome enhancers but not others. The enhanceosome model proposes that aspecific, ordered protein interface is necessary for the full activation of anenhancer, requiring strict motif positioning within its sequence[45]. Most en-hancers, however, do not seem to exhibit such strict motif grammar, insteadcontaining variable motif subsets and spatial arrangements. The billboardmodel accounts for this flexible motif grammar by proposing that althoughsome TFs bind cooperatively, others may act on an enhancer in an addi-tive or independent way, thus allowing for the relative order of some motifsto change without significantly impacting enhancer activity[45]. Anotherproposed model is the “TF collective” that suggests that protein-proteininteractions can recruit necessary TFs in cases where their motifs are notthemselves present in the enhancer sequence. Observations show that differ-ent TF composition and ordering can lead to very similar enhancer activitypatterns in some cases, but in other cases these activity patterns are sen-sitive to changes in motif positioning even for the same TF set, providingevidence that none of the described models apply universally to all enhancersites[45].Insulators are specific types of regulatory elements that are responsi-31.2. Biological Backgroundble for forming and maintaining higher order DNA structures that enableenhancer-promoter interactions. They are broadly categorized into twotypes: barrier and enhancer-blocking[16]. Barrier insulators maintain ac-tive chromatin regions by blocking the spread of heterochromatin forma-tion, thus ensuring certain genomic regions are not transcriptionally silenced.Enhancer-blocking insulators were so named because of their ability to blockenhancer-promoter interactions when placed between the two. They func-tion by anchoring themselves to nuclear structures or to other insulatorsvia TF interactions, thus establishing DNA loop domains. In vertebrates,CTCF is an especially prevalent TF at insulator sites that is able to formbonds with itself and other nuclear proteins. The formation of these loopdomains brings some enhancers and promoters within spatial proximity ofeach other while blocking others from interacting[16].1.2.3 Chromatin AccessibilityAt any given time, the majority of eukaryotic DNA is packaged into a highlycondensed chromatin structure that makes it inaccessible for binding bymost TFs[31]. The basic unit of chromatin is called a nucleosome, and iscomposed of approximately 150bp of DNA wrapped around a protein oc-tamer comprised of four types of core histones. Nucleosomes are formedalong the DNA string like beads, allowing the DNA to be further con-densed by linker histones that bind inter-nucleosomal DNA and interactwith the core histones. For processes that require DNA-protein interac-tions, such as DNA transcription, replication and repair, the DNA must bemade accessible[4]. Regulatory genomic regions that operate via TF bind-ing likewise require chromatin accessibility to fulfill their regulatory func-tions. Unsurprisingly, active enhancers, promoters and insulators coincidewith nucleosome-depleted regions of the genome, which are fully accessiblestretches of DNA typically the length of one nucleosome (150-250bp)[31].These open chromatin regions (OCRs) are established and maintainedthrough several different mechanisms. Targeted nucleosome repositioningcan be accomplished by a special class of TFs called pioneer factors thatare able to bind nucleosomal DNA in a sequence-specific manner, and thendisplace nucleosomes either independently or by recruiting active chromatinremodelers. Alternatively, some TFs may bind inter-nucleosomal DNA andthen initiate processes that destabilize and displace neighboring nucleo-somes. Another proposed mechanism is that TFs bound to an accessibleenhancer may recruit chromatin remodelers that evict nucleosomes at distalregulatory sites. Finally, nucleosomes in regulatory regions have relatively41.2. Biological Backgroundhigh turn-over rates and TFs present in high enough concentrations maybe able to passively out-compete histone proteins for DNA binding, thusensuring local chromatin accessibility[31].There is additionally evidence that CpG islands, which are approxi-mately 1kb long regions of the genome with high GC content, are predis-posed to low nucleosome occupancy[11]. These regions have high overlapwith a subset of gene promoters, and may be responsible for maintainingchromatin accessibility for this set of regulatory sites[11].1.2.4 Measuring Chromatin AccessibilityHere we describe some of the more popular methods for assaying genome-wide chromatin accessibility. Although each of these assays has differentkinds of sequence bias, the accessible chromatin regions identified via allthese methods are generally well correlated[31].The earliest method for detecting large-scale chromatin accessibility pat-terns, published in 2006, used DNase I proteins to preferentially cleave onlyaccessible DNA, then hybridize the DNA fragments onto tiled microarrays.Because microarrays are limited in throughput, this method is inadequate formeasuring accessibility genome-wide. DNase I hypersensitive site sequenc-ing (DNase-seq), developed several years later, overcomes this problem byusing high-throughput short-read sequencing to characterize the accessibleDNA fragments produced by DNase I cleavage[31].Assay for transposase accessible chromatin with high-throughput se-quencing (ATAC-seq) is a more recent method for profiling chromatin stategenome-wide[6]. It uses hyperactive Tn5 transposase proteins that pref-erentially cut nucleosome-free DNA and simultaneously insert sequencingadaptors into the ligated DNA fragments. The tagged DNA fragments arethen amplified and sequenced. After the sequencing reads are aligned tothe reference genome, OCRs can be identified by finding areas with highnumbers of aligned reads[6]. This protocol is simple and much faster thanDNase-seq, taking hours rather than days to generate the sequencing li-braries. Additionally it can profile samples with much smaller quantities ofcells, on the order of thousands of cells rather than hundreds of thousandsrequired for DNase-seq[31].MNase-seq and NOMe-seq are another two recently developed methodsfor characterizing chromatin state[31]. MNase-seq or micrococcal nucleasesequencing uses MNase endonuclease/exonuclease proteins to digest internu-cleosomal DNA. The remaining DNA fragments, that were protected by hi-stones during digestion, are sequenced and mapped to the reference genome51.3. Related Workto identify DNA regions occupied by nucleosomes[31].NOMe-seq stands for nucleosome occupancy and methylome sequencingand can be used to profile both methylation and chromatin state simultane-ously. It utilizes a viral methyltransferase that methylates GC dinucleotidesrather than the CG dinucleotides that are normally methylated in humansand mice. Because only accessible DNA is methylated, whole-genome bisul-fite sequencing can then be used to identify OCRs as well as methylation atCpG sites. A much higher number of reads is required for NOMe-seq thanthe other methods described above because it involves sequencing the wholegenome rather than selected fragments. However, this process also elimi-nates enrichment bias which is present in the other methods, thus allowingfor a better quantification of chromatin accessibility[31].1.3 Related WorkHere we provide an overview of methods that have been used to under-stand the effect of DNA sequence on local chromatin accessibility. This isby no means an exhaustive list, but we attempt to mention all the maincategories of methods used for this task. Notably, we do not detail any ofthe approaches that exploit multi-omics data or genome annotation (ratherthan sequence information alone) to understand the mechanisms of localchromatin accessibility.We start by describing approaches that rely on identifying individual TFbinding sites within open chromatin regions, either by using known motifsor discovering them de novo. Next, we delve into machine learning methodsdesigned to classify the chromatin state at putative regulatory regions aseither accessible or inaccessible based on their sequence. The features mostheavily weighted by these models can then be used to understand which mo-tifs within the sequence are biologically meaningful predictors of chromatinstate. Finally, we provide an overview of deep learning models designed topredict chromatin state from DNA sequence and the interpretation methodsthat have been applied to these models to understand how the input featuresare weighted by the model when making predictions.1.3.1 Identification of Transcription Factor Binding SitesOne way of understanding regulatory DNA sequences is by identifying thepresence of TF binding sites. For example, the over-representation or under-representation of certain motifs in OCRs can be used to infer which TFs areimportant for the regulatory activity of a given cell type. TF binding sites61.3. Related Workcan be identified using their known motifs, or they can be discovered denovo via methods that find over-represented k-mers within the input data.TFs bind short motifs, typically 6-10bp, that can vary by 1 or 2 basesfrom a consensus binding sequence[47]. It’s hypothesized that this flexibilityin binding preference, which leads to a range of affinities between a TF andits binding sites, enables precise control of transcription rates rather thanacting as a binary on/off switch. To capture this variability in binding sites,TF motifs are typically represented as position weight matrices (PWMs)with 4 entries for every position along the length of the motif, one for eachnucleotide. The entries commonly used are observed nucleotide counts ateach position (these are also called position frequency matrices), the prob-abilities of observing each nucleotide at each position (also called positionprobability matrices), or log-odds scores for each nucleotide at the givenposition, defined as:log2(pij/bj)where pij is the probability of observing nucleotide j at position i, and bj isthe probability of observing nucleotide j in the background model[19, 47].There are a number of methods designed to identify the presence ofknown TF binding sites within a set of input sequences, for example se-quences of OCRs. FIMO[18] (Find Individual Motif Occurrences) is oneexample of this method type that scans a database of sequences with aset of known TF PWMs, which can be obtained from databases such asCIS-BP[50] or JASPAR[14]. For each PWM, FIMO computes a similarityscore to every position within the input sequence, then converts it to a p-value reflecting the probability of obtaining a similarity score at least ashigh on a random sequence. These p-values can then be used to filter forhigh-confidence motif occurrences.To understand how these known motif occurrences relate to chromatinaccessibility additional statistical analysis is required. ChromVAR[39] is amethod designed to correlate motif instances identified with methods suchas FIMO to sample-specific chromatin state. It takes as input a set ofmotif occurrences, along with aligned sequencing reads from the chromatinaccessibility assay and locations of open chromatin peaks. For each motif adeviation score is computed corresponding to the read counts of all OCRscontaining that motif minus the expected read counts at these OCRs (basedon accessibility across all cells), and divided by the expected read counts.A high positive score indicates that the TF motif is highly correlated withaccessible chromatin in a particular cell type, while a large negative scoresuggests a correlation with non-active regions. A deviation score can even71.3. Related Workbe computed for pairs of motifs to detect TF cooperativity, although thereare computational limitations on the number of TF pairs that can be tested.Because PWM-scanning approaches are limited to identifying instancesof known motifs only, it can be advantageous to instead detect over-representedsequences within a dataset of interest and assemble those into novel motifs.HOMER[22] is a software package that identifies k-mers that are enriched ina target sequence set compared to a background sequence set and assemblesthem into de novo motifs. It then optimizes the PWMs of these motifs tobe maximally enriched in the target sequence database using the cumulativehypergeometric distibution function to measure enrichment. This methodcan be applied to find motifs associated with chromatin accessibility bycomparing OCR sequences to regions of non-accessible DNA.1.3.2 Machine Learning ApproachesMachine learning models trained to classify accessible versus inaccessiblegenomic regions are able to simultaneously learn many motifs that are pre-dictive of regulatory activity in the observed data. This class of methodstypically requires the transformation of the input sequences into a k-merfeature representation. Here, we describe three such methods along withthe advantages and disadvantages of each.SeqGL[40] represents input sequences using k-mers with wildcard char-acters to train a logistic regression model for classifying regions as accessibleor inaccessible. It uses a sparse group lasso constraint on k-mers clusteredby similarity to impose similar weights on k-mers that are likely to belongto the same motif. This model is easily interpretable as the weights cor-responding to each k-mer represent the importance of that k-mer to theprediction task at hand. Similar k-mers can furthermore be assembled intonovel motifs.A popular method called gkm-SVM[17] constructs a similarity kernelbetween all the input sequences using gapped k-mers similar to the wild-card k-mers of SeqGL. The kernel then serves as input into a support vectormachine that is trained to classify genomic regions as accessible or inaccessi-ble. Although gkm-SVM has excellent classification performance, the kernelfeature representation makes the interpretation stage more challenging.The Synergistic Chromatin Model (SCM)[21], another k-mer based ap-proach, is an L1-regularized generalized linear model that makes per-basepair predictions of chromatin accessibility based on the input sequence. Itrepresents its inputs as matrices indicating the presence of all possible k-mers at every position in the sequence. Unlike gkm-SVM and SeqGL, which81.3. Related Workgive binary predictions of chromatin accessibility for pre-defined sequencesof interest, SCM outputs a quantitative prediction of chromatin state at asingle base-pair resolution. Because SCM retains k-mer positions within itsinput feature space, it can reveal the importance of motifs at specific ori-entations to the output position. SCM is additionally easily interpretable,since weights assigned to each k-mer can be directly examined and impor-tant k-mers can be assembled into PWMs. However, due to the large sizeof the input, SCM is somewhat computationally intensive to train.Although these approaches model additive effects of multiple motifswithin the same sequence, they do not utilize information about the relativepositioning of motifs to make predictions. Motif spacing and orientation isknown to be an important factor in cooperative TF interactions, and theimportance of some motifs may be missed entirely if synergistic effects withother TF binding sites are not considered.1.3.3 Deep Neural Networks in GenomicsDeep neural networks(DNNs) are a powerful class of predictive models thathave become very widespread in genomics data analysis. DNNs consistsof multiple layers of “neurons”, with the output of one layer serving asthe input to the consecutive layer. Each neuron is a linear function of theinputs to that particular layer with a non-linear transformation (called anactivation function) applied to the result (Figure 1.1). These activationfunctions enable DNNs to approximate complex non-linear functions ratherthan simply being linear transformations of the input.DNNs are hugely flexible with regards to the output they can produce.They can be used as binary classifiers or predictors of continuous values,and the output can be of arbitrary length. All the internal parameters of aDNN are learned during a supervised training phase. Due to the typicallylarge number of parameters in one network, a large set of labeled data isrequired to train deep models.For tasks requiring predictions based on DNA sequence, convolutionalneural networks(CNNs) are the most widely used deep learning model.CNNs are able to model complex, non-linear relationships between thefeatures of the input sequence while accounting for their relative spacing.They have been successfully used to predict genomic data such as chromatinstate[29, 53], TF binding[1], and gene expression[52] on the basis of DNAsequence alone.91.3. Related WorkInput layer Hidden layers Output neuronFigure 1.1: Schematic of a basic DNN. Hidden fully connected layers withthe ReLU activation function, and an output neuron with a sigmoid activa-tion function.Convolutional Neural Networks for Predicting Chromatin StateCNNs are a special class of DNNs containing one or more layer of convo-lutional operators, typically at the start of the network. A convolutionaloperator (or filter) is a small matrix of weights that scans the input andcomputes the sum of element-wise multiplication between the input and thefilter. During model training, these convolutional filters are optimized torecognize low-level features of the input that are informative for the partic-ular classification task. In the case of CNNs trained on DNA sequence data,the first layer convolutional filters correspond to short sequence motifs.A benefit of these models is the lack of pre-processing required on theinput - a CNN can process a DNA sequence of any length with no man-ual input as to the relevant features. In contrast, the methods previouslydescribed require either a pre-specified set of features (in the case of PWM-scanning methods or “bag of k-mers” methods) or an assumption on thelength of relevant features (in the case of de novo motif discovery methods).CNNs, on the other hand, can detect relevant features of arbitrary lengthwithin the input DNA sequence.In this section, we describe several CNN model architectures that havebeen used to predict chromatin accessibility. The first three examples,101.3. Related WorkDeepSea[53], OrbWeaver[3] and Basset[29], are basic CNNs consisting ofonly convolutional and fully connected layers. Then we describe two mod-els, DanQ[38] and Basenji[28], that incorporate additional features into theirarchitectures to improve on the performance of vanilla CNNs.DeepSea[53], developed in 2015, was the first CNN model trained to pre-dict local chromatin accessibility from DNA sequence. DeepSea was trainedon data from hundreds of different cell lines, containing assays of DNaseI-hypersensitive sites, histones marks, and TF binding events. Traininga model to predict multiple outputs at once, called multi-task learning,provides more statistical power for optimizing parameters corresponding tolower-level features (for example, TF motifs that are relevant for both TFbinding and chromatin accessibility predictions). Each model input is a1000bp sequence centered on a TF binding event and corresponding bina-rized labels of TF binding, chromatin accessibility and histone marks acrossthe different cell types. DeepSea consists of 3 convolutional layers, thatencode sequence motifs and their interactions, followed by a hidden fullyconnected layer. The final output layer has a sigmoid function applied toit, which normalizes all the values to be between 0 and 1. These values canthen be interpreted as probabilities - in this case, probabilities of TF bindingevents and open chromatin peaks in the given sequence.The OrbWeaver[3] model, trained on ATAC-seq, DNA methylation, andmRNA expression data from induced pluripotent stem cells, also utilizesa very similar architecture to DeepSea to predict chromatin accessibility.However, rather than learning the first layer filters during training the modelis initialized with PWMs of 1320 known TF motifs. Each first layer filterhas a direct correspondence to a TF, and the relative importance of eachfilter can be derived using one of the deep learning interpretation methods,which are described in more detail in the next section.The Basset[29] model is very similar to DeepSea, but was trained topredict only DNA accessibility based on 600bp sequences at DNase I hy-persensitive sites. The model was trained on 164 different human cell typessimultaneously, also utilizing the multi-task learning approach. It’s architec-ture varies slightly from DeepSea, with different sized convolutional filtersand an additional hidden fully connected layer before the final output.The DanQ[38] model performs significantly better than DeepSea on thesame dataset by integrating a CNN with a recurrent neural network. Re-current neural networks(RNNs) process the elements of an input sequencein consecutive order while maintaining a memory of the previous elements,enabling RNNs to model spatial dependencies between sequence features.DanQ consists of one convolutional layer used for learning sequence motifs111.3. Related Workfollowed by a bi-directional long short-term memory network, a type of RNNthat combines the output of two networks that process the sequence fromopposite ends.Although these methods are significantly better at making predictionsthan non-deep learning approaches, they are limited to short input sequencesand therefore cannot account for long-range interactions between differentenhancer and promoter sites. Basenji[28] can process input sequences of 131kilobases to predict, along with other datatypes, the average DNase-seq readdepth in 128bp bin segments across the entire input sequence. This modelis thus able to incorporate information from a much larger sequence contextthan the CNNs described above and make predictions that are quantitativerather than binary. Basenji is able to process such long genomic regionsdue to the addition of dilated convolutions after the standard convolutionallayers. Dilated convolutions are very similar to the standard convolutionoperator but they contain gaps, so rather than processing adjacent featuresthey convolve features at pre-defined spaced intervals. This enables themto share information across long distances within the sequence that wouldotherwise require a much deeper network.Interpretation of Deep CNNsAlthough CNNs are highly effective predictive models, they are typicallyconsidered “black boxes” because understanding how they utilize the in-put features to make predictions is nontrivial. However, there is a grow-ing field dedicated to developing methods for interpreting deep neural net-works. Here we describe some examples of methods that have been usedfor CNN interpretation in genomics. They are broadly split into threecategories: perturbation-based, backpropagation-based and reference-basedapproaches[13].Perturbation-based approaches are based on perturbing the model inputand measuring the effect on the prediction. A common example of this isin-silico mutagenesis, where each base pair of the input sequence is changedone at a time and the effects on model predictions are quantified. Thisapproach was used by the authors of DeepSea[53] and Basset[29] to identifythe most important bases in the sequence and the most impactful nucleotidesubstitutions.Another approach based on perturbation is the analysis of first layerfilters of the Basset model[29]. Because each first layer filter of a CNN learnsa weight matrix of a short motif, these filters can be directly interpreted asPWMs of TF binding sites that are informative to the model. To measure121.4. Thesis Contributionsthe relative importance of these PWMs, each filter is removed from themodel and the change in predictions serves as a motif influence score[29].These perturbation-based approaches, particularly in-silico mutagenesis,are very computationally intensive since hundreds or thousands of modelpredictions need to be calculated to characterize each input sequence. Moreimportantly, these methods do not account for any redundancies in the fea-tures - for example, if a sequence contains two motifs that have the samepredictive value for the model, neither will be assigned importance by thesemethods. The input features may also have saturated their contribution tothe output, in which case importance of those features may be underesti-mated.Backpropagation-based approaches such as saliency maps[43] and guidedbackpropagation[46] essentially use the gradient of the model output withrespect to the input features as a measure of importance. These methods aresignificantly faster than perturbation-based approaches as the gradient forall inputs can be computed in a single pass through the network. However,they do not address the issue of redundant features or the feature saturationproblem, as the gradient would be zero in this realm. Additionally, the ReLUoperator makes it difficult to estimate negative feature contributions whenusing gradients.DeepLIFT[42] and Integrated Gradients[48] attempt to solve these issuesby computing the contribution of all the input elements with respect tosome reference. Integrated gradients computes the integral of the modelgradients as the input features are scaled from some reference (for example0) to their current values. DeepLIFT, on the other hand, computes howmuch the change from reference in each input feature contributes to thechange in the output neurons. These values can be calculated in a highlycomputationally efficient manner similar to the backpropagation approaches.These reference-based methods are able to mitigate the issue of redundantcontributions of input features and the neuron saturation problem, with thecaveat that they require a somewhat arbitrary choice of reference.1.4 Thesis ContributionsBecause DNA accessibility is a pre-requisite for activity at regulatory ge-nomic regions and transcription start sites, context-specific chromatin stateplays an important role in the regulation of gene expression. Understand-ing how DNA sequence specifies local chromatin state can therefore provideinsights into the mechanisms of sequence-directed transcriptional regulation.131.4. Thesis ContributionsWe focused our efforts on a unique dataset of genome-wide chromatinaccessibility profiles of 81 different mouse immune cell types generated usingATAC-seq by the ImmGen consortium. Unlike most previous efforts to pro-file chromatin state, this data was generated from isolated cell types ratherthan bulk tissue samples or cultured cell lines. It profiles a large numberof different cell types spanning the differentiation trajectory of the immunesystem, allowing us to analyze subtle differences in chromatin accessibilitybetween closely related cells.To understand the link between DNA sequence and accessibility, wetrained a model that predicts chromatin state from DNA sequence at puta-tive OCRs. We then extracted the most predictive sequence features usedby the model, which we expect to reflect the biologically important compo-nents of regulatory sequences. Due to their recent widespread success in thefield of genomics we chose to use a convolutional neural network (CNN) asthe predictive model. We trained a CNN (named AI-TAC) to predict, for agiven OCR, the relative chromatin state for all 81 cell types using only theDNA sequence at that OCR.To understand which relevant sequence features are learned by the modelduring training, we extracted the PWMs associated with each convolutionalfilter in the first layer of AI-TAC and computed several metrics to charac-terize the importance of each filter. We found that most of the informativefilters matched closely to motifs of known TFs, many of which are knownto have important roles in immune cell differentiation. Lastly, we tried tounderstand how combinations of first layer filters are used by the model toincorporate sequence context and motif cooperativity into its predictions.The following is an outline of the content in the remaining chapters ofthis thesis:• Chapter 2 details our selected model and the dataset used to train it.We first describe the steps of generating, processing and normalizingthe ATAC-seq data used for training the model. We then outline, indetail, the CNN architecture we chose for predicting cell type-specificchromatin state. The final section shows a number of experimentsvalidating the predictive performance of the model.• Chapter 3 focuses on model interpretation with the goal of identifyingpredictive sequence features that might help shed light on regulatorymechanisms of chromatin accessibility. We first describe how we ex-tract and characterize the first layer filter motifs of AI-TAC, and theirrelation to known TF motifs. Next, we show that using our defined141.4. Thesis Contributionsmetrics of filter importance, we can replicate the models performancewith only 1/3 of the first layer filters. Finally, we attempt to extractthe combinatorial logic used by the model that could provide insightinto TF cooperativity in vivo.• Chapter 4 discusses our findings and proposes some directions for fu-ture work that may improve on AI-TACs predictive performance andinterpretability.15Chapter 2Part I: Model Architectureand PerformanceIn this chapter we describe AI-TAC, a CNN model trained on ATAC-seqdata to predict cell type-specific local chromatin state from short DNA se-quences. Section 2.1 provides an overview of the steps used to generate,process and normalize the dataset used to optimize AI-TAC. In section 2.2we describe the details of the different components of AI-TAC, the overallmodel architecture, and the model optimization procedure. In section 2.3we show the results of several experiments designed to test the predictiveperformance of AI-TAC. We compare AI-TACs performance on real dataversus several simulated “null” datasets. Additionally, we show the robust-ness of the model optimization step with respect to the choice of trainingand test set split. Finally, we test the ability of AI-TAC to make meaning-ful predictions on a human cell ATAC-seq dataset not used in training themodel.2.1 DataThe dataset consists of bulk ATAC-seq assays performed on 81 mouse im-mune cell types belonging to six different immune lineages: αβT, γδT, B,lymphoid, myeloid and stem cells.2.1.1 Data GenerationEach cell type was isolated from genetically identical C57BL/6 mice from theJackson Laboratory by one of the ImmGen consortium immunology groupswith flow cytometry using standardized procedures. All library construc-tion and sequencing was performed at the core ImmGen lab using IlluminaNextSeq 500 instruments and paired-end reads. A more detailed descriptionof the data generation process can be found in Yoshida et al, 2019[24].162.2. The AI-TAC Model2.1.2 Processing and NormalizationThe ATAC-seq peaks for all 81 cell types were obtained from Yoshida et al,2019[24]. The processing steps, as described in the paper, are as follows. Thesequenced reads were mapped to the mm10 mouse genome using bowtie, andnon-unique, ChrM mapping and duplicate reads were filtered using samtoolsand Picard Tools. Peak calling was performed with MACS2 software usingpaired-end reads spanning less than 120 bp. Significant peaks were used todetermine OCRs of length 250 bp centered on each peak summit. OCRslocated in blacklisted genomic regions and ChrM homologous regions werethen filtered out. The “peak height” of each OCR corresponds to the numberof reads mapping to it.The raw read counts were log2-transformed after adding a pseudocountof 2, and normalized by quantile normalization across the cell types by Dr.Sara Mostafavi. For the purposes of our analysis we additionally filtered outall OCRs on the X and Y chromosomes (because samples came from bothmale and female mice) and any OCR sequence with undetermined bases inthe reference. This resulted in a total of 327,927 OCRs where a peak wasidentified in at least one of the 81 cell types.2.2 The AI-TAC Model2.2.1 Model ArchitectureBayesian optimization of model hyperparameters was performed by MarkMa using the skopt package[44] to determine the optimal model architecturefor our problem. We found models with architectures similar to Basset[29]performed best, and chose a very similar model with three convolutional andtwo fully connected hidden layers. The details of AI-TACs architecture aredescribed below and are shown in figure 2.1.Each model input is a 251bp OCR sequence transformed via one-hot-encoding into a 4x251 binary matrix such that each row corresponds to anucleotide (A, T, G, C) and each column corresponds to a position alongthe sequence, as demonstrated in Figure 2.2. For example, if the sequencestarts with a C, the first column of the one-hot-encoded sequence wouldcontain a 1 in the 4th row and 0s in rows 1 through 3. We additionally padthe input sequence with 0’s on either end to a total length of 271 to ensureevery position is properly scanned by the first layer convolutional filters.The first three hidden layers of the network are composed of convolu-tional filters. Each filter is a matrix that “scans” the length of the sequence172.2. The AI-TAC ModelFigure 2.1: The architecture of the AI-TAC model.Figure 2.2: Conversion of sequence into one-hot encoded detect a specific pattern. More formally, for input X of length L withN input channels (NxL matrix) and a convolutional filter W of width M(NxM matrix) the output for that filter is a vector of length L−M wherethe entry at position i is:convolution(X)i =N−1∑n=0M−1∑m=0WnmXn,i+m + b (2.1)where b is a bias term that is optimized for each filter.The first layer of AI-TAC consists of 300 convolutional filters of dimen-sion 4x19 that scan the input for a particular sequence (i.e. motif). Thesecond layer is 200 filters with dimensions 300x11 that detects relationshipsbetween each of the first layer filters, and the third layer is 200 filters with182.2. The AI-TAC Modeldimensions 200x7.The output of each convolutional layer is passed through an activationfunction which introduces non-linearity to the model. AI-TAC utilizes therectified linear activation function (ReLU) which thresholds values such thatall inputs below 0 are set to 0:ReLU(x) ={x if x ≥ 00 if x < 0The output of each convolutional layer is condensed using the maxpool-ing operation, which computes the maximum value within a contiguous win-dow of the convolutional layer output. For a maxpool operator of width wand stride w (equal to it’s width) applied to a sequence X, the output atposition i is computed as:maxpool(X)i = max{Xwi, Xwi+1, ..., Xwi+w} (2.2)The maxpool layer reduces the intermediate output of the layer whichreduces the number of parameters in the next layer making the model op-timization more robust. It also allows for more variability in the spacing ofinput features thus accounting for any biological variation in motif spacing.The convolutional layers of AI-TAC have maxpooling of width 3, 4, and 4,respectively, with stride equal to the width of the maxpool operator.Next, we applied batch normalization, which enables faster model opti-mization and acts as an implicit regularizer of the model weights by nor-malizing the outputs of hidden layers to reduce their variance [27]. Eachactivation unit fed into the batch normalization layer is mapped to a corre-sponding output Yi via the following equation:Yi =Xi − E[Xi]√V ar[Xi] + · γ + β (2.3)E[Xi] and V ar[Xi] are the mean and variance, respectively, of activa-tion Xi over a training mini-batch. During model evaluation (as opposedto model training) these values get substituted with the overall mean andvariance computed over the training set. Parameters γ and β are learnedduring the training phase.The output of the convolutional layers is then fed into a traditional fullyconnected network with 2 hidden layers. As the name implies, in a fullyconnected layer each input neuron is connected to each output neuron via a192.3. Model Performanceweight. The output vector Y of a fully connected layer consists of neuronsYj computed in the following way:Yj =I∑i=iwijXi (2.4)where X is the layer input. AI-TAC contains two hidden fully connectedlayers each with 1000 output units. The second of these layers maps to themodel’s final 81-unit output(corresponding to each of the 81 cell types) viaa linear transformation.2.2.2 Model TrainingThe model was trained by minimizing the loss function below with respect tothe model weights using the Adam optimizer [30]. We chose cosine distanceas the objective function:f(x) = 1− Yˆ · Y|Yˆ ||Y | (2.5)This loss maximizes the Pearson correlation between the predicted andobserved chromatin activity state across the 81 cell types for the trainingset OCRs. We chose this loss to emphasize accurate prediction of OCRswith differential activity profiles across the cell types, to aid in identify-ing sequence features correlated with differential activity during the modelinterpretation stage.AI-TAC was implemented in PyTorch version 0.4.0. The training wasperformed using the Adam optimizer [30] for 10 epochs with a learning rateof 0.001 and a mini-batch size of 100. As an additional form of regularizationwe implement drop-out of 3% on the two hidden fully connected layers.Drop-out sets a random portion (in our case 3%) of the input neurons tozero for each training sample, which helps prevent over-fitting of the modelto the training set [25].2.3 Model Performance2.3.1 Comparing to Randomized NullWe first trained AI-TAC on a randomly selected subset of OCRs consistingof 90% of the 327,927 OCRs in our dataset. We benchmarked the perfor-mance of AI-TAC on the remaining 10% of our dataset against simulated202.3. Model Performance1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00Prediction-Ground Truth Correlation010002000300040005000FrequencyReal data versus shuffled sequencesShuffled SequencesReal Data(a)1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00Prediction-Ground Truth Correlation01000200030004000500060007000FrequencyReal data versus shuffled cell orderShuffled Cell TypesReal Data(b)1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00Prediction-Ground Truth Correlation0100020003000400050006000FrequencyReal data versus shuffled OCR orderShuffled OCRsReal Data(c)Figure 2.3: Performance of the model (measured by Pearson correlation)on real and shuffled mouse ATAC-seq data. (a) randomization by shufflingthe sequences, (b) by permuting the chromatin accessibility profiles, and (c)by shuffling the assignment of each OCR to its accessibility profile“null” datasets to ensure it had learned some meaningful predictive features.We compared AI-TAC to three different models, each trained on data ran-domized in one of the following ways:• Randomly shuffled 251 base pair input sequences• Randomly shuffled ATAC-seq activity vectors of length 81• Randomly permuted order of input sequence and output vectors in thedatasetFigure 2.7 shows the correlations between model predictions and ATAC-seq derived activity values for each of the test set samples, comparing the212.3. Model PerformanceAI-TAC results to each of the three “null” models. We observed that theaverage correlation of the “null” model predictions is close to zero, as ex-pected since there shouldn’t be any detectable patterns in the randomizeddata. Notably, the shuffled sequence model has the highest average cor-relation compared to the other “null” models, reflecting the fact that GCcontent on its own is somewhat predictive of chromatin state because CpGislands are generally ubiquitously accessible[11]. In contrast, the average AI-TAC prediction correlation is 0.32, indicating that there is real signal andstructure in the dataset that the model is able to exploit to make predictions.2.3.2 10x10 Cross-validationTo ensure the results we obtained with AI-TAC were not highly sensitiveto training set selection and model initialization, we performed a 10-foldcross-validation of the entire dataset a total of 10 times, thus obtaining 100different models. In this way, each of the 327,927 OCRs was considered asa test OCR by ten different trained models.We observed very stable test set distributions across all the models (Fig-ure 2.4a). Additionally, our results show that OCRs that were well predictedhad the most stable predictions across the 10x10 cross-validation trials (Fig-ure 2.4b), indicating that the model is able to consistently learn highly pre-dictive features for that subset of OCRs regardless of training set selection.trial1 trial2 trial3 trial4 trial5 trial6 trial7 trial8 trial9 trial10trials0.750.500. 0.2 0.0 0.2 0.4 0.6 0.8 1.0Mean correlation0. 2.4: (a) The prediction correlation for each OCR in the dataset whenit was part of the test set, for all 10 cross-validation trial. (b) The mean testprediction correlation for each OCR across 10 independently trained modelson the x-axis versus the range of correlation values on the y-axis.222.3. Model Performance2.3.3 Chromosome Leave-outEnhancers regulating the activity of the same gene have been shown to havephenotypic redundancy[37]. To ensure our results were not overly optimisticdue to the presence of highly similar OCR sequences (at functionally redun-dant enhacers) in both the training and test set we performed a chromosomeleave-out validation experiment. We tested the robustness of our model byleaving each of the 19 mouse autosomes as a test set and training the modelon the remainder of the data. The results in figure 2.5 show that all 19 ofthese models have very similar test set prediction correlation distributionsto AI-TAC, indicating that our model is not significantly impacted by thechoice of training and test sets.chr1chr2chr3chr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19Test Set Chromosome0.750.500. CorrelationFigure 2.5: Boxplot showing the test performance of 19 separate models,trained by leaving each of the 19 autosomes out once.232.3. Model Performance2.3.4 Predictions Vary by OCR TypeWe were further interested in characterizing the differences between OCRsthat were predicted well versus those that were predicted poorly. When welooked at AI-TACs prediction accuracy versus the variance of the ATAC-seq signal across the 81 cell types at each OCR, we observed that the highlywell-predicted peaks were most likely to also have high variance (Figure 2.6).This is unsurprising considering the Pearson correlation loss used to optimizeAI-TAC emphasizes the accurate prediction of high-variance OCRs.1 2 3 4Peak variance0. truth correlation100101102Figure 2.6: Variance of peak heights of each OCR on x-axis versus AI-TACprediction accuracy on the y-axis.2.3.5 Model Performance on Human DataThere is a high degree of similarity between mouse and human regulatorysequences, and previous work has shown that deep learning models cangeneralize from mouse to human data[8]. We therefore decided to validatethe performance of AI-TAC on an ATAC-seq dataset of human primaryimmune cell types from Corces et al, 2016[9].Data processing and normalization steps identical to those described forthe mouse data were performed by Caleb Lareau, providing us with 539,611OCRs of 250bp each with chromatin activity values for 18 different celltypes. For eight of the human immune cell types for which ATAC-seq data242.4. Summarywas available closely matching equivalents were present in our mouse dataset(Table 2.1). We compared the predictions of the AI-TAC model trained onmouse data to the observed chromatin activity in the corresponding humancell type, averaging over the prediction values when appropriate to obtainlineage-level predictions.Figure 2.7 shows the results on real data versus three different types ofrandomized data, analogous to the test we performed on the mouse data:• Randomly shuffled 251 base pair input sequences• Randomly shuffled ATAC-seq activity vectors of length 81• Randomly permuted order of input sequence and output vectors in thedatasetDespite never seeing human data in it’s training set, the average cor-relation between AI-TACs predictions and the measured peak heights overthe OCRs in the human dataset is 0.22, much higher than the models per-formance on randomized data. The fact that AI-TAC generalizes enoughto make meaningful predictions on another species indicates that it learnedreal biological signal in our dataset.2.4 SummaryIn summary, we trained a CNN with 3 convolutional and 2 fully connectedlayers to predict relative chromatin state at putative OCRs for 81 cell typessimultaneously on the basis of 251bp DNA sequences. The model wastrained using ATAC-seq data generated from mouse immune cells from sixdifferent lineages. The model was tested on a held-out set of OCRs fromour mouse dataset, and compared to simulated “null” data. We also con-firmed that the model performance is not sensitive to training set selection,by doing cross-validation and chromosome leave-out experiments. Finally,we showed that AI-TAC generalizes to human data by testing it on a humanATAC-seq dataset of closely matching immune cell types.252.4. Summary1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00Prediction-Ground Truth Correlation010002000300040005000FrequencyReal data versus shuffled sequencesShuffled SequencesReal Data(a)1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00Prediction-Ground Truth Correlation01000200030004000500060007000FrequencyReal data versus cell type orderShuffled CellsReal Data(b)1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00Prediction-Ground Truth Correlation010002000300040005000FrequencyReal data versus shuffled OCR orderShuffled OCRsReal Data(c)Figure 2.7: Performance of model trained on mouse ATAC-seq data onhuman DNA sequences. Performance is measured by Pearson correlationbetween model predictions and ATAC-seq data for the closest correspondingcell type. Results on real data are compared to data randomized in threeways: (a) randomization by shuffling the sequences, (b) by permutting thechromatin accessibility profiles, and (c) by shuffling the assignment of eachOCR to its accessibility profile262.4. SummaryMouse cell types Human cell typesB.Fo.Sp BB.GC.CB.Sp BB.GC.CC.Sp BB.mem.Sp BB.MZ.Sp BB.PB.Sp BB.Sp BT.4.Nve.Fem.Sp CD4T.4.Nve.Sp CD4Treg.4.25hi.Sp CD4NKT.Sp CD4T.8.Nve.Sp CD8T8.Tcm.LCMV.d180.Sp CD8T8.Tem.LCMV.d180.Sp CD8T8.TN.P14.Sp CD8LTHSC.34-.BM HSCLTHSC.34+.BM HSCDC.4+.Sp mDCDC.8+.Sp mDCMo.6C-II-.Bl MonoMo.6C+II-.Bl MonoNK.27-11b+.BM NKNK.27-11b+.Sp NKNK.27+11b-.BM NKNK.27+11b-.Sp NKNK.27+11b+.BM NKNK.27+11b+.Sp NKDC.pDC.Sp pDCTable 2.1: The human immune cell types measured by Corces et al. [9],and their mouse counterparts. All mouse cell populations that map ontothe same human cell type were averaged in comparative analyses.27Chapter 3Part II: Model InterpretationGiven that we’ve shown that AI-TAC learned relevant sequence features thatare predictive of chromatin state, we are interested in understanding whatthose features are. Identifying important motifs and combinations of motifsthat are correlated with chromatin state could provide important biologicalinsights into the mechanism of cell-specific transcriptional regulation.Section 3.1 describes how we extract PWMs learned by the first layerconvolutional filters of AI-TAC and compare them to known mouse TFmotifs. In section 3.2 we provide the details of computing filter informationcontent, reproducibility and influence values, which characterize the relativeimportance and complexity of the recovered PWMs. We then performedexperiments quantifying the degradation in the models performance whenthe first layer filters are limited to subsets of the most important motifsdefined via the above metrics, the results of which are shown in section 3.3.Finally, in section 3.4 we show our attempts to decipher the combinatoriallogic used by AI-TAC to make its predictions using two different ways: byexamining the second layer convolutional filter weights, and by computingfilter pair influence values.3.1 Interpreting AI-TAC with First Layer FiltersTo understand which sequence features are used by the model to make pre-dictions we examine the first convolutional layer. The first layer consists of300 filters, each 19 units long, that scan the input sequence and learn torecognize a specific motif. These were analyzed by converting them to po-sition weight matrices (PWMs). To do so, for each first layer filter, we firstidentified all 19bp sequences that activate the filter by at least 1/2 of themaximum activation for that filter across all 51,732 well-predicted (Pearsoncorrelation greater than 0.75) OCRs [29]. Next, we constructed a positionfrequency matrix based on the prevalence of each of nucleotide along the19bp long sequences, and finally we converted the position frequency matrixto a PWM by using a background uniform nucleotide frequency of 0.25. Thisanalysis yielded 300 PWMs, each capturing the motif that is detected by a283.2. Filter PropertiesPax5Pax5Sfpi1Sfpi1Figure 3.1: PWMs corresponding to 4 different convolutional filters in thefirst layer of AI-TAC. The filter motifs are shown aligned to known tran-scription factors in the CIS-BP databasefirst layer filter. The number of sequences comprising each of the PWMs isshown in figure 3.2, plotted against the IC of each filter. The (log scaled)number of sequences in each PWM is inversely proportional to the IC ofeach filter, which is unsurprising as we expect low IC motifs to occur morefrequently by chance within the DNA sequence.These PWMs were then compared to known mouse transcription factormotifs in the CIS-BP Mus musculus database [50] using the Tomtom motifcomparison tool [20]. About a third of the first layer filters have a significantmatch to known TF motifs. Figure 3.1 shows four examples of filters thatclosely matched to known mouse TF binding motifs Pax5 and Sfpi1/PU.1.In the case of these TFs (and many others) the model learned both theforward and reverse compliment of the motif.3.2 Filter PropertiesTo help characterize and prioritize the first layer filters of AI-TAC we com-puted the following metrics: the information content of each filter motif,the reproducibility of each filter across multiple training iterations of themodel, and the influence of each filter on the model predictions. Appendix293.2. Filter Properties4 6 8 10 12 14 16 18Information Content103104105Number of sequencesLog Influence-16.0-12.0-8.0-4.00.0Figure 3.2: The number of sequences comprising each first layer filter PWM(log-scaled) versus the IC of each PWM, color coded by the log of the filterinfluence value.A lists these metrics for each of the 300 filters, and the details of how theyare obtained are described in the remainder of this section.3.2.1 Information ContentWe computed the information content of each motif using the followingformula:IC =∑i,jpij log2(pij)−∑i,jbj log2(bj) (3.1)Where pi,j is the probability of observing nucleotide j at position i basedon the observed frequencies and bj is the background frequency of nucleotidej (set to 0.25 in our case).303.2. Filter Properties3.2.2 ReproducibilityTo understand how sensitive the convergence of the first layer filters is tothe random training data split and model weight initializations we createda reproducibility metric. We trained 10 additional models using differentrandom 90% subsets of the ATAC-seq dataset, then extracted the 300 filterPWMs from all 10 models. We measured the similarity between the filters ofAI-TAC and those from each of the other models using the Tomtom PWMcomparison tool[20]. We then defined a reproducibility score for each of theAI-TAC filter motifs as the number of models with at least one matchingmotif using an FDR q-value cut-off of 0.05 on the Tomtom results.About a third of the filters are highly consistent between different train-ing iterations of the model (Figure 3.3). The highly reproducible filters aremuch more likely to match a known TF binding motif than less reproducibleones. Additionally, the high information content filters are more likely to bereproducible, perhaps because obtaining a high IC motif is less likely duringthe optimization stage and therefore these are learned only if they are highlypredictive.3.2.3 InfluenceThe influence of each filter was computed by effectively removing the filterfrom AI-TAC and quantifying the impact on the models prediction. Specif-ically, we replaced all activation values for the given filter with its averageactivation value across all samples in the batch, then fed the output throughthe remaining layers of the model to obtain the altered prediction vector[29].The overall influence value for a given filter was computed as the average(across OCRs) of the squared difference between the correlation in predictionof accessibility profiles (loss) of the altered and un-altered model.Figure 3.4 shows the importance of each filter as a function of the infor-mation content of its PWM. Notably, high IC filters tend to have a biggerimpact on model predictions. This is likely for the same reason that repro-ducibility values correlate with IC - these complex motifs are more difficultto learn, and therefore are learned by the model only if they are highlypredictive of chromatin state.Additionally, there is a strong correlation between filter influence valuesand whether or not the filter motif matches to a known mouse TF motif,indicating that for the most part motifs that are strongly correlated withregulatory activity have already been characterized. Reassuringly, manyof the high influence filters correspond to motifs of known pioneer factors313.2. Filter Properties1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0Reproducibility0. ContentAnnotationNo matchCIS-BP MatchFigure 3.3: The reproducibility value of each AI-TAC first layer filter across10 independent training iterations on the x-axis versus the information con-tent of each filter on the y-axis. Each filter is color-coded by whether ornot it matched a TF binding motif in the CIS-BP database with Tomtomq-value less than 0.05.that are responsible for establishing chromatin accessibility, for exampleSfpi1(PU.1), Pax5 and Cebp. The model also recovered the high informationcontent motif of Ctcf, which plays an important role in the function ofinsulators.To understand the cell type-specific impact of each filter we also com-puted an influence profile per cell population, as the average (across OCRs)of the squared difference between the original and modified prediction valuesfor each output neuron (corresponding to a given cell population). We addi-tionally computed a signed version of the influence profile, shown in Figure3.5, by taking the difference between altered and un-altered predictions foreach neuron, which shows whether the presence of each filter is predictiveof higher or lower chromatin accessibility in each cell population.All influence values were computed using 51,732 OCRs for which theAI-TAC prediction has greater than 0.75 correlation with the ground-truthchromatin accessibility. To ensure that including well-predicted OCRs from323.2. Filter Properties0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5Information Content10 510 410 310 2Log of Filter InfluenceAnnotationNo matchCIS-BP Match255 Sfpi1//Spib23 Ctcf218 Pouf2f286 Sfpi1//Spib133 Sfip1//Spib11 Ets1260 Ebf134 Cebp85 Zkscan1112 Tcf3174 Cebp15 Runx10 Runx190 Pou2f2165 Irf1238 Nr1d178 Tcf3//Id3224 Irf1//Stat2276 Sfpi1/Sfpib94 Pu.1257 Pax5252 Ets217 Pax5205 Bcl11a//Bcl11b35 Nfix288 Emoes//Tbx21220 Nr2f6242 Sp//KlfFigure 3.4: Information content (x-axis) versus the log of influence (y-axis)for all 300 first layer filters. Filters with PWMs matching to known TFmotifs in the CIS-BP Mus musculus database are indicated in orange.the training set would not bias our results, we compared the influence valuesobtained from well-predicted test and training set OCRs. The influencevalues computed on these two sets were strongly correlated. Additionally,the filter PWMs from the same model obtained from the test versus trainingset OCRs were virtually identical, with a Tomtom q-value of 2.9x10−27 orless for all but one pair of test set and training set derived PWMs.The per-cell type influence values shown in Figure 3.5 indicate that someTF motifs have highly cell type specific predictive value, while others havemore general importance across all lineages. There are many instanceswhere the importance placed by AI-TAC on certain TF motifs recapitu-lates known biology. For example, filters corresponding to Pax5 and Ebf1,which are necessary for the commitment to and maintenance of the B-celllineage [33, 36], have high influence values exclusively among the B-cells.The influence values of filters matching the motif recognized by both Eomesand Tbx21(T-bet) proteins recapitulates their importance in establishingthe NK and CD8+ T-cell lineages[26, 51]. The Runx and Ets TF familiesplay an important role in many different stages of haematopoiesis and are333.2. Filter Propertiesexpressed across the immune cell lineages [15, 49], which is reflected in theimportance of their motifs in Figure 3.5. The Sfpi1/PU.1 motif has a highlevel of redundancy in the model and the highest influence overall, and hasparticularly high influence values for the stem, B-cell and myeloid lineages.Sfpi1 is responsible for determining immune cell fate, with high concentra-tions promoting myeloid differentiation while low concentrations promoteB-cell differentiation [12]. Interestingly, the model produces high Sfpi1 in-fluence values for both the B-cell and myeloid lineages, even though it isexpected to be highly expressed only in the myeloid cells.BabTgdTinnate.lymmyeloidstemLTHSC.34-.BMLTHSC.34+.BMSTHSC.150-.BMMMP3.48+.BMMMP4.135+.BMproB.CLP.BMproB.FrA.BMproB.FrBC.BMB.FrE.BMB1b.PCB.T1.SpB.T2.SpB.T3.SpB.SpB.Fem.SpB.Fo.SpB.MZ.SpB.mem.SpB.GC.CB.SpB.GC.CC.SpB.PB.SpB.PC.SpB.PC.BMpreT.DN1.ThpreT.DN2a.ThpreT.DN2b.ThpreT.DN3.ThT.DN4.ThT.ISP.ThT.DP.ThT.4.ThT.8.ThT.4.Nve.SpT.4.Nve.Fem.SpT.4.Sp.aCD3+CD40.18hrTreg.4.FP3+.Nrplo.CoTreg.4.25hi.SpT.8.Nve.SpT8.TN.P14.SpT8.IEL.LCMV.d7.GutT8.TE.LCMV.d7.SpT8.MP.LCMV.d7.SpT8.Tcm.LCMV.d180.SpT8.Tem.LCMV.d180.SpNKT.SpNKT.Sp.LPS.3hrNKT.Sp.LPS.18hrNKT.Sp.LPS.3dTgd.g2+d17.24a+.ThTgd.g2+d17.LNTgd.g2+d1.24a+.ThTgd.g2+d1.LNTgd.g1.1+d1.24a+.ThTgd.g1.1+d1.LNTgd.SpNK.27+11b-.BMNK.27+11b+.BMNK.27-11b+.BMNK.27+11b-.SpNK.27+11b+.SpNK.27-11b+.SpILC2.SIILC3.NKp46-CCR6-.SIILC3.NKp46+.SIILC3.CCR6+.SIGN.BMGN.SpGN.Thio.PCMo.6C+II-.BlMo.6C-II-.BlMF.PCMF.Fem.PCMF.226+II+480lo.PCMF.102+480+.PCMF.RP.SpMF.Alv.LuMF.pIC.Alv.LuMF.microglia.CNSDC.4+.SpDC.8+.SpDC.pDC.Sp34 Cebp174 Cebp10 Runx15 Runx47 Runx120 Ets11 Ets1252 Ets93 Zeb1//Id3//Tcf38 Zeb1//bHLH78 Tcf3//Id3115 Hlx/Pbx314 Sox1328 Mafg//bZip297 Mecp2923 Ctcf275 Ctcf69 Ctcf219105244 Erg195 Gata132264 Mafg//bZip27050 Lef180 Lef1166 Lef1292 Fos//Smarcc205 Bcl11a//Bcl11b94 Pu.1133 Sfip1//Spib255 Sfpi1//Spib276 Sfpi1/Sfpib230 Sfpi1//Spib//Etv2286 Sfpi1//Spib224 Irf1//Stat289 Irf1//Stat2//Prdm1165 Irf1250 Irf1190 Pou2f2218 Pouf2f44 Pouf2f135 Pou2f2167 Pax5217 Pax5257 Pax5198 Pax5260 Ebf1123 Ebf16 Ebf185 Zkscan1112 Tcf3220 Nr2f6129 Nr2f6173 C2H2 ZF22735 Nfix19410218367 Mzf12091511271551862231961 Atf317020266122 Ellf1//Nfatc31851577012440281106 Tbx21//Eomes288 Emoes//Tbx2197 Emoes//Tbx21279 Eomes//Tbx2168 Nr1d1//Rorc238 Nr1d1231 Nfkb165 Nfkb1247 Nfkb2//Rel154 Sp178 Sp242 Sp//Klf284 Homeodomain43 bZip and Homeodomain283 bZip51 bZip290 Mafg//bZip2141050510Figure 3.5: Cell type-specific log2-scaled influence values for the 99 repro-ducible filters found in at least 8 out of 10 different models.343.3. Fine-tuning Model with Filter Subset3.3 Fine-tuning Model with Filter Subset0 50 100 150 200 250 300Number of filters0. test set correlationannotatedinfluencereproducible8reproducible9reproducible10Figure 3.6: Model performance on test set using subsets of first layer motifs.In blue are experiments with randomly selected filter subsets, repeated 100times for subsets of size 10 to 300 filters (in increments of 10). In orange,results for manually defined filter subsets selected based on filter properties:reproducible8, reproducible9. and reproducible10 correspond to all filterswith reproducibility metric of at least 8, 9, or 10, respectively; annotated -all filters with CIS-BP TF motif match with q-value 0.05 or less; influence- 98 highest influence filters.Because we expect the reproducible filters to be the most critical forgood model performance, we decided to test how well the model predictionscan be recovered with only the 99 filters found in 8 out of 10 additionalmodels. We performed the experiment by removing all first layer weightsthat do not belong to the set of 99 reproducible filters, as well as all secondlayer weights corresponding to these filters. The first layer weights are thenfrozen, and the model is fine-tuned (with an adjusted learning rate of 0.0001)on the training set OCRs. The fine-tuning was performed for 10 additional353.4. Detecting TF Cooperativity with AI-TACepochs, and the model with the best performance on the validation set wasretained (typically found after 2-5 epochs).The model used for these experiments was obtained by copying AI-TACfirst layer weights into a randomly initialized model, freezing the first layerfilters, and training the model for 10 epochs. The best model, obtained after3 epochs, was selected based on performance on a validation set. This yieldeda model with identical filters to AI-TAC and very similar performance onthe test set OCRs.As a null control, we compared these results to models for which a ran-dom subset of first layer filters was retained. For subsets of size 10 to 300filters (in increments of 10) we repeated the fine-tuning procedure 100 timeswith random subsets selected (results in Figure 3.6). Additionally, we testedthe model performance when only including filters reproduced in 9 out of10 trials (82 filters), 10 out of 10 trials (70 filters), all filters matching aCIS-BP TF motif with a Tomtom q-value less than 0.05 (61 filters), andfilters with influence greater than 0.0001 (98 filters). We found all of thesesubsets were sufficient to obtain average performance almost as high as theunaltered model, in contrast to the randomly selected filter sets. Subsetsselected by reproducibility, influence and annotation status all have verysimilar performance, which is unsurprising since those metrics are highlycorrelated, and produced very similar filter subsets.Figure 3.7 shows the performance of the 99 reproducible filter model atthe OCR level. The correlations between the truncated model predictionsand observed OCR activity profiles are virtually unchanged compared to thepredictions of the full AI-TAC model.The fact that AI-TACs predictive power can be almost entirely recoveredwith only a third of the first layer filters indicates that the majority of thesefilters are not meaningful feature detectors. This is consistent with previousfindings that the number of first layer filters required for a CNN to learn theentire set of relevant motifs for a particular classification problem is largerthan the set itself[32]. This necessary over-parameterization of the networkis likely due to the difficulty of converging to a meaningful PWM from arandom filter initialization during the model optimization stage.3.4 Detecting TF Cooperativity with AI-TACThe model predictions do not depend on the detection of individual motifsalone, but rather on patterns of motif combinations and their surroundingsequence context. The way AI-TAC models these relationships between363.4. Detecting TF Cooperativity with AI-TAC0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8Predictions with full AI-TAC model0. with 99 reproducible filtersFigure 3.7: Correlations between predictions and ground truth peak heightsfor test set OCRs for the full AI-TAC model versus the model with only the99 most reproducible filters.sequence features should reflect, to some degree, the underlying biologicalmechanisms that drive cell type-specific chromatin accessibility patterns.We attempted to understand this combinatorial logic of the model in twoways: by analyzing the weights of the second layer convolutional filters andby computing combined influence values for select filter pairs.3.4.1 Second Layer FiltersBecause higher order relationships between the first layer motifs are encodedin the deeper layers of the network, an obvious first attempt at identifyingimportant filter combinations is to look for combinations of motifs assembledby the second layer convolutional filters[5]. Due to the maxpooling applied tothe first layer output, constructing clear motifs from second layer activationswas not possible and we instead examined the second layer filter weightsdirectly.373.4. Detecting TF Cooperativity with AI-TAC(a)(b) (c)(d) (e)Figure 3.8: (a) Maximum weights between all reproduciblefirst layer filters belonging to a cluster and a subset of second layer filters.Second layer filters were selected based on a weight threshold of 0.7 forat least one first layer motif. For a few examples, we visualized secondlayer filter weights for the most heavily weighted first layer motifs alongwith the corresponding PWMs. (b-c) Two instances of a second layer motifaggregating similar first layer motifs. (d-e) Two examples of a second layerfilter recognizing reverse compliments first layer motifs. 383.4. Detecting TF Cooperativity with AI-TACWe found that in a large number of cases the second layer filters recog-nized similar (Figures 3.8b, 3.8c) or reverse complement (Figures 3.8d, 3.8e)first layer motifs. This indicates that the second layer convolutional filtersare assembling “cleaner” versions of first layer motifs rather than learningthe combinatorial logic between them. Figure 3.8a shows the magnitudeof the weights placed on each first layer convolutional filter by each secondlayer filter, with the first layer PWMs clustered by similarity. Clustering wasperformed by Ricardo Ramirez using RSAT[7]; more details can be found inMaslova et al, 2019 [34].It shows on a more global scale the trend of second layer filters agglomer-ating similar motifs from the first layer. This is consistent with the findingsof Koo and Eddy, 2019[32] for convolutional filters with smaller maxpool-ing windows. Their study suggests that increasing the maxpooling size inthe first layer can force the first layer weights to converge to more com-plete motifs during training. The second layer filters then correspond to theinteractions between these motifs.3.4.2 Filter Pair InfluenceTo see whether AI-TAC was able to detect any instances of cooperative ac-tivity between TFs, we looked for evidence of non-additive effects on themodel predictions when particular motif pairs are present in the input se-quence. To do this, we computed filter pair influence values in a similar wayas single filters, by removing both filters from the first layer of the modelat once and quantifying the change in AI-TACs predictions. To make thistask computationally tractable we selected the 40 most important filters byinfluence and reproducibility metrics and computed pair influence values forall 1600 possible pairs.Figure 3.9 shows these pair influence values against the sum of individualinfluence values for the filters in the pair. For filter pairs that correspondto TFs that act independently, we expect the effect of removing both filtersat once to be equal to the sum of their individual influence values (addi-tive effects). However, if both filters are required to accurately predict thechromatin activity profile of an OCR, as would be expected for TFs thatcooperate in vivo, removing either filter individually should produce a sim-ilar impact as removing both filters at once. In cases where two differentTFs have redundant function, and thus the presence of either motif aloneis sufficient to accurately predict the OCR activity profile, the impact ofremoving both TF filters will be greater than the sum of their individualfilter values.393.4. Detecting TF Cooperativity with AI-TAC0.00 0.01 0.02 0.03 0.04 0.05Sum of individual filter influence0. pair influence0. pair similarityFigure 3.9: Sum of individual filter influence values of given filter pair onthe x-axis versus the computed influence value when removing both filtersin the pair from the model simultaneously (y-axis).As seen in Figure 3.9, filter pairs composed of highly similar motifshave, for the most part, higher than expected pair influence values. It’slikely that these similar filters serve a redundant purpose in the model,recognizing the same motifs within the input sequence, and therefore theirindividual influence values are underestimated. The more interesting casesare the highly dissimilar pairs that have an unexpectedly high pair influencevalues, as these may be indicative of a biological redundancy rather than atechnical one. There are six such filter pairs (corresponding to four differentTF combinations) listed in Table 3.1, which are interesting candidates forfurther investigation.There are also several filter pairs, listed in Table 3.2 that have lowerthan expected pair influence values, implying that the detection of bothmotifs is necessary for a correct model prediction and perhaps indicativeof cooperativity between the TFs. Our results indicate that the model isdetecting possible cooperativity between pioneer factors Sfpi1 and Cebp andTcf3, a TF critical for normal B and T-cell development[10].The efficacy of this method for determining combinatorial logic within403.5. Summaryfilter174 - Cebp filter255 - Sfpi1filter133 - Sfpi1 filter174 - Cebpfilter133 - Sfpi1 filter165 - Irf1filter10 - Runx filter11 - Ets1filter15 - Runx filter11 - Ets1filter11 - Ets1 filter78 - Tcf3/Id3Table 3.1: Filter pairs with higher than expected pair influence values.filter255 - Sfpi1 filter78 - Tcf3/Id3filter255 - Sfpi1 filter11 - Tcf3filter255 - Sfpi1 filter93 - Zeb1/Tcf3/Id3filter174 - Cebp filter78 - Tcf3/Id3Table 3.2: Filter pairs with lower than expected pair influence values.the model is likely hampered by the redundancy present among the firstlayer motifs of AI-TAC. This approach may prove more fruitful if appliedto a CNN with unique PWMs in the first layer, for example a model that isinitialized with frozen PWMs of known TF motifs such as OrbWeaver[3].3.5 SummaryIn conclusion, we extracted the PWMs of all 300 first layer filters of AI-TACto understand which sequence features are informative of local chromatinaccessibility. We compared these filters to a database of known motifs ofmouse TFs, and computed metrics of IC, reproducibility and influence foreach filter. A table containing the above information for all 300 first layerfilters can be found in Appendix A.We found that about a third of the filter PWMs matched closely toknown TF motifs, and this subset also has high filter reproducibility andinfluence values. We then performed an experiment to see how much ofthe model accuracy would be retained when using only the top third offilters determined with our importance metrics. We found that the modelperformance remains almost unchanged, and is considerably higher thanperformance for a randomly selected subset of filters.Finally, we looked for evidence of TF cooperativity in our model. Weexamined the second convolutional layer of AI-TAC to determine which com-binations of first layer filters are weighted highly when detected together.413.5. SummaryWe found that the second layer filters tend to group similar or reverse com-plement first layer motifs. We then computed influence values for selectedpairs of first layer filters, and found several pairs that may be of interest forfurther investigation.42Chapter 4Conclusions4.1 SummaryWe trained a CNN model using ATAC-seq data to predict chromatin accessi-bility across a large set of related immune cell types from sequence alone. Weused Pearson correlation as the loss function for training our model, whichprioritized the accurate prediction of regions with variable activity acrossdifferent cell populations. Based on comparisons of model predictions onheld-out test examples and human DNA sequences to results on simulated“null” data, we concluded that the model learned some biologically mean-ingful features. We additionally showed that the predictive performance ofAI-TAC is stable with regards to the choice of training and test set splits.The second part of the thesis was dedicated to extracting predictive se-quence features from the model. We first identified the PWMs learned bythe first layer convolutional filters of AI-TAC and found that many of themclosely resemble binding motifs of known TFs. We additionally assigned twometrics of importance to every filter: influence and reproducibility. Influencevalues correspond to the impact of removing the filter on the model predic-tions, while reproducibility is the number of independent training iterationsin which the filter was recovered. We validated these metrics by testing themodel while using just a third of the most important filters and showing thatthe performance is almost as good as using the unaltered AI-TAC model.In the last part of this thesis we attempted to understand the combina-torial logic within AI-TAC that may be representative of TF interactions invivo. We tried two different approaches for this task: examining the secondlayer convolutional filter weights to understand how the model combines thefirst layer filters, and computing influence values for selected pairs of filters.The latter analysis yielded a handful of candidate motif pairs for furtherinvestigation. However, a full understanding of the combinatorial effects ofmotifs remains a challenge.434.2. Discussion4.2 DiscussionNotably, the vast majority of the filters identified in this study with highinfluence and reproducibility values matched closely to known mouse TFmotifs. In addition, the cell type-specific influence values of these filterswere largely in agreement with the known roles of the corresponding TFswithin immune cell differentiation. While it’s comforting that our resultscoincide closely with prior biological knowledge, the question arises of whywe are not seeing many novel motifs in our model. One possibility is thatthe motifs learned by both AI-TAC and through the many individual bio-logical experiments aimed at understanding immune cell differentiation rep-resent the lowest hanging fruit of relevant TFs - i.e. the most prevalentand statistically significant motif instances within the data. Alternatively,it’s possible that the vast majority of TFs that are relevant to chromatinstate within immune cells have already been characterized by the immunol-ogy field. Fully explaining cell type specific chromatin state would thenrequire understanding the interactions between these TFs and the role ofother epigenetic mechanisms in chromatin accessibility.There are a number of epigenetic mechanisms that impact chromatinstate and are not directly reflected in the DNA code that may limit howwell the model can predict chromatin state from short, local DNA sequencealone. These mechanisms include:• DNA methylation within regulatory sequences, which can affect bind-ing affinity between TFs and their motifs[45]• Modifications of histones within nucleosomes which can alter nucleo-some stability[31]• Indirect binding of important TFs at regulatory sites via protein-protein interactions with other TFs [45]• Removal of nucleosomes via interaction with distal regulatory sites[31]The results of repeated cross-validation experiments, depicted in Figure2.4b, do validate the notion that for some OCR the model fails to makeconsistently good predictions. Although a subset of OCRs are uniformlyvery well-predicted across different training iterations of the model, manyOCRs were on average predicted poorly. This can be partially explainedin terms of the variance of chromatin state at each OCR - we can expectthat high prediction correlation values would be more difficult to obtain for444.3. Future Workvery uniform activity vectors. We do, in fact, observe that peak variance iscorrelated with the prediction accuracy of AI-TAC (Figure 2.6). However,it is also likely that some of the epigenetic mechanisms listed above areimpacting chromatin accessibility in ways that are undetectable to AI-TAC.Cooperative interaction between TFs may be the other key to under-standing how chromatin accessibility is controlled with high precision dur-ing differentiation. We already noted some of the potential technical reasonswhy our approaches to understanding TF cooperativity did not yield manycandidate TF interactions, namely that our model architecture allowed forlarge amounts of redundancy among the first layer filters as well as dispersedrepresentations of motifs. However, there may be a low amount of evidencefor TF cooperativity in our dataset for biological reasons. It’s possible thatchromatin accessibility within this biological system can be largely predictedbased on motifs of lineage-specific pioneer factors and the additive effectsof differentially-expressed TFs that contribute to accessibility via passivecompetition with nucleosomes for DNA binding. Improved methods for ex-tracting the combinatorial logic used by the model would help clarify thesequestions.4.3 Future WorkUnderstanding how various epigenetic mechanisms contribute to chromatinstate can be explored by integrating additional genomic data within a CNNframework. For example, DNA methylation and histone modification as-says can be added to the sequence information to provide better predic-tions of chromatin accessibility. The role of long-range interactions betweenregulatory elements can be examined by incorporating data from a DNAconformation assay such as Hi-C.The analysis of TF cooperativity can be improved by adjusting the ar-chitecture of the model to make our approaches more informative. For ex-ample, the maxpooling after the first convolutional layer can be increased tomake the second layer filters more interpretable. Additionally, eliminatingredundancies within the first convolutional layer would make the influencevalues for both individual filters and filter pairs more informative. Finally,increasing the resolution of the model output, for example by changing itto per-base pair predictions of ATAC-seq read depth, could provide bet-ter information about the impact of the spatial organization of motifs onchromatin state.45Bibliography[1] Babak Alipanahi, Andrew Delong, Matthew T. Weirauch, and Bren-dan J. Frey. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology, 33(8):831–838,2015.[2] Robin Andersson and Albin Sandelin. Determinants of enhancer andpromoter activities of regulatory elements. Nature Reviews Genetics,21(February):71–87, 2020.[3] Nicholas E. Banovich, Yang I. Li, Anil Raj, Michelle C. Ward, Pey-ton Greenside, Diego Calderon, Po Yuan Tung, Jonathan E. Bur-nett, Marsha Myrthil, Samantha M. Thomas, Courtney K. Bur-rows, Irene Gallego Romero, Bryan J. Pavlovic, Anshul Kundaje,Jonathan K. Pritchard, and Yoav Gilad. Impact of regulatory vari-ation across human iPSCs and differentiated cells. Genome Research,28(1):122–131, 2018.[4] Oliver Bell, Vijay K Tiwari, Nicolas H Thoma¨, and Dirk Schu¨beler.Determinants and dynamics of genome accessibility. Nature PublishingGroup, 2011.[5] Nicholas Bogard, Johannes Linder, Alexander B. Rosenberg, and GeorgSeelig. A Deep Neural Network for Predicting and Engineering Alter-native Polyadenylation. Cell, 178(1):91–106.e23, 2019.[6] Jason D. Buenrostro, Beijing Wu, Howard Y. Chang, and William J.Greenleaf. ATAC-seq: A method for assaying chromatin accessibilitygenome-wide. Current Protocols in Molecular Biology, 109(1):21.29.1–21.29.9, 2015.[7] Jaime Abraham Castro-Mondragon, Se´bastien Jaeger, Denis Thieffry,Morgane Thomas-Chollier, and Jacques Van Helden. RSAT matrix-clustering: Dynamic exploration and redundancy reduction of tran-scription factor binding motif collections. Nucleic Acids Research,45(13):1–13, 2017.46Bibliography[8] Ling Chen, Alexandra E. Fish, and John A. Capra. Prediction of generegulatory enhancers across species reveals evolutionarily conserved se-quence properties. PLoS Computational Biology, 14(10), 2018.[9] M. Ryan Corces, Jason D. Buenrostro, Beijing Wu, Peyton G. Green-side, Steven M. Chan, Julie L. Koenig, Michael P. Snyder, Jonathan K.Pritchard, Anshul Kundaje, William J. Greenleaf, Ravindra Majeti,and Howard Y. Chang. Lineage-specific and single-cell chromatin ac-cessibility charts human hematopoiesis and leukemia evolution. NatureGenetics, 48(10):1193–1203, 2016.[10] Rene´e F. De Pooter and Barbara L. Kee. E proteins and the regulationof early lymphocyte development. Immunological Reviews, 238(1):93–109, 2010.[11] Aime´e M. Deaton and Adrian Bird. CpG islands and the regulation oftranscription. Genes and Development, 25(10):1010–1022, 2011.[12] Rodney P. DeKoter and Harinder Singh. Regulation of B lymphocyteand macrophage development by graded expression of PU.1. Science,288(5470):1439–1442, 2000.[13] Go¨kcen Eraslan, Zˇiga Avsec, Julien Gagneur, and Fabian J. Theis.Deep learning: new computational modelling techniques for genomics.Nature Reviews Genetics, 20(7):389–403, 2019.[14] Oriol Fornes, Jaime A. Castro-Mondragon, Aziz Khan, Robin van derLee, Xi Zhang, Phillip A. Richmond, Bhavi P. Modi, Solenne Correard,Marius Gheorghe, Damir Baranasˇic´, Walter Santana-Garcia, Ge Tan,Jeanne Che`neby, Benoit Ballester, Franc¸ois Parcy, Albin Sandelin,Boris Lenhard, Wyeth W. Wasserman, and Anthony Mathelier. JAS-PAR 2020: update of the open-access database of transcription factorbinding profiles. Nucleic acids research, 48(D1):D87–D92, 2020.[15] Lee Ann Garrett-Sinha. Review of Ets1 structure, function, and rolesin immunity. Cellular and Molecular Life Sciences, 70(18):3375–3390,2013.[16] Miklos Gaszner and Gary Felsenfeld. Insulators: Exploiting transcrip-tional and epigenetic mechanisms. Nature Reviews Genetics, 7(9):703–713, 2006.47Bibliography[17] Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, andMichael A. Beer. Enhanced Regulatory Sequence Prediction UsingGapped k-mer Features. PLoS Computational Biology, 10(7), 2014.[18] Charles E. Grant, Timothy L. Bailey, and William Stafford Noble.FIMO: Scanning for occurrences of a given motif. Bioinformatics,27(7):1017–1018, 2011.[19] Michael Gribskov. Identification of sequence patterns, motifs and do-mains. In Encyclopedia of Bioinformatics and Computational Biology:ABC of Bioinformatics, volume 1-3, pages 332–340. 2018.[20] Shobhit Gupta, John A. Stamatoyannopoulos, Timothy L. Bailey, andWilliam Stafford Noble. Quantifying similarity between motifs. GenomeBiology, 8(2):R24, 2007.[21] Tatsunori Hashimoto, Richard I Sherwood, Daniel D Kang, Nisha Ra-jagopal, Amira A Barkal, Haoyang Zeng, Bart J M Emons, SharanyaSrinivasan, Tommi Jaakkola, and David K Gifford. A synergistic DNAlogic predicts genome-wide chromatin accessibility. Genome Research,26:1430–1440, 2016.[22] Sven Heinz, Christopher Benner, Nathanael Spann, Eric Bertolino,Yin C Lin, Peter Laslo, Jason X Cheng, Cornelis Murre, HarinderSingh, and Christopher K Glass. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Re-quired for Macrophage and B Cell Identities. Molecular Cell, 38:576–589, 2010.[23] Tracy S.P. Heng, Michio W. Painter, Kutlu Elpek, Veronika Lukacs-Kornek, Nora Mauermann, Shannon J. Turley, Daphne Koller, Fran-cis S. Kim, Amy J. Wagers, Natasha Asinovski, Scott Davis, Marlys Fas-sett, Markus Feuerer, Daniel H.D. Gray, Sokol Haxhinasto, Jonathan A.Hill, Gordon Hyatt, Catherine Laplace, Kristen Leatherbee, DianeMathis, Christophe Benoist, Radu Jianu, David H. Laidlaw, J. AdamBest, Jamie Knell, Ananda W. Goldrath, Jessica Jarjoura, Joseph C.Sun, Yanan Zhu, Lewis L. Lanier, Ayla Ergun, Zheng Li, James J.Collins, Susan A. Shinton, Richard R. Hardy, Randall Friedline, Kate-lyn Sylvia, and Joonsoo Kang. The immunological genome project:Networks of gene expression in immune cells. Nature Immunology,9(10):1091–1094, 2008.48Bibliography[24] Authors Hideyuki Yoshida, Caleb A Lareau, Ricardo N Ramirez, Ja-son D Buenrostro, Christophe Benoist, Hideyuki Yoshida, Samuel ARose, Barbara Maier, Aleksandra Wroblewska, Fiona Desland, AlekseyChudnovskiy, Arthur Mortha, Claudia Dominguez, Julie Tellier, EdyKim, Dan Dwyer, Susan Shinton, Tsukasa Nabekura, YiLin Qi, BingfeiYu, Michelle Robinette, Ki-Wook Kim, Amy Wagers, Andrew Rhoads,Stephen L Nutt, Brian D Brown, Sara Mostafavi, and The Immunolog-ical Genome Project. The cis-Regulatory Atlas of the Mouse ImmuneSystem. Cell, 176:897–912, 2019.[25] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,and Ruslan R. Salakhutdinov. Improving neural networks by preventingco-adaptation of feature detectors, 2012. arXiv:1207.0580.[26] Andrew M. Intlekofer, Naofumi Takemoto, E. John Wherry, Sarah A.Longworth, John T. Northrup, Vikram R. Palanivel, Alan C. Mullen,Christopher R. Gasink, Susan M. Kaech, Joseph D. Miller, LaurentGapin, Kenneth Ryan, Andreas P. Russ, Tullia Lindsten, Jordan S. Or-ange, Ananda W. Goldrath, Rafi Ahmed, and Steven L. Reiner. Effectorand memory CD8+ T cell fate coupled by T-bet and eomesodermin.Nature Immunology, 6(12):1236–1244, 2005.[27] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In 32ndInternational Conference on Machine Learning, ICML 2015, volume 1,pages 448–456. International Machine Learning Society (IMLS), 2015.[28] David R. Kelley, Yakir A. Reshef, Maxwell Bileschi, David Belanger,Cory Y. McLean, and Jasper Snoek. Sequential regulatory activ-ity prediction across chromosomes with convolutional neural networks.Genome Research, 28(5):739–750, 2018.[29] David R. Kelley, Jasper Snoek, and John L. Rinn. Basset: Learningthe regulatory code of the accessible genome with deep convolutionalneural networks. Genome Research, 26(7):990–999, 2016.[30] Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochas-tic optimization. In 3rd International Conference on Learning Repre-sentations, ICLR 2015 - Conference Track Proceedings. InternationalConference on Learning Representations, ICLR, 2015.49Bibliography[31] Sandy L. Klemm, Zohar Shipony, and William J. Greenleaf. Chromatinaccessibility and the regulatory epigenome. Nature Reviews Genetics,20(April):207–220, 2019.[32] Peter K. Koo and Sean R. Eddy. Representation learning of genomicsequence motifs with convolutional neural networks. PLOS Computa-tional Biology, 15(12):e1007560, 2019.[33] Elizabeth M. Mandel and Rudolf Grosschedl. Transcription control ofearly B cell differentiation. Current Opinion in Immunology, 22(2):161–167, 2010.[34] Alexandra Maslova, Ricardo N. Ramirez, Ke Ma, Hugo Schmutz,Chendi Wang, Curtis Fox, Bernard Ng, Christophe Benoist, SaraMostafavi, and the Immunological Genome Project. Learning immunecell differentiation. bioRxiv, page 2019.12.21.885814, dec 2019.[35] James P Noonan and Andrew S Mccallion. Genomics of Long-RangeRegulatory Elements. Annu. Rev. Genomics Hum. Genet, 11:1–23,2010.[36] Stephen L. Nutt, Barry Heavey, Antonius G. Rolink, and Meinrad Bus-slinger. Commitment to the B-lymphoid lineage depends on the tran-scription factor Pax5. Nature, 402(S6763):14–20, 1999.[37] Marco Osterwalder, Iros Barozzi, Virginie Tissie´res, Yoko Fukuda-Yuzawa, Brandon J. Mannion, Sarah Y. Afzal, Elizabeth A. Lee, YiwenZhu, Ingrid Plajzer-Frick, Catherine S. Pickle, Momoe Kato, Tyler H.Garvin, Quan T. Pham, Anne N. Harrington, Jennifer A. Akiyama,Veena Afzal, Javier Lopez-Rios, Diane E. Dickel, Axel Visel, and Len A.Pennacchio. Enhancer redundancy provides phenotypic robustness inmammalian development. Nature, 554(7691):239–243, 2018.[38] Daniel Quang and Xiaohui Xie. DanQ: A hybrid convolutional andrecurrent deep neural network for quantifying the function of DNAsequences. Nucleic Acids Research, 44(11), 2016.[39] Alicia N. Schep, Beijing Wu, Jason D. Buenrostro, and William J.Greenleaf. ChromVAR: Inferring transcription-factor-associated acces-sibility from single-cell epigenomic data. Nature Methods, 14(10):975–978, 2017.50Bibliography[40] Manu Setty and Christina S. Leslie. SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory ElementMaps. PLoS Computational Biology, 11(5):1–21, 2015.[41] Daria Shlyueva, Gerald Stampfel, and Alexander Stark. Transcrip-tional enhancers: From properties to genome-wide predictions. NatureReviews Genetics, 15:272–286, 2014.[42] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learningimportant features through propagating activation differences. In 34thInternational Conference on Machine Learning, ICML 2017, volume 7,pages 4844–4866, 2017.[43] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep InsideConvolutional Networks: Visualising Image Classification Models andSaliency Maps, 2014. arXiv:1312.6034v2.[44] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesianoptimization of machine learning algorithms. In Advances in NeuralInformation Processing Systems, volume 4, pages 2951–2959, 2012.[45] Franc¸ois Spitz and Eileen E M Furlong. Transcription factors : fromenhancer binding to developmental control. Nature Publishing Group,13:613–626, 2012.[46] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Mar-tin Riedmiller. Striving for Simplicity: The All Convolutional Net,2015. arXiv:1412.6806v3.[47] Gary D Stormo. DNA binding sites: representation and discovery.Bioinformatics, 16(1):16–23, 2000.[48] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attribu-tion for Deep Networks, 2017. arXiv:1703.01365v2.[49] Dominic Chih Cheng Voon, Yit Teng Hor, and Yoshiaki Ito. The RUNXcomplex: Reaching beyond haematopoiesis into immunity. Immunol-ogy, 146(4):523–536, 2015.[50] Matthew T. Weirauch, Ally Yang, Mihai Albu, Atina G. Cote, Ale-jandro Montenegro-Montero, Philipp Drewe, Hamed S. Najafabadi,Samuel A. Lambert, Ishminder Mann, Kate Cook, Hong Zheng, Ale-jandra Goity, Harm van Bakel, Jean Claude Lozano, Mary Galli,51Mathew G. Lewsey, Eryong Huang, Tuhin Mukherjee, Xiaoting Chen,John S. Reece-Hoyes, Sridhar Govindarajan, Gad Shaulsky, Al-bertha J.M. Walhout, Franc¸ois Yves Bouget, Gunnar Ratsch, Luis F.Larrondo, Joseph R. Ecker, and Timothy R. Hughes. Determinationand inference of eukaryotic transcription factor sequence specificity.Cell, 158(6):1431–1443, 2014.[51] Jiang Zhang, Marie Marotel, Se´bastien Fauteux-Daniel, Anne LaureMathieu, Se´bastien Viel, Antoine Marc¸ais, and Thierry Walzer. Euro-pean journal of immunology. 48(5):738–750, 2018.[52] Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen,Aaron K. Wong, and Olga G. Troyanskaya. Deep learning sequence-based ab initio prediction of variant effects on expression and diseaserisk. Nature Genetics, 50(8):1171–1179, 2018.[53] Jian Zhou and Olga G. Troyanskaya. Predicting effects of noncodingvariants with deep learning-based sequence model. Nature Methods,12(10):931–934, 2015.52Appendix AFilter motif informationProperties of all 300 first layer filters of AI-TAC. The columns correspondto:• Influence - overall filter influence• IC - information content of the filter PWM• Reprod. - reproducibility, i.e. the number of runs in which this filterwas recovered• Top TF matches - TFs corresponding to best matching motif in theCIS-BP database,Filter Influence IC Reprod. Top TF Matchesfilter255 0.0357675 12.81011043 10 Sfpi1/ Spib/ Spib/Bcl11a/ Bcl11b/ Etv2/Ets2/ Erg/ Erf/ ENS-MUSG00000044690filter34 0.010975358 9.823882667 10 Cebpb/ Cebpe/ Cebpa/Nfil3/ Cebpg/ Cebpd/Dbp/ Tef/ Hlf/ Atf5filter11 0.009727289 12.43567913 10 Etv2/ Ets2/Ets1/ Erf/ ENS-MUSG00000044690/Fev/ Etv2/ Ets2/ Erg/Erffilter133 0.008568606 13.17366462 10 Sfpi1/ Spib/ Spib/Bcl11a/ Bcl11b/ Irf1/Etv2/ Ets2/ Ets1/ Erffilter174 0.008169892 11.09285091 10 Cebpe/ Cebpb/ Cebpa/Cebpg/ Hlf/ Cebpd/Nfil353Appendix A. Filter motif informationfilter112 0.007724161 10.6590297 10 Myf5/ Myod1/ Ascl2/Ascl1/ Myf5/ Myog/Tcf12/ Tcf3/ Tal1/Lyl1filter15 0.005275883 11.93004972 10 Runx2/ Runx3/Runx1/ Runx2/Runx3/ Runx2/ Runx3filter10 0.005030511 11.94800097 10 Runx1/ Runx2/Runx3/ Runx2/Runx3/ Runx2/ Runx3filter78 0.004063911 10.53256981 10 Id3/ Id4/ Id1/ Snai2/Zeb1/ Myf5/ Myod1/Mesp2/ Mesp1/ Tcf4filter260 0.003942707 13.54372185 10 nanfilter238 0.00301885 10.78727984 10 Rorc/ Rorb/ Rarb/Esr1/ Esr2/ Rxrb/Nr2f6/ Rxrg/ Nr2f2/Esrrbfilter167 0.002844589 13.44976933 10 Pax9/ Pax5/ Pax8/Pax1/ Pax9/ Pax5/Pax8/ Pax1/ Pax9/Pax5filter217 0.002654829 12.57158572 10 Pax9/ Pax5/ Pax8/Pax1/ Pax9/ Pax5/Pax8/ Pax1/ Pax9/Pax5filter292 0.002608087 13.3200093 10 Smarcc2/ Smarcc1/Fosb/ Fos/ Batf3/Batf/ Fosb/ Fos/Bach1/ Bach2filter165 0.002315825 14.4805664 10 Irf1/ Bcl11a/ Bcl11b/Stat2/ Prdm1filter218 0.002303334 15.35124635 10 Tbpl2filter190 0.002254576 14.3090253 10 Tbpl2filter245 0.002075286 6.377445222 1filter121 0.002055486 10.13494298 054Appendix A. Filter motif informationfilter252 0.001935959 12.22508784 10 Etv2/ Ets2/Ets1/ Erf/ ENS-MUSG00000044690/Fev/ Etv2/ Ets2/ Erg/Erffilter40 0.001902964 8.250938104 10filter220 0.001194611 10.29211886 10 Esr1/ Esr2/ Rorc/Rorb/ Rarb/ Rxrb/Nr2f6/ Rxrg/ Nr2f2/Nr2c2filter295 0.001191121 3.963877444 2filter271 0.001167049filter166 0.001148039 13.52171258 10 Lef1/ Tcf7l2filter288 0.001144482 10.343384 10 Tbx1/ Tbx10/ Tbx20/Tbx21/ Eomes/ Mga/Tbx5/ Tbx4/ Tbx21/Tbr1filter68 0.001073696 10.74340634 10 Nr1d1/ Nr1d2/ Rorc/Rorc/ Rorb/ Pparg/Ppard/ Ppara/ Pparg/Ppardfilter57 0.00104311 5.444640633 1filter242 0.000939029 14.6802471 10 Sp2/ Sp3/ Sp6/ Sp8/Sp7/ Sp9/ Sp5/ Sp2/Sp3/ Sp6filter93 0.000936151 10.65612167 10 Id3/ Id4/ Id1/ Zeb1/Snai2/ Mesp2/ Mesp1/Tcf4/ Figla/ Atoh8filter279 0.000878933 11.1171853 10 Klf6/ Klf5/ Klf3/ Klf1/Klf2/ Mga/ Tbx21/Eomes/ Tbx4/ Tbx1filter89 0.000877186 13.9338102 10 Stat2/ Irf1/ Prdm1/Bcl11a/ Bcl11b/ Irf2filter106 0.000836286 11.28191316 10 Tbx20/ Tbx1/ Tbx10/Mga/ Tbx21/ Eomes/Tbx5/ Tbx21/ Tbr1/Bcl11afilter231 0.000818648 12.81160589 10 Nfkb1/ Rel/ Relb/ Rel/Rela/ Hivep2/ Hivep1/Hivep3/ Nfkb2/ Sp11055Appendix A. Filter motif informationfilter51 0.000818111 5.223610074 10 Mafk/ Mafgfilter23 0.000810822 17.29094559 10 Ctcf/ Ctcflfilter97 0.000794982 11.39666997 10 Tbx4/ Tbx5/ Mga/Tbx20/ Rhox8/ Tbx1/Tbx10/ Tbx19/ T/Tbx21filter120 0.000742252 12.8251035 10 Etv2/ Ets2/Ets1/ Erf/ ENS-MUSG00000044690/Fev/ Etv2/ Gabpa/Ets2/ Erffilter9 0.000713634 8.747762923 10filter275 0.000683147 13.90513714 10 Ctcf/ Ctcflfilter286 0.00062891 15.79696475 10 Spib/ Sfpi1/ Spib/ Irf1/Bcl11a/ Bcl11b/ Stat2/Prdm1/ Etv2/ Ets2filter240 0.000619261 9.181616067 7 Sp2/ Sp3/ Sp6/ Sp8/Sp7/ Sp9/ Sp5/ Plagl1/Zbtb7a/ Zbtb7cfilter236 0.000573209 7.69064873 1filter230 0.000548002 11.86735023 10 Sfpi1/ Spib/ Etv2/Ets2/ Erg/ Erf/ EN-SMUSG00000044690/Fev/ Spib/ Etv2filter195 0.00054669 8.849514119 9 Gata2/ Gata3/ Gata6/Gata2/ Gata2filter65 0.000523359 12.83675477 8 Nfkb1filter35 0.000485219 14.09783964 9 Nfix/ Nfib/ Nfia/ Nfic/Nfix/ Nfib/ Nfia/ Nfic/Nfix/ Nfibfilter264 0.000475761 5.085780637 9 Mafgfilter8 0.00046047 9.781212087 10 Zeb1/ Snai2/ Id3/ Id4/Id1/ Mesp2/ Mesp1/Myf5/ Myod1/ Tcf4filter173 0.000457876 8.719776746 10 Zfp711/ Zfa/ Zfy1/ Zfxfilter205 0.000437324 13.043296 10 Bcl11a/ Bcl11b/Sfpi1/ Spib/ Etv2/Ets2/ Erg/ Erf/ EN-SMUSG00000044690/Fev56Appendix A. Filter motif informationfilter179 0.00043018 4.040188655 0 Zkscan1/ Srf/ Atf1/Sp100/ Sp140/IRC900814/ Dnajc21filter172 0.000413453 4.586823362 1 Prkrirfilter154 0.000363637 10.16701761 10 Sp2/ Sp3/ Sp6/ Sp8/Sp7/ Sp9/ Sp5/ Sp2/Sp3/ Sp6filter274 0.000349809 4.486219114 0 Dmrt3/ Dmrta1filter114 0.000346892 3.63658211 5filter128 0.000342323 3.734457583 1filter227 0.000327319 10.17649143 10filter47 0.000306001 9.084441796 10 Runx2/ Runx3/Runx1/ Runx2/ Runx3filter219 0.000301505 3.602551329 8 Setbp1/ Atf1/ Mafk/Dnajc21/ Hbp1/Zkscan1/ Tcf7/Homez/ Hhex/ Mecp2filter201 0.000272047 7.825621662 3 ENSMUSG00000079994filter211 0.000270309 10.97696139 0filter85 0.000266754 2.814624733 10 Atf1/ Homez/ Pbx4/Pbx2/ Pbx3/ Pbx1/Setbp1/ Dnajc21/Atf5/ Atf4filter132 0.000253488 4.498443623 8filter294 0.000249725 3.734606292 7 Mafk/ Setbp1/ Hmga1/Hmga2/ Hmga1-rs1/Atf1/ Zkscan1/ Hhex/Prkrir/ Dnajc21filter273 0.000242837 6.504206603 6filter94 0.000233849 12.97110591 10 Irf1/ Stat2/ Spib/Bcl11a/ Bcl11b/ Sfpi1/Spibfilter209 0.000232634 9.095851542 10filter276 0.000229846 13.49932331 10 Spib/ Sfpi1/ Spib/Bcl11a/ Bcl11b/ Ehf/Elf5/ Etv6/ Spic/ Spibfilter297 0.000225054 3.631780121 10 Mecp2filter58 0.000221362 4.053184804 7 Setbp157Appendix A. Filter motif informationfilter115 0.000220052 4.44280022 9 Setbp1/ Hbp1/ Tcf7/Hhex/ Hmga1/ Hmga2/Hmga1-rs1/ Tcf7l1/Nanog/ Hoxb9filter290 0.000207083 5.710840446 9 Mafgfilter102 0.000205864 8.741785902 10filter148 0.000197608 10.45587484 4 Nfe2l2/ Nfe2filter194 0.000195206 9.837238958 10filter270 0.000193527 4.41762007 8filter80 0.000190898 11.09343066 10 Tcf7l2/ Lef1filter50 0.000164866 9.128977137 10 Tcf7l2/ Lef1filter162 0.000163157 4.611822441 5filter30 0.000157237 4.138480327 3 Homezfilter28 0.000154734 6.226353937 9 Mafgfilter73 0.000153573 8.882060357 7 nan/ Lhx5/ Lhx1/Gsx1/ Vsx2/ Arx/Prrxl1/ Vsx2/ Rax/Prrxl1filter224 0.000152219 13.49567136 10 Stat2/ Irf1/ Bcl11a/Bcl11bfilter67 0.000151678 9.039821323 10 Mzf1filter127 0.000143981 9.121384419 10filter122 0.000138046 8.212007369 10 Etv6/ Elf5/ Ehf/Elf1/ Elf2/ Etv2/Ets2/ Erg/ Erf/ ENS-MUSG00000044690filter207 0.00013326 5.02135361 6filter43 0.000124214 4.641292484 8 Pbx4/ Pbx2/ Pbx3/Pbx1/ Mafkfilter147 0.00012326 9.240360208 6filter191 0.000111593 7.212592448 1 Atf1filter14 0.000109722 6.256898425 8filter164 0.000109263 4.284922921 0filter139 0.000108369 8.403904967 1 Six1filter151 0.000102033 6.640858201 8filter296 0.000101989 5.25963005 7filter95 0.000101887 6.070796759 4filter268 0.000100834 4.227078213 258Appendix A. Filter motif informationfilter257 0.0000989 11.95876667 10 Pax9/ Pax5/ Pax8/Pax1/ Pax9/ Pax5/Pax8/ Pax1/ Pax9/Pax5filter105 0.0000943 12.76764281 9filter180 0.000094 4.4561539 4filter183 0.0000937 9.758769455 10filter153 0.0000915 9.197231963 6filter247 0.0000896 10.84583255 10 Nfkb1/ Mzf1/ Rel/Relb/ Nfkb2filter291 0.0000877 4.662654179 2 Tcf7filter70 0.0000868 9.921290335 9filter135 0.0000867 11.07461293 9 Tbpl2filter210 0.0000862 6.922007029 7filter241 0.0000839 4.526449324 4filter281 0.0000836 5.726584818 8filter298 0.0000826 4.429468168 2filter100 0.0000787 10.74515578 5filter239 0.000078 7.510870349 5filter141 0.0000775 8.087321777 1 Emx1/ Nfil3/ Nkx1-1/Nkx1-1/ Meox2/ Tef/nan/ Vsx2/ Dlx3/ Dlx6filter203 0.000077 7.183626561 2filter72 0.0000755 4.334675782 5 Setbp1/ Atf1/ Tcf7/Homez/ Zkscan1/Pbx4/ Pbx2/ Pbx3/Pbx1/ Hhexfilter83 0.0000755 4.675820603 1 Dnajc21filter3 0.0000742 4.643168512 5 Zkscan1/ Mecp2/Dnajc21/ Mafk/ Atf1/Foxo6/ Pou3f3/ Irx5/Irx1/ Hbp1filter117 0.0000725 8.853449517 4 Arnt2filter213 0.0000722 8.108926847 4filter284 0.0000708 6.326748192 8 Irx2/ Irx1/ Rhox11/Irx5/ Irx1/ Rfx5/ Rfx6/Rfx4/ Rfx8/ Dmrt159Appendix A. Filter motif informationfilter234 0.0000704 5.756526898 3 Atf5/ Atf4/ Naif1/Atf2/ Sp140/IRC900814/ Atf1/Tef/ Kdm2bfilter123 0.0000703 11.10732837 10 nanfilter253 0.0000698 6.241630823 5filter101 0.0000688 8.482753328 0filter31 0.000068 11.91286781 3filter136 0.0000664 7.723492504 3filter131 0.0000663 7.754124606 2filter229 0.000066 6.09104818 6 Hhex/ Hoxd12/NP 032300.2/ Hoxa10/Hoxb9/ NP 032296.2/Hoxc8/ Hoxb9/ Hoxd4/NP 032296.2filter75 0.0000658 10.68908648 2filter69 0.0000657 13.12038514 10 Ctcffilter49 0.0000654 8.650625448 7 Atf3/ Jdp2/ Atf2/ Atf7filter198 0.0000652 9.886005642 8 Pax9/ Pax5/ Pax8/Pax1filter250 0.0000652 10.7869779 10 Irf1filter143 0.000065 7.746473215 4filter244 0.0000647 12.95999431 10 Etv2/ Ets2/Ets1/ Erf/ ENS-MUSG00000044690/Fev/ Etv2/ Ets2/ Erg/Erffilter283 0.0000645 7.472977411 9 Nrl/ Mafa/ Maffilter170 0.0000639 5.602994073 8filter99 0.0000635 9.123393773 1filter155 0.0000631 6.884448159 10filter159 0.0000631 5.63674747 5 Zfp300/ Zscan20/ Atf1filter214 0.0000619 5.968200513 10filter91 0.0000612 7.265551141 3filter55 0.000061 6.644188575 6filter4 0.0000609 10.02617469 0filter81 0.0000604 5.143625096 5 Hhexfilter124 0.0000602 8.219548597 9filter176 0.0000599 8.022155786 4filter188 0.0000592 8.910855997 060Appendix A. Filter motif informationfilter249 0.0000586 7.252156283 0filter259 0.0000582 12.02149639 2filter146 0.0000579 6.651357662 1filter287 0.0000566 5.23635501 1filter12 0.0000556 10.19329729 3filter24 0.0000554 6.276244049 1filter6 0.0000552 10.77284828 10filter144 0.0000552 7.10771801 1filter41 0.0000548 5.887367592 1filter160 0.0000546 8.384143074 3filter98 0.0000545 9.05066866 4filter184 0.0000542 5.644051942 3 Dnajc21/ Setbp1/Atf1/ Pbx4/ Pbx2/Pbx3/ Pbx1filter20 0.0000541 7.857412384 2filter182 0.0000541 7.857444219 6filter39 0.0000538 9.34016601 0filter82 0.0000529 8.952985455 3filter60 0.0000527 8.437548134 4filter149 0.0000526 7.123474357 0filter269 0.0000512 6.968962403 7filter299 0.0000511 8.056581304 3filter171 0.0000509 8.807042867 0filter2 0.0000508 6.602491545 0filter111 0.0000504 6.551389637 4filter140 0.0000503 6.940249007 0filter130 0.0000501 8.686382883 3filter66 0.0000499 9.306904407 8filter145 0.0000499 7.835945138 0filter152 0.0000496 10.96147533 0filter272 0.0000495 10.36105545 1filter233 0.0000494 8.142564571 4filter42 0.0000491 8.278420741 4filter1 0.000049 7.135086991 10 Atf3filter142 0.000049 6.16214829 0filter216 0.0000489 6.501634666 5filter29 0.0000487 10.95965327 1filter126 0.0000485 7.849950976 4filter246 0.0000485 7.794905625 061Appendix A. Filter motif informationfilter178 0.0000481 10.327217 8 Sp2/ Sp3/ Sp6/ Sp8/Sp7/ Sp9/ Sp5/ Sp2/Sp3/ Sp6filter199 0.0000481 7.744841132 4 Hoxb1/ Hoxa1/ Hoxb2/Hoxd4/ Hoxa6/ Hoxb2/Pou6f1/ Pou2f3/Meox1/ Hoxa2filter33 0.0000477 5.360230505 5filter262 0.0000475 10.74862289 2filter103 0.0000474 7.118303554 0filter192 0.0000474 8.358581497 3filter17 0.0000473 6.694936865 2filter138 0.0000462 8.530819708 2filter109 0.0000461 8.93098452 1filter222 0.0000461 6.379401097 0filter26 0.000046 8.427250934 4filter232 0.000046 6.942583904 2filter197 0.0000457 9.417900687 0filter248 0.0000455 7.397851979 1filter256 0.0000454 10.26651299 6filter265 0.0000454 5.77818146 1filter74 0.0000451 7.79247293 2filter175 0.0000447 9.308963928 4filter18 0.0000445 10.03106411 2filter36 0.0000445 8.023095847 3filter228 0.0000442 8.955259152 4filter215 0.000044 6.658615306 3filter5 0.0000439 6.060014083 4filter61 0.0000437 8.93558998 3filter13 0.0000434 5.896416221 1filter157 0.0000433 9.454012595 8filter125 0.0000431 10.96646503 1filter137 0.0000423 9.166575155 2filter19 0.0000418 6.755846361 1filter186 0.0000412 9.399774999 10filter267 0.000041 7.152791477 3filter45 0.0000406 10.50199055 2filter88 0.0000405 5.916821437 4filter63 0.0000403 9.971185496 0filter92 0.0000402 10.05949031 662Appendix A. Filter motif informationfilter263 0.0000401 8.316991382 5filter285 0.00004 7.637366915 1filter44 0.0000399 9.969803863 8 Tbpl2filter280 0.0000398 9.85114045 4filter21 0.0000396 7.816081845 1filter200 0.0000391 10.97817101 3filter261 0.0000388 9.110338959 2filter254 0.0000385 10.28534523 6filter278 0.0000385 7.162393957 6filter48 0.0000382 7.192766847 4filter54 0.0000379 10.32338372 0filter113 0.0000379 7.770356573 0filter84 0.0000374 5.666161006 7filter118 0.0000371 7.609764226 1 Junb/ Zic4/ Zic3/ Zic1/Zic4/ Tbx3filter161 0.0000369 7.123738039 0filter177 0.0000369 8.053713308 7filter206 0.0000368 10.52473197 2filter86 0.0000366 6.70425404 1filter237 0.0000366 7.347821421 1filter289 0.0000364 9.989819394 7filter32 0.0000362 7.846011003 0filter185 0.0000358 5.469848364 8 Mecp2/ Hhex/ Nr2e1/Xbp1/ Setbp1/ Cphxfilter243 0.0000358 6.350843954 4 Cphx/ Pbx4/ Pbx2/Pbx3/ Pbx1filter107 0.0000358 10.03660388 2filter59 0.0000353 10.09228773 2filter79 0.0000352 8.447790033 1filter16 0.0000351 6.795651699 4filter96 0.0000346 9.658079637 1filter150 0.0000346 10.65068912 2filter119 0.0000344 10.44550581 2 Usf1/ Mitf/ Tcfe3/Arntl/ Tcfec/ Arnt/Bhlhe41/ Gmeb1/Creb1/ Usf2filter104 0.0000341 8.166199904 0filter202 0.0000336 9.735935293 10filter46 0.0000336 8.58816777 0filter193 0.0000336 8.507109625 263Appendix A. Filter motif informationfilter77 0.0000332 9.536742897 0filter62 0.000033 7.164869314 2 Gm239/ Gm98/Zkscan1filter27 0.000033 7.756526575 1filter189 0.000033 6.754216621 5filter90 0.0000326 10.97267291 4filter181 0.0000326 10.90853891 0filter56 0.0000324 7.330371177 1filter196 0.0000323 9.704030662 9filter225 0.0000321 9.27022719 0filter223 0.0000316 9.304710026 10filter156 0.0000316 9.386572384 1filter52 0.000031 8.134521812 1filter226 0.0000309 12.10358919 0filter212 0.0000308 6.822379603 7filter64 0.0000306 6.770211089 2 Gata2/ Gata5/ Gata2/Gata3filter38 0.0000305 8.761957551 1filter53 0.0000305 8.050241968 4filter116 0.0000295 10.40140809 2filter76 0.000029 12.22319326 2filter110 0.0000289 9.901229468 1filter266 0.0000286 8.827576915 5filter22 0.0000279 7.788815142 1filter282 0.0000278 6.17079111 0filter293 0.0000275 7.235811417 4filter204 0.0000274 6.518757601 2filter7 0.0000265 8.805051502 1filter221 0.0000259 8.620098067 0filter277 0.0000256 6.159722079 3filter108 0.0000255 8.508599291 5filter71 0.0000252 6.780928854 4filter134 0.0000248 11.87473453 1filter251 0.0000243 10.79871508 3filter37 0.0000242 10.9847853 2filter129 0.0000235 8.840784607 8 Rarg/ Rara/ Nr2c2/Nr2c1/ Pparg/ Ppard/Ppara/ Esr1/ Esr2/Thrbfilter0 0.000023 7.387973909 364Appendix A. Filter motif informationfilter158 0.0000223filter163 0.0000223 9.075127994 0filter25 0.0000203 12.65092678 5filter168 0.0000189 7.159924343 2filter208 0.000017filter169 0.0000108filter235 0.0000107 14.87710862 5filter187 0.00000766 9.047789275 0filter87 0.00000421filter258 0.00000324 10.78377334 065


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items