Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Bioinformatics design of cis-regulatory elements controlling human gene expression Farkas, Rachelle 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2017_november_farkas_rachelle.pdf [ 2.5MB ]
JSON: 24-1.0357223.json
JSON-LD: 24-1.0357223-ld.json
RDF/XML (Pretty): 24-1.0357223-rdf.xml
RDF/JSON: 24-1.0357223-rdf.json
Turtle: 24-1.0357223-turtle.txt
N-Triples: 24-1.0357223-rdf-ntriples.txt
Original Record: 24-1.0357223-source.json
Full Text

Full Text

Bioinformatics design of cis-regulatory elementscontrolling human gene expressionbyRachelle FarkasB.Comp., Queen’s University at Kingston, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Bioinformatics)The University of British Columbia(Vancouver)October 2017c© Rachelle Farkas, 2017AbstractGene therapy has the potential to not only treat, but cure individuals suffering frominherited diseases. Advances in understanding the human genome and the dis-covery of causal genes underlying diseases has heightened the need to solve thegene therapy challenge. Viral vectors are often used as a delivery tool for thera-peutics, but their safety and efficacy are still being studied. To contribute to thisgoal, we have created 49 small viral promoters by bioinformatically annotating cis-regulatory regions from which a subset are concatenated with the goal of drivingcell-specific expression of a reporter gene. We have tested a subset of these in micein vivo. Regulatory region analysis can take a trained designer multiple weeks. Toresolve this issue, we have created a semi-automated approach to regulatory regionidentification, named OnTarget. The OnTarget database accumulates thousands ofcell and tissue-specific experiments in order to identify regions informative of reg-ulatory properties. OnTarget is able to identify regulatory regions consistent withthose identified by designers. In this capacity, we expect OnTarget to lead to bet-ter and faster identification of cis-regulatory regions for the design of promoterstargeting specific sets of cells.iiLay SummaryMillions of people currently live with incurable genetic diseases. Although manytreatments exist to ease symptoms of these diseases, they are often expensive andinvasive, resulting in both financial and emotional burdens on patients, their fam-ilies, and healthcare systems. Gene therapy has the potential to not only treat, butpotentially cure genetic diseases. The concept is simple: replace a malfunctioninggene with a working version. Current gene therapies often do not discriminate inthe delivery of these genes, which can lead to healthy cells receiving these unnec-essary genes potentially causing unwanted side effects. In order to address thisissue, we have designed a method to limit the replacement gene to be active in theright types of cells. We have created software to make this process available toother researchers.iiiPrefaceThis thesis contains original work as well as extensions to the MiniPromoter projectled by the laboratory of Dr. Elizabeth M. Simpson (UBC). All work was performedat the UBC Centre for Molecular Medicine and Therapeutics at the BC ChildrensHospital Research Institute under the supervision of Dr. Wyeth Wasserman. Notext is taken from previously published material.The MiniPromoter design protocol was defined by myself and Dr. Oriol Fornes,building from reported approaches of past members of the lab. I established the On-Target analysis steps, creating pseudocode and flowcharts, which David Arenillasprogrammed. All data was downloaded for free academic use from the FANTOM5consortium, ENCODE project, Roadmap Project, the UCSC Genome Browser,and GEO archives. With the exception of the KRT12 Ple326 and Ple334 con-structs (which was analyzed by myself within the Simpson laboratory), the mousework was performed by our collaborators in the Simpson laboratory, includingJack Hickmott, Andrea Korecki, and Siu Ling Lam, and was covered under theUBC Animal Ethics Certificate A14-0295 and BioSafety Certificate B14-0131.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Gene Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Gene Therapy via Viral Vectors . . . . . . . . . . . . . . 31.1.2 Adeno-associated Virus (AAV) Vectors . . . . . . . . . . 51.2 Regulatory Elements . . . . . . . . . . . . . . . . . . . . . . . . 71.2.1 Promoters . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.2 Enhancers . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Profiling Methods for Annotating Regulatory Properties . . . . . . 10v1.3.1 TSS and Enhancer Identification . . . . . . . . . . . . . . 101.3.2 Transcription Factor Binding . . . . . . . . . . . . . . . . 111.3.3 Histone Modifications . . . . . . . . . . . . . . . . . . . 121.3.4 Chromatin Accessibility . . . . . . . . . . . . . . . . . . 131.3.5 Topologically Associating Domains . . . . . . . . . . . . 141.3.6 Computational Predictions of Regulatory Elements . . . . 151.4 Preceding Work: Compact Promoters for Gene Delivery . . . . . 161.5 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Bespoke MiniPromoter Construct Design . . . . . . . . . . . . . 202.2.1 RR Selection . . . . . . . . . . . . . . . . . . . . . . . . 212.2.2 MiniPromoter Assembly . . . . . . . . . . . . . . . . . . 222.3 Experimental Validation of MiniPromoters . . . . . . . . . . . . . 222.4 Semi-automated RR Selection . . . . . . . . . . . . . . . . . . . 232.4.1 Promoter Selection . . . . . . . . . . . . . . . . . . . . . 232.4.2 Enhancer Selection . . . . . . . . . . . . . . . . . . . . . 262.5 Validation of Semi-automated Design Performance . . . . . . . . 283 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.1 Bespoke Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Experimental Validation of Bespoke Designs . . . . . . . . . . . 343.3 Automated System Creation of OnTarget . . . . . . . . . . . . . . 513.4 Assessing the Performance of OnTarget on Experimental Data . . 523.5 Comparing the Designs Between Bespoke and Semi-automated Ap-proaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . 58Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60List of TablesTable 1 List of All Designed MiniPromoters . . . . . . . . . . . . . . 32Table 2 MiniPromoters Tested in Mice in vivo . . . . . . . . . . . . . . 34Table 3 Summary of OnTarget Regulatory Region Predictions for theTAD Containing the Gene ABCB4. . . . . . . . . . . . . . . . 54Table 4 Summary of OnTarget Regulatory Region Predictions for theTAD Containing the Gene NOS1. . . . . . . . . . . . . . . . . 55Table 5 Regulatory Regions Identified for NEFM/NEFL Bespoke MiniPro-moters in Comparison to Regulatory Predictions Predictions fromOnTarget. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Table 6 Summary of OnTarget Regulatory Region Predictions for Be-spoke MiniPromoters. . . . . . . . . . . . . . . . . . . . . . . 57viiList of FiguresFigure 1 Overview of Popular Viral Vectors . . . . . . . . . . . . . . . 4Figure 2 Overview of Genome Regulation . . . . . . . . . . . . . . . . 8Figure 3 Visual Representation of the OnTarget Enhancer Selection Mod-ule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 4 Manual Bioinformatics Design of Ple326 and Ple334 NovelMiniPromoters from the KRT12 Gene. . . . . . . . . . . . . . 37Figure 5 MiniPromoters Ple326 and Ple334 from the KRT12 Gene DriveGene Expression in Layers of the Cornea. . . . . . . . . . . . 39Figure 6 Manual Bioinformatics Design of Cutting Down the OriginalPromoter of Ple265 to Form the Ple341 MiniPromoter. . . . . 41Figure 7 MiniPromoter Ple341 from the PCP2 Gene Drives Gene Ex-pression in Retinal Bipolar Cells. . . . . . . . . . . . . . . . 42Figure 8 Manual Bioinformatics Design of Cutting Down the OriginalPromoter of Ple321 to Form the Ple344 MiniPromoter. . . . . 44Figure 9 MiniPromoter Ple344 from the TUBB3 Gene Drives Gene Ex-pression in Retinal Ganglion Cells. . . . . . . . . . . . . . . . 45Figure 10 Manual Bioinformatics Design of Ple345 and Ple346 NovelMiniPromoters from the NEFM and NEFL Genes. . . . . . . 47Figure 11 MiniPromoter Ple345 from the NEFM Gene Drives Gene Ex-pression in Retinal Ganglion Cells. . . . . . . . . . . . . . . . 48Figure 12 Manual Bioinformatics Design of the Ple347 Novel MiniPro-moter based off the GNGT2 Gene. . . . . . . . . . . . . . . . 50viiiFigure 13 MiniPromoter Ple347 from the GNGT2 Gene Drives Gene Ex-pression in Cone Cells . . . . . . . . . . . . . . . . . . . . . 51Figure 14 The Cumulative Distribution Charts of Indivudual NucleotideScores from Two TADs . . . . . . . . . . . . . . . . . . . . . 53List of AbbreviationsAAV Adeno-associated virusATAC-SEQ Assay for Transposable Accessible Chromatin SequencingBAC Bacterial artificial chromosomeCAGE Cap Analysis of Gene ExpressionCDS Coding start siteCHIP-SEQ Chromatin immunoprecipitation SequencingCNS Central nervous systemCTCF CCTC-binding factorDHS DNase I hypersensitive sitesDNASE-SEQ DNase I hypersensitive sites SequencingEMGFP Emerald green fluorescence proteinENCODE The Encyclopedia of DNA ElementsERNA Enhancer RNAFAIRE-SEQ Formaldehyde-Assisted Isolation of Regulatory ElementsSequencingFANTOM Functional Annotation of the Mammalian GenomexFANTOM5 FANTOM consortium fifth projectGENSAT Gene Expression Nervous System AtlasGRO-SEQ Global Run-On SequencingHI-C High-resolution chromosome conformation captureLINE Long interspersed nuclear elementLTR Long terminal repeatMRNA Messenger RNARNAPII RNA polymeraseIIRNA-SEQ RNA SequencingRR cis-regulatory regionSINE Short interspersed nuclear elementSMCBA Small chicken beta actinTAD Topologically Associating DomainTF Transcription factorTSS Transcription start siteUCSC University of California, Santa CruzAcknowledgmentsI would like to thank my supervisor Dr. Wyeth Wasserman for not only giving methe chance to work on an exciting project, but for all the support, encouragement,and guidance throughout my studies. An extended thank you to everyone in theWasserman lab, including: Dr. Oriol Fornes for the endless hours of support andcollaboration, Dora Pak for managing all schedules and overall support, DavidArenillas for all things programming and computational discussion, and PhillipRichmond, Allen Zhang, Cynthia Ye, and Dr. Robin van der Lee for many helpfuldiscussions. Additionally, thanks to members of the Simpson lab (Dr. ElizabethSimpson, Andrea Korecki, Jack Hickmott, Siu Ling Lam, Zeinab Mohanna) fortaking care of me throughout our collaborations and my directed studies work. Iwould also like to thank Shams Bhuiyan and Louie Dinh for helpful discussion andproviding me with food throughout this time. A special thanks to the members ofmy committee, Dr. Paul Pavlidis, Dr. Pamela Hoodless, and Dr. Cristina Conati,for their helpful suggestions and comments throughout my studies and this thesis.xiiDedicationTo my parents, sisters, partner, and cat for always giving me so much love andsupport.xiiiChapter 1IntroductionAs healthcare costs continue to rise, it is imperative to further not only the un-derstanding of human diseases but to provide new and effective treatments[63].Unlike diseases contracted by foreign agents, a large portion of inherited diseasesare currently incurable. Treatments exists to alleviate symptoms, but often timesprovide no cure. Certain individuals suffering from inherited diseases must con-tinue treatment for life, which produces both financial and emotional burdens onpatients, their families, and the healthcare system as a whole.At the conceptual level, the problem of genetic disorders seems simple; a mal-functioning gene can be replaced by a functioning version. At the implementationlevel, however, there are challenges in the identification of the gene(s) involved, inthe delivery of the restorative gene to the appropriate cells in the body, in the main-tenance of expression of the replacement gene, and in the prevention of unintendedeffects[54]. Efforts spanning more than 25 years[25] to produce gene therapieshave confronted these issues, with mixed success[54, 81].Substantial advances in the understanding of the human genome and the dis-covery of causal genes underlying diseases has heightened the need to solve thegene therapy challenge. Improvements in the delivery of nucleic acids[54, 75]have allowed for a new era of new gene therapies, with hundreds of new clinicaltrials underway worldwide[25]. To realize the full potential of gene therapy, addi-tional advances will be required, including improving delivery of therapeutic DNAto relevant cells and tissues[37]. In particular, the most popular method of in vivo1delivery is through viral vectors, engineered to remove pathogenic properties[75].One of the identified challenges in the field is the establishment of ‘promoter’sequences capable of directing therapeutic gene expression in a targeted manner(the formal meaning of ‘promoter’ will be fully discussed below). Most existingvectors incorporate ubiquitous promoters, but calls have been made to find pro-moters capable of directing gene expression in the correct subset of cells whichalso affects gene therapy safety and efficacy[80]. Promoter design and selectionis a challenge, as the DNA sequence must be capable of utilizing a host cell’stranscriptional machinery[75]. It is possible to form promoter sequences frompiecing together endogenous DNA known to promote gene expression in a de-sired pattern[22, 23, 40, 60]. Many of these sequences are part of the non-codingregions of the genome, which accounts for 98% of all genetic material in the hu-man genome[27]. Recognizing these sequences is therefore an important goal, andseveral methodologies and technologies have been developed to aid in their identi-fication.Designing promoters for use in viral vectors is a key step to a future in whichgene therapies are widely used to treat and, in the best cases, cure genetic disorders.1.1 Gene TherapyWhile the knowledge of gene transfer dates back to the 1947 discovery of bacterialconjugation, the launch of the modern field of gene therapy is marked by the firstclinical trial in 1989[81]. By the mid-1990s, trials for the treatment of diverse dis-orders, such as adenosine deaminase deficiency[11] and cystic fibrosis[14], causeda boom within the field. While results were often mixed, the growth of the fieldcontinued rapidly until 1999, when the death of Jesse Gelsinger occurred during atrial to correct the effects of ornithine transcarbamylase deficiency. The cause ofhis death was multi-organ failure due to a large immune response over an adminis-tered high dose of the adenosine virus vector.[42].Renewed hope arose in 2000 when researchers cured patients with X-linkedsevere combined immunodeficiency-X1 through the use of a retrovirus[18]. How-ever, a couple years after publication, two of the patients treated for the disease de-veloped a leukemia-like disease, due to the retroviral gene inserted near the LMO22oncogene[38] (the insertions were presumed to be activating).The gravity of these failures weighed on the community and prompted reviewsof the gene therapy field as a whole[69]. Emphasis was placed on the identifica-tion/creation of new viral vectors capable of safe delivery of therapeutic genes[75].Advances in vector technologies have led to success in animal models, and sincethe early 2010s, the number of clinical trials for gene therapies has increased dra-matically once again[54].1.1.1 Gene Therapy via Viral VectorsViruses are currently the dominant tool for delivery of therapeutic DNA[25]. Asviruses have the ability to transmit their genetic material into cells, they are highlyrelevant to gene therapy. The failures of gene therapy at the turn of the centuryhighlighted a need for deeper understanding of viruses, and how they could be se-lected/modified to circumvent the known problems[43]. The subsequent researchrevealed advantages and drawbacks for specific viral vectors. Proper vector se-lection for the scope of each trial therapy increased both safety and efficacy[54].Individual vectors differ in the types of cells they are able to transduce, the lengthof DNA (or in some cases RNA) they can deliver, how long therapeutic expressionwill persist, and, in some aspects, expected host immune response post-injection[1](see Figure 1).The Journal of Gene Medicine maintains a database of gene therapies of thepast (since 1989), and current clinical trials[25]. Approximately 70% (69.5%) ofall trials used viral vectors (1989-2016). In 2016, viral vectors made up 84% ofnewly approved gene therapy trials. While diverse viruses have been used for pastgene therapy trials, four prominent vectors are used in therapies today: adenovirus,retrovirus, lentivirus, and adeno-associated (AAV) virus.The adenovirus is historically the most used viral vector for gene therapies,accounting for 30% of all viral vector clinical trials[25]. As a medium-sized virus,the adenovirus genome contains ∼36 kb of double stranded DNA, although onlyabout 8 kb can be used to package desired DNA with the remaining space occu-pied by genes that are important for transcription and virus integrity[43]. Amongstthe popular viral vectors, adenovirus carries the largest payload of DNA[75]. It3Figure 1: Overview of Popular Viral Vectors. Vector choice depends on avariety of factors including immunogenic host response, entry into cells,genome integration, and packaging size. Larger icons indicate largerpackaging capability. The adenovirus can package just over 8 kb of non-endogenous DNA and transduce all cells, however its expression is tran-sient and it produces a large host immune response. The γ-retrovirus canpackage around 8 kb of non-endogenous RNA and will achieve stableexpression due to genome integration, however it will produce a hostimmune response and can only transduce actively dividing cells. Thelentivirus can package around 8 kb of non-endogenous RNA, transducesall cells, and usually does not promote a host immune response. Its ex-pression is stable, but it inserts its viral genome into the host genome atrandom loci. The AAV generally does not promote a host immune re-sponse and readily transduces most cells. It is the smallest of the popularviral vectors, packaging under 5 kb of non-endogenous DNA, and expres-sion is transient in quickly dividing cells.4can transduce both actively dividing and non-dividing (quiescent) cells[1]. Theadenovirus does not integrate into its host genome, existing in the cell as a non-replicating episome, and therefore the expression of its therapeutic is transient individing cells (as the episome is diluted)[8]. The biggest disadvantage to the aden-ovirus is its highly immunogenic nature[75]. Adenoviral vectors have largely beenreplaced by other, less immunogenic vectors, however it is currently popular incancer clinical trials[19, 26].Retroviral vectors have accounted for 27% of all viral vector clinical trials[25].The widely used γ-retrovirus vectors can package ∼8 kb of RNA. Retroviruses in-tegrate into host genomes, therefore enabling stable expression of a transgene[10].The insertion location is random, however, which may lead to oncogene activation[37].Additionally, γ-retroviruses can only transduce actively dividing cells, which limitstheir utility for targeting cells or tissues that do not replicate often[43].The lentivirus, whose vectors are often based on the HIV-1 virus[75], has be-come an increasingly popular in current trials. While only about 9% of all his-toric viral gene therapies used a lentiviral vector, in 2016 it comprised of 24% ofall recorded trials[25]. A variety of sources differ on the payload capacity of thevector[70, 72], but a consensus is that robust packaging tends to occur when RNAis be less than 8 kb[2]. Although a subclass of retrovirus, lentiviral vectors cantransduce both dividing and non-dividing cells[1]. Furthermore, these vectors donot produce a large immune response[75]. As lentivirus contents are inserted intothe host genome, their stable expression is at the expense of the risk for oncogeneactivation[70, 75].Lastly, the AAV has also increased in usage as a vector over the years. Whileaccounting for 10% of all historic viral gene therapies, AAV vectors were used in23% of all 2016 trials[25]. The AAV vector, the focus of the research in this thesis,is further described in the following section.1.1.2 Adeno-associated Virus (AAV) VectorsAAVs have become an increasingly popular choice of vector in viral gene thera-pies, as some of its most desirable features include its low human pathogenicity,its ability to transduce both dividing and non-dividing cells, and non-replicative5nature[32]. Additionally, another appeal is that engineered vectors have ensuredthat the AAV will not integrate into the host genome, due to its removal of the viralrep genes[43]. There are two main drawbacks for AAVs. First, the AAV has asmall payload capacity. At less than 5 kb per virus, the AAV is the smallest of allhighly used viral vectors for gene therapies[70]. Second, AAV episomes are lostover cell divisions[1].There are nine main serotypes of the AAV that can infect human cells[85], andeach enter a subset set of cells with greater specificity than others due to differencesin capsid structure[82]. While the AAV2 serotype has been the most widely stud-ied, it transduces cells slower and is less efficient than most other serotypes[85].More recently, hybrid systems, usually made by combining viral capsid proteinsto create mosaic capsids, allow for a greater range of specific cell types to betargeted[5]. Proper AAV serotype selection is important for the design of thera-pies. For example, AAV9 is efficient at targeting neurons in cells of the centralnervous system (CNS)[67], while AAV2 is still the vector of choice for targetingcells in the kidney[82].In order to deliver a therapeutic of interest, the vector offers little room forthe inclusion of other genomic elements. An AAV must include inverted terminalrepeats (ITRs) at the 5’ and 3’ end of their genomes, followed by a promoter, atransgene, and a polyadenylation sequence (e.g. simian virus 40 late)[32]. AsAAV serotypes are similar in their packaging capacity, to allow larger transgenesmost studies utilize small, ubiquitous promoters, such as the approximately 500base pair sized CMV[33] and CAG[56] promoters. Thus AAV vectors will expressin off-target cells, which may not be appropriate for all therapeutics.To achieve the highest and most specific therapeutic effects, the designers ofnew therapies must therefore consider carefully both the capsid (serotype) and pro-moter properties. By optimizing the capsid properties of viruses, one can bias theuptake of the therapy to certain cell types, and much research is currently address-ing this mechanism[5, 85]. However, there have been calls to incorporate moreselective regulatory sequences controlling the transcription of the therapeutic gene.This thesis focuses on this opportunity to improve the delivery of gene therapy bydesigning these promoter sequences.61.2 Regulatory ElementsGreat progress has been made in understanding the mechanisms which regulatemammalian gene transcription. As a basic model, the RNA Polymerase II com-plex (RNAPII), which is required to transcribe gene DNA into messenger RNA(mRNA) must assemble on DNA before a gene. This region overlapping transcrip-tion start site(s) (TSSs) is called a promoter region. Other elements that affect therate of transcription enable recruitment of other factors necessary for the forma-tion (or obstruction) of RNAPII. Such regions have been labeled ‘enhancers’ (or‘silencers’). Both promoters and enhancers contain short elements to which DNAbinding proteins, called transcription factors (TFs) can bind in a sequence specificmanner. Characteristics of these regulatory features are further described below.For clarity, TFs are a broad category of proteins, of which only a subset exhibitsequence-specific DNA binding, but within this thesis TFs will refer specifically tothis subset. An overview of regulatory elements and profiling methods is shown inFigure PromotersIn eukaryotes, promoters are regulatory DNA sequences proximal to the 5’ endof genes and are important in the initiation of transcription from DNA to RNA.All gene promoters include one or more TSSs, where the DNA first starts to betranscribed by a RNA Polymerase complex[34]. Promoters contain TF bindingsites necessary for the recruitment/assembly of RNA polymerase complexes. Cer-tain genes are regulated by multiple promoters and TSSs, often in cell-type ordevelopmental-type contexts[31].A subset of promoters (24%) include a TATA-box feature[83], to which a com-ponent of the RNAPII can bind. Many mammalian promoters (∼70%) overlapCpG islands[83] (regions in which CpG dinucleotides have been retained over evo-lution at levels consistent with C and G mononucleotide frequencies, reflecting alack of methylation of CpGs in promoter regions that promotes CpG eliminationover evolution). Some promoters combine both TATA-box and CpG islands, whileothers have neither[74].Over the past decade extensive profiling of the locations and activities of pro-7Figure 2: Overview of Transcriptional Regulation Data. Within this the-sis, diverse types of experimental data are used to assist in the selectionof cis-regulatory regions involved in the transcriptional regulation of geneexpression. The figure highlights promoters (form which RNA productioninitiates) and enhancers (regions which modulate the activity of promot-ers). Types of experimental techniques used to collect data about the lo-cations of cis-regulatory regions and the regions within which regulatoryregions act are depicted.8moters has been performed. While original definitions of promoters highlighteda directionality to them, recent studies have shown that many promoters directbidirectional transcription production (albeit most (90%) of these are still prefer-entially expressed in one direction)[76]. These bidirectional promoters are usu-ally overlapped with CpG islands, and are depleted of TATA-boxes[76]. Bidirec-tional promoters that do not produce functional mRNA products in both direc-tions generally produce promoter upstream transcripts, away from the 5’ end ofthe gene[57]. These transcripts are short, and are generally sensitive to exosome-mediated decay[61].The fact that promoters can produce bidirectional transcripts contributes to anemerging viewpoint in which promoters and enhancers (discussed below) are rec-ognized as two ends of a continuous spectrum rather than as completely discretecategories.1.2.2 EnhancersEnhancers are DNA sequences that act upon promoters to modulate the pattern andmagnitude of transcript production. Enhancer regions are composed of a mixtureof TF binding sites[50]. Some of the bound TFs help recruit RNAPII proteins,or maintain chromatin (the material of which chromosomes are made, consistingmostly of DNA, RNA, and proteins) characteristics that are favorable or unfavor-able for RNAPII recruitment, which ultimately influences the rate of transcriptionalinitiation[3]. Enhancer sequences can be found upstream, downstream, or withinexons and introns[59]. Often, enhancers affect multiple genes, and most genes areaffected by multiple enhancers[59]. Enhancers are often implicated in cell-specifictranscription, although ubiquitous enhancers can be extensive[84].Until recently, enhancers were distinguished from promoters in two ways -first, promoters were locations at which RNA transcripts were initiated, and sec-ond, promoters were directionally dependent and enhancers were not. With furtherstudy, the distinction between enhancers and promoters has become increasinglyblurry[3]. Although conceptually different, enhancers share many properties withpromoters. They are capable of being transcribed by RNAPII, producing shortenhancer RNA (eRNA) transcripts[59]. This transcription is performed in a bidi-9rectional manner. Much like the promoter-upstream transcripts, eRNAs are short-lived, highly sensitive to exosome-mediated decay[3]. To further support the viewthat promoters and enhancers are ends of a continuum, recent studies have shownthat at least a subset of promoters can function as enhancers in enhancer activityassays[3].In the context of this work, we classify regulatory elements as either promot-ers or enhancers, despite the emerging biochemical data. Here, promoters con-tain TSS(s) for a gene of interest. Enhancers are defined as cis-regulatory regions(identified based on specific properties discussed below) that modulate the rate oftranscription initiation from promoters.1.3 Profiling Methods for Annotating RegulatoryPropertiesSince the completion of the human reference genome, current research attentionhas focused on its annotation. As up to 98% of the genome appears to be primarilyinvolved in the control of gene expression, the annotation of regulatory sequences(i.e. promoters and enhancers) and chromatin modification properties (discussedbelow) has been given particular attention. Innovative high-throughput profilingtechnologies and new computational methods have proliferated, each providinginsights into aspects of regulation.1.3.1 TSS and Enhancer IdentificationPromoter and enhancer localization is one of the main objectives of genome an-notation efforts. With the completion of the human genome project and the refer-ence genomes, locations of protein coding and non-coding RNA genes have beenmapped, largely due to RNA sequencing (RNA-seq). However, the exact locationsof transcript starts have long been ambiguous, as RNA-seq preferentially capturesmature mRNAs. New technologies, such as Cap Analysis of Gene Expression[71](CAGE) and Global Run-On Sequencing[21] (GRO-seq), have been developed inorder to capture the capped 5’ ends of RNA transcripts. These capped RNAs relatenot only to mRNAs, but also to eRNA products. The newer GRO-seq technique,although more sensitive to easily degraded transcripts such as many eRNAs, is an10expensive and time-consuming procedure[66]. As only a small number of datasetsare available in few cell lines, we focus on CAGE as the primary source of cappedtranscript identification.First introduced in 2003 by Shiraki et al.[71], CAGE technology captures the 5’end of mRNA transcripts (that is–the capped portion of the mRNA) at a given time-point. These trapped ends, called tags, are sequenced and mapped back to a refer-ence genome, delineating the specific TSS from which each mRNA transcript wasproduced. Efforts largely through the Functional Annotation of the MammalianGenome (FANTOM) consortium ( have been able to col-lect large amounts of CAGE data across every major human organ. In this capac-ity, it is possible to obtain a quantitative snapshot of the human transcriptome incell and tissue-specific contexts. At the time of publication of the consortium’sfifth project (FANTOM5)[45], samples from 573 primary human cells, 152 humanpost-mortem tissues, and 250 cancer cell lines have been used to generate CAGEdata and describe gene TSSs and their strengths[31]. The FANTOM5 CAGE dataprovides TSS locations and relative strength for 91% of protein coding genes (or94% using a more permissive threshold).Furthermore, due to the nature of the CAGE protocol, it can also be used tocapture eRNAs, as many are capped at their 5’ ends. The FANTOM5 project iden-tified over 43,000 enhancers from 808 samples based on eRNA positions[4]. Manyof these CAGE-identified enhancers showed expression in a cell type-specific man-ner, and a small portion expressed in a ubiquitous fashion.1.3.2 Transcription Factor BindingTFs are DNA-binding proteins that are involved in regulation, either by promotingor repressing transcription of genes to RNA. Activator TFs are able to recruit theRNA polymerase complex (usually with the help of other coactivator proteins orother TFs), while repressor TFs work to block RNA polymerase from initiatingtranscription[35]. TFs bind to both promoter and enhancer regions. Some TFs arepresent in all cells and are required for basic transcription. These TFs are oftenpresent in promoter regions at ubiquitous enhancers. The TATA-binding proteinTF, for example, binds to TATA-box-like sequences on DNA, located upstream of11gene TSSs in about a quarter of human genes[55]. Other TFs are only presentin specific types of cells or are active only at certain developmental timepoints.The GATA binding protein 2 (GATA2), for example, plays a key role in regulatinghematopoietic stem and progenitor cells[65], whereas the SRY-box 2 (SOX2) isessential for maintaining stem cells in the CNS[6].The Encyclopedia of DNA Elements (ENCODE) project[28, 29] is a publicrepository amassing data informative of regulation. A large part of the ENCODEproject holds information on hundreds of TFs and where they bind to DNA in avariety of primary cells, tissue samples, and immortalized cell lines. Almost allof this data comes from ‘ChIP-seq’ experiments. Chromatin immunoprecipitation(ChIP) has become a standard technique to locate DNA-binding proteins within acell of interest. As described by Mundade et al.[53], protein-DNA interactions aresubjected to crosslinking; DNA is sheared and immunoprecipitation is performedwith antibodies targeting TFs or other DNA-bound proteins. The recovered DNAcan be sequenced to identify where in the genome the protein of interest pref-erentially binds; high-throughput DNA sequencing-based approaches are referredto as ChIP-seq[68]. As recovered DNA fragments are enriched at specific loci,peak-calling algorithms determine the general area in which the original proteinwas bound. Once a large set of DNA sequences bound by a TF are determined,computational models can be generated to detect the specific DNA sequence pat-terns to which the TF preferentially binds. Databases such as JASPAR[52] containcollections of these predictive TF binding models.1.3.3 Histone ModificationsHistones are proteins around which DNA can be coiled in order to package largegenomes into cell nuclei. A nucleosome is the core unit of chromatin, which con-tains 8 histone proteins and is looped twice by DNA[3]. Individual histones aresubject to diverse post-translational modifications. The covalent attachment of dif-ferent molecular groups to specific amino acids on specific histones can alter thestructure of chromatin in the nucleus. These modifications ultimately lead to theremodelling of chromatin, where chromatin that becomes more loosely packed be-comes more accessible to DNA-binding proteins and ultimately favours gene tran-12scription.While histones may undergo numerous types of modifications (such as phos-phorylation and ubiquitination), arguably histone methylation and acetylation havebeen the most extensively studied[7]. Similarly, while multiple amino acids presenton the histones may be modified, lysine (K) residues have been the most informa-tive of gene regulation[29]. The addition of one or more methyl groups can bea sign of transcriptional activation or repression. For example, the tri-methylation(Me3) at lysine 9 (K9) on histone H3 (together, labeled as H3K9Me3) is associatedwith repetitive elements and the formation of heterochromatin, while H3K4Me3marks regions proximal to TSSs[7]. Acetylation (Ac) of lysine residues tradition-ally indicative of active transcription. The H3K27Ac modification marks active(as opposed to poised) regulatory regions[29]. These patterns or trends of histonemodifications are observed in certain functional regions, although functional re-gions can be found lacking such marks, and conversely such marks can be foundin other regions of the genome.Histone modifications can be detected by ChIP-seq[53]. Such experimentshave been conducted in various cell lines and primary tissues, and are availablein repositories from large projects such as ENCODE[28, 29] and Roadmap[64].1.3.4 Chromatin AccessibilityIn general, the more tightly chromatin is packed, the more likely DNA is not be-ing actively transcribed[35]. Chromatin remodelling proteins can unwind sectionsof DNA from the nucleosome complexes allowing for other DNA-binding pro-teins to access these regions[47]. Often, the presence of TFs in a so-called ‘open-chromatin’ regions is indicative of regulatory activity.There are generally three common laboratory methods in use for detectingopen chromatin regions: DNase I hypersensitive sites Sequencing (DNase-seq),Formaldehyde-Assisted Isolation of Regulatory Elements Sequencing (FAIRE-seq),and Assay for Transposable Accessible Chromatin Sequencing (ATAC-seq). Firstdescribed in 2008[15], DNase-seq leverages the nuclease DNase I, which cutsdouble-stranded DNA. The existence of DNase I hypersensitive sites (DHSs), nucleosome-free regions of DNA, allows the DNA to be cut by the nuclease. These fragments13can be amplified, sequenced, and mapped back to a reference genome. FAIRE-seq[36] uses formaldehyde to crosslink proteins to DNA, and then DNA is shearedvia sonication. Fragmented DNA is then suspended in a phenol-chloroform so-lution, which separates into an aqueous layer sitting atop an organic layer. DNAlinked to proteins will sink to the organic layer, where nucleosome-free regionsfloat into the aqueous layer. The sequencing step is similar to the DNase-seqmethod. ATAC-seq[16], developed to require less cells and significantly reduceexperiment preparation time, uses a modified transposase to introduce adaptorelements into nucleosome-free regions of DNA. Tagged DNA fragments can bemapped back to a reference genome and and indicative of the transposase cut sites,which have a preference for open-chromatin regions. All three methods producesimilar open-chromatin peak signals.As the oldest method, DNase-seq data is the most represented form of chro-matin accessibility data in ENCODE, however a large portion of the data were pro-duced from immortalized cell line samples (as opposed to primary tissue). Thereare far fewer ATAC-seq datasets, although all of these data have been created be-tween June 2016 and March 2017, and are highly biased towards human tissues.FAIRE-seq datasets are limited within ENCODE, but contain a mix of immortal-ized cell line samples and primary cell samples ( Topologically Associating DomainsAnother important consideration in regulation is the 3D structure of chromatin.Intuitively, DNA regions in close proximity will be more likely to interact withone another. In 2012, Dixon et al.[24] coined the term topological domains (andlater changed to topologically associating domains (TADs)), which are generallymegabase-sized genomic regions of highly interacting regulatory elements. TADshave been found in both mice and humans covering similar genomic regions, andare therefore thought to be conserved among mammals. Similarly, the analysisof several tissues, primary cells, and cell lines show that most TADs overlap thesame genomic regions, indicating that TADs tend to be consistent across tissues.Studies have shown that abnormalities in TAD boundaries or the rearrangement14of genes within them plays a role in several disease phenotypes[46], potentiallyindicating a disruption of interactions between regulatory regions and the intendedtarget genes. It has therefore been proposed that most regulatory regions and theirtarget promoter(s) will be co-localized within the same TAD.TAD discovery is mostly achieved using a technique called high-resolutionchromosome conformation capture (Hi-C)[77]. Cell DNA is crosslinked formingbonds between proximal chromatin regions. These linked regions are then lig-ated together and then sheared, resulting in fragments of DNA that were originallylinked. DNA ‘reads’ are sequenced and mapped, allowing for the determinationof which genomic regions have been interacting. Boundaries of TAD regions aredetected at positions where the number of interactions drops[24]. Analysis of theTAD boundaries have shown to be enriched in binding sites for the CCTC-bindingfactor (CTCF) which is known for being largely involved in chromatin looping[24].1.3.6 Computational Predictions of Regulatory ElementsAlthough the previous types of data are indicative of regulatory regions, it is bothtime-consuming and expensive to conduct experiments across all cells and tissuesof interest, and to confirm the functional roles of DNA segments. However, dueto the vast amounts of data now compiled, computational methods have been de-veloped for more comprehensive labelling of regulatory regions. Two of the mostpopular methods, ChromHMM and Segway, can now be used independently orin conjunction to provide unsupervised machine-learned genome-wide predictionsinformative of regulatory potential.First published in 2012, ChromHMM[30] uses histone modifications and CTCF-bound regions from ENCODE as primary input, as these are known to be associ-ated with different forms of regulation. Segmentation analysis is performed usinga multivariate Hidden Markov Model, assigning each segment of the genome intoone of ten states (including active and inactive promoter regions, enhancer regions,insulator regions, transcribed regions, and repressed regions). Segway, similarlypublished in 2012[41], uses a dynamic Bayesian network to generate genome states(also ten). It factors in chromatin accessibility, certain TF binding peaks, as wellas histone modifications into its predictions.15Both ChromHMM and Segway can be used to predict regulatory regions usingany supplied genomes containing data from various histone modifications, chro-matin accessibility and TF ChIP-seq peaks. Pre-computed predictions are avail-able from the University of California, Santa Cruz (UCSC) Genome Browser[44]for six cell lines profiled extensively in the ENCODE project.1.4 Preceding Work: Compact Promoters for GeneDeliveryIt has long been known that external DNA can be introduced into cells in a specificmanner. Transgenic mice, for example, can be generated with non-endogenousDNA by injecting the DNA of interest into embryos. This research has playeda large part in human disease discovery and therapeutics. One such endeavourwas the Gene Expression Nervous System Atlas (GENSAT) project[39], whereresearchers studied thousands of genes across the CNS through the insertion ofbacterial artificial chromosomes (BACs). These BACs, often containing 100-200kb of mouse DNA, can reproduce endogenous gene expression. However, withinthe gene of interest on the BAC, a reporter gene was placed after the target gene’scoding start (ATG) sequence. By visualizing the co-expression of the reporter pro-tein and endogenous protein, these BACs ensured the gene and all of its necessaryregulatory regions could recapitulate the expected expression pattern.The transgenic approach with long DNA sequences allows recapitulation of en-dogenous gene expression patterns, but the use of such long sequences is not ther-apeutically relevant because delivery is not feasible. Delivery by a small particle,such as a virus, restricts the amount of DNA that can be included[37]. There hasbeen success with using small, ubiquitous promoters in viral gene therapy. Whilethere have been numerous transgenic studies in which shorter DNA segments drivespecific patterns of gene transcription, the use of compact selective promoters ingene therapy is just starting[49]. Notably, a trial to treat individuals with Lebercongenital amaurosis-2, a childhood eye disorder that leads to blindness, used a1,400 bp sequence from the RPE65 gene within a AAV2 vector to drive selectiveexpression of the RPE65 protein[20].Our lab has been pursuing the development of sets of compact promoters suit-16able for selective patterns of gene delivery. The systematic design of MiniPro-moters (human regulatory sequences of ∼4 kb or less, which promote cell typeand tissue-specific expression) was first described by Portales-Casamar et al. in2010[60] and in a follow-up study by de Leeuw et al. in 2014[22]. The goal was toidentify human cis-regulatory regions (RRs) in genes targeting the CNS. This ap-proach was based primarily on the identification of non-coding, highly conservedgenomic regions closeby a gene of interest. TF binding site predictions were gen-erated across these conserved regions, and used to suggest functional roles for TFsrelevant to CNS regions of interest. These regions of interest were fused with pro-moter regions of the same gene, and the resulting MiniPromoters were assessed intransgenic knockin mice using a procedure that placed the MiniPromoters and areporter gene at a specific location on the X-chromosome.This primary work was the basis for the 2016 paper by de Leeuw et al.[23],who began to use MiniPromoters packaged in recombinant AAV2/9 hybrid vec-tors. Many tested MiniPromoters were those found to express in the previous trans-genic mouse initiatives. Newly designed MiniPromoters defined RRs similarly tomethods described above, but also included the use of DHSs, TF ChIP-seq peaks,regions of specific histone modifications, regions of high conservation. Hickmottet al.[40] used the newer MiniPromoter design strategy to find RRs for the PAX6gene. This paper introduced the use of TADs to constrain the search space for RRs.It also introduced the use of CAGE data for identifying TSSs, which resolved ambi-guity of choosing an appropriate promoter RR in the case of genes having multipletranscripts.1.5 HypothesisCompact cis-regulatory sequences can be computationally designed based on an-notated properties of the genome that overlap designs generated by human expertsin a painstaking and time consuming process. Further, the use of annotated prop-erties relevant to the tissue of desired expression will improve automated designsuccess. Based on these hypotheses, I have taken the following approaches in thisthesis to establish and assess a semi-automated bioinformatics procedure for thedesign of compact promoters for use in AAV-based viral vectors. I manually de-17signed a set of MiniPromoters based on sets of regulatory features, with the goalof defining a reference set of designs against which a semi-automated procedurecould be assessed. A subset of these were tested through in vivo experiments inmice. I then designed a semi-automated approach inspired by the manual designprocess, amalgamating thousands of experiments informative of genome regulationin cell-specific contexts in order to predict key RR consistent with the qualitativeassessments of a trained designer. From here, I validated the capacity of the semi-automated procedure to reproduce the bespoke designs.18Chapter 2Methods2.1 DataAll datasets used in the manual design MiniPromoters as well as the automaticregulatory region identification were previously published and are available pub-licly. The CAGE datasets (TSS annotations and enhancer annotations) and on-tology were obtained from the FANTOM5 consortium[4, 45]. Hi-C, TF ChIP-seq, histone modifications, DNase-seq, and FAIRE-seq data were obtained fromthe ENCODE project[28, 29]. Additional TF histone modification and DNase-seqdata were obtained from RoadMap[64]. Additional Hi-C data was obtained fromGEO (accession number: GSE87112). Gene annotations from RefSeq[58, 62],ChromHMM[30] and Segway[41] chromatin states, repeat regions identified withRepeatMasker[73], and PhastCons and PhyloP ( were obtained from the UCSC Genome Browser[44] tables. All data werebased on the hg19 reference human genome.Eight bespoke MiniPromoters (Ple360, Ple366, Ple367, Ple368, Ple370, Ple371,Ple372, Ple373) were designed to include sequences contained within previouslypublished reporter gene-containing BAC constructs (RP24-269I17, RP23-234I17,RP23-440L10, RP24-98L14, RP23-281A14, RP24-260F14, RP23-305H12, andRP24-285B17, respectively). This mouse BAC data was obtained from GENSAT[39]and Mouse Genome Informatics (MGI)[12]. The published reporter gene activityindicated that the BAC region contained sufficient and proper cis-regulatory ele-19ments to drive endogenous gene expression.2.2 Bespoke MiniPromoter Construct DesignThe bioinformatics for MiniPromoter and RR design has been described[23, 40].Briefly, this process involves the selection of a gene of interest, specification of apromoter region, and in a subset of cases the selection of one or more enhancerregions. All genes selected for MiniPromoter design must include a TSS thatis supported by experimental evidence indicating gene expression in a relevantcell or tissue. TSS identification is based on CAGE data. CAGE reads were ex-tracted for each TSS of each gene using the Zenbu browser (for visual compar-ison, or the SSTAR view (for numerical com-parison, TADs were used to delineate bound-aries within which searches for RRs were constrained. A consensus TAD regionwas determined visually by taking the overlap between TAD data from a H1 hu-man embryonic stem cell line and the IMR90 (lung fibroblast) cell line thereforecreating a consensus TAD. In certain cases, relevant mouse BACs were used tonarrow this search space, if the reporter gene co-expressed with the endogenousprotein in published studies. The BAC coordinates were then converted frommouse into human coordinates using the UCSC Genome Browser LiftOver tool( identification was based on visual assessment of data (see above) displayedwithin the UCSC genome browser, including the following tracks: RefSeq genes,FANTOM5-identified enhancers,TF ChIP-seq peaks, DNaseI hypersensitive clus-ters, histone modification marks, computational predictions from ChromHMM/Seg-way, multi-species conservation, and RepeatMasker. RefSeq genomic annotationswere reviewed to ensure all RRs excluded known open reading frames or splicesites. A set of 32,693 FANTOM5 enhancers included within SlideBase (from theoriginal 65,423 FANTOM5 set; were included. ChIP-seq experiments provided by ENCODE were limited to a set of 161 TFs that in-cluded Factorbook motifs[78, 79]. While the DHS experiments were performedon 125 cell lines (ENCODE V3), the data was used to predict which areas of thegenome would be more likely to be open regardless of cellular context as well as20areas open in only specific types of cells. H3K4Me1 and H3K27Ac were used toidentify both active and poised RRs. Combined ChromHMM/Segway predictionsacross six common cell lines were used to identify insulated (CTCF) regions ofthe genome, in order to constrain the RR search region. Two types of conservationtracks were used: the 100-vertebrate base pair-conservation track by PhyloP score(to identify non-exonic genomic regions indicative of important genomic elementswithout introducing a large bias based on more closely-related primate species) andthe Multiz Alignments[13] of the rhesus and mouse genomes (under a hypothesisthat conserved sequence would increase the likelihood that designs using humansequence would be functional in subsequent in vivo analyses in mouse and rhesus).The RepeatMasker track was used to remove RRs that contained short interspersednuclear elements (SINEs), long interspersed nuclear elements (LINEs), and longterminal repeats (LTRs).RR boundaries were chosen qualitatively based on amount and types of overallevidence present in the search space. It was determined that regions which over-lapped large amounts of TFs, DHS clusters, and had high H3K27Ac activity weremarks of general, ubiquitous enhancers. Many of the self-identified cell-specificenhancers were enriched in specific TFs known to be present in the cell-type ofinterest, were regions of high conservation, contained a FANTOM5 enhancer thatwas linked to a TSS present in the promoter RR, or a combination of these fea-tures. Boundaries were chosen conservatively, constraining RRs to contain themost overlap of chosen features, in order to minimize the size of each region.2.2.1 RR SelectionIdentified candidate regions were presented to a team of scientists for considera-tion. Each presented RR had to contain one or more forms of evidence mentionedin the previous section. These regions were then ranked based on their perceivedlikeliness to be an enhancer–either ubiquitous or in a cell specific manner. Otherfactors contributing to RR selection included region size, past description in pub-lished literature, or similarity of features to the selected promoter sequence. Intotal, the selected promoter and any additional RRs could not be more than 2.7 kbin size due to the AAV payload restriction.212.2.2 MiniPromoter AssemblyMiniPromoter RRs were assembled in the 5’ to 3’ direction. If the RR was locatedendogenously on the antisense strand, the reverse complement of the RR sequencewas used. Promoter RRs were always placed at the most 3’ end of the MiniPro-moter designs. Enhancer RRs were added where the more distal upstream RRs(from the endogenous promoter RR) were placed closest to the 5’ end of the con-struct. Additionally, all RRs located endogenously downstream of the promoterwere placed at the 5’ end, regardless if the region was more proximal than an up-stream RR. Finally, the addition of two restriction enzyme sites were added to the5’ (Fse recognition sequence) and 3’ (AscI recognition sequence) ends of the con-struct in order to properly clone the MiniPromoter sequence into a vector plasmid.2.3 Experimental Validation of MiniPromotersVirus production, injections into mice, mouse harvesting, immunostaining, andimaging methods have been previously described by de Leeuw et al., with the fol-lowing amendments: Only wild type mice were used for testing MiniPromoters,with the virus injected into the superficial temporal vein of mice at two time points(either postnatal day 0 or postnatal day 4). These dates were chosen based on op-timal injection time point studies by Byrne et al. Control mice were injected with3.3×1012 vg/mL (viral genomes per milliliter). Mice were harvested 4 weeks post-injection (at time points P28 or P32). In addition to retaining the brain, eyes, spinalcord, and heart for image analysis, the liver and pancreas were also studied. Allother methods (virus preparation, animal injections, and fluorescent imaging pro-cesses) were performed using the procedure outlined for emerald green fluorescentprotein (EmGFP) constructs in de Leeuw et al.For the study of two viruses containing MiniPromoters targeting the cornealepithelium (based on the KRT12 gene), a different protocol was followed. Bothviruses were of AAV9 serotype, and each contained one of three different pro-moters (outlined below) to express the EmGFP transgene. All injections wereperformed intrastromally on adult mice (ages ranged between 2-4 months). In-jections were all 2µL, and contained 5× 1012 vg/mL and a 1:20 dilution of stockof FluoSpheres. Left eyes of nine mice were injected in this study; three eyes22were injected for each type of virus created for this experiment. All uninjected(contralateral) eyes were used as negative controls.Three eyes were injected intrastromally for each construct, including the ubiq-uitous small chicken beta actin (smCBA) promoter, Ple326 and Ple334. Tissueswere harvested 6 days post-injection. All eyes (injected left eyes and uninjectedright eyes) were embedded in Tissue-Tek O.C.T. compound and sectioned at 20µmon a Microm HM550 cryostat. A subset of these section (generally 2-4 sections perexperiment) were pressed, and rinsed in 0.1M phosphate buffer saline (PBS) twice,for five minutes each. After rinsing the sections for five minutes in 0.1M PBST(PBS + Triton X-100), they were blocked for 30 minutes before being incubatedovernight at room temperature in a primary EmGFP antibody stain (at a 1:500 dilu-tion). The following day, the sections were rinsed again in 0.1M PBST three timesfor ten minutes. Sections were then additionally incubated and stained with a sec-ondary antibody (Alexa448 conjugated antibody at 1:1000 dilution) and co-stainedwith Hoechst33342 dye (at 1:1000 dilution) for one hour at room temperature. Fi-nally, all sections were washed in a 0.1M phosphate buffer (PB) three times for tenminutes and in 0.01M PB for ten minutes, and were mounted with ProLong GoldAntifade Mountant. All stained sections were then imaged at three different colourchannels (DAPI–blue, for Hoechst; TXRED–red, for FluoSpheres; FITC–green,for EmGFP) on an Olympus BX61 fluorescence microscope through the softwarecellSens at either 10x or 20x magnification. Each image was further processed intocomposite and single-colour TIFF images using the freeware program ImageJ andits Bio-Formats plugin.2.4 Semi-automated RR Selection2.4.1 Promoter SelectionImportantly, each defined promoter RR must contain at least one TSS. After re-ceiving a valid HGNC gene name, OnTarget retrieves each identified FANTOM5TSS stored in its underlying database (where each TSS is required to have at leastone tag in at least one sample, out of 1,829 possible samples). Each TSS is thenextended in both the upstream and downstream direction in order to achieve a min-23imal promoter length.Downstream, each sequence is extended until one of the following conditionsare met:1. If a TSS is located before the annotated gene start and(a) if the annotated coding start site (CDS) is not in the first exon, the TSSwill be extended through until the end of the first exon, minus a splicesite offset (default of 10 bp);(b) if the annotated CDS is in the first exon, the TSS will be extendedthrough until the CDS, minus a KOZAK sequence offset (default of 10bp);2. If a TSS is located within an exon before the annotated CDS and(a) the CDS is not located in the same exon, the TSS will be extendedthrough until the end of the exon, minus the splice site offset;(b) the annotated CDS is in the same exon, the TSS will be extendedthrough until the CDS, minus a KOZAK sequence offset;3. If a TSS is located within a gene intron before the annotated CDS and(a) the annotated CDS is not in the following exon, the TSS will be ex-tended through until the end of the following exon, minus the splicesite offset;(b) the annotated CDS is in the following exon, the TSS will be extendedto the CDS, minus the KOZAK sequence offset;4. If the TSS is in an intron downstream of the annotated CDS, the TSS will beextended through until the end of the following exon, minus the splice siteoffset;5. If the TSS is in a coding exon, the TSS will be extended until the end of theexon, minus the splice site offset.24This downstream expansion is then tested for unwanted elements, such as ATGsequences (which could create possible ORFs) or other FANTOM5 annotated (un-extended) TSSs. The extensions are trimmed to no longer include any of theseunwanted elements. In the case of ATG sites, these are only trimmed if they fallwithin the annotated gene area.Upstream, each TSS is extended until one of the following conditions are met:1. If the TSS is located before the annotated gene or within the gene but beforethe annotated CDS in the first exon, and(a) there is another annotated FANTOM5 TSS (unextended) further up-stream, the TSS will be extended until 1 bp before the closest upstreamTSS;(b) there is no other FANTOM5 TSS further upstream, the TSS will beextended until the Phastcons conservation score falls below a threshold(default 60%);2. If the TSS is located within an exon and(a) there is another annotated FANTOM5 TSS (unextended) further up-stream within the same exon, the TSS will be extended until 1 bp beforethe closest upstream TSS;(b) there is no other TSS within the same exon, the TSS will be extendedup until the start of the exon, excluding nucleotides within the splicesite offset;3. If the TSS is located within an intron and(a) there is another annotated FANTOM5 TSS (unextended) further up-stream within the same intron, the TSS will be extended until 1 bpbefore the closest upstream TSS;(b) there is no other TSS within the same intron, the TSS will be extendedup until the start of the intron, excluding nucleotides within the splicesite offset25Each minimal promoter is returned to the user, where it is possible to combine mul-tiple promoters into an extended promoter RR. This can be done as long as eachminimal promoter neighbours another desired minimal promoter without overlap-ping splice and KOZAK sites.2.4.2 Enhancer SelectionAn enhancer RR is described as a region lacking any annotated FANTOM5 TSSs.For a given search space, RRs are selected based on the overlap of regulatory fea-tures. An underlying feature matrix and weighting vector defines boundaries andprovide each RR with a score. The higher scoring regions contain the most in-formative data indicative of regulation. The procedure with default settings (all ofwhich can be adjusted) is described below.The underlying Data Repository of OnTarget stores cell or tissue type-specificdata for the following features:1. Hi-C TAD datasets from 33 cell lines and tissue samples: As this is oursmallest data-set, often all TADs are taken into consideration, and the con-sensus TAD is chosen for delineating a search space.2. 1,284 TF ChIP-seq experiments that cover 145 primary cell/tissue types andcell lines: When creating ubiquitous RR profiles, each TF track is condensedinto a single vector based on presence or absence, and can be cell type spe-cific or agnostic. These vectors are then summed into one consensus TFvector.3. Chromatin accessibility data based off DNase-seq and FAIRE-seq in 301 pri-mary cell/tissue types and cell lines: Should a specific cell type be unavail-able, accessibility data across all datasets are used as a consensus tracks.4. Histone modification data for 33 histone modification signatures in 197 pri-mary cell/tissue types and cell lines: When a specific cell type is unavailable,data across all datasets for a specific histone and modification are used as aconsensus tracks. We primarily focus on H3K27Ac for regions indicativeof active enhancer elements. H3K4Me2 marks are also considered, however26this repressor mark is negatively associated to expression and is thereforenegatively affects RR assignment.5. FANTOM5 experimental enhancers: certain enhancers show activity in avariety of cell and tissue types, while others have not been shown to associateto any particular location. These non-specific enhancers appear regardless ifa certain cell or tissue type is selected.Additionally, we include the following cell-type agnostic datasets into the RRidentification pipeline:1. Per-base conservation data across 100 vertebrates with PhastCons scores:Unlike the other RR features where data is stored in BED files, PhastConsnucleotide scores are implemented as a Wiggle track. This method is takenfrom the UCSC Genome Browser method, which has used the the PHASTpackage described online ( Repeated elements from RepeatMasker: We consider 3 out of the possible10 assignments from this data source. RRs overlapping SINE, LINE, andLTR elements are excluded from further analysis.Once appropriate cell type data is selected, the RR method begins by defininga search space based on the TAD track. The chosen TAD is the one in which agene of interest is located. While TAD boundaries are the default search space foreach iteration of RR identification, this can be changed to a defined chromosomalrange or the intergenic region between the gene of interest and its closest up anddownstream annotated RefSeq genes.A search space consists of a defined number of nucleotides n and default fea-tures f . A [n× f ] feature matrix M is created and initiated with zeros in all cells.As shown in Figure 3a, each cell of the matrix represents the presence (1) or ab-sence (0) of a feature. M is then multiplied by a weight vector, of size [1× f ],where each feature is assigned a default weight corresponding to its overall impor-tance in RR identification. Selection of values for the weight vector is discussed inthe Results section. The columns of the new M matrix are summed, creating thesum vector of size [n× 1]. Each position in the sum vector is then multiplied by27individual mask vectors (also of size [n×1]). Mask vectors act as absolute featuresthat must be present or absent in each identified RR. Two default mask vectors arethe coding exon mask and the RepeatMasker mask. By default, coding exons areexcluded from RR identification, and are represented by 0s in the mask. Similarly,SINE, LINE, and LTR elements receive a representation of 0 in this mask, in orderto exclude these regions from RR identification. By multiplying the mask vectorsto the sum vector results in the score vector S, which contains the final score ofeach nucleotide in the search space (Figure 3b).Segments of qualifying positions are reported when 10 or more contiguousnucleotide scores pass the threshold (Figure 3c). This threshold is calculated fromthe distribution of scores from each S vector. At each run, the score at the 99thpercentile is chosen. Regions scoring equal or above this threshold are reported asRRs.2.5 Validation of Semi-automated Design PerformanceWe evaluate OnTarget based on two expectations. First, we expect RRs identifiedby using an accumulation of all data to be different than those identified by usingcell or tissue-specific data. Second, OnTarget should detect RRs from successfulbespoke MiniPromoter designs.In my first experiment, I decided to use liver and hepatocyte datasets, as theseare the most abundant datasets among ChIP-seq (for TFs and histone marks) andDHS experiments. No cell line data (i.e. HepG2 cells) were used in the cell-specificanalysis. I then chose two different TADs. One TAD (hg19:chr7:87,000,001-87,802,064; an ∼800 kb region) contained the gene ABCB4, known to expressin the liver and hepatocytes. This gene was chosen by searching the Human Pro-tein Atlas ( for all genes that almost exclusively ex-pressed both RNA and protein in the liver, across datasets from the Protein Atlas,the Genotype-Tissue Expression project (GTEx:, andFANTOM5. The gene GYS2 was originally identified for analysis as the first geneto fit the criteria, although it was discarded because this was the only gene locatedwithin its TAD. A similar selection process was chosen for a separate TAD. Thegene NOS1 was chosen using the same procedure, appearing in the set of genes28(a) The OnTarget 0-1 matrix.(b) Each nucleotide receives a score based on its features.(c) Contiguous high-scoring nucleotides are reported back as regulatory regions.Figure 3: Visual Representation of the OnTarget Selection Module. The topUCSC data tracks allow for the visualisation of each feature correspond-ing to one nucleotide. A. At each nucleotide position, a feature is eitherpresent (represented by a 1), or absent (represented by a 1). B. After be-ing multiplied by the importance weighting of each vector, all features aresummed, resulting in a final score for each individual nucleotide. C. Con-tiguous high-scoring regions are reported as a potential regulatory region.29listed as ’not expressed’ in all liver samples across the same three datasets. ThisTAD (hg19:chr12:117,640,001-118,475,617; an ∼835 kb region) covers mostlybrain-expressing and housekeeping genes.My second experiment used a subset of eye, brain, and neuron datasets, as ourbespoke MiniPromoters were used to target cells within the eye and brain. I testedOnTarget for its capacity to predict the component regions of three MiniPromoterconstructs: one successful design, one unsuccessful design, and one design await-ing testing.30Chapter 3Results3.1 Bespoke DesignsBioinformatics analysis procedures were established for the delineation of RRs inhuman genes, which are described in the Methods. A total of 49 MiniPromoterswere designed based on detailed analyses of 35 genes. Approximately 50 addi-tional genes were partially analyzed but discontinued due to endogenous expres-sion pattern, lack of a homologous gene pair between human and mouse, or be-cause a AAV-suitable design was already reported in the literature. Genes weredetermined by expression data (predominantly CAGE and literature-derived datasuch as Drop-seq[48]) within a target cell or tissue of interest. Table 1 shows alist of all designed MiniPromoters. Most designs fit within the 2.7 kb limit, withthe exception of Ple346 (NEFM gene base) which spanned 2,711 bp. All otherMiniPromoters ranged in size from 331 bp to 2,700 bp. The average design sizewas 1.71 kb. A subset of 40 designs incorporated at least one enhancer RR in ad-dition to a promoter RR. Two of the designs (Ple326 and Ple334) used a total ofsix RRs (promoter inclusive), which was the most RRs included in any design. Asthe project progressed, and the importance of extremely compact MiniPromotersemerged, there was a trend to shorter designs.31Table 1: List of all designed MiniPromoters between January 2016 and July2017. All tested designs are described in Table 2DesignnumberGene Target cell/tissueMiniPromotersize (bp)TestedNumber ofregulatoryregionsPle326 KRT12 corneal epithelium 2,313 Y 6Ple328 PAX6amacrine, horizontal,Mu¨ller glia, ganglion cells2,148 Y 3Ple329 PAX6amacrine, horizontal,Mu¨ller glia, ganglion cells2,513 Y 3Ple330 PAX6amacrine, horizontal,Mu¨ller glia, ganglion cells1,982 Y 2Ple331 PAX6amacrine, horizontal,Mu¨ller glia, ganglion cells1,982 Y 2Ple332 KCNJ8 pericytes 2,100 Y 3Ple333 ABCC9 pericytes 2,332 Y 3Ple334 KRT12 corneal epithelium 2,326 Y 6Ple338 CLDN5 endothelial cells 2,567 Y 4Ple339 CLDN5 endothelial cells 1,973 Y 2Ple340 CLDN5 endothelial cells 2,700 Y 4Ple341 PCP2 bipolar cells 784 Y 2Ple342 TUBB3 retinal ganglion cells 1,992 Y 2Ple343 TUBB3 retinal ganglion cells 2,669 Y 3Ple344 TUBB3 retinal ganglion cells 801 Y 2Ple345 NEFL retinal ganglion cells 2,693 Y 5Ple346 NEFM retinal ganglion cells 2,711 Y 5Ple347 GNGT2 cones 1,197 Y 2Ple348 PDE6H cones 2,025 Y 3Ple349 PDE6H cones 2,005 Y 4Ple350 AQP4 Mu¨ller glia 1,802 N 2Continued on next page32DesignnumberGene Target cell/tissueMiniPromotersize (bp)TestedNumber ofregulatoryregionsPle351 GPR37 Mu¨ller glia 1,890 N 2Ple352 TACR3bipolar OFF subtypesBC1A, BC1B, BC22,643 N 2Ple353 GRIK1bipolar OFF subtypes BC2,BC3A, BC3B, BC42,367 N 3Ple354 GRIK1bipolar OFF subtypes BC2,BC3A, BC3B, BC42,646 N 3Ple355 ADORA2A striatum 2,666 N 3Ple356 DBH locus coeruleus 2,479 N 4Ple357 DRD1 striatum 2,200 N 4Ple358 DRD2 striatum 1,659 N 2Ple359 DRD2 striatum 2,680 N 4Ple360 SLC6A3 substantia nigra 2,322 N 2Ple361 PTPN3 thalamus 2,092 N 3Ple362 RGS16 thalamus 2,027 N 5Ple363 PDGFRB pericytes 846 N 1Ple364 PDGFRB pericytes 1,396 N 2Ple365 PDGFRB pericytes 730 N 1Ple366 CCK GABAergic neurons 1,469 N 1Ple367 DLX1 GABAergic neurons 970 N 1Ple368 GAD2 GABAergic neurons 1,091 N 2Ple369 SST GABAergic neurons 681 N 2Ple370 CORT GABAergic neurons 399 N 2Ple371 DLX5 GABAergic neurons 595 N 2Ple372 PVALB GABAergic neurons 832 N 2Ple373 CX3CR1 microglia 372 N 1Ple374 P2RY12 microglia 505 N 1Continued on next page33DesignnumberGene Target cell/tissueMiniPromotersize (bp)TestedNumber ofregulatoryregionsPle375 P2RY12 microglia 943 N 1Ple376 TMEM119 microglia 651 N 1Ple377 TREM2 microglia 717 N 2Ple378 TYROBP microglia 331 N 13.2 Experimental Validation of Bespoke DesignsTwenty MiniPromoters have been tested in young mice at P0 and P4. Two ofthe 20 MiniPromoters were additionally tested in adult mice. Table 2 summarizesthe results of the tested MiniPromoters. While the brain, eyes, spinal cord, heart,liver, and pancreas were all analyzed, only expression in targeted tissues will bediscussed.Table 2: MiniPromoters tested in mice in vivo. * Off-target expression ob-served along with expected expression. † Expected expression not ob-served, experiments still ongoing. ‡ A subset of expected expression ob-served. U Unknown. A positive control could not be established for thetarget cell type, and therefore success or failure could not be accuratelydetermined.Design number Gene Target cell/tissue Actual expression SuccessPle326 KRT12 corneal epithelium corneal stroma UPle328 PAX6amacrine, horizontal,Mu¨ller glia, ganglion cellsamacrine, horizontal,ganglion cellsY‡Ple329 PAX6amacrine, horizontal,Mu¨ller glia, ganglion cellsamacrine, horizontal,ganglion cellsY‡Ple330 PAX6amacrine, horizontal,Mu¨ller glia, ganglion cellsamacrine, horizontal,Mu¨ller glia, ganglion cells,ganglion cellsYContinued on next page34Design number Gene Target cell/tissue Actual expression SuccessPle331 PAX6amacrine, horizontal,Mu¨ller glia, ganglion cellsamacrine, horizontal,Mu¨ller glia, ganglion cells,ganglion cellsYPle332 KCNJ8 ocular pericytes N/A NPle333 ABCC9 ocular pericytes N/A NPle334 KRT12 corneal epithelium N/A UPle338 CLDN5 endothelial cellsendothelial cells,horizontal cellsY*Ple339 CLDN5 endothelial cellsendothelial cells,horizontal cellsY*Ple340 CLDN5 endothelial cellsendothelial cells, amacrinecellsY*Ple341 PCP2 bipolar cells bipolar cells YPle342 TUBB3 retinal ganglion cellsretinal ganglion cells,amacrine cellsY*Ple343 TUBB3 retinal ganglion cellsretinal ganglion cells,amacrine cellsY*Ple344 TUBB3 retinal ganglion cells retinal ganglion cells YPle345 NEFL retinal ganglion cells retinal ganglion cells YPle346 NEFM retinal ganglion cells retinal ganglion cells YPle347 GNGT2 conescones (including conebipolar cells)YPle348 PDE6H conesretinal ganglion cells,amacrine cellsN†Ple349 PDE6H conesretinal ganglion cells,amacrine cellsN†Ple326 and Ple334, based off the KRT12 gene, were tested for their capacity todirect reporter gene expression from AAV preparations by temporal vein injectionof AAV and injection into the corneal stroma in adult mice. Both MiniPromoterscontained the same five enhancer RRs, while their promoter RRs were based off35two distinct FANTOM5 TSSs (see Figure 4). Corneas injected with Ple326 showedexpression in the corneal stroma, at levels below that directed by the smCBA-EmGFP control virus. Mice injected with Ple334 showed no apparent EmGFPexpression throughout the corneal stroma. As displayed in Figure 5, the bespokeMiniPromoters and positive control could not direct observable expression in theepithelial layer (where expression was anticipated for Ple326 and Ple334). Thisresult, therefore, does not indicate a success or failure of Ple326 or Ple334, as wewere unable to determine a baseline expression pattern with which to compare.These MiniPromoters are the only ones of the design set to contain undeterminedresults. Expectedly, P0 and P4 mice showed no expression after harvest, as thecorneal epithelium is not fully formed until P12-14, when mice first open theireyes[17].36Figure 4: Manual Bioinformatics Design of Ple326 and Ple334 Novel MiniPromoters from the KRT12 Gene. Theblue highlights indicate the RRs selected for use in both Ple326 and Ple334. Each MiniPromoter uses the sameRRs, but different promoters. Promoter regions were based off two distinct FANTOM5-identified TSSs. AdditionalRRs were chosen based on the overlap of DHS, TF ChIP-seq, histone mark, and conservation data. One identifiedRR overlaps another FANTOM5 TSS, however its expression was considered negligible upon further inspection.37Ple328, Ple329, Ple330 and Ple331 were based upon the PAX6 gene. It waspredicted that the PAX6 promoter RR and additional enhancer RRs could restrictexpression to four specific cell types in the retina (see Table 2): amacrine cells,horizontal cells, ganglion cells, and Mu¨ller glia. Ple328 and Ple329 contained 2out of 3 of the same RRs (one enhancer RR and the promoter RR). Ple330 andPle331 contained 2 RRs each, and were exactly the same construct, except fora 8 bp change in a PAX6 TF binding site in the enhancer RR. Previous PAX6MiniPromoters could only achieve expression in combinations of three out of fourcell types. Ple330 and Ple331 were able to achieve expression in all four expectedcell types, although expression levels of EmGFP were stronger in the latter. InPle328, there was no obvious expression of Mller glia. Some injections of Ple329covered all four cell types, although it was not as clear as the expression seen inPle331.Ple332 and Ple333 were based off KCNJ8 and ABCC9 respectively, and weredesigned to target eye pericytes. Both genes are located adjacently in both humanand mouse genomes, and encode components of the same potassium channel. Dueto this reasoning, both MiniPromoters used the same two RRs, and promoter RRswere designed to incorporate the TSS of each gene. FANTOM5 expression levelsof each gene suggested that both constructs would be very lowly expressed withinthe eye. After imaging of mouse eyes at both timepoints, no clear expression wasfound for these two MiniPromoters.Ple338, Ple339, and Ple340 were based off the CLDN5 gene, and were partialre-designs of an old MiniPromoter design (Ple32, data not shown) to target CNSendothelial cells. All three MiniPromoters contained the same promoter RR, whichwas a cut-down version of Ple32, which contained only a promoter RR. Ple338additionally was packaged with three other enhancer RRs, Ple339 was packagedwith one additional enhancer RR, and Ple340 also contained three different en-hancer RRs. Each enhancer RR was different, and enhancer RRs were grouped bypredicted linkage to the CLDN5 TSS, the inclusion of a FANTOM5-derived en-hancer, and by regions not found (conserved) in the mouse genome, respectively.All MiniPromoters drove expression in endothelial cells. Ple340 was found to alsoexpress in off target locations including amacrine and bipolar cells in the eye. Ul-timately, the original MiniPromoter (Ple32) had the strongest expression with the38Figure 5: MiniPromoters Ple326 and Ple334 from the KRT12 Gene DriveGene Expression in Layers of the Cornea. The smCBA (A) promoterwas tested against Ple326 (B) and Ple334 (C). All intrastromal injectionsincluded the EmGFP reporter protein and FluoSpheres, injected into adultmice. All eyes were harvested six days post-injection. Cell nuclei are vis-ible in blue with Hoechst33342. EmGFP antibodies are visible in green.FluoSphere locations are visible in red. A. EmGFP expression is seenstrongly in the stroma and endothelium layers of the cornea. One line ofantibody stain can be seen overlapping the epithelial layers, however itwas undetermined if this was true EmGFP or an artifact, due to not seeingthis pattern anywhere else along the cornea surface over three replicates.B. There is some overlap in the stroma with EmGFP, although it is muchweaker than the smCBA promoter. No obvious EmGFP expression seenin the epithelium or endothelium layers. C. No apparent expression ofEmGFP in any layer of the cornea.39least amount of off-target expression, indicating that the promoter RR is enough toreproduce endothelial expression.Ple341, based on the PCP2 gene (see Figure 6), was a cut-down version of anolder MiniPromoter design (Ple265) to target bipolar ON cells. While Ple265 wascomposed of only one RR, Ple341 contained a smaller promoter RR, accompaniedby a small enhancer RR, which was contained in the original Ple265 promoter.Ple341 (Figure 7) produced comparable expression to Ple265 using less DNA (784bp compared to 986 bp). It is still undetermined if the smaller promoter RR usedin Ple341 is sufficient to reproduce expression in bipolar cells, or the additionalenhancer element is required.40Figure 6: Manual Bioinformatics Design of Cutting Down the Original Promoter of Ple265 to Form the Ple341MiniPromoter. The blue highlight indicates the original promoter sequence of Ple265, from which Ple341 wasbased. The new promoter region included the main FANTOM5-identified TSS until the loss of conservation be-tween the human and mouse DNA sequences. The new RR was based on the remaining conserved sequence fromthe original Ple265 design.41Figure 7: MiniPromoter Ple341 from the PCP2 Gene Drives Gene Ex-pression in Retinal Bipolar Cells. Ple341 (PCP2 - 784 bp): The con-struct contains the Ple341 promoter driving the EmGFP reporter gene,injected into P4 mice and harvested after 28 days. Cell nuclei are visiblein blue with Hoechst33342. EmGFP antibodies are visible in green. GCL–ganglion cell layer, IPL –inner plexiform layer, INL –inner nuclear layer,OPL. –outer plexiform layer, ONL –outer nuclear layer. Image by AndreaKorecki.Ple342, Ple343, and Ple344 were based off the TUBB3 gene and were par-tial re-designs of an old MiniPromoter (Ple321) designed to target retinal ganglioncells. Ple342 and Ple343 contained a newly identified enhancer RR, chosen forits likeliness to be a ubiquitous enhancer to increase the expression of the originalMiniPromoter. Ple342 contained only this new enhancer RR and the original pro-moter RR from Ple321. Ple343 contained the new RR with the old enhancer RR ofPle321 and the original promoter RR. Ple344 contained a compact version (310 bp)42of the original promoter (2,669 bp). Additionally, an enhancer-type RR was iden-tified in the original promoter, and was subsequently identified as a new RR. Thisnew enhancer RR was 491 bp, resulting in a compact MiniPromoter of 801 bp (seeFigure 8). All three newly designed MiniPromoters produced expression in retinalganglion cells and basal ganglia in the brain. Both Ple342 and Ple343 producedoff-target expression in some amacrine cells, however, which was not observed inthe original Ple321 analysis, suggesting that amacrine expression came from theaddition of the ubiquitous enhancer RR. Ple344 (Figure 9) produced expressioncomparable to Ple321 using over 1 kb less space.43Figure 8: Manual Bioinformatics Design of Cutting Down the Original Promoter of Ple321 to Form the Ple344MiniPromoter. The blue highlight indicates the original promoter sequence of Ple321, from which Ple344 wasbased. A new RR was chosen based upon both TF ChIP-seq data and histone mark data. The new promoter regionincluded the main FANTOM5-identified TSS along with a large amount of overlapped TF ChIP-seq data.44Figure 9: MiniPromoter Ple344 from the TUBB3 Gene Drives Gene Ex-pression in Retinal Ganglion Cells. Ple344 (TUBB3 - 801 bp): Theconstruct contains the Ple344 promoter driving the EmGFP reporter gene,injected into P4 mice and harvested after 28 days. Cell nuclei are visiblein blue with Hoechst33342. EmGFP antibodies are visible in green. GCL–ganglion cell layer, IPL –inner plexiform layer, INL –inner nuclear layer,OPL –outer plexiform layer, ONL –outer nuclear layer. Image by AndreaKorecki.Ple345 and Ple346 were based off the genes NEFL and NEFM respectively,and were designed to target retinal ganglion cells (see Figure 10). Both genes arelocated adjacently in both human and mouse genomes. Four common enhancerRRs were used in both MiniPromoters along with a separate promoter RR for eachgene. Both MiniPromoters showed high expression levels of the reporter gene inthe retinal ganglion cells, as well as very high expression in the basal ganglia inthe brain. No significant off-target expression was observed. Ple345 (Figure 11)45and Ple346 showed a much higher level of reporter expression than that seen inPle344, at the expense of being a much larger MiniPromoter (2,693 bp and 2,711bp compared to 801 bp).46Figure 10: Manual Bioinformatics Design of Ple345 and Ple346 Novel MiniPromoters from the NEFM and NEFLGenes. The blue highlights indicate the RRs selected for use in both Ple345 and Ple346. Each MiniPromoter usesthe same RRs, but different promoters. Promoter regions were based off FANTOM5-identified TSSs. AdditionalRRs were chosen based on the overlap of DHS, TF ChIP-seq, histone mark, and conservation data. Additionally,two RRs overlap FANTOM5-identified enhancers.47Figure 11: MiniPromoter Ple345 from the NEFM Gene Drives Gene Ex-pression in Retinal Ganglion Cells. Ple345 (NEFM - 2,711 bp): Theconstruct contains the Ple345 promoter driving the EmGFP reportergene, injected into P4 mice and harvested after 28 days. Cell nucleiare visible in blue with Hoechst33342. EmGFP antibodies are visiblein green. GCL –ganglion cell layer, IPL –inner plexiform layer, INL–inner nuclear layer, OPL –outer plexiform layer, ONL –outer nuclearlayer. Image by Andrea Korecki.Ple347, based off the gene GNGT2, and Ple348 and Ple349, both based offPDE6H, were designed to target cone photoreceptors. Ple347 (Figure 12) con-tained one enhancer RR and one promoter RR. While Ple348 and Ple349 bothcontained the same two enhancer RRs, they differed in the length of the pro-moter (Ple348 contained 1088 bp compared to 418 bp for Ple349) and 650bp ofthe deleted sequence was included as an additional enhancer RR. Ple347 expressedhighly in cones. Surprisingly, more reporter activity was observed in cones trans-48duced with Ple347 than with the ubiquitous smCBA promoter. Expression wasalso detected in cone bipolar cells (see Figure 13). While not originally the target,cone photoreceptors and cone bipolar cells share similar properties, and thereforethe observation is not unexpected. Ple348 and Ple349 did not show any cone pho-toreceptor expression, although off-target cone bipolar and amacrine cells seemedto be transduced. It should be noted, however, that at the time of writing, Ple347,Ple348 and Ple349 were only analysed in P4 mice. Cone transduction would bethe strongest in earlier stages (P0)[17], and therefore we cannot determine the truestrength of Ple347 or if Ple348 and Ple349 are truly negative.49Figure 12: Manual Bioinformatics Design of the Ple347 Novel MiniPromoter based off the GNGT2 Gene. Theblue highlights indicate the RRs selected for use in both Ple347. The GNGT2 promoter encompasses two of thesix FANTOM5-identified TSSs. Other TSSs were not included due to their off-target potential, and overlap of TSSof the nearby ABI13 gene. The additional RR was chosen based on proximity to a FANTOM5-identified enhancerand a CRX TF binding site, a known photoreceptor-specific TF.50Figure 13: MiniPromoter Ple347 from the GNGT2 gene drives gene ex-pression in cone cells. Ple347 (GNGT2 - 1,197 bp): The constructcontains the Ple347 promoter driving the EmGFP reporter gene, injectedinto P4 mice and harvested after 28 days. Cell nuclei are visible in bluewith Hoechst33342. EmGFP antibodies are visible in green. Expressionobserved in cone photoreceptors (white arrow) and in cone bipolar ONcells (white chevron). GCL –ganglion cell layer, IPL –inner plexiformlayer, INL –inner nuclear layer, OPL –outer plexiform layer, ONL –outernuclear layer, R&CL –rod and cone photoreceptor cell layer. Image byAndrea Korecki.3.3 Automated System Creation of OnTargetBased on the bespoke designs a semi-automated procedure was implemented forcompact promoter design. Distinct modules were created for the selection of pro-moter and enhancer regions. Minimal promoters were identified for every FAN-51TOM5 TSS, resulting in 201,802 sequences. Enhancer RRs were calculated ‘on-the-fly’, as changes to default settings give rise to a different number and set ofsequences. Three specific examples are described below.We have based our initial predictions from the following features, each with acorresponding weight between 0 and 1: FANTOM5 enhancers (1), TF-ChIP peaks(0.75), chromatin accessibility (0.5), H3K27Ac (0.25), and human-mouse conser-vation (0.5). These weights were motivated by the feature priority used qualita-tively when creating bespoke MiniPromoters. Importantly, the weighting of eachfeature is the driving force of our Enhancer Identification step. A higher weightrepresents a stronger importance placed on a feature. While these weights can bechanged by the user, we empirically selected a default set which we have used forour assessment of OnTarget. As described below, using these weights, we wereable to reproduce most of our chosen RRs in constructs that produced positiveresults.As described in the Methods section, after feature weighting, each nucleotidein the search space is given a score. Contiguous highly-scoring regions are thenreturned as being potentially involved in regulation. In order to restrict the numberof potential RRs, a threshold score is calculated based on the overall distribution ofscores observed within each search space. The cumulative distribution charts for2 example spaces are shown in Figure 14. Based on our analyses, and constrainedby the limited number of known regulatory regions, we could not determine a uni-versal threshold for RR identification, instead opting to only include contiguousregions scoring above the 99th percentile of nucleotide scores. Using this thresh-old, we were able to reproduce 12 out of 20 regions identified by MiniPromoterdesigners across six different TADs.3.4 Assessing the Performance of OnTarget onExperimental DataTo test our proof-of-concept, RR identification was compared by using all availabledata, versus a liver and hepatocyte subset. As described in the Methods section, twoTADs (an ∼800 kb region on chromosome 7 containing at least one liver-specificgene (ABCB4) (Table 3) and an ∼835 kb region on chromosome 12 containing52Figure 14: The cumulative distribution charts of individual nucleotidescores from two TADs. All nucleotides within TADs containing thegene ABCB4 (top) and NEFM (bottom) have different discrete scores,although the overall distribution pattern remains similar.mostly brain-specific (NOS1) or housekeeping genes, Table 4) were analyzed.In the more liver-specific TAD, 31 RRs were reported using the combination of53all datasets as features and the heuristic algorithm from the methods. In contrast,using only liver and hepatocyte samples as features resulted in the prediction of 64RRs. The overlap between the sets was only 5. While the all-feature RRs werespread across the TAD, 64% of liver-specific feature enhancers covered 29% of theTAD. The covered portion of the TAD included the genes ABCB4, ABCB1, andthe promoter region of RUNDC3B, all which are expressed in the liver, accordingto RNA sequencing data from the Human Protein Atlas, GTEx, and FANTOM5.Table 3: Summary of OnTarget regulatory region predictions for the TADcontaining the gene ABCB4. Few identified liver-specific RRs overlapthose predicted using all datasets. Furthermore, liver-specific RRs tendto cluster together, rather than those from all datasets, which are spreadrandomly throughout the TAD. In general, the fewer number of datasetsused, the less strict OnTarget becomes with RR boundaries, as seen by thedifference in RR size between datasets.ABCB4 TAD Liver datasets All datasetsMedian RR size (bp) 232 73Regions identified by OnTarget 64 31Overlapping regions 8% 16%RR localization clustered sparseIn the second TAD, 54 RRs were reported when using all datasets as features,while 75 RRs were identified using liver-specific datasets, with an overlap of 25RRs. Unlike patterns seen in the TAD on chromosome 8, both ubiquitous andliver-specific RRs were spread throughout the regions.3.5 Comparing the Designs Between Bespoke andSemi-automated ApproachesTo assess the reliability of the semi-automated approach to reproduce the designsgenerated by hand, I compared the resulting designs (see (Table 6)). For the anal-ysis of our successful design, I chose Ple345 and Ple346 due to their success intargeting retinal ganglion cells, with no apparent off-target expression. Further-more, these genes used five different RRs (six in total, as both MiniPromoters usethe same four enhancer RRs but different promoter RR) which allowed for more54Table 4: Summary of OnTarget regulatory region predictions for the TADcontaining the gene NOS1. There is greater overlap between liver-specificand all datasets within this TAD. All identified RRs were scattered through-out the TAD, not clustering around any particular gene, unlike the RRs pro-posed in the ABCB4 TAD. Similarly to the previous TAD, lower datasetcounts results in less stringent RR boundaries in a liver-specific context,while remaining almost constant using all datasets.NOS1 TAD Liver datasets All datasetsMedian RR size (bp) 300 71Regions identified by OnTarget 75 54Overlapping regions 33% 46%RR localization sparse sparseregions to be compared.Using all data available on UCSC at the time of bespoke design, we identified5 enhancer RRs and 2 promoter RRs in a 224 kb region surrounding the NEFM andNEFL genes. A subset of 4 out of 5 enhancer RRs were included in the finalizeddesigns.OnTarget analyzed across the full∼640 kb TAD, which surrounded both genes(as well as two long non-coding RNA transcripts, one microRNA, and the partial 3’end of another protein-coding gene). Knowing that the expected expression shouldbe seen in both retinal ganglion cells and basal ganglia in the brain, I analyzedfeature data from the following primary cell and tissue datasets: neuronal stemcells, neuronal progenitor cells, neuron, brain, midbrain, eye, and retina. Withinthe original 640 kb TAD, 15 RRs were identified. A subset of 10 RRs (66% of allidentified RRs) were located within the 224 kb region originally used for manualRR identification. Analyzing the same 640 kb TAD using all datasets, mimickingour bespoke MiniPromoter design, 17 RRs were identified. A subset of 9 RRswere found within the 224 kb search space. Out of the 7 manually identified RRs, 4(from the brain and eye datasets, or 3 from all datasets) overlapped with those foundby OnTarget using the selected features. Relaxing the scoring threshold to the 98thpercentile, 5 RRs from the brain and eye datasets were recovered. Reassuringly,the other two RRs could be identified at the 95th percentile score cutt-off (Table 5).For an example of an unsuccessful design, I chose a subset of designs based55Table 5: Regulatory Regions Identified for NEFM/NEFL Bespoke MiniPro-moters in Comparison to Regulatory Predictions Predictions from OnTar-get.MiniPromoter Design(Manual, All AvailableDatasets)OnTarget (AllAvailableDatasets)OnTarget(Brain/EyeDatasets)Regulatory region search space (in kb) 224 640 640Number of identified regions 7 17 (9 in 224kb) 15 (10 in224kb)on the gene DDC, constructs Ple56, Ple57, Ple58, and Ple59. These were earlyconstructs designed for the paper by Portatles-Casamar et al. in 2010[60], beforethe implementation of our current bespoke design process. No expression was seenin the brain for any of these MiniPromoters.Out of four designs, three enhancer RRs were used in combination with one oftwo similar promoter RRs. The search space for RRs was limited to the flankinggenes of DDC, spanning about 20 kb upstream and 5 kb downstream. OnTargetanalyzed across the TAD, a region of ∼720 kb, using the same datasets describedabove. Out of the 18 brain and eye-specific RRs predicted by OnTarget, only oneregion partially overlapped all tested MiniPromoter regions. Unsurprisingly, thispartial overlap was across the promoter RR of the MiniPromoter, although thepredicted RR is much more compact than the tested regions, potentially indicatingthe original promoter RRs contained elements unimportant to, or unfavorable forexpression in the brain. Interestingly, there was a large overlap of predicted RRsin the brain and eye specific datasets compared with all data. Out of the 15 RRsselected while using all datasets, only 2 did not overlap RRs from the tissue-specificset.Based on the above results, I expect that regions predicted by OnTarget in cell-specific contexts should raise the likeliness of MiniPromoters producing expressionof the reporter gene. It is therefore of interest to predict which bespoke designsstill awaiting testing would be more likely to be successful. I chose Ple368, one ofthe newest MiniPromoter constructs designed based on the GAD2 gene with thepreviously described datasets. While OnTarget used a ∼560 kb TAD as its search56space, our MiniPromoter analysis was done using a ∼260 kb region which wasbased off a successful mouse BAC targeting Gad2[9]. Out of 13 RRs predicted byOnTarget in the cell-specific datasets, 9 were found in the search space used for thebespoke design of Ple368. Inversely, only 1 RR was predicted within that smallersearch space, while the other 8 OnTarget-predicted RRs fell within the larger TAD.Both RRs selected by hand for Ple368 completely overlapped OnTarget-predictedRRs using the brain and eye-specific datasets. It is therefore our prediction thatthis MiniPromoter will express our reporter gene in at least a subset of regions ofthe brain and eye.Table 6: Summary of OnTarget Regulatory Region Predictions for BespokeMiniPromoters.Ple345 & Ple346 Ple56 to Ple59 Ple368MiniPromoter result Success Fail UntestedMiniPromoter search space (kb) 224 125 260OnTarget search space (kb) 640 720 560Number of original RRs identified 7 5 2OnTarget brain/eye RR overlap with MiniPromoters 4 2 2OnTarget all datasets RR overlap with MiniPromoters 3 0 157Chapter 4Conclusion & Future WorkIn the research activities of this thesis, we were able to create 49 MiniPromotersdesigned to drive expression of a reporter gene in cell-specific contexts, of which asubset of 20 constructs were tested in vivo to validate the design process. Based onan initial set of bespoke designs, we were able to recreate the design logic withina semi-automated pipeline. As expected, analyses using cell type-specific datasetsresults in different designs than those incorporating all possible datasets. We haveperformed initial tests of the semi-automatic approach, supporting the use of cell-type specific information. The bioinformatics approaches within the thesis are ofimportance in the field of gene therapy, as the use of small, specific promotersnot only increases the therapeutic capacity, but also restricts the delivery of thetherapeutic to relevant cells.We recognize, however, the need for future developments of our tool. First,we must continue to test OnTarget as our validation set of tested MiniPromotersgrows. In this sense, we must ensure that our current weighting scheme continuesto preferentially detect RRs from successful MiniPromoter experiments. Further-more, OnTarget should identify cis-regulatory sequences that were not selected byhuman designers in the unsuccessful experiments. We also plan to test OnTargetfor prediction of enhancers described in literature.Next, we will continue to collect cell-specific datasets as they become avail-able. Unsurprisingly, most experiments are performed on immortalized cell lines.Although these are good starting points, they may not accurately recapitulate ex-58pression in natural conditions. There is a dearth of tissue and cell-specific TFexperiments.Finally, we hope to implement an important additional feature into future re-leases of the OnTarget tool. Specifically, we hope to include the ability to mod-ify each cis-regulatory sequence in order to modulate the strength of TF bindingsites. Previous research, as well as our own observations from Ple331, has shownthat small base pair changes in TF binding sites can dramatically affect expres-sion levels. OnTarget will scan identified RRs and provide suggestions of minimalalteration of the endogenous sequence to create a high-affinity binding site for adesired TF. This system is based off of previous lab expertise in the creation of theMANTA database[51] (DNA alterations impacting TF binding), and the upkeepof JASPAR[52] (TF binding profile database). Importantly, this feature must alsoensure that sequence modification does not destroy other important binding sites,or create undesired sites which may lead to unexpected and undesired off-targetexpression after MiniPromoter delivery.I have shown preliminary work for successfully identifying cis-regulatory se-quences and creating functional MiniPromoters for therapeutic delivery in smallviral vectors. This approach lead to the creation of our semi-automated tool, On-Target, to perform the same task. While there is still work to be done, implement-ing OnTarget as outlined in this thesis should ultimately lead to better and fasteridentification of cis-regulatory sequences and designing of MiniPromoters.59Bibliography[1] Viral Plasmids and Resources. → pages 3, 5, 6[2] N. al Yacoub, M. Romanowska, N. Haritonova, and J. Foerster. Optimizedproduction and concentration of lentiviral vectors containing large inserts.The Journal of Gene Medicine, 9(7):579–584, jul 2007. ISSN 1099498X.doi:10.1002/jgm.1052. URL →pages 5[3] R. Andersson. Promoter or enhancer, what’s the difference? Deconstructionof established distinctions and presentation of a unifying model. BioEssays,37(3):314–323, mar 2015. ISSN 02659247. doi:10.1002/bies.201400162.URL → pages 9, 10, 12[4] R. Andersson, C. Gebhard, I. Miguel-Escalada, I. Hoof, J. Bornholdt,M. Boyd, Y. Chen, X. Zhao, C. Schmidl, T. Suzuki, E. Ntini, E. Arner,E. Valen, K. Li, L. Schwarzfischer, D. Glatz, J. Raithel, B. Lilje, N. Rapin,F. O. Bagger, M. Jørgensen, P. R. Andersen, N. Bertin, O. Rackham, A. M.Burroughs, J. K. Baillie, Y. Ishizu, Y. Shimizu, E. Furuhata, S. Maeda,Y. Negishi, C. J. Mungall, T. F. Meehan, T. Lassmann, M. Itoh, H. Kawaji,N. Kondo, J. Kawai, A. Lennartsson, C. O. Daub, P. Heutink, D. A. Hume,T. H. Jensen, H. Suzuki, Y. Hayashizaki, F. Mu¨ller, T. F. Consortium,A. R. R. Forrest, P. Carninci, M. Rehli, and A. Sandelin. An atlas of activeenhancers across human cell types and tissues. Nature, 507(7493):455–461,mar 2014. ISSN 0028-0836. doi:10.1038/nature12787. URL → pages 11, 19[5] A. Asokan, D. V. Schaffer, and R. Jude Samulski. The AAV Vector Toolkit:Poised at the Clinical Crossroads. Molecular Therapy, 20(4):699–708, apr2012. ISSN 15250016. doi:10.1038/mt.2011.287. URL → pages 660[6] A. A. Avilion. Multipotent cell lineages in early mouse development dependon SOX2 function. Genes & Development, 17(1):126–140, jan 2003. ISSN08909369. doi:10.1101/gad.224503. URL → pages 12[7] A. J. Bannister and T. Kouzarides. Regulation of chromatin by histonemodifications. Cell Research, 21(3):381–395, mar 2011. ISSN 1001-0602.doi:10.1038/cr.2011.22. URL → pages 13[8] K. Benihoud. Adenovirus vectors for gene delivery. Current Opinion inBiotechnology, 10(5):440–447, oct 1999. ISSN 09581669.doi:10.1016/S0958-1669(99)00007-5. URL → pages 5[9] S. Besser, M. Sicker, G. Marx, U. Winkler, V. Eulenburg, S. Hu¨lsmann, andJ. Hirrlinger. A transgenic mouse line expressing the red fluorescent proteintdtomato in gabaergic neurons. PloS one, 10(6):e0129934, 2015. → pages57[10] L. Biasco, A. Ambrosi, D. Pellin, C. Bartholomae, I. Brigida, M. G.Roncarolo, C. Di Serio, C. von Kalle, M. Schmidt, and A. Aiuti. Integrationprofile of retroviral vector in gene therapy treated patients is cell-specificaccording to gene expression and chromatin conformation of target cell.EMBO Molecular Medicine, 3(2):89–101, feb 2011. ISSN 17574676.doi:10.1002/emmm.201000108. URL →pages 5[11] R. M. Blaese, K. W. Culver, A. D. Miller, C. S. Carter, T. Fleisher,M. Clerici, G. Shearer, L. Chang, Y. Chiang, P. Tolstoshev, J. J. Greenblatt,S. A. Rosenberg, H. Klein, M. Berger, C. A. Mullen, W. J. Ramsey, L. Muul,R. A. Morgan, and W. F. Anderson. T lymphocyte-directed gene therapy forADA- SCID: initial trial results after 4 years. Science (New York, N.Y.), 270(5235):475–80, oct 1995. ISSN 0036-8075. URL → pages 2[12] J. A. Blake, J. T. Eppig, J. A. Kadin, J. E. Richardson, C. L. Smith, and C. J.Bult. Mouse Genome Database (MGD)-2017: community knowledgeresource for the laboratory mouse. Nucleic Acids Research, 45(D1):D723–D729, jan 2017. ISSN 0305-1048. doi:10.1093/nar/gkw1040. URL61 →pages 19[13] M. Blanchette, W. J. Kent, C. Riemer, L. Elnitski, A. F. Smit, K. M. Roskin,R. Baertsch, K. Rosenbloom, H. Clawson, E. D. Green, et al. Aligningmultiple genomic sequences with the threaded blockset aligner. Genomeresearch, 14(4):708–715, 2004. → pages 21[14] R. C. Boucher, M. R. Knowles, L. G. Johnson, J. C. Olsen, R. Pickles, J. M.Wilson, J. Engelhardt, Y. Yang, and M. Grossman. Gene Therapy for CysticFibrosis Using E1-Deleted Adenovirus: A Phase I Trial in the Nasal Cavity.University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.Human Gene Therapy, 5(5):615–639, may 1994. ISSN 1043-0342.doi:10.1089/hum.1994.5.5-615. URL → pages2[15] A. P. Boyle, S. Davis, H. P. Shulha, P. Meltzer, E. H. Margulies, Z. Weng,T. S. Furey, and G. E. Crawford. High-resolution mapping andcharacterization of open chromatin across the genome. Cell, 132(2):311–322, 2008. → pages 13[16] J. D. Buenrostro, B. Wu, H. Y. Chang, and W. J. Greenleaf. ATAC-seq: AMethod for Assaying Chromatin Accessibility Genome-Wide. In CurrentProtocols in Molecular Biology, pages 21.29.1–21.29.9. John Wiley & Sons,Inc., Hoboken, NJ, USA, jan 2015. doi:10.1002/0471142727.mb2129s109.URL → pages 14[17] L. C. Byrne, Y. J. Lin, T. Lee, D. V. Schaffer, and J. G. Flannery. Theexpression pattern of systemically injected AAV9 in the developing mouseretina is determined by age. Molecular therapy : the journal of theAmerican Society of Gene Therapy, 23(2):290–296, 2015. ISSN 1525-0024(Electronic). doi:10.1038/mt.2014.181. → pages 36, 49[18] M. Cavazzana-Calvo. Gene Therapy of Human Severe CombinedImmunodeficiency (SCID)-X1 Disease. Science, 288(5466):669–672, apr2000. ISSN 00368075. doi:10.1126/science.288.5466.669. URL →pages 2[19] S. Chira, C. S. Jackson, I. Oprea, F. Ozturk, M. S. Pepper, I. Diaconu,C. Braicu, L.-Z. Raduly, G. A. Calin, and I. Berindan-Neagoe. Progresses62towards safe and efficient gene therapy vectors. Oncotarget, 6(31):30675–30703, oct 2015. ISSN 1949-2553. doi:10.18632/oncotarget.5169.URL → pages 5[20] A. V. Cideciyan, T. S. Aleman, S. L. Boye, S. B. Schwartz, S. Kaushal, A. J.Roman, J.-j. Pang, A. Sumaroka, E. A. Windsor, J. M. Wilson, et al. Humangene therapy for rpe65 isomerase deficiency activates the retinoid cycle ofvision but with slow rod kinetics. Proceedings of the National Academy ofSciences, 105(39):15112–15117, 2008. → pages 16[21] L. J. Core, J. J. Waterfall, and J. T. Lis. Nascent RNA Sequencing RevealsWidespread Pausing and Divergent Initiation at Human Promoters. Science,322(5909):1845–1848, dec 2008. ISSN 0036-8075.doi:10.1126/science.1162228. URL → pages 10[22] C. N. de Leeuw, F. M. Dyka, S. L. Boye, S. Laprise, M. Zhou, A. Y. Chou,L. Borretta, S. C. McInerny, K. G. Banks, E. Portales-Casamar, M. I.Swanson, C. A. D’Souza, S. E. Boye, S. J. Jones, R. A. Holt, D. Goldowitz,W. W. Hauswirth, W. W. Wasserman, and E. M. Simpson. Targeted CNSdelivery using human MiniPromoters and demonstrated compatibility withadeno-associated viral vectors. Molecular Therapy - Methods & ClinicalDevelopment, 1:5, 2014. ISSN 23290501. doi:10.1038/mtm.2013.5. URL → pages 2,17[23] C. N. de Leeuw, A. J. Korecki, G. E. Berry, J. W. Hickmott, S. L. Lam, T. C.Lengyell, R. J. Bonaguro, L. J. Borretta, V. Chopra, A. Y. Chou, C. A.D’Souza, O. Kaspieva, S. Laprise, S. C. McInerny, E. Portales-Casamar,M. I. Swanson-Newman, K. Wong, G. S. Yang, M. Zhou, S. J. M. Jones,R. A. Holt, A. Asokan, D. Goldowitz, W. W. Wasserman, and E. M.Simpson. rAAV-compatible MiniPromoters for restricted expression in thebrain and eye. Molecular Brain, 9(1):52, 2016. ISSN 1756-6606.doi:10.1186/s13041-016-0232-4. URL{%}5Cn→ pages 2, 17, 20[24] J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu, andB. Ren. Topological domains in mammalian genomes identified by analysisof chromatin interactions. Nature, 485(7398):376–380, apr 2012. ISSN630028-0836. doi:10.1038/nature11082. URL → pages 14, 15[25] M. Edelstein. Gene Therapy Clinical Trials Worldwide, 2017. URL → pages 1, 3, 5[26] A. El-Aneed. An overview of current delivery systems in cancer genetherapy. Journal of Controlled Release, 94(1):1–14, jan 2004. ISSN01683659. doi:10.1016/j.jconrel.2003.09.013. URL → pages 5[27] G. Elgar and T. Vavouri. Tuning in to the signals: noncoding sequenceconservation in vertebrate genomes. Trends in genetics, 24(7):344–352,2008. → pages 2[28] ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNAElements) Project. Science (New York, N.Y.), 306(5696):636–40, oct 2004.ISSN 1095-9203. doi:10.1126/science.1105136. URL → pages 12, 13, 19[29] ENCODE Project Consortium. An integrated encyclopedia of DNAelements in the human genome. Nature, 489(7414):57–74, sep 2012. ISSN1476-4687. doi:10.1038/nature11247. URL →pages 12, 13, 19[30] J. Ernst and M. Kellis. ChromHMM: automating chromatin-state discoveryand characterization. Nature Methods, 9(3):215–216, feb 2012. ISSN1548-7091. doi:10.1038/nmeth.1906. URL → pages 15, 19[31] FANTOM Consortium and the RIKEN PMI and CLST (DGT), A. R. R.Forrest, H. Kawaji, M. Rehli, J. K. Baillie, M. J. L. de Hoon, V. Haberle,T. Lassmann, I. V. Kulakovskiy, M. Lizio, M. Itoh, R. Andersson, C. J.Mungall, T. F. Meehan, S. Schmeier, N. Bertin, M. Jørgensen, E. Dimont,E. Arner, C. Schmidl, U. Schaefer, Y. A. Medvedeva, C. Plessy, M. Vitezic,J. Severin, C. A. Semple, Y. Ishizu, R. S. Young, M. Francescatto, I. Alam,D. Albanese, G. M. Altschuler, T. Arakawa, J. A. C. Archer, P. Arner,M. Babina, S. Rennie, P. J. Balwierz, A. G. Beckhouse, S. Pradhan-Bhatt,J. A. Blake, A. Blumenthal, B. Bodega, A. Bonetti, J. Briggs, F. Brombacher,A. M. Burroughs, A. Califano, C. V. Cannistraci, D. Carbajo, Y. Chen,64M. Chierici, Y. Ciani, H. C. Clevers, E. Dalla, C. A. Davis, M. Detmar, A. D.Diehl, T. Dohi, F. Drabløs, A. S. B. Edge, M. Edinger, K. Ekwall, M. Endoh,H. Enomoto, M. Fagiolini, L. Fairbairn, H. Fang, M. C. Farach-Carson, G. J.Faulkner, A. V. Favorov, M. E. Fisher, M. C. Frith, R. Fujita, S. Fukuda,C. Furlanello, M. Furino, J.-i. Furusawa, T. B. Geijtenbeek, A. P. Gibson,T. Gingeras, D. Goldowitz, J. Gough, S. Guhl, R. Guler, S. Gustincich, T. J.Ha, M. Hamaguchi, M. Hara, M. Harbers, J. Harshbarger, A. Hasegawa,Y. Hasegawa, T. Hashimoto, M. Herlyn, K. J. Hitchens, S. J. Ho Sui, O. M.Hofmann, I. Hoof, F. Hori, L. Huminiecki, K. Iida, T. Ikawa, B. R. Jankovic,H. Jia, A. Joshi, G. Jurman, B. Kaczkowski, C. Kai, K. Kaida, A. Kaiho,K. Kajiyama, M. Kanamori-Katayama, A. S. Kasianov, T. Kasukawa,S. Katayama, S. Kato, S. Kawaguchi, H. Kawamoto, Y. I. Kawamura,T. Kawashima, J. S. Kempfle, T. J. Kenna, J. Kere, L. M. Khachigian,T. Kitamura, S. P. Klinken, A. J. Knox, M. Kojima, S. Kojima, N. Kondo,H. Koseki, S. Koyasu, S. Krampitz, A. Kubosaki, A. T. Kwon, J. F. J. Laros,W. Lee, A. Lennartsson, K. Li, B. Lilje, L. Lipovich, A. Mackay-Sim, R.-i.Manabe, J. C. Mar, B. Marchand, A. Mathelier, N. Mejhert, A. Meynert,Y. Mizuno, D. A. de Lima Morais, H. Morikawa, M. Morimoto, K. Moro,E. Motakis, H. Motohashi, C. L. Mummery, M. Murata, S. Nagao-Sato,Y. Nakachi, F. Nakahara, T. Nakamura, Y. Nakamura, K. Nakazato, E. vanNimwegen, N. Ninomiya, H. Nishiyori, S. Noma, S. Noma, T. Noazaki,S. Ogishima, N. Ohkura, H. Ohimiya, H. Ohno, M. Ohshima,M. Okada-Hatakeyama, Y. Okazaki, V. Orlando, D. A. Ovchinnikov,A. Pain, R. Passier, M. Patrikakis, H. Persson, S. Piazza, J. G. D.Prendergast, O. J. L. Rackham, J. A. Ramilowski, M. Rashid, T. Ravasi,P. Rizzu, M. Roncador, S. Roy, M. B. Rye, E. Saijyo, A. Sajantila, A. Saka,S. Sakaguchi, M. Sakai, H. Sato, S. Savvi, A. Saxena, C. Schneider, E. A.Schultes, G. G. Schulze-Tanzil, A. Schwegmann, T. Sengstag, G. Sheng,H. Shimoji, Y. Shimoni, J. W. Shin, C. Simon, D. Sugiyama, T. Sugiyama,M. Suzuki, N. Suzuki, R. K. Swoboda, P. A. C. ’t Hoen, M. Tagami,N. Takahashi, J. Takai, H. Tanaka, H. Tatsukawa, Z. Tatum, M. Thompson,H. Toyodo, T. Toyoda, E. Valen, M. van de Wetering, L. M. van den Berg,R. Verado, D. Vijayan, I. E. Vorontsov, W. W. Wasserman, S. Watanabe,C. A. Wells, L. N. Winteringham, E. Wolvetang, E. J. Wood, Y. Yamaguchi,M. Yamamoto, M. Yoneda, Y. Yonekura, S. Yoshida, S. E. Zabierowski,P. G. Zhang, X. Zhao, S. Zucchelli, K. M. Summers, H. Suzuki, C. O. Daub,J. Kawai, P. Heutink, W. Hide, T. C. Freeman, B. Lenhard, V. B. Bajic, M. S.Taylor, V. J. Makeev, A. Sandelin, D. A. Hume, P. Carninci, andY. Hayashizaki. A promoter-level mammalian expression atlas. Nature, 507(7493):462–70, mar 2014. ISSN 1476-4687. doi:10.1038/nature13182.65URL →pages 7, 11[32] T. R. Flotte. Gene Therapy Progress and Prospects: Recombinantadeno-associated virus (rAAV) vectors. Gene Therapy, 11(10):805–810,may 2004. ISSN 0969-7128. doi:10.1038/ URL → pages 6[33] M. K. Foecking and H. Hofstetter. Powerful and versatile enhancer-promoterunit for mammalian expression vectors. Gene, 45(1):101–5, 1986. ISSN0378-1119. URL → pages 6[34] S. F. Gilbert. Developmental Biology. Sinaur Associates, Sutherland, MA, 6edition, 2000. ISBN 0-87893-243-7. → pages 7[35] G. Gill. Regulation of the initiation of eukaryotic transcription. Essays InBiochemistry, 37:33–43, may 2001. ISSN 0071-1365.doi:10.1042/bse0370033. URL → pages11, 13[36] P. G. Giresi, J. Kim, R. M. McDaniell, V. R. Iyer, and J. D. Lieb. FAIRE(Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin. Genome Research, 17(6):877–885, jun 2007. ISSN 1088-9051. doi:10.1101/gr.5533506. URL → pages 14[37] S. Goverdhana, M. Puntel, W. Xiong, J. Zirger, C. Barcia, J. Curtin,E. Soffer, S. Mondkar, G. King, J. Hu, S. Sciascia, M. Candolfi,D. Greengold, P. Lowenstein, and M. Castro. Regulatable gene expressionsystems for gene therapy applications: progress and future challenges.Molecular Therapy, 12(2):189–211, aug 2005. ISSN 15250016.doi:10.1016/j.ymthe.2005.03.022. URL → pages 1,5, 16[38] S. Hacein-Bey-Abina. LMO2-Associated Clonal T Cell Proliferation in TwoPatients after Gene Therapy for SCID-X1. Science, 302(5644):415–419, oct2003. ISSN 0036-8075. doi:10.1126/science.1088547. URL → pages 366[39] N. Heintz. Gene Expression Nervous System Atlas (GENSAT). NatureNeuroscience, 7(5):483–483, may 2004. ISSN 1097-6256.doi:10.1038/nn0504-483. URL → pages 16, 19[40] J. W. Hickmott, C.-y. Chen, D. J. Arenillas, A. J. Korecki, S. L. Lam, L. L.Molday, R. J. Bonaguro, M. Zhou, A. Y. Chou, A. Mathelier, S. L. Boye,W. W. Hauswirth, R. S. Molday, W. W. Wasserman, and E. M. Simpson.PAX6 MiniPromoters drive restricted expression from rAAV in the adultmouse retina. Molecular Therapy, 3(June):16051, 2016. ISSN 2329-0501.doi:doi:10.1038/mtm.2016.51. → pages 2, 17, 20[41] M. M. Hoffman, O. J. Buske, J. Wang, Z. Weng, J. A. Bilmes, and W. S.Noble. Unsupervised pattern discovery in human chromatin structurethrough genomic segmentation. Nature Methods, 9(5):473–476, mar 2012.ISSN 1548-7091. doi:10.1038/nmeth.1937. URL → pages 15, 19[42] T. Hollon. Researchers and regulators reflect on first gene therapy death.Nature Medicine, 6(1):6–6, jan 2000. ISSN 1078-8956. doi:10.1038/71545.URL → pages 2[43] M. A. Kay, J. C. Glorioso, and L. Naldini. Viral vectors for gene therapy:the art of turning infectious agents into vehicles of therapeutics.Kay, M. A.,Glorioso, J. C., & Naldini, L. (2001). Viral vectors for gene therapy: the artof turning infectious agents into vehicles of therapeutics. Natur. Naturemedicine, 7(1):33–40, jan 2001. ISSN 1078-8956. doi:10.1038/83324. URL → pages 3, 5, 6[44] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M.Zahler, and a. D. Haussler. The Human Genome Browser at UCSC. GenomeResearch, 12(6):996–1006, may 2002. ISSN 1088-9051.doi:10.1101/gr.229102. URL → pages 16, 19[45] M. Lizio, J. Harshbarger, H. Shimoji, J. Severin, T. Kasukawa, S. Sahin,I. Abugessaisa, S. Fukuda, F. Hori, S. Ishikawa-Kato, C. J. Mungall,E. Arner, J. Baillie, N. Bertin, H. Bono, M. de Hoon, A. D. Diehl,E. Dimont, T. C. Freeman, K. Fujieda, W. Hide, R. Kaliyaperumal,T. Katayama, T. Lassmann, T. F. Meehan, K. Nishikata, H. Ono, M. Rehli,A. Sandelin, E. A. Schultes, P. t Hoen, Z. Tatum, M. Thompson, T. Toyoda,67D. W. Wright, C. O. Daub, M. Itoh, P. Carninci, Y. Hayashizaki, A. Forrest,and H. Kawaji. Gateways to the FANTOM5 promoter level mammalianexpression atlas. Genome Biology, 16(1):22, 2015. ISSN 1465-6906.doi:10.1186/s13059-014-0560-6. URL → pages 11, 19[46] D. G. Lupia´n˜ez, M. Spielmann, and S. Mundlos. Breaking TADs: HowAlterations of Chromatin Domains Result in Disease. Trends in Genetics, 32(4):225–237, apr 2016. ISSN 01689525. doi:10.1016/j.tig.2016.01.003.URL →pages 15[47] A. Lusser and J. T. Kadonaga. Chromatin remodeling by ATP-dependentmolecular machines. BioEssays, 25(12):1192–1200, dec 2003. ISSN0265-9247. doi:10.1002/bies.10359. URL → pages 13[48] E. Z. Macosko, A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman,I. Tirosh, A. R. Bialas, N. Kamitaki, E. M. Martersteck, et al. Highlyparallel genome-wide expression profiling of individual cells using nanoliterdroplets. Cell, 161(5):1202–1214, 2015. → pages 31[49] C. A. Maguire, S. H. Ramirez, S. F. Merkel, M. Sena-Esteves, and X. O.Breakefield. Gene therapy for the nervous system: challenges and newstrategies. Neurotherapeutics, 11(4):817–839, 2014. → pages 16[50] G. A. Maston, S. K. Evans, and M. R. Green. Transcriptional RegulatoryElements in the Human Genome. Annual Review of Genomics and HumanGenetics, 7(1):29–59, sep 2006. ISSN 1527-8204.doi:10.1146/annurev.genom.7.080505.115623. URL →pages 9[51] A. Mathelier, C. Lefebvre, A. W. Zhang, D. J. Arenillas, J. Ding, W. W.Wasserman, and S. P. Shah. Cis-regulatory somatic mutations andgene-expression alteration in B-cell lymphomas. Genome biology, 16:84,apr 2015. ISSN 1474-760X. doi:10.1186/s13059-015-0648-7. URL →pages 59[52] A. Mathelier, O. Fornes, D. J. Arenillas, C.-y. Chen, G. Denay, J. Lee,W. Shi, C. Shyr, G. Tan, R. Worsley-Hunt, A. W. Zhang, F. Parcy,68B. Lenhard, A. Sandelin, and W. W. Wasserman. JASPAR 2016: a majorexpansion and update of the open-access database of transcription factorbinding profiles. Nucleic Acids Research, 44(D1):D110–D115, jan 2016.ISSN 0305-1048. doi:10.1093/nar/gkv1176. URL →pages 12, 59[53] R. Mundade, H. G. Ozer, H. Wei, L. Prabhu, and T. Lu. Role of ChIP-seq inthe discovery of transcription factor binding sites, differential generegulation mechanism, epigenetic marks and beyond. Cell Cycle, 13(18):2847–2852, sep 2014. ISSN 1538-4101.doi:10.4161/15384101.2014.949201. URL →pages 12, 13[54] L. Naldini. Gene therapy returns to centre stage. Nature, 526(7573):351–60,oct 2015. ISSN 1476-4687. doi:10.1038/nature15818. URL → pages 1, 3[55] D. B. Nikolov and S. K. Burley. RNA polymerase II transcription initiation:a structural view. Proceedings of the National Academy of Sciences of theUnited States of America, 94(1):15–22, jan 1997. ISSN 0027-8424. URL → pages12[56] H. Niwa, K. Yamamura, and J. Miyazaki. Efficient selection forhigh-expression transfectants with a novel eukaryotic vector. Gene, 108(2):193–9, dec 1991. ISSN 0378-1119. URL → pages 6[57] E. Ntini, A. I. Ja¨rvelin, J. Bornholdt, Y. Chen, M. Boyd, M. Jørgensen,R. Andersson, I. Hoof, A. Schein, P. R. Andersen, P. K. Andersen, P. Preker,E. Valen, X. Zhao, V. Pelechano, L. M. Steinmetz, A. Sandelin, and T. H.Jensen. Polyadenylation siteinduced decay of upstream transcripts enforcespromoter directionality. Nature Structural & Molecular Biology, 20(8):923–928, jul 2013. ISSN 1545-9993. doi:10.1038/nsmb.2640. URL → pages 9[58] N. A. O’Leary, M. W. Wright, J. R. Brister, S. Ciufo, D. Haddad,R. McVeigh, B. Rajput, B. Robbertse, B. Smith-White, D. Ako-Adjei,A. Astashyn, A. Badretdin, Y. Bao, O. Blinkova, V. Brover, V. Chetvernin,69J. Choi, E. Cox, O. Ermolaeva, C. M. Farrell, T. Goldfarb, T. Gupta, D. Haft,E. Hatcher, W. Hlavina, V. S. Joardar, V. K. Kodali, W. Li, D. Maglott,P. Masterson, K. M. McGarvey, M. R. Murphy, K. O’Neill, S. Pujar, S. H.Rangwala, D. Rausch, L. D. Riddick, C. Schoch, A. Shkeda, S. S. Storz,H. Sun, F. Thibaud-Nissen, I. Tolstoy, R. E. Tully, A. R. Vatsan, C. Wallin,D. Webb, W. Wu, M. J. Landrum, A. Kimchi, T. Tatusova, M. DiCuccio,P. Kitts, T. D. Murphy, and K. D. Pruitt. Reference sequence (RefSeq)database at NCBI: current status, taxonomic expansion, and functionalannotation. Nucleic Acids Research, 44(D1):D733–D745, jan 2016. ISSN0305-1048. doi:10.1093/nar/gkv1189. URL →pages 19[59] L. A. Pennacchio, W. Bickmore, A. Dean, M. A. Nobrega, and G. Bejerano.Enhancers: five essential questions. Nature Reviews Genetics, 14(4):288–295, mar 2013. ISSN 1471-0056. doi:10.1038/nrg3458. URL → pages 9[60] E. Portales-Casamar, D. J. Swanson, L. Liu, C. N. de Leeuw, K. G. Banks,S. J. Ho Sui, D. L. Fulton, J. Ali, M. Amirabbasi, D. J. Arenillas, N. Babyak,S. F. Black, R. J. Bonaguro, E. Brauer, T. R. Candido, M. Castellarin,J. Chen, Y. Chen, J. C. Cheng, V. Chopra, T. R. Docking, L. Dreolini, C. A.D’Souza, E. K. Flynn, R. Glenn, K. Hatakka, T. G. Hearty, B. Imanian,S. Jiang, S. Khorasan-zadeh, I. Komljenovic, S. Laprise, N. Y. Liao, J. S.Lim, S. Lithwick, F. Liu, J. Liu, M. Lu, M. McConechy, A. J. McLeod,M. Milisavljevic, J. Mis, K. O’Connor, B. Palma, D. L. Palmquist, J. F.Schmouth, M. I. Swanson, B. Tam, A. Ticoll, J. L. Turner, R. Varhol,J. Vermeulen, R. F. Watkins, G. Wilson, B. K. Wong, S. H. Wong, T. Y.Wong, G. S. Yang, A. R. Ypsilanti, S. J. Jones, R. A. Holt, D. Goldowitz,W. W. Wasserman, and E. M. Simpson. A regulatory toolbox ofMiniPromoters to drive selective expression in the brain. Proc Natl Acad SciU S A, 107(38):16589–16594, 2010. ISSN 1091-6490.doi:10.1073/pnas.1009158107. URL → pages 2, 17, 56[61] P. Preker, J. Nielsen, S. Kammler, S. Lykke-Andersen, M. S. Christensen,C. K. Mapendano, M. H. Schierup, and T. H. Jensen. RNA ExosomeDepletion Reveals Transcription Upstream of Active Human Promoters.Science, 322(5909):1851–1854, dec 2008. ISSN 0036-8075.doi:10.1126/science.1164096. URL → pages 970[62] K. D. Pruitt, T. Tatusova, and D. R. Maglott. NCBI reference sequences(RefSeq): a curated non-redundant sequence database of genomes,transcripts and proteins. Nucleic Acids Research, 35(Database):D61–D65,jan 2007. ISSN 0305-1048. doi:10.1093/nar/gkl842. URL →pages 19[63] CIHR: New Emerging Team for Rare Diseases, 2016.URL → pages 1[64] Roadmap Epigenomics Consortium, A. Kundaje, W. Meuleman, J. Ernst,M. Bilenky, A. Yen, A. Heravi-Moussavi, P. Kheradpour, Z. Zhang, J. Wang,M. J. Ziller, V. Amin, J. W. Whitaker, M. D. Schultz, L. D. Ward, A. Sarkar,G. Quon, R. S. Sandstrom, M. L. Eaton, Y.-C. Wu, A. R. Pfenning, X. Wang,M. Claussnitzer, Y. Liu, C. Coarfa, R. A. Harris, N. Shoresh, C. B. Epstein,E. Gjoneska, D. Leung, W. Xie, R. D. Hawkins, R. Lister, C. Hong,P. Gascard, A. J. Mungall, R. Moore, E. Chuah, A. Tam, T. K. Canfield,R. S. Hansen, R. Kaul, P. J. Sabo, M. S. Bansal, A. Carles, J. R. Dixon,K.-H. Farh, S. Feizi, R. Karlic, A.-R. Kim, A. Kulkarni, D. Li, R. Lowdon,G. Elliott, T. R. Mercer, S. J. Neph, V. Onuchic, P. Polak, N. Rajagopal,P. Ray, R. C. Sallari, K. T. Siebenthall, N. A. Sinnott-Armstrong,M. Stevens, R. E. Thurman, J. Wu, B. Zhang, X. Zhou, A. E. Beaudet, L. A.Boyer, P. L. De Jager, P. J. Farnham, S. J. Fisher, D. Haussler, S. J. M. Jones,W. Li, M. A. Marra, M. T. McManus, S. Sunyaev, J. A. Thomson, T. D.Tlsty, L.-H. Tsai, W. Wang, R. A. Waterland, M. Q. Zhang, L. H. Chadwick,B. E. Bernstein, J. F. Costello, J. R. Ecker, M. Hirst, A. Meissner,A. Milosavljevic, B. Ren, J. A. Stamatoyannopoulos, T. Wang, andM. Kellis. Integrative analysis of 111 reference human epigenomes. Nature,518(7539):317–30, feb 2015. ISSN 1476-4687. doi:10.1038/nature14248.URL →pages 13, 19[65] N. P. Rodrigues, A. J. Tipping, Z. Wang, and T. Enver. GATA-2 mediatedregulation of normal hematopoietic stem/progenitor cell function,myelodysplasia and myeloid leukemia. The International Journal ofBiochemistry & Cell Biology, 44(3):457–460, mar 2012. ISSN 13572725.doi:10.1016/j.biocel.2011.12.004. URL → pages 12[66] N. V. Rozhkov. Global Run-On Sequencing (GRO-seq) Library Preparation71from Drosophila Ovaries. Methods in molecular biology (Clifton, N.J.),1328:217–30, 2015. ISSN 1940-6029. doi:10.1007/978-1-4939-2851-4 16.URL → pages 11[67] L. Samaranch, E. A. Salegio, W. San Sebastian, A. P. Kells, K. D. Foust,J. R. Bringas, C. Lamarre, J. Forsayeth, B. K. Kaspar, and K. S. Bankiewicz.Adeno-Associated Virus Serotype 9 Transduction in the Central NervousSystem of Nonhuman Primates. Human Gene Therapy, 23(4):382–389, apr2012. ISSN 1043-0342. doi:10.1089/hum.2011.200. URL → pages 6[68] D. Schmidt, M. D. Wilson, C. Spyrou, G. D. Brown, J. Hadfield, and D. T.Odom. ChIP-seq: Using high-throughput sequencing to discoverproteinDNA interactions. Methods, 48(3):240–248, jul 2009. ISSN10462023. doi:10.1016/j.ymeth.2009.03.001. URL → pages 12[69] R. SCOLLAY. Gene Therapy. A Brief Overview of the Past, Present, andFuture. Annals of the New York Academy of Sciences, 953a(1 NEW VISTASIN):26–30, dec 2001. ISSN 0077-8923.doi:10.1111/j.1749-6632.2001.tb11357.x. URL → pages 3[70] C. Sheridan. Gene therapy finds its niche. Nature Biotechnology, 29(2):121–128, feb 2011. ISSN 1087-0156. doi:10.1038/nbt.1769. URL → pages 5, 6[71] T. Shiraki, S. Kondo, S. Katayama, K. Waki, T. Kasukawa, H. Kawaji,R. Kodzius, A. Watahiki, M. Nakamura, T. Arakawa, S. Fukuda, D. Sasaki,A. Podhajska, M. Harbers, J. Kawai, P. Carninci, and Y. Hayashizaki. Capanalysis gene expression for high-throughput analysis of transcriptionalstarting point and identification of promoter usage. Proceedings of theNational Academy of Sciences, 100(26):15776–15781, dec 2003. ISSN0027-8424. doi:10.1073/pnas.2136655100. URL → pages 10, 11[72] P. L. Sinn, S. L. Sauter, and P. B. McCray. Gene Therapy Progress andProspects: Development of improved lentiviral and retroviral vectorsdesign, biosafety, and production. Gene Therapy, 12(14):1089–1098, jul2005. ISSN 0969-7128. doi:10.1038/ URL → pages 572[73] A. Smit, R. Hubley, and P. Green. RepeatMasker Open-4.0. 2013-2015 .,2013. URL → pages 19[74] J. A. Stamatoyannopoulos. Illuminating eukaryotic transcription start sites.Nature Methods, 7(7):501–503, jul 2010. ISSN 1548-7091.doi:10.1038/nmeth0710-501. URL → pages 7[75] C. E. Thomas, A. Ehrhardt, and M. A. Kay. Progress and problems with theuse of viral vectors for gene therapy. Nature Reviews Genetics, 4(5):346–358, may 2003. ISSN 14710056. doi:10.1038/nrg1066. URL → pages 1, 2, 3, 5[76] N. D. Trinklein, S. F. Aldred, S. J. Hartman, D. I. Schroeder, R. P. Otillar,and R. M. Myers. An abundance of bidirectional promoters in the humangenome. Genome research, 14(1):62–6, jan 2004. ISSN 1088-9051.doi:10.1101/gr.1982804. URL →pages 9[77] N. L. van Berkum, E. Lieberman-Aiden, L. Williams, M. Imakaev,A. Gnirke, L. A. Mirny, J. Dekker, and E. S. Lander. Hi-C: A Method toStudy the Three-dimensional Architecture of Genomes. Journal ofVisualized Experiments, (39), may 2010. ISSN 1940-087X.doi:10.3791/1869. URL →pages 15[78] J. Wang, J. Zhuang, S. Iyer, X. Lin, T. W. Whitfield, M. C. Greven, B. G.Pierce, X. Dong, A. Kundaje, Y. Cheng, O. J. Rando, E. Birney, R. M.Myers, W. S. Noble, M. Snyder, and Z. Weng. Sequence features andchromatin structure around the genomic regions bound by 119 humantranscription factors. Genome research, 22(9):1798–812, sep 2012. ISSN1549-5469. doi:10.1101/gr.139105.112. URL →pages 20[79] J. Wang, J. Zhuang, S. Iyer, X.-Y. Lin, M. C. Greven, B.-H. Kim, J. Moore,B. G. Pierce, X. Dong, D. Virgil, E. Birney, J.-H. Hung, and Z. a Wiki-based database for transcription factor-binding datagenerated by the ENCODE consortium. Nucleic acids research, 41(Database73issue):D171–6, jan 2013. ISSN 1362-4962. doi:10.1093/nar/gks1221. URL →pages 20[80] P. Washbourne and A. McAllister. Techniques for gene transfer intoneurons. Current Opinion in Neurobiology, 12(5):566–573, oct 2002. ISSN09594388. doi:10.1016/S0959-4388(02)00365-3. URL → pages 2[81] T. Wirth, N. Parker, and S. Yla¨-Herttuala. History of gene therapy. Gene,525(2):162–169, aug 2013. ISSN 03781119.doi:10.1016/j.gene.2013.03.137. URL → pages 1,2[82] Z. Wu, A. Asokan, and R. J. Samulski. Adeno-associated Virus Serotypes:Vector Toolkit for Human Gene Therapy. Molecular Therapy, 14(3):316–327, sep 2006. ISSN 15250016. doi:10.1016/j.ymthe.2006.05.009.URL →pages 6[83] C. Yang, E. Bolotin, T. Jiang, F. M. Sladek, and E. Martinez. Prevalence ofthe initiator over the TATA box in human and yeast genes and identificationof DNA motifs enriched in human TATA-less core promoters. Gene, 389(1):52–65, mar 2007. ISSN 03781119. doi:10.1016/j.gene.2006.09.029. URL → pages 7[84] M. A. Zabidi, C. D. Arnold, K. Schernhuber, M. Pagani, M. Rath, O. Frank,and A. Stark. Enhancer-core-promoter specificity separates developmentaland housekeeping gene regulation. Nature, 518(7540):556, 2015. → pages 9[85] C. Zincarelli, S. Soltys, G. Rengo, and J. E. Rabinowitz. Analysis of AAVserotypes 1-9 mediated gene expression and tropism in mice after systemicinjection. Molecular therapy : the journal of the American Society of GeneTherapy, 16(6):1073–1080, jun 2008. ISSN 1525-0016.doi:10.1038/mt.2008.76. URL → pages 674


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items