Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

High-throughput characterization of mutations in antioxidant responsive elements Chou, Alice 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2008_spring_chou_alice.pdf [ 3.74MB ]
JSON: 24-1.0066606.json
JSON-LD: 24-1.0066606-ld.json
RDF/XML (Pretty): 24-1.0066606-rdf.xml
RDF/JSON: 24-1.0066606-rdf.json
Turtle: 24-1.0066606-turtle.txt
N-Triples: 24-1.0066606-rdf-ntriples.txt
Original Record: 24-1.0066606-source.json
Full Text

Full Text

HIGH -THROUGHPUT CHARACTERIZATION OF MUTATIONS IN ANTIOXIDANT RESPONSIVE ELEMENTS  by ALICE CHOU B.Sc., University of British Columbia, 2004  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES  (Genetics)  THE UNIVERSITY OF BRITISH COLUMBIA January 2008 © Alice Chou, 2008  Abstract  Understanding the binding specificity of transcription factors is an important step towards accurate computational prediction of regulatory sequences governing gene expression. Higherthroughput binding site characterization methods have long been available in the laboratory for the study of protein-DNA interactions in solution or upon a surface. In this thesis a new method is introduced for characterization of inducible regulatory sequences in living cells based on construction and analysis of promoter-reporter gene plasmids. Spiked oligonucleotides are used to generate libraries of regulatory sequences with subtle variations from a known regulatory element. Screening of the library in cell culture for the capacity of the mutated sequences to mediate expression provides a diverse collection of responsive and non-responsive sequences to aid in understanding the sequence requirement for an inducible transcription factor binding site. We apply the methodology to the study of antioxidant responsive elements, the target sites of the Nfe212 transcription factor. These target sequences commonly found in the promoters of detoxification genes modulate gene expression in response to a diverse array of chemicals. The variants serve as a primary screen for future targeted mutational analysis to further characterize context-specific sequence requirement in the ARE and/or interdependence between positions. Moreover, a transcription factor binding profile can be generated from functional ARE variants in the library screen. Such an ARE profile performs as well as standard profiles based on bona fide ARE sequences drawn from the scientific literature.  ii  Table of Contents  Abstract ^  ii  Table of Contents ^  iii  List of Tables ^ List of Figures ^ Acknowledgments ^  vi vii  Chapter 1 : Introduction ^  1  1.1^Cellular Detoxification ^ 1.2^Chemopreventive Compounds ^ 1.3^Nfe212 Regulation ^ 1.4^Antioxidant Responsive Element ^ 1.5^Current Technologies to Study Regulatory Sequences ^ 1.6^Computational Predictions of Regulatory Sequences ^ 1.7^Position-Specific Variation in Transcription Factor Binding Sites ^ 1.8^Overview of Project ^  2 2 3 5 7 7 8 9  Chapter 2 : Materials and Methods ^  10  2.1^Laboratory-based Methods ^ 2.1.1^Chemicals and Reagents ^ 2.1.2^Cell Culture ^ 2.1.3^Mutagenic Oligonucleotide Synthesis and Primer Extension ^ 2.1.4^Plasmids and Cloning ^ 2.1.5^High-throughput Screening of Clone Libraries ^ 2.1.6^DNA Concentration Measurement and Normalization ^ 2.1.7^Transfection and Reporter Gene Assays ^ 2.2^Computer-Based Methods ^ 2.2.1^Model for Design of Spiked Oligonucleotide Libraries ^ 2.2.2^Probabilistic Model for Reporter Gene Analysis of ARE Variants ^ 2.2.3^Combinatorics Test for Single-base Mutations under Basal Condition ^ 2.2.4^Position Frequency Matrices ^ 2.2.5^Evaluating Performances of the ARE Models ^ 2.2.6^Screening of Genomic Sequence ^  11 11 11 11 14 14 15 15 16 16 16 19 19 20 20  Chapter 3 : Spiked Oligonucleotides ^  21  3.1^Theoretical Model of the ARE Spiked Oligonucleotide Design ^ 3.2^Clone Recovery Rate and Mutation Rate Analysis ^ 3.3^Function Verification of Known ARE and ARE-like Sequences ^ 3.4^Expression of Wild-Type ARE Sequences ^ 3.5^Mutational Analysis of ARE Variants Under Basal Condition ^ 3.6^Mutational Analysis of ARE Core Variants Under Induced Condition ^ 3.7^Mutational Analysis of ARE Variants ^ 3.8^Variants of Inactive ARE-like Sequences ^  22 23 26 29 30 33 34 35  Chapter 4 : New Inducible ARE Model ^  37  4.1^A New Predictive Model for Inducible Active ARE Sequences ^ 4.2^Comparing Performances of ARE Models in TFBS Detection ^ 4.3^Identifying Putative ARE Sites in the Human Genome ^  38 41 45  ^  Chapter 5 : Discussion ^  46  47 5.1^The "Spiked" Library Screening Method ^ 5.1.1^Spiked Oligonucleotide Synthesis ^ 47 5.1.2^Higher-Throughput Reporter Gene Assay ^ 48 50 5.2^Observations Related to Constitutive Expression via AREs ^ 5.3^Observations on Induction Mediated By the Antioxidant Response Element ^ 51 5.4^A New Definition of ARE ^ 52 53 5.5^Computational Biology and the Statistical Model ^ 5.6^Future Directions ^ 54  Bibliography ^  55  Appendices ^  62  Appendix A: ARE variants ^ Appendix B: ARE matrices ^ Appendix C: Decision tree analysis for ARE and ARE-like sequences ^ Appendix D: Experimental Protocol ^ Appendix E: High-throughput Transfection ^  63 69 70 74 78  iv  List of Tables TABLE 1.1 LIST OF 25 BP REGIONS FROM PUBLISHED ACTIVE ARES ^  6  TABLE 1.2 LIST OF 21 BP REGIONS OF INACTIVE ARE-LIKE SEQUENCES ^  6  TABLE 2.1 OLIGONUCLEOTIDES USED IN THE STUDY ^  12  TABLE 3.1 CLONE RECOVERY RATE OF ARE VARIANT CLONES. ^  24  TABLE 3.2 MUTATION RATES IN THE SPIKED OLIGOS ^  25  TABLE 3.3 NUCLEOTIDE SUBSTITUTION RATE IN SPIKED OLIGONUCLEOTIDE SYNTHESIS REPLACING WILD-TYPE BASE WITH MUTATIONS IS UNDER-REPRESENTED ^ 25 TABLE 3.4 COEFFICIENTS OF EMPTY VECTORS ^  29  TABLE 3.5 COEFFICIENTS OF THE WILD-TYPE ARES ^  30  TABLE 3.6 THE COMBINATORICS PROBABILITY TEST FOR THE EFFECT OF SINGLE-BASE MUTATIONS UNDER BASAL CONDITION ^  31  TABLE 3.7 AP-1 MATRIX SCORES OF THE ARE VARIANT SEQUENCES VS RESPECTIVE WILDTYPE SEQUENCES ^ 33 TABLE 3.8 ARE VARIANTS WITH MUTATIONS AT THE CORE ARE REGIONS ^ 34 TABLE 3.9 ARE VARIANTS ^  35  TABLE 3.10 VARIANT SEQUENCES OF INACTIVE ARE-LIKE SEQUENCES WITH MUTATIONS IN THE CORE REGIONS ^ 36 TABLE 4.1 RANK LIST OF WANG'S SET OF ARE SEQUENCES USING THE NEW INDUCIBLE MODEL OF ARE. ^  V  44  List of Figures  FIGURE 1.1 PROPOSED PATHWAY FOR THE INDUCTION OF PHASE II DETOXIFYING GENES VIA NFE2L2:ARE BINDING ^ 4 FIGURE 2.1 OLIGONUCLEOTIDE SYNTHESIS WITH BASE SUBSTITUTION. ^ 12 FIGURE 2.2 DESIGN OF THE SPIKED ARE SEQUENCE. ^  13  FIGURE 3.1 BINOMIAL PREDICTIONS OF THE PERCENTAGE OF SEQUENCES CONTAINING 1-4BP CHANGES WITH DIFFERENT NUMBERS OF TOTAL VARIABLE POSITIONS AT A GIVEN SUBSTITUTION RATE OF 15% ^ 23 FIGURE 3.2 OVERALL MUTATION RATE IN SPIKED OLIGONUCLEOTIDE SYNTHESIS FOR POSITIONS 1 TO 25 IN THE ARE SEQUENCE ^  26  FIGURE 3.3 EFFECT OF TBHQ ON LUCIFERASE PRODUCTION IN LIVER CELL LINES TRANSFECTED WITH LUCIFERASE PLASMIDS CONTAINING ACTIVE ARES AND INACTIVE ARE-LIKE SEQUENCES. ^  28  FIGURE 3.4 CATEGORIZATION OF SINGLE-BASE ARE VARIANTS UNDER BASAL CONDITION ^ 32 FIGURE 4.1 LOGO REPRESENTATION OF THE KNOWN ARE MODELS. ^  38  FIGURE 4.2 LOGO REPRESENTATION OF ARE MODELS BUILT FROM MUTATION COUNTS FOR EACH SET OF ARE VARIANTS. ^ 39 FIGURE 4.3 LOGO REPRESENTATIONS OF COMBINED ARE MUTANT PROFILES ^ 40 FIGURE 4.4 COMPARISON OF THE PERFORMANCE OF KNOWN ARE MODELS WITH LEAVE-ONEOUT METHOD IN ARE PREDICTION.. ^ 42 FIGURE 4.5 COMPARISON OF THE NEW ARE MODEL AND THE LEAVE-ONE-OUT METHOD MATRIX FOR PREDICTION OF INDUCIBLE ARES ^  vi  43  Acknowledgments First and foremost, I would like to express sincere gratitude towards my supervisor Dr. Wyeth Wasserman for given me the opportunity to do this research project and his patience in guiding me through the various phases of my studies. I would like to thank Elodie Portales-Casamar and Magdalena Swanson for their help in the execution of the high-throughput transfection and reporter gene assays. Also, I would like to thank Jochen Brumm with whom I cooperated in the statistical modeling of the luciferase data. I would like to thank Dora Pak for her encouragement and help with administrative issues. I would also like to thank past and present members of the Wasserman lab who were directly involved in different stages of the project: Dimas Yusuf, David Martin, Peter Sudmant, and David Arenillas. To everyone else in the Wasserman group, your insightful suggestions, encouragement, and support were all very much appreciated. Also, I would like to thank Dr. Catherine Pallen for given me the opportunity to work in her lab. Thank you to past and present members of the Pallen lab in providing a friendly and supportive environment. I would also like to thank the Leavitt lab, Kobor lab, and the sequencing facility. I'm also grateful to members of my supervisory committee Dr. Peter Pare and Dr. Angela Brooks-Wilson for their mentorship. Lastly, I am especially grateful to my parents and my siblings for their love and support.  vii  Chapter 1 : Introduction  1  ^  1.1^Cellular Detoxification Cells are constantly exposed to oxidizing agents which are capable of damaging DNA integrity (1, 2). These carcinogens to which we are exposed require metabolic processing to reduce their capacity to damage DNA. Such biotransformation of xenobiotics has been postulated to include a series of sequential reaction processes: phase I in which molecules are activated; phase II in which activated molecules are converted to an inactive or greatly reduced activity state; and in some opinions, phase III to remove the molecules from the cell (3, 4). Phase I enzymes consist primarily of the cytochrome P450 superfamily of microsomal enzymes, which modify compounds either by oxidation, reduction or hydrolysis into chemically reactive metabolites that covalently bind to DNA to initiate a carcinogenic response. Phase II enzymes inactivate metabolites formed by phase I enzymes by conjugating them with endogenous ligands and converting them into less toxic, watersoluble forms. Moreover, phase II enzymes can catalyze reactions independent of phase I activity, acting upon reactive molecules directly (5). Phase III pumps, such as the multidrug resistant protein (MRP), catalyze the removal of the molecules from the cytoplasm (6, 7).  ^1.2^Chemopreventive Compounds A variety of natural and synthetic chemopreventive compounds can activate the detoxification pathways that defend cells from oxidative stress or reactive carcinogenic intermediates. Two general categories of inducers have been identified: bifunctional inducers that upregulate both phase I and phase II enzymes, and monofunctional inducers that primarily act on the expression of phase II enzymes. A wide range of structurally diverse monofunctional inducers share the common ability to react with sylfhydryl groups (8-10). Bifunctional inducers interact directly with the aryl hydrocarbon receptor and gain monofunctional inducer properties following oxidative metabolism. Natural compounds that have been found with chemopreventive properties include isothiocyanates in cruciferous vegetables, polyphenols in green tea, and curcumin in the curry spice tumeric (11). Synthetic compounds used as model inducers include butylated hydroxyanisole (BHA), a synthetic phenolic antioxidant commonly found in food preservatives and dermatological creams, and its metabolite tert-butyl-hydroquinone (t-BHQ). 2  1.3^Nfe212 Regulation In studies of the phase II detoxifying mechanism by model inducers, the upregulation of chemoprotective response genes occurs via nuclear translocation of the transcription factor NF-E2-Related Factor 2 (Nfe212 also commonly known as Nrf2) and the subsequent binding of Nfe212 to the cis-regulatory sequence known as the antioxidant response element (ARE) or the electrophile response element (EpRE). The ARE-mediated upregulation of genes enhances cellular detoxification and maintenance of redox balance. Nfe212 is a cap'n'collar (CNC) family protein with a conserved basic-leucine zipper domain [bZIP, (12)]. Nfe212 is ubiquitously expressed but more abundantly in metabolic and detoxification organs, such as liver, kidney and intestine, as well as in organs continuously exposed to the environment, such as skin, lung, and the digestive tract. Nfe212 has been identified as the central transcription factor involved in regulating the expression of antioxidant and Phase II enzymes in the event of oxidative stress (13, 14). The cellular location of Nfe212 is regulated by a repressor named Kelch-like ECH associating protein [Keapl, (15)]. The most common and prevailing model states that under basal conditions, Keap 1 binds to Nfe212, anchors it within the cytoplasm, and targets it for ubiquitination and proteasomal degradation (Figure 1.1). Under basal condition, the ARE sequence has been demonstrated to be bound by heterodimers consisting of the Activating Protein 1 (AP-1) and a bZIP protein (16, 17). Embedded AP-1 sites, given by the consensus sequence TGA(G/C)TCA in the y-glutamylcysteine synthetase (GCSH) and heme oxygenase (H0-1) AREs have been demonstrated to contribute to constitutive gene expression (18, 19). Under conditions of oxidative stress, electrophile-induced modifications of cysteine residues lead to a conformational change in Keapl which ceases its activity for Nfe212 ubiquitination (15). It has also been proposed that Keapl shuttles between the nucleus and cytoplasm to mediate Nfe212 activity (20). In addition to Keapl, certain protein kinases, in particular protein kinase C (PKC) and PKR-like endoplasmic reticulum kinase (PERK) have been demonstrated to contribute to Nfe212's release from Keapl and both can phosphorylate Nfe212 directly in vitro and in vivo (21-23). Upon liberation from Keap 1 , Nfe212  3  translocates to the nucleus and heterodimerizes with one of the small Maf proteins (MafF, MafG, MafK) for ARE binding and transactivation (24). Genetic studies using compound mutant mice have demonstrated that small Maf proteins are obligatory partner molecules for Nfe212 in vivo (25, 26). Polyamine-modulated factor-1 (PMF-1), activation transcription factor 4 (ATF4), and Jun proteins have also been shown to form hetermodimers with Nfe212 in vitro and the latter two factors have been further shown to transactivate ARE-containing  reporter genes (27-29). Figure 1.1 Proposed pathway for the induction of Phase II detoxifying genes via Nfe212:ARE binding. (A) Under basal condition, Keapl targets Nfe212 for ubiquitination and proteasomal degradation. The antioxidant response element (ARE) is bound by heterodimers of AP-1 and a bZIP protein. (B) Upon exposure to oxidative stress, modification of sulfhydryl groups in Keapl and phosphorylation of Nfe212 by several protein kinases can promote the release of Nfe212 from Keapl. Nfe212 then translocates to the nucleus, heterodimerizes with a bZIP protein (e.g. small Maf, PMF-1, ATF4, Jun), and binds to the ARE found in many Phase II detoxifying gene promoters.  Activation  41  11.  Ubiquintination  Modification of sulfhydryl groups  Phosphorylation  Dissociation and nuclear translocation  Degradation\  4  1.4^Antioxidant Responsive Element The core sequence required for the ARE binding was first determined by Rushmore et al in their study of an ARE in the rat GST-YA gene to be 5'-TGACnnnGC-3'. Further characterization of the flanking positions of the GST-YA ARE led to an extended consensus 5'-TMAnnnRTGAYnnnGCRwwww-3', where M = A or C, R = A or G, Y = C or T, and W = A or T (30). Furthermore, Erickson et a/ suggested that the generalized ARE consensus is 5'-RTKAYnnnGCR-3', where K = G or T and Y = C or T, as a result of finding a functional ARE in the human GCLM promoter (31). AREs have been identified in the promoters of drug metabolizing enzymes and antioxidant proteins, such as glutathione S-transferases (GST), NAD(P)H:quinone oxidoreductase (NQO1), heme oxygenase (H0-1), aldo-keto reductases (AKR), glutamate-cysteine ligase (GCL), etc (Table 1.1). Based on previous definitions of the ARE consensus, Nerland and Wang et al have collected 32 and 57 putative ARE sequences respectively in their reviews of the ARE (32, 33). More recent studies have uncovered greater variability in ARE sequences and the sequence requirements of elements from different genes are distinct, as there exist inactive ARE sequences which follow the current ARE model as well as active sequences that deviate from the core requirement. Requirements for the element may be distinct for different genes due to nucleotide tolerance in the core positions and/or contribution of flanking positions for a functional enhancer, perhaps due to the specific context in which certain nucleotides are found or due to variation of the regulating transcription factors that are recruited to the individual enhancers. Introduction of bases present in certain wild-type AREs into equivalent position of other functional AREs abolished reporter gene activity in cell-based studies (18, 31, 34, 35). Presence of a G in the Y of the RTKAY core have been documented in AREs of the mouse NQO1, mouse GSTP1, and rat AKR7A3 genes (17, 34, 36). Sequences defining AREs are more plastic than previously believed and ARE sequence requirements are not sufficiently understood. For this thesis, an active ARE (Table 1.1) is defined as a sequence that contains an ARE core-like segment that is able to drive increased reporter gene expression under inducing conditions whereas an inactive ARE sequence (Table 1.2) also has a core-like sequence but is unable to mediate significant reporter gene expression (relative to a control plasmid).  5  Table 1.1 List of 25 by regions from published active AREs  Gene Symbol 1 Sequence (5'43')^  Reference  mGSTA3^ACTCAGGCATGACATTGCATTTTTC ^(37) hGCLM^GGAAGACAATGACTAAGCAGAAATC^(31, 38) hGCLC^GCCTCCCCGTGACTCAGCGCTTTGT^(18, 35) mGST-YA^TGCTAATGGTGACAAAGCAACTTTC ^(30) mNQ01^GAGTCACAGTGAGTCGGCAAAATTT^(34) hNQO2^ACAAGGTGGTGATGTTGCATCACAC^(39) hAKR1C2^CAGTCAGGGTGACTCAGCAGCTTGC^(40) mAKR1B3^ACTGGAGCATGACCCAGCAGAAGGA ^(41) hTRX^CGGTCACCGTTACTCAGCACTTTGT^(42) hMRP^CCTTCTGTGTGACTCAGCTTTGGAG^(43) rGSTP1^CAGTCACTATGATTCAGCAACAAAA ^(17, 36) hGPX2^TCCCCAGGATGACTTAGCAAAAACA^(44) mH01^TCCCAACCATGACACAGCATAAAAG^(45) hFTH 1^TCTCCTCCATGACAAAGCACTTTTT ^(46) mMAFG^GAGTCATGCTGACTCAGCGGATCGC ^(47) hTXRND1^GAGTCAGAATGACAAAGCAGAAATC^(48) hGNAI2^CCAAGCCTGTGACTGGGCCGGGGCG^(49) hETS 1^GAGAGCGGGTGACCAAGCCCTCAAG^(50) I  h = human, m=mouse, r=rat  Table 1.2 List of 21 by regions of inactive ARE-like sequences  Gene Symbol'  Sequence (5'43')  Reference  hGCSH-AREL3  GCCACGTGACTGCGCGGGCCC  (35)  hGPX2-AREL 1  TACATGTGAGAGGGCAGGGTC  (44)  hNQO2- AREL1  GGCGGGTGAGTGGGCGGGGCC  (39)  hNQO2- AREL2  GCCAAATGAGGTGGCAGAAGC  (39)  hNQO2- AREL5  CCTGGATGACAGAGCGAGACC  (39)  mGST-mu  CTTCGGTGACATAGCCTCCAT  (30)  xCT- EpREL3  GCTGCATGATGATGCAAATTC  (51)  6  1.5^Current Technologies to Study Regulatory Sequences Several methods have been developed for the delineation of transcription factor binding sites (TFBS) experimentally. These methods can be divided into in vitro binding studies and cellular studies. In vitro binding methods include gel shift assay/EMSA, DNAse footprinting, and southwestern blots. High throughput TF-DNA binding assays, such as Systematic Evolution of Ligands by Exponential enrichment (SELEX) or the double stranded DNA arrays have greatly accelerated the process in determining the sequence requirement for protein binding. The binding of a protein to a DNA sequence in an in vitro binding assay does not definitively indicate that it behaves similarly in a living cell, so for many researchers it is preferable to test activity in cells. In cell culture studies, DNA segments that contain TFBS can be identified by a reporter gene assay. To assay transcriptional activity, a sequence of interest is cloned upstream of a reporter gene in a plasmid, which is transiently transfected into cells and the activity of the reporter gene is measured (52). The activity of the reporter gene can be correlated to the function of the regulatory sequence in the cell. It is now also feasible to conduct larger-scale experiments to investigate the functional properties of panels of candidate enhancers and promoters within cells. Cooper et al performed a large screen of promoter activity in 16 cell lines on all predicted promoters in the 1% of the human genome targeted for in depth annotation by the ENCODE Project (53). Similarly, relatively large-scale in vivo enhancer studies have been performed using highly conserved (human to fish) sequences driving reporter gene expression in transgenic mouse embryos, leading to the identification of 75 forebrain-specific enhancers (54). The development of higher-throughput approaches to verify enhancer and promoter function has been a focus of recent efforts to annotate vertebrate genomes.  1.6^Computational Predictions of Regulatory Sequences Obtaining data through experimental methods can be laborious, time consuming and costly. The development of methods for accurate computational prediction of TFBS and regulatory regions has been a key goal in computational biology. A set of sequences for experimentally identified TFBS can be aligned as the first step in building a TFBS motif model. The nucleotide frequency observed at each position of the alignment is recorded in a matrix,  7  commonly known as a position frequency matrix (PFM). The frequency values are converted to a log scale, resulting in a final profile known as a position weight matrix [PWM, (55, 56)] which is used to score a sequence of interest. To apply the motif model to TFBS prediction, a user-selected threshold value is used to classify whether or not a sequence scored with the PWM is predicted to be a binding site. The complete set of predictions produced by the PWM model includes many false predictions. The fact that false prediction rates are high for all computational TFBS models is well known— functional binding sites are presumably influenced greatly by the context of adjoining binding sites and chromatin states (30). PWM models in the best cases exhibit a high success rate for predicting artificial TF-TFBS interactions in assays such as EMSAs or simple minimal promoter report gene assays (57). Given the artificial nature of plasmid-based minimal promoter constructs which are commonly used to validate predicted TFBSs, one would expect a binding site model to accurately predict the function of a TFBS in reporter gene assay in which only a short segment containing a putative sequence drives a minimal promoter. PFMs and PWMs have been compiled for the ARE (32, 33). It is noteworthy that about 40% of the known ARE sequences contain TCA at the nnn region within the core (33), leading to matrix models with some preference for these nucleotides.  1.7^Position Specific Variation in Transcription Factor Binding Sites -  A transcription factor binding motif is usually 6 to 30 base pairs in length, and some nucleotides tend to occur more often than the others in specific positions for a given TF. It has been demonstrated that there is a relationship between the pattern of TFBS degeneracy and positions directly contacted by the binding protein: positions with little or no contact observed in protein:DNA structures show greater variability among orthologous binding sites (58). The degenerate positions are expected to show less constrained evolution because they are less important for the formation of the protein:DNA complex. On the other hand, changes in positions in the DNA motif that disrupt the transcription factor binding would be deleterious. On a genomic level, SNPs in TFBS may play a biological role by affecting the binding of transcription factors which lead to differences in gene expression and phenotypes.  8  Using a PWM of known functional AREs combined with Nfe212 expression data, Wang et al (2007) identified functional polymorphisms of AREs in the human genome (33).  1.8^Overview of Project To further study the sequence requirement for the antioxidant response element, in this thesis project, we present a spiked oligonucleotide mutation procedure for library construction combined with higher-throughput transient transfection reporter assays to test panels of variants for known functional ARE and non-functional ARE-like sequences. The existing TFBS profile derived from known antioxidant responsive elements confirms a core target sequence of RTKAYnnnGCR, with an emphasis on TCA nucleotides at the three variable positions. The current model is based on nearly 20 years of research literature covering dozens of genes. While many AREs have been identified, the constraints for function remain unclear and a greater abundance of functional AREs and non-functional mutations could lead to improved computational models. The newly generated ARE data are used to construct a predictive model for identification of novel ARE binding sites in the genome.  9  Chapter 2 : Materials and Methods  10  2.1^Laboratory-based Methods 2.1.1^Chemicals and Reagents Tert-butylhydroquinone (tBHQ) was obtained from Sigma Chemical Co. (Oakville, ON, Canada). Oligonucleotides were supplied by Operon Technologies (Huntsville, AL, USA). Restriction endonucleases were obtained from New England Biolabs (Mississauga, ON, Canada).  2.1.2^Cell Culture Mouse hepatoma Hepa-lc 1 c7 cells (ATCC CRL-2026; American Type Culture Collection; Manassas, VA, USA) were maintained in alpha minimum essential medium, supplemented with 10% (v/v) heat inactivated fetal bovine serum, 100U/m1 penicillin, and 10011g/m1 streptomycin. HepG2 cells were maintained in Dulbecco's modified Eagle's medium supplemented with 10% (v/v) heat inactivated fetal bovine serum, 100U/m1penicillin, and 100m/m1 streptomycin. The cultures were grown at 37 ° C and 5% CO 2 . The media and reagents for cell culture were obtained from Gibco-Invitrogen (GIBCO-Invitrogen Canada, Canadian Life Technologies, Burlington, ON, Canada).  2.1.3^Mutagenic Oligonucleotide Synthesis and Primer Extension Current methods for incorporating mutations in DNA sequence include site-directed mutagenesis by specific synthetic oligonucleotides (58), PCR mutagenesis, and random mutagenesis with degenerate oligonucleotides (59). The latter method allows researchers to generate a library of DNA sequences differing by one or more mutations from an original sequence. Such a library can be screened to determine which positions are functionally important. Oligonucleotides are synthesized from 3' to 5' with the 3' end of the oligonucleotide bound to a solid support column on which the reactions take place. The phosphoramidites bases are added sequentially given the template design. In each step, the solutions of the nucleotides for the next reaction are pumped through the column and washed out before the next nucleotide is added. To synthesize degenerate oligonucleotides, a process also known as "spiking", the solution of nucleotides added to the elongating DNA  11  molecules contain a mixture of different bases which subsequently lead to base substitutions in specified positions (Figure 2.1). Figure 2.1 Oligonucleotide synthesis with base substitution. In normal oligonucleotide synthesis, a pool of phosphoramidite bases is added to the elongating DNA sequence to give the same nucleotide at the given position. In spiked oligonucleotide synthesis, a mix of all four bases is added so there is variation at a given position. The end result is a pool of oligos with random substitutions at specified degenerate position. 5' GATCT...3'  AA A AA AA A A A A A AA  AGATCT... AGATCT... AGATCT... AGATCT... AGATCT... AGATCT... AGATCT... AGATCT... AGATCT...  5' _AGATCT...3'  ■■••••►  CCA CC CC CT C GC CC  CAGATCT... CAGATCT... AAGATCT... CAGATCT... CAGATCT... CAGATCT... TAGATCT... CAGATCT... CAGATCT...  GCTACGCGTAGCTAATGGTGACAAAGCAACTGTCAGATCTTCGTATCGAC GCTACGCGTTGCTTATGGTGACAAAGCAACTTTCAGATCTTCGTATCGAC GCTACGCGTTGCTAATGGTGACAAAGCAACTTTAAGATCTTCGTATCGAC GCTACGCGTTGGTAATGGTGACAAAGCAACATACAGATCTTCGTATCGAC GCTACGCGTTGCTAATGGTGACAAAGCAACTTTCAGATCTTCGTATCGAC GCTACGCGTTGCAAATGGTGACAAAGCAACTTTCAGATCTTCGTATCGAC GCTACGCGTTGCTAATCGTGACACAGCAACTTTTAGATCTTCGTATCGAC GCTACGCGTTGCTAATGGTGACTTAGCAACTTTCAGATCTTCGTATCGAC GCTACGCGTTGCTAATGGTGACAAGGCAACTTTCAGATCTTCGTATCGAC  Three active ARE and two inactive ARE-like sequences from published reporter assay results were chosen to serve as the backbone for the generation of spiked ARE oligonucleotide libraries (Table 2.1). Table 2.1 Oligonucleotides used in the study.  Name^Description^Sequence (5' 3') GSTA3^ARE^ACTCAGGCATGACATTGCATTTTTC GST - YA^ARE^TGCTAATGGTGACAAAGCAACTTTC NQO 1^ARE^GAGTCACAGTGAGTCGGCAAAATTT NQO2 - AL5^Inactive ARE - like^CTGCCTGGATGACAGAGCGAGACCC GCSH - AL1^Inactive ARE - like^AATATGTGTTGACAGAGCAATGACC Primer^Primer^GTCGATACGAAGATCT -  The spiked ARE oligonucleotide consists of a 3 by clamp and a Mlu I restriction enzyme cutsite at the 5' end. The 3' end contains the Bgl II restriction enzyme cutsite and a 10 by universal primer binding site for reverse-strand synthesis via primer extension reaction  12  (Figure 2.2). The 50-mer oligonucleotides contain 7 by invariant core positions, 18 by of spiked positions, and a universal primer binding site. Figure 2.2 Design of the spiked ARE sequence. The bolded text represents the ARE invariant core. Nucleotides coloured in black represent invariant positions, consisting of the core, restriction enzyme sites, and the primer binding site. Nucleotides coloured in gray represent variable positions. Star represents base substitution.  I  ^ ^ ^ II III^IV^V^VI VII  clamp + Mlu I  Flanking  Core  nnn  Core  Flanking  Bgl II + primer binding site  GCTACGCGT  "I( K.' FAAT(i  GTGAC  AA \  GC  A ACITTC  AGATCTTCGTATCGAC  Mlu I  1  Spiked oligonucleotide synthesis (15% mutation rate) Bgl II + primer ^►  A Y  -AV  A-  -  primer 4^  ^► ►  ^• ^►  V V^V  Primer extension with Klenow fragment to produce double-stranded products 4^  A  -A-  •  V  •  -AV  •^  • ^►  V V^V  Clone and screen mutant populations  by reporter gene assays  Due to the random nature of base substitution in spiked oligos, the dsDNA was created via primer extension so that the creation of dsDNA does not rely upon annealing a complex mix of heterogeneous complementary sequences which avoids the need to handle each variant  13  sequence separately. The exact complement of each template strand including any degenerate nucleotides synthesized into the first strand, was made by the Klenow fragment of DNA polymerase I (3' 4 5' exo-; NEB, Mississauga, ON, Canada). Extension reaction was carried out at 37 ° C for 4 hours. Extension reactions include 1X NEBuffer 2, 6 p1 of 100gM ssDNA template, 3 pi of 101.tM primer, 3 pl of 10mM dNTP mix, 120 pl dddH2O, and 3 pi Klenow fragment (3'- 5' exo). The double-stranded ARE oligos were purified by column purification using the QlAquick Nucleotide Removal Kit (Qiagen Inc. Mississauga, ON, Canada).  ^2.1.4^Plasmids and Cloning The double-stranded ARE oligos were subcloned into pGL3-promoter luciferase vector (Promega; Fisher Scientific, Nepean, ON, Canada) using Mlu I and Bgl II restriction sites. The restriction digest was carried out overnight at 37 ° C. Post-digestion, the plasmid was dephosphorylated with alkaline phosphatase (NEB, Mississauga, ON, Canada) and then gel purified. The ARE oligos were column purified using the QlAquick Nucleotide Removal Kit (Qiagen Inc. Mississauga, ON, Canada) and ligated to digested vectors. Constructs were subcloned into sub-cloning efficient DH5a chemically competent E.coli cells (GIBCOInvitrogen Canada, Canadian Life Technologies, Burlington, ON, Canada) via heatshock at 42 ° C and plated on LB agar plates containing 100µg/ml of Ampicillin for preliminary bacterial colony screening. Colonies were picked and inoculated overnight in 3m1 LB broth with ampicillin. Plasmids were miniprepped using QlAprep Spin Miniprep Kit (Qiagen Inc. Mississauga, ON, Canada). Preliminary sequence checks were done in-house at the CMMTCFRI DNA Sequencing Core Facility.  ^2.1.5^High throughput Screening of Clone Libraries -  Ligations were sent to the BC Genome Science Centre (GSC) for large-scale transformation, colony picking, miniprep, and sequencing reactions. 1 ul of ligation mix was transformed by eletroporation into E. coli DH1OB T1 resistant cells (Invitrogen). Transformed cells were recovered using 1 ml of SOC medium and plated onto 22 cm x 22 cm agar plates (Genetix) containing 100 ug/ul ampicillin. Bacterial colonies were picked from the agar plates and arrayed into 384-well microtiter plates (Genetix) using the Genetix QPIX automated colony  14  picker (Genetix). 96 clones were picked for each of the 5 libraries of ARE variants. DNA was prepared using an alkaline lysis plasmid preparation procedure. DNA sequencing reactions was set up using a Biomek FX workstation (Beckman-Coulter) and performed using BigDye 3.1 (Applied Biosystems). Analysis of the resulting sequence files was performed with AlignX from the Vector NTI software (Invitrogen, Carlasbad CA). ^2.1.6^DNA Concentration Measurement and Normalization Concentration of the plasmids was measured using the NanoDrop ND 1000 spectrophotometer (NanoDrop Technologies Inc., Wilmington, DE). All plasmid prep samples were normalized to 100ng/iil per well. ^2.1.7^Transfection and Reporter Gene Assays For preliminary validation of known ARE sequences, HepG-2 and Hepa-lcic6 cells were seeded in 96-well plates at a density of 1.5 x 10 4 and5 x 10 3 cells per well respectively. For large-scale screening of ARE variants, only Hepa-1c1c6 cells were used. Twenty-four hours later they were transfected with 200ng pGL3-promoter firefly luciferase plasmid that contain insert DNA and 2Ong renilla phRL-TK luciferase plasmid as an internal control using Lipofectamine 2000 according to the manufacturer's protocol (GIBCO-Invitrogen Canada, Canadian Life Technologies, Burlington, ON, Canada). At 24 h post-transfection, cells were treated with 100 1.1M tBHQ or DMSO vehicle for 18 h. The cells were harvested and luciferase activity was measured using the Dual-Luciferase Reporter Assay System (Promega, Madison, WI) and a Wallac Victor3 V 1420 Multilabel Counter (Perkin-Elmer, Shelton, CT). Reporter gene activity of the empty pGL3-promoter plasmid served as control. Results were based on three transfections performed in duplicates.  15  2.2^Computer-Based Methods 2.2.1^Model for Design of Spiked Oligonucleotide Libraries A binomial distribution gives the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. In our case, the distribution was used in a model of spiked mutagenesis for the selection of the number of variable positions to include in the oligonucleotides assuming a mutation rate of 15% from wild-type (15% is the minimum mutation rate supported by the instruments commonly used in largescale oligonucleotide synthesis). The portion of sequences containing one to four-base pair mutations was predicted for a range of variable positions using the following equation:  f (k; n, P) = ( k ) k i — where n is the number of variable positions, p is the mutation rate (i.e. 15%) and k is the number of changes observed.  2.2.2^Probabilistic Model for Reporter Gene Analysis of ARE Variants For background a brief summary is presented for the most common analysis procedure for reporter gene studies. Reporter gene activity is commonly normalized by dividing firefly relative light units (RLU) for each construct to the renilla RLU produced by the control reporter plasmid, thus controlling for transfection efficiency. For treatment with inducing agents, an induction ratio could be calculated by dividing the control-normalized reporter activity of the ARE-containing plasmid treated with the inducing compound tBHQ by the reporter activity observed under the control DMSO treatment. For each mutation, one would divide the mutant ARE variant induction ratio by the wild-type ARE variant induction ratio to determine the effect on expression mediated by the mutation(s). In this study we chose to take a slightly different approach to delineate whether and how much the changes in AREs impact constitutive and/or inducible expression. We elected to pursue this approach because it allows us to combine the effect of different influences into a  16  single statistical test. The expression resulting from the influences of the vector, the wildtype ARE and the construct-specific mutations are combined under the assumption that these effects are multiplicative. This assumption reflects a simple to compute yet realistic representation of enhancer contribution to transcriptional activity — the expression level produced through basal mechanisms (fold-change over empty vector) and the treatment (fold change of treated over untreated) can be multiplied to give the overall fold-change. What follows is a series of representative equations based on the multiplicative assumption — we will work with log scales, so in the equations addition reflect a multiplicative relationship. Under basal condition, the logarithm of the observed expression yv j (i.e. expression of firefly RLU divided by renilla RLU) of each replicate tested for the empty vector is modeled by a linear formula:  yv; = p, Evi where u is the effect of empty vector and j refers to each replicate observation, and 8 is an error term.  For constructs containing a wild-type sequence, the observed expression represented by y wj is expanded to the following form:  Yw; = ± w + cw; 13 w is the effect mediated by the wild-type ARE sequence. For the i th mutant construct, the model takes this form:  Yij  —  ±  w  PI+ e l,  13, is the effect on expression level mediated by the variation in the ARE.  17  In the analysis performed, an assumption is made so that all error terms arise from the same distribution across all replicates of all variants of the same wild-type ARE; this distribution is assumed to be normal with a constant variance and a mean of 0. This assumption allows the analysis to be performed using a standard regression model within the R software package.  Under the induced condition, like the above, the expression of each construct is fitted by a linear regression model with multiplicative effects of the tBHQ-treated empty vector, wildtype sequence, and the ARE variant, represented with the superscript t.  Yvi = ± li t ± Evi ywi 1.1^±  o w ptw  swi  t  Yip =µ+µ +Rw+R tw + + Pt , +  Eij  Our goal is to find 13, and 13% for each construct for all five pools of ARE variants, which will be modeled separately. In real world experiments data will not perfectly fit the model. A measure of the closeness of the observed data to the model is the residual variance. The best fitting values for the model minimizes this variance. The values of 13, and fi t , are "true values", but estimates, 13', and if, for these terms are generated in the model fitting process — the best fitting estimates are reported by the software. Restated,  p', and If; are estimates,  using all construct data for each ARE, for the impact of the sequence variations on the basal and induced expression respectively. To summarize the above, a linear regression procedure within the R statistics package was used to find the best fitting estimates for the terms in the final equation in light of the experimental data. In the discussion we address future directions that would include alternative analysis methods.  18  2.2.3^Combinatorics Test for Single-base Mutations under Basal Condition  In order to assess the significance of regional modifications within the ARE, we performed a combinatorics test. As changes to constitutive expression were observed in —50% of cases, we utilized a standard coin toss formulation. Coin toss probability models the probability for any number of heads x in any number of n flips: ( 71 ) X^71!  Probability =^2n^771, (71.. — X)! Similar to coin toss probability, we calculate using the number of observed changes in a region that are associated with expression changes (x) versus the n number of mutations tested in the region. 2.2.4^Position Frequency Matrices  18 sequences of Nfe212 binding sites were obtained from the literature, verified primarily by transactivation assays, and in most cases additional evidence was reported for in vitro or in vivo binding. Given a set of experimentally defined AREs, a position frequency matrix was  built. The models were generated first by aligning the known sequences to generate a position frequency matrix. An inducible ARE model was built from 118 active ARE variants that were functional in this study — those for which the inducible expression was not significantly different from the expression of the wild-type ARE sequence. The profiles constructed from the ARE variants were built based on the discrete set of observed nucleotides at each position - equal weight was given to all nucleotides that were observed in a position in one or more functional variants of an ARE. Each column in a matrix sums up to 4. If all four nucleotides were present in a given position, then a score of 1 is assigned to each nucleotide. In core positions, the wild-type nucleotide was assigned a score of 4 while the other nucleotides were assigned 0's. Combined mutant profiles were constructed by summing mutant profiles together (Appendix B). The profiles were used to evaluate possible binding sites in an input sequence, as reviewed previously (55, 56). 19  2.2.5^Evaluating Performances of the ARE Models For the training set of 18 known ARE sequences, true positives (TP) refer to those bona fide ARE sites identified by the model and false negatives (FN) refer to those not detected. A true negative (TN) is documented when no predicted ARE sequence was found in the control set of sequences, whereas a false positive (FP) is identified when an ARE site was predicted. Sensitivity is computed as TP/(TP + FN) and specificity is FP/(FP + TN). A receiver operating characteristic-like (ROC) curve was created by plotting sensitivity over 1specificity by varying the PWM scoring threshold from 60% to 90% for ARE site detection.  2.2.6^Screening of Genomic Sequence The conserved upstream and downstream 10kb regions from the transcriptional start sites (TSS) of 15150 human-mouse gene sequence alignments were analyzed to identify regions of at least 60% nucleotide identity over 100 by [data obtained from the oPOSSUM system, (60)]. The conserved sequences were screened with the newly defined inducible ARE model at a scoring threshold of 75% to predict novel ARE sites. Both strands of the sequences were scanned.  20  Chapter 3 : Spiked Oligonucleotides  21  3.1^Theoretical Model of the ARE Spiked Oligonucleotide Design Previous studies have shown that mutations in the seven positions of the ARE core sequence RTKAYnnnGCR abrogated induction completely. Thus, the ARE core was held invariant in this study to focus on the characterization of the core-flanking nucleotides and the degenerate nnn spacer in between the two core segments. In order to generate panels of variations for each ARE and ARE-like sequence, spiked oligonucleotide synthesis was employed. Spiking refers to the introduction of alternative (mutant) nucleotides in the synthesis reaction for a given position to compete for incorporation against the dominant nucleotide. Restated, given a wild-type sequence specification of a G, in a spiked reaction there is a chance for obtaining A, C or T nucleotides at the position in proportion to the composition of the synthesis mixture. To further characterize the ARE, we aimed to elucidate the effect of mutations on the function of an ARE via reporter gene assays. To approach sequence-variation analysis, known ARE sequences with seven withheld invariant positions were designed to incorporate single and multi-base mutations. We anticipated 1-4 by of mutations per construct to be informative in the data analysis phase of the project. The spiking frequency in the oligonucleotide synthesis reaction was set to 85% for the wild type base and 5% for each of the other three bases, reflecting the lowest possible mutation rate for the oligonucleotide synthesis instruments used by our suppliers. A binomial simulation model was performed to determine the theoretical expected percentage of sequences containing one to four base pairs of mutations for a range of spiked positions. Figure 3.1 demonstrates the tradeoff between the number of variable positions in the spiked ARE oligonucleotide versus the number of mutations incorporated per sequence. The size limit of the spiked oligonucleotides is 50 base pair, and in order to accommodate priming sequences, restriction enzyme sites and 7 base pair of invariant core, 18 variable positions were designated. Assuming a binomial distribution, we projected 83% of clones would contain 14 base mutations at the specified 15% mutation rate.  22  Figure 3.1 Binomial predictions of the percentage of sequences containing 1-4bp changes with different numbers of total variable positions at a given substitution rate of 15%  0.9  co 0.8 cn c a) o O 0.7 c ..c. co as m m 0.6 gE  ci) Ix 0.5 1; 0.4 t a) va) En co . c 0.3 .5 E 8,... ,..-•Ts 0.2 c a. 3 0.1 -  0  1  ^  3^5^7^9^11^13^15^17^19 Total Number of Variable Positions  3.2^Clone Recovery Rate and Mutation Rate Analysis We selected three active ARE and two inactive ARE-like sequences for high-throughput reporter gene assays. The functional AREs selected came from the following mouse genes: glutathione S-transferases YA subunit (mGST-YA; situated at -1145bp to -1121bp relative to the transcription start site), A3subunit (mGSTA3; -5064bp to -5040bp) and NAD(P)H:quinone oxidoreductase 1 (mNQO1; -360 to -384bp) (30, 34, 37). GST-YA ARE sequence was chosen in this study based on our experience with it in a previous study. The remaining AREs were selected based on the fact that they are inactive their sequences deviate from the known consensus. Inactive ARE-like sequences were selected from the following human genes: NAD(P)H:quinone oxidoreductase 2 (hNQO2; -2596bp to -2575bp) and y-Glutamylcysteine Synthetase Heavy Subunit gene (hGCSH; -921 to -945) (38, 39). In both cases, the inactive sequences were identified as part of larger screens of multiple candidate ARE-like sequences which ultimately revealed active AREs at other locations. 96 variant clones were screened for each of the five wild-type AREsequences. Of a total of 480 sequenced clones, 281 clones contained ARE variant sequences with the correct size (Table 3.1) - a cloning success rate of 59%, with a range from 36% to 80% across the five panels.  23  The recovery yield for plasmids containing ARE or ARE-like sequences across five clone libraries is shown in Table 3.1. Out of the 281 ARE-containing sequences, 222 (79%) had unique mutations, 31 (11%) were wild-type ARE sequences, and 28 (10%) sequences were duplicate sequences. GCSH and NQ01 AREs had fewer variants due to the presence of empty plasmids and truncated ARE sequences. The lower yield may reflect suboptimal oligonucleotide synthesis or reduced success in the cloning steps for the generation of the clone libraries. Table 3.1 Clone recovery rate of ARE variant clones. Bracketed numbers represent the number of unique clones within the set.  Number of  GST-YA  GSTA3  NQO1  NQO2  GCSH  0  1  10  5  6  10  1  21 (11)  24 (18)  7 (7)  31 (24)  11 (10)  2  25 (23)  7 (7)  13 (12)  20 (20)  8 (8)  3  16 (16)  11 (11)  6(6)  14 (14)  8 (8)  4  6 (6)  1 (1)  4 (4)  3 (3)  7 (7)  >4  2 (2)  0 (0)  0 (0)  3 (3)  1 (1)  Total  71 (58)  53 (37)  35 (29)  77 (64)  45 (34)  Mutation  To further delineate the overall mutation rates in spiked oligos, all sequences from the five pools were combined to determine the proportion of sequences containing zero, one, two, three, four or more than four mutations. Compared to the model, there were fewer mutations per clone than expected (Table 3.2). Returning to the theoretical calculations used for the design specification, the effective rate of nucleotide substitution from the DNA synthesizer was determined to be close to 10% for the spiked positions. Based on the observed 10% nucleotide substitution rate, we re-calculated the mutation frequency in the oligos using the binomial distribution and confirmed the results in the empirical fit.  24  Table 3.2 Mutation rates in the spiked oligos. The predicted rates come from a binomial prediction with the overall mutation rate to be 15% from the wild-type nucleotide. The observed rate is the number of sequences containing mutations. The empirical fit is based on a binomial prediction with the overall mutation rate of 10%.  (%)  Observed (%))  Empirical Fit ( %)  0  5  11  15  1  17  32  30  2  26  25  29  3  24  18  17  4  16  9  7  >4  12  5  2  Mutation  Predicted  The individual nucleotide substitution rates in the spiked oligonucleotide synthesis were lower than the defined 5% (Table 3.3). Only the G nucleotide met the theoretical mutation rate of 5%, whereas the other three bases were apparently under-represented in the nucleotide mix during spiked oligonucleotide synthesis. Table 3.3 Nucleotide substitution rate in spiked oligonucleotide synthesis replacing wild-type base with mutations is under-represented.  WT base A C G T  MT base A  4% 2% 3%  C  G  T  3%  5% 5%  5% 4% 1%  2% 2%  6%  The mutation rates varied across all 18 variable positions (Figure 3.2). Moreover, four out of seven positions in the core invariant positions had mutations incorporated in them at an observed rate of 0.36 to 1.81%.  25  Figure 3.2 Overall mutation rate in spiked oligonucleotide synthesis for positions 1 to 25 in the ARE sequence. Positions 9 to 13 and 17 to 18 are core regions which were held invariant during oligonucleotide synthesis (indicated by shaded background).  3.  r^  C  ^  r'  rrl  1^3^5^7^9 11 13 15 17 19 21 23 25 Position in the ARE sequence  3.3^Function Verification of Known ARE and ARE-like Sequences  As introduced above, reporter gene-based assays have been used to test candidate sequences to identify sequence requirements of the ARE. In order to further characterize the ARE, the ARE and ARE-like sequences cloned into pGL3-promoter plasmids were transfected into hepatoma cell lines. The active ARE sequences led to a substantial increase in constitutive luciferase activity compared to the empty vector expression level when the liver cells were treated with control DMSO, albeit with lower expression level for GST-YA ARE. The reporter activity of inactive ARE-like sequences was not significantly different from the pGL3-promoter empty vector. Under induced condition when the cells were treated with the chemoprotective response inducer 100 tM tBHQ, the reporter activity of the active ARE was further enhanced from the basal level expression, whereas the expression levels of the inactive ARE-like sequences were not significantly changed. Variability in expression was 26  observed between different active ARE sequences and in different cell lines (Figure 3.3A and B). Inductions ratios were obtained by dividing the luciferase activity of tBHQ-treated cells by the luciferase activity of DMSO-treated cells. The active AREs mediated 2.0 to 2.8 fold induction in Hepa-lc1c7 cells and 1.5 to 2.8 fold induction in HepG2 cells in response to treatment with tBHQ (Figure 3.3C). The ARE-like sequences in which the observed treated to untreated ratios were not significantly different from that of the control pGL3promoter empty vector. The empty vector control exhibits a slight elevation in response to inducer treatment (1.2+/-0.2 fold), which is readily distinguished statistically from the induction mediated by active AREs. We have confirmed that the responses of the three active AREs and two inactive ARE-like sequences in Hepal cic7 hepatoma cells behave as expected.  27  Figure 3.3 Effect of tBHQ on luciferase production in liver cell lines transfected with luciferase plasmids containing active AREs and inactive ARE-like sequences. The sequences are presented in Table 2.1. Liver cell lines were transfected with pGL3-promoter firefly luciferase constructs and phRL-TK renilla plasmids (internal control). After 24 h, cells were treated with 100 AM tBHQ or solvent (DMSO). Lysates were assayed for firefly and renilla luciferase activities. Values presented are averages from two independent experiments performed in triplicates f S.E. (A) Reporter activity of DMSO control treated Hepa-1 cic7 cells for basal condition and tBHQ treated for induced condition (B) Reporter activity of DMSO treated HepG2 cells for basal condition and tBHQ treated cells for induced condition (C) Induction ratios for both HepG2 and Hepa-1 cic7 liver cells were obtained by dividing the reporter activity of tBHQ treated cells over DMSO control-treated cells  A^Hepa-lc1c7  ^ ^ B HepG2  100  180  90  ix . 160  ce  80  140  —a 70 ce  cc  60 —  re  120  w 100  50  Ti; 80 1,7 60  11-• 40 30  :F 40 7.; 20  `c t. 20 0 o.  10  0  pGL3-^NO01^GST-YA GSTA3 NO02- GCSHpromoter^ AL5^AL1  a, oc  o pGL3-^NO01 GST-YA GSTA3 NQ02- GCSHpromoter^ AL5^AL1  C  ■ Hepa-1c1c7 ■ HepG2  pGL3promoter  I 11,11  NQ01 GST-YA GSTA3^NQO2- GCSHAL5^AL1  28  3.4^Expression of Wild-Type ARE Sequences  All plasmids were transfected into Hepa-lc 1 c7 cells in duplicate in three independent experiments for a total of six replications for each construct. In order to identify differences between constructs and to differentiate between the effects of mutations and noise impacting inducibility, a linear analysis was applied (see Materials and Methods). To recall, the analysis method returns coefficient values µ's and 13's, indicating an impact on expression. The coefficients under induced condition have been adjusted for the basal condition, and thus can be represented as an induction ratio in log2 scale. The effect mediated by the empty vector was represented byte for basal condition and ix t ' for induced condition. The empty vector mediated an effect on expression, with high variability in coefficient values under basal condition from -1.84 to 1.85 (Table 3.4). This reflects the diversity of transfection results, and limits our capacity to identify significant differences between the constitutive effects of different regulatory sequences. Under the induced condition, the coefficients values of the empty vector, were in the narrower range of 0.72 to 1.20 in the log2 scale, which correspond to 1.65 to 2.30 fold change, as the effect of plasmid concentration is normalized. This reflects the greater capacity to compare induction levels between experiments. In our statistical model, we elected to normalize for the empty vector effect in order to obtain a more accurate estimate of the effect mediated by the insert. Table 3.4 Coefficients of empty vectors  ARE variant pool  11  Empty Vector GsT-yA  -1.07 ± 0.13  0.99 ± 0.04  Empty Vector GsTA3  -1.84 ± 0.13  1.12 ± 0.18  Empty Vector NQ01  1.86 ± 0.56  1.20 ± 0.70  Empty Vector NQ02-AL5  -0.77 ± 0.36  0.72 ± 0.50  Empty Vector GCSH-AL I  -0.65 ± 0.36  1.20 ± 0.50  29  Under the basal condition, the coefficients for the 25-bp wild-type ARE sequences were variable as in the case of the empty vectors, ranging from -0.24 to 2.22. Under induced condition, the coefficients for the wild-type ARE sequences ranged from 0.30 to 0.99, which correspond to 1.25 to 2 fold change in expression beyond empty vector background (Table 3.5). Consistent with expectations, the induction coefficients for the inactive ARE-like sequences were near 0, ranging from -0.27 to 0.11. The active wild-type ARE constructs have positive induced expression coefficients that are substantially larger than nonfunctional ARE sequences. However, high variability between replicates was observed for NQO1, NQO2-AL5, and GCSH-AL1 sets, which is represented by large standard error values. Further analysis of the effects of specific ARE variants will focus on GST-YA and GSTA3 ARE sequences. Table 3.5 Coefficients of the wild-type AREs  Sequence Name  (3'w  'W  GST-YA  0.50±0.21  0.99±0.30  GSTA3  2.23±0.14  0.56±0.20  NQO1  -0.24 t 0.56  0.30 f 0.79  NQO2-AL5  0.07±0.54  0.11 f 0.58  GCSH-AL1  -0.04±0.39  -0.27±0.55  3.5^Mutational Analysis of ARE Variants Under Basal Condition There were potential differences between the constitutive expression levels of mutant and wild-type constructs detected. An important caveat for the study of constitutive expression contribution by ARE mutations is the potential for concentration-dependent effects of plasmid preparations. While concentrations were carefully assessed prior to transfection, the subsequent observations should be viewed as preliminary indicators for further studies — a limitation associated with many high-throughput screens. Previous studies suggested that mutations in the 3 by variable region within the GST-YA and NQO1 AREs had minimal effect on basal luciferase activity (30, 34). However in the current study, mutations in some single-base and multi-base changes led to significant 30  change in basal expression compared to the wild-type ARE sequence. We elected to focus on constructs that exhibit at least 2-fold variations from wild-type levels (either up or down), based on the assumption that plasmid concentration effects would be less likely to cause such dramatic changes. At this threshold, the constitutive expression of 41% (92/222) of the ARE variants were judged to be different from wild-type (Appendix A). To analyze the single-base mutations, the 25-bp ARE and ARE-like sequences were divided into 7 blocks (Figure 3.4): the two invariant cores RTKAY and GC, the nnn region in between the cores (positions 14-16), the three immediate flanking nucleotides (positions 6-8 and 19-21), and the far flanking nucleotides (positions 1-5 and 22-25). For the variants of the active AREs, it is noteworthy that all four mutations in region 4, which correspond to the nnn region between the two invariant half sites, led to significant changes in basal expression from the wild-type sequence whereas the mutations in the other regions were associated with constitutive changes in approximately half of the cases (Figure 3.4A). This was not observed for the inactive ARE-like sequences as the mutations in the nnn region were not biased toward altered constitutive expression (Figure 3.4B). A test of combinatorics was applied to model the probability of a single-base mutation mediating a change or no change in the level of expression from wild-type sequence. Our test indicates that the probability of having four mutations all conferring a change in expression in the block 4 nnn region for active ARE variants occurring by chance is only 0.06 (Table 3.6). Consistent with some early reports the nnn region of the ARE core appears to have an important role in determining the magnitude of constitutive expression mediated via the ARE (38). Table 3.6 The combinatorics probability test for the effect of single-base mutations under basal condition  ARE Block 1 2 4  Active AREs 0.18 0.25  Inactive AREs 0.16 0.38  0.06 0.31  0.38  6  7  0.38  31  0.50 0.31  Previously, it has been shown that the AP-1 contributes to the constitutive expression of the ARE. The consensus sequence of the AP-1 binding site is TGA(C/G)TCA, which has the same nucleotide requirement as the ARE for the first four positions but requires the trinucleotide TCA Figure 3.4 Categorization of single-base ARE variants under basal condition. Sites of single nucleotide mutations in the AREs are indicated by a letter that represents each wild-type ARE within each of 7 regions. The upper half indicates changes associated with altered basal expression while the lower half indicates changes associated with no change from wild-type. (A) Active ARE single-base variants. (B) Inactive ARE single-base variants.  x w w  ww v  w  w  v  v v  1^2^3 4 5 6 7 8 RTGAY141516 GC v^v  v  w Change  192021 22232425  v  www x  w v^v  w x  Block 1  w  x  ^  x  2^3^4^5^6  x  ^  No Change 7  Active AREs: v= GST-YA, w = GSTA3, x = NQO1  z YYYYY  Y  z  z  Y  Y  1 2 3 4 5 6 7 8 RTGAY141516 GC  192021  Y z z  z  z  Y^Y  Change  22232425 Y^Y  z  z No Change  Block 1  ^  2^3^4^5^6  ^  Inactive AREs: y = N002, z = GCS H  32  7  in the nnn region. ConSite, a web-based tool for finding cis-regulatory elements was used to score each variant sequence and the respective wild-type sequence with a position frequency matrix of the AP-1 from Wasserman et al's study [Table 3.7, (61, 62)]. Mutations that make the site more closely resemble an AP-1 consensus sequence had an enhancing effect on expression whereas mutations that make it less similar to the AP-1 consensus had deleterious effects on expression. Table 3.7 AP-1 matrix scores of the ARE variant sequences vs respective wild-type sequences. The variants contain mutation(s) in the nnn region. Core regions are indicated by bolded text. Gene  Sequence AP-1 Conensus: TGA(G/C)TCA  NQO1 GST-YA GST-YA GSTA3 NQO2 NQO2 NQO2 NQO2 NQO2 NQO2 GCSH  GAGTCACAGTGAGTgGGCAAAATTT TGCTAATGGTGACAAcGCAACTTTC TGCTAATGGTGACAcAGCAACTTTC ACTCAGGCATGACATaGCATTTTTC  CTGCCTGGATGACtGAGCGAGACCC CTGCCTGGATGACcGAGCGAGACCC CTGCCTGGATGACAGgGCGAGACCC CTGCCTGGATGACgGAGCGAGACCC CTGCCTGGATGACgGcGCGAGACCC CTGCCTGGATGACgGtGCGAGACCC AATATGTGTTGACtGAGCAATGACC  -1.07 -1.18 1.18 0.76 1.71 -2.72 -0.16 1.22 1.81 1.71 0.98  p-value  Variant score  WT score  Change  0.05 <0.01 <0.01 <0.01 <0.01 <0.01 0.66 0.02 <0.01 <0.01 0.06  4.33 2.84 8.24 3.50 6.64 0.45 -0.67 1.41 -1.64 -1.64 4.12  8.06 7.02 7.02 -0.67 3.50 3.50 3.50 3.50 3.50 3.50 0.45  -3.73 -4.18 1.22 4.18 3.14 -3.05 -4.17 -2.09 -1.86 -1.86 3.67  3.6^Mutational Analysis of ARE Core Variants Under Induced Condition As there was high variability between replicates observed for the NQ01 set and the ARElike sequences were never observed to be inducible, analysis of the specific inductionrelated effects of individual ARE variants will focus on GST-YA and GSTA3 ARE sequences. Although spiked oligonucleotide synthesis was not designed for the invariant core regions, there were a few mutations observed in these positions. There were five GST-YA ARE variants that had mutations in the core regions along with other mutations in the variable positions. These sequences mediated the lowest if t 'i values ranging from -0.72 to -1.02, which correspond to 1.6 to 2 fold decrease from wild-type ARE expression, with p-value < 0.05 (Table 3.8). Concordant with the ARE consensus and previous observations, mutations in the G nucleotide of the GC core repressed the tBHQ induction compared to the wild-type GST-YA ARE by about two fold (Table 3.8, GST-YA 1 and 4). A substitution of T instead 33  of G in the R position of the RTGAC core also led to decrease in expression from the wildtype sequence. (Table 3.8, GST-YA 2). A mutation from C to T in the GC core also reduced the inducibility of the GST-YA ARE (Table 3.8, GST-YA sequences 3 and 5). However, the same mutation in the GSTA3 did not mediate the same effect as the inducible expression mediated by the sequence was not significantly different from that of the wild-type sequence (Table 3.8, GSTA3 1). The difference observed could be attributed to sequence-specific effect or to other mutations present in the constructs. A scan of the sequences did not reveal other potential transcription factor binding sites and neither was a C-) T mutation in the GC core documented in reviewed literature. Confirmatory experiments will need to be performed to validate this novel observation that a T may be tolerated instead of C in the GC core. A mutation from C to T in the ATGAC portion of the core did not cause significant difference in expression from the wild-type GSTA3 sequence (Table 3.8, GSTA3 2). The functional AREs in the human NQO2, rat AKR7A3 and rat GSTP 1 genes also have a T at this position (17, 36, 39). Table 3.8 ARE variants with mutations at the core ARE regions. Bolded text represents invariant core positions and lower case indicates a substitution.  Gene  Sequence (RTKAYnnnGC) 1^5^10^15,.‘^20^25  GST - YA 1 GST - YA 2 GST - YA 3 GST - YA 4 GST - YA 5  TGCTAATGGTGACAA4PAAggTTC cGgTAATCWGACAAAGAAtCTTTC TGCTAATGGTGACAAAAtTTTC gaCTAATGGTGACgAOCAACTTca TtCTAATGGTGACAAACiktCTTTC  GSTA3 1 GSTA3 2  AgTtAGGCATGACATTODITTTTTC ACTCAGGCATGeiTTGCATTTTTC  1^5^10^15^20^25  Impact of Mutation(s) (log2)  P-value  (± 0.35) - 1.02 - 0.99 - 0.88 - 0.82 - 0.72  0.004 0.004 0.012 0.018 0.041  (± 0.27) - 0.06 - 0.07  0.821 0.806  3.7^Mutational Analysis of ARE Variants We recovered single-base mutations at 9 out of 18 variable positions for the GST-YA ARE and 11/18 for the GSTA3 ARE. None of the individual mutations produced statistically  34  significant alterations to induction by tBHQ (i.e. the induction coefficient was not significantly different from that of the wild-type sequence). Out of all ARE variants, there was one GST-YA ARE variant with mutations at positions 7, 16, 22, and 23 which led to significant decrease in inducibility from the wild-type sequence (Table 3.9 GST-YA 6). However, these mutations at these same positions individually did not mediate significant alterations from the wild-type ARE (Table 3.9 GST-YA 7-9). Mutations in position 16 of GSTA3 sequences further demonstrate the redundancy of this position under induced condition (Table 3.9 GSTA3 3-4). The deleterious effect on expression mediated by the combined mutations at position 7 and 22 and 23 may suggest cooperativity between 5' and 3' flanking regions. Table 3.9 ARE variants. Bolded text represents the invariant core positions and lower case indicates a substitution. A full list of all variants is presented in Appendix B.  Sequence (RTKAYnnnGCR)  Gene  Impact of Mutation(s) (log2)  P-value  1^5^10^15^20^25  (I: 0.35) GST-YA 6 TGCTAI^GTGAC A O CAA g TC -0.95 TGC11AKT-GZIFGAC A S^ GST-YA 7 AAC^T-Cr -' -0.61 TGCT GTGACAAAGCAACTaTC GST-YA 8 -0.51 TGgTAA GGTGACAAAGCAACOTC GST YA 9 0.47 AA  O  -  GSTA3 3 GSTA3 4  -  ACTCAGGCATGACA^ATTTTTC tCTCAGGCATGACA^CAaTTTTC  -  -  0.08 0.05  0.01 0.08 0.14 0.18 0.77 0.86  3.8^Variants of Inactive ARE-like Sequences The 64 NQO2 ARE variants exhibited induction coefficients that range from -1.37 to 0.56 and the 34 GCSH ARE variants had induction coefficients from -0.33 to 0.20. According to our threshold cutoff of 0.7, 17 NQO2 sequences demonstrated significant decrease expression from the wild-type sequence; none of the mutations converted the inactive ARElike sequence to an active form. Core mutations in the variant sequences of inactive ARE-  35  like sequences did not cause a significant change in expression from the wild-type sequence as in the case for the active ARE variants (Table 3.10). Table 3.10 Variant sequences of inactive ARE-like sequences with mutations in the core regions  Gene  Sequence (RTKAYnnnGCR) 1^5^10^15^20^25  NQ02 1 NQ02 2 GCSH 1 GCSH 2 GCSH 3  CTGCCTGGATGACtGcGCGAGgCCC CcGCCTGGATGAtAGAGCGgGgCCC AcTATGTGTTGACAGAaaAATGACC tATATGTGTTGACAGAGtAATGACC AATAgtgGgTGAacGAGCAtTGACg  36  Impact of Mutation(s) (log2) (± 0.74)  0.24 0.17 0.05 0.04 0.18  -  -  -  P-value  0.76 0.83 0.95 0.95 0.81  Chapter 4 : New Inducible ARE Model  37  4.1^A New Predictive Model for Inducible Active ARE Sequences An ARE model was recently published by Wang et al, in which they documented 57 AREs and candidate AREs from the literature (33). However, the published ARE model appears to be biased by the inclusion of orthologous ARE sequences and ARE-like flanking sequences adjacent to bona fide AREs. Using more conservative criteria, we produced an ARE model defined by 18 experimentally verified binding sites based on review of the scientific literature. Known active ARE sequences were aligned and a position frequency matrix was generated based on the number of nucleotides at each position. Graphical representations of the ARE sequence alignments were constructed using WebLogo (63). Each logo consists of stacks of nucleotides at each position of the sequence. The overall height of the stack indicates the information content (i.e. the strength of the pattern at that position). If all the nucleotides in that position are the same, for example A, the information content is 2 bits. If the numbers of As, Cs, Gs, and Ts are equal, the information content is 0. The height of each nucleotide within the stack represents its relative frequency at that position. The two known ARE profiles reflects the consensus RTKAYnnnGCR except in the nnn region which is biased toward the trinucleotide TCA (Figure 4.1). Figure 4.1 Logo representation of the known ARE models (A) Current collection of 18 experimentally verified ARE sequences (B) Wang's published ARE model. The information content and the relative nucleotide frequency at each position are measured in bits (vertical axis of the logo representation) to represent the overall nucleotide conservation in a position.  A^18 Experimentally-verified  ^  Wang  current  2  B^Wang's Model  xcAS ro-  From the panels of ARE mutations, we tested 124 ARE variants with three different active ARE backbones, 118 of which are functional given the expression threshold set by observed mutations in the ARE core region. Given the more diverse collection of functional ARE sites,  38  a new predictive model was built to compare to the current ARE model. As the ARE variants are designed to preferentially incorporate the wild-type nucleotides in most constructs, we compiled frequency matrices based on the discrete set of observed nucleotides at each position. In other words, equal weights were assigned to all nucleotides that were observed in functional elements no matter how often each nucleotide was observed among all the variants for a particular wild-type ARE sequence. Due to the lack of coverage of mutations in a single ARE variant set, the profiles were biased toward the wild-type nucleotide, which was especially evident in the NQO1 set, which only had 29 variants compared to 37 in the GSTA3 set and 52 in the GST-YA set (Figure 4.2). In order to resolve the issue of lack of clone coverage, the mutant matrices were joined in combinations of two and all three sets (Figure 4.3). Pooling the variants in combinations of two different sets, which resulted in 98 sequences for GST-YA and GSTA3, 85 for GST-YA and NQO1, and 67 for GSTA3 and NQO1, we observed that the information content decreases for variable positions with increased number of sequences in each set. Comparing the GST-YA set which has 58 variants sequences with the smallest joint set consisting of GSTA3 and NQOI ARE variants, we see that even though the numbers of the variants are quite close, having two different ARE backbones demonstrate a better model than just having a pool of variants from one wild-type ARE sequence. Figure 4.2 Logo representation of ARE models built from mutation counts for each set of ARE variants. ^ 2  ggYta^  ngol  L+A  ^NNNNNN^  gsia3  .4. ..  T^T  c,c „,„Q  3`?  (  k. 0,„TISIACTT  39  ,NN40.  WNW  Figure 4.3 Logo representations of combined ARE mutant profiles (A) Combined ARE profile built from addition of two different ARE mutant profiles (B) Combined ARE profile built from all three ARE mutant profiles.  A gstya+nqol  2^  IPAYIItsta3  ArlIJAc ry-Fj Y w^co^nit,4001,:co  1"7-4Tiii  ttc  y^.  pin  """  nclohtsta3  A  DI V)^10 "ri, CO Of O r  AA  N^m^Mi p "  -r-  ggg  gstya-i-gsta3+1431  t  N l7 4  Ill  CO h CO Of O r N C7 mt  tO h DO  ^gl)^g41  With all three pools of ARE variants combined, we observed a pattern that is similar to the current model of ARE. We used STAMP, a web tool for exploring DNA-binding motif similarities, to compare between different ARE models. The new model of the inducible ARE has a similarity p-value score of 1.208e-07 to the current ARE model, and 5.211e-05 to the Wang model (64). From the combined mutant profiles, the new inducible ARE model resembles known models but shows lower information content at positions 14-16. This is an important observation, as it affirms previous suggestions that these positions contribute to constitutive ARE-mediated gene expression but not inducible expression (65).  40  4.2^Comparing Performances of ARE Models in TFBS Detection The published Wang matrix, our more conservative "current" matrix and the new inducible matrix compiled with the three panels of ARE mutations were assessed for their capacity to predict known AREs and the generation of false positive predictions on a control set of sequences. The positive set consists of sequences that span 75bp upstream and downstream from the core ARE nucleotides. The negative set contains 1000 randomly generated oligos with the same base composition as the positive set. By searching against these sequences with the ARE profiles with scoring thresholds from 60% to 90% (where the percentage refers to the percent identity to the PWM; i.e. 0% being equivalent to the smallest possible assigned score and 100% the maximal possible). To evaluate the performance of the models, we generated a receiver operating characteristic-like curve, which is a representation of the trade-offs between sensitivity and specificity. A ROC curve of a good classifier model will be as close as possible to the upper-left corner of the chart, indicating a high number of true positives and a small number of false positives — thus a perfect result would produce a curve under which the area is 1. We plotted a ROC-like curve using three different approaches: the current ARE model, the leave-one-out method of the current ARE model, and the new inducible ARE model (Figure 4.4). The area under the curves were determined to be 0.97 for both the current ARE model and the Wang model, which demonstrate they all have strong discriminatory power as the values are far away from the random classification rate of 0.5. However, these values are artificially high due to the circular nature of the test — the sequences in the reference collection were used to construct the profiles. To see the effect of over-training in the positive set, a leave-one-out cross validation was performed on the current ARE model, in which one sequence was removed from the positive set at a time, producing 18 ARE profiles. Each withheld sequence was then scored by the corresponding matrix. The leave-one-out method is a more fair representation of the ARE model with the area underneath the curve determined to be 0.86.  41  Figure 4.4 Comparison of the performance of known ARE models with leave-one-out method in ARE prediction. The current ARE model which consists of 18 bona fide ARE sequences performs comparably to Wang's published model of ARE. A leave-one-out procedure was performed on the current model.  1^1^i^i^1 0 0^0.2^0.4^0.6^0.8^1.0 1 - Specificity  The new inducible model of ARE is the combined mutant profiles from all three sets of ARE, each contributing equal weight, and its performance is comparable to the leave-oneout method, with the area underneath the curve determined to be 0.86.  42  Figure 4.5 Comparison of the new ARE model and the leave-one-out method matrix for prediction of inducible AREs. The new inducible matrix model performed comparably to the leave-one-out method generated from 18 functional elements obtained from the literature.  P  co.  NEW ^ Leave-one-out —  0  N  O  0.0^0.2^0.4^0.6^0.8^1.0 1 - Specificity  The final inducible ARE matrix was applied to the Wang collection of putative AREs reported in the scientific literature in order to prioritize candidates for potential future analysis. As the Wang collection overlaps the reference collection of bona fide AREs compiled in this study, such sequences are indicated in the table.  43  Table 4.1 Rank list of Wang's set of ARE sequences using the new inducible model of ARE. The score is the percent identity of the sequence to the PWM. The AREs in our own collection is marked by X. Species Rat Mouse Human Mouse Rat Human Mouse Mouse Mouse Mouse Human Human Human Human Mouse Human Mouse Human Human Human Mouse Mouse Human Human Human Human Rat Human Mouse Mouse Mouse Human Mouse Human Human Mouse Human Human Rat Rat Human Mouse Mouse Mouse Mouse  Gene NQO I Hmoxl SPTA1 Gstal Gsta2 NQ01 Ft1 1 Pparg Gcic Fthl AKRI Cl GCLC GPX2 TXNRDI Abcc2 FTH I Fthl GSTPI FTL ABCC1 Hmox1 Hmoxl FTH1 HBBIHBE1 HMBS GCLM Adra2b SAT Hmoxl Gclm Mtl ETS1 Gstal GCLM TXN Alcrl b3 TBXAS1 UGT1A6 Ggtl Tbxasl UGT1A6 Mafg Hmox1 Gstpl Slc7al1  Sequence TCACAGTGACTTGGCAAAATC AGAGGGTGACTCAGCAAAATC ACTGGGTGACTCAGCAGTTTT TAATGGTGACAAAGCAACTTT TAATGGTGACAAAGCAACTTT TCACAGTGACTCAGCAGAATC TCAGCGTGACTCAGCAGAACT TCATTGTGACATAGCACTTAT TCACCGTGACTCAGCACTCTG CCACCGTGACTCAGCATTCTG TCAGGGTGACTCAGCAGCTTG TCCCCGTGACTCAGCGCTTTG CCAGGATGACTTAGCAAAAAC TCAGAATGACAAAGCAGAAAT CTGGGATGACATAGCATTCAT CCTCCATGACAAAGCACTTTT CCTCCATGACAAAGCACTTTT GCGCCGTGACTCAGCACTGGG TCAGCATGACTCAGCAGTCGC TCTGTGTGACTCAGCTTTGGA GAACCATGACTCAGCGAAAAC GGACCGTGACTCAGCGTCACA CCACCGTGACTCAGCACTCCG TCATCATGACTCAGCATTGCT CTCCAGTGACTCAGCACAGGT AGACAATGACTAAGCAGAAAT GAGCGATGACTCAGCAGTTTA CCGCTATGACTAAGCGCTAGT CAACCATGACACAGCATAAAA AGACAATGACTAAGCAGAAAC GGCGCGTGACTATGCGTGGGC AGCGGGTGACCAAGCCCTCAA TGGAAATGACATTGCTAATGG TAACGGTTACGAAGCACTTTC TCACCGTTACTCAGCACTTTG GGAGCATGACCCAGCAGAAGG AAGGAATGAATCAGCAACTTT TCTGTCTGACTTGGCAAAAAT CCACAATGACACAGCAAGAAA AGGCAATGAATCAGCAACTTT GAAAGCTGACACGGCCATAGT TCATGCTGACTCAGCGGATCG GGACCTTGACTCAGCAGAAAA ACGTGTTGAGTCAGCATCCGG CCAACATTACTCAGCTCTTTT  44  Score 96.2 95.9 95.5 95.1 95.1 93.4 93 92.7 92.2 91.9 91.8 90.1 90 89.9 89.7 89.6 89.6 89.5 88.6 88.2 88.2 88.1 87.6 87.3 87.2 86.8 86.6 86.3 85.5 85.4 85.4 85 84.9 84.8 84.2 83.9 83.7 83.7 83.6 82.2 82 79.9 79.8 79.6 79.3  Current set  X X X X X  X  X  X  X  X X  X  Species Human Rat Rat Rat Rat Human Mouse Human  Gene SI00A6 Akr7a3 Gsta5 Akr7a3 Gstpl GNAI2 Nfe212 TXNRDI  Sequence GACACGTGACTCGGCAAGGGG ATGCCCTGAGTGAGCGAGTGA CACGGCTGACAGAGCGATGGA TGGAAATGATTCAGCAGTTTA TCACTATGATTCAGCAACAAA AGCCTGTGACTGGGCCGGGGC CCCACCTGACTCCGCCATGCC TCATTCTGACTCTGGCAGTTA  Score 79.1 77.8 76.1 75.8 75.5 75.3 71.5 69.9  Current set  X X  4.3^Identifying Putative ARE Sites in the Human Genome A screen of the evolutionarily conserved, orthologous human and mouse promoter regions [+/- 10 kbp from a putative TSS as used in the oPOSSUM system, (60)] was performed with the inducible ARE model. A total of 144937 candidate regulatory elements were detected in the screen. According to a futility theorem which states that TFBS prediction is confounded by a high rate of false positive predictions, only a small portion of these sites would be functional (57). In the future it may be possible to merge these predictions with results of chromatin immunoprecipitation and expression profiling data to identify more functional AREs in the genome.  45  Chapter 5 : Discussion  46  5.1^The "Spiked" Library Screening Method Understanding the compositional nucleotide requirements for the binding sites of a transcription factor is essential for the construction of a model for the computational prediction of target sites in a genome sequence. The "spiked" library screen introduced here represents one of many possible approaches to the characterization of the function of a DNA sequence. Both the spiked oligonucleotide synthesis and reporter gene assays are wellestablished procedures, but have not been used jointly to study transcription factor binding sites. The pipeline was used to characterize the target sequences of the Nfe212 transcription factor; regulatory sites commonly referred to as antioxidant response elements. Using a spiked oligonucleotide synthesis procedure for the generation of pools of DNA with subtle variations from an original sequence, we generated three libraries of reporter gene plasmids carrying mutations of known AREs. The novel TFBS library construction approach performed close to the theoretical projections, returning 222 ARE-related sequences with both single and multiple base mutations. The variant ARE constructs were transfected into cells in culture and tested for both constitutive and inducible activity of the encoded luciferase reporter. The data was used to construct ARE models, with the resultant model from the laboratory data performing as well as models based on 18 bona fide AREs discovered in the last 15 years.  5.1.1^Spiked Oligonucleotide Synthesis For our experiments, there are a number of advantages derived from the incorporation of base changes via synthesis of degenerate oligonucleotides with subsequent creation of dsDNA via primer extension (as compared to synthesis of complementary oligonucleotides with targeted base changes). Mutagenesis with degenerate oligonucleotides containing randomly incorporated substitutions requires fewer syntheses, saving both reagent costs and more importantly labor required for the generation and management of individual clones. However, there are challenges associated with this stochastic (random) sampling procedure. First, the specific mutations that will be recovered cannot be predicted. Second, there are variations in the mixture of degenerate bases produced by the machine, so the sampling of changes at each variable position is not guaranteed to be even. Third, oligonucleotide  47  synthesis with the defined 5% substitution for each base requires hand-mixing of the bases and attention to detail — work performed outside the control of the investigator. Since the pipeline was not fully automated, human errors could have been introduced to cause the observed variability in the spiking process. The fact that mutations were observed in some invariant positions and based on the observation that substitution rates dip near invariant positions suggests the possibility that the tubes in the oligonucleotide synthesis machine may not be completely cleaned prior to the addition of the next nucleotide in the progression of synthesis. The design of the spiked oligonucleotide is a critical step in the process. We elected to retain the 9 base pairs of the ARE core to be invariant based on extensive past studies that demonstrate deleterious effect on expression when these positions were mutated. In light of the results, it might have been beneficial to include changes at a few of these positions, both as controls for the experiment and to explore the range of nucleotides allowed at positions of known heterogeneity. To apply the study to less well-characterized TFBS, it is our recommendation that the investigator spike all the TFBS positions in the DNA oligonucleotides. If the sequence is particularly long (as was the case for the ARE), it might be suitable to plan for two consecutive experiments in which some positions are held invariant in the oligonucleotide synthesized for the first round. The oligonucleotide for the second experiment could be designed in light of the results obtained in the first experiment. Specific future experiments for the study of the ARE are discussed below.  5.1.2^Higher Throughput Reporter Gene Assay -  A number of technologies have been developed for the characterization of TFBS. SELEX and SELEX-SAGE, which are experimental procedures that allow the extraction of the strongest binders for a given DNA-binding protein from an initially random pool of oligonucleotides, can give an accurate picture of binding specificity on a genomic level, but are limited to well-described proteins that can be studied in a homogeneous form in a test tube. For well-studied TFs, binding profiles from SELEX compare favorably to profiles from sets of natural occurring binding sites (66). Higher-throughput array-based methods for determining transcription factor-binding specificity have also been developed (67, 68). In  48  such methods, specificity is determined by the quantitative study of binding patterns for labeled proteins onto double-stranded DNA targets affixed to a surface. However, these in vitro TF:DNA binding methods are often performed with the assumption that experimental  conditions are similar to natural ones, but such claims have been demonstrated to be not fully representative (69). In vivo high-throughput ChIP studies with detection by array hybridization or multi-plex sequencing can reveal the genomic locations at which a protein is situated, but the subsequent pattern discovery procedure to extract a binding profile is difficult. The challenges likely reflect a combination of the relatively low resolution afford by ChIP techniques and the fact that ChIP can recover regions at which the target TF is present due to protein-protein interactions rather than protein-DNA interactions. These techniques do not allow for high-resolution identification of specific nucleotide patterns active within a living cell. With the procedure introduced here, it is now feasible to screen panels of mutations for the many transcription factor binding sites associated with inducible expression in cell culture. The challenge of controlling for differences between plasmids makes it difficult to draw conclusive observations about constitutive expression with this method. However, there are numerous, biomedically important inducible stress-related TFBS that could be studied, such AP-1 binding sites, aryl hydrocarbon receptor binding sites, NFkB binding sites, Hypoxia inducible factor elements, etc. The method can be used by most laboratories equipped for the study of reporter genes. The pipeline does not call for complex instrumentation — it is possible to conduct the experiment with standard equipment required for 96-well luciferase reporter gene assays. It would be advantageous to have robotic plasmid preparation instruments, but it is possible to work manually with standard multi-channel pipets. The reporter gene plasmid matters. In the future, endogenous TFBS sites on the vector backbone should be assessed first and steps should be taken to eliminate such sites, as the background responsiveness of the plasmid complicates the analysis. One could mutate the binding site on the vector sequence or opt for the new pGL4 vectors offered by Promega, which carries a redesigned backbone with fewer cryptic regulatory sequences. We have identified a cryptic ARE in both pGL4 and pGL3promoter vectors. A comparison study of the pGL3Basic and pGL4Basic luciferase 49  constructs containing individual y-glutamylcysteine synthetase catalytic subunit gene regulatory sequences by Wild et al, demonstrated that the expression values were not significantly different (18).  5.2^Observations Related to Constitutive Expression via AREs The basal transcription mediated via the ARE has specific sequence requirements. One should always bear in mind the caveat for the study of constitutive expression in a reporter gene assay, as differences between different preparations of the same plasmid can be greatly influenced by concentrations and other properties of plasmid preparations. In our study, all single-base mutations in nnn region affected reporter gene expression under basal condition for active ARE variants. This pattern of changes suggests a role for this region in constitutive expression. In some AREs, this nnn segment is known to be part of a TPAresponse element (TRE), which has the consensus sequence 5 '-TGA(C/G)TCA-3 ' for Activator protein-1 [AP-1, (70)]. AP-1 is a nuclear transcription factor, composed of subunit components of Fos and Jun proteins, that plays roles in a variety of cellular processes, such as signal transduction and differentiation. The region may function as part of a Maf recognition element (MARE). The MARE has a palindromic consensus sequence TGCTGA(G/C)TCAGCA or TGCTGA(GC/CG)TCAGCA. These two sequences contain either an embedded TRE or a cAMP-responsive element (CRE). Yamamoto et al. examined how MafG homodimers (a transcriptional repressor) and MafG:Nfe212 heterodimers (a transcriptional activator) interact with MARE variants using SPR-microarray imaging and EMSA. Their results demonstrate that base changes in the MARE sequence appear to be important determinants specifying differential binding between MafG homodimers and MafG:Nfe212 heterodimers. The binding preference for the MafG homodimers was found to be primarily dependent on the core +1 to +3 positions which correspond to the nnn segment in the ARE. Since single-base changes in the nnn region were suggested to have a role in the ARE function under basal condition we examined the combinatorial effect of constructs with multiple mutations in addition to those within the min region. With multi-base mutations in the ARE sequence, the effect of substitutions varied quite significantly, depending on the  50  location or the type of substitution. Additive effects were observed for combinations of some positions but were not noted for other combinations. We attempted to train a decisiontree classifier for the prediction of ARE mutants likely to exhibit altered constitutive expression. While there was qualitative emphasis on the changes within the nnn region (Appendix C), there was inadequate data to demonstrate statistical significance. To further characterize the base requirement for the nnn region under basal condition, targeted mutational analysis can be done to explore all possible combinations in the future. It may be interesting to explore how mutations with strong effect on basal expression relate to the creation or elimintation of consensus target sequences for well-studied transcription factors. In figure 3.4, we observed a bias towards functional changes in active AREs at the nnn core region. In light of the capacity for inactive AREs to mediate basal expression, it would be interesting in future studies to explore more deeply changes in these nucleotides in both ARE and ARE-like sequences.  5.3^Observations on Induction Mediated By the Antioxidant Response Element In this study, we confirmed that mutations in the core ARE regions are deleterious for the function of ARE under induced condition, an observation which in turn was used as the threshold for determining statistically significant expression change from wild-type expression. We also confirmed previous observations that the single base mutations in various positions other than the ARE core region have minimal effect under induced expression. In this study, we have identified 118 active ARE sequences using synthetically generated spiked oligos and built a new ARE model based on profiling the mutations in the variable positions. We elected to study the AREs based on our previous experience with its characterization. The new model resembles current models and further demonstrates the sequence flexibility of the nnn spacer region for inducible expression. Pooling the ARE variants, we were able to generate a new model that demonstrated comparable performance in predicting ARE sites to the current model. We have demonstrated that pooling about 100 sequences containing 1 to 4 base pair mutations from three different wild-type backbones  51  suffice in building a reliable TFBS model. If only one or two wild-type TFBS backbone were to be used, more variant sequences must be included in the analysis. Due to the limited number ARE variants screened in this study, we were unable to further characterize the position-interdependence in the ARE. Future experiments can focus on targeted multiple-base mutations in the 3' flanking region of the core as we observed, which is known to be AT-rich and has been demonstrated to mediate expression. Longer ARE sequences may be used to capture shadow sites or cis-regulatory modules for full inducibility. In addition, the GSTA3 sequence with a CT mutation in the GC core which had no significant change in expression from the wild-type sequence needs to be replicated with an independent construct in the lab. Lastly, further study is required to determine how to make inactive ARE-like sequences become active. For NQ02-AL1, there is high density of GC's 3' to the core sequence, which differs from the apparent need for A/T-rich sequences in this region. Future studies might swap the 3' region of functional AREs into the NQO2-AL1 sequence to determine if such a change is sufficient to confer induction. For GCSH-AL5 the R of RTKAY core is a non-consensus T, which could be modified in future studies; the only mutation observed at this position occurred in conjunction with a change at another critical position (TTGAC -> ATGAA).  5.4^A New Definition of ARE At the outset of the study we took a broad perspective on what constituted an ARE-like sequence, reflecting the content of the scientific literature. In retrospect there is sufficient knowledge to refine the definition of what constitutes an ARE and what sequences should be referred to as ARE-like. It is important to realize that the distinction between the induction mediated by an active and an inactive ARE is not completely distinct. In reality the expression values are on a continuum. We propose that the definition of an ARE be a sequence that mediates elevated reporter gene expression (compared to a construct lacking the sequence) in response to an inducing compound, and that the sequence contains one or more copies of the core segment RTKABnnnGCR. Future confirmation of our findings could alter the core segment requirement to RTKABnnnGYR. An ARE-like sequence would contain the specific core segment, but would not mediate elevated reporter gene expression.  52  By this definition, a long sequence could be classified as an ARE. It would be expected over time and further studies that such long sequences would be refined to identify the minimum functional sequence. In the future, it might be possible to extend the definition to include the size of the sequence once inter-nucleotide dependencies are determined.  5.5^Computational Biology and the Statistical Model Matrix-based models for the prediction of TF binding sites require sets of experimentally verified naturally occurring sequences or binding site selection data. It is important to realize that weight matrices rely on assumptions that may over simplify the DNA:protein interaction. First of all, weight matrix can only describe fixed-length sequence motifs, while some transcription factors can bind to sites of variable lengths (71). Another implicit assumption is that individual base pairs contribute independently of each other for protein binding. More importantly, the existing ARE profile places an over- emphasis on TCA nucleotides at the three variable positions due to the circularity of searching for novel ARE sequences that are most alike the known ones. Thus, the method developed here can be applied to the study all other TFBSs under an inducible system, enabling a more accurate TFBS profile to be constructed within weeks. It is important to realize that the analysis of data can greatly influence the interpretation of experimental results. As described, we made assumptions in order to use a linear regression procedure to a multiplicative model with the observed data. We assumed that the error is the same across all ARE variants across each 96-well plate. However, we know that this cannot hold true experimentally. As introduced in the methods section, a traditional way of assessing the impact of mutations within a TFBS via luciferase assay for an inducible system involves a series of ratio calculations. It would be appropriate in future work to reanalyze the data with different approaches, including the ratio test approach or alternative models. Such continuing analysis may result in additional observations about the inducible ARE.  53  5.6^Future Directions In this study, we have developed a higher throughput pipeline to further characterize the antioxidant response elements, built a new model for the prediction of inducible ARE, and scanned the genome for putative ARE sites. It is not known whether these in vitro-validated sequences are actually functional in vivo, as they may be influenced by additional properties, such as neighboring binding sites and the accessibility regulated by chromatin effect (72). Based on previous observations [Sudmant and Wasserman, unpublished, (73)] about the positional correlation between stress inducible regulatory elements and CpG islands, it would be interesting to explore whether predicted sites near CpG islands tend to be functional. Such studies could be pursued by combining high-throughput Nfe212-ChIP data and inducible expression data to focus on Nfe212-regulated genes. Once we identify new ARE targets, they can be further tested by laboratory validation.  54  SS  Atidui2owag  1. Finkel T, Holbrook NJ. Oxidants, oxidative stress and the biology of ageing. Nature. 2000 Nov 9;408(6809):239-47. 2. Klaunig JE, Kamendulis LM. THE ROLE OF OXIDATIVE STRESS IN CARCINOGENESIS. Annu Rev Pharmacol Toxicol. 2004;44:239-67. 3. Belinsky MG, Dawson PA, Shchaveleva I, Bain LJ, Wang R, Ling V, et al. Analysis of the in vivo functions of Mrp3. Mol Pharmacol. 2005 Jul;68(1):160-8. 4. Keppler D. Export pumps for glutathione S-conjugates. Free Radic Biol Med. 1999 Nov;27(9-10):985-91. 5. Prochaska HJ, Talalay P. Regulatory mechanisms of monofunctional and bifunctional anticarcinogenic enzyme inducers in murine liver. Cancer Res. 1988 Sep 1;48(17):4776-82. 6. Haimeur A, Conseil G, Deeley RG, Cole SP. The MRP-related and BCRP/ABCG2 multidrug resistance proteins: Biology, substrate specificity and regulation. Curr Drug Metab. 2004 Feb;5(1):21-53. 7. van de Water FM, Masereeuw R, Russel FG. Function and regulation of multidrug resistance proteins (MRPs) in the renal elimination of organic anions. Drug Metab Rev. 2005;37(3):443-71. 8. Prestera T, Holtzclaw W, Zhang Y, Talalay P. Chemical and molecular regulation of enzymes that detoxify carcinogens. PNAS. 1993 April 1;90(7):2965-9. 9. Prestera T, Talalay P. Electrophile and antioxidant regulation of enzymes that detoxify carcinogens. Proc Natl Acad Sci U S A. 1995 Sep 12;92(19):8965-9. 10. Dinkova-Kostova AT, Massiah MA, Bozak RE, Hicks RJ, Talalay P. Potency of michael reaction acceptors as inducers of enzymes that protect against carcinogenesis depends on their reactivity with sulfhydryl groups. Proc Natl Acad Sci U S A. 2001 Mar 13;98(6):3404-9. 11. Potter JD, Steinmetz K. Vegetables, fruit and phytoestrogens as preventive agents. IARC Sci Publ. 1996;(139)(139):61-90. 12. Mohler J, Vani K, Leung S, Epstein A. Segmentally restricted, cephalic expression of a leucine zipper gene during drosophila embryogenesis. Mech Dev. 1991 Mar;34(1):3-9. 13. Thimmulappa RK, Mai KH, Srisuma S, Kensler TW, Yamamoto M, Biswal S. Identification of Nrf2-regulated genes induced by the chemopreventive agent sulforaphane by oligonucleotide microarray. Cancer Res. 2002 Sep 15;62(18):5196-203. 14. Ramos-Gomez M, Kwak MK, Dolan PM, Itoh K, Yamamoto M, Talalay P, et al. Sensitivity to carcinogenesis is increased and chemoprotective efficacy of enzyme  56  inducers is lost in nrf2 transcription factor-deficient mice. Proc Natl Acad Sci U S A. 2001 Mar 13;98(6):3410-5. 15. Chan K, Lu R, Chang JC, Kan YW. NRF2, a member of the NFE2 family of transcription factors, is not essential for murine erythropoiesis, growth, and development. PNAS. 1996 November 26;93(24):13943-8. 16. Nguyen T, Rushmore T, Pickett C. Transcriptional regulation of a rat liver glutathione S-transferase ya subunit gene. analysis of the antioxidant response element and its activation by the phorbol ester 12-O-tetradecanoylphorbol-13-acetate. J Biol Chem. 1994 May 6;269(18):13656-62. 17. Favreau LV, Pickett CB. The rat quinone reductase antioxidant response element. identification of the nucleotide sequence required for basal and inducible activity and detection of antioxidant response element-binding proteins in hepatoma and nonhepatoma cell lines. J Biol Chem. 1995 Oct 13;270(41):24468-74. 18. Wild AC, Gipp JJ, Mulcahy T. Overlapping antioxidant response element and PMA response element sequences mediate basal and beta-naphthoflavone-induced expression of the human gamma-glutamylcysteine synthetase catalytic subunit gene. Biochem J. 1998 Jun 1;332 ( Pt 2)(Pt 2):373-81. 19. Prestera T, Talalay P, Alam J, Ahn YI, Lee PJ, Choi AM. Parallel induction of heme oxygenase-1 and chemoprotective phase 2 enzymes by electrophiles and antioxidants: Regulation by upstream antioxidant-responsive elements (ARE). Mol Med. 1995 Nov;1(7):827-37. 20. Karapetian RN, Evstafieva AG, Abaeva IS, Chichkova NV, Filonov GS, Rubtsov YP, et al. Nuclear oncoprotein prothymosin {alpha} is a partner of Keapl: Implications for expression of oxidative stress-protecting genes. Mol Cell Biol. 2005 February 1 ;25(3):1089-99. 21. Cullinan SB, Zhang D, Hannink M, Arvisais E, Kaufman RJ, Diehl JA. Nrf2 is a direct PERK substrate and effector of PERK-dependent cell survival. Mol Cell Biol. 2003 October 15;23(20):7198-209. 22. Huang H-, Nguyen T, Pickett CB. Regulation of the antioxidant response element by protein kinase C-mediated phosphorylation of NF-E2-related factor 2. PNAS. 2000 November 7;97(23):12475-80. 23. Huang H-, Nguyen T, Pickett CB. Phosphorylation of Nrf2 at ser-40 by protein kinase C regulates antioxidant response element-mediated transcription. J Biol Chem. 2002 November 1;277(45):42769-74. 24. Itoh K, Igarashi K, Hayashi N, Nishizawa M, Yamamoto M. Cloning and characterization of a novel erythroid cell-derived CNC family transcription factor heterodimerizing with the small maf family proteins. Mol Cell Biol. 1995 August 1;15(8):4184-93. 57  25. Motohashi H, Katsuoka F, Engel JD, Yamamoto M. Small maf proteins serve as transcriptional cofactors for keratinocyte differentiation in the Keapl-Nrf2 regulatory pathway. PNAS. 2004 April 27;101(17):6379-84. 26. Katsuoka F, Motohashi H, Ishii T, Aburatani H, Engel JD, Yamamoto M. Genetic evidence that small maf proteins are essential for the activation of antioxidant response element-dependent genes. Mol Cell Biol. 2005 Sep;25(18):8044-51. 27. Wang Y, Devereux W, Stewart TM, Casero RA,Jr. Characterization of the interaction between the transcription factors human polyamine modulated factor (PMF-1) and NFE2-related factor 2 (nrf-2) in the transcriptional regulation of the spermidine/spermine N1-acetyltransferase (SSAT) gene. Biochem J. 2001 Apr 1;355(Pt 1):45-9. 28. He CH, Gong P, Hu B, Stewart D, Choi ME, Choi AMK, et a/. Identification of activating transcription factor 4 (ATF4) as an Nrf2-interacting protein. IMPLICATION FOR HEME OXYGENASE-1 GENE REGULATION. J Biol Chem. 2001 June 8;276(24):20858-65. 29. Venugopal R, Jaiswal AK. Nrf2 and Nrfl in association with jun proteins regulate antioxidant response element-mediated expression and coordinated induction of genes encoding detoxifying enzymes. Oncogene. 1998 Dec 17;17(24):3145-56. 30. Wasserman WW, Fahl WE. Functional antioxidant responsive elements. Proc Natl Acad Sci U S A. 1997 May 13;94(10):5361-6. 31. Erickson AM, Nevarea Z, Gipp JJ, Mulcahy RT. Identification of a variant antioxidant response element in the promoter of the human glutamate-cysteine ligase modifier subunit gene. revision of the ARE consensus sequence. J Biol Chem. 2002 Aug 23;277(34):30730-7. 32. Nerland DE. The antioxidant/electrophile response element motif Drug Metab Rev. 2007;39(1):235-48. 33. Wang X, Tomso DJ, Chorley BN, Cho HY, Cheung VG, Kleeberger SR, et al. Identification of polymorphic antioxidant response elements (AREs) in the human genome. Hum Mol Genet. 2007 Apr 4. 34. Nioi P, McMahon M, Itoh K, Yamamoto M, Hayes JD. Identification of a novel Nrf2regulated antioxidant response element (ARE) in the mouse NAD(P)H:Quinone oxidoreductase 1 gene: Reassessment of the ARE consensus sequence. Biochem J. 2003 Sep 1;374(Pt 2):337-48. 35. Mulcahy RT, Wartman MA, Bailey HH, Gipp JJ. Constitutive and beta-naphthoflavoneinduced expression of the human gamma-glutamylcysteine synthetase heavy subunit gene is regulated by a distal antioxidant response element/TRE sequence. J Biol Chem. 1997 Mar 14;272(11):7445-54.  58  36. Ikeda H, Nishi S, Sakai M. Transcription factor Nr12/MafK regulates rat placental glutathione S-transferase gene during hepatocarcinogenesis. Biochem J. 2004 Jun 1;380(Pt 2):515-21. 37. Jowsey IR, Jiang Q, Itoh K, Yamamoto M, Hayes JD. Expression of the aflatoxin Bl8,9-epoxide-metabolizing murine glutathione S-transferase A3 subunit is regulated by the Nrf2 transcription factor through an antioxidant response element. Mol Pharmacol. 2003 Nov;64(5):1018-28. 38. Moinova HR, Mulcahy RT. An electrophile responsive element (EpRE) regulates betanaphthoflavone induction of the human gamma-glutamylcysteine synthetase regulatory subunit gene. constitutive expression is mediated by an adjacent AP-1 site. J Biol Chem. 1998 Jun 12;273(24):14683-9. 39. Wang W, Jaiswal AK. Nuclear factor Nrf2 and antioxidant response element regulate NRH:Quinone oxidoreductase 2 (NQO2) gene expression and antioxidant induction. Free Radic Biol Med. 2006 Apr 1;40(7):1119-30. 40. Lou H, Du S, Ji Q, Stolz A. Induction of AKR1C2 by phase II inducers: Identification of a distal consensus antioxidant response element regulated by NRF2. Mol Pharmacol. 2006 May;69(5):1662-72. 41. Nishinaka T, Yabe-Nishimura C. Transcription factor Nrf2 regulates promoter activity of mouse aldose reductase (AKR1B3) gene. J Pharmacol Sci. 2005 Jan;97(1):43-51. 42. Kim YC, Yamaguchi Y, Kondo N, Masutani H, Yodoi J. Thioredoxin-dependent redox regulation of the antioxidant responsive element (ARE) in electrophile response. Oncogene. 2003 Mar 27;22(12):1860-5. 43. Kurz EU, Cole SP, Deeley RG. Identification of DNA-protein interactions in the 5' flanking and 5' untranslated regions of the human multidrug resistance protein (MRP1) gene: Evaluation of a putative antioxidant response element/AP-1 binding site. Biochem Biophys Res Commun. 2001 Jul 27;285(4):981-90. 44. Banning A, Deubel S, Kluth D, Zhou Z, Brigelius-Flohe R. The GI-GPx gene is a target for Nrf2. Mol Cell Biol. 2005 Jun;25(12):4914-23. 45. Alam J, Cai J, Smith A. Isolation and characterization of the mouse heme oxygenase-1 gene. distal 5' sequences are required for induction by heme or heavy metals. J Biol Chem. 1994 Jan 14;269(2):1001-9. 46. Tsuji Y. JunD activates transcription of the human ferritin H gene through an antioxidant response element during oxidative stress. Oncogene. 2005 Nov 17;24(51):7567-78. 47. Katsuoka F, Motohashi H, Engel JD, Yamamoto M. Nrf2 transcriptionally activates the mafG gene through an antioxidant response element. J Biol Chem. 2005 Feb 11;280(6):4483-90. 59  48. Hintze KJ, Wald KA, Zeng H, Jeffery EH, Finley JW. Thioredoxin reductase in human hepatoma cells is transcriptionally regulated by sulforaphane and other electrophiles via an antioxidant response element. J Nutr. 2003 Sep;133(9):2721-7. 49. Arinze IJ, Kawai Y. Transcriptional activation of the human Galphai2 gene promoter through nuclear factor-kappaB and antioxidant response elements. J Biol Chem. 2005 Mar 18;280(11):9786-95. 50. Wilson LA, Gemin A, Espiritu R, Singh G. Ets-1 is transcriptionally up-regulated by H202 via an antioxidant response element. FASEB J. 2005 Dec;19(14):2085-7. 51. Sasaki H, Sato H, Kuriyama-Matsumura K, Sato K, Maebara K, Wang H, et al. Electrophile response element-mediated induction of the cystine/glutamate exchange transporter gene expression. J Biol Chem. 2002 Nov 22;277(47):44765-71. 52. Smith CL, Hager GL. Transcriptional regulation of mammalian genes in vivo. A tale of two templates. J Biol Chem. 1997 Oct 31;272(44):27493-6. 53. Cooper SJ, Trinklein ND, Anton ED, Nguyen L, Myers RM. Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 2006 Jan;16(1):1-10. 54. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature. 2006 Nov 23;444(7118):499-502. 55. Stormo GD. DNA binding sites: Representation and discovery. Bioinformatics. 2000 Jan;16(1):16-23. 56. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004 Apr;5(4):276-87. 57. Tronche F, Ringeisen F, Blumenfeld M, Yaniv M, Pontoglio M. Analysis of the distribution of binding sites for a tissue-specific transcription factor in the vertebrate genome. J Mol Biol. 1997 Feb 21;266(2):231-45. 58. Mirny LA, Gelfand MS. Structural analysis of conserved base pairs in protein-DNA complexes. Nucleic Acids Res. 2002 Apr 1;30(7):1704-11. 59. Dube DK, Loeb LA. Mutants generated by the insertion of random oligonucleotides into the active site of the beta-lactamase gene. Biochemistry. 1989 Jul 11;28(14):5703-7. 60. Ho Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW. oPOSSUM: Integrated tools for analysis of regulatory motif over-representation. Nucleic Acids Res. 2007 Jul 1;35(Web Server issue):W245-52.  60  61. Sandelin A, Wasserman WW, Lenhard B. ConSite: Web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue): W249-52. 62. Wasserman WW, Fickett JW. Identification of regulatory regions which confer musclespecific gene expression. J Mol Biol. 1998 Apr 24;278(1):167-81. 63. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: A sequence logo generator. Genome Res. 2004 Jun;14(6):1188-90. 64. Mahony S, Benos PV. STAMP: A web tool for exploring DNA-binding motif similarities. Nucleic Acids Res. 2007 Jul 1;35(Web Server issue):W253-8. 65. Xie T, Belinsky M, Xu Y, Jaiswal AK. ARE- and TRE-mediated regulation of gene expression. response to xenobiotics and antioxidants. J Biol Chem. 1995 Mar 24;270(12):6894-900. 66. Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P. High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites. Nat Biotechnol. 2002 Aug;20(8):831-5. 67. Bulyk ML, Huang X, Choo Y, Church GM. Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc Natl Acad Sci U S A. 2001 Jun 19;98(13):7158-63. 68. Kyo M, Yamamoto T, Motohashi H, Kamiya T, Kuroita T, Tanaka T, et al. Evaluation of MafG interaction with maf recognition element arrays by surface plasmon resonance imaging technique. Genes Cells. 2004 Feb;9(2):153-64. 69. Shultzaberger RK, Schneider TD. Using sequence logos and information analysis of lrp DNA binding sites to investigate discrepanciesbetween natural selection and SELEX. Nucleic Acids Res. 1999 Feb 1;27(3):882-7. 70. Friling R, Bergelson S, Daniel V. Two adjacent AP-1-like binding sites form the electrophile-responsive element of the murine glutathione S-transferase ya subunit gene. PNAS. 1992 January 15;89(2):668-72. 71. Seidel HM, Milocco LH, Lamb P, Darnell JE,Jr, Stein RB, Rosen J. Spacing of palindromic half sites as a determinant of selective STAT (signal transducers and activators of transcription) DNA binding and transcriptional activity. Proc Natl Acad Sci U S A. 1995 Mar 28;92(7):3041-5. 72. Beato M, Eisfeld K. Transcription factor access to chromatin. Nucleic Acids Res. 1997 Sep 15;25(18):3559-63. 73. Yamashita R, Suzuki Y, Sugano S, Nakai K. Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity. Gene. 2005 May 9;350(2):129-36.  61  Z9  saNpuaddv  Appendix A: ARE variants Table S.1 Coefficients for GST-YA ARE variants GST-YA ARE variants fp, TGCTAATGGTGACAAAaCAAggTTC -1.32 cGgTAATGtTGACAAAGCAtCTTTC -1.69 TGCTAAgGGTGACAAtGCAACgcTC -1.08 TGCTAATGGTGACAAAGtAAtTTTC -0.61 gaCTAATGGTGACgAAcCAACTTca -0.86 TtCTAATGGTGACAAAGtAtCTTTC -2.53 TGaTAATGGTGACcAAGCAACcTTC -1.85 TcCTgATGGTGACAtAGCgACTTTC 0.11 TGCTAATGGTGACAAcGCAACTTTC -1.18 TGgTAgTGGTGACAAAGCAcCTTgC -0.56 TGCTAAgGGTGACAAAGCAACTaTC -1.24 TGCgAATGGTGACAAgGCtACTTTC -0.40 TGCTAtgGGTGACAAAGCAACTTTC -0.43 gGCaAATGGTGACAgAGCAACTTTC -0.68 TGCTAATGGTGACAAAGCAAagTTC -0.06 TGCTgATGGTGACAAAGCAACTTgC -2.34 TGgTAATGGTGACAAAGCAACcgTC -2.13 TGaTAATGGTGACAAcGCAACTTTC -1.46 TGCTAATGGTGACAAAGCgACTTTC -1.67 TGCTAATGGTGACAAAGCtAgTTTC -0.63 gGCTAATGGTGACAAgGCAACTTgC -0.58 gGgTAATGGTGACAAAGCgAtTaTC -1.77 TGCTAATGGTGACtAtGCAAtTTTC -0.87 TGtTAATaGTGACAAAGCAACTTgC -1.56 TGgTAATGGTGACAAAGCtACTTTC -1.99 TtCTAAgGGTGACAAAGCAACTTTC -0.02 gGCTAATGGTGACcAAGCAACTTTC -0.50 TtCTAATGGTGACAAgGCAtCTTTC -0.88 TGgTAATGGTGACAAAGCAACTgTC -0.71 gGgTAATGGTGACAAAGCAACTTTt 1.40 gGCTcATGGTGACAAAGCAACTTTC -0.27 TGgTAgTGGTGACAAAGCAACTTTC 0.09 TaaTAATGGTGACAAAGCAtCTTcC -0.16 gGCTAATGGTGACAAAGCgAtTTTC -0.35 TGgTAATGGTGACAAAGCgACTTTa -3.22 gGCTgATGGTGACAAtGCAAtTTTC -1.62 cGCTAATaGTGACAAAGCAACTTTC -2.60 gaCTAATGGTGACAAAGCAACTTTC 0.12 TGCTAATGGTGACAcAGCAACTTTC 1.18 cGCTAATGGTGACAAAGCAACTTTC 0.22 TGCTAATGGTGACccAGCcACTTTC -0.74 TGgTAATGGTGACAAAGCAACTTcC 1.05 gGCcAATGGTGACAAAGCAACTTTC 0.10  63  P-values 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.64 0.00 0.02 0.00 0.10 0.08 0.01 0.79 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.95 0.04 0.00 0.00 0.00 0.27 0.70 0.52 0.16 0.00 0.00 0.00 0.62 0.00 0.31 0.00 0.00 0.70  -1.02 -0.99 -0.95 -0.88 -0.82 -0.72 -0.70 -0.67 -0.61 -0.58 -0.51 -0.50 -0.49 -0.48 -0.48 -0.48 -0.47 -0.46 -0.45 -0.43 -0.38 -0.31 -0.29 -0.28 -0.24 -0.23 -0.20 -0.19 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.11 -0.10 0.10 -0.07 -0.06 -0.03  P-values 0.00 0.00 0.01 0.01 0.02 0.04 0.05 0.05 0.08 0.10 0.14 0.15 0.16 0.17 0.17 0.17 0.18 0.19 0.20 0.22 0.27 0.38 0.41 0.42 0.49 0.51 0.56 0.58 0.59 0.60 0.62 0.65 0.67 0.69 0.71 0.74 0.74 0.74 0.78 0.75 0.84 0.86 0.92  GST-YA ARE variants TtCTAATGGTGACAAAGCAACgTTC gGCTAATGGTGACAAAGCAACTTTC TGCaAATGGTGACAAAGCAACTTTC TGCTAATGGTGACAAAGCAAaTTTt TGtTAATGGTGACAAAGCAACTTTC gGCTAATGGTGACAAAGCAACTTTa TGgTAATGGTGACAAAGCAACTTTC TGgTAATGGTGACgAAGCAtCTTTC TGCTAATGGTGACAAAGCAtaTTTC TGCTAATGGTGACAAAGCAAaTTTC aGCTAATGGTGACAtAGCAACTTTC TGCTcATGGTGACAAAGCAACTTTC TGCTAATcGTGACAAAGCAACTTTC TGtTAATGGTGACcAAGCAACTTTC aGCTcATGGTGACAtAGCAACTTTC  f" -0.23 0.08 0.20 -0.30 0.97 -0.16 0.45 -1.41 -0.60 -0.95 -1.24 -1.28 0.34 -0.13 0.18  Table S.2 Coefficients for GSTA3 ARE variants GSTA3 ARE variants Ivi ACTCAGGgATGACATTGCAaTTgTC ACTCAGGaATGACgTTGCtTTTTTC ACTCAGGCATGACATTGCATTTgTC ACTggGGCATGACgTaGCATTTTTC ACTCAGGtATGACATTGCATTTTTC tCTCAGGCATGACATTGCtTTTTTC gaTCAGGgATGACATTGCATTTTTC ACTCAGGaATGACtTTGCATTTTTC ACTCAGGCATGACATTGCATTTTTa ACTCgGGtATGACtTTGCATTTTTC ACTgAGGCATGACATTGCATTTTTC ACTCAGGgATGACAaTGCATTTTcC ACTCAGGCATGACATaGCATTTTTC gCTCAGGCATGACATTGCATTTTTC ACTCAGGgATGACATTGCATTTTTt ACTCAGGCATGAtATTGCATTTTTC AgTCAGGCATGACATTGCATTTTTC AgTtAGGCATGACATTGtATTTTTC tCTCAGGCATGACATgGCAaTTTTC ACTCAGGCATGACATTGCgTTTTTC AgTaAGGCATGACATTGCATTTTTC ACTCAGGCATGACATTGCtTTTTTC gCTCAGGgATGACATTGCATTTTTC ACTCgaGaATGACATTGCATTTTTC ACTCAGGCATGACATTGCAgTTTTC cCTCAGGCATGACATTGCATTTTTC AaTCAGGCATGACATTGCATTTTTC ACTCAGGtATGACAaTGCATTTTTC  -1.12 -0.53 -1.55 -3.43 0.87 -2.76 0.26 0.58 0.07 -0.31 0.44 -1.67 0.76 0.18 0.36 -1.59 -0.68 -1.73 -1.10 -1.11 0.12 -0.88 0.63 -3.44 -2.00 0.07 0.66 0.23  64  P-values 0.34 0.69 0.43 0.23 0.00 0.44 0.02 0.00 0.02 0.00 0.00 0.00 0.11 0.59 0.46  P-values 0.00 0.01 0.00 0.00 0.00 0.00 0.17 0.00 0.73 0.10 0.02 0.00 0.00 0.14 0.06 0.00 0.00 0.00 0.00 0.00 0.54 0.00 0.00 0.00 0.00 0.73 0.00 0.23  ri -0.03 -0.02 -0.01 0.05 0.08 0.10 0.12 0.12 0.13 0.17 0.18 0.24 0.25 0.30 0.37  P-values 0.92 0.93 0.97 0.89 0.83 0.74 0.67 0.73 0.72 0.57 0.60 0.49 0.41 0.40 0.29  fi t 'i -0.46 -0.31 -0.24 -0.22 -0.21 -0.19 -0.17 -0.17 -0.16 -0.12 -0.12 -0.10 -0.08 -0.07 -0.07 -0.07 -0.06 -0.06 -0.05 -0.05 -0.03 -0.03 -0.03 -0.01 0.00 0.00 0.01 0.01  P-values 0.08 0.24 0.37 0.42 0.43 0.48 0.52 0.52 0.56 0.64 0.66 0.71 0.77 0.66 0.79 0.81 0.82 0.82 0.86 0.87 0.91 0.86 0.92 0.96 1.00 0.99 0.97 0.96  GSTA3 ARE variants AaaCAGGCATGACATTGCATcTTTC ACTgtGGCATGACATTGCATTTTTt ACTCgGGCATGACATTGCATTTTTC ACgCAGGCATGACATTGCATTTTTC AtTCAGGCATGACATTGCATTTTTC ACTCAGGCATGACATgGCAaTTTTa ACTCtGGCATGACATTGCATTTTTC ACTtAGGCATGACATTGCATTTTTC ACTCAaGCATGACcTTGCATTTTTC  13% -0.65 -0.53 -1.48 -1.20 -0.63 -3.04 0.20 0.34 -0.88  P-values 0.00 0.01 0.00 0.00 0.00 0.00 0.29 0.01 0.00  0.02 0.05 0.06 0.06 0.06 0.08 0.09 0.12 0.14  P-values 0.95 0.86 0.83 0.81 0.74 0.77 0.75 0.53 0.61  P-values 0.05 0.05 0.01 0.02 0.17 0.33 0.49 0.04 0.14 0.07 0.28 0.48 0.73 0.18 0.34 0.20 0.86 0.41 0.12 0.69 0.03 0.00 0.36 0.01 0.00 0.68 0.05 0.29 0.36  -0.80 -0.76 -0.64 -0.61 -0.58 -0.57 -0.54 -0.51 -0.45 -0.41 -0.39 -0.37 -0.37 -0.37 -0.34 -0.34 -0.31 -0.21 -0.20 -0.14 -0.11 -0.06 0.02 0.16 0.17 0.24 0.32 0.38 0.40  P-values 0.31 0.34 0.29 0.44 0.47 0.47 0.50 0.51 0.57 0.60 0.62 0.64 0.64 0.64 0.67 0.67 0.69 0.79 0.80 0.86 0.89 0.94 0.98 0.84 0.83 0.76 0.68 0.63 0.61  Table S.3 Coefficients for NQ01 ARE variants  p%  NQO1 ARE variants GAGTtAaAGTGAGaCGGCAAtATTT GAGTCgCAGTGAGTCGGCcAAATTT GAGTCgCAGTGAGgCGGCAAAATTT GgGTtACAGTGAGTCGGCtAAATTa GAGTCcCAGTGAGTCGGCgAAATTT GAGTCACAGTGAGTCtGCcAAATTT GAGTCgCAGTGAGTaGGCAAAATTT GAGTtACAGTGAGTCcGCAAAATTT GAGTCACAGTGAGTgGGCgcAATTT GAGgCACAGTGAGTtGGCAtAATTT GAGTCACAGTGAGTaGGCAtAATTT GAGTCAgAGTGAGTCGGCAAAtTTT GAGgCgCAGTGAGTCaGCAAAAcTT GAGTCACAGTGAGTCGGCAcAATTT GgGTCACAGTGAGTCtGCAAAATTg GAGgCACAGTGAGTCGGCAAAATTT GAGTCAaAGTGAGTCGGCAAAATTT GAGaCgCgGTGAGTCGGCAAAATTT GAGTCACcGTGAGaCGGCAAAATTT GAGTCACAGTGAGTCGGCAcAATaT GAGTCACAGTGAGTCGGCAgAATTT GAGTCtCAGTGAGTCGGCAAAATTc GAGTCAtAGTGAGTCGGCAAtcaTT GAGgCAtAGTGAGTCGGCAAtATTT GAGTCACAGTGAGTCGGCAAAATTT GAGTCACAGTGAGTCGGCAtAtTTT GAGTCACAGTGAGTgGGCAAAATTT GtGTCACAGTGAGTCGGCAAAATTT tAGTCACAGTGAGTCGGCAAAATTT  -1.09 -1.08 -1.13 -1.25 -0.77 -0.54 -0.38 -1.17 -0.82 -1.01 0.60 0.39 -0.19 0.74 0.53 -0.71 -0.10 -0.46 0.86 0.22 1.23 -1.73 -0.51 -1.41 -1.78 -0.23 -1.09 0.59 -0.51  65  Table S.4 Coefficients for NQO2 ARE variants NQO2 ARE variants p% CTGCCTGGATGACgGAGCGAGACCC tTGCCTGGATGACAGAGCGAGACCC CTGCCTGGATGACAGAGCGAGACCg CTGCCTGGATGACAGAGCGgGACtC CcGCCTGGATGACAGAGCGAGACCC CTGCgTGGATGACAGAGCGAGACCC CTGCaTGGATGACAGAGCGAGACCC CTGCCTGGATGACAGAGCGAGACgC CTGCCgGGATGACAGAGCGAGACCC CaGCCTGGATGACAGAGCGAGACCC CTGCCTGGATGACAGAGCGAGgCCC CTGCCaGGATGACAGAGCGAGACCC CTGCCTGGATGACAGcGCGgGACCa CcGCCaGGATGACAGAGCGAGACCC CcGCCgGaATGACAGgGCGgGAgCC CTGCCTGGATGACAGAGCGgtACCC CgGCCTcGATGACAGAGCGAGACCC CTGCCTGGATGACtGgGCGAacgCC CTGCCaGGATGACgGAGCGAGACCa gTGCCTGGATGACAGAGCGAGACCC CTGCCTGGATGACAGAGCGgGACCC CTGCCaGGATGACgGAGCGAGgCCC CTGCCTGGATGACcGAGCGcGACtC CTGCCTGGATGACgGAGCGAGACCg CgGCCTGGATGACAGAGCGAGACCC CTGCCTGGATGACAGcGCGcGACCC CTGCCTGGATGACAGgGCGAGACCC CTGCCTGGATaACAGAGCGAGACCC CcGCtTGGATGACAGgGCGAGACCC CTaCCTGGATGACAGcGCGcGACCC CTGCgTGGATGACAGgGCGAGACgC CTGCCTGGATGACtGcGCGAGgCCC CTGCCTGGATGACAGAGCGAGACCa CgGCCgGGATGACcGAGCGAGACCC CcGCCTGGATGAtAGAGCGgGgCCC CTaCCTGGATGACAGAGCGAGAaCC CTGgCTGGATGACAGAGCGAGACCC CTGCCTGGATGACAGAGCGAGcgCC CTGaCTGGATGACAtAGCGcGcCCC CcGCCTGGATGACAGcGCGAGAggC CTGggTGGATGACgctGCGAGACCC CTGCCTGGATGACgGtGCGAGACCC CTGCCaGGATGACAGAGCGtGACCC CgGCCTGGATGACAGAGCGAGACaC CTGCCgGGATGACAGAGCGAGACCa  66  1.22 3.25 1.28 -4.31 2.29 1.50 2.72 1.75 2.00 3.00 1.96 2.90 1.66 1.73 1.95 -0.74 0.38 1.23 3.23 1.24 1.94 0.05 0.71 -0.24 2.19 1.66 -0.16 0.85 -0.42 0.51 1.00 -0.38 -0.54 -0.17 -0.30 0.91 1.63 1.81 -0.36 0.25 1.04 1.71 -0.55 0.56 -0.43  P-values 0.02 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.17 0.48 0.02 0.00 0.02 0.00 0.93 0.19 0.66 0.00 0.00 0.66 0.12 0.44 0.34 0.06 0.48 0.32 0.75 0.58 0.09 0.00 0.00 0.51 0.65 0.06 0.00 0.31 0.30 0.42  P-values -1.37 -1.26 -1.25 -1.21 -1.20 -1.13 -1.09 -1.09 -1.08 -1.02 -0.88 -0.85 -0.83 -0.80 -0.78 -0.77 -0.71 -0.68 -0.63 -0.63 -0.62 -0.52 -0.50 -0.50 -0.42 -0.42 -0.39 -0.29 -0.25 -0.25 -0.24 -0.24 -0.22 -0.17 -0.17 -0.11 -0.11 -0.10 -0.03 0.01 0.03 0.04 0.05 0.09 0.11  0.07 0.10 0.10 0.11 0.11 0.05 0.15 0.15 0.16 0.18 0.12 0.26 0.28 0.29 0.31 0.31 0.35 0.37 0.41 0.41 0.42 0.49 0.51 0.51 0.47 0.58 0.43 0.70 0.75 0.75 0.75 0.76 0.77 0.83 0.83 0.88 0.88 0.89 0.97 0.99 0.96 0.96 0.95 0.91 0.89  CTGCCTGGATGACAcAGCGAGgCCC CTaCCTGGATGACAGAGCGAGACCC CTGCCTGGATGACAGAGCGcGACCC CTcCCTGGATGACAGAGCGtGgCCC CTGCCTGGATGACAcAGCGAGACaC CTGCCTGGATGACAGAGCGAGACCt CTGCCaGGATGACAGtGCGAGACCC CgGgCTGGATGACAGAGCGAGgCCC CTGCCTGGATGACAGAGCGAGcCgC CTGCCTGGATGACgGcGCGAGACCC CTGCCTGGATGACAGAGCGAGcCgC CgGCCTGGATGACgGAGCGAGACCC CgGCCTGGATGACAGAGCGcGACCC CTaCCgGGATGACAGAGCGAcACCC CTGCCTGGATGACtGAGCGAGACCC CgGCCaGGATGACAGtGCGAGACCC CTGCCTGGATGACcGAGCGAGACCC CTGCCaGGATGACAGAGCGgGACCC CTGttTGtATGACAGAGCGAGACCC  0.88 1.71 -1.14 -0.52 -0.04 -0.40 -2.87 0.57 0.92 1.81 -0.91 1.88 -2.04 -6.08 1.71 1.32 -2.71 -2.57 0.09  0.10 0.00 0.01 0.34 0.94 0.46 0.00 0.29 0.09 0.00 0.03 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.87  0.11 0.11 0.17 0.18 0.20 0.22 0.23 0.27 0.31 0.32 0.34 0.35 0.36 0.39 0.40 0.41 0.45 0.49 0.56  0.88 0.88 0.77 0.82 0.79 0.77 0.76 0.73 0.68 0.68 0.56 0.64 0.63 0.61 0.60 0.59 0.56 0.52 0.46  P-values 0.68 0.70 0.33 0.87 0.67 0.21 0.99 0.57 0.71 0.91 0.72 0.60 0.61 0.89 0.07 0.46 0.46 0.80 0.01 0.03 0.01 0.06 0.33 0.79 0.50  -0.33 -0.31 -0.28 -0.28 -0.25 -0.24 -0.23 -0.22 -0.18 -0.17 -0.11 -0.11 -0.06 -0.06 -0.05 -0.05 -0.05 -0.04 -0.03 -0.02 -0.01 -0.01 0.02 0.04 0.04  P-values 0.66 0.68 0.71 0.71 0.74 0.75 0.75 0.76 0.81 0.82 0.88 0.89 0.93 0.94 0.95 0.95 0.95 0.95 0.97 0.98 0.99 0.99 0.98 0.95 0.95  Table S.5 Coefficients for GCSH ARE variants GCSH ARE variants AATtTGTGTTGACAcAGCAAcGACC gATATGTGTTGACgGAGCAAgGtCC AATATGgGTTGACAGgGCAATGACC AtTATGTGTTGACAGAGCAATGACa AATtTGTGTTGACAGgGCAtTtACC AATtTtTGTTGACAGAGCAcTGACC AATAgtTGTTGACtGAGCAATGAgC AAcAgGTGTTGACAGcGCAATGACC AATATGTGTTGACAGAGCAATGAtC AATATGTGTTGACAGcGCgATcAaC AATATaTGTTGACAGAGCAATGACC AATcTGgGTTGACAGAGCAAgGACC AATtTGTGTTGACtGcGCAgTGACC AAgATGTGTTGACAGAGCAATGACC AcTATGTGTTGACAGAaaAATGACC AATATGTGTTGACAGAGCAATtgCC AATATGTGTTGACgGAGCAATGACC AATATGTGTTGACAGAGCAgTGACC gAaATGTGTTGACAGAGCAATGACC AATATGTGTTGACAGAGCcATGttC AATATGTGTTGACAtAGCAATGACC AATATGTGTTGACtGAGCAATGACC AAgATGTGTTGACAGAGCAATGAtC tATATGTGTTGACAGtGCAATGACC tATATGTGTTGACAGAGtAATGACC  67  13% -0.22 0.20 0.51 0.09 0.23 -0.66 0.01 -0.30 0.20 -0.06 0.19 0.28 0.27 -0.07 0.96 -0.39 0.39 -0.14 1.41 1.12 1.30 0.98 0.51 0.14 0.35  GCSH ARE variants AATATGgGTTGACAGAGCAATGACC AgTATGTGTTGACAGAGCAATGACC AAgATGTGTTGACAcAGCAAgGACC AATATGTGTTGACAGAGCAATGAgC AATATt TGTTGACAGAGCAtTGcgC At aATGTGTTGACgGAGCAAT cACC AATATGTGTTGACAcAGCAt TGACC AATAgtgGgTGAacGAGCAt TGACg t tTATGaGTTGACAGAGCAATGACC  68  IP;  P-values  [Pi  P-values  -1.24 0.68 0.46 -1.45 1.01 -0.53 -0.39 -1.73 0.97  0.00 0.20 0.38 0.01 0.06 0.31 0.46 0.00 0.07  0.06 0.08 0.11 0.14 0.14 0.15 0.16 0.18 0.20  0.91 0.91 0.88 0.85 0.85 0.84 0.83 0.81 0.79  ^41  ^£  41  ^CD  ^  4,  C^  Cl)^  cs) al r tv tr^  u., (..-,^(....  HOH^r H r H^0 F,  r H  A I-, A  4/ 41^4./  H^I-, I-,  •  •^•  r I-, ^N N 0 0  0 H H F, ^0 N 0 N^0 H r H  H r r  ^HHHH^0 N N O^W 4,  .4  co o^o o  r I-, ^0 C) 0 A  0^HOHH^0 0 A 0  C, C, 0 A^r F,  0 4, 0 0^0 0 A 0^0 A 0 0  r H  0  •^•^•  H  r H 0  LO r N^41 4, 41 4/^4/ 41  W 1,1 41 41 01 W M W^ Cr, W 01 4,^H 0 H r^NN00^4, 41 41  01 41 01 4.1^  I-. N iA^  4,^4, Id^61 4/ 4, W^4, 41 4/ 41^41 41^1-, 0 F, F, ^41^4, 6.1^4/ 61 4, 4/^61 41^• • 41 4, A 1-, N W^4, W 1,^NOON^r H 0 I-,  k.0 W 01 10 10 4, 01 lf) l0 4, 01 tC, NOON I-, 0 r H  •  10 41 N • •^ 4, 41 4, U./^H 0 H H^H IH 0 H^4, 4, LA, 4.1 4, 41^4/^4/^4, 4,^4, 4, ^61 4/^4/ 4/^4, 41^4, 41 F, ^4, W 4/^4, 4, 4,^H H r H  4/ 41^41 4/ 41 61 H 0^ N 0 0^H H H r^4, 4/ 4,^H 0 H  0 r 0 0^0 0 A 0^H H 0 F, ^HHHH  41 41^41 • Ul W^0 A 0 0^N) 0 N 0^0 0  H H r F, ^0 A 0 0^0 A 0 0  41 41^4, 4./^4./ 4.1 41^r r I-, I-, ^4) W^W^H F, 4.1 4,  N 41 4/^  41 41 NJ 4 ,  •  4,^4, 4./ 4/N1,1^ 41^W HHOH^4, 41 4, ^r 1-4 I-, 4, 4, ^4/ 41 4/^4,^4/ 4, ^41 w^w • W 4, 41 61^Lo.,^H F, 0 H^r H F,  o  0 CD CD I-,  o o o  I-, 0 0 0  0 F, 0 N^A 0 0 0^0 0 0 A^A 0 0 0 0  01 01 01 M 01^0 A 0 0^0 A 0 0^0 A 0 0 M 01 Cl  41 VI W r^W 1.4 W 41 41 4, 4, 4/^ 4.1 W^A000^W 41 W F, 61 414.  4/ 4, ^4, 4,^41 4/ 4/^4/  •  LO  W W^  N iA N^NONO^0 N 0 N^H F, 0  W W W W^W W W^ W 4, 4, 4, 41 4,^W 41 W^r r r r^4, 4, 4, • 6/^4,  H  • H^ 41 41^4, 4., 1,1^4, (D^0 A 0 0^41 41^4,^I-, I-, I-, H 4, 6)^4, 4, 4, ^4,  W  4./  41 W 41^ 41 61 1,1^ 4/ 41 w^H H F,  4.1^4, 4,  W^W 40 r^4, 4, 41^I-, 0 r H  VI I-, W N^NOON^r 0 H r^41 47 4,  01^01 01 01^01 al MI^01 01  Cl)  o Z^t-, 1--. 1--. -, 0^1-■ 1-. ■-■ H 0 Ft >  ^0^.-3^.3  A o ^  tv Z^o  W 14 [V 1,-, ^I-,^ > • •^• I,^H F, 0 H^H 1-• F-. l-. (..,^H I-, 1-■1-. wt..,^t.., 0^• •^ wt.,^6., a^L., L.,^41 4, 4, Wr^4,4,^41 1-1^4, 4, ^41^F., H 0 I-, ^1-, F, F, F, I, •  IQ 01 N.,  Cl 4, 4/ 01^4, 4/ 41^N 0 IV 0^r H r 0 CI1 W W 01 4, 4, 01^ 4/ W Lo  69  Appendix C: Decision tree analysis for ARE and ARE-like sequences An alternative approach to identify functionally important segments in the ARE can be pursued using decision trees. A decision classifies instances by sorting them down the tree from the root node to some leaf node to provide the classification of the instance. Each node in the tree specifies a test of some attribute of the instance, nucleotide observed at a positions in the candidate ARE sequence in this case, and each branch descending from that node corresponds to one of the possible values for this attribute. An instance is classified by starting at the root node of the decision tree, assessing a property indicated for the node, then moving down the tree branch corresponding to the value of the attribute. This process is then repeated at the node and so on until a leaf node is reached. A simple example would be to predict if it will rain. A first question might be "Are there clouds in the sky?" If so, you might predict no, while if not you could ask a second question such as "Is the humidity above 90%?" Such questions could follow until the best possible classifier was created.  Established software is available for the creation of trees. We used J48 Classifier based on the C4.5 tree algorithm developed by Ross Quinlan in 1993 from the software package Waikato Environment for Knowledge Analysis (Weka). The C4.5 algorithm builds decision trees from a set of training data using the concept of Information Entropy. trm  JEW =^f(7j)1°g2 f0,1) The training data is a set S= si,s2,... of already classified samples. Each sample si = x1,x2,... is a vector where xi,x2,... represent attributes or features of the sample. The training data is augmented with a vector C = ci,c2,... where ci,c2,... represent the class that each sample belongs to. C4.5 examines the normalized Information Gain that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is the one used to make the decision. The algorithm then recursively classifies the remaining sublists.  70  The coefficients value associated with each ARE variant is categorized into decrease or no decrease in expression for the AREs and ARE-like sequences. Decision tree analysis was first performed on five sets of ARE variants separately as well as combined ARE variants. For regional analysis of the ARE sequence, the 25-base pair ARE variant sequences were divided into seven blocks and the two blocks containing the invariant ARE core with no mutations were masked from the analysis. Each block of sequence is a feature and if within each block there are one or more mutations, then the block is categorized as being a mutant block. This decision tree algorithm predicts the effect of mutations on AREs based on the identification of base changes and if they mediate a change in expression relative to wildtype.  Decision tree analysis was performed on five sets of ARE variants separately as well as combined ARE variants for each group by 18 variable positions. The Weka software was used to analyze the ARE blocks with the core positions masked. 10-fold cross-validation experiments were performed using J48 classification method on all sets. No consistent pattern was observed across 18 positions for the active and inactive AREs that would lead to decreased gene expression (Table S.6). Table S.6 Decision tree classification with 18 variable positions as attributes ARE Instances Correctly Important Classified positions 58 62% GSTYA 23, 16, 14 GSTA3 70% NA 37 5 27 81% NQ01 122 61% 23, 16, 24 All active variants 64 67% NA NQO2 34 91% NA GCSH All inactive variants 98 76% NA We then divided the 18 variable positions into 7 blocks to study the regional characteristics of the AREs to further minimize the problem of the lack of available mutations and sequences. If within a block, there is at least one mutation, then we categorize the block to be mutant. No important blocks could be identified other than regions B6 and B7 in GSTYA ARE variants (Table S.7). Moreover, the pooled ARE variants results were biased by the GSTYA set, which has more sequences than the other active AREs.  71  Table S.7 Decision tree classification with having at least one mutation in block 4 Important Correctly Instances ARE positions Classified  GSTYA GSTA3 NQ01 All active variants NQO2 GCSH All inactive variants  44% 75% 63% 61% 64% 91% 75%  58 37 27 122 64 34 98  B6, B7 NA NA B6, B1 NA NA NA  Based on the assumption that deleterious effect in protein:DNA binding may be observed in degenerate positions in a binding site due to interdependent effect between proximal positions, we then set the mutational effect to be observed if there are two or more mutations within each block. The decision tree classified the mutant blocks as decrease or no decrease with 75% precision, 7% sensitivity, and 99% specificity for the active ARE variants. The sensitivity was low due to the fact that there were very few sequences with mutations as after the setting our definition for the mutant blocks, many sequences became wild-type in all the blocks. Moreover, this was also confounded by the fact that there were on average 1 to 3 base mutations per sequence, so the likelihood of having multiple mutations in proximity of each other is low. Thus, to better characterize the effects of mutations, sequences containing only wild-type blocks were filtered out, which increased the sensitivity to 33% and the specificity remained at 94%. From the decision tree analysis, block four, comprising the nnn region of the ARE consensus, was identified as the region that has the most effect on basal expression when mutated (Figure S.1, Table S.8). The decision tree data supports the single-base mutational analysis in that the nnn region is important under basal condition. Block 7, which consists of nucleotides 3' of the GC dinucleotide core, was also shown as having an effect under basal condition when mutated. However, due to the bias from the GSTYA set for the pooled ARE sequences, more targeted mutational experiments of these regions will be necessary to elucidate their importance. No ARE block was found to be important for the inactive ARElike sequences.  72  Figure S.1 Decision tree for classifying mutation blocks in the ARE sequence that contribute to decrease in expression  = WT  ^  = MT  Decrease (3.0)  = MT  ^  Decrease (4.0/1.0)  =WT  No decrease (115.0/34.0)  Table S.8 Decision tree classification with having at two or more mutations in block 4 ARE Instances Correctly Important Classified positions 122 68% All active variants B4, B7 88% All inactive variants 98 NA A Fisher's exact test was performed to examine the significance of the association between the proportion of sequences that carry at least one mutation in the Block 4 nnn region and sequences that have no mutations in this region. The p-value was determined to be 0.111 for the active AREs for sites that had mutation in the nnn region with associated decrease in expression (Table S.9) and the p-value for the inactive AREs set was determined to be 0.356 (Table S.10). Although the p-value for the ARE did not meet the normal significance level of 0.05, it concurred with the single-base mutational results as well decision tree analysis, providing an initiative for future experiments that could be done to further mutate these regions in order to elucidate their function.  73  Table S.9 Fisher Exact test on the active ARE sequences classified on mutation in Block 4 and change in expression Mutation in Block 4^No mutation in Block 4  ^ Decrease in expression 18^ ^ No decrease in expression 23^  25 56  Table S.10 Fisher Exact test on the inactive ARE sequences classified on mutation in Block 4 and change in expression Mutation in Block 4^No mutation in Block 4  ^ Decrease in expression 3^ 6 ^ No decrease in expression 41^ 48  74  Appendix D: Experimental Protocol Optimized Procedure for Spiked Oligonucleotide Synthesis Aim: To generate double-stranded DNA from single-stranded template of spiked oligos with ARE primer 2 Method: 1. Pipette the reagents in the order given into a 1.5m1 eppendorf tube. 10x NEB Buffer 2^15 ptl 1001.1M ssDNA^6µl 101.1.M primer^3 10mM dNTP^3 ill dddH2O^120 Klenow (exo-)^3 pl 2. 37 ° C incubation for 4h 3. Heat inactivation step is omitted to prevent strand separation of short double-stranded fragments DNA purification Aim: Purify double-stranded ARE oligos from unincorporated dNTPs, unbound primers, and enzyme Materials: Qiagen Nucleotide Removal Kit Method: 1. Add 1000121 of Buffer PN to the mix containing double-stranded spiked oligos and mix 2. Transfer 575111 into spin column 3. Centrifuge sample at room temperature (RT) at 6000rpm for 1 min 4. Discard supernatant 5. Repeat steps 2-4 6. Add 750111 Buffer PE to wash the column 7. Centrifuge spin column for 1 min 8. Discard PE buffer 9. Spin Column for another minute to remove residual PE buffer 10. Elute oligos with 50p1 of dddH2O or Buffer EB. Let column stand in RT for 1 min 11. Centrifuge for 1 min at 130000rpm  75  Restriction Enzyme Digest Aim: To generate compatible ends on the pGL3-promoter plasmid and oligos for cloning Method: 1. Pipette reagents in the following order pGL3-promoter 10x NEB buffer 3 dddH2O DNA Mlu I Bgl II  3 ill 19 5 1.5 1.5  pGL3-m25133 (positive control) 3 p1 14 10 1.5 1.5  2. Incubate reactions at 37 ° C overnight 3. Post-RE digestion a) pGL3-pro plasmid i. incubate at 65 ° C for 10min to inactivate restriction enzymes ii. add 1 p1 phosphatase to dephophoryate 5' end of the plasmid iii. incubate at 37 ° C for 30min iv. incubate at 65 ° C for 10min to inactivate enzyme v. gel purification b) pGL3-M25133 i. incubate at 65 ° C for 10min ii. gel purification c)ARE oligos i. column purification using Qiagen Nucleotide Removal Kit ii. Add 500 PN buffer to reaction iii. Follow instructions from the Kit iv. Elute with 40 IA dddH2O  76  ARE 7 IA 9 50 2 2  Ligation Aim: To ligate ARE oligos into pGL3-promoter plasmid Method: 1. Add reagents in the given order into 1.5m1 eppendorf tubes pGL3 -promoter pGL3-m25133 (negative control) (positive control) 2 ill 2 gl 10x ligase buffer Vector 2 2 Insert DNA 0 4 dddH2O 15 11 T4 DNA ligase 1 1 2. Incubate reactions on ice for 4h  ARE 9 gl 4 30 0 2  Transformation Aim: To replicate the recombinant clone many times to provide material for the analysis Method: 1. Add 10 gl of ligation sample to 50 gl aliquot of dH5a E. coli 2. Incubate sample on ice for 30min 3. Heatshock sample at 42 ° C for 45sec 4. Add 440 gl LB 5. Shake sample at 37 ° C for 30min 6. Plate 200 gl of sample on Amp-LB plate  77  Appendix E: High-throughput Transfection Aim: Transfecting variants of 5 known ARE and ARE-like sequences (96/ARE) in duplicate Total # of wells = 4 x 96 x 2 x 2 =^  1536 (16 plates)  Day 1: Seeding Materials media: MEMalpha w/ 10% FBS and no antibiotics p1000 & 1 box of tips pre-warmed PBS, trypsin-EDTA, full media, full media w/o Ab 16 96-well tissue-culture plates hematocytometer, sterile solution basins, 15ml falcon tubes, flask 5000 cells/well^ 5000 1536 wells x 5000 cell/well = ^7680000^cells Prepare MM for 1650 x 5000 =^8250000^cells in 165m1 media Plate labelling: (A) GSTYA-D1, D2, T1, T2; (B) GSTA3, (C) NQO2, (D) GCSH/NQ01 Dispense 100u1 cell suspension/well Day 2: Transfection  Master Mix (MM) Materials: 4 pre-plated plasmid (4.5xMM x 200ng= 900ng/well; 9u1 from 10Ong/u1 stock) 96x4x4.5 1728^Prepare 1850111MM OPTI-MEM: 25 x 1850^46250 pl = 46.3m1 x2 Lipofectamine 2000: 0.25 x 1850^462.5 ttl phRL-TK: 2Ong / 490ng/u1 x 1850^75.51020408 pl in 46.3m1 OPTI-MEM Tips: 8 boxes of p200, lbox of p1000 Plasmid Dilution Add 112.5 p1 of phRL-containing OPTI-MEM to each well Lipofectamine 2000 Dilution Add 463 gl of Lipofectamine 2000 to 463m1 OPTI-MEM Incubate solution for 5 min Dispense 112.5 fil to diluted plasmid solution Incubate solution for 20 min Take 2 master mixes & 8 plates of cells Add 50 ml of solution per plate (multi-dispense 4 X 50 iAl into 4 plates) Incubate for 24 h Total of 4 replicates: 2 for DMSO and 2 for tBHQ treatment  78  Day 3: Treatment Materials: 1 boxes of p 1000 tips pre-warm full media media: 170 ml of full media DMSO 80 pl in 85m1 media 100 JIM tBHQ 80 ill in 85m1 media Remove media from transfected cells Add 100 ill media containing DMSO to 8 plates Add 100 pi media containing tBHQ to 8 plates Do 4 plates from the same batch at a time (order: GSTYA, GSTA3, NQO2, GCSH/NQ01) Incubate for 18h Day 4: Reporter gene assay Materials: Victor plate reader set up injectors 80m1 PBS 50 ml of 70% EtOH, and ddH2O 40m1lx PLB (Dilution: 8m1 of 5x PLB in 32m1 ddH2O) LAR reagent (thaw to RT) S&G reagent (thaw to RT) Shaker Aspirate media (all 16 plates) Add 50 ill PBS & rinse (all 16 plates) Aspirate PBS (do 4 plates from the same batch at a time) Add 20 gal lx PLB Shake 15min LucAssay 96-Well strip Read BG Add 50 LAR reagent Read firefly signal Add 50 S&G reagent Read renilla signal  79  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items