Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Universal Sequence Tag Array (U-STAR) platform : strategies towards the development of a universal platform… So, Austin Pierre 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2008_fall_so_austin_pierre.pdf [ 6.94MB ]
Metadata
JSON: 24-1.0070815.json
JSON-LD: 24-1.0070815-ld.json
RDF/XML (Pretty): 24-1.0070815-rdf.xml
RDF/JSON: 24-1.0070815-rdf.json
Turtle: 24-1.0070815-turtle.txt
N-Triples: 24-1.0070815-rdf-ntriples.txt
Original Record: 24-1.0070815-source.json
Full Text
24-1.0070815-fulltext.txt
Citation
24-1.0070815.ris

Full Text

Universal Sequence Tag Array (U-STAR) platform Strategies towards the development of a universal platform for the absolute quantification of gene expression on a global scale by Austin Pierre So B.Sc., The University of Waterloo, 1992 M.Sc., The University of Toronto, 1995  A THESIS SUBMIYED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Interdisciplinary Studies)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) October 2008  © Austin Pierre So, 2008  ABSTRACT The advent of technologies specifically designed to capture glimpses of gene expression on a systems-wide scale has led to a revolution in our understanding of cellular dynamics, identifying the contributions and interactions of families of genes involved in cell development, dysfunction, and death. Broadly classified into count-based “digital” or signal-based “analogue” approaches, these technologies have permitted “portraits” of the transcriptome to be generated through comparative measurements of gene expression, enabling, for example, the generation of qualitative models of disease. However, truly predictive models of cellular function that can enhance our ability to discover new pharmaceuticals, detect and monitor disease, evaluate treatments, and ultimately, predict and prevent illness, require platforms that can provide detailed and accurate measurements of transcript abundances on an absolute scale. Unfortunately, inherent limitations preclude these technologies from providing this level of quantitative information. This thesis examines design issues associated with a popular digital approach to  transcriptomics, serial analysis of gene expression (SAGE), that diminish us utility as a tool for absolute transcriptomics. Careful analysis of the processing steps involved in converting the starting mRNA population into short sequence tags (SST5) and subsequently into a format amenable to interrogation via sequencing technology reveals the introduction of strong biases and artifacts that limit reproducible abundance measurements in SAGE to transcripts present within the highest 2 orders of magnitude in the original sample. As a large number of steps are involved in formatting SSTs for analysis via sequencing, an alternative strategy is presented that utilizes a microarray-based analogue approach for the interrogation of SSTs. Termed the Universal Sequence Tag Array (U-STAR) platform, this platform is able to provide accurate quantitative measurements over a 3-decade range of transcript abundances by obviating sources of bias associated with both SAGE and DNA microarray technologies and eliminating the requirement for material amplification. The SAGE-like utilization of STARtags also endows this technology with the potential to be used as a universal platform, eliminating costs associated with custom microarray construction, and minimizing those associated with the preparation and large-scale sequencing of SAGE libraries. Considerations for future development are also discussed for the USTAR platform.  11  TABLE OF CONTENTS ABSTRACT  ii  TABLE OF CONTENTS  lii  LIST OF TABLES  vii  LJST OF FIGURES  viii  LIST OF ABBREVIATIONS  xv  ACKNOWLEDGEMENTS  xviii  DEDICATION  xix  CO-AUTHORSHIP STATEMENT CHAPTER 1 Introduction, background information and thesis objectives  1  1.1. INTRODUCTION  2  1.2. DIGITAL BASED APPROACHES TO GENE EXPRESSION ANALYSIS  3  1.2.1. Overview of SAGE technology 1.2.2. Current limitations in SAGE technology 1.2.2.1. Sampling-based bias 1.2.2.2. Limits in transcriptome coverage 1.2.2.3. Sequence-based artifacts 1.2.3. Advances in digital-based transcriptomics technologies 1.2.3.1. Formation of spatially segregated clonal populations 1.2.3.2. Sequencing in real-time 1.3. ANALOGUE APPROACHES TO GENE EXPRESSION ANALYSIS 1.3.1. Overview of DNA microarray technology 1.3.1.1. Spotted microarrays 1.3.1.2. iKe novo synthesized microarrays 1.3.1.3. Sample preparation for microarray technologies 1.3.2. Current limitations of microarray technologies 1.3.2.1. Limitations introduced by target and probe chemistries 1.3.2.2. Limitations introduced by sample preparation 1.3.3. Advances in analogue-based transcriptomics technologies 1.3.3.1. Genome tiling arrays 1.3.3.2. Bead arrays 1.3.3.3. universal k-met library arrays 1.3.3.4. Locked nucleic acid (LNA) arrays 1.4:THESIS OBJECTiVES  4 5 5 6 7 8 9 11 13 14 14 15 16 17 17 18 20 20 21 22 24 26  ni  1.5. U-STAR THE UNIVERSAL SEQUENCE TAG ARRAY A BRIEF CONCEPTUAL OVERVIEW 28 -  -  1.6. REFERENCES  32  CHAPTER 2 Increasing the efficiency of SAGE adaptor ligation by directed ligation chemistry.. 47 2.1. INTRODUCTION  48  2.2. MATERIALS AND METHODS  50  2.2.1. Enzymes and constructs 2.2.2. Preparation of 3’-end anchored cDNA 2.2.3. Adaptors 2.2.4. Standard ligation protocol used in SAGE 2.2.5. Directed ligation 2.2.5.1. Titration of T4 DNA ligase activity with N1aIII 2.2.5.2. Directed ligation protocol for SAGE 2.2.6. Analysis of anchored ligation products 2.2.7. Preparation and PCR amplification of ditags  50 50 51 51 51 51 52 52 52  2.3. RESULTS AND DISCUSSION  53  2.3.1. Self-ligation of the anchored 3’-end cDNA competes with ligation of the adaptor 54 2.3.2. Addition of macromolecular crowding agents increases the yield of adaptor modified anchored 3’-end cDNA 55 2.3.3. Product distribution can be directed through the introduction of a restriction enzyme into the ligation reaction Directed ligation chemistry 57 2.3.4. Near-complete conversion of anchored 3’-end cDNA to adaptor modified products via directed ligation chemistry 58 ..  —  2.4. CONCLUSION  60  2.5. REFERENCES  71  CHAPTER 3 Minimizing loss of sequence information in SAGE ditags by modulating the temperature dependent 3’ 5’ exonudease activity of DNA polymerases on 3’-terminal isoheptyl amino groups 73 3.1. INTRODUCTION  74  3.2. MATERIALS AND METHODS  75  3.2.1. Linkers and oligonucleotides 3.2.2. Preparation of SAGE ditags from Cptococcus neofirmans 3.2.3. Model adaptor:tag (A:T) construction 3.2.4. Standard fill-in reactions 3.2.5. Ligation to form ditags 3.2.6. PCR amplification of ditags 3.2.7. Simulation of ditag formation under various amounts of 3’-IHA block relief 3.2.8. End-point activity of DNAPs 3.2.9. Visualization of fill-in and ligation products  iv  75 75 76 76 77 77 77 78 78  3.3. RESULTS AND DISCUSSION  .79  3.3.1. Improved yield of released SSTs via directed ligation chemistry reveals formation of HMW ligation products 79 3.3.2. HMW ligation products consist of +54 bp multimers of released ATs 79 3.3.3. KF removes the 3’-IHA from model A:Ts 80 3.3.4. 3’-IHA removal from and subsequent ligation of A:Ts reduces the performance of SAGE technology 81 3.3.5. 3’-IHA removal activity of DNA polymerases can be modulated by temperature.... 83 3.3.5.1. KF DNA polymerase 83 3.3.5.2. T4 DNA polymerase 84 3.3.5.3. Vent® DNA polymerase 85 3.3.6. 3’-IHA removal is minimized by the presence of phosphorothioate (R,/S) linkages  85  3.3.7. Specific formation and amplification of 102 bp ditags using redesigned adaptors  86  3.4. CONCLUSION  87  3.5. REFERENCES  94  CHAPTER 4 Eliminating amplification, bias in competitive templates through the use of a proofreading DNA polymerase application to the amplification of SAGE ditags 96 —  4.1. INTRODUCTION  97  4.2. METHODS AND MATERIALS  99  4.2.1. Oligonucleotides 4.2.2. Taqman-based qPCR assays 4.2.3. Secondary structure and abundance dependent bias: definitions 4.2.4. Secondary structure and abundance dependent bias: simulations 4.2.5. Real-time assays for proofreading DNAPs 4.2.6. Preparation and concatenation of 26 bp ditags 4.3. RESULTS  99 .99 100 103 103 104 105  4.3.1. Low abundance ditags in a matrix of competing ditags are poorly amplified by Taq DNAP 105 4.3.2. Abundance-dependent inhibition is absent when co-amplifying non-homologous templates 107 4.3.3. Impact of secondary-structure and abundance bias during the PCR of SAGE ditagsl08 4.3.4. Correcting the Problem: Application of Taqman’ assays to proofreading DNA polymerases 110 4.3.5. SAGE library construction using a minimal amount of amplified ditag materiaL...112 4.4. DISCUSSION  113  4.5. REFERENCES  124  Chapter 5 Progress towards the development of a universal microarray for quantitative geneexpression analysis on an absolute per cell basis 127 5.1. INTRODUCTION  128  V  5.2. U-STAR: CONCEPT AND CRITICAL DESIGN FEATURES  .130  5.3. METHODS AND MATERIALS  134  5.3.1. Oligonucleotides, enzymes and reagents 5.3.2. UV spectrometry measurements of duplex melting temperature (Ta.) 5.3.3. Array preparation 5.3.4. Scanner evaluation 5.3.5. Calibration curve and performance of U-STAR array 5.3.6. Generation of STARtags 5.3.6.1. Sources of RNA used for this study 5.3.6.2. Generation of anchored cDNA 5.3.6.3. STARlink adaptor ligation via directed ligation chemistry 5.3.6.4. STARtag release and final preparation 5.3.6.5. Diagnostic of stepwise yields for STAR protocol 5.4. RESULTS AND DISCUSSION  134 135 135 136 136 137 137 137 138 138 139 139  5.4.1. Design and characterization of LNA/DNA mix-mer probe set 139 5.4.2. Sensitivity and specificity of prototype U-STAR array 141 5.4.3. Evaluation of scanning and analysis properties for maximal dynamic range 142 5.4.4. Construction of a universal calibration curve for U-STAR 143 5.4.5. Performance of U-STAR in absolute abundance determination for STARtag mixtures 144 5.4.5.1. Quantitative preparation and purification of STARtags from mRNA. 145 5.4.5.2. Quantitative recovery of anchored cDNA synthesis products 145 5.4.5.3. Directed ligation chemistry for creation and release of STARtags 146 5.4.5.4. Application of the STAR platform to a mixture of polyadenylated RNA 147 5.5. CONCLUSION  147  5.7. REFERENCES  162  CHAPTER 6 Summary of findings, future considerations and conclusions  167  6.1. SUMMARY OF FINDINGS  168  6.2. CONSIDERATIONS FOR FUTURE DEVELOPMENT  170  6.2.1. Enhancing mismatch discrimination 6.2.2. Increasing detection sensitivity 6.2.3. Enabling robust interrogation of the transcriptome  170 172 173  6.3. CONCLUSIONS  174  6.4. REFERENCES  177  APPENDIX  180  A.1. ditagiormation.pl  181  A.2. ditag.YCR_bias.pl  190  A.3. Tm_calculate.pl  198  vi  LIST OF TABLES Table 2.1. Outline of the enzymatic, purification and isolation steps involved in the SAGE and microSAGE protocols (http: / /www.sagenet.org/protocol/index.hrm) 61 Table 2.2. List of oligonucleotides used in this study to form various SAGE adaptors. Oligonucleotides were obtained gel-purified and verified by mass spectrometry 62 Table 2.3. List of methyl sensitive Type II restriction enzymes that generate overhangs suitable for directed ligation chemistry. This list was extracted from REBASE version 05/2003 (http: / /rebase.neb.com) and corresponds to commercially available enzymes which have well-characterized methylation sensitivities 63 Table 5.1.. Costing of materials and reagents specific to the U-STAR platform. Costs are based on current list pricing for reagents. Costs associated with LNA synthesis and purification are based on fully substituted probes, and represent the maximum possible cost of synthesis 149 Table 5.2. Model RNA transcripts used in this study. List of RNA control spikes indicating sites for Nialil cleavage, and corresponding sequence tags derived from 3’-end of each transcript 150 Table 5.3. Properties of LNA/DNA mix-mer probe set. Probes were designed using NN models to have a melting temperature of 65 °C at 50 tM strand concentration at 0.5xSSC. Thermodynamic properties were then verified by UV-M analysis with complementary DNA oligonucleotides 151 Table 5.4. Properties of mismatched LNA/DNA mix-mer probes. Probes were designed using NN models to have a melting temperature of 65 °C at 50 M strand concentration at 0.5xSSC with their perfect complements. Thermodynamic properties were assessed via UV-M analysis with complementary DNA oligonucleotides at 50 tM and 5 tM total strand concentrations 152 Table 6.1. List of commercially available Type II enzymes  v  176  LIST OF FIGURES Figure 1.1. Outline of the SAGE protocol. A. Processing steps leading to the formation and release of short sequence tags (SAGE tags) from a starting mRNA sample. B. Released SAGE tags are then converted through a series of steps into concatenates for cloning into a sequencing library for analysis 30 Figure 1.2. Diagram and application of the U-STAR adaptor. Features of the U-STAR adaptor permit the quantitative conversion of the anchored cDNA population into the adaptor modified product under directed ligation chemistry (DLC). Ligation leads to the covalent attachment to the (+) strand of the cDNA strand, while the presence of the 5’OH on the adaptor prevents ligation to the (-) strand 31 Figure 2.1. Ligation of SAGE adaptor IA to anchored 3’-end cDNA. 100 ng of in iiihv transcribed polyadenylated product was processed under the microSAGE protocol and split in half. Lane 2 shows a control reaction in which T4 DNA ligase was not added to the ligation mix. Lane 3 shows the formation of a small amount of the hetero-ligation product indicated by the arrow as well as a high molecular weight band corresponding to twice the molecular weight of the unligated cDNA. Ligations were performed as described in Methods and Materials 65 Figure 2.2. Influence of increasing adaptor:target molar ratios on the formation of adaptortarget heterodimer versus target homodliner. Increasing amounts of adaptor 1 (0—3.8 iM final) were introduced into standard ligation reactions containing 0.07 5 pmol anchored target in a final volume of 10 il as described in Methods and Materials. In microSAGE, adaptors are introduced to a reaction mixture containing -0.08 pmol anchored target at a final concentration of 0.08 M in a total volume of 20 l, corresponding to adaptor:target ratio of approximately 20:1. The classic SAGE protocol introduces a final adaptor concentration of 0.8 !iM to the ligation mixture containing 1.95 pmol anchored target in a total volume of 40 il, corresponding to an adaptor:target ratio of approximately 16:1.66 Figure 2.3. Influence of supplemental PEG-8000 and incubation temperature on the formation of adaptor-target heterodimer versus target homodimer. The standard ligation reaction in the microSAGE protocol is performed in the presence of 5% PEG8000 (w/v) at 16 °C for 2 hrs using a final adaptor concentration of 0.08 [LM in a final volume of 20 .d. Ligation reactions shown were performed in a final volume of 10 il as described in Methods and Materials using a adaptor concentration of 1 1 tM final in the presence or absence of PEG-8000 supplemented to a final concentration of 15% (w/v). Reactions were carried out for 2 hrs under the conditions indicated 67 Figure 2.4. Outline of directed ligation. A. Ligation of unmethylated adaptors (black) results in the formation of a mixture of adaptor homodimers, target homodimers, and the adaptortarget heterodimer. In the presence of NlaIll, ligated products are converted back to their respective monomers. The final product distribution is determined by the relative rates of ligation by T4 DNA ligase and digestion by NlaIII. B. In contrast, using an adaptor with a methylated base (N6-methyl-deoxy adenosine) within the site of ligation blocks digestion of the adaptor-target heterodimer, and product distribution is favoured towards the formation of the adaptor-target heterodimer. Titrations of T4 DNA ligase with increasing quantities of NlaIll were performed in the presence of 1 1 iM adaptor in a final volume of 10 fl for 2 hrs at 37°C as described in Methods and Materials 68  yin  Figure 2.5. Comparison of ligation under the SAGE protocol versus under directed ligation chemistry. Ligation reactions were performed in 5% PEG-8000 (w/v) in the presence or absence of NIaTTI, using standard SAGE adaptors (adaptor 1) or modified SAGE adaptors with an N6-methyl-deoxy-adenosine base (adaptor lm6A) at a final concentration of I !M. Reactions were performed at a final reaction volume of 10 [Li and incubated as described in Methods and Materials 69 Figure 2.6. PCR amplification of ditags derived from adaptor-modified anchored 3’-end cDNA obtained using the microSAGE protocol or directed ligation chemistry. After ligation under the microSAGE protocol using —10-fo1d greater amount of adaptors I and 2 under standard conditions (lanes 3-5, 9-11) or using adaptors lm6A and 2m6A using directed ligation chemistry (lanes 6-8, 12-14), tags were released with BsmFT, bluntended with Kienow and ligated to form ditags as described under the microSAGE protocol version le in the presence (lanes 5,8,11,14) or absence (lanes 4,7,10,13) of added PEG-8000 (15% w/v flnai). Following ligation, mixtures were diluted 1:20 (lanes 3-8) or 1:200 in LoTE (lanes 9-14) and I t1 was used as a template for PCR amplification as described in Methods and Materials 70 Figure 3.1. Schematic outline of the steps in the SAGE protocol following the release of SSTs and prior to concatemer formation. A. Fill-in with KF leads to blunt-ending of the (+) strand of the released SST. Upon ligation to form ditags, the presence of the 3’amino group and the lack of a 5’-phosphate blocks participation of the adaptor end of the adaptor:tag (A:T) in the ligation reaction, resulting in the specific formation of a 102 bp ditag. Subsequent PCR amplification therefore results in the specific amplification of the 102 bp product Following Nialli digestion, SAGE adaptors(i) and 26 bp products (ii) are recovered. B. Unwanted release of the 3’-amino group by KF reintroduces the adaptor end of the A:T as a potential site for ligation. The ligation reaction results in the formation of 102 + 54i bp multimers that participate in the PCR amplification reaction (see text for details). Subsequent digestion of these amplification products with NiallI releases (iii) “tagadaptor” constructs flanked by NlaIH sites in addition to (i) SAGE adaptors and (ii) 26 bp clitags that can participate in the concatenation reaction 88 Figure 3.2. HMW amplification products result from ligation products generated during dlitag formation. A. PCR amplification products from serial dilutions of the ditag ligation reaction (lanes I & 6: no ligase [control], lanes 2 & 7: 1:20, lanes 3 & 8: 1:400 lanes 4 & 9: 1:8000, and lane 5 & 10: 1:160000 dilutions in LoTE). Ditag ligation products were derived from released SSTs via the ligation of standard SAGE adaptors (SAGE) under the standard SAGE protocol (lanes 1-5) or of methylated SAGE adaptors (m6A) under directed ligation chemistry (lanes 6- 10) to an anchored 3’-end cDNA library obtained from C. neqformans strain H99. B. Analysis of clitag ligation products prior to PCR amplification. Released SSTs obtained via standard microSAGE (lanes I & 2: SAGE) or directed ligation (lanes 3 & 4: m6A) protocols and ligated in the presence (lanes 1 & 3) or absence (lanes 2 & 4) of T4 DNA ligase after fill-in with KF. FIMW bands corresponding to ligation products are indicated by arrows 89 Figure 3.3. Discrete HMW products are formed during ligation of synthetic tags and are a result of KF exonucleolytic activity. A. Ligation products from model adaptor:tag constructs after fill-in with KF and prior to PCR amplification were analysed via densitometry to quantify the amounts of the mono-, di-, in-, and tetrameric ligation products obtained in the presence of T4 DNA ligase (lane 2) relative to the amount of  ix  unligated starting material (lane 1). B. Truncated templates were labelled with P] 32 on [ either the (+) strand or the (—) strand to monitor the activity of KF under standard fill-in conditions (30 mm at 37°C). Incubation in the presence of KF leads to a shift of a population (-55%) of the (—) strand by +3 bases (lane 3) compared to incubation in the absence of KF (lane 1), suesting removal of the 3’-amino blocking group and subsequent fill-in of this strand. Concomitantly, the (+) strand (lane 2) is shifted +4 bases (lane 4) in the presence of KF 90 Figure 3.4. Recoverable sequence information within purified ditags under conditions that allow HMW product formation. Ad hoc canonical Monte Carlo simulations of the ligation reaction involving theoretical adaptor:tags were used to determine size distributions of ligation products and extract recoverable sequence information under varying values off, and Pliga, = PHHP = 2 x 1 pHHH/HH::pH (see text and Appendix A.1. for details). The amount of recoverable information within resulting ligation products as a fraction of the original pooi of A:Ts is plotted as a function of both f and P,,,. Densitometric data obtained for the fill-in reaction (see figure 3.2.B) using our model tag system indicated thatf = 0.57 under standard fill-in conditions. Under these conditions, 45% of the sequence information in released SSTs is predicted to be lost from the purified ditag population 91 Figure 3.5. Temperature dependence of DNA polymerase activity and impact on HMW product formation. Comparison of the activity of DNA polymerases on model templates with 3’-terminal phosphodiester (panels A-C) or phosphorothioester (panels D-F) linkages in the (—) strand at various incubation temperatures. Incubation of truncated SAGE adaptor I with KF (A. & D.), T4 DNAP (B. & E.) or with Vent® DNAP (C. & F.) reveals that (z) the extension of the (+) strand (upper) and 3’-isoheptyl amine removal activity with subsequent extension of the (—) strand (lower) can be modulated by incubation temperature. Ligation of model ATs following fill-in with the various DNA pois at the examined incubation temperatures reveals that (it) the extent of HIVIW product formation correlates with the extent of removal and extension of the (—) strand in the truncated template. Introduction of phosphorothioate linkages in the (—) strand of both templates dramatically inhibits the processing of the (—) strand by the various polymerases, demonstrating that inhibition of 3’-isoheptyl amine removal leads to a corresponding reduction in HMW product formation. Lane M: 20 bp marker, lane C: no polymerase, lane 1: -4 °C (on ice), lane 2: 12 °C, lane 3: 25 °C, lane 4: 37 °C, lane 5: 50 °C 92 Figure 3.6. Comparison of ligation products obtained under standard SAGE versus modified SAGE applied to 10 jig C. neoformans total RNA. A. Reaction products of released SSTs obtained in the presence (lanes 2, 4, 6, & 8) or absence (lanes 1,3,5,7) of T4 DNA ligase after fill-in with T4 DNAP at 4°C, KF DNAP at 50°C and Vent® DNAP at 12 °C using methylated SAGE adaptors with two 3’-terminal phosphorothioate linkages on the (—) strand. For comparison, ligation products of released SSTs obtained under the standard SAGE protocol are also shown. B. Comparison of PCR amplification products obtained under the SAGE protocol (lanes 1-5) versus those obtained using our modified protocol using Vent® DNAP for fill-in (lanes 6-10), indicating the specificity of 102 bp template formation in the modified protocol and an absence of non-specific amplification products. PCR products were obtained from 1 jtl of serial dilutions (lanes I & 6: no ligase [control], lanes 2 & 7: 1/20, lanes 3 & 8: 1/400 lanes 4 & 9: 1/8000, and lanes 5 & 10: 1/160000 dilutions in LoTE). C. One-step purification and subsequent digestion with Nla  x  III of PCR products. PCR reactions were pooled (lane 1) and purified using a commercially available PCR purification kit, yielding a single 102 bp band (lane 2). Subsequent digestion with NiaJil yielded a 42-44 bp doublet corresponding to released SAGE adaptors, and a 26 bp product corresponding to released ditags (lane 3) 93 Figure 4.1. Amplification proffles of synthetic templates within a background of ditags derived from C. neoformans. Each synthetic ditag construct was designed to mimic the structure of a typical SAGE ditag where two different adaptor sites containing priming sites (P1 and P2) flank an internal sequence corresponding to an SST pair. Corresponding probes are shown (uppercase: DNA; lowercase: LNA). Synthetic templates were either amplified on their own (black lines), or introduced into a 1/400 dilution of ditags generated from C neofirmans and amplified (color lines) under standard SAGE protocols using Platinum Taq. Amplification sets represent 10-fold serial dilutions of the synthetic ditag template in the presence or absence of background C. neojbrmans ditags. A. 5OGC_5050 B. 5OGC_2080 C. 2IGC D. 79GC_5050 117 Figure 4.2. Impact of competitive and non-competithre templates in PCR amplification with Platinum Taq. A. Amplification profiles of serial dilutions of 5OGC_8020 synthetic ditag (100 fM to 0.1 fM) in the presence (red) or absence (black) of a background of 5OGC_5050 synthetic ditag (100 fM). When the ratio of interrogated template to background template falls below unity, suppression of amplification is observed. B. Amplification profiles of serial dilutions of 5OGC_8020 (100 fM to 0.1 fM) in the presence (red) or absence (black) of a background of BCL2 (100 fM). In contrast to the presence of a homologous competing template, suppression of amplification is not observed. C. Relationship between the doubling rate D, (1+E/100) and the fractional abundance% of 5OGC_2080 in the presence of a background of 5OGC_5050 template 118 Figure 4.3. Structural diversity of PCR templates generated from ditag species. Upon bluntend ligation of two populations of released adaptor(A/B)-tag(1 /2) constructs, ditags with a general adaptor-tag-tag-adaptor configuration are formed. These ditags are then introduced as a template for the PCR and, following the melting cycle, yield single stranded species capable of forming numerous secondary structures during the annealing cycle. The potential for seif-complementarity of the single-stranded templates derived from each ditag enables one to classify ditags into four distinct structural types, three of which (types 2 to 4) inhibit the amplification reaction through formation of single-stranded hairpins. SAGE analysis is therefore reliant on the selective amplification of type I templates, which can lead to amplification bias as described in the text 119 Figure 4.4. Impact of ditag formation and PCR on the outcome of SAGE analysis. Virtual transcriptomes were constructed from each of 32 human MPSS libraries, from which the ditag formation and PCR steps of standard SAGE were simulated to assess the impact of secondary-structure and template-abundance biases on the fidelity of SAGE analysis. The simulation results for 4 libraries are highlighted (black: human bone marrow; red: human pancreas; green: human spleen; blue: human thymus). A. Post-PCR structure-dependent bias arising from the preferential segregation of highest abundance tags into non amplifiable homo-ditag species, leading to an under-representation of high-abundance tags in the amplification product. B. The extent of secondary-structure bias is relatively minor (no more than 20% under-representation) and is restricted to the one or two highest abundance tags. C. Abundance-dependent bias arising from the formation of intrastrand annealing complexes during the PCR, leading to a strong inhibition in the amplification of  x  low abundance dlitags. D. Ditag abundance dependence bias results in a severe underrepresentation of lower abundance tags 120 Figure 4.5. Optimization of Pfu and PfuUltraII activity for minimizing abundance-based bias during amplification of ditags. Heat map showing 1/ACT values for the amplification of 5OGC....2080 ditag in the presence (versus the absence) of 1000-fold excess of 5OGC_5050 ditag as a function of primer and nucleotide concentrations. Darker values represent conditions where ACT is minimized, and thus 1/ACT is maximized. Amplification conditions that maximize i/ACT lead to the full preservation of relative tag frequencies over 4 to 5 orders of magnitude of original template abundance, but proceed at an amplification efficiencyE of less than 100%. A. Pfu. B. PfuUkrall 121 Figure 4.6. Optimization of PCR conditions for minimized abundance-dependent inhibition and maximized product yield. Ditag amplification products from PCR reactions utilizing 1ñUltrall and various amounts of 5OGC_2080 template within a background of 0.1 pM 5OGC_2080 template were resolved on a 2.2% agarose gel to determine product yields and amplification performance as a function of nucleotide and primer concentrations. Reaction conditions that preserve relative tag frequencies over 5 orders of magnitude do not yield sufficient ditag amplification product to be observable on the gel. Amplification product (20 cycles) is detectable at higher dNTP and primer concentrations, with an associated decrease in the range over which relative tag frequencies are preserved. Thus, increasing dNTP concentration (A—C, B-.])) or primer concentration (A—B, C—.D) improves yield but reduces performance 122 Figure 4.7. New protocol for purffication and end-point concatenation of 26 bp ditags recovered from 102 bp amplification products. A. Starting material of purified 102 bp amplification products. Lane B. Post Nialli digestion for I hr at 37 °C yields 26 bp product mixture and regenerated adaptors. C. Incubation of NlaIII digest with streptavidin leads to a shift in the mobility of biotinylated DNA, leading to a dear separation of 26 bp dlitags from biotinylated adaptor moieties and partially digested biotinylated products. D. Filtration of digests in the presence of streptavidin through micropure EZ columns yields highly purified 26 bp ditags. E. End-point concatenation of recovered 26 bp products yields concatenates of —1.5 kb in average length (0.6 to 2 kb range), which can be cloned directly into dephosphorylated vector 123 Figure 5.1. Outline of STARtag generation and application. A. General features of the STARlink adaptor permit the generation of STARtags through 4 processing steps for immediate interrogation on a U-STAR hybridization array. B. Combination of multiple STARtag sets generated from different anchoring restriction enzymes increases transcriptome coverage. Putative STARtag sets generated from the transcriptome of S. cerezisciae using either Nla ifi (5’-CATG-3 and Tai I (5’-ACGT-3’) are shown schematically (left panel). Independently, the set of STARtags generated from either restriction enzyme (shown in red and green) can only cover < 80% of the transcriptome of S. cerevisciae as indicated by the colour map (right panel). In contrast, combination of the sequence information (yellow) from both STARtag sets enables more complete coverage of the transcriptome (typically greater than 95%) 153 Figure 5.2. General features of the STARarray for interrogating STARtags. Hybridization array consists of a combinatorial library of nonamers, as well as an optional 3’-end four base sequence corresponding to the restriction enzyme used to anchor the STARtag. The  xu  use of DNA as probes creates an array with a wide range of thermal stabilities, which requires the use of a hybridization temperature (Tj below 20 °C. This renders the GC-rich probes as non-specific as mismatched targets are able to hybridize. Programmed incorporation of LNAs into the probe sequences permits an increase in the melting temperature of PM duplexes to each probe to a common, uniform Tm 154 Figure 5.3. UV-M studies on melting thermodynamics of STARprobes. Melting thermogram of STARprobes P1 to P8 with their oligonucleotide complements (total concentration 115 .tM) ifiustrating the ability to tune melting thermodynamics of a set of probes to a uniform melting window 155 Figure 5.4. Dynamic range, detection limit, and mismatch binding properties of prototype U-STAR array. A. PM register intensities for a dilution series of JOE_TI and JOE_T7 targets hybridized at a temperature of I = 42 °C indicating the ability of the platform to capture as little as 10 fM of target. B. Hybridization of mismatched probe sequences to 10 nM JOE_Ti and 1 pM JOE_T7 at various T indicates a decrease in the relative amount of target hybridized to mismatch sequences compared to perfect matches with an increase in T. Mismatches within the probe sequence are indicated via the probe identifier, where, for example, “P7_2M59” indicates a probe complementary to target T7, with 2 mismatched bases at position 5 and 9 along the probe sequence 156 Figure 5.5. Evaluation of scanning and analysis parameters for the Genepix 4200 AL scanner. Parameters to establish scanner settings providing maximal dynamic range were established using a Cy3 calibration slide as described in Methods. After image acquisition, TIFF files were analyzed with Genepix Pro 6.1 using manually aligned feature settings. Median (A) or total (B) fluorescence intensities were recorded for each flourophore concentration at increasing PMT settings (100 V to 900 V, 50 V increments) and plotted as a function of fluorophore density (left panel). Pairwise comparisons of fluorescence measurements were used to establish probability cutoffs for significance (two-tailed Student’s i-test) in the detection of 2-fold differences in fluorophore intensity as a function of PMT settings (right panel) 157 Figure 5.6. Calibration curves for the U-STAR platform. A. Two-fold serial dilutions of JOE_Ti, JOE_T4, and JOE_T7 under hybridization conditions minimizing mismatch hybridization (final T = 55°C) were scanned under a range of PMT settings (left panel) to establish optimal scan settings. Calibration curves utilized a PMT setting of 300 V (right panel). B. The high uniformity across the target dilution series enables the creation of a universal calibration curve for the entire probeset. However, scanner properties limit the measurable dynamic range to -3.5 orders of magnitude, between 1.6 pM and 5 nM, for the detection of 2-fold changes in target concentration (b < 0.001, Student’s i-test) 158 Figure 5.7. Performance of U-STAR in quantifying absolute STARtag concentrations. Various mixtures of synthetic STARtags were interrogated via hybridization onto prototype STARarrays, and target concentrations were determined from the universal calibration curve (Figure 5.6). Measured concentrations of A. TI and B. T7 are shown in pink versus the original concentrations shown in purple 159 Figure 5.8. Analysis of step-wise yields obtained under the STARtag production protocol. 100 nanograms of a polyadenylated in vitro transcript derived from the a-factor gene (X65948) were processed through the STARtag production protocol. A. SYBR green I staining of cDNA synthesis products reveal that Superscript Il/ifi yield spurious double  xlii  stranded products. B. Post-Nia III digestion of anchored cDNA yields C. Application of DLC to the ligation of STARlink adaptors yields ‘40% of the desired adaptor-modified product. D. Subsequent digestion with the Type uS enzyme Acu I leads to near-complete digestion of the adaptor modified cDNA and release of the STARtag 160 Figure 5.9. Overall performance of the U-STAR platform in absolute quantification of transcript abundances. Various mixtures (—400 ng) of polyadenylated RNA (ArrayControl spikes) were processed through the STAR protocol, and the resulting STARtags were hybridized onto prototype arrays. Strong correlation between the calculated concentration and the original concentration was obtained with the platform. 95% confidence intervals are shown in dark blue, with 95% prediction lines in light blue.161  xv  LIST OF ABBREVIATIONS 3D  three dimensional  A  adenine  A:T  Adaptor-Tag  ARE  anchoring restriction enzyme  aRNA  amplified RNA  AT  adenine-thymidine  BCL2  B-cell CLL/lymphoma 2 gene  BSA  bovine serum albumin  C  cytosine  CCD  charge-coupled device  cDNA  copy DNA  cRNA  copy RNA  CT  threshold cycle  CTP  cytosine triphosphate  dCTP  deoxycytosine triphosphate  DEPC  diethylpyrocarbonate  DLC  directed ligation chemistry  DLP  dual labelled probe  DMSO  dimethyl sulfoxide  DNA  deoxyribonucleic acid  DNAP  DNA polymerase  DNAP(B)  B-family DNAP  dNTP  deoxynucleotide  dsDNA  double-stranded DNA  DTPA  Diethylene triamine pentaacetic acid  dUTP  deoxyuradil triphosphate  Ea  amplification efficiency  EDTA  ethylene diamino tetra-acetic acid  ePCR  emulsion PCR  EST  expressed sequence tag  xv  G  guanidine  GC  guanidine-cytosine  HMW  high molecular weight  HPLC  high performance liquid chromatography  HT  high throughput  IRA  isoheptyl amine  IVT  In vitro transcription  kb  kilobase  KF  Kienow fragment of E. coli DNA polymerase I  LNA  Locked nudeic acid  MM  mismatched  MPSS  massively parallel signature sequencing  mRNA  messenger RNA  NN  nearest-neighbour  NS  non-specific  OH  hydroxyl  ORF  open reading frame  PAGE  polyacrylamide gel electrophoresis  PCR  Polymerase Chain Reaction  PEG  polyethylene glycol  PM  perfectly matched  PMT  photomultiplier tube  1 PP  pyrophosphate  qPCR  quantitative PCR  RE  restriction enzyme  RFU  relative fluorescence unit  RNA  ribonucleic acid  RNAP  RNA polymerase  RPLC  reverse phase HPLC  SAGE  Serial Analysis of Gene Expression  SBH  sequencing by hybridization  SBL  sequencing by ligation  xv  SBS  sequencing by synthesis  SDS  sodium dodecyl sulfate  SPA  solid-phase amplification  ssDNA  single-stranded  SST  short sequence tag  T  thymidine  T4  bacteriophage T4  17  Bacteriophage 97  TAR  transcriptional active regions  Tm  melting temperature  U-STAR  Universal Sequence Tag Array Platform  UTP  uracil triphosphate  UV  ultraviolet-visible  xvli  ACKNOWLEDGEMENTS There are many people to whom I am indebted to for the completion of this work, from whom I have gained inspiration, insight, and hopefully, a little bit of wisdom. First of all, I would like to thank the Interdisciplinary Studies Graduate programme, Rhodri Windsor-Lipscombe, John Beatty, and of course, Janice Matuatia, for creating and providing a home for me at the outset of my work, for providing me the opportunity to purst my ideas and ideals, and finally, for helping me at the end when I needed it most. I am also grateftil to Michael Smith, whose assembly of people comprising the Michael Smith Laboratories (formerly the Biotechnology Laboratories) has made every facet of this work possible. Particular thanks go out to Robin Turner, Andre Marziali, John Hobbs, and especially James Kronstad, for their immeasurable help and openness at various stages of this work. If not for the helpful advice from Chi Yip Ho and tolerance to my random e-mails, I would have never had the nerve to follow my dreams and find the people who ultimately enabled this thesis to happen. Most of all, however, I am forever grateful to Charles Haynes, who gave me the freedom and opportunity to explore, to stumble, and ultimately, to succeed, and who, throughout this process, has been absolutely unfailing in his support. I have learned so much from him and his perspective as a scientist and an engineer. As a person whom I respect and admire as a mentor, as a colleague, and someone whom I can call friend, my sincerest hope is that this work honours him as much as he has honoured me. Finally, I have to thank my family, my mother and father, for their grudging support along the path that I chose and my sister, who never doubted. I love them dearly. However, there is no one more important, no one to whom I owe the most to for the success of this work, than my partner, friend, wife, and mother to my children. For her patience, her grace, her love, her care, and above all, her understanding against all the trials and tribulations I have inflicted on her, for all that this work is worth, it remains worthless compared to her.  xv  To mj children Hannah, Akxande, and Natasha For their excitement and wonder ofthe world around them And to mji grandmother For waiting  xx  CO-AUTHORSHIP STATEMENT  For the work presented in Chapters 2 through 5 inclusive, experimental design, performance of experiments, as well as data analyses and manuscript preparation was performed by the author of this thesis. Dr. Jennifer Bryan provided expertise in statistical analysis for the calculations utilized in Chapter 4. Drs. Charles Haynes and Robin Turner helped in the editing and review of the manuscripts for each of these four chapters.  xx  CHAPTER 1 Introduction, background information and thesis objectives  I  1.1. INTRODUCTION Many disease states are thought to arise from the contributions and interactions of multiple genes in response to external stimuli, such as exposure to carcinogens, or to viral or bacterial infections. Recent advances in molecular biology and biochemical techniques are allowing genes and gene products associated with these states to be elucidated, helping to define sequence and structural motifs that contribute to the genesis and progression of disease. Much progress has been made in identifying inherited gene mutations as well as genes that are present or absent in particular cells or cellular states, helping to categorize gross aspects of their function. Biochemical studies combined with gene over-expression and knockout experiments have provided further details on the roles of particular genes in cellular function by serially pinpointing upstream and downstream elements with which they interact. This latter approach embraces the idea that the functional significance of gene products is not only related to their quantity in the cell, but also to how they interact and are strung together to form genetic and biochemical networks. Numerous technologies have been developed over the past decade to specifically examine gene expression on a systems-wide scale, with the greatest attention being given to approaches based either on count-based “digital” approaches to analysis of the transcriptome (i.e. the set of all expressed genes weighted by transcript abundance) via high-throughput sequencing, or through signal-based “analogue” methods on a massively parallel scale using hybridization array technology. This ability to perform high-throughput analysis of transcriptomes has led to a revolution in the study of genomic information and has permitted the functional characterization of genes  —  in some cases previously unknown  involved in the transitions between cellular states  —  for a number of organisms. In yeast, global gene expression profiles during various metabolic shifts  ,  the cell cycle  6,7,  and sporulation  8,9  have been obtained, revealing and identifying gene  clusters associated with these different cellular states regulatory elements controlling gene expression pharmacological 22-24  912,  as well as identifying putative upstream  6, 12-16•  Expression profiles in response to  20 environment 2 and ’ al agents as well as those that contribute to virulence of  and the host’s response to  pathogenic microbes have also been monitored, allowing the  identification of putative therapeutic targets. More recently, expression profiles have been determined for a number of disease related states  29-31  to generate portraits of gene expression 32-38  that can be used to distinguish between, for example, invasive and non-invasive tumours.  2  Despite these successes in elucidating previously unknown relationships of genes and gene families, as well as identifying gene expression signatures that correlate with the adoption of a particular phenotype, neither approach has demonstrated an ability to provide accurate measurements of transcript abundances on an absolute scale. Furthermore, challenges remain in establishing a gold-standard that properly allows cross-platform correlation amongst and within both digital and analogue transcriptomics technologies. This is particularly important given that the successful development of predictive models of cellular behaviour requires unambiguous systemswide analytical data, including absolute message numbers over time. Regrettably, inherent weaknesses of current technologies limit their ability to provide truly quantitative snapshots of gene expression on a global scale. The expense and technical demands of current technologies further limit their use as a general, widely available tool This thesis first examines design issues associated with one of the most popular digital approaches to transcriptomics, serial analysis of gene expression (SAGE), that diminish its utility as a tool for absolute transcriptomics. A new approach to high-throughput analysis of global gene expression profiling is then described which overcomes many of the limitations of current transcriptomics technologies. The strength of this approach, which improves upon and converts SAGE into a robust and cost-effective microarray platform, is based on the use of locked nudeic acids (LNAs) as probe molecules and is referred herein as the universal sequence tag array platform (U-STAR platform). 1.2. DIGITAL BASED APPROACHES TO GENE EXPRESSION ANALYSIS Digital approaches to gene expression analysis identify and tally the frequency of individual transcripts through the application of sequencing-based technologies. Such sampling-based approaches often involve the conversion of the mRNA population (100 ng or,  I xlO” mRNA  molecules of average 2 kilobases in length) into its corresponding cDNA, which is then introduced into a high copy plasmid vector to create a library that can be used to transform host bacteria. As each bacterium will preferentially select a single high copy vector upon transformation, individual plasmids bearing unique cDNA inserts can be segregated, and the resultant dones spatially isolated through low-density plating of transformants. This library of archived dones can then be amplified and propagated as needed for large-scale harvesting of plasmids, permitting the recovery of sufficient amounts of material for identification via sequencing technology. The number of times  3  an identified cDNA sequence occurs out of the total population of sequenced clones therefore gives a measure of the relative frequency of the sequence and of the particular transcript from which it was derived within the original mRNA population  .  However, given that standard  Sanger-based dideoxy chain terminating sequencing technology yields a readable sequence length of --0.1-1 kilobases (kb) of DNA 40,4t, and that transcript lengths can vary from <0.1 kb to> 10 kb, throughput under this paradigm is low and cost is high. To increase sample throughput and the rate of transcript identification, shorter sequence reads (-400 bases) have been utilized to uniquely identify a given transcript via the creation of expressed sequence tags (ESTs)  .  While this can  increase the rate of transcript identification by a factor of up to 10-fold, the time and cost required to identify each EST is nonetheless restricted by the inability to sequence more than a single cDNA done per sequencing run. This limitation was addressed in 1995 byVelculescu et a1  42,  who  introduced Serial Analysis of Gene Expression (SAGE) technology. This technology enabled a new paradigm for high-throughput identification and quantification of a transcriptome by stringing together in series short sequence tags (SSTs) to create clonable inserts 500-1 500 bases in length and thereby provide a means to increase the number of transcripts that can be identified per sequencing run by a factor of up to 150-fold 39,42• 1.2.1. Overview of SAGE technology SAGE technology  42  directly samples the transcriptome of an organism under a given  cellular state through the generation of SSTs of 9-22 base pairs in length (figure 1.1). As a 9-10 mer oligonudeotide can theoretically identify 49 (262,144) or  410  (1,048,576) unique sequences, the  entire transcript population of any organism can potentially be represented within a 9-mer to 10mer library of SSTs  39,42,43  In SAGE, an SST library is constructed by digesting a cDNA copy of  the mRNA population with a restriction endonudease (RE) that has a tetranucleotide recognition sequence (e.g. Nia III: 5’-CATG-3’) that “anchors” the SST. Since a given tetranudeotide is theoretically present every 256 bases in a randomly sequenced oligonudeotide, the probability of cleaving within every transcript is high  42  The most 3’ end restriction fragments of the cDNA  population digested with this anchoring restriction enzyme (ARE) are then purified, and the sequences downstream of the ARE recognition sequence are isolated by ligating to the 5’ end one of two short oligonudeotide linkers that each contains a 3’ overhang sequence complementary to the overhang on the cDNA population generated by the ARE, a type ITS RE recognition sequence, and a unique primer sequence. Through subsequent cleavage with a Type 11S RE (e.g.  4  Bsm Fl) —  42_  a family of REs capable of cleaving DNA outside of the RE’s recognition sequence  short sequences of equal length are produced from the cDNA fragment population  42•  The  choice of Type ITS RE intrinsically determines the amount of sequence information that can be retrieved from a given cDNA and forms the basis of other SAGE-like technologies such as 1ongSAGE  The adapter-tag (A:’T) products are then dimerized and amplified via the  polymerase chain reaction (PCR) to form an amplified pool of SAGE tag dimers (“ditags”), which are concatenated after removing the linkers. Concatenates of --0.7-1 kb are then purified and subcloned into a sequencing vector to create a SAGE library 42,43• Finally, the insert in each done of the library is sequenced and the abundance of each unique SAGE tag is  As each  SST is derived from a defined position within a particular cDNA, a given tag can be crossreferenced through organism- and/or tissue-specific genome databases to a particular gene to give a proffle of global gene expression . An important feature of SAGE is that tags which arise out of 50 the analysis that do not correspond to any known gene suggest the presence of novel or alternate splicing variants of transcripts and can therefore be used to aid in the annotation of genome databases  7, 51-57  1.2.2. Current limitations in SAGE technology In principle, SAGE technology offers the potential to provide a comprehensive measurement of transcript abundances on an absolute scale, and indeed was introduced as a technology capable of providing these kinds of measurements. However, as it is a sampling based technology that relies on numerous processing steps between starting material and data acquisition, a number of unique factors can impact the ability of SAGE to provide a complete and accurate portrait of the transcriptome. 1.2.2.1. Sampling-based bias In a typical SAGE experiment, 0.1 to 5 tg of mRNA are utilized as starting material, corresponding to approximately lxlO” to 5x10 12 transcript molecules of 2 kb in length. Following conversion into cDNA and after a series of processing steps as described in section 1.2.1, amplified tag molecules are concatenated into  0’ molecules of  I kb in length. However, as a  consequence of the additive costs and turn-around times associated with sequencing, complete sequencing of a SAGE library, which would entail the sequencing of  clone samples, is  untenable. Instead, a small fraction of the total number of SAGE tags created (< 10) is typically  5  sequenced a fraction often determined as a multiple of an estimated number of unique transcripts -  expressed in a given cell type during a particular cell state 58; or, by simply halting sequencing when unique SAGE tags are no longer uncovered  .  A useful example is provided through studies of the  yeast S. cerevisicae under conditions where —‘15,000 transcripts from the --‘6000 theoretical open reading frames are thought to be expressed o. With a dedicated state-of-the-art sequencer such as an AIM 3730x1 capillary sequencer that can process 768 samples per day at a 99% confidence level per read, a 3-fold coverage of the yeast transcriptome (—45,000 tags) would require a full day of sequencing at a cost of ‘-410,000 for sample preparation and consumables for sequencing alone, which dearly presents a significant cost barrier to increases in the depth of sequencing. However, sampling of the transcriptome through the sequencing of < 10 SAGE tags, which represents less than 0.000001 % of the 1013 tag molecules typically generated in a SAGE study, can introduce significant measurement 61 error 6 ’ 2 particularly for less abundant transcripts, since greater than 50% of the total mass of mRNA is comprised of less than 20% of the unique transcripts in the mRNA library  58, 63•  In particular, the over 6-orders of magnitude dynamic range  M  of transcripts  abundances in a typical transcriptome can lead to the introduction of a sampling bias towards highly expressed transcripts  ‘ ,  causing some tags from transcripts present in lower abundance to  be missed, and the number of tags identified for other transcripts to inaccurately reflect their true abundances  61,62  As will be shown in this thesis, the utilization of numerous processing steps to  prepare a concatenated tag library further contributes to measurement error in SAGE, as moderate to substantial losses as well as processing artifacts occur in each of the 12 enzymatic and 10 purification steps used to convert the starting mRNA sample into a SAGE library (flgurel.1.). As a result, the use of SAGE is now largely directed toward the fundamentally less informative study of changes in transcript abundance relative to a selected reference state for the sample, and a number of algorithms are available to determine the statistical significance of such changes in tag abundances  65,66  1.2.2.2. Limits in transcriptome coverage Although the theoretical coverage of the transcriptome of complex genomes by SAGE is limited only by the depth of sequencing pursued, only 4665 of the 6305 (—‘75%) putative ORFs can be identified in S. cereviscae after sequencing —60,000 SAGE tags (approximately 4-fold coverage of the estimated size of the transcriptome) more than one gene are often observed  7,62  .  Furthermore, SAGE tags that correspond to  and have been reported in the SAGE analysis of  6  human CDI 5 (34% of SAGE tags)  as P. falciparum (14-18%) (9-10 bases)  ‘.  67  and mouse Gr-I + (42%) myeloid progenitor cells  ,  as well  69,70  This phenomenon has been attributed to the use of short SAGE tags However, increasing SAGE tag lengths to 20 bases has not been clearly  demonstrated to ameliorate these losses in information due to sequence degeneracy and probably does not warrant the associated increase in the cost of sequencing 72 Indeed, only full sequencing of 3’ end cDNAs downstream of a given ARE cleavage site has been shown to provide unambiguous tag-to-gene assignments 68, Applications of SAGE in silico on the theoretical •  open reading frames (ORFs) as well as on curated transcript sequences from a number of organisms have revealed that limits in transcriptome coverage arise from the fact that not all ORFs and therefore not all cDNA possess a deavage site for a given ARE 76, fl Furthermore, the composition of sequences downstream of any given ARE is influenced by the choice of ARE used, causing degenerate sequences to be more prevalent downstream of some cleavage sites than others (e.g. 5’-AA’IT-3’ vs. 5’-CCGG-3’) 76, Thus, the ability of SAGE to provide a complete picture of the transcriptome is both genome-specific and determined by the ARE (typically NIaffi) used. Although a SAGE analysis that combines tag information obtained from a transcriptom e sample using two different AREs 78 can increase the coverage of the genome by effectively using two SAGE tags to identify a particular transcript, this unfortunately multiplies the already considerable sequencing (—$10,000) and sample preparation (—$1000) costs per library for SAGE technology.  1.2.2.3. Sequence-based artifacts The success of SAGE in providing accurate information about the transcriptome also relies on the ability to relate a tag sequence to its putative transcript. This depends on the accuracy of the sequencing results obtained, as errors in the tag sequence can lead to either a misattribution of tags to a transcript or a spurious assignment of tags to genomic sequence, serving to confound SAGE analysis  Tag errors can arise from the error rate (0.6-2% per base 80-8 inherent to automated high-throughput sequencing systems, and can cause up to 10% of the pooi of tag sequences to be erroneous 83, Sequence artifacts can also be introduced during the processing  steps involved in preparing a SAGE library. Indeed, the primary source of sequence artifacts is due to the amplification of SAGE ditags 85, where extensive application of the PCR is used to both offset losses occurring during sample preparation as well as permit the recovery of sufficient amounts of material to generate a library for ’ sequen 6 42 8 7 6 cing 87, Due to the error-rate inherent to  7  the DNA polymerase (DNAP) Taq  ,  the incidence of sequence artifacts in the amplified clitag  population increases with each amplification cycle and is estimated to contribute as much as —30% erroneous tags to the final amplified pool 84, 85 Moreover, sequence-related differences in amplification efficiencies among the population of individual ditags can alter ditag template-toproduct ratios with increasing cycle number  67, 88-90•  Finally, while not strictly due to errors in  sequence, SAGE databases have been found to exhibit a bias in GC-content 91,92 which has been attributed, in part, to the propensity of ditags rich in AT-content to melt at room temperature during sample preparation  ‘,  potentially skewing tag and therefore transcript abundance  measurements away from their original distribution. Thus, while SAGE may potentially uncover all unique transcripts expressed under a given cell state, it is possible (and even likely) that the real distribution of transcript abundances is incorrectly reflected in SAGE data sets. 1.2.3. Advances in digital-based transcriptomics technologies  Recent advances in digital transcriptomics technologies have largely coincided with the development of novel highly parallel methods of sequencing that are capable of simultaneously sequencing> 106 samples in a single  93-96  Read lengths obtained under these next-generation  platforms, while currently limited to 25-250 bases  ‘°°,  are sufficient for the analysis of SSTs.  Thus, through adaptation to tag analysis, they can provide a 10- to 100-fold increase in the number of SSTs that can be sampled from a transciptome in a single sequencing run 9b03, greatly reducing sampling biases that arise from insufficient depth of sequencing found in standard SAGE analysis. Furthermore, as SSTs themselves can be utilized as templates for sequencing, sample biases arising from the conversion of the released SST population into a sequencing library are partially mitigated, although PCR-based amplification is still required for generating sufficient amounts of material for sequencing. Indeed, following cDNA library generation and release of the A:T with a Type IIS enzyme (see section 1.2.1), dimerized A:Ts can be applied to these platforms for direct sequencing, or alternatively, a second adaptor can be immediately introduced to the 3’-end of the A:T to 97100 prepare the released SSTs for Underlying these sequencing technologies are two common and enabling principles: 1) each sequence is clonally isolated into dusters, either onto individual microbeads that are then spatially dispersed onto a surface, or within a highly localized area (< 10 im diameter) on a surface allowing a 1:1 correspondence of cluster number with transcript number, and 2) mass separation  8  of labeled sequencing ladders is bypassed by directly sequencing each ladder in real-time using either a sequencing-by-ligation (SBL) or a sequencing-by-synthesis (SBS) approach. However, a critical evaluation of the accuracy of abundance measurements obtained is not yet available, though, as detailed below, some fundamental features of these platforms are likely to compromise their ability to provide absolute measurements of transcript abundance. 1.2.3.1. Formation of spatially segregated clonal populations Initial approaches to clonally isolate cDNA sequences for sequence analysis utilized a unique 32-base “barcode” tag cloned into each cDNA that could then be hybridized to individual beads bearing unique complementary sequences to this tag  102  Each barcode is created from a  collection of 8 non-self-complementary tetranucleotide sequences that allow the formation of 7 unique barcodes with identical GC-content and similar melting thermodynamics. This —1.7x10 allows a small but significant fraction of each cDNA in the starting population (2.5  12 or 2.5x10  transcript molecules) to be tagged with a unique barcode. Successfully, tagged cDNAs are selected through transformation of host E. co/i, and following amplification via PCR with primers that flank the barcoded cDNA, the amplified cDNA is hybridized onto a population of. 5 m dia. beads in which each bead dliplays an “anti-barcode tag” oligonucleotide complementary to a particular barcode sequence, effectively allowing each bead to be loaded with a unique cDNA  101-103  The  requirement for extensive sample preparation and potential for sampling bias has however led to notable modifications in order to achieve clonal segregation of templates, including the development of emulsion-based PCR amplification protocols  (ePCR)  and solid-phase  amplification (SPA) approaches. ePCR enables the amplification of individual templates within isolated microvessels formed by emulsification of water in oil solution suspended in mineral oil microcapillary created  1o4-108•  108,  106107  104-107  Through bulk agitation of an aqueous surfactant  or through mechanical shear of the suspension through a  thermally stable microvessels ranging in volume between 3-10 m 3 can be  The careful distribution of PCR reagents, templates, and primer-anchored beads  across the population of microvessels formed within the emulsion then permits individual beads to be loaded with amplification products arising from a single template, allowing the creation of a clonal bead population  104, 1o6-11O  This requires that the number of microvessels formed in the  emulsion vastly exceed the number of template molecules (100 ng mRNA or 1Oh1 molecules)  9  present to ensure the presence of a single template in a given microvesseL Following amplification, donal beads can be harvested and dispersed either as a monolayer onto a surface to achieve a “feature” density of 5x10 7 beads/cm 2 (for Ca. i tm dia. beads) microfabricated wells of a picotitre plate  or by deposition into individual  6 wells/plate) ( I .6x10  ‘.  Unfortunately, the distribution  of templates and reagents with the emulsion therefore follows a Poisson distribution, so that only a fraction of beads are loaded with a unique template or have sufficient quantities of reagents to permit efficient amplification. The effect of this low sampling frequency on the final distribution of sequenced tags is unknown. In contrast, SPA involves the direct amplification of individual templates distributed onto a surface. The templates are either entangled in a three-dimensional (3D) matrix through infusion and co-polymerization with a high concentration acrylamide/bis-acrylamide solution containing methylacrydite-modified primers and then deposited as a thin layer (—IO 11.m depth) on a glass slide 111115;  or are captured through direct hybridization on a glass surface possessing covalently attached  primers  109,116, 117•  In the initial round of SPA, the end-grafted primer, following hybridization with a  solution-phase template, is extended by DNAP to yield a surface-anchored complementary strand, seeding the growth of a polymerase colony, or “polony”. During subsequent rounds of SPA, due to high-local concentrations of surface-attached primers, and/or the polyacrylamide matrix, the lateral diffusion of the parent strand(s) and subsequent amplification product is inhibited, leading to a local accumulation of amplicons into individual polonies, the size and density of which can be controlled by adjusting reaction conditions  112,  The amplification efficiency of SPA is however  very low and is dependent upon overcoming the steric and Coulombic interactions arising from the crowding of primers and amplicons, as well as length-dependent and conformational entropic effects that influence template:primer hybridization on a surface  118-124  Although the introduction  of solution phase primers can partially mitigate this low amplification efficiency’ 16 and permit the growth of polony densities (I x iO polonies/cm) exceeding the feature densities of standard spotted microarrays  111, 112,116, 1,  the resulting increase in diffusible amplicons can increase the  likelihood of polony “satellites” that compromise fidelity through misattribution of these polonies to a unique template  116  Detachment of -50-60% of surface-bound components over the course  of thermocycling has also been observed irrespective of the silyl attachment chemistry utilized for primer attachment, compounding this loss in fidelity  116, 125  Although these problems may be  reduced when diffusionally restricted by 3D matrices, the integrity of the template, primer, and  10  polymerase is compromised during the free-radical mediated polymerization of acrylamide/bis acrylamide monomers  127,  and formation of polonies outside the focal plane of the imaging system  can impact signal coherence as well as detection limits 1.2.3.2. Sequencing in real-time Upon segregation of the template population into clonally and spatially discrete clusters on  a surface, the sequence of each template is simultaneously determined in real-time using one of two general fluorescence-based approaches: sequencing-by-ligation (SBL) and sequencing-bysynthesis (SBS). SBL involves the sequential decoding of a ssDNA/dsDNA template through the hybridization, ligation, interrogation, and removal of fluorescently labelled sequence-specific oligonucleotides. The original SBL method employed “query adaptors” possessing a 4-base 5’overhang sequence and a recognition sequence for the Type ITS RE Bbv I that cleaves 4 bases downstream of its recognition sequence. Following digestion of the bead-anchored dsDNA template with a RE, Bbv I adaptors are ligated to the plurality of templates. The sequencing run is then initiated by digesting with Bbv I to create a 5’-overhang that exposes the adjacent tetranudeotide sequence downstream of the initiating RE site. A series of labeled sequencespecific adaptors are then hybridized and ligated to decode the revealed tetranucleotide sequence. Successive iterations of digestion, adaptor hybridization and ligation therefore allow the sequence of each anchored cDNAs to be determined as blocks of consecutive tetranucleotides’°”° 3 Current SBL approaches attempt to increase the specificity of the decoding reaction by utilizing “query probes” consisting of a nucleotide signature within a degenerate nonamer. In this  approach, the sequencing cycle is initiated by hybridizing a primer at a fixed distance upstream of a bead-anchored single-stranded template. A 3’-end fluorescently labeled query probe is then hybridized to the ssDNA and ligated to the adjacent upstream primer, allowing the specific  identification of the n+4 base. After stripping ligation products, the cycle of hybridization and ligation is repeated for each set of initiating primers that bind at the n-i, n-2, n-3, and finally the n-4  positions  128  This entire series of reactions thus reveals the sequence of the template as  pentameric blocks, although recent modifications (e,g. ABI’s SOLiD platform) currently enable read lengths of —30 bases, which is sufficient for the analysis of SSTs. The signal and accuracy of each decoded sequence obtained with SBL is however dependent upon the ability of the query  11  adaptor or probe to invade the template “brush” and hybridize with high specificity to either to the 5’-overhang sequence or the adjacent to the initiating primer  122, 129,  therefore be reduced through the tolerance of ligase to mismatches  130  System performance can  and gaps  130,131  in the joining  reaction, as well as inefficiencies in query probe hybridization resulting from the propensity to form secondary structures In contrast, SBS involves the real-time monitoring of specific bases incorporated during the primed synthesis of the strand complementary to the denatured ssDNA template sequence 132, 133  Sanger-based approaches to SBS  utilize nucleotide analogues possessing: I) a unique  fluorophore attached through a photocleavable  134  or chemically-labile  135, 136  linker; and 2) a  chemically labile 3’-O-linked allyl terminator that blocks extension by DNA polymerase  136,  Each  cycle of base-calling therefore involves incremental addition of the full set of reversible terminators to the nascent strand population, removal of unincorporated bases, imaging of the entire surface to identify bases incorporated by each bead/polony, and deavage of the fluorophore and terminator group to permit addition of the next base. Although error rates are on par with traditional Sanger sequencing, current read-lengths are limited to  30 bases as incorporation rates of these  <  terminators are much lower than even traditional dideoxy-terminators the reaction conditions employed  138  136,137  and are influenced by  as well as the efficiency of terminator removal 136 Increased  read lengths can be obtained through a technique termed “pyrosequencing”, which allows the use of unmodified deoxynucleotides as well as traditional DNA polymerases. In this approach, pyrophosphate  (PP ) 1 ,  released  upon  nucleotide  incorporation,  is  chemiluminescent signal through a cocktail of enzymes and substrates  converted 132, 139-142  into  a  This requires  however that each base is added separately and that the pyrosequencing reaction for each template is isolated in order to localize the chemiluminescent signal  143, 144,  High-throughput pyrosequencing  is therefore performed in picotitre plates containing —‘1.6x10 6 wells containing an individual clonal bead for analysis. However, less than 20% of the wells (-3x10) are able to yield high quality sequence  ,  although this comparably lower sample throughput is compensated by attainable reads  of up to ‘-‘240 bases with 0.5% accuracy per base  145  Homopolymer stretches are problematic  though, and the performance of pyrosequencing is limited by the sequence-dependence of the total number of photons released to the number of nucleotides incorporated  145-147  While this  homopolymer problem may be mitigated through the use of 3’-O-allyi reversibly terminated nucleotides  147,  the impact of these modified analogues on read length is unknown  12  .  1.3. ANALOGUE APPROACHES TO GENE EXPRESSION ANALYSIS  Analogue approaches to transcriptome analysis exploit the unique structural and biophysical properties of polynucleotides to enable a signal-based method to measure the quantity of a transcript in a mixture. The base-pairing rules governing the interaction of individual bases within two polynudeotide strands  guanine (G) with cytosine (C)  —  namely, adenine (A) with thymine (I) or uradil (U), and permit synthetic sequences of DNA (“probes”) to be designed to —  bind a “target” DNA or RNA sequence. Quantification of the amount of hybridized target may then be achieved through the incorporation of a reporter group (radiolabelled, fluorescent, or biotinylated) within either the target or the probe. Original applications of this principle to gene expression analysis involved the size-based separation of mRNA via denaturing agarose-gel electrophoresis, followed by capillary transfer and fixation onto a nitrocellulose membrane. Referred to as the Northern blot 148 in deference to the Southern blot 149 upon which it is based, this approach introduces radio-labelled probes designed to specific target mRNAs into the immobilized transcript mixture, followed by high-stringency washes to remove any non-specific adsorption of probe onto the nitrocellulose matrix. The signal intensity derived from each hybridized probe is determined by densitometry, providing a semi-quantitative measure of the abundance of each transcript. The ability to detect and quantify unique transcripts therefore lies largely in the ability to spatially segregate the population of transcripts, in this instance by size, and thus depends upon the resolving power of the agarose gel utilized. In particular, transcripts of similar length cannot be resolved and require either the use of separate blots generated from the same sample, or alternatively, the use of multiple cydes of probing and stripping, amounting to a tedious and time-consuming procedure. These limitations were addressed through a paradigm shift in the design of analogue-based approaches to transcript anaysis, resulting in the development of microarray technology, where probes were instead immobilized onto a surface. The development of DNA microarray technology enabled a vast increase in the number of transcripts that can be interrogated simultaneously from an mRNA population  151  and along with SAGE technology, is largely responsible for the evolution of global analysis of gene expression that now serves as a cornerstone of the rapidly  growing field of systems biology.  13  1.3.1. Overview of DNA microarray technology Microarray technology entails the engineering of 1x10 5  150,151  152,  unique  151 154,155  or physi  to over 1x10 6  features on a small surface area in which each feature contains an end-grafted ‘  sorbed  polynucleotide probe that is complementary to a region of a particular gene. This engineered complementarity allows the hybridization of a target mRNA or its cDNA (e.g. radio- 157 or fluorescent-labeled 158, 1sc and its subsequent identification and quantification 160• These features, or probes, can be generated on an appropriately modified surface in one of two ways 161: 1) isolated cDNAs or PCR products can be manually or automatically gridded onto a support (e.g. coated ass coverslips or nitrocellulose membranes) at a density of  —‘  4,500 spots/cm 2 (75 jtm diameter)  to create “spotted arrays” of up to ‘—6O,000 features per square inch  150;  or 2) complementary  oligomers (25-75 bases in length) corresponding to a given cDNA can be directly synthesized on the support at a density of 400,000 features/cm 2 (20 .tm width) either through photolithographic methods greater  151, 152, 162  or  inkjet-like dispensing methods  than 1,000,000 features per square inch  163,164  to create “oligonudeotide arrays” with  15Z 153•  1.3.1.1. Spotted microarrays Spotted microarrays come in many different configurations. A variety of surfaces and surface-active groups are utilized for array preparation, both in-house and commercially, including amino-  165  thiol-  166  aldehyde-  silylated glass, as well as nylon-  167  isothiocyanyl-  170  poly-L-lysine  156  succinimidyl-  168  and epoxy-  and gel-coated glass  111, 171  169  activated  Each exhibits a  unique level of non-specific adsorption of polynucleotides and a different physical environment for hybridization. Similarly, a variety of linkers are employed to separate the probe from the surface, and hybridization efficiencies are found to vary according to the length and chemistry of the linker used employed  123, 124, 167  The different substrate and probe attachment chemistries that can be  167, 172  also influence the amount of probe that can be attached to the surface and the rate of its desorption from the array surface. This variability is compounded by the use of cDNAs as probes, creating probe sets that range in length between 100-3000 bases  173  leading to large  differences in hybridization thermodynamics and kinetics that make it difficult to relate hybridization signals between registers and across array formats 174 While minimal standards in the presentation of microarray experiments (e.g. M.[AME) have aided in addressing data quality and cross-validation 175 cDNA-based microarrays are nevertheless strictly limited to measuring  14  comparative changes relative to a designated reference sample due to these sources of variability and uncertainty. Several commercial spotted array platforms attempt to overcome probe-length variability issues by employing fixed-length probes (typically a 30-70 met oligonucleotide) to identify transcripts using a sequence tag identification approach 176-180• Their performance has been aided by a number of studies which have focused both on determination of the optimal probe density for  uninhibited access of a given target molecule to its complement  173,181-185  and on the most reliable  probe attachment 125 chemistry 1 ’ 86 to identify conditions that can permit accurate determination of target abundances over a wide dynamic range. However, the successful design of such probe sets to target unique regions of each transcript requires highly-curated sequence databases. It also requires matching of probe melting thermodynamics and the elimination of secondary structures within the ssDNA  187,  particularly those formed in either the target or the probe that are  thermodynamically preferred over target:probe duplexation  189, 19o  This has proven difficult to  achieve, pointing to the need for new approaches to probe design. 1.3.1.2. de nova synthesized microarrays Oligonucleotide arrays synthesized de nova by photolithographic or high-end inkjet dispensing methods 151,152,162-164 utilize well-defined chemistries to synthesize probes of fixed length Their fabrication currently requires highly specialized facilities similar to those found in the semiconductor industry, making the overhead cost for construction of these devices well beyond the resources of most institutions. As a result, the arrays are currently offered through commercial sources  191  making the availability and cost Q-4200-$600) of a given array determined largely by its  market demand and commercial feasibility  191  Moreover, the synthesis of probes de nova on a  surface is difficult. Typically, greater than 99% coupling yields of individual bases in a nascent oligonucleotide can be obtained through standard solid-phase phosphoramidite chemistry. However, the coupling yields for each base onto a flat surface is less than % 98 192 and for photolithographic methods, coupling yields of less than 94% 192,193 are often observed due to the use of less efficient photolabile protecting groups. These low coupling efficiencies limit the yield of full length 25-met oligonucleotides to less than 20-60%, and as surface coupled oligonucleotides are not amenable to any purification strategies, failed oligonudeotides will compete with the full length probe for hybridization to its target molecule  15  .  A number of statistical and technological  strategies have been used to minimize this problem, most notably by exploiting the extremely high density of probes that can be placed onto these arrays. This enables the utilization of multiple probes registers for a given transcript that together serve to more accurately identify a given target as well as to identify mismatched false positive signals that might convolute data analysis 195-197 1.3.1.3. Sample preparation for microarray technologies. Many methods of sample preparation are available to convert harvested mRNA starting material into a form that can be interrogated via either spotted or oligonucleotide microarrays. In general, the extent of sample preparation required to create the analyte is determined by the amount of analyte required (typically 2 to 20 rig) by a particular microarray configuration for signal detection, as well as the labeling chemistry utilized for detection  150, 198-201•  As the mass of starting  mRNA material is typically low, some form of material amplification is needed to generate the required amount of analyte. Prototypically, this is achieved using a modification to the Eberwine protocol  202  where the mRNA population is converted into a cDNA product using an oligo(dT)  primer that contains a priming site for RNA polymerase from the bacteriophage T7 cr7 RNAP). Through in vihv transcription ([Vi) of the cDNA population with T7 RNAP, a linearly amplified RNA (aRNA) pool of  (—)  strand RNA is then created  199-201  This aRNA in turn is used as a  template for a second round of random primed cDNA synthesis followed by 2’ strand synthesis with the oligo(dl)/17 primer. Alternatively, amplification can be achieved by introducing a second priming site and exploiting the template switching effect of reverse 199 transcriptase 2 ’ 03 or through tailing of the cDNA with polythymidine  199,204.205  allowing the subsequent application of the PCR  for material amplification. The fluorescently labelled analyte can then be generated via IVT or through a single round of PCR to incorporate fluorescently-modified nucleotides (CTP/UTP or dCTP/dUTP, respectively), or through the introduction of the corresponding biotinylated or amino-allyl nucleotides  199-201, 203-205  In the latter case, labeling is achieved either using fluorescent  streptavidin conjugates that specifically bind to the biotinylated nucleotides, or through covalent modification of the amino-allyl substituents with an amine reactive fluorophore. However, although many options are available for preparing a sample for microarray analysis which, with due diligence, are capable of providing highly reproducible information on fold-changes in gene expression 206-208 the results obtained are often specific to the protocol adopted 209-211 highlighting an underlying bias that undermines attempts to infer transcript abundances on an absolute scale 212214  16  1.3.2. Current limitations of microarray technologies The incompleteness of curated databases for many organisms limits the design of comprehensive probesets to fully interrogate the transcriptomes of those organisms, and generally restricts the applicability of current microarray platforms to the monitoring of expression profiles for known genes and not as a gene discovery tool per Se. Nevertheless, analogue platforms for transcriptome analysis possess many advantages relative to their digital counterparts, including a higher speed of data acquisition and a lower cost per sample since the cost of analyzing via sequencing is serially accrued and data acquisition times are measured in days. The combination of these features renders the replication of data sets economically feasible, improving data precision by allowing one to investigate and optimize the parameters that impact technology ’ performance 20 1 213,215-218 219. 220  including spotting density  sample preparation  209, 211,  182  the method and incubation period for target hybridization and labeling and finally, scanner properties ‘  ,  However, the inherent complexities in the chemistry and biophysics of microarray platforms have, to date, precluded the determination of absolute transcript abundances from array hybridization data. These issues are also thought to diminish the ability to correlate relative measurements of transcript abundances across different microarray platforms, despite attempts at standardization 226to improve inter-platform reproducibility  177,195,215, 229-232  As detailed below, a number of other  issues impact the potential use of DNA microarrays in absolute determination of transcript numbers. 1.3.2.1.  Limitations  introduced by target and probe chemistries  The biophysical properties that govern the interaction between a specific target and its corresponding probe anchored to the rnicroarray surface establish the ability of microarray platforms to provide absolute measurements of transcript abundance. In this regard, nearestneighbor (NN) thermodynamic models probes sequences  1, 235-238  have proven quite useful in both the design of and the normalization of hybridization signals for solution-based  assays. However, as duplexation thermodynamics  240-243  are sensitive to the solution and  environmental conditions in which the hybridization reaction takes place 244-247 empirical best-fit regressions are currently utilized in an attempt to correct for solution-dependent effects 245, 248• Unfortunately, few data are currently available to validate the true performance of these corrections beyond general claims that they can improve reproducibility across microarray datasets 235  17  The ability to obtain accurate measurements of transcript abundance is further hampered by the multi-phase nature of the microarray platform, where the target population present in solution must interact with a high, localized concentration of surface- or matrix-bound complementary probe. The well-known concentration dependence of the bi-molecular duplexation reaction will be affected by hybridization thermodynamics at the array surface  121 189;  hybridization  kinetics will also differ from that in solution by additional resistances to mass-transfer to and lateral diffusion along the array surface  249  attached to a 3D polymer matrix  although this discrepancy is less prominent with probes Non-specific interactions of the target with the underlying  brush-like configuration of the surface or matrix  ‘  can also inhibit hybridization through steric  effects and through unfavorable coulombic interactions created by the negative charge density of the phosphate backbone of the probe  123 25o 251  Significant differences in the sizes of labelled  transcripts can also alter the kinetics of the interactions with the immobilized probes  185  as can  secondary structures formed in the target strand that compete with probe hybridization  253  These effects can lead to large differences in hybridization signals for probes to the same gene 254256 but are generally lessened by chemically or enzymatically shearing targets to a smaller and more uniform distribution of lengths signals  200,201  which also tends to produce higher intensity hybridization  257-259•  Finally, as the microarray assay involves the hybridization of a population of targets to a larger population of probes with related sequences, strong competitive effects are ntroduced° 3 that can delay equilibrium  261-263  and hinder the ability to discern mismatch from perfect match  hybridization signals. This is particularly problematic for probe design, as the stability of partially complementary duplexes depends upon the position of the mismatched base within the probe 265  264,  on the chemistry of the mismatched base 266 and on its local sequence context 189  1.3.2.2. Limitations introduced by sample preparation While significant effort has been made to improve array design and chemistry, relatively  little research has been conducted to determine the impact of sample preparation on the accuracy and overall quality of micrarray data. The properties and composition of the starting transcript  population pose significant (but to date largely unrecognized) challenges to the ability to prepare an analyte that faithfully preserves the abundance distribution of the components, and that can then be detected and analyzed in an unbiased manner to permit absolute quantification of message  18  numbers. Fluorescent labeling of transcripts is known to be heterogeneous. In particular, transcript lengths can range from 3O to -iO nucleotides according to curated Refseq datasets  ,  causing  the total number of fluorophores incorporated into each transcript to vary proportionately and labelled transcripts present within the same abundance dass to exhibit a wide range of fluorescence intensities at the array surface. In addition, fluorophores are often incorporated through the introduction of either a modified (deoxy-) cytosine, (deoxy-) uradil, or (deoxy-) guanidine nucleotide. However, the sequence compositions among transcripts can range from 267  leading to further variations in the number of fluorophores  173, 268-270,  While this can be mitigated somewhat through the use of  2O% to 8 % in their GC-content 5  incorporated per transcript  mixtures of modified dCTP and dUTP  ‘,  the incorporation efficiency of dCTP analogues is  generally lower than that of corresponding dUTP analogues  211,272,  Finally, uniform incorporation  rates for these modified nudeotides requires the use of polymerases with lower base selectivity and therefore a higher error-rate  275,  which can be detrimental to the overall fidelity of the platftrm  as the prevalence of errors in the labeled target sequence will increase the incidence of hybridization artifacts. Proposed methods to lessen the impact of size- and sequence-dependent variability in labelling 271 include labeffing the 3’-end of each cDNA or cRNA enzymatically through end labeffing with RNA ligase from the bacteriophage T4 276 or hybridizing fluorescent dendrimers to a conserved sequence introduced into each cDNA via the oligo(dI) primer ’ 277 226 The correlation of hybridization signal with transcript abundance can also be biased as a result of the amplification protocols used to prepare sufficient amounts of analyte for interrogation (see section 1.3.1.3) aRNA products  89  including linear amplification protocols utilizing 17 RNAP to generate  For example, successive rounds of linear amplification lead to a decrease in  the average length of and an increase in the average number of sequence errors in amplified products  283,  ,  biasing the composition of the analyte towards the 3’-end of the transcript  population and reducing the information content of the analyte relative to the full transcriptome. Moreover, as lower abundance templates are more susceptible to stochastic variation in amplification, the reproducibility of low abundance transcript measurements are compromised ’ 278  While these issues do not preclude the application of microarray technology to the provision of relative measurements in transcript abundances, some or all will need to be addressed to ‘  .  19  establish a reliable array-based technology for global measurement of absolute transcript abundances. 1.3.3. Advances in analogue-based transcriptomics technologies  Although incremental advances have been made in various facets of analogue-based platforms, from sample preparation and labelling to array fabrication and target hybridization, three recent advances in the design and chemistry of these platforms have significantly improved their potential to provide absolute measurements of the transcriptome: 1) the development of technologies that enable dramatic increases in feature densities and therefore the depth and replication of transcriptome analysis, 2) the creation of “universal” format arrays that can be applied to any sample of interest, and 3) the mitigation of probe signal variance arising from sequence/length-dependent thermodynamics. 1.3.3.1. Genome tiling arrays Photolithography-based methods have enabled the production of oligonucleotide arrays with feature densities of >106 probes per array. This increase in feature density in turn provides the ability to design sets of 25-50 mer probes covering the entire library of expressed genes for a higher-level organism, thereby allowing a single array to be used to analyze transcripts from all tissues within an organism  .  The approach is not strictly universal, and the successful design of  such a comprehensive probeset requires a priori sequence information for the organism. It does, however, extend the range of applications for a given array. An alternative approach involves the creation of probesets that span the entire genome of an organism, such that the entire genomic sequence is represented through a series of overlapping or contiguous oligonucleotide probes  285-292  Like digital based methods such as SAGE, this format  confers the ability to interrogate any potentially transcribed mRNA sequence, permitting the de novo identification of previously unidentified gene products generally >100 bases in size  293  ‘ .  Moreover, since exons are  they can be well represented within the probe set of the tiling array,  allowing the detection of all potential splice variants that may exist for a particular gene product. Results from tiling arrays have therefore been used to characterize the breadth of transcriptomes by identifying every transcriptionally active region (TAR) of the genome, including antisense transcripts, pseudogenes, microRNAs, and other non-coding RNAs 285-292294-296  20  In general, multiple tiling arrays are required for complete genome sequence coverage, dramatically increasing experimental costs relative to standard spotted arrays. Although longer probe lengths and reduced overlap can be utilized to reduce the number of arrays required, this approach can compromise both the resolution and the precision of the platform  .  Moreover,  the potential application of tiling arrays to quantitate absolute transcript abundance is subject to the same weaknesses found in traditional arrays (section 1.3.3.1). For example, the impact of competition is expected to be more pronounced as the hybridization signal will be further convoluted by target:probe competition among the number of overlapping probes directed to a particular exon or that are potentially shared among different exons, by competition amongst targets that share common exons, and by differences in the total number of probes directed to each TAR since TARS are known to vary in length  299  As a result, like with traditional DNA  microarrays, current uses of tiling arrays are restricted to comparative transcriptome studies. 1.3.3.2. Bead arrays A promising alternative strategy to traditional microarrays involves creating a library of polystyrene beads (Ca. 3 pm dia) in which each bead displays a unique oligonucleotide probe covalently attached to the surface. Spatial dispersion of the beads then permits isolation of the hybridization signal arising from a unique target:probe complex  300305•  The probe sequence  displayed on a given bead is identified through the introduction of a unique emission spectrum signature encoded either through impregnating various combinations of fluorophores  306,  or  through the use of a barcode oligonucleotide sequence that can be identified through the hybridization of fluorescent “decoder” oligonudeotides  307, 308,  Spatial isolation, hybridization and  detection can then be achieved through deposition of individual beads into nanowells within a fiber optic bundle As each bead is associated with an individual optical fiber, this permits the simultaneous excitation and detection of fluorescent signals associated with a particular bead, with little or no signal degradation. Typically, bead mixtures are deposited onto etched fiber-optic bundles (ca. 1.4 mm dia) capable of capturing and segregating 50,000 beads  300,303,  although this  feature density can be increased by employing smaller beads and appropriately scaled fiber optic bundles  The fibre optic bundle containing the impregnated beads is then immersed into the  labeled target solution  302, 303  and hybridization can either be imaged in real-time or imaged  following washing and drying, with concentrations as low as 0.1 pM detected over a dynamic range  21  of 3.2 orders of magnitude 303, matching the performance characteristics of traditional microarray platforms 310,311 State-of-the-art “bead arrays” utilize a population of 1536 distinguishable beads, each with a unique probe sequence (50 mer probe + 29 mer barcode), that are loaded onto a single fiber optic bundle capable of holding 50,000 303 beads 307, The resulting 30-fold over-sampling of beads ’ ensures that at least 5 copies of each bead are embedded within the bundle, providing a means to internally measure signal variability 307• The loading of each such bundle can be repeated, using beads with different probes but the same  Set of  of up to --148,O0O unique probe sequences  303  barcodes, creating an “array of arrays” compiised  and therefore comparable to the feature densities  achieved in standard sported rnicroarray technology  .  Modification or customization of a bead  array does not require the wholesale redesign of the array, but instead requires the replacement of a single bead type with another. As such, this bead-based platform could provide a universal format for gene expression analysis, as cross-tissue and cross-species analysis can be achieved simply by introducing a different set of beads However, while bead array technology represents a potentially powerful analogue-based platform for transcriptome analysis, it relies on the hybridization of a labeled target to a surfaceimmobilized probe and therefore is subject to the same biophysical limitations found in traditional microarray technologies. In particular, the sample preparation method is identical to that commonly used in standard microarray platforms, making universal beaded arrays subject to the same sequence- and length-dependent biases introduced in the amplification and labeling of the starting material. As a result, the technology is currently restricted to comparative transcriptome studies. 1.3.3.3. universal k-met library arrays  Truly universal arrays consisting of fully degenerate k-mer libraries of probes were among the first microarray platforms constructed by virtue of the simplicity in which a set of short k-met probes could be 51 synthesized’ ‘. As the number of probe sequences within a k-mer library scales ’ k, by a factor of 4 these arrays typically set k at 10 or less and their performance was therefore limited by the low thermodynamic stabilities of the targetprobe duplexes which k  <  312,  Universal arrays in  9 have therefore found limited use in functional genomics, but studies utilizing k  libraries of oligo-purines or oligo-pyrimidines (i.e.  22  28  =  8  64 probes) have provided valuable insights  on the impact of varying hybridization conditions on target:probe duplex formation studies have utilized k  =  155,313  Other  4 libraries presented on the 3’-end of an octamer sequence to examine the  effect of terminal mismatches on target hybridization in thin gel matrices attached surfaces Finally, fully degenerate k  6 libraries have been used to illustrate the range of thermal stabilities  of immobili2ed probeset sequences  312,  and have been used to analyze the sequence-dependent  properties of the minor groove binding dye Hoescht 33258  315,  as well as the impact of various  base analogues on duplex stabilities 316• The increase in achievable feature densities afforded by photolithographic and other modem techniques  (k  =  317  have made possible the creation of universal nonamer (k  =  9) and decamer  10) probe libraries, creating the potential for a sequencing by hybridization (SBH) approach  to transcriptome analysis  318, 319  The ability to fully deconvolute a sequence spectrum generated  from the hybridization of a single target DNA for de novo sequencing via these universal nonamer or decamer arrays is theoretically limited to lengths of 400 and 800 bases, respectively 319• Despite this inherent limitation, the application of universal k-mer arrays to analysis of the sequence content of a transcriptome has been explored theoretically, and results suggest that probes 10-16 bases in length might be sufficiently specific to identify individual transcripts °, falling at or above the upper range of feature densities (k  =  16.8 million features) currently possible. However,  12  because a putative hybridization signal is a function not only of the abundance of the transcript, but also on the number of times a sequence is present within each transcript , deconvolution of the resulting panel of hybridization signals in order to uniquely identify and infer the abundance of each transcript in the transcriptome is unlikely. Alternative methods have therefore been explored to reduce the sequence complexity of the transcriptome so that it is amenable to analysis via universal k-mer libraries. By reducing the sequence representation of each transcript  —  essentially recapitulating the principle of SAGE  —  a  one-to-one correspondence between transcript abundance and the signal at a probe register can in theory be created. In one such platform termed Gencompass  321,  that utilizes a universal  hexamer array (4096 registers), SSTs are generated as in SAGE to yield adaptor modified 10-base sequence tags that are then split into a pool of up to 256 samples. A second 25-bp adaptor, possessing an overhang sequence comprised of one of 256 possible tetramers, is then ligated to the pool of released SSTs in each aliquoted sample. Following their amplification via the PCR, as one  23  primer is biotinylated, incubation with streptavidin beads allows the preparation of a single stranded target. One of 256 unique fluorescent decoder oligonucleotides complementary to the second adaptor is then introduced and hybridized to target, leaving a 6-mer tag unique to the transcript available for hybridization onto the universal array consisting of a k  =  6 degenerate  probeset. Specificity in hybridization signals is thought to be improved by the introduction of a ligation step to join the probes of the hexamer array to the detector oligonucleotide, using the target as a template. This decoding-ligation step is repeated for each of 256 decoder oligonucleotides, and the aggregate signal (6-mer array  +  4-mer decoder) corresponding to the 10-  mer SST is used to give a measure of transcript abundance. While in principle a viable method to reduce the complexity of the transcriptome for interrogation via k-mer universal arrays, this approach has only been used to provide measurements of relative changes in transcript abundance. Given the potential for bias arising during material amplification  88,  tolerance of ligases to mismatches in joining overhang sequences  324,  as well as the known the potential for bias is  unlikely to permit absolute measurements of transcript abundance through this particular approach. 1.3.3.4. Locked nucleic acid (LNA) arrays As noted above, the ability to obtain truly quantitative measurements of target abundance  is hampered both by the sequence-dependent thermodynamics of the target:probe interaction and, in certain instances, the sequence-dependent labelling intensities arising from the target labelling strategies utilized  212-214, 310,  •  This is particularly problematic for probes of shorter length, as the  range of thermal stabilities of targetprobe duplexes is then large, and in extreme instances, hybridization of AT-rich sequences may not yield a detectable signal. Moreover, while mismatched probe sequences are less stable relative to their perfectly matched counterparts, hybridization conditions which permit the capture of AT-rich targets may render GC-rich probes non-specific.  However, new classes of nucleotide analogues, classified by the moiety within the nucleotide that is modified (i.e., the nucleobase, the ribose sugar, and/or the phosphate backbone), can potentially be used to address these issues through their use in probesets. Driven largely by the desire to create more stable anti-sense molecules as potential therapeutics been reported  329  326-328  many such analogues have now  Among these, locked nucleic acids (LNAs) have shown the most promise for  adaptation to microarray technology 330332•  24  Locked nucleic acids (LNAs) are analogues of RNA in which the 2’-hydroxyl group of the ribose sugar is covalently linked to the 4’-carbon through a methylene bridge, which serves to “lock” the base in the C3’-endo conformation. This subtle modification also removes the requirement for protection of the 2’-hydroxyl group while allowing the use of standard 2cyanoethyl phosphoramidite coupling chemistry typically used for DNA synthesis coupling yields approaching those typically observed for DNA bases (>98%)  with  LNAs confer  unique properties relative to their unlocked counterparts by conformationally restricting the duplex 3,  to a more stable “A”-form duplexes  337,338  ,  leading to a decrease in the dissociation rate for formed  and allowing LNAs to “invade” DNA:DNA and DNA:RNA duplexes to form an °.  intermolecular LNA:DNA duplex  Studies on the thermodynamic properties of LNAs have  shown that the introduction of a single LNA base can increase the melting temperature (F,) of a complementary duplex by 1-10°C  and accurate models (±1.4  °q  for predicting the melting  thermodynamics of duplexes containing a single LNA base have been developed from UV melt data  ‘ll•  Significant progress in the prediction of melting thermodynamics for duplexes containing  multiple LNA substitutions has also been achieved through the use of statistical prediction algorithms (± 5 °q °. An increase in stability is generally observed, but the degree is sequence dependent, and is also dependent on the nature of the base substitution as well as its position within the duplex. In rare instances, an LNA substitution can destabilize the duplex ‘. These properties have enabled the use of LNAs in detecting very short transcribed RNA molecules called microRNAs (ranging between 21 to 25 bases in length) via Northern blotting 342 and in situ hybridization  ‘  .  This assay is not possible using traditional DNA-based probes,  highlighting the unique and valuable properties of LNAs. Alternating locked thymidine oligonudeotides have been used to directly isolate polyadenylated mRNA in the presence of strong denaturants  As well, their use as 5’-nudease probes has been shown to increase the sensitivity  of real-time quantitative PCR assays creation of high-affinity reagents  ‘  ‘ .  ,  However, while LNAs provide a general route for the  their overriding benefit as probes for niicroarrays arises  from the compatibility of their chemistry with standard protocols for the solid-phase synthesis of DNA. This unique property allows the creation of LNA+DNA “mix-mers”, where LNAs can be incorporated in any position of a synthetic mix-mer oligonudeotide. Probes of defined LNA content can therefore be synthesized to “tune” melting thermodynamics, allowing the creation of universal probesets with a uniform Tm. The tolerance of LNAs to mismatches is also significantly  25  lower than DNA, though the magnitude of this effect is dependent on the mismatch site, the length of the oligomer, and the identity, relative position and number of LNA substitutions °. Small microarrays have therefore recently been designed to interrogate microRNA populations with LNA probesets designed to melt at Ca. Tm  =  72 °C  3 332,  cluster of transcripts targeted by the cytochrome P450 gene in C e1egans  as well as to analyze a 330W  In both cases, the  sensitivity of the LNA-based probes exceeded that of the corresponding DNA probes. 1.4. THESIS OBJECTIVES Moving from characterization of transcript abundances relative to an arbitrary standard state to the absolute quantification of transcript numbers would have a profound impact on systems biology, as well as our ability to discover new pharmaceuticals, detect and monitor disease, evaluate treatments, and ultimately, predict and prevent illness. An inexpensive, universal platform that provides highly reproducible and quantitative transcriptome profiling on a per cell basis could therefore dramatically increase the use of microarray technology in health management and in the development  of accurate  models  of  disease  and  associated  cellular  responses  to  external/environmental stimuli (e.g., infectious agents, carcinogens). The objective of this thesis is to fuse the strengths of DNA microarrays and SAGE, and to exploit recent advances in nucleotide analogues to work toward the development of a powerful new universal microarray platform for SAGE-like analysis of transcriptomes that provides researchers and clinicians with the following:  •  rapid and absolute quantification of gene expression patterns in both pure and mixed-cell populations  •  identification and subsequent annotation of unknown genes  This new technology, herein referred to as the U-STAR (Universal Sequence Tag ARray) platform, will attempt to build on the existing strengths of SAGE to prepare a novel SST library suitable for microarray analysis that faithfully retains the abundance information of the original mRNA population. Significant improvements in current sample preparation protocols are required, and chapters 2 through 4 of this thesis therefore provide a careful study of errors and biases introduced in the current protocols used to prepare SAGE sequencing libraries, and then  26  document. a number of modifications to the protocol that work to preserve the information content of original sample. First, an investigation into the efficiency of the ligation step joining the SAGE adaptor to the 3’-end restricted library is presented in Chapter 2. As ligation of this adaptor enables the release of a sequence tag from the cDNA population, inefficiencies in this ligation step can lead to an under-representation or even complete absence of low abundance transcript representations within the SST population. Losses are observed, but can be overcome using a new method termed directed ligation chemistry (DLC), which ensures the quantitative conversion of the 3’-end cDNA to the adaptor modified product, thereby enabling full and accurate representation of the transcriptome within the set of released SSTs. Second, the steps involved in the creation of an SST dimer, or “ditag”, are examined. These steps convert the SST population into a form that enables the application of the PCR so that sufficient amounts of material for downstream processing are obtained. Results in Chapter 3 show that spurious enzymatic products are formed during ditag formation that can bias the SST population upon the application of the PCR and subsequent purification. Methods for eliminating this problem are presented. Finally, as the preparation of a SAGE library requires a relatively large amount of material, the impact of material amplification as it pertains to SAGE is examined in Chapter 4. In particular, the ability of the PCR to preserve the original distribution of SSTs within the amplified pooi used for SAGE library production is assessed and shown to significantly alter the distribution of transcript representations in the analyte. A number of modifications to the ditag amplification protocol are then introduced to eliminate this source of bias. Together, these improvements in the protocol used to generate a SAGE sequencing library can be combined with improvements in sequencing technologies and depths of sequencing to establish a powerful universal digital platform for quantifying transcriptome profiles on an absolute scale. However, the economics and accessibility of such a universal transcriptomics technology could be improved by utilizing a microarray format to analyze the SST sample. Chapter 5 therefore reports on progress toward the development of U-STAR, which presents a combinatorial registered array of k-mer probes whose chemistry has been carefully engineered to overcome the  27  many problems of traditional k-mer arrays documented in section 1.3.3.3 in an attempt to create a detection platform that accurately records the abundance information contained in the analyte. U-STAR seeks to eliminate differences in rates of mass transfer arising from targets of varying length by creating short SST representations of equal length. Long-standing and significant limitations to universal array formats related to differences in melting temperatures and hybridization thermodynamics arising from probe composition are also addressed through the introduction of LNAs into the short surface-anchored probes. Introduction of LNAs into the probeset allows the creation of an array with uniform melting thermodynamics, permitting stringent hybridization conditions to be set that maximize signal to noise while minimizing target hybridization to partially or non-complementary registers. As a result, hybridization signals will directly reflect the concentration of a particular SST within the target solution, and therefore allow transcript abundances to be determined on an absolute scale. 1.5. U-STAR - THE UNIVERSAL SEQUENCE TAG ARRAY - A BRIEF CONCEPTUAL OVERVIEW Building on SAGE technology, the U-STAR concept for analysis of gene expression begins with the capture of an mRNA sample on poly(dT) Dynabeads, synthesis of an anchored cDNA copy of the mRNA population, and digestion with a type II anchoring endonuclease (ARE) to retain and purify the most 3’-end restriction fragments of the digested population. A novel U-STAR oligonucleotide adaptor containing a recognition sequence for a Type ITS RE is then ligated to the 5’ end of the (+) strand of the anchored cDNA. As shown in figure 1.2., the U-STAR adaptor offers several unique features essential to the proposed technology: (1) a hairpin chemistry that increases the melting temperature of the adapter selfduplex above 8 5°C, (2) a 3’ overhang sequence complementary to the ARE site, (3) a 5-methyl-dC within the overhang, (4) a recognition sequence for the Type ITS RE Acu I positioned to create a CATG  9- or lO-mer SST for each transcript, (5) a 5’-OH which prevents ligation of the 5’ end of the adapter to the 3’ end of the (—) strand of the cDNA, and (6) a fluorescent reporter group +  conjugated to the distal hairpin thymidine in precise 1:1 stoichiometry. The SST is then released by digestion with Acu I, partially denatured to remove the 11-12 base  (—)  strand of the SST that  remains unligated to the U-STAR adaptor, and gel purified, resulting in a hairpin molecule (STARtag) possessing a single-stranded CATG+9- or lO-mer SST derived from the transcript that  28  is available for hybridliaation to the universal array (Figure 1.2). Multiple sets of STARtags can be created for the sample by splitting the mRNA and processing each aliquot with a unique ARE, thereby ameliorating tag degeneracy issues to achieve complete coverage of the transcriptome. To overcome iow yields of anchored Y-end cDNAs that are successfully ligated to added SAGE adaptors (arising from excessive self-ligation of the self-complementary sticky ends of the anchored cDNA fragments under reaction conditions currently used in the SAGE protocol), an alternative method of ligation, termed “directed ligation chemistry” (DLq, is used. DLC exploits the favorable mass-action condition created by the presence of NIalli in the ligation reaction in conjunction with the U-STAR adaptor that contains a methylated base within the ligation site: as NlallI cannot process duplexes containing the methylated base, complete conversion of the anchored 3’-end cDNA to the desired adaptor-modified product is achieved. This new protocol therefore ensures that a SST will be generated from every transcript, greatly enhancing the fidelity of U-STAR and obviating the need for SST amplification prior to interrogation on the array. The U-STAR array embodies the concept and associated advantages of using a single array to interrogate any organism(s) or tissue sample(s) of interest. The array is combinatorial in design, displaying all possible 9- or lO-mer polynucleotide sequences. As noted above, the performance of short-probe based microarrays, including the few previous attempts to create a universal array format  321,  ,  has been challenged by the inherently large differences in melting temperatures  between AT-rich and GC-rich duplexes  312,  This problem is overcome in U-STAR through the  introduction of locked nucleic acids (LNAs) into the probeset displayed on the universal registered array. Together, these proposed advances are intended to yield a U-STAR prototype capable of accurately identifying and quantifying sub-femtomole quantities of a given transcript, thereby paving the way for development of a powerful, optimized platform for universal absolute transcriptomics.  29  A.  B. Blunt-ending with Kienow  RNA  -  PCI extraction Ethanol precipitation  Affinity purification With oligo dT dynabeads  Ditag formation First strand  .  .  synthesis PCR amplification  Second strand synthesis  I —  I  ‘  PCI extraction Ethanol precipitation PAGE purification Gel extraction Ethanol precipitation  Release of 26 bp ditags  NIalil digest  PAGE purification Gel extraction Ethanol precipitation Concatemerization Adaptor ligation  I  -  -\  ‘  p  PAGE purIfication Gel extraction Ethanol precipitation  Insertion into sequencing vector  OQ  Tagrethmfl  Host transformation .d  Figure 1.1. Outline of the SAGE protocol. A. Processing steps leading to the formation and release of short sequence tags (SAGE tags) from a starting mRNA sample. B. Released SAGE tags are then converted through a series of steps into concatenates for cloning into a sequencing library for analysis.  30  JOE  I  Acu I :CTGAAG GCTTC-.  TG  DLC  ? TTTGCTCTG)flmNNNNNNNNNN TTGGACTTCGAGAclflN)ftwNNNNNNNN  Acu I  Figure 1.2. Diagram and application of the U-STAR adaptor. Features of the U-STAR adaptor permit the quantitative conversion of the anchored cDNA population into the adaptor modified product under directed ligation chemistry (DLC). Ligation leads to the covalent attachment to the (+) strand of the cDNA strand, while the presence of the 5’-OH on the adaptor prevents ligation to the (-) strand.  31  1.6. REFERENCES 1.  Kal, A.J. Ct al. Dynamics of gene expression revealed by comparison of serial analysis of gene expression transcript profiles from yeast grown on two different carbon sources. Mo! Biol CellO, 1859-1872 (1999).  2.  Ogawa, N., DeRisi, J. & Brown, P.O. New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis. Mo! Biol Cell 11, 4309-4321 (2000).  3.  Gross, C., Kelleher, M., Iyer, V.R., Brown, P.O. & Winge, D.R. Identification of the copper regulon in Saccharomyces cerevisiae by DNA microarrays.J Biol Chem 275,32310-32316 (2000).  4.  Jia, M.H. et al. Global expression profiling of yeast treated with an inhibitor of amino acid biosynthesis, sulfometuron methyl. Phjsiol Genomics 3, 83-92 (2000).  5.  Natarajan, K. et al. Transcriptional profiling shows that Gcn4p is a master regulator of gene expression during amino acid starvation in yeast. Mol Cell Biol2l, 4347-4368 (2001).  6.  Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. MolBiol Cell9, 3273-3297 (1998).  7.  Velculescu, V.E. et al. Characterization of the yeast transcriptome. Cell 88, 243-251 (1997).  8.  Primig, M. et al. The core meiotic transcriptome in budding yeasts. Nat Genet 26,415-423 (2000).  9.  Eisen, M.B., Speilman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc NatlAcad Sci USA 95, 14863-14868 (1998).  10. Altmann, C.R. et al. Microarray-based analysis of early development in Xenopus laevis. DevBiol 236, 64-75 (2001). 11. Roberts, C.J. et al. Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science 287, 873-880 (2000). 12. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J. & Church, G.M. Systematic determination of genetic network architecture. Nat Genet22, 281 -285 (1999). 13. Jelinsky, S.A., Estep, P., Church, G.M. & Samson, L.D. Regulatory networks revealed by transcriptional profiling of damaged Saccharomyces cerevisiae cells: Rpn4 links base excision repair with proteasomes. Mo! Cell Biol 20, 8157-8167 (2000). 14. Pilpel, Y., Sudarsanam, P. & Church, G.M. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet29, 153-159 (2001). 15. Zhang, M.Q. Promoter analysis of co-regulated genes in the yeast genome. Cos’rput Chem 23,233-250 (1999). 16. Wolfsberg, T.G. et al. Candidate regulatory sequence elements for cell cycle-dependent transcription in Saccharomyces cerevisiae. Genome Rex 9, 775-792 (1999). 17. Simpson, L. et al. PTEN expression causes feedback upregulation of insulin receptor substrate 2. Mo! Cell B1o121, 3947-3958 (2001). 18. Shen, X. et al. The activity of guanine exchange factor NETI is essential for transforming growth factor-betamediated stress fiber formation. J Biol Chem 276, 15362-15368 (2001). 19. Roy, D., Calaf G. & Hei, T.K. Profiling of differentially expressed genes induced by high linear energy transfer radiation in breast epithelial cells. Mo! Carcinog 31, 192-203 (2001). 20. Seki, M. et a]. Monitoring the expression pattern of 1300 Arabidopsis genes under drought and cold stresses by using a full-length eDNA microarray. Plant Cell 13, 61-72 (2001). 21. Riehle, M.M., Bennett, A.F. & Long, A.D. Genetic architecture of thermal adaptation in Eschenchia coli. Proc Nati AcadSci USA 98, 525-530 (2001). 22. Smoot, LM. Ct al. Global differential gene expression in response to growth temperature alteration in group A Streptococcus. Proc NatlAcad Sd USA 98, 10416-10421 (2001). 23. Sherman, D.R. et aL Regulation of the Mycobacterium tuberculosis hypoxic response gene encoding alpha crystallin. Proc NatlAcad Sci USA 98,7534-7539 (2001).  32  -  24. Schoolnik, G.K. et al. Whole genome DNA microarray expression analysis of biofllm development by Vibrio cholerae 01 El Tor. MetbodcEnrjmol336, 3-18 (2001). 25. Simmen, K.A. et al. Global modulation of cellular transcription by human cytomegalovirus is initiated by viral glycoprotein B. Proc NatlAcaa’ Sd USA 98, 7140-7145 (2001). 26. Shires, J., Theodoridis, E. & Hayday, A.C. Biological insights into TCRgammadelta+ and TCRalphabeta+ intraepithelial lymphocytes provided by serial analysis of gene expression (SAGE). Immunity 15,419-434 (2001). 27. Shaffer, A.L. et al. Signatures of the immune response. Immunity 15, 375-385 (2001).  28. Ragno, S. et al. Changes in gene expression in macrophages infected with Mycobacterium tuberculosis: a combined transcriptomic and proteomic approach. Immunolsgy 104, 99-108 (2001). 29. Sampson, N.S. et al. Global gene expression analysis reveals a role for the alpha I integrin in renal pathogenesis. J BiolChem 276, 34182-34188 (2001). 30. Rozzo, S.J. et al. Evidence for an interferon-inducible gene, 1fi202, in the susceptibility to systemic lupus. Immunity 15,435-443 (2001). 31. Ramanathan, M. et al. In vivo gene expression revealed by cDNA arrays: the pattern in relapsing-remitting multiple sclerosis patients compared with normal subjects.JNeumimmuno/116, 213-219 (2001). 32. Zohlrthofer, D. et al. Gene expression profiling of human stent-induced neointima by cDNA array analysis of microscopic specimens retrieved by helix cutter atherectomy: Detection of FK506-bincling protein 12 upregulation. Circuhilion 103, 1396-1402 (2001). 33. Virtaneva, K. et al. Expression profiling reveals fundamental biological differences in acute myeloid 1eulemia with isolated trisomy 8 and normal cytogenetics. Proc NatlAcadSci USA 98, 1124-1129 (2001). 34. Sorlie, T. et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc NatlAcad Sd USA 98, 10869-10874 (2001). 35. Siebert, R., Rosenwald, A., Staudt, L.M. & Morris, S.W. Molecular features of B-cell lymphoma. Curr Opin Oncol 13, 316-324 (2001). 36. Shridhar, V. et al. Genetic analysis of early- versus late-stage ovarian tumors. Cancer Rex 61, 5895-5904 (2001). 37. Ryu, B., Jones, J., Hollingsworth, M.A., Hruban, R.H. & Kern, S.E. Invasion-specific genes in malignancy: seriaL analysis of gene expression comparisons of primary and passaged cancers. Cancer Research 61, 1833-1838 (2001). 38. Rickman, D.S. et al. Distinctive molecular profiles of high-grade and low-grade gliomas based on oligonucleotide microarray analysis. Cancer Rex 61, 6885-6891 (2001). 39. Adams, M.D. Serial analysis of gene expression: ESTs get smaller. Bioessqys 18, 261-262 (1996). 40. Righetti, P.G., Gelfi, C. & D’Acunto, M.R. Recent progress in DNA analysis by capillary electrophoresis. Electtvpboresis23, 1361-1 374 (2002). 41. Franca, L.T., Carrilho, E. & Kist, T.B. A review of DNA sequencing techniquesQ Rev Biophjs 35, 169-200 (2002). 42. Velculescu, V.E., Zhang, L., Vogeistein, B. & Kinzler, K.W. Serial analysis of gene expression. Science 270,484-487 (1995). 43. Mu, M. & Polyak, K. Serial analysis of gene expression. Natureprotocols 1, 1743-1760 (2006). 44. Szybalski, W., Kim, S.C., Hasan, N. & Podhaska, A.J. Class-uS restriction enzymes--a review. Gene 100, 13-26 (1991). 45. Matsumura, H. et al. SuperSAGE. Cell Microbiol7, 11-18 (2005). 46. Hashimoto, S. et al. 5’-end SAGE for the analysis of transcriptional start sites. Nat Biotecbnol22, 1146-1149 (2004). 47. Gowda, M., Jantasuriyarat, C., Dean, R.A. & Wang, G.L. Robust-LongSAGE (RL-SAGE): a substantially improved LongSAGE method for gene discovery and transcriptome analysis. P/ant PlyjxiolI34, 890-897 (2004). 48. Wei, C.L. et al. 5’ Long serial analysis of gene expression (LongSAGE) and 3’ LongSAGE for transcriptome characterization and genome annotation. Proc NatlAcaci Sd USA 101, 11701-11706 (2004).  33  49. Harbers, M. & Caminci, P. Tag-based approaches for transcriptome research and genome annotation. Nat Methods 2,495-502 (2005). 50. Barrett, T. et al. NCBI GEO: mining millions of expression profiles--database and tools. Nckic Adds Res 33, D562-566 (2005). 51. Kuo, B.Y. (2006).  Ct  al. SAGE2Splice: unmapped SAGE tags reveal novel splice junctions. PLoS Comput Biol 2, e34  52. Saha, S. et al. Using the transcriptome to annotate the genome. Nat Bioiechnol20, 508-512 (2002). 53. Caron, H. et al. The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science 291, 1289-1292 (2001). 54. Velculescu, V.E., Vogeistein, B. & Kinzler, K.W. Analysing uncharted transcriptomes with SAGE. Trends Genet 16, 423-425 (2000). 55. Caron, H. et al. The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science 291, 1289-1292 (2001). 56. Boheler, KR. & Stem, M.D. The new role of SAGE in gene discovery. Trends Biotechnol2l, 55-57; discussion 5758 (2003). 57. Chen, J. et al. Identifying novel transcripts and novel genes in the human genome by using novel SAGE tags. Proc NatlAcadSci USA 99, 12257-12262 (2002). 58. Lewin, B. Gene expression, Edn. 2d. (Wiley, New York; 1980). 59. Goffeau, A. et al. Life with 6000 genes. Science 274, 546, 563-547 (1996). 60. Hereford, L.M. & Rosbash, M. Number and distribution of polyadenylated RNA sequences in yeast. Cell 10, 453462 (1977). 61. Stern, M.D., Anisimov, S.V. & Boheler, K.R. Can transcriptome size be estimated from SAGE catalogs? Bioinformatics 19, 443-448 (2003). 62. Stollberg, J., Urschitz, J., Urban, Z. & Boyd, C.D. A quantitative evaluation of SAGE. Genome Kes 10, 1241-1248 (2000). 63. Lewin, B. Genes VII. (Oxford University Press, Oxford; 2000). 64. Holland, M.J. Transcript abundance in yeast varies over six orders of magnitude. (2002).  J Biol Chem 277,  14363-14366  65. Thygesen, H.H. & Zwinderman, A.H. Modeling Sage data with a truncated gamma-Poisson modeL BMC Bioinformatics 7, 157 (2006). 66. Ruijter, J.M., Van Kampen, A.H. & Baas, F. Statistical evaluation of SAGE libraries: consequences for experimental design. P4ysiol Genomics 11,37-44 (2002). 67. Lee, S., Chen, J., Zhou, G. & Wang, S.M. Generation of high-quantity and quality tag/ditag cDNAs for SAGE analysis. BioTechniqsves 31, 348-350, 352-344 (2001). 68. Chen, J. et al. The pattern of gene expression in mouse Gr-1(+) myeloid progenitor cells. Genomics 77, 149-162 (2001). 69. Patankar, S., Munasinghe, A., Shoaibi, A., Cummings, L.M. & Wirth, D.F. Serial analysis of gene expression in Plasmodium falciparum reveals the global expression profile of erytbrocytic stages and the presence of anti-sense transcripts in the malarial parasite. Mokcular Biology f the Cell 12, 3114.3125 (2001). 70. Munasinghe, A. et al. Serial analysis of gene expression (SAGE) in Plasmodium falciparum: application of the technique to A-T rich genomes. MolBiochem Paracitolll3, 23-34 (2001). 71. Angelastro, J.M., Klimaschewski, LP. & Vitolo, O.V. Improved NIalil digestion of PAGE-purified 102 bp ditags by addition of a single purification step in both the SAGE and microSAGE protocols. Nackic Acids Re’s 28, E62 (2000). 72. Li, Y.J. et al. A comparative analysis of the information content in long and short SAGE libraries. BMC Bioinfomxatics 7, 504 (2006).  34  73. Lee, S. et al. Correct identification of genes from serial analysis of gene expression tag sequences. Genomics 79, 598602 (2002). 74. Chen, J.J., Rowley, J.D. & Wang, S.M. Generation of longer cDNA fragments from serial analysis of gene expression tags for gene identification. Proc NatlAcad Sci USA 97,349-353 (2000). 75. Lee, S. et al. The pattern of gene expression in human CDI 5+ myeloid progenitor cells. Proc NatlAcad Sd U S A 98, 3340-3345 (2001). 76. Pleasance, E.D., Macta, MA. & Jones, S.J. Assessment of SAGE in transcript identification. Genome REx 13, 12031215 (2003). 77. Clark, T., Lee, S., Ridgway Scott, L. & Wang, S.M. Computational Analysis of Gene Identification with SAGE. J ComputBiol9, 513-526 (2002). 78. Peters, BA. et al. Large-scale identification of novel transcripts in the human genome. Genome Rex 17, 287-292 (2007). 79. Anisimov, S.V. & Sharov, AA. Incidence of “quasi-ditags” in catalogs generated by Serial Analysis of Gene Expression (SAGE). .BMC Bioinformatics 5, 152 (2004). 80. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8, 186-194 (1998). 81. Richterich, P. Estimation of errors in “raw” DNA sequences: A validation study. Genome Research 8, 251-259 (1998). 82. Lawrence, C.B. & Solovyev, V.V. Assignment of position-specific error probability to primary DNA sequence data. NucleicAcidsRxs22, 1272-1280 (1994). 83. Beissbarth, T. et al. Statistical modeling of sequencing errors in SAGE libraries. Bioinformatics 20 Suppi 1, 131-139 (2004). 84. Colinge, J. & Feger, G. Detecting the impact of sequencing errors on SAGE data. BioinJbrmatics 17,840-842 (2001). 85. Akmaev, V.R. & Wang, C.J. Correction of sequence-based artifacts in serial analysis of gene expression. Bioinformatics 20, 1254-1263 (2004). 86. Datson, NA., van der Perk-de Jong, J., van den Berg, M.P., de Kloet, E.R. & Vreugdenhil, E. MicroSAGE: a modified procedure for serial analysis of gene expression in limited amounts of tissue. Nucleic Acids Res 27, 13001307 (1999). 87. Ye, S.Q., Zhang, L.Q., Zheng, F., Virgil, D. & Kwiterovich, P.O. miniSAGE: gene expression profiling using serial analysis of gene expression from I rnicrog total RNA.AnalBiochem 287, 144-152 (2000). 88. Polz, M.F. & Cavanaugh, C.M. Bias in template-to-product ratios in multitemplate PCR. Appi Environ Microbiol64, 3724-3730 (1998). 89. Arezi, B., Xing, W., Sorge, J.A. & Hogrefe, H.H. Amplification efficiency of thermostable DNA polymerases. AnalBiochem 321, 226-235 (2003). 90. Spinella, D.G. et al. Tandem arrayed ligation of expressed sequence tags (TALES’I): a new method for generating global gene expression profiles. NucleicAcid,c Rex 27, e22 (1999). 91. Margulies, E.H., Kardia, S.L. & Innis, J.W. Identification and prevention of a GC content bias in SAGE libraries. NucleicAcids Rex 29, E60-60 (2001). 92. Siddiqui, A.S. et al. Sequence biases in large scale gene expression profiling data. NuckicAcids Rex 34, e83 (2006). 93. Ng, P. et al. Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high-throughput analysis of transcriptomes and genomes. NuckicAdds Rex 34, e84 (2006). 94. Shendure,  (2005).  J.  et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728-1732  95. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-380  (2005).  35  96. Shendure, j., Mitra, RD., Varma, C. & Church, G.M. Advanced sequencing technologies: methods and goals. Nat Rev Genet5, 335-344 (2004). 97. Kim, J.B. et al. Polony multiplex analysis of gene expression (PMAGE) in mouse hypertrophic cardiomyopathy.  Stiern 316, 1481 -1484 (2007). 98. Nielsen, K.L., Høgh, A.L. & Emmersen,  J.  DeepSAGE--digital transcriptomics with high sensitivity, simple  experimental protocol and multiplexing of samples. NuckicAcids Rex 34, e133 (2006). 99. Emrich, S.J., Barbazuk, W.B., Li, L & Schnable, P.S. Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Rex 17, 69-73 (2007). 100. Gowda, M. et al. Robust analysis of 5’-transcript ends (5’-RATE): a novel technique for transcriptome analysis and genome annotation. NuckicAcids Rex 34, e126 (2006). 101. Brenner, S. et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotethnoll8, 630-634 (2000). 102. Brenner, S. et al. In vitro doning of complex mixtures of DNA on microbeads: physical separation of differentially expressed cDNAs. Proc NatlAcad Sd USA 97, 1665-1 670 (2000). 103. Reinartz, J. et al. Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms. Briflngx infunctional,genomics e”p7vteomkc1, 95-104 (2002). 104. Kojima, T. et al. PCR amplification from single DNA molecules on magnetic beads in emulsion: application for high-throughput screening of transcription factor targets. NucleicAdds Rex 33, e150 (2005). 105. Nakano, M., Komatsu, J., Matsuura, S., Takashima, K. & al., e. Single-molecule PCR using water-in-oil emulsion. Journal of BiotechnoloD (2003). 106. Williams, R. et al. Amplification of complex gene libraries by emulsion PCR. Nat Methods 3, 545-550 (2006). 107. Li, M., Diehl, F., Dressman, D., Vogelstein, B. & Kinzler, K.W. BEAMing up for detection and quantification of rare sequence variants. Nat Methods 3, 95-97 (2006). 108. Utada, A.S. et al. Monodisperse double emulsions generated from a microcapillary device. Science 308, 537-541 (2005). 109. Andreadis, J.D. & Chrisey, L.A. Use of immobilized PCR primers to generate covalently immobilized DNAs for in vitro transcription/translation reactions. NuckicAcids Rex 28, e5 (2000). 110. Diehl, F. et al. BEAMing: single-molecule PCR on microparticles in water-in-oil emulsions. NatMethodx 3, 551-559 (2006). 111. Rehman, F.N. et al. Immobilization of acrylamide-modified oligonucleotides by co-polymerization. Nuckic Acids Rex 27, 649-655 (1999). 112. Mitra, R.D. & Church, G.M. In situ localized amplification and contact replication of many individual DNA molecules. NuckicAcids Rex 27, e34 (1999). 113. Mikkilineni, V. et al. Digital quantitative measurements of gene expression. Biotechnol Bioeng 86, 117-124(2004).  J., Shendure, J., Mitra, RD. & Church, G.M. Single molecule profiling of alternative pre-mRNA splicing. Science 301, 836-838 (2003).  114. Zhu,  115. Mitra, R.D. et al. Digital genotyping and haplotyping with polymerase colonies. Proc NatlAcad Sci USA 100, 59265931 (2003). 116. Adessi, C. et al. Solid phase DNA amplification: characterisation of primer attachment and amplification mechanisms. NucleicAcids Rex 28, E87 (2000). 117. Lizardi, P.M. et al. Mutation detection and single-molecule counting using isothermal rolling-circle amplification. Nat Genetl9, 225-232 (1998). 118. Del Giallo, M.L. et al. Steric factors controlling the surface hybridization of PCR amplified sequences. Anal Chem 77,6324-6330 (2005). 119. Xu, J. & Craig, S.L. Thermodynamics of DNA hybridization on gold nanoparticles. JAm Chem Soc 127, 1322713231 (2005).  36  120. Peterson, A.W., Wo1f L.K. & Georgiadis, R.M. Hybridization of mismatched or partially matched DNA at surfaces.JAm Chem Soc 124, 14601-14607 (2002). 121. Watterson,J.H., Piunno, P.A.E., Wust, C.C. & Krull, U.J. in Langmuir, Vol. 16 4984-49922000). 122. Halperin, A., Buhot, A. & Zhulina, E.B. Brush effects on DNA chips: thermodynamics, kinetics, and design guidelines. BiophjvsJ 89,796-811 (2005). 123. Vainrub, A. & Pettitt, B.M. Surface electrostatic effects in oligonucleotide microarrays: control and optimization of binding thermodynamics. Biopoymeic 68, 265-270 (2003). 124. Nicewarner Pefla, S.R., Raina, S., Goodrich, G.P., Fedoroff NV. & Keating, C.D. Hybridization and enzymatic extension of au nanoparticle-bound oligonucleotides. JAm Chem Soc 124,7314-7323 (2002). 125. Fedurco, M., Romieu, A., Williams, S., Lawrence, I. & Turcatti, G. BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies. NuckicAcids Re: 34, e22 (2006). 126. Aach, J. & Church, G.M. Mathematical models of diffusion-constrained polymerase chain reactions: basis of highthroughput nucleic acid assays and simple self-organizing systems.J TheorBiol228, 31-46 (2004). 127. Hunkeler, D. in Macromolecules, Vol. 242160-21711991). 128. Zhang, K et al. Sequencing genomes from single cells by polymerase cloning. Nat Biotecbnol24, 680-686 (2006). 129. Halperin, A., Buhot, A. & Zhulina, E.B. Hybridization at a surface: the role of spacers in DNA microarrays. Lan,gmuir: the ACSjournal ofsurfaces and colloids 22, 11290-11304 (2006). 130. Goffin, C., Bailly, V. & Verly, W.G. Nicks 3’ or 5’ to A? sites or to mispaired bases, and one-nucleotide gaps can be sealed by T4 DNA ligase. NuckicAcid.r Re: 15, 8755-8771 (1987). 131. Nilsson, S.V. & Magnusson, G. Sealing of gaps in duplex DNA by T4 DNA ligase. Nucleic Acids Re: 10, 1425-1 437 (1982). 132. Hyman, ED. A new method of sequencing DNA. AnalBiochem 174,423-436 (1988). 133. Sanger, F., Nicklen, S. & Coulson, A.R. DNA sequencing with chain-terminating inhibitors. Proc Nail Acad Sd USA 74, 5463-5467 (1977). 134. Li, Z. et ai. A photocleavable fluorescent nucleotide for DNA sequencing and analysis. Proc Nail Acad Sa USA 100,414-419 (2003). 135. Metzker, M.L. et al. Termination of DNA synthesis by novel 3’-modifled-deoxyribonucleoside 5’-triphosphates. NucleicAcids Re: 22,4259-4267 (1994). 136. Ju, J. et al. Four-color DNA sequencing by synthesis using cleavable fluorescent nucleotide reversible terminators. Proc NatlAcad Sd USA 103, 19635-19640 (2006). 137. Astatke, M., Grindley, N.D. & Joyce, C.M. How E. coli DNA polymerase I (Klenow fragment) distinguishes between deoxy- and dideoxynucleotides.JMolBiol278, 147-165 (1998). 138. Ramanathan, A., Pape, L. & Schwartz, D.C. High-density polymerase-mediated incorporation of fluorochrome labeled nucleotides. AnalBiochem 337, 1-11 (2005). 139. Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlén, M. & Nyrén, P. Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem 242, 84-89 (1996). 140. Nyren, P., Pettersson, B. & Uhlén, M. Solid phase DNA minisequencing by an enzymatic luminometric inorganic pyrophosphate detection assay. Anal Biochem 208, 171-175 (1993). 141. Pourmand, N., Elahi, E., Davis, R.W. & Ronaghi, M. Multiplex Pyrosequencing. NucleicAcide Re: 30, e31 (2002). 142. Agaton, C. et al. Gene expression analysis by signature pyrosequencing. Gene 289,31-39 (2002). 143. Ronaghi, M. Pyrosequencing sheds light on DNA sequencing. Genome Re: 11, 3-11 (2001). 144. Nordstrom, T., Gharizadeh, B., Pourmand, N., Nyrén, P. & Ronaghi, M. Method enabling fast partial sequencing of cDNA clones. Anal Biochem 292,266-271(2001). 145. Mashayekhi, F. & Ronaghi, M. Analysis of read length limiting factors in Pyrosequencing chemistry. Anal Biochem 363,275-287 (2007).  37  146. Agah, A. et al. A multi-enzyme model for Pyrosequencing. NuckicAcids Res32, e166 (2004). 147. Wu, J. et al. 3’-O-modified nucleotides as reversible terminators for pyrosequencing. Proc NatlAcad Sa USA 104, 16462-16467 (2007). 148. Aiwine, J.C., Kemp, D.J. & Stark, G.R. Method for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. Proc NatlAcad Sd USA 74, 5350-5354 (1977).  149. Southern, E.M. Blotting at 25. Trends Biochem Sd 25, 585-588 (2000). 150. Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270,467-470 (1995). 151. Pease, A.C. Ct al. Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc NatlAcad Sd USA 91, 5022-5026 (1994). 152. Lipshutz, R.J., Fodor, S.P., Gingeras, T.R. & Lockhart, D.J. High density synthetic oligonucleotide arrays. Nat Genet2l, 20-24 (1999). 153. Barone, A.D. et al. Photolithographic synthesis of high-density oligonucleotide probe arrays. Nuckosides Nucleotides NucleicAcids 20, 525-531 (2001). 154. Fodor, S.P. et al. Light-directed, spatially addressable parallel chemical synthesis. Science 251,767-773 (1991). 155. Maskos, U. & Southern, E.M. A study of oligonucleotide reassociation using large arrays of oligonucleotides synthesised on a glass support. NuckicAcid.c Res 21,4663-4669 (1993). 156. Guo, Z., Guilfoyle, LA., Thiel, A.J., Wang, B.. & Smith, L.M. Direct fluorescence analysis of genetic polymorphisms by hybridization with oligonucleotide arrays on glass supports. Nucleic Adds Res 22, 5456-5465 (1994). 157. Bertucci, F. et al. Sensitivity issues in DNA array-based expression measurements and performance of nylon microarrays for small samples. Hum Mol Genet 8, 1715-1722 (1999). 158. Southern, E.M. DNA microarrays. History and overview. Methods MolBioll7O, 1-15 (2001).  159. Shalon, D., Smith, S.J. & Brown, P.O. A DNA rnicroarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Res 6, 639-645 (1996). 160. Southern, E.M. et at Arrays of complementary oligonucleotides for analysing the hybridisation behaviour of nucleic acids. NucleicAdds Res 22, 1368-1373 (1994). 161. Granjeaud, S., Bertucci, F. & Jordan, B.R. Expression profiling: DNA arrays in many guises. Bioessqys 21, 781-790 (1999). 162. Singh-Gasson, S. et at Masidess fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nat Biotechnoll7, 974-978 (1999). 163. Okamoto, T., Suzuki, T. & Yamamoto, N. Microarray fabrication with covalent attachment of DNA using bubble jet technology. Nat Biotechnoll8, 438-441 (2000). 164. Hughes, T.R. et at Expression profiling using microarrays fibricated by an ink-jet oligonucleotide synthesizer. Nat  Biotecbnoll9, 342-347 (2001). 165. Belosludtsev, Y. et at DNA microarrays based on noncovalent oligonucleotide attachment and hybridization in two dimensions. AnalBiochem 292,250-256 (2001). 166. Rogers, Y.H. et at Immobilization of oligonucleotides onto a glass support via disulfide bonds: A method for preparation of DNA microarrays. AnalBiothem 266, 23-30 (1999). 167. Zammatteo, N. et at Comparison between different strategies of covalent attachment of DNA to glass surfaces to build DNA microarrays. AnalBiochem 280, 143-150 (2000). 168. Healey, B.G., Matson, R.S. & Walt, D.R. Fiberoptic DNA sensor array capable of detecting point mutations.Ana/ Biochem 251, 270-279 (1997). 169. Lamture, J.B. et at Direct detection of nucleic acid hybridization on the surface of a charge coupled device. Nucleic Adds Rs 22,2121-2125 (1994).  38  170. Stillman, BA. & Tonkinson,J.L FAST slides: a novel surface for microarrays. Biotethniques29, 630-635 (2000). 171. Guschin, D. et al. Manual manufacturing of oligonucleotide, DNA, and protein microchips. Anal Biochem 250, 203-211 (1997). 172. Beier, M. & Hoheisel, J.D. Versatile derivatisation of solid support media for covalent bonding on DNA microchips. NackicAcids Re: 27, 1970-1977 (1999). 173. Yue, H. et a!. An evaluation of the performance of cDNA microarrays for detecting changes in global mRNA expression. Nucleic Acids Re: 29, E41-41 (2001). 174. Stillman, BA. & Tonkinson, J.L. Expression microarray hybridization kinetics depend on length of the immobilized DNA but are independent of immobilization substrate. Anal Biocbem 295, 149-157 (2001). 175. Brazma, A. et ai. Minimum information about a microarray experiment (MIAME) microarray data. Nature Genetics 29, 365-371 (2001).  -  toward standards for  176. Mah, N. et al. A comparison of oligonucleotide and cDNA-based microarray systems. Piysiol Genomics 16, 361-370 (2004). 177. Woo, Y. et a!. A comparison of cDNA, oligonucleotide, and Affymetrix GeneChip gene expression microarray platforms. Journal of biomolecular techniques :JBT15, 276-284 (2004). 178. Petersen, D. et al. Three microarray platforms: an analysis of their concordance in profiling gene expression. BMC Genomics 6, 63 (2005). 179. de Reyniès, A. et al. Comparison of the latest commercial short and long oligonucleotide microarray technologies. BMC Genomics 7, 51(2006). 180. Lee, M., Xiang, C.C., Trent, J.M. & Bittner, M.L. Performance characteristics of 65-mer oligonucleotide microarrays. Anal Biochem 368, 70-78 (2007). 181. Dandy, D.S., Wu, P. & Grainger, D.W. Array feature size influences nucleic acid surface capture in DNA microarrays. Proc NatlAcad Sd USA (2007). 182. Gong, P., Harbers, G.M. & Grainger, D.W. Multi-technique comparison of immobilized and hybridized oligonucleotide surface density on commercial amine-reactive microarray slides. Anal Chem 78,2342-2351(2006). 183. Peterson, A.W., Heaton, R.J. & Georgiadis, R.M. The effect of surface probe density on DNA hybridization. NucleicAcids Re: 29, 5163-5168 (2001). 184. Shchepinov, M.S., Case-Green, S.C. & Southern, E.M. Steric factors influencing hybridisation of nucleic acids to oligonucleotide arrays. NuckicAcids Re: 25, 1155-1161 (1997). 185. Chan, V., Graves, D.J. & McKenzie, S.E. The biophysics of DNA hybridization with immobilized oligonucleotide probes. BiophjsJ 69,2243-2255 (1995). 186. Walsh, M.K., Wang, X. & Weimer, B.C. Optimizing the immobilization of single-stranded DNA onto glass beads. J Biochem Biophjs Methods 47, 221-231 (2001). 187. Li, X., He, Z. & Zhou,J. Selection of optimal oligonucleotide probes for microarrays using multiple criteria, global alignment and parameter estimation. NuckicAcids Rex 33, 6114-6123 (2005). 188. Wernersson, R. & Nielsen, H.B. OligoWiz 2.0--integrating sequence feature annotation into the design of microarray probes. NuckicAcids Re: 33, W61 1-6 15 (2005). 189. Fish, D.J. et al. DNA multiplex hybridization on microarrays and thermodynamic stability in solution: a direct  comparison. NacleicAcids Re: (2007). 190. Mir, K.U. & Southern, E.M. Determining the influence of structure on hybridization using oligonucleotide arrays. Nat Biotechnoll7, 788-792 (1999). 191. Beaucage, S.L. Strategies in the preparation of DNA oligonucleotide arrays for diagnostic applications. Ca,r Med Cheat 8, 1213-1244 (2001). 192. LeProust, E., Zhang, H., Yu, P., Zhou, X. & Gao, X. Characterization of oligodeoxyribonucleotide synthesis on glass plates. NackicAcids Rex 29,2171-2180 (2001).  39  193. Pirrung, M.C., Fallon, L. & McGa]l, G. Proofmg of photolithographic DNA synthesis with 3 ‘,5 dimethoxybenzoinyloxycarbonyl-protected deoxynucleoside phosphoramidites.J Ot Chem 63,241-246 (1998).  ‘-  194.Jobs, M., Fredriksson, S., Brookes, A.J. & Landegren, U. Effect of oligonucleotide truncation on single-nucleotide distinction by solid-phase hybridization. Anal Chem 74, 199-202 (2002). 195. Klebanov, L. & Yakovlev, A. How high is the level of technical noise in microarray data? Biol Direcl2, 9 (2007). 196. Stalteri, MA. & Harrison, A.P. Interpretation of multiple probe sets mapping to the same gene in Affymetrix GeneChips. BMC Bioinformafics 8, 13 (2007). 197. Royce, T.E., Rozowsky, J.S. & Gerstein, M.B. Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification. NuckicAcids Res 35, e99 (2007). 198. DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680-686 (1997). 199. Nygaard, V. & Hovig, E. Options available for profiling small samples: a review of sample amplification technology when combined with microarray profiling. Nuc/eicAcids Res 34,996-1014 (2006). 200. Barczak, A. et al. Spotted long oligonucleotide arrays for human gene expression analysis. Genome Res 13, 17751785 (2003). 201. Ramakrishnan, R. et al. An assessment of Motorola CodeLink microarray performance for gene expression profiling applications. NucleicAcids Res 30, e30 (2002). 202. Van Gelder, R.N. et al. Amplified RNA synthesized from limited quantities of heterogeneous cDNA. Proc Natl AcadSci USA 87, 1663-1667 (1990). 203. Petalidlis, L. et al. Global amplification of mRNA by template-switching PCR: linearity and application to microarray analysis. NuckicAdds Res 31, e142 (2003). 204. Jena, P.K., Liu, A.H., Smith, D.S. & Wysocki, L.J. Amplification of genes, single transcripts and cDNA libraries from one cell and direct sequence analysis of amplified products derived from one molecule. J Immunol Methods 190, 199-213 (1996). 205. Iscove, N.N. et al. Representation is faithfully preserved in global cDNA amplified exponentially from subpicogram quantities of mRNA. Nat Biotechnol20, 940-943 (2002). 206. Mitsuhashi, M., Tomozawa, S., Endo, K. & Shinagawa, A. Quantification of mRNA in whole blood by assessing recovery of RNA and efficiency of cDNA synthesis. Clin Chem 52, 634-642 (2006). 207. Raymond, F., Metairon, S., Borner, R., Hofmann, M. & Kussmann, M. Automated target preparation for microarray-based gene expression analysis. Anal Chem 78, 6299-6305 (2006). 208. Naderi, A. et al. Expression microarray reproducibility is improved by optimising purification steps in RNA amplification and labelling. BMC Genomics 5,9 (2004). 209. Wilson, C.L., Pepper, S.D., Hey, Y. & Miller, C.J. Amplification protocols introduce systematic but reproducible errors into gene expression studies. BioTechniques 36,498-506 (2004). 210. Lonergan, W., Whistler, T. & Vernon, S.D. Comparison of target labeling methods for use with Affymetrix GeneChips. BMC J3iotechnol7, 24 (2007). 211. Ma, C. et al. In vitro transcription amplification and labeling methods contribute to the variability of gene expression profiling with DNA microarrays. TheJournal ofmolecular diagnostics :JMD 8, 183-192 (2006). 212. Hekstra, D., Taussig, A.R., Magnasco, M. & Naef F. Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. NuckicAcids Res3l, 1962-1968 (2003). 213. Yuen, T., Wurmbach, E., Pfeffer, R.L., Ebersole, B.J. & Sealfon, S.C. Accuracy and calibration of commercial oligonucleotide and custom eDNA microarrays. NucleicAcids Res 30, e48 (2002). 214. Dudley, A.M., Aach, J., Steffen, M.A. & Church, G.M. Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc NatlAcad Sd USA 99, 7554-7559 (2002). 215. Chen, J., Hsueh, H.M., Delongchamp, R., Lin, C.J. & Tsai, CA. Reproducibility of microarray data: a further analysis of microarray quality control (MQC)data. BMCBioinformatics 8,412 (2007).  40  216. Zakharkin, S.O. et al. Sources of variation in Affymetrix niicroarray experiments. BMC Bioinformatics 6,214 (2005). 217. Dumur, C.I. et al. Evaluation of quality-control criteria for microarray gene expression analysis. C/in Chem 50, 1994-2002 (2004). 218. Piper, M.D. et al. Reproducibility of oligonucleotide microarray transcriptome analyses. An interlaboratory comparison using chemostat cultures of Saccharomyces cerevisiae.J Biol Chem 277,37001-37008 (2002). 219. Peeva, V.K., Lynch, J.L., Desilva, C.J. & Swanson, N.R. Evaluation of automated and conventional microarray hybridization a question of data quality and best practice? BiotechnolApp/Biothem (2007). -  220. Bhanot, G., Louzoun, Y., Zhu, J. & DeLisi, C. The importance of thermodynamic equilibrium for high throughput gene expression arrays. Biop4ysJ 84, 124-135 (2003). 221. Degenkolbe, T. et al. A quality-controlled microarray method for gene expression pro filing. Anal Biochem 346, 217224 (2005). 222. Han, E.S. et al. Reproducibility, sources of variability, pooling, and sample size: important considerations for the design of high-density oligonucleotide array experiments.J GerontolA BiolSciMedSci59, 306-315 (2004). 223. Mo, X.Y. et aL The effects of different sample labelling methods on signal intensities of a 60-mer diagnostic microarray. J VirolMethods 134, 36-40 (2006). 224. Lyng, H. et al. Profound influence of microarray scanner characteristics on gene expression ratios: analysis and procedure for correction. BMC Genomics 5, 10 (2004). 225. Shi, L. et al. Microarray scanner calibration curves: characteristics and implications. BMC Bioinformatics 6 Suppi 2, SlI (2005). 226. Rouse, R.J., Espinoza, C.R., Niedner, R.H. & Hardiman, G. Development of a microarray assay that measures hybridization stoichiometry in moles. BioTechniques 36,464-470 (2004). 227. Canales, R.D. et ai. Evaluation of DNA microarray results with quantitative gene expression platforms. Nat Biotechnol24, 1115-1122 (2006). 228. Shippy, R. et al. Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nature Biotechnology 24, 1123-1131(2006). 229. Consortium, M. et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechaol24, 1151-1161 (2006). 230. Larkin, J.E., Frank, B.C., Gavras, H., Sultana, R. & Quackenbush, microarray platforms. Nat Methods 2, 337-344 (2005).  J.  Independence and reproducibility across  231. Shi, L. et al. Cross-platform comparability of microarray technology: intra-platform consistency and appropriate data analysis procedures are essential. BMCBioinformatics6 Suppi 2, S12 (2005). 232. Wang, H., He, X., Band, M., Wilson, C. & Liu, L. A study of inter-lab and inter-platform agreement of DNA microarray data. BMC Genomics 6, 71(2005). 233. Breslauer, K.J., Frank, R., Blocker, H. & Marky, L.A. Predicting DNA duplex stability from the base sequence. ProcNatlAcadSa USA 83,3746-3750 (1986). 234. SantaLucia, J. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. PtvcNatlAcadSci USA 95,1460-1465 (1998). 235. Bruun, G.M., Wernersson, R., Juncker, A.S., Willenbrock, H. & Nielsen, H.B. Improving comparability between microarray probe signals by thermodynamic intensity correction. NuckicAcids Res 35, e48 (2007). 236. Tulpan, D. et al. Thermodynamically based DNA strand design. NucleicAcids Res 33,4951-4964 (2005). 237. Rouillard, J.M., Zuker, M. & Gulari, E. OligoArray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach. NucleicAdds Res 31, 3057-3062 (2003). 238. Matveeva, O.V. et al. Thermodynamic calculations and statistical correlations for oligo-probes design. Nuckic Acids Res3l, 4211-4217 (2003). 239. Halperin, A., Buhot, A. & Zhulina, E.B. Sensitivity, specificity, and the hybridization isotherms of DNA chips. BiophysJ86, 718-730 (2004).  41  240. Holbrook, J.A., Capp, M.W., Saecker, R.M. & Record, M.T. Enthalpy and heat capacity changes for formation of an oligomeric DNA duplex: interpretation in terms of coupled processes of formation and association of singlestranded heices. Biothemishy38, 8409-8422 (1999). 241. Vesnaver, G. & Breslauer, K.J. The contribution of DNA single-stranded order to the thermodynamics of duplex formation. Prr,cNaiIAcadSci USA 88,3569-3573 (1991). 242. Owczarzy, R., Dunietz, I., Behike, MA., Klotz, I.M. & Walder, J.A. Thermodynamic treatment of oligonucleotide duplex-simplex equilibria. P,vcNa.elAcadSci USA 100, 14840-14845 (2003). 243. Tikhomirova, A., Taulier, N. & Chalildan, T.V. Energetics of nucleic acid stabili Chem Soc 126, 16387-16394 (2004).  .  the effect of DeltaCP. JAm  244. Wu, P., Nakano, S. & Sugimoto, N. Temperature dependence of thermodynamic properties for DNA/DNA and RNA/DNA duplex formation. EurJ Biochem 269,2821-2830 (2002). 245. von Ahsen, N., Wittwer, C.T. & Schütz, E. Oligonucleotide melting temperatures under PCR conditions: nearestneighbor corrections for Mg(2+), deoxynucleotide triphosphate, and dimethyl sulfoxide concentrations with comparison to alternative empirical formulas. C/in Chem 47, 1956-1961 (2001). 246. Dimitrov, R.A. & Zuker, M. Prediction of hybridization and melting for double-stranded nucleic acids. Biopbjs J 87,215-226 (2004). 247. Nakano, S., Fujimoto, M., Hara, H. & Sugimoto, N. Nucleic acid duplex stability: influence of base composition on cation effects. NuckicAcids Rex 27, 2957-2965 (1999). 248. Owczarzy, R. et al. Predicting sequence-dependent melting stability of short duplex DNA oligomers. Biopovmers 44,217-239 (1997). 249. Chan, V., Graves, D.J., Fortina, P. & McKenzie, S.E. in Langmuir, Vol. 13320-3291997). 250. Weckx, S., Canon, E., Vuyst, LI). & Hummelen, P.V. Thermodynamic Behavior of Short Oligonucleotides in Microarray Hybridizations Can Be Described Using Gibbs Free Energy in a Nearest-Neighbor Model. J Phjs Chem  B 111, 13583-13590 (2007).  251. Sorokin, N.y. et al. Discrimination between perfect and mismatched duplexes with oligonudeotide gel microchips: role of thermodynamic and kinetic effects during hybnidization.J BiomolSiruci Djn 22, 725-734 (2005). 252. Gao, Y., Wolf L.K. & Georgiadis, R.M. Secondary structure effects on DNA hybridization kinetics: a solution versus surface comparison. NuckicAcids Rex 34,3370-3377 (2006). 253. Chen, C., Wang, W., Wang, Z., Wei, F. & Zhao, X.S. Influence of secondary structure on kinetics and reaction mechanism of DNA hybridization. NuckicAcids Rex 35,2875-2884 (2007). 254. Wu, C., Carta, R. & Zhang, L. Sequence dependence of cross-hybridization on short oligo microarrays. Nuckic Acids Rex 33, e84 (2005). 255. Lee, M.L., Kuo, F.C., Whitmore, G.A. & Sklar,J. Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Nail Acrid Sd U S A 97, 9834-9839 (2000). 256. Li, C. & Wong, W.H. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Nat/Acrid Sd USA 98,31-36 (2001). 257. Sawada, A., Mizufune, S., Kaji, N., Tokeshi, M. & Baba, Y. Evaluation of amplified cRNA targets for oligonucleotide microarrays. Ana’ytical and bioana/ytical chemisirj 387,2645-2654 (2007). 258. Sakai, K., Higuchi, H., Matsubara, K. & Kato, K. Microarray hybridization with fractionated cDNA: enhanced identification of differentially expressed genes. Anal Biochem 287,32-37 (2000). 259. Park, P.J. et al. Current issues for DNA microarrays: platform comparison, double linear amplification, and universal RNA reference.J Biotechnolll2, 225-245 (2004). 260. Bishop,  J., Wilson, C., Chagovetz, A.M. & Blair, S. Competitive displacement of DNA during surface hybridization. BiophyJ 92, Li 0-12 (2007).  261. Home, M.T., Fish, D.J. & Benight, A.S. Statistical thermodynamics and kinetics of DNA multiplex hybridization  reactions. Biop4ysJ 91,4133-4153 (2006).  42  262. Bishop,  J., Blair,  S. & Chagovetz, A.M. A competitive kinetic model of nucleic acid surface hybridization in the  presence of point mutants. BiophycJ 90, 831-840 (2006).  -  263. Zhang, Y., Hammer, D.A. & Graves, D.J. Competitive hybridization kinetics reveals unexpected behavior patterns. BiopIy:J 89,2950-2959 (2005). 264. Weckx, S., Canon, E., Vuyst, LD. & Hummelen, P.V. Thermodynamic Behavior of Short Oligonucleotides in Microarray Hybridizations Can Be Described Using Gibbs Free Energy in a Nearest-Neighbor Model. J Phyx Chem B (2007). 265. Ketomãki, K., Hakala, H. & Lonnberg, H. Mixed-phase hybridization of short oligodeoxyribonucleotides on microscopic polymer particles: effect of one-base mismatches on duplex stability. Bioconjug Chem 13, 542-547 (2002). 266. Binder, Fl. & Preibisch, S. Specific and nonspecific hybridization of oligonucleotide probes on microarrays. Biophys J 89, 337-352 (2005). 267. Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. NucleicAcids Rex 35, D61-65 (2007). 268. Gupta, V. et al. Directly labeled mRNA produces highly precise and unbiased differential gene expression data. NucleicAcidc Rex 31, e13 (2003). 269. Tseng, G.C., Oh, M.K., Roblin, L., Liao, J.C. & Wong, W.H. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. NucleicAcids Rex 29,2549-2557 (2001). 270. Vernon, S.D. et al. Reproducibility of alternative probe synthesis approaches for gene expression profiling with arrays. J MolDiagn 2, 124-127 (2000). 271. Badiee, A., Eiken, H.G., Steen, V.M. & Løvlie, R. Evaluation of five different cDNA labeling methods for microarrays using spike controls. BMC Biotechno!3, 23 (2003). 272. Dornis, D.R. et al. A highly reproducible, linear, and automated sample preparation method for DNA microarrays. Genome Rex 12, 976-984 (2002). 273. Anderson, J.P., Angerer, B. & Loeb, LA. Incorporation of reporter-labeled nucleotides by DNA polymerases. BioTethniques38, 257-264 (2005). 274. Kuwahara, M. et al. Systematic characterization of 2’-deoxynucleoside- 5’-tniphosphate analogs as substrates for DNA polymerases by polymerase chain reaction and kinetic studies on enzymatic production of modified DNA. NuneicAcids Rex 34, 5383-5394 (2006). 275. Tasara, T. et al. Incorporation of reporter molecule-labeled nucleotides by DNA polymerases. II. High-density labeling of natural DNA. NuckicAcids Rex 31, 2636-2646 (2003). 276. Cole, K., Truong, V., Barone, D. & McGall, G. Direct labeling of RNA with multiple biotins allows sensitive expression profiling of acute leukemia class predictor genes. NuckicAcids Re: 32, e86 (2004). 277. Manduchi, E. et al. Comparison of different labeling methods for two-channel high-density microarray experiments. Phjisiol Genomics 10, 169-179 (2002). 278. Wadenbäck, J. et al. Comparison of standard exponential and linear techniques to amplify small cDNA samples for microarrays. BMC Genomics 6,61 (2005). 279. Laurel, C., Wirta, V., Nilsson, P. & Lundeberg, J. Comparative analysis of a 3’ end tag PCR and a linear RNA amplification approach for microarray analysis.J Biotechnol 127, 638-646 (2007). 280. Duftner, N., Larkins-Ford, J., Legendre, M. & Hofmann, H.A. Efficacy of RNA amplification is dependent on sequence characteristics: implications for gene expression profiling using a eDNA microarray. Genomics (2007). 281. Wilhelm, J. et al. Systematic comparison of the T7-1VT and SMART-based RNA preamplification techniques for DNA microarray experiments. Clin Chem 52, 1161-1167 (2006). 282. Boelens, M.C. et al. Microarray amplification bias: loss of 30% differentially expressed genes due to long probe poly(A)-tail distances. BMC Genomics 8,277 (2007).  -  283. Wagner, F. & Radelof U. Performance of different small sample RNA amplification techniques for hybridization on Affymetnix GeneChips. J Biotecbnoll29, 628-634 (2007).  43  284. Subkhankulova, T. & Livesey, F.J. Comparative evaluation of linear and exponential amplification techniques for expression profiling at the single-cell leveL Genome Biol7, R18 (2006). 285. Selinger, D.W. et at. RNA expression analysis using a 30 base pair resolution Escherichia coil genome array. Nat Biotecbnoll8, 1262-1268 (2000). 286. David, L. et at. A high-resolution map of transcription in the yeast genome. Proc Nati Acad Sd USA 103, 53205325 (2006). 287. Stoic, V. et at. Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays. Proc NatlAcad Sci USA 102,4453-4458(2005). 288. Shoemaker, D.D. et at. Experimental annotation of the human genome using microarray technology. Nature 409, 922-927 (2001). 289. Bertone, P. et at. Global identification of human transcribed sequences with genome tiling arrays. Science 306, 2242-2246 (2004). 290. Samanta, M.P. et at. The transcriptome of the sea urchin embryo. Science 314, 960-962 (2006). 291. Li, L. et at. Genome-wide transcription analyses in rice using tiling microarrays. Nat Genet 38, 124-129 (2006). 292. He, H. et at. Mapping the C. elegans noncoding transcriptome with a whole-genome tiling microarray. Genome Res 17, 1471-1477 (2007). 25)3. Reddy, A.S. Alternative splicing of pre-messenger RNAs in plants in the genomic era. Annual ,mâw ofplant biolo 58, 267-294 (2007). 294. Johnson, J.M., Edwards, S., Shoemaker, D. & Schadt, E.E. Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments. Trendr Genet2l, 93-102 (2005). 295. juneau, K., Palm, C., Miranda, M. & Davis, R.W. High-density yeast-tiling array reveals previously undiscovered introns and extensive regulation of meiotic splicing. Proc NatlAcadSci USA 104, 1522-1527 (2007). 296. Kapranov, P. et al. Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Rex 15,987-997 (2005). 297. Emanuelsson, 0. et at. Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome. Genome Rex 17, 886-897 (2007). 298. Bertone, P. et at. Design optimization methods for genomic DNA tiling arrays. Genome Rex 16,271-281(2006). 299. Sasaki, D. et at. Characteristics of oligonucleotide tiling arrays measured by hybridizing full-length cDNA clones: causes of signal variation and false positive signals. Genomics 89, 541-551 (2007). 300. Oliphant, A., Barker, D.L., Stuelpnagel, J.R. & Chee, M.S.. BeadArray technology: enabling an accurate, costeffective approach to high-throughput genotyping. BioTechniques Suppl, 56-58, 60-51 (2002). 301. Epstein, J.R., Leung, A.P., Lee, K.H. & Walt, D.R. High-density, microsphere-based fiber optic DNA microarrays. Biosensors ó bioelectronics 18, 541-546 (2003). 302. Fan, J.B. et at. A versatile assay for high-throughput gene expression profiling on universal array matrices. Genome Res 14, 878-885 (2004). 303. Kuhn, K. et al. A novel, high-performance random array platform for quantitative gene expression profiling. Genome Rex 14,2347-2356 (2004). 304. Ferguson, J.A., Steemers, F.J. & Walt, D.R. High-density fiber-optic DNA random microsphere array. Anal Chem 72, 5618-5624 (2000). 305. Ferguson,J.A., Boles, T.C., Adams, C.P. & Watt, D.R. A fiber-optic DNA biosensor microarray for the analysis of gene expression. Nat Biotechnoll4, 1681-1684 (1996). 306. Michael, K.L., Taylor, L.C., Schultz, S.L. & Watt, D.R. Randomly ordered addressable high-density optical sensor arrays. Anal Chem 70, 1242-1248 (1998). 307. Gunderson, K.L et at. Decoding randomly ordered DNA arrays. Genome Res 14, 870-877 (2004).  44  308. Epstein, J.R., Ferguson, J.A., Lee, K.H. & Walt, D.R. Combinatorial decoding: an approach for universal DNA array fabrication. JAm Chem Soc 125, 13753-13759 (2003). 309. Pantano, P. & Walt, D.R. in Chem Mater, Vol. 8 2832-28351996). 310. Chudin, E. et aL Assessment of the relationship between signal intensities and transcript concentration for Affymetrix GeneChip arrays. Genome Biol3, RESEARCH0005 (2002). 311. Tong, W. et al. Evaluation of external RNA controls for the assessment of microarray performance. Nat Biotechnol 24, 1132-1139 (2006). 312. Fotin, A.V., Drobyshev, A.L, Proudnikov, D.Y., Perov, A.N. & Mirzabekov, A.D. Parallel thermodynamic analysis of duplexes on oligodeoxyribonucleotide microchips. NucleicAdds Res26, 1515-1521 (1998). 313. Southern, E.M., Maskos, U. & Elder, J.K. Analyzing and comparing nucleic acid sequences by hybridization to arrays of oligonucleotides: evaluation using experimental models. Genomics 13, 1008-1017 (1992). 314. Yershov, G. et al. DNA analysis and diagnostics on oligonucleotide microchips. Proc NatlAcad Sci USA 93,49134918 (1996). 315. Drobyshev, A.L, Zasedatelev, A.S., Yershov, G.M. & Mirzabekov, A.D. Massive parallel analysis of DNAHoechst 33258 binding specificity with a generic oligodeoxyribonucleotide microchip. Nucleic Acids Res 27, 41004105 (1999). 316. Timofeev, B. & Mirzabekov, A. Binding specificity and stability of duplexes formed by modified oligonucleotides with a 4096-hexanucleotide microarray. NucleicAcids Rrs 29, 2626-2634 (2001). 317. Gunderson, K.L. et al. Mutation detection by ligation to complete n-mer DNA arrays. Genome Res 8, 1142-1153 (1998). 318. Drmanac, R., Labat, I., Brukner, I. & Crkvenjakov, R. Sequencing of megabase plus DNA by hybridization: theory of the method. Genomics 4, 114-128 (1989). 319. Drmanac, R. et al. Sequencing by hybridization (SBI-l): advantages, achievements, and opportunities. Ads Biochem En,g Biotechnol77, 75-101 (2002). 320. van Dam, R.M. & Quake, S.R. Gene expression analysis with universal n-mer arrays. Genome Res 12, 145-152 (2002).  321. Roth, M.E. et al. Expression profiling using a hexamer-based universal microarray. Nat Biotechnol 22, 418-426 (2004).  322. Luscombe, N.M. & Babu, M.M. GenCompass: a universal system for analysing gene expression for any genome. Trends Biotechnol22, 552-555 (2004). 323. Kanagawa, T. Bias and artifacts in multitemplate polymerase chain reactions (PCR). J Biosci Bioeng 96, 3 17-323 (2003). 324. Pritchard, C.E. & Southern, E.M. Effects of base mismatches on joining of short oligodeoxynucleotides by DNA ligases. NucleicAcidc Res 25, 3403-3407 (1997). 325. Kanno, J. et al. “Per cell” normalization method for mRNA measurement by quantitative PCR and microarrays. BMC Genomics 7,64 (2006). 326. Schmidt, KS. et al. Application of locked nucleic acids to improve aptamer in vivo stability and targeting function. NucleicAdds Res 32, 5757-5765 (2004). 327. Braasch, D.A. & Corey, D.R. Novel antisense and peptide nucleic acid strategies for controlling gene expression. Biochemisirj 41,4503-4510 (2002). 328. Wahlestedt, C. et al. Potent and nontoxic antisense oligonucleotides containing locked nucleic acids. Proc NatlAcad Sd USA 97,5633-5638 (2000). 329. Karkare, S. & Bhatnagar, D. Promising nucleic acid analogs and mimi characteristic features and applications of PNA, LNA, and morpholino. ApplMicrobiolBiotechnol7l, 575-586 (2006). .  330. Tolstrup, N. et al. OligoDesign: Optimal design of LNA (locked nucleic acid) oligonucleotide capture probes for gene expression profiling. Nuc/eicAdds Res 31,3758-3762 (2003).  45  331. Castoldi, M. et al. A sensitive array for microRNA expression profiling (miChip) based on locked nucleic acids (LNA). RNA 12, 913-920 (2006). 332. Castolcli, M., Benes, V., Hentzc, M.W. & Muckenthaler, M.U. miChip: a microarray platform for expression profiling of microRNAs based on locked nucleic acid (LNA) oligonucleotide capture probes. Methods 43, 146-152 (2007). 333. Wengel,J., Petersen, M., Frieden, M. & Koch, T. Chemistry of locked nucleic acids (LNA): Design, synthesis, and bio-physical properties. Lette,c in Peptide Science (2005). 334. Hansen, H.F., Olsen, 0. & Koch, T. New standards in LNA synthesis. Nuckosides Nucleotides Nuc/eicAcids22, 12731275 (2003).  335. Nielsen, K.E. et al. NMR studies of fully modified locked nucleic acid (LNA) hybrids: solution structure of an LNA:RNA hybrid and characterization of an LNA:DNA hybrid. Bioconju,g Chem 15,449-457 (2004). 336. Braasch, D.A. & Corey, D.R. Locked nucleic acid (LNA): fine-tuning the recognition of DNA and RNA. Chem Biol8, 1-7 (2001). 337. Möhrle, B.P., Kumpf, M. & Gauglitz, G. Determination of affinity constants of locked nucleic acid (LNA) and DNA duplex formation using label free sensor technology. TheAnayst 130, 1634-1638 (2005). 338. Christensen, U., Jacobsen, N., Rajwanshi, V.K, Wengel, J. & Koch, T. Stopped-flow kinetics of locked nucleic acid (LNA)-oligonucleotide duplex formation: studies of LNA-DNA and DNA-DNA interactions. Biochem J 354, 481-484 (2001).  339. Gamper, H.B., Arar, K., Gewirtz, A. & Hou, Y.M. Unrestricted accessibility of short oligonucleotides to RNA. RNA 11, 1441-1447 (2005). 340. Kaur, H., Wengel, J. & Main, S. LNA-modified oligonucleotides effectively drive intramolecular-stable hairpin to intermolecular-duplex state. Biochem Bioplys Rex Commun 352, 118-122 (2007). 341. McTigue, P.M., Peterson, R.J. & Kahn, J.D. Sequence-dependent thermodynamic parameters for locked nucleic acid (LNA)-DNA duplex formation. Biochemisty 43, 5388-5405 (2004). 342. V1óczi, A. et al. Sensitive and specific detection of microRNAs by northern blot analysis using LNA-modified oligonucleotide probes. NuckicAdds Rex 32, e175 (2004). 343. Thomsen, R., Nielsen, P.S. & Jensen, T.H. Dramatically improved RNA in situ hybridization signals using LNA modified probes. RNA 11, 1745-1748 (2005). 344. Wienholds, E. et al. MicroRNA expression in zebrafish embryonic development. Science 309, 310-311(2005).  345. Jacobsçn, N. et al. Direct isolation of poly(A)+ RNA from 4 M guanidine thiocyanate-lysed cell extracts using locked nucleic acid-oligo capture. NuckicAcids Rex 32, e64 (2004). 346. Keniedy, B., Mar, K, Reja, V. & Henry, R.J. Locked nucleic acids for optimizing displacement probes for quantitative real-time PCR. Anal Biochem 348, 294-299 (2006). 347. Letertre, C., Perelle, S., Dilasser, F., Arar, K. & Fach, P. Evaluation of the performance of LNA and MGB probes in 5’-nuclease PCR assays. Mo! Cell Probes 17, 307-311(2003). 348. Frieden, M. et al. Expanding the design horizon of antisense oligonucleotides with alpha-L-LNA. Nuckic Acids Rex 31, 6365-6372 (2003). 349. Kurreck, J., Wyszko, E., Gillen, C. & Erdmann, V.A. Design of antisense oligonucleotides stabilized by locked nucleic acids. NuckicAcids Res 30, 1911-1918 (2002). 350. You, Y., Moreira, B.G., Behlke, M.A. & Owczarzy, R. Design of LNA probes that improve mismatch discrimination. NucleicAcids Rex 34, e60 (2006).  46  CHAPTER 2 Increasing the efficiency of SAGE adaptor ligation by directed ligation chemistry*  A version of this chapter is published in Nucleic Acids Research [Reference: So, A. P., Turner, R. B. F., and Haynes, C. A. 2004. Increasing the efficiency of SAGE adaptor ligation by directed ligation chemistry. Nucleic Acids Res. 32, e96.  47  2.1. INTRODUCTION The development of technologies aimed towards monitoring gene expression on a global scale has revolutionized the study of biology from a systems perspective  l•  This perspective  embraces the idea that the functional significance of gene products is not only related to their quantity in the cell, but also to how they interact and are strung together to form genetic and biochemical networks. Numerous technologies have been developed over the past decade, with the greatest attention being given to approaches based either on high-throughput sequencing or on massively parallel analysis of the transcriptome (i.e. the set of all expressed genes weighted by transcript abundance) using array hybridization technology. The sequencing approach to monitoring gene expression on a global scale typically involves the creation of short representations of each transcript such as expressed sequence tags (ESTs) or short sequence tags (SSTs) generated through Serial Analysis of Gene Expression (SAGE) technology  2, 3  DNA  microarray technology attempts to resolve the transcriptome by selectively binding and quantifying each transcript at one or more complementary registers of a high-density array technologies are now routinely used to identify families of genes characterized or with previously unidentified functionality cell fate or outcome  ,  —  —  .  These  in many cases incompletely  which act in concert to define a given  and have been used to identify upstream sequence elements involved in  directing the expression of these gene families. Although microarray technology offers an increasingly reliable and sensitive analysis of gene expression, its use is dependent on an a ptiori knowledge of genes which are expressed under a given cell state, currently restricting application of the technology to the identification and quantification of these subsets of genes. SAGE technology  2  in contrast, directly samples the entire transcriptome of an organism  under a given cellular state through the generation of SSTs of 9-22 base pairs in length. As a 9-10 mer oligonucleotide can theoretically identify 49 (262,144) or  410  (1,048,576) unique sequences, the  entire transcript population of any organism can potentially be represented 2 First, a cDNA copy .  of the mRNA population is digested with a restriction endonuclease (RE; e.g. NlaIIl) and the most 3’ end restriction fragments of the digested population are purified. A short oligonucleotide adaptor that contains a unique primer sequence and a recognition sequence for a Type ITS RE is then ligated to the anchored cDNA. As Type uS REs are capable of deaving DNA outside their recognition sequence equal length  2•  8  subsequent cleavage with a Type 11S RE (e.g. Bsm Fl) releases SSTs of  A library of these SSTs is created through subsequent dimerization, amplification  48  via the polymerase chain reaction (PCR), concatemerization, and insertion into an appropriate vector. Finally, a representative population of clones is serially sequenced to identify and tally each SST. As each SST is derived from a defined position within a particular cDNA, a given tag can be cross-referenced through organism- and/or tissue-specific genome databases to a particular gene to give a profile of global gene expression. An important advantage compared to microarray technology is that unreferenced SSTs that arise out of the SAGE analysis can be used to identify previously unknown genes and aid in the completion of genome annotations for the organism  under study 2,3,9-17 The ability of SAGE to provide an accurate measure of gene expression profiles is dependent upon the extent to which the distribution of transcript abundances inferred through the sequenced set of amplified SSTs fully reflects the real distribution of the abundances of associated transcripts in the original mRNA population. This fidelity depends upon the accuracy of the sequencing method used to identify the SSTs  18  and on the depth of sequencing applied to the  SAGE library 19,20, Less appreciated, however, is the extent to which losses and processing artifacts in each of the 12 enzymatic and 10 purification steps  —  or 7 in the microSAGE protocol  —  used to  convert the starting mRNA sample into a SAGE library (see Table 2.1) can skew the sequencing results away from the real distribution. To illustrate, if 5 tg of mRNA 12 (—‘5x10 molecules of average length 2 kb) are used as starting material for the SAGE protocol, a 50% average yield in each processing step would result in an overall yield of 0.000024% (i.e. 0.5), such that the final sample  6 ( I .2x10  molecules) would represent a minute fraction of the original. Such an overall  yield would result in a form of sampling bias in SAGE analysis equivalent to the bias introduced by an insufficient depth of sequencing ’ 2O Although inclusion of PCR steps in the SAGE protocol is 19 intended to recover these losses, amplification after processing can only recover those ditags derived from targets that have survived the numerous enzymatic and purification steps. Clearly then, efforts to maximize yields and minimize artifacts introduced in each processing step are required to ensure the fidelity of SAGE.  While a number of recent studies have resulted in the improvement of some of the purification steps in the SAGE protocol  13, 15,21-28  little attention has been given to addressing the  efficiencies of the enzymatic steps of the protocoL Given that the ability to generate a SAGE tag from a transcript is determined by the successful ligation of the SAGE adaptor to the anchored 3’-  49  end cDNA population, the yield in this step is likely to contribute significantly to the overall fidelity of the SAGE protocoL Here we demonstrate, using adaptors IA/B of the current SAGE protocol (version le; http://www.sagenet.org/protocol/index.htm), that the yield of this ligation step is generally low due to a strong propensity of the anchored 3’-end cDNA target to self-ligate. We then show that the addition of PEG-8000, traditionally used to favour the formation of linear ligation products  29-31,  increases the yield of the desired adaptor-target heterodimer, but is unable to  fully eliminate the formation of unwanted homodimer. Finally, we show that by using an alternative method of ligation, which we call “directed ligation”, a significant improvement in the SAGE protocol is achieved, increasing the efficiency of adaptor ligation and eliminating the irreversible formation of unwanted ligation products. 2.2. MATERIALS AND METHODS 2.2.1. Enzymes and constructs A 956 bp done homologous to rat liver a transcription factor (Genbank ID: X65948) from rat brain with a polyadenylated 3’-end (58 bp), kindly provided by Dr. Terry Snutch (Biotechnology Laboratory) in pBluescript SK (Stratagene), was propagated in E. co/i DH5a (Invitrogen). Plasmids were isolated using the boiling miniprep method 32 from 3 ml Terrific broth (Sigma Aldrich) cultures in the presence of 100 g/ml ampicillin (Sigma Aldrich) when required. Plasmids (20 tg each) were linearized with EcoRV and further purified using the Qiagen Qiaquick purification kit according to the manufacturer’s protocol (Qiagen). Orientation and identification of the insert were verified by sequencing of 100 ng of the purified plasmid at the Nucleic Acids and Peptide Synthesis Unit, University of British Columbia. In mhv RNA transcripts in the sense orientation were generated from I tg of linearized plasmid using the T3 MEGAscript kit (Ambion) following the manufacturer’s protocol and stored at —70°C in dliethyl-pyrocarbonate (DEPC) treated H O (Ambion). All reactions in this study were incubated using an Eppendorf 2 Mixmaster programmed for 3 sec mixes at 1400 rpm every 15 mm. 2.2.2. Preparation of 3’-end anchored cDNA 5 tg (—‘16 pmol) or 0.1  (--‘0.3 pmol) of in ztm transcribed RNA was processed  according to the regular SAGE protocol or the microSAGE protocol version le. Alternatively, in zv transcribed RNA (0.6 tg or —1.9 pmol) was annealed to 3.0 mg oligo(d dynabeads (Dynal 5 Biotech) in the presence of 600 U of SUPERaseIn (Ambion). Annealed RNA was then processed  50  according to the microSAGE protocol version le using components from a cDNA synthesis kit (Invitrogen) and scaled accordingly to a final volume of 600  with the following exception: after  first strand synthesis, the reaction was cooled on ice, magnetized, and 520 p1 of the first strand reaction was replaced with 520 p1 of a pre-chilled mix of second strand synthesis reaction components and incubated 16 hrs at 16 °C. Anchored second strand products were then blunt ended, washed, and digested with NlaIII (New England Biolabs) as described. The resulting anchored 3’ end cDNAs (0.6 pmol per mg dynabeads) were stored at—20 °C until ready for use. 2.2.3. Adaptors Oligonucleotides corresponding to the adaptors and primers used in the SAGE and microSAGE protocols version le were obtained gel- or HPLC-purified (Qiagen) and are shown in Table 2.2. Stock concentrations (5 pM) of the following adaptors were prepared in lx NEB4 buffer (New England Biolabs) by mass dilutions: adaptor I (IA/lBphos); adaptor lm5C (IAm5C/lBphos); adaptor lm6A (IAm6A/lBphos); adaptor 2 (2A/2Bphos); adaptor 2m5C (2Am5C/2Bphos); adaptor 2m6A (2Am6A/2Bphos). Adaptors were annealed according to the  annealing schedule described in the current SAGE protocols. 2.2.4. Standard ligation protocol used in SAGE  Ligation reactions using adaptor I at a final concentration of 80 nM were performed according to microSAGE protocol version le. Additional ligation reactions, scaled to a final volume of 10 iii (—‘0.075 pmol cDNA per 125 ig dynabeads) and containing varying amounts of adaptor 1 (0.038 pmol to 38 pmol) or supplemented with PEG-8000 (15% w/v final) using a final adaptor concentration of 1 pM were also performed. All reaction samples were incubated for 2 hrs at 16°C or 25°C. 2.2.5. Directed ligation 2.2.5.1. Titration of T4 DNA ligase activity with Nialil Stock ligase mixes containing T4 DNA ligase (5 Weiss U/il; Fermentas) were prepared with various amounts of Nialil (120 U/p1; New England Biolabs) in a final buffer composition of 15 mM TtisHCl (pH  =  7.5), 0.1 mM EDTA, 1 mM DTT, 200 mM KC1, 0.5 mg/mi bovine serum  albumin (BSA), and 50% glycerol, and stored at —70°C. Oligo(dT) dynabeads (125 pjg) with anchored 3’end cDNA (0.075 pmol) were pre-incubated with adaptor 1, adaptor lm5C, or adaptor lm6A (1 pM final) for 5 mm at 37°C in lx NEB4 buffer supplemented with 1 mM ATP and 100 51  ng/iI BSA in a volume of 9 1 il. Reactions were initiated by adding 1 tl from one of the stock enzyme mixes described above, overlaid with mineral oil, and incubated for 2 hrs at 37°C. 2.2.5.2. Directed ligation protocol for SAGE A stock enzyme mix containing NiallI (25 U/pJ final) and T4 DNA ligase (2.5 Weiss U4t1 final) was prepared as described above. Oligo(dT) dynabeads (125 rig) with anchored 3’end cDNA (0.07 5 pmol) were pre-incubated with 2.5 pmol of adaptor I m6A for 5 mm at 37 °C in lx NEB4 buffer supplemented with 100 ng/il BSA and 1 mM ATP. After initiation with I tl of the stock enzyme mix, reactions were spiked every 15 mm with 2.5 pmol of adaptor im6A for a total incubation time of 1 hr and a total addition of 10 pmol adaptor. 2.2.6. Analysis of anchored ligation products Reactions were heat inactivated for 20 mm at 65°C in 200 [ii of lx NEB4 supplemented with 100 ng/pJ BSA, followed by two washes with the same buffer. Anchored ligation products were then cleaved off the dynabead support with 10 U Dral (New England Biolabs) in 30  of lx  NEB4 supplemented with BSA. After incubation for 1 hr at 37°C, products were resolved via polyacrylamide gel electrophoresis (6% PAGE; Owl Scientific) for 3 hrs at 12.5 V/cm. SYBR Gold (Molecular Probes) stained gels were visualized using a CCD-based gel documentation system (Alpha Innotech) using a SYBR-green filter set (Molecular Probes) at a sub-saturating aperture setting and recorded as TIFF files. When required, densitometric analysis was performed using publicly available software (tnimage-3.3.7a; http://brneurosci.org/tnimage.html). 2.2.7. Preparation and PCR amplification of ditags Adaptors I & 2 or adaptors lm6A & 2m6A were ligated to anchored 3’-end cDNA derived from 100 ng of in ilim transcripts as described above using the standard microSAGE protocol version 1 e or the directed ligation protocol. After ligation, the anchored products were processed according to microSAGE protocol version le to form ditags. Ditag ligation mixtures (3 tl) were brought up to a final volume of 20 id with LoTE buffer (2 mM TdsHCl, 0.2 mM EDTA, pH  =  8.0). 1 iii aliquots of 1:20 and 1:200 dilutions of the ligation mixture in LoTE were  then used as a template for PCR amplification with Platinum Pfx thermophilic DNA polymerase (Invitrogen) supplemented with 0.5X PCR enhancer solution and 0.1 mM MgSO 4 according to the manufacturer’s protocol in a final volume of 50 p1. PCR amplification was performed in the presence or absence of template on an Eppendorf Mastercycler (Eppendorf) using primer 1 and 52  primer 2 as described in the microSAGE protocol. After activation for I mm at 95°C, 26 cycles were performed according to the following schedule: 95°C, 30 sec; 55°C, 1 mm; 72°C, 1 mm. Upon completion, a 10 1 aliquot was then resolved via 6% PAGE for 1 hr at 12.5 V/cm and visualized as described above. 2.3. RESULTS AND DISCUSSION The ability of SAGE to provide a truly quantitative picture of gene expression relies on the efficiency of each step required to generate the library of SSTs from the harvested mRNA starting material. Currently, two general approaches to generate SAGE libraries are utilized (see Table 2.1), each customized towards the amount of starting material available to the researcher. The original SAGE protocol described by Velculescu et al. 2  5 ig of mRNA (7.8 pmol mRNA of average  length 2 kb) as starting material. After conversion to biotinylated cDNA, half of this sample is digested with the RE Niaffi, and the 3’-end fragments are affinity purified via streptavidin linked dynabeads (2 mg) to generate anchored 3’ end cDNA (3.9 pmol per mg dynabeads). In contrast, the microSAGE protocol, a modification of the SADE (AGE Malysis for own-sized xtracts) protocol of Virlon et al.  and commercially available as the ISAGETM kit from Invitrogen, is  designed to process the RNA from 5x10 4 to 2x10 6 cells or up to 100 ng (—0.16 pmol mRNA of average length 2 kb) of starting mRNA. Oligo(dT) 25 dynabeads (0.5 mg) are used as an affinity support to directly harvest polyadenylated RNA from the sample. The anchored oligo(dT) on the support is used to prime cDNA synthesis which is then digested with NiallI to generate anchored 3’ end cDNA (0.31 pmol per mg dynabeads). When our in zihv RNA material was used as the starting material, we found that the amount of anchored 3’-end cDNA recovered using the original SAGE protocol was similar to that obtained through the microSAGE protocol despite using 25-fold more starting material (data not shown). This observation is consistent with work by Virlon et al. where 200-fold less anchored 3’end cDNA was recovered from micro-dissected renal tubules using the SAGE protocol compared to that recovered from their SADE protocol, which used half the amount of starting material and employed Sau3A I as the anchoring enzyme  While this loss of material is largely due to the  presence of four additional extraction and precipitation steps in the original SAGE protocol prior to adaptor ligation (Table 2.1), additional losses may arise from the presence of excess biotinylated oligo(dT) primer used to prime first strand synthesis. Any such primer that survives the  53  extraction and precipitation steps will compete with binding to the streptavidin support. This primer contamination is likely small, however, as batch purification of biotinylated cDNAs using Qiaex II silica beads did not improve yields significantly. Following synthesis of the anchored 3’-end cDNA library on either streptavidin-linked Dynabeads (i.e. SAGE) or oligo(dT) Dynabeads (i.e. microSAGE), further processing towards generation of the SAGE library is essentially the same under the two protocols (Table 2.1). 2.3.1. Self-ligation of the anchored 3’-end cDNA competes with ligation of the adaptor Under standard microSAGE reaction conditions, we observe that the ligation of SAGE adaptors to the cohesive end of the anchored 3’-end cDNA consistently produces two products. In the presence of T4 DNA ligase and the standard 80 nM concentration of adaptor 1, a relatively small fraction (< 5%) of the anchored 3’-end cDNA was found to ligate to adaptor I to form the desired adaptor-target cDNA hetero-ligation product (Figure 2.1). The bulk of the anchored cDNA underwent an undesired reaction to form a high molecular weight product (lane 3). Comparisons with the control reaction in which no T4 DNA ligase was added (lane 2), and with a ligation reaction performed in the presence of Nialli indicated that this high molecular weight product is a homodimer of the anchored 3’-end cDNA. Identical experiments were also carried out on streptavidin anchored 3-end cDNA samples prepared by the original SAGE protocol and gave essentially the same results. Lower loading densities of in mhv RNA onto oligo(d’I) 20 dynabeads or biotinylated cDNA onto streptavidin linked dynabeads only marginally inhibited formation of the homodimer, suggesting that homodimer formation depends both on the distance of separation between anchored 3’-end cDNA molecules on the surface of a given dynabead (inter-molecular) as well as between those anchored on adjacent dynabeads (intra-molecular). Formation of the homodimer was also observed when other in vitrv RNA transcripts were utilized to generate anchored 3’-end cDNA targets ranging from 132 to 355 bp in length. Thus, under the ligation conditions described, most of the desired hetero-ligation product is lost in favor of selfligation of two anchored cDNA fragments. The yield of the desired hetero-ligation product was found to depend on the amount of SAGE adaptor introduced to the ligation mix, and increased with increasing adaptor concentration  (Figure 2.2). However, even at very high concentrations of added adaptor (500:l, lane 10), formation of the unwanted cDNA self-ligation product remained significant, resulting in a loss of  54  approximately half of the starting cDNA material. Under homogeneous reaction conditions (i.e. all reactants present in the solution phase), mass-action should favour the formation of two products, the desired adaptor-cDNA heterodimer and the adaptor-adaptor homodimer at these high concentrations of added adaptor. However, tethering of the target cDNA to the polystyrene surface of dynabeads creates a heterogeneous reaction environment. The distribution of ligation products may therefore be controlled by mass transfer effects that limit the concentration of adaptor in the solid-liquid interfacial region where the target cDNA is anchored and the reaction must take place. Consequently, adaptor-adaptor and cDNA-cDNA homodimers are produced preferentially, even in the presence of a large excess of the added adaptor. Improving the yield of adaptor-cDNA heterodimer by increasing the adaptor concentration in the reaction mix is impractical for large-scale SAGE projects. In addition to the high associated costs of preparing the adaptor, excess adaptor may have deleterious effects on subsequent steps in SAGE. High concentrations of adaptor promote the formation of a large number of adaptor dimers, which can interfere with subsequent PCR amplification steps or necessitate excessive washing of the anchored ligation product to remove unreacted adaptor and adaptor dimers. For this reason, some groups  “  have attempted to limit adaptor-dimer  contamination of the clitag PCR reaction mixture by reducing the concentration of adaptor used in the adaptor ligation step. However, our results show that lowering the added SAGE adaptor concentration below the standard concentration of 80 nM (i.e. lanes 4 & 5 of figure 2.2) results in a significant reduction in the already low yield of the desired adaptor-cDNA hetero-ligation product. As the overall fidelity of SAGE to provide an accurate read of the distribution of transcript abundances will be affected by this sampling loss, there exists a need to develop cheaper and more effective methods to increase the yield of the desired hetero-ligation product by reducing or, better yet, eliminating the formation of self-ligation products. 2.3.2. Addition of macromolecular crowding agents increases the yield of adaptor modified anchored 3’-end cDNA Other changes in reaction conditions that alter the distribution of ligation products were therefore explored to improve the yield of the desired hetero-ligation product. For example, lowering the reaction temperature can be used to slow the ligation reaction to a point where the rate of mass transfer of the adaptor to the solid-liquid interface no longer limits the formation of  55  the hetero-ligation product. In this case however, a significantly increased incubation time is required, extending the already lengthy process involved in producing a SAGE library. Varying the rate of mixing during the reaction to decrease the hydrodynamic boundary layer and increase the surface concentration of the free adaptor was explored, but led to only a marginal improvement in the yield of the hetero-ligation product. Adding co-solutes that act as macromolecular crowding agents (i.e. compaction agents) has been shown to dramatically affect the thermodynamics of reaction mixtures, generally favouring the formation of products with compact conformations and for some proteins, linear rod-like aggregates  35,36  For ligation reactions, addition of 15% (w/v) of the neutral polymer polyethylene  glycol (PEG) has been shown to enhance by up to 100-fold the formation of intermolecular ligation products (i.e. linear concatamers) during the ligation of cohesive or blunt-ended DNA  fragments in the solution phase 30 31,37 The influence of increased concentrations of PEG-8000 on the formation of the desired hetero-ligation product was therefore examined (Figure 2.3). At the standard reaction temperature of 16 °C and a fixed adaptor concentration of 1 M (i.e. >10-fold than typically used in microSAGE), increasing the PEG-8000 concentration to 15% (w/v) (lane 3) significantly improved the yield over that obtained at 5% PEG-8000 (lane 2), such that the desired product represents slightly over half the total reaction product. This increase in hetero-ligation product yield in the presence of added PEG was also observed when the standard adaptor concentration (i.e. 80 nM) or a 10-fold higher concentration of anchored cDNA was used. Given that the activity of T4 DNA ligase is higher at 25 °C than at the standard 16 °C reaction temperature used in microSAGE  °,  we examined if increasing the rate of the ligation  reaction by increasing the incubation temperature could further favor formation of the desired intermolecular ligation product. Moreover, as it is known that there is a temperature dependence to the effects of macromolecular crowding by added PEG  ,  we also examined the effect of  reaction temperature on the distribution of ligation products in the absence and presence of supplemental concentrations of PEG-8000. We observed that the desired product yield was not improved by increasing the reaction temperature to 25°C (lanes 4 & 5) compared to the standard temperature of 16 °C (lanes 2 & 3), indicating that rates of formation of the two reaction products show similar temperature dependencies. This is in contrast to some reports on the influence of  56  PEG in ligations in solution, where cohesive-end ligations were shown to be enhanced by an increase in temperature 31,41,42,44 We conclude that PEG-8000 added in moderate concentrations to ligation reactions performed at 16 °C can improve hetero-ligation product yields. However, a large excess (50:1 or greater) of added adaptor is required to achieve better than 50% yield. More importantly, complete conversion to the desired hetero-ligation product is not observed at any realistic adaptor concentration. 2.3.3. Product distribution can be directed through the introduction of a restriction enzyme into the ligation reaction  —  Directed ligation chemistry  The inability to adjust ligation conditions such that the hetero-ligation product becomes the only significant reaction product suggests that surface-anchoring of the target cDNA presents kinetic or mass-transfer barriers that cannot be overcome by simple adjustments to the reaction conditions. As the primary problem lay in the inability to offset self-ligation of the target cDNA molecules on the surface, we sought a novel method to limit or prevent formation of this undesired ligation product. Although removal of the 5’-phosphate on the recessed 5’-ends of the anchored 3’-end cDNAs using an appropriate alkaline phosphatase could potentially provide a means  to  eliminate  self-ligation of the anchored 3’-end  cDNAs,  the efficiency of  dephosphorylation by such phosphatases is often much lower for 5’-phosphates on these sites. This, combined with the background nuclease activity of the enzyme which can catalyze digestion of 5’ overhangs, would significantly reduce the overall yield of defined ligation products. In addition, the ligation of SAGE adaptors to such modified targets would lead to the formation of a nicked adaptor-cDNA hetero-ligation product that is inappropriate for further SAGE processing without the introduction of an additional enzymatic step prior to PCR amplification. Another approach would be to use an adaptor with an unphosphorylated 5’-end that would prevent adaptor dimer formation and thereby enhance reactivity by maintaining a large excess of adaptor relative to target cDNA. However, this approach would also result in a nicked strand that requires additional enzymatic steps for processing in SAGE. As an alternative approach, we considered the effect of adding different amounts of NlaIII to the reaction mixture, with the aim of establishing a more favourable product distribution based on the relative rates of ligation-product formation catalyzed by T4 DNA ligase and ligation  57  product cleavage by Nialil. In the presence of both enzymes, ligation would proceed until a steady state product profile is reached in the presence of AT]?. We observed that the addition of various amounts of Nialil to a standard ligation reaction containing the SAGE adaptor clearly influenced the product distribution of hetero- versus homo-ligation product (Figure 2.4.A). Titration of a standard ligation reaction containing 0.25 U/ iJ T4 DNA ligase and 1 iM of adaptor I with 1 increasing amounts of Nialli in the absence of PEG-8000 resulted in a gradual decrease in the amount of the high molecular weight homodimer as well as the desired heterodimer, and a concomitant increase in the amount of unmodified target DNA. While we were unable to selectively enhance the formation of the hetero-ligation product  relative to the undesired cDNA-cDNA homodimer, the results suggested that the competitive actions of NiallI and T4 DNA ligase could provide an efficient route to complete conversion of the anchored 3’-end cDNA fragments into the desired hetero-ligation product if RE-catalyzed digestion of the desired adaptor-cDNA heterodimer could be specifically inhibited. As NiallI is one of a number of REs sensitive to the presence of a methylated base within its recognition sequence (see Table 2.3), the introduction of a methylated base within the ligation site of the SAGE adaptor could potentially enable the selective inhibition of digestion of the desired ligation product. Through subsequent formation of a hemi-methylated site within the desired adaptor-3’end cDNA hetero-ligation product, cleavage by NlaIII would be specifically inhibited. In contrast, all cDNA homodimers remain susceptible to NlaIII-catalyzed digestion. As the irreversible formation of the adaptor-cDNA heterodimer would direct formation of the desired product by rapidly digesting any self-ligated cDNA back to its unmodified state, we termed this technique “directed ligation”.  2.3.4. Near-complete conversion of anchored 3’-end cDNA to adaptor modified products via directed ligation chemistry. The principle of directed ligation chemistry was tested by designing SAGE adaptors with a methylated base (5-methyl-deoxy cytosine or N6-methyl-deoxy adenosine) within the ligation site. Each of these redesigned adaptors was then introduced into the ligation reaction performed in the presence of NlallI. Substitution of the conventional SAGE adaptor with either modified adaptor had a dramatic effect in the overall distribution profile of ligation products (Figure 2.4.B). When  58  the methylated adaptor I m6A was used, increasing amounts of Nialli in the reaction mixture selectively reduced the amount of high molecular weight homodimer corresponding to the target cDNA self-ligation product, such that its formation was vanishingly small at a 20:1 NIalil to ligase ratio. However, in contrast to the unmodified SAGE adaptor, the formation of the heterodimer increased dramatically as more Niaffi was introduced into the ligation mix. As a result, very high yields of the desired adaptor-cDNA heterodimer were achieved when mixtures of 10:1 (lane 8) and 20:1 [ane 9) NIalli to T4 DNA ligase were used. In both cases, the steady state product distribution profiles shown in figure 2.4.B were obtained within the first 15 mm of reaction. At 5% PEG-8000, a 10:1 enzyme mix was enough to sufficiently inhibit homodimer formation while promoting the formation of the desired heterodimer to the extent observed when PEG was excluded from the ligation mix. While both the 5-methyl-deoxy cytosine and the N6-methyl-deoxy adenosine modified adaptors are extremely effective under directed ligation chemistry, the position of the 5-methyldeoxy-cytosine base (i.e. adaptor lm5C) within the recognition sequence of BsmFI would block the activity of this Type uS enzyme  ,  preventing the release of a SAGE tag from the transcript.  We therefore employed the N6-methyl-deoxy adenosine modified adaptors for use in the SAGE protocol. Direct comparison of the efficiency of our modified SAGE protocol employing the SAGE lm6A adaptor to that of the original microSAGE protocol demonstrated a remarkable increase in the yield of adaptor-modified anchored 3’-end cDNA (figure 2.5). Under the directed ligation protocol, we achieved near complete conversion of the anchored 3’-end cDNA to the desired adaptor-target DNA hetero-ligation product (lane 9). This is in direct contrast to the less than 50% yield obtained in the microSAGE protocol when a >10-fold higher adaptor concentration was used (lane 2). We also verified that the N6-methyl-deoxy adenosine modified adaptors could be applied to downstream processing in SAGE by ligating both adaptor lm6A and 2m6A to anchored 3’-end cDNA and subjecting the resulting ligation products to the remaining steps of the microSAGE protocol (Figure 2.6). PCR amplification of derived ditags demonstrated that more amplifiable template was present when the directed ligation protocol was employed (lanes 6 & 12) in place of the SAGE protocol employing a >10-fold excess of standard SAGE adaptors (lanes 3 & 9).  59  Thus, directed ligation chemistry appears to provide an extremely efficient route to ensuring complete conversion of anchored 3’-end cDNA to the desired adaptor modified product, and can thereby ameliorate the loss of sample due to self-ligation of the anchored cDNA. 2.4. CONCLUSION As th cost of sequencing continues to decrease, SAGE technology will continue to evolve into a powerful, more accessible alternative to microarray technology for the study of gene expression on a global scale. However, for SAGE to provide a truly quantitative picture of global gene expression, it is clear that many of its processing steps require optimization in order to ensure the fidelity of SAGE prior to analysis of the SAGE tag library via sequencing. We have demonstrated that the efficiency of one critical step in the SAGE protocol, namely the ligation of the SAGE adaptor that permits the use of a Type 11S RE to generate a SAGE tag, is compromised by the tendency for the anchored target to self-ligate. While optimization of reaction conditions to improve ligation efficiency resulted in a significant improvement in the yield of the desired ligation product, complete conversion could not be achieved. We therefore developed a simple approach termed “directed ligation”, that provides near complete conversion to hetero-ligation products, thereby ensuring the fidelity of the transcriptome sample at this step in SAGE analysis. Finally, given that the ligation of specifically designed adaptors is a fundamental step in many genomic technologies, it is likely that this self-ligation reaction problem is not unique to SAGE. Our directed ligation chemistry may therefore provide a means of improving a range of important functional genomics technologies.  60  Table  2.1.  Outline  of the  enzymatic,  purification  and  isolation  steps  involved in  SAGE and microSAGE protocols (http: / /www.sagenet.org/protocol/index.htm).  Purification and Isolation steps Enzymatic steps  mRNA preparation  MIcroSAGE (SADE)  SAGE Precipitation, selection with biotinylated oligo(d’I) Phenol extraction, precipitation Phenol extraction, precipitation  Affinity purification  cDNA synthesis  -  cleavage with anchoring enzyme (digest with NlaIIl)  -  3’-end cDNA isolation  Affinity purification  -  ligating adaptors to bound 3’-end  -  cDNA release of eDNA tags (digest with BsmFl)  Phenol extraction, precipitation  blunt ending of eDNA tags  Phenol extraction, precipitation  ligating tags to form ditags  -  Phenol extraction, precipitation PCR amplification of ditags PAGE purification, gel extraction, precipitation Adaptor removal (NlaIII digestion) and purification of ditags Ligation of ditags to form  concatamers  Phenol extraction, precipitation PAGE purification, gel extraction, precipitation PAGE purification, size selection, gel extraction, precipitation  insertion into vector  Phenol extraction, selection by host  61  the  Table 2.2. List of oligonucleotides used in this study to form various SAGE adaptors. Oligonucleotides were obtained gel-purified and verified by mass spectrometry.  Oligo ID IA 1 1Am6A 2 IAm5C lBphos 14 1 2Am6A  MW (g/mol)  Sequence (5’—’3’) GGAYfGCTGGTGCAGTACAACTAGGCAATAGGGACATG mGGAmGCTGGTGCAGTACAACTAGGC’rFAATAGGGACAm&TG  mGGAmGCTGGTGCAGmcAAcTAGGCrrAATAGGGAc 5 ATG 2 pTCCCTA19AAGCCTAGTrGTACTGCACCAGCAAATCC.NH mCTGCTCGAArrCAAGCTTCTAACGATGTACGGGGACATG TG 6 mCTGCTCGAATrCAAGCTrCTAACGATGTACGGGGACAm 2Am5C mCTGCTCGAATrCAAGCVrCTAACGATGTACGGGGAC mSATG 2 2Bphos pTCCCCGTACATCG1TAGAAGC1TGAA19CGAGCAG-NH 2  I  A N6-methyl-deoxyadenosine 5-methyl-deoxycytosine : 3’ C7 amino spacer 2 NH  -  2Cm5:  62  13657.06 13670.95 13670.95 11517.57 12919.55 12933.58 12933.58 11020.24  Table 2.3. List of methyl sensitive Type II restriction enzymes that generate overhangs suitable for directed ligation chemistry. This list was extracted from REBASE version 05/2003 (http: / /rebase.neb.com) and corresponds to commercially available enzymes which have well-characterized methylation sensitivities.  enzyme name AccI AclI Afilhl Apal BamHI Banli Bell BgllI BseCI Bsp 1061 BstNI BstYI Bsul5I CfrlOI Cfr131 Cfr9I CM Clal DdeI  recognition sequence with cleavage s1te 2 GTJ, MKAC  A,CGT] A CRYGT GGGCCJC Gi. GATCC GRGCY,,C TJ, GATCA A1,GATCT AT CGAT AT CGAT CC ‘I, WGG R,[ GATCY ATCGAT R,LCCGGY G,GNCC CJdCCGGG Y GGCCR ATJ,CGAT C.TNAG  Dralil EaeI EcoOlO9I  CACNNNGTG YJ, GGCCR RGGNCCY  EcoRl EcoRil EcoT38I  GJ,AATT’C CCWGG GRGCY,,C  Hapli HhaI  C CGG GCGC  Hindlil Hinfl  AJ A 1 GC’fl GJ,ANTC GCGC CI, CGG TCN GA A CGT GGTAC C  HinPil HpalI Hpyl 881 HpyCH4IV KpnI MseI MspI MunI MvaI NgoMIV  TJ,TAA C CGG  CJ,AAITG CC4,WGG G,I,CCGGC  63  methylation site  and type 3 5 (6) 3(5) 2(4) 4(5) 5(4) 4(5) 3(6) 5(4) 5 (6) 5 (6) 2(4) 5(4) 5 (6) 2(5) 4 (5) 2 (4) 4(5) 5(6) 1(5) 2(6) 4 (5) 5 (5) 3(6) 2(5) 4(5) 2 (5) 2(5) 1 (6) 2(6) 2(5) 2(5) 5 (6) 2(5) 4(6) 4(6) 1 (5) 3(6) 2(4) 2(5)  NIalil NspI PaeR71 PspPI PstI Sad Sail Sau3AI Sau961 SinI Sse9I  CATGJ, RCATGY CTCGAG G, GNCC CTGCA,j,G GAGCT C GTCGAC GATC G,GNCC G.GWCC AA’fl’  TaqI Tfil Tsp451  T CGA G,I,AWTC GTSAC  TspRI TthlllI VspI XbaI Xcnil XhoI XhoII  CASTGNN GACNNNGTC AT TAAT TJ, CTAGA CCANNNNN ,NNNNTGG C,TCGAG R,GATCY  2(6) 2(5) 5 (6) 4(5) 5(6) 4 (5) 5(6) 4 (5) 4 (5) 4(5) 2 (6) 4 (6) 2(6) 4 (6) 1 (5) 2(6) 5 (6) 6(6) 3(6) 5 (6) 5(4)  recognition sites listed 5’ to 3’ orientation with site of cleavage denoted by degenerate bases are indicated as follows: R = G or A Y = C or T M = Aor C KGorT S = G or C W = A or T B = not A (C or G or 1) D = not C (A or G or 1) H = not G (A or C or 1) V = not T (A or C or G) N = A or C or G or T of methylated base in recognition sequence with type of modification in parentheses as follows: 4 N4-methyl-2’-deoxy-cytosine 5 = 5-methyl-2’-deoxy-cytosine 6 = N6-methyl-2’-deoxy-adenosine I  “ “.  2  64  500 bp 400 bp 300 bp  200 bp  100 bp  LII  Ligase  Figure 2.1. Ligation of SAGE adaptor lÀ to anchored 3’-end cDNA. 100 ng of in miro transcribed polyadenylated product was processed under the microSAGE protocol and split in half. Lane 2 shows a control reaction in which T4 DNA ligase was not added to the ligation mix. Lane 3 shows the formation of a small amount of the hetero-ligation product indicated by the arrow as well as a high molecular weight band corresponding to twice the molecular weight of the unligated cDNA. Ligations were performed as described in Methods and Materials.  65 1  500 bp 400 bp  Figure 2.2. Influence of increasing adaptor:target molar ratios on the formation of adaptor-target heterodimer versus target homodimer. Increasing amounts of adaptor 1 (0 3.8 tM final) were introduced into standard ligation reactions containing 0.075 pmol anchored target in a final volume of 10 itl as described in Methods and Materials. In microSAGE, adaptors are introduced to a reaction mixture containing -0.08 pmol anchored target at a final concentration of 0.08 tM in a total volume of 20 tl, corresponding to adaptor:target ratio of approximately 20:1. The classic SAGE protocol introduces a final adaptor concentration of 0.8 tM to the ligation mixture containing —‘1.95 pmol anchored target in a total volume of 40 il, corresponding to an adaptor:target ratio of approximately —‘16:1. —  66  500 bp 400 bp 300 bp  200 bp % PEG (w/v) Temperature  j  Figure 2.3. Influence of supplemental PEG-8000 and incubation temperature on the formation of adaptor-target heterodimer versus target homodimer. The standard ligation reaction in the microSAGE protocol is performed in the presence of 5% PEG-8000 (w/v) at 16 °C for 2 hrs using a final adaptor concentration of 0.08 iM in a final volume of 20 tl. Ligation reactions shown were performed in a final volume of 10 ial as described in Methods and Materials using a adaptor concentration of I 1 iM final in the presence or absence of PEG-8000 supplemented to a final concentration of 15% (w/v). Reactions were carried out for 2 hrs under the conditions indicated.  67  A  — T4DNA — — ligase jjj — — I! Nialil ............  —_  —____  .  —  ...  -  500 bp  —  -  —  —  —  — — — —  400 bp 300 bp  200 bp Nial I 0.25  T4 DNA ligase  B.  T4DNA ligase  —  i 1 mI!M  4—  :::::::::  500bp 400 bp  T4DNA :::!::.;:.::: ligase ::!:::::::: — 4—  -  Nialli —  •••••••  Nialil  ..r  — —  —  300bp  200 bp NialIl  0.0  0.010.2510.510.7511.512.515.0  T4 DNA ligase  0.0  0.25  Figure 2.4. Outline of directed ligation. A. Ligation of unmethylated adaptors (black) results in the formation of a mixture of adaptor homodimers, target homodimers, and the adaptortarget heterodimer. In the presence of Nialil, ligated products are converted back to their respective monomers. The final product distribution is determined by the relative rates of ligation by T4 DNA ligase and digestion by Nialli. B. In contrast, using an adaptor with a methylated base (N6-methyl-deoxy adenosine) within the site of ligation blocks digestion of the adaptor-target heterodimer, and product distribution is favoured towards the formation of the adaptor-target heterodimer. Titrations of T4 DNA ligase with increasing quantities of Nialil were performed in the presence of 1 i.tM adaptor in a final volume of 10 l for 2 hrs at 37 °C as described in Methods and Materials.  68  500 bp 400 bp 300 bp —  —  —  —  -.  .  200 bp Linker NIalil Temperature  --—-16°C  37°C  Figure 2.5. Comparison of ligation under the SAGE protocol versus under directed ligation chemistry. ligation reactions were performed in 5% PEG-8000 (w/v) in the presence or absence of Nialil, using standard SAGE adaptors (adaptor 1) or modified SAGE adaptors with an N6-methyl-deoxy-adenosine base (adaptor lm6A) at a final concentration of I 1 iM. Reactions were performed at a final reaction volume of 10 t1 and incubated as described in Methods and Materials.  69  400 bp 300 bp 200 bp  100 bp  Template dilution T4 DNA ligase  1:200  IImi  % PEG (w/v)  Figure 2.6. PCR amplification of ditags derived from adaptor-modified anchored 3’-end cDNA obtained using the microSAGE protocol or directed ligation chemistry. After ligation under the microSAGE protocol using —1O-fold greater amount of adaptors I and 2 under standard conditions (lanes 3-5, 9-11) or using adaptors lm6A and 2m6A using directed ligation chemistry (lanes 6-8, 12-14), tags were released with BsmFI, blunt-ended with Klenow and ligated to form ditags as described under the microSAGE protocol version le in the presence (lanes 5,8,11,14) or absence (lanes 4,7,10,13) of added PEG-8000 (15% w/v final). Following ligation, mixtures were diluted 1:20 (lanes 3-8) or 1:200 in LoTE (lanes 9-14) and I t1 was used as a template for PCR amplification as described in Methods and Materials.  70  2.5. REFERENCES 1.  Ideker, T. et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292, 929-934 (2001).  2.  Velculescu, V.E., Zhang, L., Vogelstein, B. & Kinzler, K.W. Serial analysis of gene expression. Science 270,484-487 (1995).  3.  Adams, M.D. Serial analysis of gene expression: ESTs get smaller. Bioessajts 18, 261-262 (1996).  4.  Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270,467-470 (1995).  5.  Epstein, C.B. & Butow, R.A. Microarray technology enhanced versatility, persistent challenge. Curr Opin Biotechnol 11,36-41 (2000).  6.  Lipshutz, R.J., Fodor, S.P., Gingeras, T.R. & Lockhart, D.J. High density synthetic oligonucleotide arrays. Nat Genet2l, 20-24 (1999).  7.  Lercher, M.J., Urrutia, A.O. & Hurst, L.D. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat Genet 31, 180-183 (2002).  8.  Szybalski, W., Kim, S.C., Hasan, N. & Podhaska, A.J. Class-uS restriction enzymes--a review. Gene 100, 13-26 (1991).  9.  Saha, S. et al. Using the transcriptome to annotate the genome. Nat Biotecbnol20, 508-512 (2002).  -  10. van den Berg, A., van der Leij, J. & Poppema, S. Serial analysis of gene expression: rapid RT-PCR analysis of unknown SAGE tags. NucleicAcids Rex 27, e17 (1999). 11. Velculescu, V.E., Vogelstein, B. & Kinzler, K.W. Analysing uncharted transcriptomes with SAGE. Trend.r Genet 16, 423-425 (2000). 12. Velculescu, V.E. et al. Characterization of the yeast transcriptome. Cell 88, 243-251 (1997). 13. Lee, S., Chen, J., Zhou, G. & Wang, S.M. Generation of high-quantity and quality tag/ditag cDNAs for SAGE analysis. Biotechniques 31, 348-350, 352-344 (2001). 14. Caron, H. et al. The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science 291, 1289-1292(2001). 15. Munasinghe, A. et al. Serial analysis of gene expression (SAGE) in Plasmodium falciparum: application of the technique to A-T rich genomes. MolBiothem Parasitolll3, 23-34 (2001). 16. Boheler, K.R. & Stem, M.D. The new role of SAGE in gene discovery. Trends Biotechnol2l 55-57 (2003). ,  17. Chen,J. et al. Identifying novel transcripts and novel genes in the human genome by using novel SAGE tags. Proc NatlAcadSci USA 99, 12257-12262 (2002). 18. Colinge,J. & Feger, G. Detecting the impact of sequencing errors on SAGE data. Bioinformatics 17, 840-842 (2001). 19. Stem, M.D., Anisimov, S.V. & Boheler, KR. Can transcriptome size be estimated from SAGE catalogs? Bioinformatics 19,443-448 (2003). 20. Stollberg, J., Urschitz, J., Urban, Z. & Boyd, C.D. A quantitative evaluation of SAGE. Genome Res 10, 1241-1248  (2000).  21. Powell, J. Enhanced concatemer cloning-a modification to the SAGE (Serial Analysis of Gene Expression) technique. NucleicAcids Rex 26, 3445-3446 (1998). 22. Datson, N.A., van der Perk-de Jong, J., van den Berg, M.P., de Kloet, E.R. & Vreugdenhil, E. MicroSAGE: a modified procedure for serial analysis of gene expression in limited amounts of tissue. Nucleic Acids Rex 27, 13001307 (1999). 23. Kenzelmann, M. & Muhlemann, K. Substantially enhanced cloning efficiency of SAGE (Serial Analysis of Gene Expression) by adding a heating step to the original protocoL NucleicAcids Rex 27, 917-918 (1999).  71  24. Angelastro, J.M., Klimaschewski, L.P. & Vitolo, O.V. Improved NlaIII digestion of PAGE-purified 102 bp ditags by addition of a single purification step in both the SAGE and microSAGE protocols. Nucleic Adds Rex 28, E62 (2000). 25. Ye, S.Q., Zhang, L.Q., Zheng, F., Virgil, D. & Kwiterovich, P.O. miniSAGE: gene expression profiling using serial analysis of gene expression from 1 microg total RNA.AnzslBiochem 287, 144-152 (2000). 26. Margulies, E.H., Kardia, S.L. & Innis, J.W. Identification and prevention of a GC content bias in SAGE libraries. NuclekAcids Rex 29, E60-60 (2001). 27. Mathupala, S.P. & Sloan, A.E. “In-gel” purified ditags direct synthesis of highly efficient SAGE Libraries. BMC Genomics 3,20 (2002). 28. Damgaard Nielsen, M., Millichip, M. & Josefsen, K High-performance liquid chromatography purification of 26bp serial analysis of gene expression ditags results in higher yields, longer concatemers, and substantial time savings. AnalBiochem 313, 128-132 (2003). 29. Harrison, B. & Zimmerman, S.B. Polymer-stimulated ligation: enhanced ligation of oligo- and polynucleotides by T4 RNA ligase in polymer solutions. NucleicAdds Rex 12, 8235-825 1 (1984). 30. Hayashi, K, Nakazawa, M., Ishizaki, Y., Hiraoka, N. & Obayashi, A. Regulation of inter- and intramolecular ligation with T4 DNA ligase in the presence of polyethylene glycoL NuckicAcids Rex 14, 7617-7631 (1986). 31. Pheiffer, B.H. & Zimmerman, S.B. Polymer-stimulated ligation: enhanced blunt- or cohesive-end ligation of DNA or deoxyribooligonucleotides by T4 DNA ligase in polymer solutions. NuekicAdd.c Rex 11,7853-7871(1983). 32. Sambrook, J. & Russell, D.W. Molecular cloning: a laboratory manual, Edn. 3rd. (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.; 2001). 33. Virlon, B. et al. Serial microanalysis of serial transcriptomes. Proc NatlAcari Sd US A 96, 15286-15291 (1999). 34. Angelastro, J.M. et al. Identification of diverse nerve growth factor-regulated genes by serial analysis of gene expression (SAGE) profiling. Proc NatlAcad Sd USA 97, 10424-10429 (2000). 35. Hall, D. & Minton, A.P. Macromolecular crowding: qualitative and semiquantitative successes, quantitative challenges. Biochim BiophysActa 1649, 127-139 (2003). 36. Zimmerman, S.B. & Minton, A.P. Macromolecular crowding: biochemical, biophysical, and physiological consequences. Anna Rev Biophys BiomoiStruct 22,27-65 (1993). 37. Hayashi, K., Nakazawa, M., Ishizaki, Y. & Obayashi, A. Influence of monovalent cations on the activity of T4 DNA ligase in the presence of polyethylene glycol. NuckicAcids Rex 13, 3261-3271 (1985). 38. Wu, D.Y. & Wallace, R.B. Specificity of the nick-closing activity of bacteriophage T4 DNA ligase. Gene 76, 245254 (1989). 39. Pohl, F.M., Thomae, R. & Karat, A. Temperature dependence of the activity of DNA-modifying enzymes: endonucleases and DNA ligase. EurJ Biochem 123, 141-152 (1982). 40. Faulhammer, D., Lipton, R.J. & Landweber, L.F. Fidelity of enzymatic ligation for DNA computing.J Comput Biol 7, 839-848 (2000). 41. Louie, D. & Serwer, P. Effects of Temperature on Excluded Volume-Promoted Cycization and Concatemerization of Cohesive-Ended DNA Longer Than 0.04-Mb. NuckicAcids Research 19, 3047-3054 (1991). 42. Louie, D. & Serwer, P. Quantification of the Effect of Excluded-Volume on Double-Stranded DNA. Journal MolecularBiolagy 242, 547-558 (1994).  of  43. Murphy, L.D. & Zimmerman, S.B. Condensation and Cohesion of Lambda-DNA in Cell-Extracts and Other Media Implications for the Structure and Function of DNA in Prokaryotes. Biophysical Chemistry 57, 71-92 (1995). -  44. Hayashi, K., Nakazawa, M., Ishizaki, Y., Hiraoka, N. & Obayashi, A. Stimulation of intermolecular ligation with E. coli DNA ligase by high concentrations of monovalent cations in polyethylene glycol solutions. Nucleic Adds Rex 13, 7979-7992 (1985). 45. Roberts, R.J., Vincze, T., Posfai, J. & Maceis, D. REBASE: restriction enzymes and methyltransferases. Nucleic AddcRes3l, 418-420(2003).  72  CHAPTER 3 Minimizing loss of sequence information in SAGE ditags by  modulating the temperature dependent 3’—5’ exonuclease activity of DNA polymerases on 3’-terminal isoheptyl amino groups*  *A version of this chapter is published in Biotechnoology & Bioengineering. [Reference: So, A. P., Turner, R. B. F., and Haynes, C. A. 2006. Minimizing loss of sequence information in SAGE ditags by modulating the temperature dependent 3’—5’ exonuclease activity of DNA polymerases on 3’-terminal isoheptyl amino groups. Biotechnoh Bioeng. 94, 54-65.  73  3.1. INTRODUCTION Serial analysis of gene expression (or SAGE) technology  is one of two widely used  technologies that allow the monitoring of gene expression on a global scale. Utilizing a highthroughput sequencing approach, SAGE directly samples the entire transcriptome of an organism under a given cellular state through the generation of short sequence tags (SSTs) that are derived from an identifiable position within the 3’ region of each transcript of the transcriptome. As a 9-or lO-mer can theoretically identify 49 (262,144) or  410  (1,048,576) unique sequences, respectively, the  entire transcript population of any organism can potentially be represented by the set of generated SSTs  1, 2  Furthermore, because each SAGE tag is derived from a defined position within a  particular cDNA, a given tag can be cross-referenced through organism- and/or tissue-specific genome databases to a particular gene to give a profile of global gene expression. A particular strength of SAGE is the ability to utilize unreferenced SAGE tags that arise out of the analysis to identify previously unknown genes and aid in the annotation of genomic sequences  .  However,  numerous steps are required in order to prepare a SAGE library from the starting niRNA material, requiring that particular care be taken to minimize sample loss and potential sources of bias. An essential step in the preparation of a SAGE library is the introduction of an adaptor containing a Type uS restriction site to a defined site within a given cDNA. A Type IIS enzyme is capable of cleaving a number of bases outside of its recognition sequence  8,  This enables the  release of a sequence tag from the cDNA through an adaptor:tag (A1) conjugate, and thus forms the basis of many tag-based technologies  .  The choice of Type uS enzyme intrinsically determines  the amount of sequence information that can be retrieved from a given cDNA. In “shortSAGE” protocols (e.g SAGE, microSAGE, I-SAGE, SADE), A:Ts are released from the cDNA population using the Type ITS enzyme BsmFI 3,10-13, which cleaves 10/14 nucleotides downstream of its recognition sequence. As this particular enzyme leaves a recessed end at the 3’-end of the (+) strand of the A:T, a proofreading DNA polymerase (DNAP) is utilized to fill-in the recessed end using the  (—)  strand as a template so that the sequence information content of individual tags is  maximized (figure 3.1.A). This also facilitates the formation of SAGE “ditags”, a key intermediate that enables the use of PCR to generate sufficient amounts of material for library construction. Ligation at the 3’-hydroxyl group on the adapter end of the  (—)  strand of the A:T is in principal  prevented by introducing a 3’-terminal isoheptyl amine (3’-IHA) group so that, in combination  74  with the 5’-hydroxyl group on the (+) strand, ligation is restricted to the filled-tag end of the released AT to form a symmetric 102 bp ditag (figure 3.1.A). Here we show that, despite the presence of the 3’-IHA blocking group, a complex ligation mixture is formed during ditag creation consisting of higher molecular weight (HM’ byproducts that can act as templates for PCR amplification (Figure 3.2). These unwanted HMW ligation products are a result of the removal of the 3’-Il-TA during the standard fill-in reaction with the Kienow fragment (1(F) of DNA polymerase I from B. co/i. The reintroduced 3’-hydroxyl group on the  strand is then able to compete with the 3’-hydroxyl on the (+) strand for joining with the 5’-phosphate on the (—) strand of another A:T to form an asymmetric dimer (figure 3.1 .B) that can  (—)  be further ligated to form of a mixture of multimeric ligation products. The result is that SSTs corresponding to rare transcripts can be lost to 1-1MW ligation products during ditag formation, leading to their elimination from the SAGE library and the skewing of sequencing results away from the true distribution of transcripts in the original sample. We describe new reaction conditions that strongly inhibit the formation of these unwanted HMW ligation products, allowing one to maximize the yield of ditags and to bypass the need for gel purification via PAGE following PCR. The modifications described here, combined with the novel chemistry described previously for adaptor ligation  ,  ensure that the full sequence  information content in SSTs derived from the transcriptome is preserved in the pool of amplified ditags prior to the creation of SAGE libraries. 3.2. MATERIALS AND METHODS 3.2.1. Linkers and oligonucleotides Oligonucleotides (table 3.1) were obtained gel-purified (Qiagen) and stock concentrations (5 tM) were prepared in lx NEB4 buffer (New England Biolabs) by mass dilutions. Oligonucleotide pairs corresponding to desired duplexes were annealed according to the following schedule: I ruin, 95 °C; 5 mm, 65 °C; 15 mm, 37 °C; 15 min, room temperature; 15 mm  4 °C.  Duplexes were stored at -20 °C until ready for use. 3.2.2. Preparation of SAGE ditags from Ctypfococeus neoformans 30 .tg of total RNA from Cyptococcus neoftrmans strain H99 grown under low iron (gift of Dr. James Kronstad) was annealed to 1.5 mg of oligo(d’T) dynabeads (Dynal Biotech) as  75  described by the manufacturer. Annealed RNA was then processed according to the microSAGE protocol version le  12  but scaled to a final volume of 300 p1 with the following modifications.  After first strand synthesis, the reaction was cooled on ice, magnetized, and 260 p1 of the first strand reaction was replaced with 260 il of a pre-chifled mix of second strand synthesis reaction components and incubated 16 hrs at 16 °C. Anchored second strand products were then polished  with 20 U of T4 DNAP (Invitrogen), washed and then digested with 120 U NlallI (New England Biolabs). After washing, the resulting anchored 3’ end cDNAs were resuspended in 300  of lx  NEB4 buffer (New England Biolabs) supplemented with 100 ng/pl bovine serum albumin (BSA) and split into six 50 p1 aliquots. SAGE adaptors 1 (adaptor IA + IB_phos) and 2 (2A + 2B_phos) were ligated to anchored 3’-end cDNA as described in the microSAGE protocol. Modified SAGE adaptors m6A (m6A_IA + 1B_phos and m6A_2A + 2B_phos) or m6A_Sp (m6A_IA + 1S_phos  and m6A_2A 14  +  2S_phos) were ligated to anchored 3’-end cDNA according to our DLC protocol  After adaptor ligation, A:Ts were released with BsmFI and precipitated as described.  3.2.3. Model adaptor:tag (AT) construction 100 pmol of SAGE adaptors I & 2 (as above) or modified adaptors iS & 2S (IA + 1S_phos and 2A + 25_phos) were separately ligated to 200 pmol model clone fragments NM (NMI + NM2) and BC (BC1 + BC2), respectively, using 10 U (Weiss) T4 DNA ligase (5 U/[Ll; Fermentas) in lx NEB4 buffer supplemented with 100 g/mi BSA and 1 mM ATP for 2 hrs at 16 °C in a final volume of 100 p1. After heat inactivation for 20 mm at 65 °C, reactions were spiked with 20 U BsmFI in 10 [Li of ix NEB4 supplemented with BSA and incubated 1 hr at 65 °C. Reaction mixtures were then resolved via 12% polyacrylamide gel electrophoresis (PAGE) and the 54 bp band was excised and purified using the “crush and soak” method  15,  and resuspended in  100 [Li LoTE buffer (2 mM TdsHC1, 0.2 mM EDTA, pH = 8.0). 3.2.4. Standard fill-in reactions. Model A:Ts were 1led-in using KF (Invitrogen) as outlined in the ISAGETM protocol (version F, Invitrogen) in a final volume of 50 p1. Alternatively, ATs were filled-in using T4 DNAP (Fermentas) or Tli DNA polymerase (Vent® DNAP; New England Biolabs) according to the manufacturers’ protocols in a final volume of 50 p1. Fill-in reactions were pre-incubated for 5 mm on ice (4°C), or at 12 °C, 25 °C, 37 °C and 50 °C (Eppendorf Thermomixer; Eppendorf) and initiated with the addition of polymerase (1 U/pl final). After incubation for 30 mm, reactions  76  were quenched by the addition of 1/5 volume 0.5 M EDTA (pH  =  7.5). Fill-in reaction products  were extracted with phenol:chloroform:isoamyl alcohol (25:24:1; PCi) and the aqueous layer precipitated as described. 3.2.5. Ligation to form ditags Fill-in products were resuspended in 3 p1 LoTE and ligations were performed as described in the ISAGETM protocol or supplemented to 15% (w/v) final polyethylene glycol 8000 (PEG8000; J. T. Baker) in the presence or absence of 2 U (Weiss) T4 DNA ligase (Fermentas) in a final volume of 3 p1. Reactions were incubated at 16 °C for 16 hrs then brought to a final volume of 20 p1withLoTh. 3.2.6. PCR amplification of ditags PCR amplifications were performed on 1 p1 of 1:20, 1:400, 1:8000, or 1:160000 dilutions in LoTE of the 20 il ditag reaction mix using Platinum Taq DNAP (invitrogen) as described under the ISAGETM protocol at a final volume of 50 p1. PCR amplifications were performed using primers P1 and P2 according to the following amplification schedule 95 °C, 30 sec; 55 °C, 1 miii; 70 °C, I mm; 27 cydes. The amplification cycle was preceded by a pre-incubation for 2 mm at 95 OC.  3.2.7. Simulation of ditag formation under various amounts of 3’-IHA block relief  Ad hoc Perl scripts (see Appendix A.1.) were used to create a canonical Monte Carlo simulation of the ligation reaction to form ditags, under the assumption that the reaction rate is solely determined by the collision of ends possessing the appropriate 5’P/3’OH donor-acceptor. A starting pool of A:T monomers (5’-HNXPH-3’) is introduced into the simulation and a fraction (j)  of the starting pool is modified (5’-HHXPH-3 to correspond to the fraction of A:T  monomers modified through fill-in with KF. Joining rules were established to represent the activity of T4 DNA ligase on this mixture of A:T monomers; monomers linked via a tag-end to tag-end ligation of a 5’P/3’OH donor-3’OH/5’P acceptor pair (denoted as PH::HP) and those linked via a tag-end to modified adaptor-end ligation of a 5’P/3’OH donor-3’OH/5’OH acceptor pair (denoted as PH::HH or HH::PH) were permissible outcomes of monomer ligation. Each of these joining rules were then assigned a joining probability (PpH.., and between 0 and 1, where  =  =  2  X  PPH.HH/HHPH,  respectively)  PpHH/pj::pHtO reflect the statistical chance of ligation  of a 5’P/3’OH donor-3’OH/S’P acceptor pair versus a 5’P/3’OH donor- 3’OH/5’OH acceptor  77  pair. During a simulation step, each member of the population was randomly flipped in orientation and ligated in a pair-wise fashion according to the joining rules and probabilities assigned. Each member of the resulting product population was then re-ligated according to the simulated joining probabilities, and this cycle was repeated until a fixed number of steps were completed or until an assigned cutoff point in the number of remaining monomers was reached. Each ligation product was thenbinned according to the number of A:Ts incorporated to determine the size distribution of ligation products obtained under varying conditions off and  pHHP  Each ligation product  was also interrogated for the presence of region(s) which can give rise to a ditag PCR amplification product (i.e.  . . .XPH::HPX...) to determine the fractional recovery of information within purified  ditags following amplification versus the total number of starting A:T monomers introduced into the simulation. Data obtained from the simulation were collected and plotted in Origin 7.1 (Microcal). 3.2.8. End-point activity of DNAPs Truncated adaptors (1T..Bi + lB. IT + IB_Bi, IT_Bi + IS, and iT + IS_Bi) were endlabeled with T4 polynudeotide kinase (New England Biolabs) using 50 piCi of [yP]-dATP (6000 32 Ci/mmol; Perkin Elmer Life Sciences) according to the manufacturer’s protocol, but scaled to a final volume of 10 il in the presence of 5% w/v (final) PEG-8000. After incubation for 30 mm at 37°C, labeling reactions were quenched with 10 !Il of 50 mM EDTA (pH  =  7.5) and heat  inactivated for 20 mm at 65°C. One microlitre of a 1:10 dilution of the quenched labeling reaction in DEPC-treated H O was then used as a template for the fill-in reactions as described above. 2 After quenching, fill-in products were precipitated immediately as described above. Fifi-in products were resuspended in 10 l of formamide loading buffer (80% formamide w/v, 10 mlvi EDTA, I mg/mi xylene cyanol, I mg/mi bromophenol blue), heat denatured 2 mm immediately placed on ice. Fill-in products  (- 15,000  at 95°C, and  cpm) were resolved via 12% denaturing  PAGE (29:1 bis-acrylamide, 7 M urea) for 1.5 hrs at 1600 V (40 V/cm). 3.2.9. Visualization of fill-in and ligation products Resolved unlabelled reaction products were stained with SYBR-Green I (Molecular Probes) and visualized using a CCD-based gel documentation system (Alpha Innotech) employing a SYBR-green filter set (Molecular Probes) and recorded as TIFF flies. Resolved 32 P-labelled reaction products were visualized via autoradiography on Kodak X-Omat film and scanned as  78  TIFF files. When required, densitometric analysis on generated TIFF files were performed using publidy available software (tnimage-3.3.1 2a: http: / /entropv.brni-jhu.org/tnimage.htm. TIFF images were cropped and levels were auto-equalized in Corel PhotoPaint 9.0 prior to the generation of figures in Corel Draw 9.0. 3.3. RESULTS AND DISCUSSION 3.3.1. Improved yield of released SSTs via directed ligation chemistry reveals formation of HMW ligation products The ability to greatly improve upon the ligation efficiency of the adaptors utilized to release SSTs from an anchored cDNA population {So, 2004 #562} allowed a closer monitoring of the downstream processing steps involved in generating a SAGE library. Application of our directed ligation chemistry (DL protocol to anchored 3’-end cDNA derived from C neoJbrmans (strain H99) and comparison to the amount of ditag amplification product obtained under the  original microSAGE protocol indicated that an equivalent amount of the desired 102 bp  amplification product could be generated using 20-fold less starting template (figure 3.2.A). However, a significant amount of amplification product of molecular weight higher than 102 bp was observed irrespective of the ligation protocol used (microSAGE or DLC). When the pre amplification ligation products were resolved directly (figure 3.2.B), we found that ligation of the filled-in SSTs does not result in the formation of a single 102 bp band. Instead, a series of HMW products (lane 4, arrows) are consistently observed. 3.3.2. HMW ligation products consist of +54 bp multimers of released ATs To more clearly define the products of the A:T dimerization step, 5 ng of synthetic A:Ts  -  an amount equivalent to the amount of A:Ts that would be released from 100 ng of starting mRNA material (0.156 pmol mRNA of average length 2 kb)  -  were filled-in with KF and then  incubated with T4 DNA ligase to generate ditag reaction product (figure 3.3.A). Consistent with the observations obtained using A:Ts derived from the C. neoformans mRNA sample, ligation reaction mixtures clearly and consistently showed the formation of discrete HMW products (—P5054 bp ladders) in addition to un-ligated A:T monomers and the expected ditag (figure 33.A). Comparison with a control reaction performed in the absence of T4 DNA ligase indicated that ligation occurred at —80 % efficiency, with —‘20 % of the A:Ts remaining unligated. However, only --‘35 % of the starting material was converted into the desired ditag. The remainder was  79  incorporated into HMW products. This product distribution was observed regardless of the commercial source of KF utilized (United States Biochemical, Invitrogen, or New England Biolabs) or when 4-fold less starting material was utilized. 3.3.3. KF removes the 3’-IHA from model A:Ts The presence of HMW ligation products of discrete sizes in the pre-amplifled ditag mixture suggests that the adaptor end of the A:T is being reintroduced as a site for ligation, likely through the loss of the 3’-IHA on the  (—)  strand. Although failed  (—)  strand synthesis products  could also account for the presence of A:Ts lacking the 3’-IHA, the phosphoramidite chemistry utilized to synthesize the SAGE adaptors prevents the formation of any 3’-end failed sequences. Given that the fill-in step with KF follows A:T release and precedes the A:T dimerization step, we hypothesized that the 3’—’5’ exonuclease activity of KF catalyzes the removal of the 3’-IHA during the fill-in reaction. To test this hypothesis, we constructed test templates corresponding to a truncated SAGE A:T that were radiolabelled on the 5’-end of the (+) strand or the  (—)  strand to  allow us to directly monitor the activity of KF on either strand of the template (figure 33.B). As  required for SAGE, we observed that incubation of these templates for 30 mm at 37°C with KF led to a +4 base shift in the (+) strand (lane 4) versus the control reaction incubated in the absence of KF (lane 2). In addition, however, a major fraction of the  (—) strand (--57%) in the test template  shifted +3 bases (lane 3) versus the control reaction (lane 1). This indicates that KF is able to insert nucleotides at the 3’-end of the  (—)  strand. Such extension of the  (—)  strand can occur only after  removal of the 3’-IHA, indicating an unwanted exonucleolytic activity of KF during the fill-in reaction. This capacity to remove the 3’-IFJA by KF is somewhat unexpected given that this group  does not present itself as an obvious candidate for exonucleolytic removal. The design of the SAGE adapter should also preclude its removal, as end-fraying, which would favor partitioning of the nascent DNA into the exonuclease domain, should be inhibited by the combined presence of the terminal 5’- GG/CC-3’ clamp, the 5’-dangling end (5’-fl’T) on the (+) strand ‘ and the basic charge of the 3’-amino group on the (—) strand. However, since KF is initially unable to extend the 3’-end of the  (—)  strand during its processive stage, the observed  (—)  strand extension  reaction may arise because the polymerase resides longer on the “template” (+) strand, due in part to the presence of the 5’- overhang. As the activities of both the polymerase and exonuclease  80  domains are catalyzed by a similar two-metal ion mechanism that stabilizes the transition state and primes the a-phosphate of the deoxynudeotide (dNTP) or the terminal linkage to nucleophilic attack by the 3’-OH of either the primer or water  18  extended association of the polymerase with  the (+) strand may promote the catalytic hydrolysis of the Y-IHA via a stochastic mechanism, either through partitioning of the 3’-IHA to the exonudease domain or, as has been observed in exonuclease deficient DNA polymerases, through hydrolysis within the polymerase  19 am  3.3.4. 3’-IHA removal from and subsequent ligation of A:Ts reduces the performance of SAGE technology The loss of the 3’-IHA from released tags can have important ramifications in the performance of SAGE technology. When the resulting mixture of 102 + 54i (i = 1  ...  ,n)  multimeric ligation products is used as a template for PCR, the SAGE PCR primers (20 bp; Tm --‘60°C) hybridize to their complementary sequences located along the entire length of a given multimer, resulting in amplification products of 102 + 54i (i = 1,  ...  ,n) bp that contaminate the  desired 102 bp product (see figure 3.2.A). The problem is exacerbated by the fact that (+) strands (—‘50 bp) unligated at their tag end 3’-OH are present on the multimeric ligation products. They can therefore dissociate from the multimer during the PCR amplification cycle and then act as an alternate primer (Tm”-’75°C). This leads to the generation of additional non-specific amplification products that further contaminate the SAGE library. In addition, self-complementary sites along a given multimer can permit the formation of stable hairpin structures that can compete with the bimolecular hybridization kinetics of primers to the template molecule, contributing to amplification bias and an inaccurate evaluation of transcript abundance. While selective purification of the 102 bp product can remove unwanted HMW amplification products, the sequence information contained in A:Ts that are not incorporated into an amplifiable 102 bp product will be lost from the resulting SAGE library. This brings into question the extent to which the 102 bp amplification product preserves the sequence information content in the released SSTs. To address this uncertainty, we performed a Monte Carlo simulation of the ligation reaction by assigning fractional fill-in and joining probabilities for a Poissondistributed pool of A:T constructs, and then examined the distribution of products when the fraction of uriligated A:Ts reached the amount found experimentally (figure 3.4). This allowed us to compute the fraction of sequence information recovered from the ligation products. The  81  simulations reveal that recoverable sequence information is strongly dependent on the extent of fill-in at the adaptor end (i.e. the 3’ end of (—) strand) of the A:T construct since ligation at that end of an A:T can remove the A:T from the final SAGE library (see figure 3.3.B). If 3’-IHA removal through fill-in with KF occurs on 10%  (f of 0.10) of the population of released A:Ts, more than  20% of the sequence information contained within the released A:T population is lost upon ligation to form ditags. At anf,,of 0.57, as observed in our model tag system (see figure 3.3.B), more than 45% of the total sequence information present in the original pool of SSTs is lost during ditag formation including the complete removal of several low abundance SSTs from the population. These rare transcripts lost from the original transcript population may not be recovered in the subsequent amplification step. This points to a need to modify the current SAGE protocol either by eliminating the need for a (+) strand fill-in step, or by preventing removal of the 3’-ll-IA and subsequent fill-in of the recessed  (—)  strand. In this regard, the recently described  longSAGE protocol, which was developed to allow a more accurate assignment of SSTs to their corresponding transcripts, employs the Type 11S enzyme Mme I to generate SSTs of up to 17 bp in which the 3’ end of the (+) strand of the tag is recessed by two bases  5,20  Although it generally  limits reliable sequence information in 1ongSAGE SSTs to 15 bases, the ability of T4 DNA ligase to join mismatched and gapped ends  21,  avoids a need in the longSAGE protocol to use of a  proof-reading DNAP to polish the 5’-overhang on the  (—) strand of the released A:Ts. By virtue of  avoiding this fill-in step, longSAGE inherently avoids the problems associated with the formation of HMW ligation and amplification products. Despite the apparent advantage of 1ongSAGE, the classic shortSAGE protocol has been shown to perform better in measuring changes in gene expression levels  .  This superior  performance is due, at least in part, to the fact that shortSAGE is intrinsically less prone to sequence errors introduced during PCR and  24 It is also more cost effective. As a result, cin  shortSAGE continues to be the more widely used SAGE protocoL We therefore sought modifications to the shortSAGE protocol to prevent extension of the 3’-end of the  (—)  strand.  Three strategies were explored: (1) alteration of the reaction temperature, (2) selection of an optimal DNAP, and (3) introduction of more robust chemistry at the 3’-end of the  82  (—) strand.  3.3.5. 3’-IHA removal activity of DNA polymerases can be modulated by temperature 3.3.5.1. KF DNA polymerase  The fill-in of SSTs is generally achieved in SAGE through a 30-minute reaction at 37 °C catalyzed by KF at a deoxynucleotide concentration ‘-400-fold higher than the I< for the DNAP  .  We observed that complete extension of the (+) strand is achieved under these reaction  conditions (figure 3.2.B, lane 4). However, approximately 70% of the  (—)  strand is also extended,  verifying significant but somewhat slower reaction kinetics for the removal of the 3’-II-IA (figure 3.2.B, lane 3). We therefore examined the potential for eliminating the exonuclease activity of KF on our test templates by lowering the reaction temperature or by preferentially increasing the polymerase activity of KF on our test templates by increasing temperature (figure 3.5.A). The fill-in reaction on our truncated template (figure 3.5.A.i) shows that activity of KF strongly depends on temperature. At 4 °C, slow extension of the (+) strand results in incomplete fill-in (lane 1; upper panel). Proper fill-in to yield the desired 4-base extension product requires  incubation temperatures above 12 °C (lanes 3-5). However, a gradual increase in the amount of +3 base extension of the  (—)  strand template (lower panel; lanes 1-3) is observed as the incubation  temperature is increased, reaching a maximum at 37 °C (lane 4), the incubation temperature at standard fill-in conditions for SAGE technology. At 50 °C (lane 5), extension of the (—) strand is significantly reduced while full 4-base extension of the (+) strand of the template is retained, indicating a narrower temperature range over which 3’-IFIA removal activity is observed. In principal, fraying of the duplex ends should be more prevalent at this reaction temperature, which  should increase the propensity for partitioning of the  (—)  strand into the exonuclease domain of  KF. However, given that a population of KF would exist in a partially unfolded state at 50 °C (Tm 26, = 45 °C at pH 7.5) selective heat inactivation of the exonuclease domain responsible for KF mediated removal of the 3’-IF{A may be occurring while the polymerase domain remains intact 27 The amount of HMW ligation product formed from our model A:Ts was also found to be strongly dependent upon the incubation temperature used for the fill-in reaction (figure 3.5.A.i:), and was proportional to the amount of (—) strand +3 extension product formed. At 4 °C, HMW  product formation was not detected, and ligation appeared to give a single dimer product in addition to a significant amount of unreacted A:Ts (lane 1). Given the slow reaction kinetics, it is likely that this dimer population includes incomplete (+) strand extension products. Indeed, we  83  have observed that in the absence of fill-in with KF, T4 DNA ligase is able to ligate our model A:Ts with reasonable efficiently (data not shown), consistent with the known joining activity of T4 DNA ligase on gapped ’ and mismatched 2  overhangs.  Increasing the incubation temperature used for fill-in of the model A:Ts leads to an increase in the number of 102  +  54i multimers formed during ligation, with maximal formation of  HMW species occurring between 25 °C and 37 °C (lanes 3 & 4). When the fill-in reaction was performed at 50 °C, the formation of HMW species was diminished (Lane 5), corresponding to the observed inhibition of 3’-IHA removal. 3.3.5.2. T4 DNA polymerase Given the temperature dependent behavior of KF and its influence on the distribution of 1-1MW ligation products formed upon ligation, we decided to examine the temperature dependent activities of other candidate proofreading polymerases on our model templates to determine their applicability to the fill-in reaction. We first explored the use of T4 DNAP, which had been employed in the original SAGE protocol for the fill-in step inconsistent results in SAGE analysis ‘-100-fold higher than its  “,  but was reported to give  Under conditions in which the dNTP concentration was  to favour the formation of extension products, we observed  that, in contrast to KF, T4 DNAP does not exhibit idling activity on the (+) strand of the test template under any of the incubation temperatures examined (figure 3.5.B.i, upper panel). As a result, ftill extension of the (+) strand of the test template was observed across all temperatures studied. Removal of the 3’-IHA from the  (—)  strand, however, was more pronounced than with  KF, with peak exonuclease activity observed at 25 °C (lane 3). This is consistent with the greater 3’-_.5’ exonuclease activity and processivity of T4 DNAP  29,  which increases the probability of  removal of the 3’-IHA group. As with KF, increasing the reaction temperature to 50°C (Lane 5) results in significant inhibition of 3’-IHA removal activity without loss of extension activity. When model A:Ts were filled-in with T4 DNAP, the formation of HMW products again correlated closely with the extent of removal of the 3’-IHA in the  (—)  strand of the truncated  template (figure 3.5.B.i:). The stronger 3’-IHA removal activity of T4 DNAP would therefore appear to be the source of the “inconsistency” observed when applying T4 DNAP to the fill-in step at 25 °C to 37 oc  10  as its use at these temperatures is more likely to lead to the formation of  1-1MW products during ditag formation and amplification.  84  3.3.5.3. Vent® DNA polymerase The temperature dependence profiles obtained for KF and T4 DNAP suests that the exonudeolytic activity of these two polymerases exhibits a stronger temperature dependence than does their corresponding polymerase activities, permitting exonudease activity to be specifically inhibited by lowering the incubation temperature. As similar maximum exonuclease activities were observed for KF and T4 DNAP, we decided to examine the performance of a thermophilic proofreading DNAP, Tli DNAP (Vent®), in the fill-in reaction (figure 3.5.C.z). At 4 °C, a small level of the (+) strand remains unfilled (upper panel, lane 1). When the incubation temperature is increased to 12 °C and higher, all (+) strands are converted to the full extension product. This is surprising given that the temperature optimum for Vent® activity is 75 °C at 300 nMmin’ dNTPs 3O  Further increase in the incubation temperature to values doser to the optimum temperature for Vent® polymerase activity results in the formation of truncation products of the (+) strand (lane 5), indlicating that the 3’—’5’ exonuclease activity of Vent® increases significantly with reaction temperature. Analogous results were observed on the  (—)  strand of the test template (lower panel).  Increasing the incubation temperature from 4 °C to 50 °C results in an increase in the amount of (—) strand extension products and thus an increase in 3’-IHA removal activity. The polymerase and 3’—5’ exonudease activities of the thermophilic proofreading DNAP Tli (Vent®) therefore exhibit temperature dependencies that make this DNAP a poor choice for the SAGE A:T fill-in reaction in the absence of further modification of the 3’-end of the (—) strand. The processing of model A:Ts with Vent® and subsequent ligation to form dimers resulted in a greater amount of HMW products, particularly when the incubation temperature for the fill-in reaction was increased to 50 °C (figure 3.5.C.iz). 3.3.6. 3’-IHA removal is minimized by the presence of phosphorothioate (R,/S) linkages. While temperature was found to influence the 3’-IFJA removal activity of all DNA  polymerases studied, we were unable to identify conditions that allowed for complete extension of the (+) strand while preventing 3’-IHA removal. As the effectiveness of SAGE relies on the ability to efficiently process all SSTs into full length SAGE ditags, slight decreases in polymerase activity can compromise the analysis; truncations would be more prevalent, hampering tag-to-gene assignments by increasing the likelihood of degenerate SSTs  85  31, 32  Given that the 3’—5’  exonuclease activity of many polymerases is inhibited by the presence of phosphorothioate linkages  ,  we investigated the effect of redesigning the  (—)  strand of the SAGE adaptor to  include two terminal phosphorothioate linkages at the 3’-end. Model A:Ts containing this modification were then incubated with KF, T4 and Vent® DNAPs as a function of reaction temperature (figures 5 D-F, respectively). Templates incorporating the 3’-end phosphorothioate linkages exhibited a dramatic reduction in the extent of 3’-IHA removal, irrespective of the DNAP used (figures 5 D-F, respectively). For example, KF-mediated 3’-IHA removal was strongly inhibited by the presence of the phosphorothioate linkages, independent of reaction temperature. Minimal formation of  (—)  strand extension products was observed (figure 3.5.D), consistent with the sensitivity of the 3’—.5’ exonuclease domain of KF to both Sp and Rp stereoisomers of the phosphorothioate linkage Both T4 DNAP (figure 3.5.E) and Vent® (figure 3.5.F) 3’-+5’ exonudease activities were also inhibited by the presence of these modifications, although small amounts of (—) strand extension product could be observed. Given that the  (—)  strand of the template included both the  stereoisomers of the phosphorothioate linkages, the small level of extension of the  (—)  and strand by  T4 and Vent® DNAPs likely reflects the greater specificity of their respective exonuclease domains to the  stereoisomer 36•  Inclusion of 3’-terminal phosphorothioate linkages into our model A:Ts also greatly inhibited the formation of HMW products during the dimerization reaction. Inhibition of  (—)  strand extension was observed at all incubation temperatures, indicating that the inclusion of 3’terminal phosphorothioate linkages in the (—) strand effectively allows for the selective formation of the desired 102 bp ditag product. 3.3.7. Specific formation and amplification of 102 bp ditags using redesigned adaptors Our efforts to identify conditions that inhibit DNAP-mediated removal of the 3’- IHA were tested by ligating our phosphorothioate modified adaptors via DLC to anchored 3’-end cDNA derived from C neojbrmans H99. Upon release of A:Ts, we applied three candidate fill-in conditions — namely T4 DNAP at 4°C, KF at 50°C, and Vent® DNAP at 12°C and compared —  the ditag ligation products obtained to those obtained under the standard SAGE protocol (figure 3.6.A). Although good results were obtained with all three enzymes, processing with Vent® at 12°C generally gave the highest yield of 102 bp product due to its complete utilization of the 86  substrate. Comparison of PCR yields (figure 3.6.B) obtained under our modified protocol (lanes 610) with those obtained under standard SAGE (lanes 1-5) demonstrated specific amplification of the 102 ditag, requiring more than 20-fold less starting material for optimal amplification (lane 9 vs. lane 3). More significantly, the specific formation of a 102 bp amplification product (figure 3.6.C: lane 1) enabled the use of a commercially available PCR purification kit to dean up the PCR reaction mixture with minimal loss of the ditag amplification product (figure 3.6.C: lane 2). Further processing of this purified ditag with Nla III (figure 3.6.C: lane 3) generated the three desired products: a doublet corresponding to released adaptors; and a single band corresponding to the 26 bp product ditag. 3.4. CONCLUSION In summary, the ability of SAGE to provide an accurate quantitative picture of gene expression depends on the extent to which the distribution of transcript abundances inferred through the set of sequenced tags fully reflects the true distribution of transcripts in the original mRNA population. Caution must therefore be exerted to ensure that all steps of the protocol are highly optimized so that potential sources of bias that may affect the performance of SAGE technology are limited. We have demonstrated that a seemingly innocuous step in the SAGE protocol, namely the £11-in of sequence tags released from the 3’-end cDNA library with a DNAP, can contribute to a loss of sequence information in the SAGE library. By utilizing Vent® at 12 °C in combination with the introduction of 3’-terminal phosphorothioate linkages into the adaptor, we were able to minimize the potential loss of sequence information at this step of the protocol while simplifying the purification and recovery of important intermediates. These modifications should enhance the performance of SAGE technology as an analytical technique. They should also advance the general use of proof-reading DNAPs in other tag-based technologies superSAGE, which uses the Type III enzyme EcoPi  37,38  ,  including  to generate SSTs 22 bp in length.  EcoPi 51 cleaves 25/27 bp away from its recognition sequence, leaving a 3’-recessed end that requires a fill-in step to maximize sequence information in the released SST.  87  HO-TTT-  0H  H.N  A. HO-TY  0H NNNN-PO  HO.  B.  NNNN-OH  — NNNN-OH  1 H  Klenow fill-in  Ilo-m  NNNN-OH NNNN-PO,  1  NNNN-P0, H0-TTTNNNN-OH NNNN.PO HO-AAA  Ligation  + lO2bp  lO2bp  PCR  lO2bp  4  156bp  lO2bp 4  2lObp 4  4  lO2bp  264bp  (102  102 bp  + (I)  +  54i) bp i= 1,23,n  4, Nialil  (i)  (ii)  (H)  =— =  =,  =.+  (iii)  Figure 3.1. Schematic outline of the steps in the SAGE protocol following the release of SSTs and prior to concatemer formation. A. Fill-in with KF leads to blunt-ending of the (+) strand of the released SST. Upon ligation to form ditags, the presence of the 3’-amino group and the lack of a 5’-phosphate blocks participation of the adaptor end of the adaptor:tag (A:1) in the ligation reaction, resulting in the specific formation of a 102 bp ditag. Subsequent PCR amplification therefore results in the specific amplification of the 102 bp product. Following Nialli digestion, SAGE adaptors (1) and 26 bp products (ii) are recovered. B. Unwanted release of the 3’-amino group by KF reintroduces the adaptor end of the A:T as a potential site for ligation. The ligation reaction results in the formation of 102 + 54i bp multimers that participate in the PCR amplification reaction (see text for details). Subsequent digestion of these amplification products with NlaTII releases (iii) “tag:adaptor” constructs flanked by Nialil sites in addition to (I) SAGE adaptors and (ii) 26 bp ditags that can participate in the concatenation reaction.  88  A. 200 bp  i  100 bp  -‘  M1 2 345M67 8910 II SAGE m6A  ZI  B. 200 bp’.’ 100 bp—’  2 1 M SAGE m6A  Figure 3.2. HMW amplification products result from ligation products generated during ditag formation. A. PCR amplification products from serial dilutions of the dlitag ligation reaction (lanes I & 6: no ligase [control], lanes 2 & 7: 1:20, lanes 3 & 8: 1:400 lanes 4 & 9: 1:8000, and lane 5 & 10: 1:160000 dilutions in LoTE). Ditag ligation products were derived from released SSTs via the ligation of standard SAGE adaptors (SAGE) under the standard SAGE protocol (lanes 1-5) or of methylated SAGE adaptors (m6A) under directed ligation chemistry (lanes 6- 10) to an anchored 3’-end cDNA library obtained from C. neoformans strain H99. B. Analysis of ditag ligation products prior to PCR amplification. Released SSTs obtained via standard microSAGE (lanes I & 2: SAGE) or directed ligation (lanes 3 & 4: m6A) protocols and ligated in the presence (lanes I & 3) or absence (lanes 2 & 4) of T4 DNA ligase after fill-in with KF. HMW bands corresponding to ligation products are indicated by arrows.  89  A.  Mol fraction starting material. 200bp...  0.05 0.07 0.34  p 100bp-.  0.17 0  0.6  B. A-strand  B-strand KF  KF CCCT.  H,N KF  KF  40bp.ø...  30 bp. Ml  2  —KF  I  34  I  +KF  Figure 3.3. Discrete HMW products are formed during ligation of synthetic tags and are a result of KF exonucleolytic activity. A. Ligation products from model adaptor:tag constructs after fill-in with KF and prior to PCR amplification were analysed via densitometry to quantify the amounts of the mono-, di-, tn-, and tetrameric ligation products obtained in the presence of T4 DNA ligase (lane 2) relative to the amount of unligated starting material (lane 1). B. Truncated templates were labelled with P] 32 on either the (+) strand or the (—) strand to [ monitor the activity of KF under standard fill-in conditions (30 mm at 37°C). Incubation in the presence of KF leads to a shift of a population (55%) of the (—) strand by +3 bases (lane 3) compared to incubation in the absence of KF (lane 1), suggesting removal of the 3’-amino blocking group and subsequent fill-in of this strand. Concomitantly, the (+) strand (lane 2) is shifted +4 bases (lane 4) in the presence of KF.  90  HONNNN—OH H N— -NNNN—P  Iiin  HO_:NNNN_Oi-I HO—ifNNNN—P P44NNN-  HO—  -  kONNNN  -  7PPHHHIHHPH  PPc\  OHHO HO-  NNN1 iiiiiØr. NNNN,  NNNN-OH NNNN—P  HO_NNNNi’NNNN_ —OH oNNNNNNNN HO— OH  Figure 3.4. Recoverable sequence information within purified ditags under conditions that allow HMW product formation. Ad hoc canonical Monte Carlo simulations of the ligation reaction involving theoretical adaptor:tags were used to determine size distributions of ligation products and extract recoverable sequence information under varying values ofj , and P 10 50 = )< = 2 Appendix (see text and A.1. for amount of recoverable details). The PH::F1P PH::HH/HH::PH information within resulting ligation products as a fraction of the original pool of A:Ts is plotted as a function of both JdI-jn and Densitometric data obtained for the fill-in reaction (see figure 3.2.B) using our model tag system indicated that j = 0.57 under standard fill-in conditions. Under these conditions, —45% of the sequence information in released SSTs is predicted to be lost from the purified ditag population.  91  S  P1  KF  (ii)  ()  AL.  D.  (N)  I  :_ B.  C12345  M12345  (‘)  (ii)  (i)  E.  (ii)  .  T4  C. Vent  C12345  M12345  (i)  (ii)  —  Ml 2345  C12345  C12345  F.  (I)  M12345  (ii)  — a a —  C  12345  M12345  Figure 3.5. Temperature dependence of DNA polymerase activity and impact on HMW product formation. Comparison of the activity of DNA polymerases on model templates with 3’-terminal phosphodiester (panels A-c) or phosphorothioester (panels D-F) linkages in the (—) strand at various incubation temperatures. Incubation of truncated SAGE adaptor 1 with KF (A. & D.), T4 DNAP (B. & E.) or with Vent® DNAP (C. & F.) reveals that (z) the extension of the (+) strand (upper) and 3’-isoheptyl amine removal activity with subsequent extension of the (—) strand (lower) can be modulated by incubation temperature. Ligation of model ATs following fill-in with the various DNA pols at the examined incubation temperatures reveals that (ii) the extent of HIV.EW product formation correlates with the extent of removal and extension of the (— ) strand in the truncated template. Introduction of phosphorothioate linkages in the (—) strand of both templates dramatically inhibits the processing of the (—) strand by the various polymerases, demonstrating that inhibition of 3’-isoheptyl amine removal leads to a corresponding reduction in HIvPV product formation. Lane M: 20 bp marker, lane C: no polymerase, lane 1: -4 °C (on ice), lane 2: 12 °C, lane 3: 25 °C, lane 4: 37 °C, lane 5: 50 °C.  92  • SAGE  modified I  I  4-lO2bp ..-54bp  12  M  34 56 78 LJ LJ LJ T4 KF Vent  •l-lO2bp  100 bp.. 60bp.+ 2Obp..i  M1 2345M678910 =1  II  SAGE  I  modified  Figure 3.6. Comparison of ligation products obtained under standard SAGE versus modified SAGE applied to 10 pg C. neoformans total RNA. A. Reaction products of released SSTs obtained in the presence (lanes 2, 4, 6, & 8) or absence (lanes 1,3,5,7) of T4 DNA ligase after fill-in with T4 DNAP at 4°C, KF DNAP at 50 °C and Vent® DNAP at 12 °C using methylated SAGE adaptors with two 3’-terminal phosphorothioate linkages on the (—) strand. For comparison, ligation products of released SSTs obtained under the standard SAGE protocol are also shown. B. Comparison of PCR amplification products obtained under the SAGE protocol (lanes 1-5) versus those obtained using our modified protocol using Vent® DNAP for fill-in (lanes 6-10), indicating the specificity of 102 bp template formation in the modified protocol and an absence of non-specific amplification products. PCR products were obtained from 1 p1 of serial dilutions (lanes I & 6: no ligase [control], lanes 2 & 7: 1/20, lanes 3 & 8: 1/400 lanes 4 & 9: 1/8000, and lanes 5 & 10: 1/160000 dilutions in LoTE). C. One-step purification and subsequent digestion with Nia III of PCR products. PCR reactions were pooled (lane 1) and purified using a commercially available PCR purification kit, yielding a single 102 bp band (lane 2). Subsequent digestion with Nialli yielded a 42-44 bp doublet corresponding to released SAGE adaptors, and a 26 bp product corresponding to released ditags (lane 3).  93  3.5. REFERENCES 1.  Velculescu, V.E., Zhang, L., Vogeistein, B. & Kinzler, K.W. Serial analysis of gene expression. Science 270, 484487 (1995).  2.  Adams, M.D. Serial analysis of gene expression: ESTs get smaller. Bioessays 18,261-262 (1996).  3.  Velculescu, V.E. et al. Characterization of the yeast transcriptome. Cell 88,243-251(1997).  4.  Caron, H. et al. The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science 291, 1289-1292 (2001).  5.  Saha, S. et al. Using the transcriptome to annotate the genome. Nat Biotechnol 20, 508-512 (2002).  6.  Boheler, K.R. & Stem, M.D. The new role of SAGE in gene discovery. Trends Biotechnol 21, 55-57 (2003).  7.  Chen, J. et al. Identifying novel transcripts and novel genes in the human genome by using novel SAGE tags. Proc Natl Acad Sci U S A 99, 12257-12262 (2002). V  8.  Szybalski, W., Kim, S.C., Hasan, N. & Podhajska, A.J. Class-uS restriction enzymes--a review. Gene 100, 13-26 (1991).  9.  Harbers, M. & Carninci, P. Tag-based approaches for transcriptome research and genome annotation. Nat  Methods 2,495-502 (2005). 10. Basrai, M.A. & Hieter, P. Transcriptome analysis of Saccharomyces cerevisiae using serial analysis of gene expression. Methods Enzymol 350,414-444 (2002). 11. Ye, S.Q., Zhang, L.Q., Zheng, F., Virgil, D. & Kwiterovich, P.O. miniSAGE: gene expression profiling using serial analysis of gene expression from 1 microg total RNA. Anal Biochem 287, 144-152 (2000). 12. St Croix, B. et al. Genes expressed in human tumor endothelium. Science 289, 1197-1202 (2000). 13. Virlon, B. et al. Serial microanalysis of renal transcriptomes. Proc Natl Acad Sci U S A 96, 15286-15291 (1999). 14. So, A.P., Turner, R.F. & Haynes, C.A. Increasing the efficiency of SAGE adaptor ligation by directed ligation chemistry. Nucleic Acids Res 32, e96 (2004). 15. Sambrook, J. & Russell, D.W. Molecular cloning: a laboratory manual, Edn. 3rd. (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.; 2001).  16. Senior, M.,Jones, R.A. & Breslauer, K.J. Influence of dangling thymidine residues on the stability and structure of two DNA duplexes. Biochemistry 27, 3879-3885 (1988). 17. Bommarito, S., Peyret, N. & SantaLucia, J., Jr. Thermodynamic parameters for DNA sequences with dangling ends. Nucleic Acids Res 28, 1929-1934 (2000). 18. Kunkel, T.A. & Bebenek, K. DNA replication fidelity. Annu Rev Biochem 69,497-529 (2000). 19. Canard, B., Cardona, B. & Sarfati, R.S. Catalytic editing properties of DNA polymerases. Proc NatI Acad Sci U S A 92, 10859-10863 (1995). 20. Gowda, M., Jantasuriyarat C., Dean, R.A. & Wang, G.L. Robust-LongSAGE (RL-SAGE): a substantially improved LongSAGE method for gene discovery and transcriptome analysis. Plant Physiol 134, 890-897 (2004). 21. Nilsson, S.V. & Magnusson, G. Sealing of gaps in duplex DNA by T4 DNA ligase. Nucleic Acids Res 10, 14251437 (1982). 22. Wiaderkiewicz, R. & Ruiz-Carrillo, A. Mismatch and blunt to protruding-end joining by DNA ligases. Nucleic Acids Res 15, 7831-7848 (1987). 23. Lu, J., Lal, A., Merriman, B., Nelson, S. & Riggins, G. A comparison of gene expression profiles produced by SAGE, long SAGE, and oligonucleotide chips. Genomics 84, 631-636 (2004). 24. Akmaev, V.R. & Wang, C.J. Correction of sequence-based artifacts in serial analysis of gene expression. Bioinformatics 20, 1254-1263 (2004).  94  25. Polesky, A.H., Steitz, TA., Grindley, N.D. & Joyce, C.M. Identification of residues critical for the polymerase activity of the Kienow fragment of DNA polymerase I from Escherichia coil. J Biol Chem 265, 14579-14591 (1990). 26. Karantzeni, I., Ruix, C., Liu, C.C. & Licata, V.J. Comparative thermal denaturation of Thermus aquaticus and Escherichia coil type I DNA polymerases. Biochem J 374, 785-792 (2003). 27. Bailey, M.F., van der Schans, E.J. & Millar, D.P. Thermodynamic dissection of the polymerizing and editing modes of a DNA polymerase. J Mol Biol 336, 673-693 (2004). 28. Yamamoto, M., Wakatsuki, T., Hada, A. & Ryo, A. Use of serial analysis of gene expression (SAGE) technology.J Immunol Methods 250, 45-66 (2001). 29. Gupta, A.P., Benkovic, PA. & Benkovic, S.J. The effect of the 3’,S’ thiophosphoryl linkage on the exonuclease activities of T4 polymerase and the Kienow fragment. Nucleic Acids Res 12, 5897-5911’ (1984). 30. Kong, H., Kucera, R.B. & Jack, W.E. Characterization of a DNA polymerase from the hyperthermophile archaea Thermococcus litoralis. Vent DNA polymerase, steady state kinetics, thermal stability, processivity, strand displacement, and exonuclease activities. J Biol Chem 268, 1965-1975 (1993). 31. Pleasance, E.D., Marra, M.A. & Jones, S.J. Assessment of SAGE in transcript identification. Genome Res 13, 1203-1215 (2003). 32. Clark, T., Lee, S., Ridgway Scott, L. & Wang, S.M. Computational Analysis of Gene Identificatbn with SAGE. J Comput Biol 9, 513-526 (2002). 33. Skerra, A. Phosphorothioate primers improve the amplification of DNA sequences by DNA polymerases with proofreading activity. Nucleic Acids Res 20,3551-3554 (1992). 34. Wang, C.X. et al. Pre-steady-state kinetics of RB69 DNA polymerase and its exo domain mutants: effect of pH and thiophosphoryl linkages on 3’-S’ exonuclease activity. Biochemistry 43, 3853-3861 (2004). 35. Brautigam, CA. & Steitz, TA. Structural principles for the inhibition of the 3’-S’ exonudease activity of Escherichia coli DNA polymerase I by phosphorothioates. J Mol Biol 277,363-377 (1998). 36. Di Giusto, D. & King, G.C. Single base extension (SBE) with proofreading polymerases and phosphorothioate primers: improved fidelity in single-substrate assays. Nucleic Acids Res 31, e7 (2003). 37. Matsumura, H. et al. SuperSAGE. Cell Microbiol 7, 11-18 (2005). 38. Matsumura, H. et al. Gene expression analysis of plant host-pathogen interactions by SuperSAGE. Proc Nati Acad Sci U S A 100, 15718-1 5723 (2003).  95  CHAPTER 4 Eliminating amplification bias in competitive templates through the use of a proofreading DNA polymerase amplification of SAGE ditags  —  application to the  *  A version of this chapter has been submitted for publication [Reference: So, A. P., Bryan, J., Turner, R. B. F., and Haynes, C. A. 2008. Amplification bias in competitive templates can be offset through the use of proofreading polymerases: application to the amplification of SAGE ditags. (submitted to Nucleic Acids Research)  96  4.1. INTRODUCTION Serial analysis of gene expression 1 (SAGE ) , a high-throughput sequencing approach used to identify and tally the abundances of short sequence tags (SSTs) derived from each transcript of a transcriptome, is widely used for the quantitative analysis of gene expression ‘. Overhead and/or intellectual property costs associated with other global gene expression technologi such as massively-parallel signature sequencing  5,6  and microarray technologies  are absent in SAGE, as  data acquisition relies solely on the user’s access to high-throughput sequencing instrumentation. In principle, SAGE-derived SSTs are sufficiently specific to permit comparative mapping of transcriptomes. For example, human SSTs can be mapped uniquely to UniGene dusters through customized databases such as SAGEmap  8,  allowing ratios of transcript expression levels to be  estimated from SAGE tag counts for two or more biological States of interest. A further advantage of SAGE is that, in contrast to microarray technology, a ptioii information about the genome under interrogation is not required. Therefore, SSTs arising out of the analysis that do not correspond to any known gene can be used to identify novel genes or alternate splicing variants of transcripts, aiding in the annotation of genome databases ’ 912• 2 However, as SAGE is a sampling method, the number of SSTs accurately sequenced limits its ability to define the size and distribution of a transcriptome. Typically, the number of SAGE tags sequenced does not exceed 100,000 per sample, and thus some transcripts present in low abundance may be missed and the number of tags identified for other transcripts may not accurately reflect their true abundances  13, •  Moreover, significant losses of material can occur over  the numerous steps involved in preparing and converting an mRNA sample into a concatenated set of SSTs suitable for analysis via high-throughput sequencing. Prototypically, following reverse transcription of the mRNA sample, an SST library is created by ligating the most 3’-end Nla III restricted fragments of a cDNA library to one of two upstream adaptors (Al and A2) possessing primer-binding sites. Subsequent processing with a Type uS restriction enayme releases a corresponding set of adaptor-tag (A:1) constructs and, when required, the recessed end of each  resulting A:T is filled in. Formation of a corresponding library of A:T climers (or “ditags”) is then achieved through ligation of the two tag moieties. To offset losses occurring over these steps 16 ’ 15 as well as to permit the recovery of sufficient amounts of material for creating a sequencing library, amplification of the ditag population by large-scale application of the polymerase chain reaction (PCR) is performed.  97  This application of PCR, however, not only increases the cost and labour associated with producing and analyzing a SAGE library, but also affects the fidelity of SAGE analysis. In particular, the incidence of sequence artefacts in the amplified ditag population increases with each amplification cycle due to the error-rate inherent to the DNA polymerase (DNAP) Taq  .  This  particular source of error can be mitigated somewhat through the application of an appropriate correction algorithm such as SAGEScreen  17,  or by replacing Taq with a higher fidelity  thermophilic DNAP such as Pfu, which has a 6-fold lower sequence error rate compared to Taq 18, 19•  Ditag template-to-product ratios may also be altered with increasing cycle number through  sequence-related differences in amplification efficiencies among the population of individual templates, which are known to occur during the co-amplification of non-homologous templates having different GC-content or length 2O,21 However, sequence-dependent amplification biases are thought to be small in SAGE as ditags (and their resulting amplicons) are of equal length and share considerable sequence homology (> 75 %) due to the presence of common adaptor moieties (http: / /www.invitrogen.com /SAGE). Here, we document other more significant but less recognized sources of bias in the PCR step of SAGE by carefully evaluating the ability of the amplification reaction to preserve the relative abundance information contained within the ditag population derived from an mRNA sample. Taqman -based exonuclease assays TM  are used to monitor in real-time the amplification  of individual SAGE ditags and the co-amplification profiles of clitag mixtures of known composition. Results from these quantitative PCR (qPCR) studies are combined with a set of molecular simulations to identify and establish the basis for two additional mechanisms by which clitag amplification can potentially bias the final distribution of SSTs, and therefore affect the fidelity of SAGE analysis. The first arises from the ability of certain dasses of ditags comprised of self-complementary single-stranded templates to form hairpin structures that inhibit amplification. The second, and more significant, is linked to a tendency for templates from high-abundance SAGE ditags to preferentially inhibit Taq-mediated amplification of low-abundance templates  —  an  effect that is absent when co-amplifying mixtures of two or more completely non-complementary templates. Since conditions could not be defined that could ameliorate the suppression of amplification of low abundance ditags when employing Taq DNAP, thermophilic members of the  98  B-family DNAPs (DNAP(B)) possessing 3’—5’ exonuclease activity (i.e. “proofreading” DNAPs) were tested for their ability to preserve ditag abundance ratios during co-amplification. After identifying conditions in which Taqman TM assays can be applied to Pfu family DNAPs, we find that, in contrast to Taq, the relative distribution of a starting mixture of synthetic ditags varying over 3 to 5 orders of magnitude in abundance can be preserved when using Pfu DNAPs under appropriately modified reaction conditions that enhance their exonuclease activity. However, this greatly improved preservation of ditag abundance ratios occurs at the expense of amplification efficiency and yield, which hinders the generation of a SAGE library. We therefore report conditions that provide an effective compromise between amplification fidelity and yield, which when combined with an improved method for ditag recovery and purification, offer an improved protocol for concatenate generation from a limited amount of amplification product. 4.2. METHODS AND MATERIALS 4.2.1. Oligonucleotides Primers (P1: 5’-GGAYITGCTGGTGCAGTACA-3’; P2: 5’-CTGCTCGAATfCAAGCT TCT-3’) were obtained desalted (Integrated DNA Technologies) and resuspended in LoTE buffer (2 mM TrisHCl, 0.2 mM EDTA, pH  =  7.5). Synthetic ditag templates of 100 bases in length  (I’able 1) as well as a template derived from the conserved area (position 755-85 1) of the human B-cell CLL/lymphoma 2 (BCL2) gene (Genbank accession NM 000657.2) were obtained PAGE purified (Integrated DNA Technologies) and resuspended in LoTE. All dilutions were performed by mass (± 0.2 jtg) in LoTE and individually combined by mass to form dlitag mixtures of known composition. Taqman TM -like dual-labelled probes (DLPs) corresponding to each synthetic ditag template strand (Figure 4.4.1) were obtained HPLC-purifled (Integrated DNA Technologies; Coralville, IA) and resuspended in LoTE. Taqman ’ &based qPCR assays 4.2.2. T A 1/400 dilution in LoTE of 102 bp dlitags was prepared from mRNA isolated from  Cptococcus neoformans strain H99, grown under low iron, using directed ligation chemistry and filled-in with Vent® DNA polymerase as described previously  .  This ditag mixture served as the  background template mixture for PCR amplification studies using Platinum Taq (0.25 U/[tl final; Invitrogen) according to the I-SAGE protocol (version F) but scaled to a final volume of 20 tl.  Individual synthetic ditags of known concentration were introduced into the background template  99  mixture and their amplification profiles were monitored in the presence of 0.2  (final) of the  corresponding DLP. Alternatively, various mixtures of synthetic ditags were co-amplified in the presence of 0.2 1 M (final) of the DLP for the ditag under interrogation as above. Assays were performed using a MyIQ real-time thermocycler (BioRad Inc.; Mississauga, Ontario) according to the following three-step amplification schedule following preincubation of the mixture at 95°C for 2 mm: 55 °C, I mm; 70 °C, I mm; 95 °C, 30 sec. Individual qPCR traces were baseline corrected using MyIQ software version 2.0, and the corrected data imported to Microcal Origin 7.0. 4.2.3. Secondary structure and abundance dependent bias: definitions  Upon their release using Nialil (New England Biolabs, Ltd.; Pickering, Ontario), two populations of adaptor-tag constructs are filled in and blunt-end ligated in a random fashion to form ditags, which have a general adaptor-tag-tag-adaptor configuration (Figure 4.3). When introduced as templates in the PCR reaction, these ditags, as a consequence of sequence complementarity that can arise under various ditag configurations, can be segregated into four structure types, three of which have the potential to form amplification-inhibiting secondary structures: linear templates that exhibit no seif-complementarity (type 1), tag-tag hairpins (type 2), adaptor-adaptor hairpins (type 3), or frill hairpins (type 4) (Figure 4.3). The distribution of ditags within these four types of structures is governed by well-defined combinatorial statistics, and can therefore be determined for a given pool of adaptor-tag constructs using the simple statistical algorithm described here. The relative frequency of tags (i.e. SSTs) within a given transcriptome is represented as the vector p = (b , 1  ...  ,p.), where p’ is the frequency  of tag i in the transcriptome. The expected relative frequencyj of each ditag i in the amplifiable pool can therefore be determined by crossing the individual tag frequenciesp 1 x:  P1  P2  P3  PP  P3  Pi 2 1 2 P Ptl  PiP P2P3 P3  P P 1 n Ptn At  A  Pi t P  PP3  P  P2  2 P2  PP2  100  1 whereJ  etc.  = P1 ’J2 = 2  It is convenient to segregate the total ditag population into “homo-ditags”, having relative frequencyf  horn  and induding all clitags in which the internal sequences are derived from the same  SST and are therefore complementary, and “hetero-ditags”, having relative frequency f and including those ditags in which the internal sequences are derived from a pair of different noncomplementary SSTs. The expected frequency of any tag i within each of these subpopulations can then be determined as follows:  fhom =  (1)  Leg =fhom  2)  Pi,horn  (3)  =  fhorn  -  (l-p 1 p )  ()  Pi,het J he:  where p, horn is the expected frequency prior to amplification of tag i among the pool of tags incorporated into homo-ditags (ditag types 2 and 4), and p is the expected frequency of tag i among the pooi of tags incorporated into hetero-clitags (ditag types I and 3). Similarly, from symmetry arguments, the expected relative frequency p 1 of tag i in the pool of type I (linear) dlitags is given by (5)  pI,  Equations I to 5 therefore allow A to be computed and compared to p 1 to assess bias /p) relative to the original tag frequency. (quantified asp 1 To assess the impact of abundance-dependent amplification bias on the outcome of SAGE analysis, we first defined the relationship between the relative abundance of a given ditag and its amplification efficiency. Amplification curves obtained from qPCR studies of the  101  5OGC_2080 template within a background of 5OGC_5050 (using Platinum Taq, see above) were used to determine the individual amplification efficiencies of the 5OGC_2080 template at various relative abundances. Baseline corrected data obtained via MyIQ 2.0 software was imported into LinRegPCR to perform a linear regression of the log-transformed data for the initial exponential phase of the PCR  24, ,  allowing the doubling rate D, for each ditag i as a function of relative  frequency 4 to be defined (figure 4.2.C). This relationship can be described by the following equation:  =  a  +/3f’,  lD,2  (6)  where D is the average doubling efficiency during the exponential phase of the PCR for the 5OGC_2080 ditag at relative frequencyf, and is related to the amplification efficiency Ea by the relation I)!,  =  1+Ea /100. Under the experimental conditions described above using Platinum Taq,  the parameters a, (3, and y were evaluated (estimate ± standard error) as a=l.157±0.0l1 /3=1.175±0.032  (7)  y = 0.586 ± 0.029 with a corresponding R 2  =  0.99865. The derived abundance-dependent weighting function  described in equation 6, using the parameters shown in 7, can be incorporated into an endpoint simulation of the PCR for a mixture of dlitag templates, where each component has an initial relative frequency off? The relative frequency% of clitag i following k cycles of amplification is given by:  —  f,,k  f, x(D,’)’ fl  ()  fl  i—I j—1  Equations 7 and 8 were assumed to be valid for other SAGE dlitags and used to compute changes in the distribution of ditags resulting from k the relative frequency of each tag  =  20 amplification cycles in siuico, after which  determined by:  102  Pi,amplf led =  (fij,k  —  (9)  Ji,k)  The ratio iplifi/P can then computed as a measure of bias introduced by the amplification process. 4.2.4. Secondary structure and abundance dependent bias: simulations 6  were extracted from the Gene  Expression Omnibus database (accession number GSE1747)  and indlividually normalized to  Tag frequencies from 32 human MPSS libraries  create virtual tag transcriptomes of the general form p  =  ...  ,p). Each tag library was analyzed  using customized Pen scripts (see Appendix A.2) according to the relations described above to simulate the impact of secondary structure and abundance-dependent bias on the ability to infer original tag frequencies from the assembled and amplified ditag pool. Data was imported into Microcal Origin 7.0 for graphical analysis. 4.2.5. Real-time assays for proofreading DNAPs The performance of taqman assays in monitoring amplification efficiencies of proofreading DNAPs was evaluated using 0.1 pM (final) template, 0.2 tM (final) of the corresponding DLP, and the manufacturer’s supplied reagents in the presence of 3% ([v/vJ final) dimethyl sulfoxide (DMSO; Sigma Aldrich). Initially, deoxynucleotide (dNTP) concentrations (0.01 mM to 1 mM each dNTP), annealing temperature (55 °C  -  65 °q and 2-step versus 3-step  protocols were screened for their ability to provide stable amplification profiles of samples containing I fM to 0.1 pM (final) of template 5OGC_8020. Based on these studies, a 2-step protocol was adopted for proof-reading DNAPs as follows: 2 iin preincubation at 95°C; 2 mm at 55°C; 30 s at 95 °C. The ability of a given DNAP to preserve relative ditag abundances was assessed by monitoring the amplification of template 5OGC_2080 (0.1 pM) in the presence or absence of competing template 5OGC_5050 (0.1 nM) and calculating the difference in the threshold cydes (ACT)  obtained for 5OGC_2080 in the presence or absence of the competing template. The impact  of amplification conditions on ACT was examined by performing amplification reactions under  various concentrations of dNTPs (0.01 mM to 1.0 mM) and primers (0.01 .iM to 1.0 M) to  103  define optimal conditions that minimize AC The study included Platinum Taq, 1i9i, and several  other proofreading DNAPs (PfuUltra, PfuUltrall, PhusionTM, Vent®, DeepVentr). 4.2.6. Preparation and concatenation of 26 bp clitags Ditags (20 reactions, I ml) were amplified with PfuUltrall (0.1 U/pi) using 5’-dual-biotin labelled primers P1 and P2 (0.5 M each) in the presence of 0.1 mM (final) dNTPs. After 20 cycles, recovered products were pooled on ice, quenched with  th 1110  volume 0.5 M EDTA, and  then extracted with I volume of a 25:24:1 mixture of phenol:chloroform:isoamyl alcohol (PCI; Invitrogen). The aqueous layer was filtered through a Microcon YM-30 microfiltration device (21 °C, 15 ruin ( 8,000 x g; Millipore Inc.), followed by buffer exchange thrice with 500 il of lx NEB4 buffer (20 mM Trisacetate, 50 mM Kacetate, 10 mM Mgacetate, 1 mM dithiothreitol, pH =  7.9), and brought to reduced volume. Recovered filtrates were digested with 50 U Nialli (New  England Biolabs) in lx NEB4 buffer supplemented with 100 g/ml bovine serum albumin at a final volume of 100 l for 1 hr at 37 °C. Digested PCR products were then incubated with 10 ig (0.18 nmol) of streptavidin in LoTE (1 mg/nil; Invitrogen) for an additional 20 mm  to bind  biotinylated products, and placed on ice. All remaining steps were performed in a 4°C cold room. The processed digests were filtered through 4 micropureEZ columns (Millipore mc) to remove streptavidin bound products. After washing each filter with 50 p1 of chilled lx ligase buffer, the pooled filtrate was filtered through a Nanosep Omega 3K filter (4 °C, 80 ruin  @ 5,000 x g; Pall  Life Sciences), followed by buffer exchange once with 500 p1 of lx ligase buffer, and brought to reduced volume (‘-5 p1). To optimize conditions for the concatenation of 26 bp ditags, starting mass quantity, enzyme concentration, incubation temperature, incubation period, ATP concentration, and PEG8000 concentration were varied. Optimized conditions for generating concatenates averaging 1.5 kb were as follows: 25 ng of 26 bp ditag was concatenated for 30 ruin at 21 °C with 5 Weiss U of T4 DNA ligase (5 Weiss U/p.l; Fermentas) in lx ligase buffer supplemented with 0.1 mg/ml bovine serum albumin, 15% final (v/v) PEG—8000 (from a 50% [w/w] stock; Baker Analytical) and 3 iM final of ATP (from 10 mM stock in 25 mM TrisHCl, pH  =  7.5; Roche) at a final  volume of 10 p1. Reactions were quenched with 6x loading buffer containing 100 mM NaEDTA (pH  =  8.0) and resolved via 1.5% agarose gel electrophoresis.  104  4.3. RESULTS 4.3.1. Low abundance ditags in a matrix of competing ditags are poorly amplified by Taq DNAP We first examined if and when changes in the relative abundances of SSTs occur during the amplification of SAGE ditags by utilizing synthetic template oligonucleotides with sequence characteristics matching those of 102 bp ditags obtained from eukaryotic mRNA via the standard SAGE protocol (Figure 4.1). To eliminate the formation of single-strand secondary structures that  can inhibit amplification, the sense-strand of each synthetic ditag was designed with an internal non-self-complementary 26 bp seuence (the SST pair) flanked on the 5’- and 3’-ends by sequences corresponding to SAGE adaptors IA and 2A, respectively, resulting in the creation of completely non-self-complementary single-stranded templates. These constructs were amplified either on their own or spiked into a mixture of dlitags of fixed total concentration generated from mRNA isolated from C. neoformans grown under low iron conditions  15, •  Amplification was  performed in the presence of corresponding Taqman’-like dual-labeled probes (DLPs) to monitor the PCR reaction in real time under the conditions utilized in the SAGE protocol. Taq-based qPCR data for amplification of isolated non-self-complementary ditags confirm that any amplification bias due to differences in ditag sequence is small. High amplification efficiency E 4 values (Ea> 95%) are observed when the synthetic SAGE ditag contains a random (5OGC._5050; Figure 4.1 .A black lines) or a disparate distribution of GC content (5OGC_2080; •  Figure 4.1.B black lines) within the 26 bp SST pair region, or contains very low (2IGC, Figure 4.1.C black]ines) or very high (79GC, Figure 4.1.D black lines) GC content in the SST pair region. However, when 10-fold serial dilutions of a synthetic ditag (0.1 pM to 0.1 fM final) are introduced into a fixed concentration of ditags (‘—‘10-100 fM total) generated from C. neofbrmans; a progressive decrease in the ability to amplify the synthetic clitag is observed as its concentration decreases relative to that of the total ditag mixture, eventually leading to a near complete suppression of amplification (Figures 4.1.A to 4.1.D, color lines). The four data sets shown in Figure 4.1 indicate that this suppression of amplification is at most only weakly dependent on the GC content or its associated distribution within the 26 bp SST pair region of the ditag. As the observed suppression effect could be due to the presence of inhibitors within the SAGE ditag mixture prepared from C ne/brman4 we monitored the qPCR amplification profile for  105  each synthetic template when present in a more well-defined mixture containing only a second synthetic “background” template at equal or higher concentration. For example, qPCR profiles for 10-fold serial dilutions of 5OGC_2080 template (100 fM to 0.1 fM final) in a fixed background concentration of 0.1 pM 5OGC_5050 template are shown in Figure 4.2A. In the absence of any background template (black lines), qPCR reactions of 5OGC_2080 template display consistent amplification profiles across the dilution series, resulting in highly efficient amplification with an end-point fluorescence for 5OGC_2080 amplification of 1000 to 1200 relative fluorescence units (RFU). However, when the same dilution series of 5OGC_2080 template is amplified in the presence of a fixed concentration of competing 5OGC_5050 template, an abundance-dependent inhibition of amplification is again recorded. When the two templates are present at equimolar concentrations (red traces), a small decrease in the amplification efficiency of the 5OGC_2080 template is observed beginning at 77 RFU (22 cycles). When the 5OGC_.2080 template is introduced at  100 / 1 th  the molar concentration of the background template, detection of  5OGC_2080 amplification product (dark blue trace) is delayed significantly. Finally, amplification of the 5OGC_2080 template is essentially completely suppressed when its molar concentration is  /  000 that of the background template in the starling mixture (light blue trace). As a result, the  difference (lCT) in threshold cycle values for a spiked versus isolated template increases rapidly with decreasing 5OGC_2080 template concentration (Figure 4.2.A, right-hand graph). Together, the results in Figures 4.1 and 4.2 indicate that the relative abundance of ditags is preserved during amplification only for those ditags present at concentrations that are at most 1.5 orders of magnitude below the concentration of the highest abundance ditag in the starting mixture. This suppression of amplification was observed regardless of which synthetic templates are used as target and background. Moreover, adjusting both primer (0.1 itM to 2 1 iM) and nucleotide (0.01 mM to 1.5 mlvi) concentrations did not lead to any apparent relief of the effect, which was observed irrespective of the commercial source of Taq (Applied Biosystem AmpliTaq Gold, Eppendorf Master Mix, Takara Speedstar Taq, or Invitrogen Platinum Tacj) and its associated reaction buffer and recommended reaction conditions. Finally, the effect could not be mitigated through adjustment of the annealing temperature, extension temperature, or any other reagent conditions utilized in the PCR, and thus appeared to be a general property of Taq DNA polymerase when used to amplify mixtures of SAGE ditags.  106  4.3.2. Abundance-dependent inhibition is absent when co-amplifying non-homologous templates. In an effort to understand the mechanism for this undesirable suppression of the  amplification efficiency for relatively low abundance ditags, we hypothesized that co-amplification of two or more relatively short templates sharing common priming sites creates competition that inhibits amplification of the lower abundance template 31  as well as earlier experimental observations  32,  ,  27-29•  Recent kinetic descriptions of PCR ’ 30  have suggested that competition between  template-template annealing versus primer-template annealing are in part responsible for triggering the observed plateau phase of the amplification reaction. These studies argue that, as amplified template accumulates during the PCR, the total template concentration reaches a point where template-template annealing becomes thermodynamically favourable relative to primer-template annealing, leading to a decrease in amplification efficiency of the template. In the context of SAGE, individual SAGE dlitag species share extensive homology (-‘75100%) to each other as a result of their common adaptor sequences. Consequently, end-clamped intra-strand annealing products can potentially form between any two partially homologous templates, inhibiting primer annealing to ditags in a manner that generates a bias that is then amplified as the PCR reaction proceeds  28, 29,  We tested the converse of this hypothesis by  performing Taqman -based qPCR assays on the amplification of 10-fold serial dilutions of the TM 5OGC_2080 template (0.1 pM to 0.1 fM) in the presence of a fixed concentration of a completely non-homologous synthetic template derived from the BCL2 gene (0.1 pM), under amplification conditions identical to those used in the SAGE protocol (Figure 4.2.B). We observed that, in contrast to what is observed in the presence of a competing partially homologous SAGE ditag template, co-amplification in the presence of the non-homologous BCL2 template does not result in any abundance-dependent suppression of amplification of the 5OGC_2080 template. Indeed, the amplification efficiency of the 5OGC_2080 template was indistinguishable from that observed th in the absence of background BCL2 template, even when present at 1/1 000 the concentration of  the background template (Figure 4.2.B). This suggests that the abundance-dependent inhibitory effect found in the co-amplification of SAGE ditags using Taq DNAP is somewhat unique to the SAGE protocol and is due to the extensive homology shared between all ditag templates.  107  4.3.3. Impact of secondary-structure and abundance bias during the PCR of SAGE ditags In SAGE, the ditag population subjected to the PCR is created through blunt-end ligation of randomly selected pairs of individual A:Ts, and can lead to the formation of four distinct structural classes of dlitag templates (Figure 4.3): single-stranded ditag templates that are completely or operationally non-self-complementary (type 1 ditags), and single-stranded ditag templates that are prone to form hairpin structures, either through the presence of an inverted dimer of the same SST in the ditag (type 2 ditags), or via the same adaptor at the two ends of the ditag (type 3 ditags), or finally, through a completely symmetric and therefore completely self-complementary A:T pair (type 4 dlitags). The propensity of a given single-stranded template to form a hairpin structure that inhibits its amplification (which we collectively term “structure-dependent bias” to differentiate it from the “abundance-dependent bias” mechanism identified in the previous section) thus defines a second source of bias that arises out of the ditag amplification step. Type 2 and type 4 ditags are both inverted tag sequences that traditionally are filtered out in SAGE analysis because of their expected low amplification efficiencies. Similarly, type 3 ditags are thought to amplify very poorly as a consequence of the Tm of the adaptor hairpin (—‘84 °C as predicted by Mfold, http://www.idtdna.com/scitools), which is duplex (Tm  30 °C higher than that of the primer-template  Ca.  55 °C). This leaves the final SAGE sequencing results effectively directed towards  the determination of the distribution of type I  (i.e.  linear) ditags in the sample following PCR  amplification. However, there is little evidence in the literature validating the implicit assumption that the normalized distribution of SSTs in the population of amplified type I ditags recapitulates the normalized tag distribution in the starting SST mixture. As ditag formation is presumed to proceed to completion  (i.e.  every A:T is incorporated  into a ditag species), the normalized frequency of a particular ditag sequence can be expressed as the product of the normalized frequencies of each tag comprising the ditag in the sample. This allows us to define the statistical relationships governing the number and distribution of type 1/3 (hetero-)ditags and type 2/4 (homo-)clitags formed from a given mRNA sample and associated ensemble of A:T constructs. We therefore simulated the partitioning of individual SSTs into hetero- and homo-ditags using tag abundance data obtained from each of a series of 32 human MPSS libraries as a starting sample (Figure 4.4.A & B)  6•  The results of this simulation show that  tags present at the highest abundance in each MPSS library are preferentially incorporated into  108  homo-ditags. As a result, the corresponding transcripts are under-represented within the amplified  pooi of ditags, leading to an under-representation of the highest abundance transcripts in the final SAGE data set (Figure 4.4.A: left panel). The degree of this secondary structure bias varies with the nature and complexity of the transcriptome examined (Figure 4.4.A: right panel). When a single transcript dominates the transcriptome, such as that for f3-hmoglobin (17.5%) in the bone marrow transcriptome, or the amylase variant transcript (13%) in the pancreatic transcriptome (http://mpss.1icr.org/SearchIPSS.php), the degree of bias is substantial. Each of these highabundance transcripts is under-represented by 10-15% within the corresponding ampliflable type 1 ditag pool. In contrast, when a transcriptome is characterized by a larger number of highabundance transcripts (a more common occurrence), such as the 30 to 40 unique transcripts that constitute the top 2O% of the spleen or the thymus transcriptome, this bias is reduced, resulting in a very small under-representation of high-abundance transcripts (1-2%) in the ampliflable pooi of dlitags. Thus, in most instances, errors introduced by template secondary structure effects are relatively small. In all 32 transcriptomes examined, secondary-structure-based bias dissipates within the first order of magnitude of abundance (Figure 4.4.B). As demonstrated in Figures 4.1 and 4.2, suppression of the amplification of lower abundance ditags is potentially a more significant source of bias in SAGE analysis. However, the impact of this effect on the ability to infer true normalized tag frequencies, and thus transcript abundances, from the resulting amplified pool of ditags is not known. The degree of “amplification-suppression” bias introduced into each of the 32 human MPSS libraries was therefore examined through a simulation of the PCR step in SAGE (Figure 4.4.C & D). Following formation of the ampliflable type I dlitag pool, 20 cycles of PCR were applied on each potential ditag based on amplification efficiencies regressed from the qPCR data shown in Figure 4.2 (see equation 6 of the Methods section), and the normalized tag frequencies within the amplified pooi were then extracted. The results of our simulations indicate that abundance-dependent suppression of ditag amplification can lead to an 8-fold to 10,000-fold under-representation of low abundance tags in the amplified ditag population, with a concomitant over-representation of higher abundance tags that tails off for the highest abundance tags due to the secondary-structure bias detailed above (Figure 4.4.C). As observed with the secondary-structure bias, the extent of ditag abundance related bias depends on the manner in which tags are distributed across abundance classes in the original sample. However, the net effect is always a dramatic change in the relative  109  distribution of tag abundances due to a large drop in the relative abundance of lower abundance tags in the original sample following amplification. As a result, the abundance distribution of tags in the amplified material does not reflect the true distribution of tags in the oliginal sample (Figure 4.4.C), leading to unwanted changes in the amplified product to original tag frequency ratio (Figure 4.4.D)  —  a result that is consistent with what has previously been observed for competitive PCR  .  Suppression in the amplification of iow abundance ditags is likely the main contributor to the low concordance observed for low abundance SAGE tags across sample replicates  ‘  Regrettably, this bias depends upon the original complexity of the transcriptome being analyzed, and likely precludes the development of an appropriate correction algorithm, as information on the organization of the transcriptome would need to be established aprioti 4.3.4. Correcting the Problem: Application of Taqman TM assays to proofreading DNA polymerases Our results, which show that PCR amplification of a SAGE dlitag population using Taq can alter the distribution of tag abundances primarily through suppression in the amplification of lower abundance ditags (and thus low abundance SSTs), point to the need for an improved amplification strategy that greatly reduces or removes this source of bias. As our hypothesis was that inhibition of the priming and extension of low abundance ditag templates is related to the formation of intrastrand annealing products during the PCR, we sought to examine whether thermophilic DNAP(B)s could be used to reduce this effect by exploiting their 3’—5’ exonuclease activity to edit through partially complementary duplexes. Although their use in probe-based qPCR has not been reported, DNAP(B)s should in principle be able to process fluorogenic Taqman TM  type probes, provided that the 3’-5’ exonuclease domain of the DNAP(B) is capable of removing the 3’-quencher group. In previous work, we demonstrated that DNAP(B)s from the mesophiles T4 and E. co/i, as well as from the thermophile T /itora/Is (Vent®), are proficient in removing 3’isoheptyl amine terminators  15,  suggesting that non-nucleotidic moieties such as the quencher  typically employed in the design of fluorogenic DLPs may be processed by DNAP(B)s. A variety of commercially available thermophilic DNAP(B)s were therefore screened under the manufacturer’s recommended reaction conditions for their applicability to  TM Taqman  based qPCR assays. The amplification of the 5OGC.2080 template (0.1 pM) was used as a test system. While found  to  be generally applicable to Taqman assays, DNAP(B)s displayed marked  110  differences in their ability to utilize the Taqman TM probe to track the amplification reaction (data not shown). The Pfu family of DNAP(B)s (i.e. Pfi PfuUltra, PfuUltrall) consistently provided the most stable and robust fluorescence signal for the Taqman TM assay, particularly when a 2-step  cycling protocol was employed. We therefore focused on this group of DNAP(B)s and tested various PCR conditions that could potentially eliminate the observed suppression of amplification of low abundance ditag templates. In particular, co-amplification of 5OGC_2080 (0.1 fM) in a background of 1000-fold excess of 5OGC_5050 (0.1 pM) was investigated as a function of primer (0.1-1.0 tiM) and dNTP (0.01  -  0.5 mM each) concentration, and ACT was again used as an index  of performance. We found that amplification conditions that minimized ACT were essentially the same for Pfa and PfuUltra, in both cases lying within the interval 0.05 to 0.2 mM for nucleotide and 0.1 to 0.2 tM for primer concentration (Figure 4.5.A). At these reaction conditions, the performance of Pfu (Figure 4.5.A: right panel) and PfuUltra (data not shown) on serial dilutions of 5OGC_2080 in a background of 5OGC_5050 shows that, unlike with Taq, preservation of relative tag abundances’ during amplification can be maintained over 3 to 4 orders of magnitude. Unfortunately, the amplification efficiencies  ( 48%)  observed under these reaction conditions are  generally too low to economically generate sufficient ditag product for concatenate synthesis, even when relatively high template concentrations (0.1 nM) are utilized. The performance of PfuUltraII was significantly better (Figure 4.5.B). ACT remained less than 0.5 at all 5OGC_2080/5OGC_5050 ratios for reaction conditions covering a wide range of primer (0.2-1.0 1 iM) and nucleotide (0.05-0.2 mM) concentrations. Optimal performance was achieved at 0.5 pM primer and 0.05 mM nucleotide, where ACT is essentially always zero and co amplification of templates that differ in abundance by at least 5 orders of magnitude can be achieved without any observed abundance-dependent inhibition (Figure 4.5.B: right panel). The amplification efficiency was again less than 100% (E  =  68%) at these reaction conditions, but was  considerably higher than observed for Pfa and PfuUltra. Nevertheless, sufficient amounts of ditag product could not be reliably obtained (Figure 4.6). A more careful study of conditions maximizing product yields while minimizing the effects of abundance-dependent inhibition was therefore initiated for serial dilutions of 5OGC_2080 in the presence or absence of competing template 5OGC_5050. The greater coverage of reaction conditions allowed us to more clearly define the relationship between B, product yield and  111  abundance-dependent amplification suppression (Figure 4.6). We find that increases mE, obtained through increases in either dNTP or primer concentration, occur at the expense of preserving relative ditag abundances during amplification. As a result, E can be tuned to be equal to its observed value of 95% for Taq-based PCR (see Table 1) when dNTP concentration is increased above 0.1 mM (i.e. under conditions approaching those recommended by the manufacturer). However, the amplification of low abundance ditags by PfuUltraII is then suppressed in a manner similar to that observed with Taq-based PCR, presumably due to an improper ratio of polymerase to 3’—’5’ exonuclease activity provided at these conditions. We therefore screened for an acceptable balance in preserving ditag abundance ratios while increasing E. At reaction conditions employing 0.5 pM primer and 0.1 mM nucleotide, we found that the amplification reaction could preserve ditag abundance ratios over at least 3-4 orders of magnitude with product yields detectable on a 2.2% agarose gel when the total starting template concentration is 0.1 pM and the amplification is terminated after 20 cycles (Figure 4.6, I2ne A). Using protocols described below, this allows the isolation of sufficient amplification product for concatenate synthesis through pooling of twenty 50 L PCR reactions. A smaller number of amplification cydes or PCR reactions is required for concatenate synthesis if the total initial dlitag concentration is increased. 4.3.5. SAGE library construction using a minimal amount of amplified ditag material Although we were able to define PfuUltraII-catalyzed reaction conditions that preserve the abundance distribution of ditags in mixtures amplified via PCR, the yields obtained are lower than those typically obtained in the standard Taq-based SAGE protocol. To avoid the additional cost that would be incurred through more extensive use of PfuUltraTI, we therefore devised a modified protocol for highly efficient 26 bp SST pair purification and concatenation from a more limited amount of PCR product (Figure 4.7). Following ditag amplicon clean-up with a YM-30 microfiltration column and digestion with NlaIII in the presence of bovine serum albumin (Figure 4.7, step B), free streptavidin is introduced in molar excess to bind all released adaptor moieties and unused primer, both of which are biotinylated (Figure 4.7, step C). All streptavidin-bound complexes are then removed from the digest mixture by capture on a Micropure EZ filter (Millipore) that exhibits preferential binding of proteins over nucleic acids, yielding purified 26 bp SST pairs in the filtrate (Figure 4.7, step D) with undetectable levels of contaminating adaptor and  112  primer. This new method reliably provides yields of 80 to 90% of 26 bp SST pairs from the 102 bp PCRproduct. In contrast, we typically obtained a 30% recovery of both the ditag amplicon and the SST pair population from the PAGE purification and extraction process used in the standard SAGE protocol: Recovered 26 bp ditags were then concatenated through an “end-point” ligation protocol that exploits the processive nature of T4 DNA ligase and the trans-adenylation reaction  to  generate readily clonable concatenates. In general, linear concatenates with lengths between ca. 700 bp and 2000 bp are desired in the cloning and sequencing steps. Selection of this range of concatenate lengths is typically achieved by purification on and selective excision from gels, a process that results in considerable product loss. This loss is further exacerbated by the formation of circularized concatenates that cannot be doned into a vector for sequencing. While other groups have utilized NlaIII to restrict the formation of circular concatenates  41,  this procedure  does not allow control of concatenate length. In contrast, we have found that the use of limiting ATP concentrations (3 tM) in the presence of high concentrations (15% v/v) of PEG-8000 favours the formation of linear concatenates 42 of an average length of 1.5 kb (Figure 4.7, step E). This obviates the need for gel-purification of appropriately sized concatenates prior to cloning, greatly increasing yields of clonable concatenates. 4.4. DISCUSSION The ability to obtain high quality, quantitative information about transcript abundances within a transcriptome (mRNA) sample requires knowledge of and accurate corrections for any biases introduced into the sample over the course of the processing steps used in the analysis. SAGE, as a sampling-based technology, enables the analysis of a transcriptome via the creation of short sequence tag (SS1) representations of transcripts from axed positions within their 3’-end. The technique requires over 20 processing steps to present the SSTs in a format amenable to sequencing. As large amounts of starting material are not often available, amplification of dlitags is required to facilitate further manipulation of the sample, as well as to offset cumulative losses that occur during processing. However, amplification, particularly via the PCR, can bias results, as differences in amplification efficiencies between any two templates can lead to unwanted accumulation of one template relative to the other. These differences in efficiency can in principle arise as a consequence of the differences in sequence composition among the different templates  113  or, as illustrated in this work, through the formation of secondary structures within templates that inhibit template amplification by the Taq family of DNAPs. We have discovered, that under the current PCR conditions employed for the large-scale amplification of ditags, a significant abundance-dependent amplification bias is observed that can directly impact the fidelity of SAGE analysis. This bias systematically results in an underestimation of and an increased uncertainty in the abundances inferred for SSTs corresponding to transcripts expressed in relatively low numbers, and appears to result from the formation of partially complementary template-template products that block primer hybridization during the rapid annealing phase of the PCR. In support of this hypothesis, we found that abundance-dependent inhibition in the amplification of lower abundance templates was absent when co-amplified in the presence of a high concentration of non-homologous template derived from the BCL2 gene. Through the development and application of Taqman-like assays to DNAP(B)s, we found that suppression of low-abundance template amplification could be minimized when proof reading DNAPs such as Pfu were employed under conditions which enhance their exonuclease  aetivity (Figures 5 and 6). DNAP(B)s such as Pfu are then able to edit either strand of the intra strand annealing products and synthesize the proper complementary sequence, reintroducing this population of ditag templates into the amplification cycle. As this exonuclease activity is absent from Taq, processing of templates incorporated into intra-strand annealing products is inhibited, leading to the observed abundance-dependent amplification bias. It is tempting to speculate that secondary-structure formation during amplification, both from the formation of template hairpins as well as from intra-strand annealing products, may be partially mitigated by the inclusion in the PCR of DNA binding proteins, which have been shown to enhance product yield and specificity RecA  by increasing perfect primer-template association via  or by reducing strand slippage across hairpins by enhancing the strand displacement  activity of Taq via a single strand binding protein from E co/i 46• However, the extensive homology shared between any two non-identical ditag templates versus the primer-template is likely to hinder the ability of DNA binding proteins to minimize the formation of intra-strand annealing products. Alternatively, as intra-strand annealing products (Tm than the perfectly matched ditag templates (Tm 55  q 0  65°C) are thermodynamically less stable  69-75°C), reducing the annealing rate (95 °C to  below 3.3 °C/s (MyIQ cycler, BioRad) may help limit the formation of intra-strand  114  annealing products, though at the expense of primer-template annealing (T 55°C) and amplification efficiency  However, this approach may reduce amplification fidelity as the longer  duration of high-temperature cycling increases the incidence of heat-induced base depurination and/or deamination I2’95°c ranges  4S )•  Furthermore, the activity of Taq may become compromised, as the  from 40 mm to 2.5 hrs depending on the enzyme source  An alternative strategy to solving the amplification bias problem was therefore chosen. By exploiting the exonucleolytic properties of proofreading DNAPs of the Pfu family, and in particular PfuUltrall, we were able to recapitulate the relative abundances of competing ditag templates varying over 5-orders of magnitude in concentration. In principle, other DNAP(B)s with higher native exonuclease activity may perform equally well in the co-amplification of competitive templates. However, in our Taqman-based screen of commercially available DNAP(B)s, a variable ability to process the fluorogenic probe was observed among these high-exonuclease-activity DNAPs, indicating a differential capacity for their exonuclease domains to remove the quencher from the DLP. As a result, we were unable to clearly assess the performance of these DNAP(B)s in amplifying ditag mixtures without altering relative tag frequencies. For example, amplification utilizing a 2-step protocol with Pfx, a DNAP(B) from the thermophile Thermococcus kodakaraensis KOD1 that exhibits high processivity  50,  did not show an increase in fluorescence despite the  presence of an amplification product (data not shown). In contrast, Phusion  ,  a derivative of Pfii  like DNAP with a DNA-binding motif akin to PfuUltrall, was able to utilize the fluorogenic probe, but reaction conditions could not be defined that did not alter relative tag frequencies during the co-amplification of competitive templates. Our success in utilizing FJii family DNAPs to overcome biasing of relative tag frequencies in the co-amplification of highly homologous templates has general applications in the use of real time competitive PCR within a mixed population. In particular, utilization of DNAP(B)s such as PfuUltrall offers the potential to significantly increase the sensitivity and reliability of assays to detect variants at <1:100,000 within a population, without the requirement to develop a series of specific primers to segregate amplification products. This has broad implications in the ability to detect and quantify splice variants, gene deletions, or polymorphisms associated with various cancers, such as in the clinical detection of minimal residual disease in tissues following treatment  115  52;  may also serve to improve the analyses of microbial flora in various environments through the  detection of strain-specifici 6S ribosomal sequences For SAGE analysis, however, such an increase in assay sensitivity comes at the price of lower analyte (amplified ditag) yields, hampering the ability to reliably generate a SAGE library for transcriptomic studies. Adjustment of nucleotide and primer concentrations can significantly improve yields, but generally leads to a reduction in range over which relative transcript abundances are preserved. We therefore redesigned the processing steps following ditag amplification to minimize material losses during the purification and recovery of 26 bp ditags and to maximize the production of clonable concatenates. By defining PCR conditions with PfuUltraII that would enable the preservation of relative ditag frequencies over 4 orders of magnitude, a sufficient pool of concatenates (10 ng) from 20 PCR reactions terminated at 20 cycles can be obtained (Figure 4.7). In summary, we have discovered that amplification bias in the PCR of competitive templates, a technique applied to numerous assays including SAGE technology, can be offset through the use of proof-reading DNAPs. Our ability to apply Taqman-like probes to monitor the efficiency of amplification using these DNAP(B)s enabled definition of reaction conditions that largely eliminate this bias. In particular, replacement of Taq with 1iU1traII under conditions that enhance its exonucleolytic activity permits amplification of SAGE ditag populations without alteration of relative tag frequencies. The strategy should prove useful for a wide range of related assays that rely on the use of common primers to interrogate a diverse population of templates. Finally, the demonstration of the applicability of Taqman-like probes to monitor DNAP(B) activity opens up new avenues in the development of probe-based assays that are able to utilize this biochemically diverse set of DNAPs.  116  I  Templates II  5o5Q5O: 5OGC 2080:  5’ -CATGGTCTGCGTGAAGTCTTAGCRTG-3’  2GC:  5’ -CATGTGTTCTATAATTTAATATTCaTG-3’ 5’ -CATOGCCCCccCGCGCCCCCACTcCkTG-3’  5’ -CATGGCCGTGGGCTACaTTCATCRTG-3’  TCCCTATTAAGCCTAGTTGTACTGCACCAGCAAATCC-3’  5’ -CTGCTCGAATTCAAGCTTCTAACGATGTACGGGGA  P2  P1 Probes  DLP 5OGC 5050: DL? 500C 2080: DL? 21GC:  5’ -FP,4-ACATGcTAAGACTTCAcGCTCAA.C-BHQ-3’  DL? 79GC:  5’ -FN-ATQGAGrGGGGGCGC-EHQ-3’  5’ -FM-ACRTGGCACGTGGGCTACT-BHQ-3’ 5 -FAM-ACTGaAtAtTtAAAtTAtAgAAaA-BHQ-3’  A.  B.  1200  1200 D  1000  1000  a)0 C  800 C)  2  600  0  0  Cu  400  a)  200 Cu ‘)  >  a,  0 3  20 30 Cycle Number  C.  40  40 Cycle Number  D. U,  1200  SC S  500 C  C  1000  D  800  SC  a)  600  a, 0 U, a,  0 Cu CC  0  400  Cu LL  200  200  a,>  100  Co  ‘a a,  79GC  400 300  Cu  0 10  0 20  3b  10  20  30  Cycle Number  Cycle Number  4.1. Amplification profiles of synthetic templates within a background of ditags derived from C. neoformans. Each synthetic ditag construct was designed to mimic the structure of a typical SAGE ditag where two different adaptor Sites containing priming sites (P1 and P2) flank an internal sequence corresponding to an SST pair. Corresponding probes are shown (uppercase: DNA; lowercase: LNA). Synthetic templates were either amplified on their own (black lines), or introduced into a 1/400 dilution of ditags generated from C. neoformans and amplified (color lines) under standard SAGE protocols using Platinum Taq. Amplification sets represent 10-fold serial dilutions of the synthetic ditag template in the presence or absence of background C. neoformans ditags. A. 5OGC_5050 B. 5OGC_2080 C. 21GC D. 79GC_5050  Figure  117  A. ,,  1400 4@  1200 1000  ii.  800  C  600 Z  35 3°  0  25  400  a  20  200  E  =  93.4% 0.997  =  ir )  20  30  I  50  40  10  100  Template Concentration (fM)  Cycle Number  B. 0  2500  40  D 0 0  a)0 (0  a, 0  D  2000  35  1500  30  C  0  1000  25  500  20  U.  a,  >  E=90.7% R=0.997  a,  10  20  C.  40 30 Cycle Number  10 100 1000 Template Concentration (tM)  50  2.8 2.6 2.4 2.2  a)  2.0 1.8 1.6 0 -  1.4 1.2 1.0 18-3  0.01  0.1  1  fractional abundance (f)  Figure 4.2. Impact of competitive and non-competitive templates in PCR amplification with Platinum Taq. A. Amplification profiles of serial dilutions of 5OGC_8020 synthetic clitag (100 fM to 0.1 fM) in the presence (red) or absence (black) of a background of 5OGC_5050 synthetic clitag (100 fM). When the ratio of interrogated template to background template falls below unity, suppression of amplification is observed. B. Amplification profiles of serial dilutions of 5OGC_8020 (100 fM to 0.1 fM) in the presence (red) or absence (black) of a background of BCL2 (100 fM). In contrast to the presence of a homologous competing template, suppression of amplification is not observed. C. Relationship between the doubling rate D, (1 +E/100) and the fractional abundance J! of 5OGC_2080 in the presence of a background of 5OGC_5050 template.  118  a —  r.  I .-  =  I”.’  -  I  -:  Type 1 S  I  —  • — —  + B B B  a  I I  —I  I  2-  z  —, — a’ — I — I  I I  — —  I ‘IN fle 11 .S  —  : :—  r  —  I  --  it S  S  •‘  -  —. RIurw  --  •  Type 2  :r  S -—  E+IE.  -  I  a I a I a I a  iI  --  -  Type 4  Figure 4.3. Structural diversity of PCR templates generated from ditag species. Upon blunt-end ligation of two populations of released adaptor(A/B)-tag(1 /2) constructs, ditags with a general adaptor-tag-tag-adaptor configuration are formed. These ditags are then introduced as a template for the PCR and, following the melting cycle, yield single stranded species capable of forming numerous secondary structures during the annealing cycle. The potential for self complementarity of the single-stranded templates derived from each ditag enables one to classify ditags into four distinct structural types, three of which (types 2 to 4) inhibit the amplification reaction through formation of single-stranded hairpins. SAGE analysis is therefore reliant on the selective amplification of type 1 templates, which can lead to amplification bias as described in the text.  119  I bone marrow  100  spleen dO  I I  102 101  pancreas  iø 102  Iii.  .102  100  0  1  2  3  thymus  100 10 C  0  tag count (x10)  ii iii 1  101 100 2  a  3  ) 5 tag count (x10  A.  100 0  tag count (x10 ) 5  1  2  3  tag count (xlO ) 5  B. 1.2 0.20 C.  1.1  0.15  .:,  0.10 Ca C)  10 1)2  0.05 0 10)  a  0.00 0.00  0.05 0.10 0.15 0.20 original tag frequency (p)  C.  106  1 i0 original tag frequency  io  ià 10 10 10.2 1O’ original tag frequency i0flgin)l)  102  w’  D. 101  10’ 10 0)  100  10  1 0’  -5  I 11) •g  I 0. IC) Ca  io 10  10.0  ‘°  io  5 io  io  io  io  io’  original tag frequency (p ) 0 , 019  Figure 4.4. Impact of ditag formation and PCR on the outcome of SAGE analysis. Virtual transcriptomes were constructed from each of 32 human MPSS libraries, from which the ditag formation and PCR steps of standard SAGE were simulated to assess the impact of secondarystructure and template-abundance biases on the fidelity of SAGE analysis. The simulation results for 4 libraries are highlighted (black: human bone marrow; red: human pancreas; green: human spleen; blue: human thymus). A. Post-PCR structure-dependent bias arising from the preferential segregation of highest abundance tags into non-amplifiable homo-ditag species, leading to an under-representation of high-abundance tags in the amplification product. B. The extent of secondary-structure bias is relatively minor (no more than 20% under-representation) and is restricted to the one or two highest abundance tags. C. Abundance-dependent bias arising from the formation of intrastrand annealing complexes during the PCR, leading to a strong inhibition in the amplification of low abundance ditags. D. Ditag abundance dependence bias results in a severe under-representation of lower abundance tags.  120  B.  ci)  Figure 4.5. Optimization of Pfu and PfuUltraII activity for minimizing abundance-based bias during amplification of ditags. Heat map showing 1/ACT values for the amplification of 5OGC_2080 ditag in the presence (versus the absence) of 1000-fold excess of 50GC_5050 ditag as a function of primer and nucleotide concentrations. Darker values represent conditions where ACT is minimized, and thus 1 /ACT is maximized. Amplification conditions that maximize I /AC lead to the full preservation of relative tag frequencies over 4 to 5 orders of magnitude of original template abundance, but proceed at an amplification efficiency E of less than 100%. A. Pfu. B. PfuUltraTI.  121  AB  100 bp  CD  —  50 bp dNTP(mM) primer (j.M)  0 0 °— 000  U) 0 0 °— 000  0.2  0.5  U)  D.  LI ! F 20 30 40 50 Cycle Number  10  0  15  8  10  .  -  E=81.1% 2 = 0.996 R concentration (pM)  40 35 L. 0 30  10,  20 30 40 50 Cycle Number  10  20 30 40 50 Cycle Number  40 -  •..  0  E=98.3%. 20 =0.994 2 R 15 10410.310.210.1 10° concentration (pM)  0  1.0  1000  1000  40 35 30  0  000  C.  B.  A.  LI)  30  o  ê.  E = 93.2% 20 =0.988 2 R 15. 10410.310.210.1 10° concentration (pM)  Lr 0 I 10  40 35 .30 025  20 30 40 50 Cycle Number  @.  E=97.1% 20 ’0.999 2 R 15. 10410.310210. 10° concentration (pM)  Figure 4.6. Optimization of PCR conditions for minimized abundance-dependent inhibition and maximized product yield. Ditag amplification products from PCR reactions utilizing PfuUltraII and various amounts of 5OGC_2080 template within a background of 0.1 pM 5OGC_2080 template were resolved on a 2.2% agarose gel to determine product yields and amplification performance as a function of nucleotide and primer concentrations. Reaction conditions that preserve relative tag frequencies over 5 orders o magnitude do not yield sufficient ditag amplification product to be observable on the gel. Amplification product (20 cycles) is detectable at higher dNTP and primer concentrations, with an associated decrease in the range over which relative tag frequencies are preserved. Thus, increasing dNTP concentration (A—C, B—D) or primer concentration (A-.B, C—D) improves yield but reduces performance.  122  A  102 bp ditag  B  NIalIl digestion (50 U)  c $  Incubation with SAy (10 ug)  D  PC extraction MicropureEZ filtration Nanosep Omega 3K filtration  26 bp ditag  E  5 U T4 DNA ligase MAT 15% PEG 1 3  1-2kb concatenates  Figure 4.7. New protocol for purification and end-point concatenation of 26 bp ditags recovered from 102 bp amplification products. A. Starting material of purified 102 bp amplification products. Lane B. Post Nialli digestion for 1 hr at 37 °C yields 26 bp product mixture and regenerated adaptors. C. Incubation of Nialil digest with streptavidin leads to a shift in the mobility of biotinylated DNA, leading to a clear separation of 26 bp ditags from biotinylated adaptor moieties and partially digested biotinylated products. D. Filtration of digests in the presence of streptavidin through micropure EZ columns yields highly purified 26 bp ditags. E. End-point concatenation of recovered 26 bp products yields concatenates of -1.5 kb in average length (0.6 to 2 kb range), which can be cloned directly into dephosphorylated vector.  123  4.5. REFERENCES 1.  Velculescu, V.E., Zhang, L., Vogelstein, B. & Kinzler, K.W. Serial analysis of gene expression. Science 270,484-487 (1995).  2.  Velculescu, V.E. et al. Characterization of the yeast transcriptome. Ce1188, 243-251 (1997).  3.  Adams, M.D. Serial analysis of gene expression: ESTs get smaller. Bioessqys 18,261-262 (1996).  4.  Wang, S.M. Understanding SAGE data. Trends Genet23, 42-50 (2007).  5.  Brenner, S. et al. Gene expression analysis by massively parallel signature sequencing (I’vIPSS) on microbead arrays.  Nat Biotechnoll8, 630-634 (2000). 6.  Jongeneel, C.V. et al. An atlas of human gene expression from massively parallel signature sequencing (MPSS). Genome Rex 15, 1007-1014 (2005).  7.  Hager, J. Making and using spotted DNA microarrays in an academic core laboratory. Methods EnrjmoI 410, 135168 (2006).  8.  Lash, A.E. et al. SAGEmap: a public gene expression resource. Genome Rex 10, 1051-1060 (2000).  9.  Kuo, B.Y. et at SAGE2Splice: unmapped SAGE tags reveal novel splice junctions. PLoS Couiput Biol 2, e34 (2006).  10. Velculescu, V.E., Vogelstein, B. & Kinzler, K.W. Analysing uncharted transcriptomes with SAGE. Trends Genetl6, 423-425 (2000). 11. Harbers, M. & Carninci, P. Tag-based approaches for transcriptome research and genome annotation. Nat Methods 2,495-502 (2005). 12. Caron, H. et al. The human transcriptome map: clustering of highly expressed genes in chiomosomal domains. Science 291, 1289-1292 (2001). 13. Stem, M.D., Anisimov, S.V. & Boheler, K.R. Can transcriptome size be estimated from SAGE catalogs? Bioinformatics 19, 443-448 (2003). 14. Stollberg, J., Urschitz, J., Urban, Z. & Boyd, C.D. A quantitative evaluation of SAGE. Genome Rex 10, 1241-1248 (2000). 15. So, A.P., Turner, R.F. & Haynes, C.A. Minimizing loss of sequence information in SAGE ditags by modulating the temperature dependent 3’ --> 5’ exonuclease activity of DNA polymerases on 3’-terminal isoheptyl amino groups. Biotechnol Bioeng 94, 54-65 (2006). 16. So, A.P., Turner, R.F. & Haynes, C.A. Increasing the efficiency of SAGE adaptor ligation by directed ligation chemistry. NucleicAcids Rex 32, e96 (2004). 17. Akmaev, V.R. & Wang, C.J. Correction of sequence-based artifacts in serial analysis of gene expression. Bioinformatics2O, 1254-1263 (2004). 18. Clime, J., Braman, J.C. & Hogrefe, H.H. PCR fidelity of pfu DNA polymerase and other thermostable DNA polymerases. NucleicAcids Rex 24,3546-3551(1996). 19. WahI, M.B. et al. Evaluation of the chicken transcriptome by SAGE of B cells and the DT4O cell line. BMC Genou.iics 5,98 (2004). 20. Polz, M.F. & Cavanaugh, C.M. Bias in template-to-product ratios in multitemplate PCR. Appi Environ Microbiol 64, 3724-3730 (1998). 21. Arezi, B., Xing, W., Sorge, J.A. & Hogrefe, H.H. Amplification efficiency of thermostable DNA polymerases. AnalBiochem 321,226-235 (2003). 22. Lie, Y.S. & Petropoulos, C.J. Advances in quantitative PCR technology: 5’ nuclease assays. Curr Opin Biotechnol 9, 43-48 (1998). 23. Lian, T. et al. Iron-regulated transcription and capsule formation in the fungal pathogen Cryptococcus neoformans. MolMicrobiol55, 1452-1472 (2005).  124  24. Karlen, Y., McNair, A., Perseguers, S., Mazza, C. & Mermod, N. Statistical significance of quantitative PCR. BMC BioinJôrmatics 8, 131 (2007). 25. Ramakers, C., Ruijter, j.M., Deprez, R.H. & Moorman, A.F. Assumption-free analysis of quantitative real-time polymerase chain reaction (PCR) data. Neurosci Le#339, 62-66 (2003). 26. Barrett, T. et al. NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Rrs 35,1)760-765 (2007). 27. Becker, S., Boger, P., Oehlmann, R. & Ernst, A. PCR bias in ecological analysis: a case study for quantitative Taq nuclease assays in analyses of microbial communities. ApplEnvuvn Microbial 66, 4945-4953 (2000). 28. Ogino, S. & Wilson, LB. Quantification of PCR bias caused by a single nucleotide polymorphism in SMN gene dosage analysis.JMolDitgn 4, 185-190 (2002). 29. Alvarez, M.J., Depino, A.M., Podhajcer, O.L. & Pitossi, F.J. Bias in estimations of DNA content by competitive polymerase chain reaction. Anal Biochem 287, 87-94 (2000). 30. Mehra, S. & Hu, W.S. A kinetic model of quantitative real-time polymerase chain reaction. Biotechnol Bioeng 91, 848860 (2005). 31. Gevertz, J.L., Dunn, S.M. & Roth, C.M. Mathematical model of real-time PCR kinetics. Biotechnal Bioeng 92, 346355 (2005). 32. Mathieu-Daude, F., Welsh, J., Vogt, T. & McClelland, M. DNA rehybridization during PCR: the ‘Cot effect’ and its consequences. NucleicAcid.c Err 24, 2080-2086 (1996). 33. Suzuki, M.T. & Giovannoni, S.J. Bias caused by template annealing in the amplification of mixtures of I 6S rRNA genes by PCR. ApplEnviron Microbiol62, 625-630 (1996). 34. Ding, C. & Cantor, C.R. Quantitative analysis of nucleic acids--the last few years of progress. Biochem Mol Bio137, J 1-10 (2004). 35. van Ruissen, F. et al. Evaluation of the similarity of gene expression data estimated with SAGE and Affymetrix GeneChips. BMC Genomics 6, 91(2005). 36. Trendelenburg, G. et al. Serial analysis of gene expression identifies metallothionein-JI as major neuroprotective gene in mouse focal cerebral ischemia.JNeutvsd22, 5879-5888 (2002). 37. Cherepanov, A.V. & de Vries, S. Kinetics and thermodynamics of nick sealing by T4 DNA ligase. EurJ Biothem 270,4315-4325 (2003). 38. Sreedhara, A., Li, Y. & Breaker, R.R Ligating DNA with DNA.JAm Chem Soc 126, 3454-3460 (2004). 39. Chiuman, W. & Li, Y. Making AppDNA using T4 DNA ligase. Biootg Chem 30, 332-349 (2002). 40. Khattra, J. et al. Large-scale production of SAGE libraries from microdissected tissues, flow-sorted cells, and cell lines. Genome Resl7, 108-116 (2007). 41. Gowda, M., Jantasuriyarat, C., Dean, LA. & Wang, G.L. Robust-LongSAGE (RE-SAGE): a substantially improved LongSAGE method for gene discovery and transcriptome analysis. Plant Pbjsioll34, 890-897 (2004). 42. Pheiffer, B.H. & Zimmerman, S.B. Polymer-stimulated ligation: enhanced blunt- or cohesive-end ligation of DNA or deoxyribooligonucleotides by T4 DNA ligase in polymer solutions. NucleicAcids Res 11,7853-7871(1983). 43. Spinella, D.G. et al. Tandem arrayed ligation of expressed sequence tags (]ALES’I): a new method for generating global gene expression profiles. NucleicAcids Res 27, e22 (1999). 44. Rapley, R. Enhancing PCR amplification and sequencing using DNA-binding proteins. Mol Biotechnal 2, 295-298 (1994). 45. Shigemori, Y., Mikawa, T., Shibata, T. & Oishi, M. Multiplex PCR: use of heat-stable Thermus thermophilus RecA protein to minimize non-specific PCR products. NucleicAcids Err 33, e126 (2005). 46. Viguera, E., Canceill, D. & Ehrlich, SD. In vitro replication slippage by DNA polymerases from thermophik organi srns.JMolBiol 312, 323-333 (2001). 47. Kurata, S. et al. Reevaluation and reduction of a PCR bias caused by reannealing of templates. AppI Environ Microbial 70, 7545-7549 (2004).  125  48. Stiller, M. et al. Inaugural Article: Patterns of nucleotide misincorporations during enzymatic amplification and direct large-scale sequencing of ancient DNA. PNAS 103, 13578-13584 (2006). 49. Pienaar, E., Theron, M., Nelson, M. & Viljoen, H.J. A quantitative model of error accumulation during PCR amplification. Compat Biol Chem 30, 102-111 (2006). 50. Pavlov, A.R., Pavlova, N.y., Kozyavkin, S.A. & Slesarev, A.I. Recent developments in the optimization of thermostable DNA polymerases for efficient applications. Trcnds in Biotechnology 22, 253-260 (2004). 51. Wang, Y. et al. A novel strategy to engineer DNA polymerases for enhanced processivity and improved performance in vitro. NzickicAcids Ens 32, 1197-1207 (2004). 52. van der Velden, V.H. Ct al. Detection of minimal residual disease in hematologic malignancies by teal-time quantitative PCR: principles, approaches, and laboratory aspects. Lnileemia 17, 1013-1034 (2003). 53. Sharma, S. et al. Quantification of functional genes from procaryotes in soil by PCR. J Micrvbiol Methods 68, 445452 (2007). 54. Smith, C.J., Nedwell, D.B., Dong, L.F. & Osborn, A.M. Evaluation of quantitative polymerase chain reactionbased approaches for determining gene copy and gene transcript numbers in environmental samples. Environ Microbiol8, 804-815 (2006).  126  Chapter 5 Progress towards the development of a universal microarray for • quantitative gene-expression analysis on an absolute per cell basis*  A version of this chapter is being prepared for publication [Reference: So, A. P., Turner, R. B. F., and Haynes, C. A. 2008. Progress towards the development of a universal microarray for quantitative geneexpression analysis on an absolute per cell basis. (in preparation) *  127  5.1. INTRODUCTION The interactions between genes and gene products are a fundamental aspect of cellular function. For example, networks of protein-DNA, protein-RNA and protein-protein interactions combine to regulate transcription, metabolic flux, and signal transduction. A number of genetic, molecular-biological and biochemical techniques have therefore been developed to identify biological interactions and their resulting analytes. These technologies include powerful highthroughput (HI) methods that permit simultaneous monitoring of an organism’s complete ensemble of expressed transcripts, also known as its transcriptome  .  HT platforms for  transcriptome analysis can be broadly classified according to the manner in which they achieve the evaluation. The digital approach to monitoring gene expression, as embodied by Serial Analysis of Gene Expression (SAGE) technology  2, 3,  typically involves the creation of short sequence  representations of each transcript (“sequence tags”), which are then identified via high-throughput sequencing, enabling the abundance of each expressed transcript to be inferred from the presumed linear correspondence with tag counts. In contrast, the analogue approach to monitoring gene expression, as embodied by microarray technologies, typically entails labeling transcripts of a cell or tissue sample with a fluorescent  or radio-labeled reporter molecule  .  Separation of this labeled  population through selective binding to complementary registers of a high-density array  7,8  then  permits identification and quantification of each transcript present via the signal recorded at a given register. Both of these platforms are proven to deliver reliable measures of changes in gene expression relative to a chosen standard state  9,b0•  However, neither has been shown to provide  accurate measures of transcriptome-wide abundance profiles on an absolute (per cell) scale  11-18•  The ability to reliably obtain such information would benefit both systems biology and health research, permitting, for example, the construction and testing of quantitative and predictive dynamic models of cell state and function. In principle, digital-based approaches such as SAGE and massively parallel signature sequencing (MPSS) are capable of providing absolute measurements of transcript abundances, as they directly count the presence of each transcript molecule. Indeed, some earliest applications of SAGE  1,2,19  or MPSS  20,21  attempted to use these techniques for this purpose. However, as both  are sampling-based methods, their accuracy is directly linked to the number of tags sequenced relative to the total number of transcripts expressed in the cell sample  22, •  Due in part to the  relatively high cost of sequencing, a typical SAGE or IvIPSS study will comprise ca. 40,000 to 2x10 6  128  sequenced tags, respectively, representing less than 0.002% of the ‘—‘1x10 ’ transcripts present in a 1 typical starting sample of 100 ng of purified mRNA (2 kb on average)  24  Recent evidence gained  from studies on the reproducibility of SAGE and MPSS suest that this low level of sample coverage results in large measurement errors (Rfl, transcripts  12,13, •  0.5-0.7), particularly for low abundance  Additional biases and artifacts, some quite severe, are introduced during the  numerous processing steps required to the convert the starting mRNA population into a form amenable to analysis via a given sequencing platform  26-28•  As a result, reproducible measurement  of transcript abundance is generally limited to transcripts within the highest 2 orders of magnitude in abundance in the original sample. With careful engineering and sample preparation, the precision of analogue-based microarray technologies can be made higher than that of current digital approaches correlation coefficients approaching R . 3  0.99 for Affymetrix oligonucleotide arrays  12,  with  10, 25, 29-34  However, while these results confirm that duplicate runs of a given sample on a well-constructed microarray can be made reproducible, they do not address whether the data obtained from the array accurately reflect the distribution of transcript abundances in the original mRNA 18,35, 36•  Data addressing this issue are sparse, but biases are known to be introduced through any  required amplification step  ,  and may also be generated through differences in  transcript/cDNA sequences and lengths that result in unique hybridization thermodynamics for each sequence and its complementary probe on the array surface labeling arrays  42, 5155  56, 57  4550,  Non-uniformity in target  and the known non-linear behavior of commercial detectors currendy used to read  further bias transcript-abundance measurements. To minimize the effects of these  potential sources of bias introduced by sample processing and array reading, DNA microarrays and related analogue technologies are therefore almost exdusively used in a comparative mode, in which fractional up-regulation or down-regulation of each expressed transcript is measured relative to the corresponding expression level in a chosen standard condition lo,29-34• As current microarray technologies are also sample specific, where the sequence composition of each anchored probe is designed from a ptio7i knowledge of the genome of the organism(s) or tissue sample under study  ‘  ‘,  they are generally restricted to monitoring the  expression profiles of known genes and/or transcripts within well-dened genomic regions  60,61,  There is therefore a need for a universal hybridization-based platform for transcriptome analysis  129  that is also capable of globally quantifying gene expression on an absolute per cell scale. Here we describe recent progress in our efforts to develop such an analysis platform sequence Tag Rray ((i-STAR)  —  —  termed the Universal  which seeks to combine the inherent strengths of SAGE  technology with the highly multiplexed analytical capabilities of rnicroarray platforms to obtain absolute quantitative gene expression data from 100 ng of starting mRNA (1x10 5 cells) without the requirement for material amplification. By the highly efficient conversion of each transcript into a readable short-sequence tag, termed a “STARtag”, and through careful chemical modification of each anchored probe using locked nucleic acids (LNAs) to achieve uniform hybridization kinetics and thermodynamics (e.g melting temperature Ta,), we are able to directly correlate the hybridization signal at the U-STAR array surface to the amount of each transcript present in the original sample. The prototype U-STAR platform is thereby able to quantify absolute transcript abundances over ca. a 3 order of magnitude  (—‘  1 pM to I nM) dynamic range  that is currently limited by the signal-to-noise characteristics of the reporter fluorophore and by the dynamic range of the 16-bit photomultiplier-based array scanner available to us  56,57,  The SAGE-  like utilization of STARtags endows this technology with the potential to be used as a universal platform for quantitative analysis of gene expression in any organism or tissue sample, eliminating the overhead costs associated with construction of customized microarrays, and minimizing the costs and obstacles associated with the preparation and large-scale sequencing of SAGE libraries. 5.2. U-STAR: CONCEPT AND CRITICAL DESIGN FEATURES The power of SAGE technology lies in part in its clever use of a type II restriction endonuclease (anchoring enzyme) and a type ITS restriction endonuclease to create a library of short sequence tags (SSTs) from polyadenylated mRNA of a eukaryotic organism or tissue sample, thereby providing a simple and convenient operational means to describe the transcriptome of these systems  2,62,  In the current SAGE protocol, however, over 20 processing steps, including  enzymatic conversions, gel purifications, and precipitation procedures are required to prepare the concatenated SST library for sequencing 62, Each step results in sample loss, and certain steps, such as the SAGE adaptor ligation step, are characterized by product yields of less than 30%  27,  When  applied in an absolute quantification mode, these losses further reduce the fidelity of a technology already compromised by sample bias introduced in the required PCR amplification step and by limitations in the number of tags that can be sampled in the final sequencing reaction.  130  U-STAR technology attempts to harness the power of SAGE while overcoming its limitations by preparing a library of SSTs known in this case as a STARtag library with minimal -  -  loss of material (and therefore information) in a such way that it can be analyzed using highdensity hybridization arrays rather than employing digital high-throughput sequencing platforms. By utilizing a hybridization-based approach to segregate, identify, and quantify SSTs present in the analyte, U-STAR technology minimizes sources of measurement error arising from incomplete sampling, as the entire STARtag population can be directly applied onto the array. Drawing from the strengths of SAGE, generation of a STARtag library (Figure 5.1.A) begins with the isolation and immobilization of mRNA onto paramagnetic polystyrene beads possessing a poly(d’I) primer. The captured mRNA is then reverse transcribed into an immobilized cDNA library that can be processed with a Type II restriction enzyme to anchor the sequence tags. Following this step, U-STAR technology diverges from SAGE protocol. A STARIink adaptor (Figure 5.1 .A) is introduced in excess and ligated to the 5’ sticky end of each anchored sequence tag. Following digestion with Acu I, the released STARtag, composed of a hairpin molecule possessing a 14 base single-stranded region (the SST) of known orientation and derived from the sense strand of the cDNA, is directly applied onto the hybridization array for identification and quantification. Many of the unique elements of U-STAR technology that permit its use in the absolute per cell quantification of transcripts are encoded in the STARlink adaptor. It is a novel hairpin adaptor that possesses 1) an overhang sequence complementary to the anchoring restriction enzyme (ARE) site to enable ligation to the ARE processed cDNA library, 2) a 5’-OH group to prevent ligation to the antisense strand of the anchored cDNA, 3) a Type ITS recognition sequence for Acu I to enable release of an SST in the form of a STARtag, 4) a fluorescent reporter group bonded to the hairpin at a fixed stoichiometry of one reporter per STARIink adaptor to ensure uniform -  labeling of each SST and therefore a 1:1 correspondence between the hybridization signal and the concentration of the SST in the analyte, and finally 5), a 5-methyl-deoxycytosine base within the overhang sequence that promotes stoichiometric conversion of every anchored 3’-cDNA into a STARtag through the application of our previously described directed ligation chemistry (DLC)  27  Like SAGE, U-STAR technology utilizes SSTs of uniform length derived from defined positions from each transcript to generate an operationally simple and convenient representation of the transcriptome. The conversion of a transcriptome into a corresponding library of STARtags  131  therefore provides an inherent improvement to array-based platforms. The uniform length of STARtags removes variations in hybridization signals that can arise from differences in mass transfer and hybridization kinetics associated with differences in molecular weights of the analyte population  37,45,6369•  Furthermore, the presentation of the SST as a 3’-dangling end to the hairpin  structure of the STARtag enhances the base stacking interactions formed during hybridization of the surface-bound probe with the STARtag hairpin, and thereby improves the specificity and the stability of the target:probe duplex, as well as the kinetics of hybridization  7O-72  Finally, short  oligonucleotides, such as these STARtags, are known to hybridize according to two-state theory, that is in an “all-or-none” fashion  which allows the thermodynamics of hybridization to be  accurately predicted using standard nearest-neighbour (NN) thermodynamic models for oligonucleotide duplexation ’ 74  .  Therefore, the array signal at each register is a known function of  the concentrations of the SSTs in the sample and the associated sequence dependent thermodynamics of duplexation with the tethered probe, allowing one to compute the concentration of the corresponding transcript in the original mRNA sample. The array used to interrogate released STARtags consists of a registered library of singlestranded antisense probes of either 13 or 14 bases in length. For the 13-mer probe array, each probe is comprised of the common 4-base recognition sequence of the ARE linked at its 3’-end to one sequence of the entire combinatorial library of 9-mer (262,400 features) oligonucleotides. Similarly, the 14-mer U-STAR array used in these smdlies is based on a 4+10-mer (1,024,000 features) combinatorial library  —  a feature density that falls within the capability of currently  available array fabrication techniques  8, 76  The U-STAR array is therefore universal, as the 262,400  or 1,024,000 unique probe registers can be utilized to analyze every 4+9-mer or 4+ 10-mer STARtag generated from any organism or tissue sample of interest; and, when genomic data are available, registers can be pre-assigned organism-specific identities to permit more rapid interpretation of the results. In addition, in an approach analogous to SAGE analysis, probe registers corresponding to tag sequences showing significant changes where no transcript information is currently available can be used to design primers and/or probes that will permit isolation and further characterization of the putative transcript for functional annotation de nova. Therefore, the flexibility conferred by this combinatorial library of probes not only obviates the need for the constructitin of individual arrays for a particular cell type or organism, but also  132  permits the mass production of a single universal array, significantly reducing production costs and lot-to-lot variability which would otherwise arise from the production of customized arrays. In general, statistical robusmess in quantification of transcript abundance using hybridization arrays is best achieved by probing each transcript at multiple registers and within different regions of the transcript  77. 78  The U-STAR array satisfies this condition by generating  unique sets of STARtags, where each set is anchored with a different ARE. Subsequent analysis of multiple STARtag library sets for a given sample thereby confers greater statistical reliability of abundance measurements as the signals obtained from multiple tags are used to identify and quantify each expressed transcript. It also overcomes problems with information loss associated with tag degeneracy that arise when two or more transcripts are represented by the same SST within a given STARtag library (Figure 5.1.B). While the use of short probe lengths permits construction of a universal array, it can nonetheless result in very poor device performance. In particular, Tm’5 of hybridization products formed with 13-mer or 14-mer probes synthesized using standard DNA chemistry are strongly dependent on GC-content, such that AT-rich sequences melt at a Tm as low as 27 °C while GC rich duplexes remain stable at temperatures up to 64 °C at strand concentrations of I iM in 0.1 M NaCL This wide range of thermal stabilities for potential target:probe duplexes hampers the ability to define hybridization conditions where all perfectly-matched (PM) targets hybridize and mismatched (Mlvi) targets do not (Figure 5.2). To overcome this limitation, U-STAR utilizes chemically modified probes that incorporate nucleotide analogues called locked nudeic acids (LNAs). LNAs are ribonucleotide analogues in which the ribose ring is made bicydic through the introduction of a methylene bridge linking the 2’-OH with the carbon at the 4’-position, locking the sugar in an endo conformation  ‘.  Introducing LNAs into an oligonudeotide increases the  stability of the duplex formed with its complementary sequence both by reducing the conformational entropy of the single strand and by favoring the formation of a more thermodynamically stable A-form duplex  7981•  The extent to which the Tm of the duplex is  increased is determined by the number of bases that are locked  81, 82  As LNA chemistry is  compatible with the phosphoramidite chemistry used for solid-phase DNA synthesis  79,83  probe  sequences with any number of LNA substitutions can be synthesized, providing an effective means to tune the Tm (and associated thermodynamic stability) of each probe:target duplex within  133  the combinatorial U-STAR array to a common value useful for hybridization-based analyses (Figure 5.2). Moreover, as oligonucleotides containing LNA substitutions are less tolerant to MM hybridization than their unmodified DNA counterparts  ,  hybridization temperatures that favor  PM targets while avoiding MM hybridization can be more readily defined, creating a hybridization array of higher sensitivity and specificity. The use of LNA chemistry to produce a uniform melting array of LNA/DNA mix-mer probes therefore confers unique and essential performance characteristics to U-STAR technology, but, we note, at a synthess cost that is currently greater than that for DNA probes of the same length. While the commercialization of the U-STAR platform for quantitative transcriptomics may therefore be challenged by the current commercial costs of LNA synthesis, the universality of USTAR enables the distribution of these costs over the number of fabricated arrays (fable 5.1), such that the production cost of each array would be 7O % that of the current retail costs of commercial production microarrays such as the Affymetrix Genechip (4250 per array). Fortunately, the price of LNA/DNA mix-mers has also been steadily decreasing as the demand for LNA-containing reagents increases. 5.3. METHODS AND MATERIALS 5.3.1. Oligonucleotides, enzymes and reagents LNA-DNA mix-mers (STARprobes) were synthesized with a 5’-hexyl amino group and obtained HPLC purified (Proligo LLC, Boulder CO) and resuspended at a final concentration of 50 tM in nuclease free water. STARprobes were designed to have a complementary duplex melting temperature (T,,J of 65 ± 1 °C at I !tM oligonucleotide concentration and 75 mM NaCl, by comparing the predicted Tm’5 obtained using an algorithm written in Pen (Appendix A.3) based on the LNA-DNA nearest-neighbour parameters developed by McTigue et aL with based on the algorithm of Tolstrup et ah (http://www.lnatools.com)  82, 85•  STARIink adaptors, fluorescently  labeled at a C6 amine modified deoxythymidine within the hairpin loop with 6-carboxy-4’,5’dichloro-2’,7’-dimethoxyfluorescein (JOE;  = 529 nm,  = 555 nm), were obtained RPLC  purified (Integrated DNA Technologies; Coralville IA) and resuspended at a final concentration of 10 iM in LoTE (2 mM TnisHCl, 0.2 mM Na EDTA, pH = 7.5). 2  E. co/i DNA ligase (10 U/ d), E. co/i RNase H (5 U4ii), Nia III (10 U4t1), T4 DNA ligase 1 (2,000 cohesive end units/tl), Acu I (5 U/tl), and Bfa I (5 U/uI) were purchased from New 134  England Biolabs (Pickering ON). E. co/i DNA polymerase I (10 U/!Il) was from Fermentas (Burlington ON). AffinityScript TM multiple temperature reverse transcriptase was from Stratagene (La Jolla CA). Multiscribe reverse transcriptase, mussel glycogen, and Superase-In RNase inhibitor  were from Ambion/Applied Biosystems (Foster City CA). Superscript II and Supecript III were from Invitrogen (Burlington ON). Adenosine triphosphate was from Roche (Laval QC). Nuclease free water and TE buffer (10 mM TrisHCl, pH = 7.5) were from Integrated DNA Technologies (Coralville TA). All dry chemicals were obtained at ACS gmde, and assembled reaction buffers were filtered through a 0.22 pm filter (Pall Life Sciences, Mississauga ON) as required. 5.3.2. UV spectrometry measurements of duplex melting temperature (T) Complementary oligonucleotides to STARprobes were obtained HPLC purified (Integrated DNA Technologies, Coralville IA) and resuspended at a concentration of 100 M in UVIVI buffer (10 mM Na P0 , 0.1 mM Na EDTA, 0.1 mM NaC1, pH = 7.0). STARprobes and 4 their corresponding complementary sense strands were combined by mass (± 0.1 mg) and melting profiles were measured at ?. = 260 nm by thermal-ramp UV spectrometry (CARY IE UV-Vis Spectrophotometer, Varian Canada Inc., Mississauga ON) at a scan rate of 0.5 °C/min in sealed quartz cuvettes (1 mm or 1 cm path lengths; Varian Canada Inc., Mississauga ON) under N 2 gas. Each raw thermogram was imported into Microcal Origin 7.0 and, following baseline correction of pre- and post-transition data, converted into a fractional dissociation curve and then fit according to the two-state Van’t Hoff equation  I =(-cr-’i+ Tm  zSH  (1)  AH  to derive melting thermodynamics parameters. 5.3.3. Array preparation HPLC purified oligonucleotide mix-mer probes were resuspended in spotting buffer (150 mM Naphosphate, pH= 9.5, 0.001% [v/v] Tween-20 fmal) at a final concentration of 20 tM, and then spotted at an ambient temperature of 21 °C onto Nexterion H MPX 16 slides (Schott North America, Louisville KY) using a BioRobotics Microgrid II arrayer (Genomic Solutions Inc., Ann Arbor Ml) equipped with Stealth split pins (Telechem International, Sunnyvale CA) and according  135  to the slide manufacturer’s recommendations at the Microarray Centre, Prostrate Centre, Vancouver General Hospital, BC. 5.3.4. Scanner evaluation A scanner evaluation slide (Full Moon Biosystems, Sunnyvale, CA)  containing 2-fold  serial dilutions of Cy3 spotted onto glass slides at defined concentrations was used to define settings of the Genepix 4200AL scanner (Molecular Devices, Sunnyvale CA) and analysis parameters that would maximize dynamic range and enable 2-fold differences in fluorescence intensities to be measured with statistical significance. Slides were sequentially scanned at a pixel resolution of 10 ii.m  Q. = 532 nm)  at 100% laser power under increasing PMT settings (100 V to  900 V at 50 V increments) and at a focal position of 0 m from the auto-adjusted focal plane. Images were collected as 16-bit TIFF files, and analyzed using GenePix Pro 6.1 using manually assigned feature settings. Results files were parsed and analyzed in Microsoft Excel (Redmond WA) and the resulting files were displayed graphically in Microcal Origin 7.0 (Microcal LLC, Northampton IVIA). 5.3.5. Calibration curve and performance of U-STAR array  Fluorescent calibrant STARtags (STARtags Ti, T4 and T7) were obtained dual HPLC purified (Integrated DNA Technologies) and resuspended at 50 tM in LoTE. A master stock (5 tM) in 2x SSC buffer (derived from a 20x stock of 0.3 M Nacitrate, 3 M NaC1, pH  =  7.0) was  prepared by mass and pre-annealed by melting at 95 °C and slow cooling to room temperature. Various dilutions of calibrant STARtags in 2x SSC buffer were generated by mass and target solutions were created by adding 40 tl of calibrant dilution with 10 l of 5x 5_HYB buffer (8 tg/ml BSA, 0.5 % SDS in 4x NEB2 buffer). Gasketed slides were pre-equilbrated at 65 °C in an Advalytix Slidebooster hybridization chamber equipped with an acoustic mixer (Olympus America Inc., Concord MA)  ,  and target solutions (40 il) were introduced into each well. After incubation  according to the following schedule of 1 hr  @ 65  °C, 1 hr  @ 60 °C, and  14 hrs  @  55 °C, target  solutions were then removed, and wells were washed thrice with 100 tl of 2.Ox SSC buffer containing 0.1 % (w/v) final of SDS (SSC/SDS buffer). After gasket removal, slides were washed again by submersion into 50 ml of SSC/SDS for 5 minutes with gentle agitation, and then rinsed twice by submersion into 50 ml of 2.Ox SSC with gentle agitation. Slides were then spun dry (1 mm at 100 x ) and imaged at  =  532 nm (1 00% laser power, PMT setting of 400) using the Genepix  136  4200AL scanner at a pixel size of 5 .tm (1 line average) and at a focal position of 0 m from the auto-adjusted focal plane. Images were collected as 1 6-bit TIFF files and processed as above. 5.3.6. Generation of STARtags 5.3.6.1. Sources of RNA used for this study Seven polyadenylated RNA standards (Table 5.2) of known sequence and concentration (ArrayControl spikes, Applied Biosystems, Mississauga ON) were serially diluted by mass in TM nuclease-free TE buffer (10 mM TrisHC1, 2 mlvi Na EDTA, pH 2  7.0) and combined by mass (± 0.1 mg) to generate 4 model transcriptomes of known composition, varying over 3-orders of =  magnitude in transcript concentrations. Alternatively, a rat brain clone homologous to rat liver a transcription factor (Genbank ID: X65948) was used to generate a polyadenylated in vit7v transcribed product as described elsewhere 27 5.3.6.2. Generation of anchored cDNA 25 Dynabeads Oligo(dT) TM (2 mg) were washed twice with 1000 tl of binding buffer (10 mM TrisHCl, I M LiCI, pH  7.0, supplemented with 10 tg/ml mussel glycogen), and then  resuspended in 450 tl binding buffer. Model transcriptome mixtures of mRNA (‘100 ng) in a total volume of 50 i.l were incubated for 5 mm at 75 °C, and then added to and mixed well with 150 tl of the Dynabead slurry. After slowly cooling to room temperature for 15 mm, the Dynabead slurry was washed twice with 500 1i1 of Wash 13 (10 mM TrisHCl, 0.15 M Lid, pH  =  7.0, supplemented with 10 g/m1 glycogen), followed by four washes with 100 il of ice-cold lx FS buffer (50 mM Tris, 75 mM KC1, 3 mM MgCI , pH = 8.3) supplemented with 10 i.g/m1 2 glycogen. A premix of first strand synthesis reaction components was then assembled for use with MultiScribeTM (MS-R or Superscript II (SSII) reverse transcriptases according to the manufacturer’s recommendations in the presence of Superase-In, but scaled to a final volume of 100 il. After pre-incubation for 5 mm at 25 °C, first strand synthesis was initiated with 250 U of MS-RT or SSII. Reactions were incubated for 10 mm at 25 °C, followed by 20 mm at 37 °C, and then 30 mm at 42 °C. Reactions were then supplemented with first strand synthesis components for use with AffinityScript (AS-RI) or Superscript III (SSIII) reverse transcriptases and incubated for 30 mm at 50 °C followed by 30 mm at 55 °C. Following first strand synthesis, the reaction mix was cooled on ice, magnetized, and 180 l of the first strand reaction product was recovered and replaced with 130 tl of a pre-chilled mix of second strand synthesis reaction components (final  137  concentrations: 20 mM TrisHCl, pH  6.9, 90 mM KCI, 4.6 mM MgC1 , 15 tM f3-NAD, 10 mM 2  S0 0.067 U/il E. co/iDNA ligase, 0.267 U/piE. co/i DNA polymerase I, and 0.013 U/pi 2 ) 4 (NH ,  E. co/i RNase H) and incubated 2 hrs at 16°C. After quenching with 30 p1 of 0.5 M EDTA, anchored second strand products were then washed twice with I ml of preheated BW buffer (10 mM Tris—HCI pH 7.5, 1 mM EDTA, 1.0 M NaC1, 75 °C) supplemented with I % (w/v) SDS and 10 ng/p.l glycogen, followed by four washes with BW buffer supplemented with 100 ng/[ti BSA and 10 ng/pi glycogen as described in the standard SAGE protocol  to prepare the Dynabead  anchored cDNA library for directed ligation chemistry. 5.3.6.3. STARIInk adaptor ligation via directed ligation chemistry Following washing thrice with 200 tl of lx NEB4 buffer supplemented with 100 ng/[tl BSA and 10 ng/pl glycogen (NEB4/BSA), STARIink adaptors were then ligated to the 3’-end anchored cDNA library according to a “one-pot” adaptation of the directed ligation chemistry (pLC) protocol of So et aL  27  The methylated base on the STARlink overhang inhibits (though not  completely) the activity of NlaIll on STARI1nk-modified products while permitting its activity on unwanted cDNA-cDNA ligation products, thereby driving by mass action the conversion to STARIink-modified cDNA fragments. A total of 40 pmol of STARlink adaptor (1 iM final) was introduced to the anchored cDNA slurry in the presence of 100 cohesive end units of T4 DNA ligase and 10 U of NlaIH in lx NEB4 buffer supplemented with 100 ng/pi BSA. The DLC reaction was initiated by adding 1 mM final of ATP for a total reaction volume 40  After  incubation for 1 hr at 37 °C, the STARIink-modified 3’-end cDNA was isolated by removing the solution phase and washing the anchored products as above. 5.3.6.4. STARtag release and final preparation After washing thrice with 200 p1 of lx NEB2 buffer supplemented with 100 ng/pi BSA and 10 ng/pi glycogen (NEB2/BSA), STARtags were released from the Dynabead-anchored library by digestion with 20 U Acu I (New England Biolabs) in lx NEB2 supplemented with 100 ng/pi BSA and 40 iM (final) S-adenosyl-methionine for 2 hrs at 37 °C at a final volume of 40 pi. Following heat inactivation at 65 °C for 20 min, the entire reaction mixture was transferred to a COSTAR Spin-X filtration tubes (0.45 tm cellulose membrane; Corning life Sciences) prewet with 5 pi LoTE. After centrifugation for 3 mm at 14,000 x g, filtrates were collected and brought to 40 p1 with lx NEB2 buffer supplemented with 100 ng/pi BSA and 10 ng/pi glycogen. Ten  138  microlitres of 5x E_FIYB buffer (0.5 % SDS in 8x  ssq  was added to mass dilutions of the  recovered STARtags in NEB2/BSA. STARtags were denatured for 2 mm at 95 °C, and reannealed at room temperature prior to hybridization to prototype U-STAR arrays. 5.3.6.5. Diagnostic of stepwise yields for STAR protocol Anchored products obtained at each step of the STAR protocol and derived from 100 ng of the a transcription factor transcript were washed thrice with 200 l NEB4/BSA buffer and digested with 20 U of Bfa I in NEB4/BSA buffer for 1 hr at 37 °C at a final volume of 20 l. The solution phase containing products released from the oligo(dl) 25 dynabeads was then analyzed via polyacrylamide gel electrophoresis (PAGE) to assess product recoveries at each processing step as described elsewhere 27 5.4. RESULTS AND DISCUSSION The prototype U-STAR platform fabricated in this study generates STARtags using the type II anchoring enzyme (Nia III) and type ITS restriction enzyme (Acu]) pair in a manner analogous to the standard short-SAGE protocol  62  to release SSTs 4+10 bases in length. The  arrays therefore display a registered set of 14-mer LNA/DNA mix-mer probes designed to specifically capture Nla ITI/Acu I derived SSTs from seven polyadenylated mRNA of E. coil (Array Control spikes, Ambion), which together represent our synthetic transcriptome (Table 52). While they do not indude the full combinatorial set of 4+10-mer mix-mer probes, the prototype USTAR arrays used are sufficient in design to allow us to independently and rigorously assess the performance of key features of the U-STAR platform as described below. 5.4.1. Design and characterization of LNA/DNA mix-mer probe set The nearest-neighbour (NN) model of LNA/DNA mix-mer melting thermodynamics developed by Tolstrup ci at and McTigue ci at  8Z85  was utilized to design the set of LNA/DNA  mix-mer probe sequences for the prototype array such that each PM duplex formed with STARtags from the synthetic transcriptome is predicted to have a Tm of 65 ± I °C at a probe concentration of I iM and a NaC1 concentration of 100 mM (Table 5.3). Melting temperatures for the corresponding pure-DNA probes were first calculated using the NN model of Santa Lucia  .  Each mix-mer probe was then designed to incorporate a minimum number of LNA substitutions to reach the target PM duplex Tm of 65 ± I °C. Given that the LNA/DNA NN models used to design STARprobes are new and not yet widely tested, we verified predicted thermodynamic  139  properties by monitoring the thermal melting profiles of each mix-mer probe with its unlabelled DNA complement using UV spectroscopy (Figure 5.3). The regressed data for duplexes of STARprobes formed with their complementary STARtag sequences derived from the 7 polyadenylated mRNA of E. co/i are shown in Table 5.3. In hybridization solutions containing 0.5x SSC (150 mM NaCl), an average melting temperature (Tm of 66.3 ± 1.8 °C, slightly higher but nevertheless in line with NN model predictions, was measured for the complete set of PM duplexes formed with the set of mix-mer probes (114.6-120.5 M probe) presented on the prototype array. More importantly, however, programmed insertion of LNA bases into STARtag probes significantly narrows the  8 °C (51.7 °C  —  59.4  q 0  melting window predicted for the  equivalent probe set comprised only of DNA bases, demonstrating our ability to tune mix-mer probes to a common Tm for all PM duplex products. In addition to their ability to raise Tm, LNA substitutions are known to increase Tm differences between PM and MI’I duplexes  87-91  Unfortunately, current NN or other models for  LNA-containing duplexes are not capable of accurately predicting melting thermodynamics for duplexes containing one or more base mismatches  84, 85,  To address probe specificity, the melting  thermodynamics for a set of mismatch probes to a single target (TI) was therefore analyzed via UV-M spectroscopy. Mismatches within the P1 STARprobe sequence were generated against the Ti target sequence, and directed to various base positions along Ti (Table 5.4). Depending on its positioning, the introduction of a single mismatch decreased the Tm by 6.9 °C to as much as 18 °C, indicating the potential to identify hybridization conditions that promote formation of PM duplexes while disfavoring formation of unwanted MM duplexes. While the study was not large enough in scope to serve as a basis for establishing rules for LNA placement on each probe to maximize IVIM discrimination, it did yield one important observation. Introduction of a locked base (P1_MM1: ggt/ac or Pi_MM3: ctt/ag, where bases in lower-case and capital letters are DNA and LNA, respectively) across from the mismatch site of the target led to a ATm of -10 °C, irrespective of MM position along the probe sequence. While less than the highest ATm observed 18  °q  (-  which occurred in a less predictable manner when an unlocked base was introduced across  the target mismatch site and locked bases were present upstream of the mismatch site (i.e. P1_MM4: gcc/gcc), this generally observed MM destabilization effect argues in favor of maximizing the number of LNA substitutions in each probe to maximize the likelihood that a MM with a given STARtag occurs at a locked-base position on the probe. In the end, then, the optimal  140  LNA content of each anchored probe will reflect the proper balance between increasing cost and increasing discrimination resulting from increasing the number of LNA bases in a probe. The need to maximize differences in PM versus MM Tm’S is in part due to equation 1, which defines the weak concentration dependence of Tm for a PM duplex. As predicted by equation 1, reducing total strand concentration Cr for LNA:DNA duplexes by an order of magnitude uniformly lowers the average Tm from 65.8 (± 1.9) °C to 61.5 (± 1.4) °C. While MM probe Tm’S for targets at 50 iM fall below the average Tm for PM probes at 5 tiM, the difference, particularly for probes containing only one LNA substitution, is not always large enough to ensure binding of a target to its PM register is thermodynamically frvored 100-fold over binding to the corresponding single-MM register (i.e. the ratio of signal intensities from the PM and MM registers is ? 100). Meeting this thermodynamic selectivity condition would simplify data analysis by ensuring that each probe strongly favors duplexation to its PM. Preliminary data for the P4 probe (Table 5.4), which contains four LNA substitutions, again suggests that incorporating a larger number of LNAs into a probe enhances the impact of mismatches on duplex stabilities, underlining the potential value of designing probesets to include a large number of LNAs. 5.4.2. Sensitivity and specificity of prototype U-STAR array The substrate used to construct the prototype U-STAR array consists of a glass slide upon which lies a hydrophilic three dimensional (3D) polymer matrix specifically designed to make the thermodynamics of target hybridization to the matrix-attached probe dosely match that in solution ‘‘  ‘  Covalent attachment of 5’-amine terminated mix-mer probes is achieved through the  formation of a stable amide bond via the aminolysis of activated N-hydroxysuccinimidyl groups on the polymer matrix, resulting in a platform less susceptible to probe release and degradation. Initially, a set of PM and MM probes to targets JOE-TI and JOE-T7 that correspond to STARtags Ti and T7 derived from transcripts 1 and 7, respectively, of the synthetic transcriptome (Table 52) were spotted in quadruplicate, along with 3 control probes with no significant complementarity to any of the 7 transcripts, to determine the basic performance characteristics of the U-STAR array for a 40 1 sample. The concentration ratio of JOE-T7 to JOE-TI was fixed at 10,000:i, and the total concentration of STARtags was varied over 3 orders of magnitude. As the surface density and hybridization properties of the mix-mer probes were unknown, hybridization of target mixtures was initially performed under the slide manufacturer’s recommended conditions (42 °C,  141  overnight). Scanning of these arrays revealed that changes in fluorescence signal at complementary registers can be observed for T7 concentrations as high as 10 to possibly 100 nM and for TI concentrations as low as 10 fM (Figure 5.4.A). In this respect, it is relevant to note that in a standard SAGE experiment, 100 ng of starting mRNA material (0.16 pmol for 2 kb average length mRNA) is typically utilized to prepare the SAGE library. This corresponds to a total target concentration of 4 nM in the 40 l sample used for U-STAR hybridization, indicating the potential of the technology to detect transcript abundances over at least 4 orders of magnitude without the need for transcript or SST amplification, provided the processing steps used to generate the STARtag library are reasonably efficient. Significant cross-hybridization at non-complementary registers was observed, possibly due to the low annealing temperature 7 relative to the common Tm of the array for PM duplexes, and as noted before this could diminish the effectiveness of the U-STAR array. A thermal window from 45 °C to 60 °C was therefore screened to determine an optimal Ta that could provide an optimal compromise condition between array sensitivity and specificity. A 7, of 55 °C was thereby identified for our 14-mer mix-mer probes characterized by a common Tm of 66 °C (Figure 5.4.B). At this Ta, cross-hybridization effects are significandy reduced. However, some mismatched signals remain, including that corresponding to a 5’-terminal mismatch (P1.jMI) which exhibits a thermal stability comparable to that of the perfectly matched probed, underlining the potential value of further refinements in probeset design through the establishment and use of a robust model for predicting Tm’S of duplexes containing one or more mismatched bases. 5.4.3. Evaluation of scanning and analysis properties for maximal dynamic range While the surface fluorescence data in Figure 5.4.A indicate that changes in U-STAR register fluorescence intensity can be detected over 4 to 5 decades in transcript concentration, the response is not linear with solution concentration over this entire range. To establish the operational limits of the U-STAR array for absolute quantitative transcriptomics, including the true dynamic range and detection limit of the platform, scanning and analysis parameters were assessed for the Genepix 4200AL scanner using an evaluation slide consisting of 2-fold serial dilutions of Cy3 in replicates of 12 adsorbed on the surface (Figure 5.5).  142  under a series of increasing PMT settings  Despite the 4.5 orders of magnitude range of signal intensities that can theoretically be recorded in a 16-bit (i.e., 0 to 65536) TIFF image, background fluorescence typically limits the range of distinguishable fluorescence intensities to —3.5 orders of magnitude at PMT voltage settings of 200-250, with measured median intensities varying less than ± 12% for any given fluorophore density within this range (Figure 5.5A: left panel). At higher PMT settings, the dynamic range is reduced due to the increase in median background signal at the 1or end and the saturation of pixels at the upper end of the scale, making a PMT setting of 250 V optimal for this measurement mode. At this setting, pairwise comparisons of median intensities (Student’s t-test, two-tailed) indicate that, while fluorescence intensity depends linearly on fluorophore density over 3.0 orders of magnitude, changes in median fluorescence intensities arising from 2-fold differences in fluorophore concentrations can nonetheless be detected with very high significance (p  <  0.001) across a --4-order of magnitude range of flourophore concentrations (Figure 5.5.A:  right panel). At these PMT settings, the lower limit of detection for the Genepix scanner for reliable detection of 2-fold differences in concentration lies at a value of 10 2 fluorophores4tm . Measurement of total intensities (Figure 5.5.B) instead of median intensities results in a modest improvement in dynamic range, primarily due to improved discrimination of low intensity signals at a PIVIT setting of 300 V or slightly higher (up to 550 V). Pairwise comparisons of total intensities (Student’s -test, two-tailed) reveals that 2-fold differences in fluorescence intensities can be detected with high significance (p  <  0.001) over slightly greater than 4 logs of fluorophore  density, with a lower limit of detection of I . 2 fluorophore/i m 5.4.4. Construction of a universal calibration curve for U-STAR Using these scanner properties to guide U-STAR platform development, a calibration curve relating total fluorescence intensity to STARtag concentration was created by hybridizing 2fold serial dilutions of synthetic STARtag calibration mixtures (TI, T4, and T7) at equimolar concentrations (0.6 pM to 5 tiM) to the prototype array at 55 °C (Figure 5.6). The arrays were scanned across a range of PMT settings and pairwise comparisons of total fluorescence intensities at each STARtag concentration were used to create a plot of P-values (Figure 5.6.A: left panel). While a number of PMT settings are capable of providing a calibration curve spanning 2.5 orders of magnitude for each STARtag, a PMT setting of 300 V, which was also found to be optimal in our scanner characterization studies, resulted in a calibration curve spanning -3.5 orders of  143  magnitude in tag concentration that allows 2-fold differences in concentration arising from hybridized STARtags (b  <  0.001, two-tailed Student’s I-test) to be distinguished from measured  intensities. Array signals are statistically significant at tag concentration of 1.6 pM and above, making the calibration curve valid for STARtag concentrations between 1.6 pM and 5 nM (Figure 5.6.A: right panel). As a result of the uniform Tm of PM duplexes formed at the U-STAR array surface, the independently measured calibration curves for STARtags Ti, T4 and T7 largely overlap, indicating that a universal calibration curve can be constructed and applied to all U-STAR array registers to regress absolute transcript abundances from U-STAR register fluorescence intensities (Figure 5.6.B). The upper limit on correlating fluorescence intensity readings to fluorophore concentration imposed by the scanner means that an increase in dynamic range must be achieved by lowering the detection limit. While the measured detection limit of 1.6 pM for the current USTAR platform is comparable to other microarray platforms  14-18, 35, 36, 94  can potentially be  reduced through the use of one or more lanthanide reporters, which are known to offer considerably better signal-to-noise characteristics detection systems  96•  ,  or through the utilization of single molecule  This would also reduce the amount of starting material required for analysis  via the U-STAR platform. Alternatively, the dynamic range of U-STAR technology could be increased through the use of a scanner with a broader response curve  However, with respect  to the basic technology validation studies reported here, the fluorescent reporter and the array scanner are of sufficient quality to permit studies with the prototype U-STAR platform that confirm the potential of the technology to quantify transcript abundances on an absolute basis. 5.4.5. Performance of U-STAR in absolute abundance determination for STARtag  mixtures The ability of the U-STAR array to measure absolute abundances of STARtags was assessed by preparing binary STARtag mixtures of known composition in which the concentrations of the two STARtags (I’l and T7) were randomly varied between 1 pM and I nM. PM register intensities were analyzed with the universal calibration curve (Figure 5.6.B) to estimate the solution concentrations of synthetic STARtags in each “transcriptome” sample (Figure 5.7). The results show that within the calibration window, the U-STAR array is capable of determining  144  absolute concentrations of STARtags to within 1.5 ± 0.4-fold of their predetermined concentrations in the set of original samples. Somewhat poorer accuracy is naturally observed for STARtag concentrations lying at the extreme upper and lower limits of the calibration curve. 5.4.5.1. Quantitative preparation and purification of STARtags from mRNA Under the SAGE protocol, many processing steps are required to present SSTs in a format amenable to identification and quantification via sequencing. Recent studies in our  2728  indicate  that a number of these steps are characterized by significant yield losses, which can cumulatively lead to a significant measurement error among, or in some cases a complete loss of lower abundance transcripts. While PCR is utilized to offset these losses and thereby generate enough material to present SSTs for analysis via sequencing, it does not reverse the associated measurement errors, and can introduce significant bias into the analysis due to differences in amplification efficiency. The ability to apply SSTs to interrogation via hybridization arrays, in contrast, allows a significant reduction in the number of steps required for their presentation and permits the development of a highly optimized series of processing steps for quantitative conversion of the starting mRNA material into STAR tags. Following our success in developing an assay platform for STARtags that can quantitatively measure target concentrations between 1.6 pM to 5 nM, we examined the efficiency of the steps of the protocol used to prepare the STARtag analyte. Given that 100 ng of mRNA (typical mRNA sample size for SAGE) can at most yield 4 nM of STARtags in the analyte applied to the STARarray (100% yield in each processing step), a cumulative yield in STARtag preparation of greater than 25% is required by the U-STAR array to analyze a 3-order of magnitude range of transcript abundances without material amplification. Utilizing a polyadenylated in vitiv transcript corresponding to the rat brain a-transcription factor, we therefore monitored the stepwise yields obtained under the STARtag production protocol and examined conditions that would maximize yields at each step of the protocol (Figure 5.8). 5.4.5.2. Quantitative recovery of anchored cDNA synthesis products In the steps involved in anchored cDNA synthesis, the amount of cDNA product in the solution phase was found to be determined primarily by the incubation temperature utilized in the first strand reaction, and was greatest at 37 °C than 42 °C or 50 °C irrespective of the reverse transcriptase utilized. The amounts of cDNA product formed were equivalent at 37 °C across all  145  the MMLV RNaseW RTs studied. However, the RTs SSII and SSffl yielded a number of spurious products during first strand synthesis that could be detected with SYBR Green II staining, while the RTs AS and MS produced only minor amounts of unwanted cDNA truncation products (Figure 5.8.A). Utilization of a series of incubation steps of increasing temperature with AS and MS RTs (25 °C, 10 mm; 37 °C, 20 mm; 42 °C, 30 mm; 50 °C, 30 mm; 55 °C) was found to further reduce the formation of truncated cDNA products to undetectable levels, likely by minimizing secondary structure formation. As a result, near-complete conversion of polyadenylated RNA was achieved based on comparison of band intensities. From the perspective of protocol development, it is important to note that similar yields can be obtained with the use of oligo(dT) dynabeads to prime first strand synthesis (1.25 tM effective concentration, Tm  65°C), and that these yields of surface-bound 3’-end cDNA were  consistently greater than those obtained for formation of biotinylated 3’-end cDNA bound to paramagnetic supports modified with streptavidin (streptavidin dynabeads or streptavidin ferrofluids). 5.4.5.3. Directed ligation chemistry for creation and release of STARtags Following cDNA synthesis (Figure 5.8A), ligation of the STARlink adaptor to the anchored 3’-end cDNA is achieved by application of a “one-pot” adaptation of directed ligation chemistry (DLC) that results in near-complete conversion to and recovery of the desired STARIInk-modifled 3’-end cDNA product. In this “one-pot” adaptation of DLC, digestion of the anchored cDNA library with an ARE and ligation with the STARlink adaptor are performed simultaneously. In the absence of ligase, DLC results in the complete digestion of the anchored cDNA (Figure 5.8.B), but in the presence of ligase, all resulting internal fragments are shifted in molecular weight by —‘34 bp (i.e. capped at both ends by the STARlink adaptor), while the most 5’and 3’-end fragments are shifted in molecular weight by --‘17 bp (Figure 5.8.C). Removal of the solution phase (Figure 5.8.C.z) and washing of the Dynabead support yields the STARlink modified 3’-end cDNA (Figure 5.8.C.iz). Subsequent incubation with the Type ITS RE Acu I cleaves the STARtag from the anchored 3’-end library with high efficiency (Figure 5.8.D). Recoveries of STARtags from the solution phase were found to be highest and obtained with the greatest consistency when ifitered through a filtration device with 0.45 p.m cellulose membrane to remove any debris that may cause background fluorescence. While other methods  146  were explored to permit the utilization of specific hybridization buffers for the recovered STARtags, including phenol extraction and ethanol precipitation, or buffer exchange through microfiltration devices with low molecular weight cut-off (3000 daltons), these methods were found to be prone to human error and often led to more inconsistent and invariably lower yields. 5.4.5.4. Application of the STAR platform to a mixture of polyadenylated RNA. Four mixtures of polyadenylated RNA spanning 3-orders of magnitude were created and processed in duplicates according to the U-STAR protocol to generate corresponding STARtags. These mixtures were then incubated with prototype arrays to assess the overall performance of the U-STAR platform in providing absolute quantitative measurements of transcript abundance (Figure 5.9). A “control” RNA, T4, was introduced to each of these mixtures as an internal calibrant (10 nM) to permit the normalization of hybridization signals across the array. Quantification of hybridization intensities against the calibration curve revealed a range of cumulative STARtag yields (3.7 % to 49.2 %) across all mixtures. Despite this variability observed, the magnitude of these yields represent a significant improvement in the overall efficiency of analyte preparation as compared to that found in the corresponding steps of the SAGE protocol leading up to SAGE tag release, and are sufficient to obviate the need for an amplification step that could introduce bias in the platform. Performance of the U-STAR platform in recapitulating the original target concentrations  from the observed hybridization intensities indicate a strong correlation between inferred concentrations of target with their actual concentrations, with a 95% confidence interval for the slope ranging between [0.99359, 1.011 (Figure 5.9). However, the presence of mismatched probes TI in this prototype array led to a strong underestimation of the concentration of this target, underlining the need to design more stringent probes to minimize MM hybridization.  5.5. CONCLUSION The accuracy of any analytical method to determine composition relies on the efficiency of the sample preparation process and how successfully it preserves the distribution of the  components in the original sample. It also requires a defined correlation between the output of the detector (digital or analogue) and the quantity/concentration of a particular component present within the analyte. Here, we have described significant steps towards the development of a transcriptomics platform that is capable of providing accurate absolute measurements of transcript  147  abundance. This platform, termed the U-STAR platform, achieves this goal by utilizing a hybridization array displaying novel probe chemistry to interrogate SSTs (STARtags) generated from an original mRNA sample. Though still in development, the U-STAR array is capable of detecting with high accuracy absolute transcript numbers over a 3-order-of-magnitude dynamic range. The results therefore point to the potential for high-throughput system-wide analysis of a transcriptome on an absolute (per cell) scale, but further development is required in order to fully exploit the capabilities of the U-STAR array.  148  Table 5.1. Costing of materials and reagents specific to the U-STAR platform. Costs are based on current list pricing for reagents. Costs associated with LNA synthesis and purification are based on fully substituted probes, and represent the maximum possible cost of synthesis.  Component USTAR array  Details  initial cost  LNA probes Nexterion slide H slides  subtotal eDNA synthesis oligo dT paramagnetic beads (5 ml) MultiScribe RT Superase In (10,000 U) DNA polymerase I (2500 U) B. coli DNA ligase (10000 U) Rnase H (1250 U)  $153,000,000.00 $481.25  no. of assays 980000 25  cost per assay $156.12 $19.25  $200.00 $78.00 $312.00 $393.00 $212.00 $244.00  50 10 100 62.5 100 625  $175.37 $4.00 $7.80 $3.12 $6.29 $2.12 $0.39 $23.72  $244.00 $252.00 $561.03  125 500 953  $232.00 $424.50  100 200  subtotal directed ligation Nla III (2500 U) chemistry T4 DNA ligase (100,000 U) STAR adaptor  subtotal STAR tag release Acu I (1000 U) COSTAR Spin-X filter tubes subtotal TOTAL  149  $1.95 $0.50 $0.59 $3.04 $2.32 $2.12 $4.44 $413.16  Table 5.2. Model RNA transcripts used in this study. List of RNA control spikes indicating sites for Nialil cleavage, and corresponding sequence tags derived from 3’-end of each transcript.  RNA  5’-transcript-3’  •  Spike 1  GCCTTTGGTG  Spike 2 Spike 3  —  ACCGGGàTTT 4  Spike 4  —  CGTAGATAAC H  Spike 5  CTGAGCCTGG  Spike 6 ATGTCGGCGT  Spike 7  4  4 4H4  4  4  _____f_•___ CACTGACCT  Spike 8  4  44  4  CTGGAAW’.G  150  Table 5.3. Properties of LNA/DNA mix-mer probe set. Probes were designed using NN models to have a melting temperature of 65 °C at 50 iM strand concentration at 0.SxSSC. Thermodynamic properties were then verified by UV-M analysis with complementary DNA oligonucleotides.  seq ID  Target! probea  Ti P1  catggcctttggtg/ caCcaaaggccatg  114.6  T3 P3  catgaccgggattt! aaatcCcggTcatg  T4 P4  Ct (M)b  Tm predicted  Tm measured (°C)  (kcai/mol)  (cal/molK)  )d 6 ( 1  91.3  249.4  64.7  115.0  64/62.5 (58)  98.0  268.0  66.0  catgcgtagataac! gtTaTcTaCgcatg  117.1  64/62.3 (53)  90.5  245.1  67.2  T5 P5  catgctgagcctgg/ ccaggcTcagcatg  120.5  66/63.2 (64)  92.1  250.4  66.7  T6 P6  catgatgtcggcgt/ acgccgacaTcatg  117.9  65/65.6 (63)  97.9  266.0  68.2  T7 P7  catgcactgaacct/ aGgttcaGtgcatg  118.8  66/65.3 (59)  101.7  280.0  65.1  T8 P8  catgctggaagaga/ tCtcTtcCagcatg  118.8  65/66.2 (57)  89.1  242.0  66.1  65/62.5  a  LNA substituents are indicated in uppercase with probe sequences terminated at the 5’-end with an amino-hexyl linker b total strand concentrations calculated from the post-transition deduced from UV-M scans. predicted melting temperatures at 75 mM NaC1 based on a neural-net algorithm of Toistrup et a (www.lnatools.com) 85 /the modified nearest-neighbor model of McTigue et a1 82 Values in parentheses indicate predicated melting temperatures for the equivalent unmodified DNA probe.  151  Table 5.4. Properties of mismatched LNA/DNA mix-mer probes. Probes were designed using NN models to have a melting temperature of 65 °C at 50 tM strand concentration at 0.5xSSC with their perfect complements. Thermodynamic properties were assessed via UV-M analysis with complementary DNA oligonucleotides at 50 !tM and 5 M total strand concentrations.  a  seqlD  Target/ probea  Ti  catggcctttggtg/  P1 P1MM1 P1MM2 P1MM3 P1MM4  caCcaaaggccatg caGcaaaggccatg caccTaaggCcatg caccaaTggccatg caCcaaagcccatg  T4  catgcgtagataac/  P4 P41MM1  gtTaTcTaCgcatg gtTaTctCcgcatg  LNA bases  substituents  Tm (°C) 50iM/5iM  ATm (°C) 50LM/5M  0.01 0.53 0.59 0.56 0.56  / / / / /  60.04 49.15 53.19 50.25 42.05  0.03 0.03 0.03 0.04 0.03  10.43 6.53 9.50 17.30  / / / /  10.89 6.85 9.79 17.99  68.77 ± 0.61 54.22 ± 0.84  / /  63.10 ± 0.03 49.22 ± 1.52  14.55  /  13.88  64.02 53.59 57.49 54.52 46.72  ± ± ± ± ±  ± ± ± ± ±  are indicated in uppercase with underlined bases corresponding to mismatched  152  A. Ohgo(d1  DynbedS  1. cONA synthesis  reporter  .44  2. sampling via restnction dest  Tcpe 3 Adaptor ligation  STARIInk adaptor  4. STARtag release YYINNN7INNNNN  <<<0000001-i-  .  1stREprr  ‘\ GGCC GCGC GATC  ‘  CCGG  Ma III  —,  1YYyr$N1NN  1  LI!)  Totãt€RFg ACGT MiT  0505  •  2nd REair/  F  1000  I  7000  I  Tal I  Figure 5.1. Outline of STARtag generation and application. A. General features of the STARIInk adaptor permit the generation of STARtags through 4 processing steps for immediate interrogation on a U-STAR hybridization array. B. Combination of multiple STARtag sets generated from different anchoring restriction enzymes increases transcriptome coverage. Putative STARtag sets generated from the transcriptome of S. cerevisciae using either Nia III (5’CATG-3’) and Tail (5’-ACGT-3’) are shown schematically (left panel). Independently, the set of STARtags generated from either restriction enzyme (shown in red and green) can only cover < 80% of the transcriptome of S. cerevisciae as indicated by the colour map (right panel). In contrast, combination of the sequence information (yellow) from both STARtag sets enables more complete coverage of the transcriptome (typically greater than 95%).  153  AT-rich  Thybj  $  ‘  GC-rich  /\  \i  DNA  \Mimtch  6_11mPra:fe <:__160  -  Thj// 1’  Mismatch: 6  20  40  60  80  100  Temperature ( C) 0  Figure 5.2. General features of the STARarray for interrogating STARtags. Hybridization array consists of a combinatorial library of nonamers, as well as an optional 3’-end four base sequence corresponding to the restriction enzyme used to anchor the STARtag. The use of DNA as probes creates an array with a wide range of thermal stabilities, which requires the use of a hybridization temperature (Ta) below 20 °C. This renders the GC-rich probes as non specific as mismatched targets are able to hybridize. Programmed incorporation of LNAs into the probe sequences permits an increase in the melting temperature of PM duplexes to each probe to a common, uniform . 80 T  154  0.4  0.3 E ° CD  02  C4  0.1  0.0  0  I  20  •  I  •  40  I  •  60  I  80  iôo  Temperature (°C) Figure 5.3. TJV-M studies on melting thermodynamics of STARprobes. Melting thermogram of STARprobes P1 to P8 with their oligonucleotide complements (total concentration 115 pJ\4) illustrating the ability to tune melting thermodynamics of a set of probes to a uniform melting window.  155  A.  B. I’  45°C  10pM  lOOnM  1pM  lOnM 50°C  O.lpM  0.01 pM  1mM  j  0.1 nM  55 °C  P1 1M7 P1 2M69 P1 1M1 P12M29 P1 1M5 P1 P1_1M3 P1_2M25 P1_1M9  P7 2M26 P7 1M8 P7 R P7 2M29 P7 P7 P7_2M59 P7_1M4 F’l_R  P7 IMIO! PN P7_1M2  Figure 5.4. Dynamic range, detection limit, and mismatch binding properties of prototype U-STAR array. A. PM register intensities for a dilution series of JOE_Ti and JOE_T7 targets hybridized at a temperature of 7 = 42 °C indicating the ability of the platform to capture as little as 10 fM of target. B. Hybridization of mismatched probe sequences to 10 nM JOE_TI and 1 pM JOE_T7 at various Ta indicates a decrease in the relative amount of target hybridized to mismatch sequences compared to perfect matches with an increase in T. Mismatches within the probe sequence are indicated via the probe identifier, where, for example, “P7_2M59” indicates a probe complementary to target T7, with 2 mismatched bases at position 5 and 9 along the probe sequence.  156  A. 900 .‘  800  10000  0 C  .1000  700  I2  600 -  C  1000  C/)  —  a)  400  ft. 300  100  200 106.  1  B.  -3030  5oo  3 io  102 10’ 10° 10’ 102  io° 2 Fluorophore/pm  -  - -;  -4.000  (  Iog(P-value)  6  6  6  6  6  2 i 6 og(Fluorophoie/ ) m  10° 106  10000 900  \t RM1  > C  100  -1.000  I  06600  a  500 C/) I— 400  a)  C  -  A  700  x •  0  %  800  1000  -2.000 -3000 -4.000  Iog(P-value)  CL 300.  10  200  0  -2  10  10  10 10.1  10° 101 102  106 106  -l  0  1  2  3  4  5  Iog(FuorophoreI m 6 °)  2 Fluorophore4im  Figure 5.5. Evaluation of scanning and analysis parameters for the Genepix 4200 AL scanner. Parameters to establish scanner settings providing maximal dynamic range were established using a Cy3 calibration slide as described in Methods. After image acquisition, TIFF files were analyzed with Genepix Pro 6.1 using manually aligned feature settings. Median (A) or total (B) fluorescence intensities were recorded for each flourophore concentration at increasing PMT settings (100 V to 900 V, 50 V increments) and plotted as a function of fluorophore density (left panel). Pairwise comparisons of fluorescence measurements were used to establish probability cutoffs for significance (two-tailed Student’s i-test) in the detection of 2-fold differences in fluorophore intensity as a function of PMT settings (right panel).  157  A. -1.000  I  -2.000  100  x  -3000  >‘  -4.000  U, C .  Iog(P-value)  10  C (0  a  0  Ti T4 17  i i 160 1600 ioóoo Target Concentration (pM)  Iog[Targetl  B. 100  10 C  0 H  1  ib iôo idoo -ioôoo Target concentration (pM) Figure 5.6. Calibration curves for the U-STAR platform. A. Two-fold serial dilutions of JOE_TI, JOE_T4, and JOE_T7 under hybridization conditions minimizing mismatch hybridization (final Ta = 55°C) were scanned under a range of PMT settings (left panel) to establish optimal scan settings. Calibration curves utilized a PMT setting of 300 V (right panel). B. The high uniformity across the target dilution series enables the creation of a universal calibration curve for the entire probeset. However, scanner properties limit the measurable dynamic range to —3.5 orders of magnitude, between 1.6 pM and 5 nM, for the detection of 2fold changes in target concentration < 0.001, Student’s t-test).  158  A.  B.  a  a  .2  .2  o o  o  1.) C-,  mixture ID  mixture ID  Figure 5.7. Performance of U-STAR in quantifying absolute STARtag concentrations. Various mixtures of synthetic STARtags were interrogated via hybridization onto prototype STARarrays, and target concentrations were determined from the universal calibration curve (Figure 5.6). Measured concentrations of A. TI and B. T7 are shown in pink versus the original concentrations shown in purple.  159  mRNA cDNA synthesis  Oligo(dT) 25 Dynabeads  A STARIink adaptor 100 U T4 DNA ligase 1OUNIaII1 1 mMATP  10 U NIaHI 1mMATP  B  III,  Nialli ligase Acul  -b%’%  (I)  -  First strand synthesis  Second Strand Synthesis  +  + +  — -  -  — —  +  -  —  — —  Acul (50U)  -.  -  Adaptor+cDNA cDNA — cDNA (-STARtag)  () (ii) (j)(jj):  B  C*D  D + W  YYTGAAGYYCATGNNNNNNNNN  Figure 5.8. Analysis of step-wise yields obtained under the STARtag production protocol. 100 nanograms of a polyadenylated in vitro transcript derived from the a-factor gene (X65948) were processed through the STARtag production protocol. A. SYBR green I staining of cDNA synthesis products reveal that Superscript H/Ill yield spurious double-stranded products. B. Post-Nia III digestion of anchored cDNA yields C. Application of DLC to the ligation of STARIink adaptors yields 40% of the desired adaptor-modified product. D. Subsequent digestion with the Type ITS enzyme Acu I leads to near-complete digestion of the adaptor modified cDNA and release of the STARtag.  160  10000. —,  1  ,D  /  ,  — / / /  • •  II  1  111119  I  I  I  I I 119  10  100  I  I  I  I  111111  1000  10000  original concentration (pM) Figure 5.9. Overall performance of the U-STAR platform in absolute quantification of transcript abundances. Various mixtures (-100 ng) of polyadenylated RNA (ArrayControl spikes) were processed through the STAR protocol, and the resulting STARtags were hybridized onto prototype arrays. Strong correlation between the calculated concentration and the original concentration was obtained with the platform. 95% confidence intervals are shown in dark blue, with 95% prediction lines in light blue.  161  5.7. REFERENCES 1.  Velculescu, V.E.  2.  Velculescu, V.E., Zhang, L., Vogelstein, B. & Kinzler, K.W. Serial analysis of gene expression. Science 270, 484-487 (1995).  3.  Adams, M.D. Serial analysis of gene expression: ESTs get smaller. Bioessqys 18, 261-262 (1996).  4.  Southern, E.M. DNA microarrays. History and overview. Methodc MolBio!170, 1-15 (2001).  5.  Shalon, D., Smith, S.J. & Brown, P.O. A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Err 6, 639-645 (1996).  6.  Bertucci, F. et al. Sensitivity issues in DNA array-based expression measurements and performance of nylon microarrays for small samples. Hum Mo! Genet 8, 1715-1722 (1999).  7.  Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270,467-470 (1995).  8.  Lipshutz, R.J., Fodor, S.P., Gingeras, T.R. & Lockhart, D.J. High density synthetic oligonucleotide arrays. Nat Genet2l, 20-24 (1999).  9.  Wang, S.M. Understanding SAGE data. Trends Genet23, 42-50(2007).  Ct  al. Characterization of the yeast transcriptome. Ce1188, 243-25 1 (1997).  10. Consortium, M. et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotethnol24, 1151-1161(2006). 11. Dinel, S. et al. Reproducibility, bioinformatic analysis and power of the SAGE method to evaluate changes in transcriptome. NucleicAcids Res 33, e26 (2005). 12. van Ruissen, F. et al. Evaluation of the similarity of gene expression data estimated with SAGE and Affymetrix GeneChips. BMC Genomics 6, 91(2005). 13. Haverty, P.M., Hsiao, LL., Gullans, S.R., Hansen, U. & Weng, Z. Limited agreement among three global gene expression methods highlights the requirement for non-global validation. Bioinformatics 20, 3431-3441 (2004). 14. Hekstra, D., Taussig, A.R., Magnasco, M. & Naef F. Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. NucleicAcids Res 31, 1962-1968 (2003). 15. Dudley, A.M., Aach, J., Steffen, M.A. & Church, G.M. Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc Nat/AcatiSci USA 99, 7554-7559 (2002). 16. Kanno, J. et al. “Per cell” normalization method for mRNA measurement by quantitative PCR and microarrays. BMC Genomics 7, 64 (2006). 17. Yuen, T., Wurmbach, E., Pfeffer, R.L., Ebersole, B.J. & Sealfon, S.C. Accuracy and calibration of commercial oligonucleotide and custom eDNA microarrays. NuckicAcids Ens 30, e48 (2002). 18. Chudin, B. et al. Assessment of the relationship between signal intensities and transcript concentration for Affymetrix GeneChip arrays. Genome Biol3, RESEARCH0005 (2002). 19. Gygi, S.P., Rochon, Y., Franza, B.R. & Aebersold, R. Correlation between protein and mRNA abundance in yeast. MoICellBioll9, 1720-1730 (1999). 20. Jongeneel, C.V. Ct al. Comprehensive sampling of gene expression in human cell lines with massively parallel signature sequencing. Proc NatiAcadSci USA 100, 4702-4705 (2003). 21. Brenner, S. et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotecbnoll8, 630-634 (2000). 22. Stern, M.D., Anisimov, S.V. & Boheler, K.R. Can transcriptome size be estimated from SAGE catalogs? Bioinfbimatics 19,443-448 (2003). 23. Stollberg,J., Urschitz,J., Urban, Z. & Boyd, C.D. A quantitative evaluation of SAGE. Genome Rex 10, 1241 -1248  (2000). 24. Lewin, B. Gene expression, Bdn. 2d. (Wiley, New York; 1980).  162  25. Liu, F. et al. Comparison of hybridization-based and sequencing-based gene expression technologies on biological replicates. BMC Genomics 8, 153 (2007). 26. Stolovitzky, GA. et al. Statistical analysis of MPSS measurements: application to the study of LPS-activated macrophage gene expression. Proc NatlAcad Sd USA 102, 1402-1407 (2005). 27. So, A.P., Turner, R.F. & Haynes, C.A. Increasing the efficiency of SAGE adaptor ligation by directed ligation  chemistry. NuckicAddc Res 32, e96 (2004). 28. So, A.P., Turner, R.F. & Haynes, CA. Minimizing loss of sequence information in SAGE ditags by modulating the temperature dependent 3’ --> 5’ exonuclease activity of DNA polymerases on 3’-terminal isoheptyl amino groups. Biotechnol Bioeng 94, 54-65 (2006). 29. Larkin, J.E., Frank, B.C., Gavras, H., Sultana, R. & Quackenbush, microarray platforms. Nat Methods 2, 337-344 (2005).  J.  Independence and reproducibility across  30. Sin, L. et al. Cross-platform comparability of microarray technology: intra-platform consistency and appropriate data analysis procedures are essential. BMC Bioinformatics 6 Suppi 2,S12 (2005). 31. Woo, Y. et al. A comparison of cDNA, oligonucleotide, and Affymetrix GeneChip gene expression microarray platforms. Journal ofbiomolecular techniques :JBT15, 276-284 (2004). 32. Wang, H., He, X., Band, M., Wilson, C. & Liu, L. A study of inter-lab and inter-platform agreement of DNA microarray data. BMC Genomics 6, 71(2005). 33. Chen, J., Hsueh, H.M., Delongchamp, R., Lin, C.J. & Tsai, C.A. Reproducibility of microarray data: a further analysis of microarray quality control (MAQC)data. BMC Bioinformatics 8,412(2007). 34. Klebanov, L. & Yakovlev, A. How high is the level of technical noise in microarray data? Biol Direct 2,9 (2007). 35. Rouse, R.J., Espinoza, C.R., Niedner, R.H. & Hardiman, G. Development of a microarray assay that measures hybridization stoichiometry in moles. BioTechniques 36,464-470 (2004). 36. Shippy, R. et al. Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nature Biotechnology 24, 1123-1131(2006). 37. Sawada, A., Mizufune, S., Kai, N., Tokeshi, M. & Baba, Y. Evaluation of amplified cRNA targets for oligonucleotide microarrays. Ana’ytical and bioana/yticalchemistrj 387,2645-2654 (2007). 38. Wagner, F. & Radelof U. Performance of different small sample RNA amplification techniques for hybridization on Affymetrix GeneChips.J Biotechnoll29, 628-634 (2007). 39. Viale, A. et al. Big results from small samples: evaluation of amplification protocols for gene expression profiling. Journal ofbiomolecular techniques :JBT18, 150-161(2007). 40. Boelens, M.C. et al. Microarray amplification bias: loss of 30 % differentially expressed genes due to long probe poly(A)-tail distances. BMC Genomics 8, 277 (2007). 41. Duftner, N., Larkins-Ford, J., Legendre, M. & Hofmann, HA. Efficacy of RNA amplification is dependent on sequence characteristics: implications for gene expression profiling using a cDNA microarray. Genomics (2007). 42. Ma, C. et al. In vitro transcription amplification and labeling methods contribute to the variability of gene expression profiling with DNA microarrays. The Journal ofmolecular diagnostics :JMD 8, 183-192 (2006). 43. Subkhankulova, T. & Livesey, F.J. Comparative evaluation of linear and exponential amplification techniques for expression profiling at the single-cell level. GenomeBiol7, R18 (2006). 44. van Haaften, R.I. et al. Biologically relevant effects of mRNA amplification on gene expression profiles. BMC Bioinformatics 7,200 (2006). 45. Chan, V., Graves, D.J. & McKenzie, S.E. The biophysics of DNA hybridization with immobilized oligonucleotide probes. BiophjsJ 69,2243-2255 (1995). 46. Fotin, A.V., Drobyshev, A.L, Proudnikov, D.Y., Perov, A.N. & Mirzabekov, A.D. Parallel thermodynamic analysis of duplexes on oligodeoxyribonucleotide microchips. NucleicAcids Res 26, 1515-1521(1998). 47. Bruun, G.M., Wernersson, R., Juncker, AS., Willenbrock, H. & Nielsen, H.B. Improving comparability between microarray probe signals by thermodynamic intensity correction. Nuc/.eicAcids Res 35, e48 (2007).  163  48. Halperin, A., Buhot, A. & Zhulina, E.B. Brush effects on DNA chips: thermodynamics, kinetics, and design guidelines. Biop1ysJ 89, 796-811 (2005). 49. Gao, Y., Wo1f L;K. & Georgiadis, R.M. Secondary structure effects on DNA hybridization kinetics: a solution versus surface comparison. NuckicAcids Rex 34, 3370-3377 (2006). 50. Chen, C., Wang, W., Wang, Z., Wei, F. & Zhao, X.S. Influence of secondary structure on kinetics and reaction mechanism of DNA hybridization. NuckicAcids Rex 35,2875-2884 (2007). 51. Lonergan, W., Whistler, T. & Vernon, S.D. Comparison of target labeling methods for use with Affymetrix GeneChips. BMC Biofechnol 7, 24 (2007). 52. Anderson, J.P., Angerer, B. & Loeb, L.A. Incorporation of reporter-labeled nucleotides by DNA polymerases. BioTechniques 38, 257-264 (2005). 53. Dorris, D.R. et al. A highly reproducible, linear, and automated sample preparation method for DNA microarrays. Genome Rex 12, 976-984 (2002). 54. Kuwahara, M. et al. Systematic characterization of 2’-deoxynucleoside- 5’-triphosphate analogs as substrates for DNA polymerases by polymerase chain reaction and kinetic studies on enzymatic production of modified DNA. NuckicAcids Rex 34, 5383-5394 (2006). 55. Tasara, T. et al. Incorporation of reporter molecule-labeled nucleotides by DNA polymerases. II. High-density labeling of natural DNA. NucleicAcidx Rex 31,2636-2646 (2003). 56. Lyng, H. et al. Profound influence of microarray scanner characteristics on gene expression ratios: analysis and procedure for correction. BMC Genomics 5, 10 (2004). 57. Shi, L et al. Microarray scanner calibration curves: characteristics and implications. BMC Bioinformai*x 6 Suppl 2, SIl (2005). 58. Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Rex 33, e175 (2005). 59. Li, X., He, Z. & Zhou, J. Selection of optimal oligonucleotide probes for microarrays using multiple criteria, global alignment and parameter estimation. NuckicAcids Rex 33, 6114-6123 (2005). 60. Bertone, P. et al. Global identification of human transcribed sequences with genome tiling arrays. Science 306, 2242-2246 (2004). 61. Mockler, T.C.  Ct  al. Applications of DNA tiling arrays for whole-genome analysis. Genomics 85, 1-15 (2005).  62. Hu, M. & Polyak, K. Serial analysis of gene expression. Natureprotocolx 1, 1743-1760 (2006). 63. Wu, C., Carta, R. & Zhang, L. Sequence dependence of cross-hybridization on short oligo microarrays. Nucleic Acids Rex 33, e84 (2005). 64. Lee, M.L., Kuo, F.C., Whitmore, G.A. & Skiar, J. Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive eDNA hybridizations. Proc Nail Acad Sd U S A 97, 9834-9839 (2000). 65. Li, C. & Wong, W.H. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc NatlAcad Sci USA 98, 31-36 (2001). 66. Barczak, A. et al. Spotted long oligonucleotide arrays for human gene expression analysis. Genome Rex 13, 17751785 (2003). 67. Rarnakrishnan, R. et a]. An assessment of Motorola CodeLink microarray performance for gene expression profiling applications. NucleicAcids Rex 30, e30 (2002). 68. Sakai, K, Higuchi, H., Matsubara, K. & Kato, K Microarray hybridization with fractionated eDNA: enhanced identification of differentially expressed genes. Anal Biochem 287, 32-37 (2000). 69. Park, P.J. et al. Current issues for DNA niicroarrays: platform comparison, double linear amplification, and universal RNA reference.J Biotechnolll2 225-245 (2004). ,  70. Lane, M.J. et al. The thermodynamic advantage of DNA oligonucleotide ‘stacking hybridization’ reactions: energetics of a DNA nick. NuckicAcidx Rex 25, 611-617 (1997).  164  71. Riccelli, P.V. et al. Hybridization of single-stranded DNA targets to immobilized complementary DNA probes: comparison of hairpin versus linear capture probes. NuckicAdds Res 29, 996-1004 (2001). 72. Broude, N.E., Woodward, K., Cavallo, R., Cantor, C.R. & Englert, D. DNA microarrays with stem-loop DNA probes: preparation and applications. NucleicAcids Res 29, E92 (2001). 73. Owczarzy, R. Ct al. Predicting sequence-dependent melting stability of short duplex DNA oligomers. Biopo/ymers 44, 217-239 (1997). 74. Breslauer, K.J., Frank, R., Blocker, H. & Marky, L.A. Predicting DNA duplex stability from the base sequence. Proc Nat/Acad Sci USA 83,3746-3750 (1986). 75. SantaLucia, J. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc NatlAcad Sd USA 95, 1460-1465 (1998). 76. Barone, A.D. et al. Photolithographic synthesis of high-density oligonucleotide probe arrays. Nucleosides Nucleotides NucleicAdd.r 20, 525-531 (2001). 77. Hughes, T.R. et al. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat Biolechnoll9, 342-347 (2001). 78. Chou, C.C., Chen, C.H., Lee, T.T. & Peck, K Optimization of probe length and the number of probes per gene for optimal microarray analysis of gene expression. NucleicAcids Res 32, e99 (2004). 79. Wengel,J., Petersen, M., Frieden, M. & Koch, T. Chemistry of locked nucleic acids (LNA): Design, synthesis, and bio-physical properties. Letters in Peptide Science (2005). 80. Nielsen, KF.. et al. NMR studies of fully modified locked nucleic acid (LNA) hybrids: solution structure of an LNA:RNA hybrid and characterization of an LNA:DNA hybrid. Bioconjug Chem 15,449-457 (2004). 81. Braasch, D.A. & Corey, D.R. Locked nucleic acid (LNA): fine-tuning the recognition of DNA and RNA. Chem Biol8, 1-7 (2001). 82. McTigue, P.M., Peterson, R.J. & Kahn, J.D. Sequence-dependent thermodynamic parameters for locked nucleic acid (LNA)-DNA duplex formation. Biochemistry 43, 5388-5405 (2004). 83. Hansen, H.F., Olsen, 0. & Koch, T. New standards in LNA synthesis. Nucleosides Nucleotides NudeicAcids22, 12731275 (2003). 84. You, Y., Moreira, B.G., Behike, M.A. & Owczarzy, R. Design of LNA probes that improve mismatch discrimination. NucleicAcids Res 34, e60 (2006). 85. Toistrup, N. et al. OligoDesign: Optimal design of LNA (locked nucleic acid) oligonucleotide capture probes for gene expression profiling. NucleicAcidsRes 31,3758-3762 (2003). 86. Toegl, A., Kirchner, R., Grazer, C. & Wixforth, A. Enhancing results of microarray hybridizations through microagitation. Journal ofbiomolecular techniques :JBT14, 197-204 (2003). 87. McKenzie, F., Faulds, K. & Graham, D. Sequence-specific DNA detection using high-affinity LNA functionalized gold nanoparticles. Small (W einbeim an derBe,s1ra.cse, Germanj) 3, 1866-1868 (2007). 7 88. Kierzek, E. et al. The influence of locked nucleic acid residues on the thermodynamic properties of 2’-O-methyl RNA/RNA heteroduplexes. NucleicAcids Res 33, 5082-5093 (2005). 89. Vester, B. & Wengel, J. LNA (locked nucleic acid): high-affinity targeting of complementary RNA and DNA. Biochemistzy 43, 13233-13241 (2004). 90. Rosenbohm, C. et al. LNA guanine and 2,6-diaminopurine. Synthesis, characterization and hybridization properties of LNA 2,6-diaminopurine containing oligonucleotides. Bioor Med Chem 12,2385-2396 (2004). 91. Elayadi, A.N., Braasch, D.A. & Corey, D.R. Implications of high-affinity hybridization by locked nucleic acid oligomers for inhibition of human telomerase. Biochemirtrj 41,9973-9981(2002). 92. Timofeev, E. & Mirzabekov, A. Binding specificity and stability of duplexes formed by modified oligonucleotides with a 4096-hexanucleotide microarray. NucleicAcids Res 29,2626-2634 (2001). 93. Sorokin, N.y. et al. Kinetics of hybridization on surface oligonucleotide microchips: theory, experiment, and comparison with hybridization on gel-based microchips. J BiomoiStruct Djn 24, 57-66 (2006).  165  94. Cariales, R.D. et al. Evaluation of DNA microarray results with quantitative gene expression platforms. Nat Biotechnol24, 1115-1122 (2006). 95. Selvin, P.R. Principles and biophysical applications of lanthanide-based probes. Annual m’zew of biophyics and biomokcuhr structure 31,275-302 (2002). 96. Hesse,J. et al. RNA expression profiling at the single molecule level. Genome Rex 16, 1041-1045 (2006).  97. Huang, G. et al. Development of a confocal optical system design for molecular imaging applications of biochip. IntJ Blamed Imaging 2007, 79710 (2007). 98. Hamilton, G. (2006).  Ct  al. A large field CCD system for quantitative imaging of microarrays. Nucleic Acids Rex 34, e58  166  CHAPTER 6 Summary of findings, friture considerations and conclusions  167  6.1. SUMMARY OF FINDINGS  Obtaining a complete picture of the dynamics of the transcriptome to enable further refuiement of systems-wide models of cellular function requires development of technologies that are able to provide detailed and accurate measurements of transcript abundances on an absolute scale. Incremental improvements over the past decade in feature density and in the intelligent design of probesets, as well as the development of more consistent approaches to sample preparation and the utilization of less starting material, have enabled array-based platforms to reach a high level of reproducibility. However, despite these improvements, the ability to provide accurate quantitative measurements is fundamentally restricted by biases that arise from sequenceand length-dependent properties of the analyte, which directly influence the uniformity and intensity of labelling, from material amplification, and from probe-level interactions. This renders the correspondence between hybridization signals and the actual quantities of transcripts within the original mRNA sample as ill-defined, and undoubtedly is the major reason behind the low inter-platform correlations observed despite high intra-platform reproducibility. For digital technologies, the serial nature of costs accrued with producing technical replicates have generally limited analyses of their accuracy and precision to estimations based on sampling theory on a single replicate. Performance statistics have largely focused on establishing confidence windows for the tag identification process (i.e. base-calling via sequencing) within which sequence information can be considered accurate. Inherent to this, however, has been the tacit assumption that measurement errors arising from sequencing greatly exceed those introduced by sample preparation. Regrettably, the general assumption that errors and biases are minimal in the analyte prior to analysis via sequencing is not supported by experiment. The work presented in this thesis provides a careful examination of potential sources of bias that may arise over the course of sample preparation involved in one widely used digital platform, namely SAGE technology. Through a careful analysis of the steps involved in creating and presenting an SST library, a number of steps within the protocol were discovered to generate inefficiencies that cumulatively lead to very significant losses of material and to a sampling bias characterized by an under-representation, or even complete absence of low abundance transcripts in the final SST library. I have further shown that losses at various stages of the protocol can be  168  overcome through the development of effective methods that provide complete conversion to and recoveries of these important intermediates. In SAGE, however, a particularly significant source of bias is introduced during material amplification via the PCR. As a consequence, in part, of this amplification step, which is essential for generating sufficient amounts of material for interrogation of the SST representation via sequencing technology, the dynamic range of SAGE technology is restricted to 2-orders of magnitude and the correlation between replicates is poor. This amplification bias, however, can be partially mollified through the use of proofreading polymerases as shown in this work, improving the chances of successfully using SAGE as a tool for absolute quantification of transcripts. In essence, SAGE and other digital technologies involve two stages of sample preparation: first, the creation of a SST representation of the transcriptome, and second, the conversion of this SST population into a sequencing library for analysis. As the larger errors arise during the sequence of steps involved in converting the SST library into a format amenable to sequencing, the potential for applying an analogue-based approach to the analysis of SSTs was investigated. Lower cost and the potential for improved accessibility provide additional motivation for the development of an analogue technology. Termed the U-STAR platform, this adaptation of the principles of SAGE technology allows the entire SST population to be interrogated by hybridization onto an array such that sampling issues associated with digital approaches are mitigated. The fixed length of the SST targets not only permits the use of a single universal k-met format for analysis via hybridization, but also unifies the mass transfer properties of each target in the analyte population. By eliminating the sequence-dependent hybridization thermodynamics through the introduction of LNAs into the probeset, and then eliminating the requirement for material amplification through improvements in analyte processing, a direct correlation between hybridization signal and SST concentration can be made within a 3-order of magnitude window between 1.6 pM and 5 nM, with the potential to detect as little as 1.2-fold differences in concentration (t <0.001, Student’s two-tailed -test). The U-STAR platform thus represents the first description of a novel analytical platform that has the potential to provide absolute per cell measurements of the transcriptome on a global scale.  169  6.2. CONSIDERATIONS FOR FUTURE DEVELOPMENT By leveraging accurate and rapid data acquisition associated with array technology, the USTAR platform has the potential to enable absolute transcriptomics analysis, induding economical use as a clinical tool for personal health care. However, given the 6-order of magnitude range in transcript abundances typifying the eukaryotic transcriptome, a number of issues that limit its dynamic range must be addressed. 6.2.1. Enhancing mismatch discrimination The number of samples that must be processed in a clinical setting generally demands use of a high-throughput assay. Under the current timeline for sample analysis with the U-STAR platform, —6 hrs are required to prepare an SST population for hybridization, followed by 20 hrs for SST hybridization to the array, and finally, ‘‘1 hr for post-hybridization processing and data analysis. While moderate improvements in sample processing speed can be achieved through optimization of the steps yielding STARtags, by far the more fruitful avenue to reduce data acquisition times is to reduce the time required for hybridization. Unfortunately, the time for reaching equilibrium at microarray surfaces, particularly for low abundance targets, can exceed 20 hrs  1-4•  Time course studies have revealed that a rapid overshoot of non-specific (NS) or mismatch  (MM) target hybridization occurs during the initial phase of the reaction, followed by a gralual increase in perfect match (PM) hybridization and a concomitant decrease in MM/NS hybridization However, multi-component analyses involving multiple targets and probes suggest that the time for 90% PM binding over MM binding may exceed thousands of hours, and that at high relative concentrations of MM versus PM targets, distinction between MM and PM target hybridization may be unlikely  6,7  This is particularly relevant to the U-STAR platform as the  complete set of potential MIVI probes to a given STARtag, encompassing 310 or  ‘-‘S  60,000 probes, is  present on the array. These multi-component binding dynamics can therefore define the attainable dynamic range of the hybridization assay, as the lowest concentration PM target must possess a free energy that exceeds that of the highest concentration MM target for a given register. Thus, while introduction of convective diffusion through peristaltic pumps through acoustic micro-agitation  8  chaotic mixing  ‘  °,  or  can decrease the incubation time required to reach equilibrium,  these mixing methods do not necessarily lead to more specific signals. Recent simulations of multicomponent hybridization events suggest that increases in hybridization temperature, while expected to decrease non-specific binding, in fact lead to a decrease in dynamic range as the  170  differences between the dissociation rates of PM versus MM targets are reduced  6  Instead,  conditions that specifically enhance dissociation of low afnity complexes are more likely to lead to an overall increase in dynamic range 12 In this regard, the U-STAR platform enjoys some potential advantages that need to be further assessed. Unlike hybridizations involving single-strand targets to probes, the hairpin-like structure of the STARtag can contribute to the specificity of the U-STAR platform through the introduction of a base-stacking energetic component to hybridization thermodynamics, involving the 5’-end of the STARtag hairpin and the 3’-end of the probe  13-15•  It may however also increase  specificity by conferring a unique MM penalty for hybridization that arises from competing tendencies of the STARtag dangling end to adopt an induced B-like conformation versus an A-like duplex upon hybridization with the LNA probe’ 6 Furthermore, the higher mismatch penalties for MM target hybridization associated with the utilization of LNAs in the U-STAR platform is likely to confer an inherently larger dynamic range for target discrimination. However, as this increase in IVIM penalty is context dependent  17  more robust prediction algorithms must be developed to  permit the design of probes that maximize this MM penalty. Finally, other modified bases can be incorporated into the k-mer probe library such as 2,6-diamino purine 19,20  or their locked analogues  that increase PM stabilities through additional hydrogen bonding across the duplex ‘°. While  these may also enhance MM penalties and therefore contribute to further expansion of the dynamic range of the platform, more thorough characterization of their thermodynamic contributions to PM stabilities and MM penalities must be performed 18,20 The ability to engineer probes with higher mismatch penalities is but one strategy to extend dynamic range. Hybridization buffers of lower ionic content ’ or high concentrations of 2 denaturants such as formamide  ‘  ,  or combinations of both  are known to increase Mlvi  discrimination of DNA probes in solution and on surfaces. However, their impact on the hybridization of MM LNA probes of the U-STAR platform has yet to be characterized but could prove fruitftil. A complementary approach is to exploit the electrostatic properties of DNA and 2 induce electro-kinetic flow within the hybridization solution at fixed hybridization 26 Introducing an electric field within the sample hybridization chamber by applying a current not  only can greatly increase the rate of hybridization, but it is possible that an optimum current can be found to maximize mismatch discrimination 27 ’ 25  .  171  6.2.2. Increasing detection sensitivity The dynamic range of a probe-based platform is determined not only by mismatch hybridization patterns, but also by the combined properties of the fluorophore and the scanner used to record the fluorescence signal. For example, the digital encoding of hybridization signals into a 16-bit TIFF format rigidly limits the dynamic range to 4.5-orders of magnitude, a problem that can be overcome either through the utilization of compression functions, or the use of 32-bit TIFF formats. More significant, however, are the limits introduced by the biophysical properties of the chromophore used to detect hybridization and by the sensitivity of the recording device itself. The U-STAR platform, like many microarray-based platforms, is capable of measuring target concentrations near 1 nM with high precision and accuracy. However, above this target concentration, signals become less precise, requiring careful dilutions of STARtags to be performed to assay those STARtags at higher concentrations. The source of this response plateau is unclear, but may reflect the non-linear properties of the particular PMT utilized in the detector, or quenching of the fluorophore signal at high hybridization densities  29-31•  Indeed, there is some  evidence that this may partially be dye-specific as Cy5 is more prone to quenching at high label concentrations utilized  31,32  Thus, although algorithms designed to correct for this phenomenon can be  the lack of a clear biophysical basis for this plateau in hybridization signal may introduce  inaccurate estimates of high abundance transcripts. Given an upper limit of 4 nM expected for any given STARtag generated from 100 ng of starting mRNA material, a more appropriate way to extend dynamic range is to lower the detection limit. Not only would this reduce the amount of starting material required, but this may also ease unwanted signal effects associated with PM/IVIM competition as recent experimental evidence indicates that the impact of PM/MM competition on specificity is reduced at target concentrations below 400 pM  .  Currently, the detection limit of the U-STAR platform is 1.6 pM. This is  comparable to detection limits reported for a number of other microarray platforms. While some progress has been made in the development of sub-micron (200 nm) resolution scanners that facilitate the spatial discrimination of hybridization signals to accurately quantify sub-picomolar (‘-‘10 fM) target concentrations  the resulting images require specialized hardware and software  for data-storage and analysis. And although fibre-optic based bead-array formats provide a significant advantage in detection sensitivity (‘-‘10 fM), methods to encode and decode a k-mer  172  library of beads without interference with the hybridization signal are limited  More feasible  approaches to increase signal-to-noise ratios for low abundance targets under current scanner resolutions (5 im) are therefore desirable. Fortunately, a number of existing technologies can be readily adapted by the U-STAR platform to increase signal-to-noise ratios by either increasing signal intensity or reducing background noise. Generally, as a consequence of the combined configuration of the flat-surface array, the excitation laser source, and the PMT or CCD detector, only -‘20% of the photons released by the fluorophore upon excitation can be recorded by the detector 38 Simple redirection of the released photons through the use of reflective surfaces can improve the amount of captured photons to 80%  38,  while coating of the array surface with silver or gold can increase the  fluorescence signal by a factor of 4 to 8-fold by enhancing both the excitation and emission properties of the fluorophore  °.  In addition, background fluorescence can be reduced through  the use of luminescent lanthanide chelates such as DTPA-Eu 3  ‘  that can be readily introduced  into the STARIInk adaptor. These luminescent labels are generally characterized by a large Stokes shift (i.e.  >  150 nm between peak excitation and emission wavelengths), tight emission spectra, and  more significantly, long lifetime constants (1 is to I ms) versus fluorescence (> Ins) that allow clear separation of the luminescence signal from background fluorescence through delayed time 42 gated acquisition  .  6.2.3. Enabling robust interrogation of the transcriptome The deconvolution of hybridization signals obtained from PM versus MM probes can also be aided by deriving multiple STARtags to uniquely identify and therefore quantify the abundance of a transcript. However, the restricted commercial availability of suitable Type II restriction enzymes creating a 3’-overhang (i.e. NIa III and Tai 1) limits the breadth of STARtag sets that can be generated. In contrast, enzymes creating a 5’-overhangs are more numerous (I’able 6.1), and therefore permit the creation of a more diverse and flexible set of STARtags that would enhance transcriptome coverage  ,  not only for transcriptionally active regions (TARs) of a genome  48,49,  but also for a wider range of organisms that vary in their global GC-content. This would of course require a redesign of the k-mer probeset, as partial or complete elimination of the 4—base common sequence in the probe corresponding to the ARE sequence would be required. However, this modification may improve the performance of the array as this common sequence may introduce  173  base-stacking interactions that stabilize the interaction of the STARtag with the probe despite the presence of mismatches present at the 3’-end of the k-mer region of the probe 14,15 6.3. CONCLUSIONS The impact of high-throughput digital and analogue technologies on our understanding of the transcriptome has been enormous, providing rich insight into the intricacies governing cellular function. Much recent effort has been on developing strategies to increase sample throughput so that, data on the transcriptome can be garnered in less time, while developing more robust algarithms to mitigate potential sources of measurement error. In contrast, attempts to engender the ability to provide accurate measurements of transcript abundance on an absolute scale have  been sparse, and have not led to any fruitful developments that have withstood careful scrutiny. Indeed, low inter-platform reproducibility is found among current state-of-the-art technologies despite higher intra-platform correlations, underlining that, despite the marty improvements made to these technologies, the sources of their fundamental inefficiencies and biases have not yet been addressed. Although comparative measurements of gene expression changes are useful in their own right, and have proven powerful in developing portraits of gene expression that correspond to disease states such as breast cancer, the evolutionary oversight in this particular aspect of transcriptomics technologies is striking given the potential impact that absolute transcript measurements wifi have on systems biology, and specifically, on the ability to develop accurate and predictive models of disease. Close examination of the fundamental chemistry underlying these platforms however, suggests that their continued evolution into platforms capable of providing accurate absolute measurements of transcript abundance may be inherently capped, requiring the development of alternative strategies to overcome these deficiencies. This thesis presents one such method by which some of the inefficiencies and biases found in both digital and analogue-based approaches to transcriptomics analysis can be remedied by combining the strengths of both approaches into a single platform, the U-STAR platform. The U-STAR platform is not only universal, but also has the potential of providing accurate quantitative information of transcript abundances for studies in systems biology. Only by providing such information will the true potential of these technologies, and of systems biology, be realized. The facility in which the target can be prepared, combined with the potential rapid turn around time for comprehensive transcriptomics will enable its application within a clinical setting,  174  and aid in the discovery of new pharmaceuticals and in the evaluation of treatment strategies, in the detection and monitoring of disease, and ultimately, in the prediction and prevention of illness,  allowing this aspect of personal genomics to come of age.  175  Table 6.1. List of commercially available Type II enzymes.  Enzyme 5’-overhangs: Fat I Tsp El Mbo I Cvi All HinPil Hpa II Mae II TaqI CviQI MacI MseI 3’-overhangs: Nia III Tai I HhaI  5’-cleavage site3a I CATG j AKIT I GATC C I ATG GICGC C I CGG A I CGT TICGA GITAC CITAG TITAA CATG I ACGT GCGIC  Methylation sensitivityb Common t probe unk 0 unk 0 2-m6A 0 2-m6A I 2—m5C I 1-m5C I 2-m5C I 4-m6A 1 3-m6A I unk I 2-m6A I 1-m5C, 2-m6A 4 1-m6A, 2-m5C 4 2-m5C 3  176  6.4. REFERENCES 1.  Dorris, D.R. et al. Oligodeoxyribonucleotide probe accessibility on a three-dimensional DNA microarray surface and the effect of hybridization time on the accuracy of expression ratios. BMC Biotcchnol3, 6 (2003).  2.  Bhanot, G., Louzoun, Y., Zhu, J. & DeLisi, C. The importance of thermodynamic equilibrium for high throughput gene expression arrays. BiopfrjsJ 84, 124-1 35 (2003).  3.  Dai, H., Meyer, M., Stepaniants, S., Ziman, M. & Stoughton, R. Use of hybridization kinetics for differentiating specific from non-specific binding to oligonucleotide microarrays. NucleicAcids Res 30, e86 (2002).  4.  Livshits, M.A. & Mirzabekov, A.D. Theoretical analysis of the kinetics of DNA hybridization with gelimmobilized oligonucleotides. BiophyJ 71, 2795-2801 (1996).  5.  Glazer, M. et al. Kinetics of oligonucleotide hybridization to photolithographically patterned DNA arrays. Anal Biothem 358, 225-238 (2006).  6.  Bishop, J., Blair, S. & Chagovetz, A.M. A competitive kinetic model of nucleic acid surface hybridization in the presence of point mutants. BiopiysJ 90, 831-840 (2006).  7.  Zhang, Y., Hammer, D.A. & Graves, D.J. Competitive hybridization kinetics reveals unexpected behavair patterns. BiophjisJ 89,2950-2959 (2005).  8.  Peeva, V.K., Lynch, J.L., Desilva, C.J. & Swanson, N.R. Evaluation of automated and conventional microarray hybridization a question of data quality and best practice? BiofechnolApplBiothem (2007). -  9.  McQuain, M.K. et al. Chaotic mixer improves microarray hybridization.Ana/Biothe#i 325,215-226 (2004).  10. Liu, J., Williams, B.A., Gwirtz, R.M., Wold, B.J. & Quake, S. Enhanced signals and fast nucleic acid hybridization by microfluidic chaotic mixing. Angew Chem Irn’EdEngl45, 3618-3623 (2006). 11. Toegl, A., Kirchner, R., Gauer, C. & Wixforth, A. Enhancing results of microarray hybridizations through microagitation. Journal ofbiomolecidar techniques :JBT14, 197-204 (2003). 12. Bishop, J., Blair, S. & Chagovetz, A. Convective flow effects on DNA biosensors. Biosensors & bioekctronics 22, 2192-2198 (2007).  13. Lane, M.j. et al. The thermodynamic advantage of DNA oligonucleotide ‘stacking hybridization’ reactions: energetics of a DNA nick. NuckicAcids Res 25, 611-617 (1997). 14. Riccelli, P.V. et al. Hybridization of single-stranded DNA targets to immobilized complementary DNA probes: comparison of hairpin versus linear capture probes. NuckicAcids Res29, 996-1004(2001). 15. Broude, N.E., Woodward, K., Cavallo, R., Cantor, C.R. & Englert, D. DNA microarrays with stem-loop DNA probes: preparation and applications. NucleicAcid.c Res 29, E92 (2001). 16. Bonnet, G., Tyagi, S., Libchaber, A. & Kramer, F.R. Thermodynamic basis of the enhanced specificity of structured DNA probes. Proc NatlAcad Sd USA 96, 6171-6176 (1999). 17. You, Y., Moreira, B.G., Behike, M.A. & Owczarzy, R. Design of LNA probes that improve mismatch discrimination. NucleicAcids Res 34, e60 (2006). 18. Timofeev, E. & Mirzabekov, A. Binding specificity and stability of duplexes formed by modified oligonucleotides with a 4096-hexanucleotide microarray. NuckicAcids Res29, 2626-2634 (2001). 19. Koshkin, A.A. Syntheses and base-pairing properties of locked nucleic acid nucleotides containing hypoxanthine, 2,6-diaminopurine, and 2-aminopurine nucleobases. J Org Chem 69, 3711-3718 (2004). 20. Rosenbohm, C. et al. LNA guanine and 2,6-diaminopurine. Synthesis, characterization and hybridization properties of LNA 2,6-diaminopurine containing oligonucleotides. BioorgMed Chew 12, 2385-2396 (2004). 21. Tikhomirova, A., Beletskaya, I.V. & Chahldan, T.V. Stability of DNA duplexes containing GG, CC, AA, and ‘fl’ mismatches. Biochemisfrj 45, 10563-10571 (2006). 22. Blake, R.D. & Delcourt, S.G. Thermodynamic effects of formamide on DNA stability. Nucleic Acids Res 24, 20952103 (1996). 23. Ng, J.K., Feng, H. & Liu, W.T. Rapid discrimination of single-nucleotide mismatches using a microfluidic device with monolayered beads. Anal Chim Ada 582,295-303 (2007).  177  24. Edman, C.F. et al. Electric field directed nucleic acid hybridization on microchips. Nzickic Acids Re: 25, 4907-4914 (1997). 25. Heaton, R.J., Peterson, A.W. & Georgiadis, R.M. Electrostatic surface plasmon resonance: direct electric fieldinduced hybridization and denaturation in monolayer nucleic acid films and label-free discrimination of base mismatches. Proc NatlAcad Sd USA 98,3701-3704 (2001). 26. Fixe, F. et al. Immobilization and hybridization by single sub-millisecond electric field pulses, for pixel-addressed DNA microarrays. Biosensoes & bioelectronicsl9, 1591-1 597 (2004). 27. Sosnowski, R.G., Tu, E., Butler, W.F., O’Connell,J.P. & Heller, M.J. Rapid determination of single base mismatch mutations in DNA hybrids by direct electric field control. Proc NatlAcad Sci USA 94, 1119-1123 (1997). 28. Heller, M.J., Forster, A.H. & Tu, E. Active microeletronic chip devices which utilize controlled electrophoretic fields for multiplex DNA hybridization and other genomic applications. Elecirophoresis 21, 157-164 (2000). 29. Bengtsson, H.,Jonsson, G. & Vallon-Christersson,J. in Bmc Bioinformatics, Vol. 5-2004). 30. Lyng, H. et al. Profound influence of microarray scanner characteristics on gene expression ratios: analysis and procedure for correction. BMC Genomics 5, 10 (2004). 31. Shi, L. et al. Microarray scanner calibration curves: characteristics and implications. BMC Bioinfomsatics 6 Suppi 2, SlI (2005). 32. Borden, J.R., Paredes, C.J. & Papoutsalds, E.T. Diffusion, mixing, and associated dye effects in DNA-rnicroarray hybridizations. BiophsJ 89,3277-3284 (2005). 33. Dudley, A.M., Aach, J., Steffen, MA. & Church, G.M. Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc NatlAcadSd USA 99, 7554-7559 (2002). 34. Fish, D.J., Home, M.T., Searles, R.P., Brewood, G.P. & Benight, A.S. Multiplex SNP discrimination. Biophjs 92, J L89-91 (2007). 35. Hesse,j. et al. RNA expression profiling at the single molecule level. Genome Re: 16, 1041-1045 (2006). 36. Han, M., Gao, X., Su, J.Z. & Nie, S. Quantum-dot-tagged microbeads for multiplexed optical coding of biomolecules. Nat Biotechnoll9, 631 -635 (2001). 37. Nicewarner-Pena, S.R. et al. Submicrometer metallic barcodes. Science 294, 137-141 (2001). 38. Moal, E.L.  Ct  ai. Enhanced fluorescence cell imaging with metal-coated slides. Biop4ysJ 92,2150-2161(2007).  39. Redkar, R.J. et al. Signal and sensitivity enhancement through optical interference coating for DNA and protein microarray applications. Journal ofbiomolecular techniques :JBT17, 122-130 (2006). 40. Sabanayagam, C.R. & Lakowicz, J.R. Increasing the sensitivity of DNA microarrays by metal-enhanced fluorescence using surface-bound silver nanoparticles. NuckicAcids Re: 35, e 3 (2007). 1 41. Li, M. & Selvin, P.R. Amine-reactive forms of a luminescent diethylenetriaminepentaacetic acid chelate of terbium and europium: attachment to DNA and energy transfer measurements. Bioconjug Chem 8, 127-1 32 (1997). 42. Sammes, P.G. & Yahioglu, G. Modem bioassays using metal chelates as luminescent probes. Naturalproduct roports 13, 1-28 (1996). 43. Vereb, G., Jares-Erijman, E., Selvin, P.R. & Jovin, T.M. Temporally and spectrally resolved imaging microscopy of lanthanide chelates. BiophjsJ 74,2210-2222 (1998). 44. Selvin, P.R. Principles and biophysical applications of lanthanide-based probes. Annual review of biophjsics and biomokcularsfriecture 31,275-302 (2002). 45. Peters, B.A. et al. Large-scale identification of novel transcripts in the human genome. Genome Re: 17, 287-292 (2007). 46. Pleasance, E.D., Marra, MA. & Jones, S.J. Assessment of SAGE in transcript identification. Genome Re: 13, 12031215 (2003). 47. Konu, 0. & Li, MD. Correlations between mRNA expression levels and GC contents of coding and untranslated regions of genes in rodents.JMolEvol54, 35-41 (2002).  178  48. David, L. et al. A high-resolution map of transcription in the yeast genome. Proc Nail Acad Sci USA 103, 53205325 (2006). 49. Kuo, B.Y. et al. SAGE2Splice: unmapped SAGE tags reveal novel splice junctions. PLoS Comput Biol2, e34 (2006).  179  APPENDIX  180  iLl. ditag_formation.pl #!/uar/bin/perl # This program generates a ligation product profile of SAGE tags by assigning a # probability of joining to each potential site of ligation. A fixed population of # tags are combined randomly to completion or according to the parameters assigned # by the user. Products are then interrogated to determine the number of products # that are capable to generating a ditag amplification product of the appropriate # size and orientation. ########################################################################### # # SUBROUTiNE & MODULE LIST # ######################################################################### use strict; use warnings; use Math::Random::MT qw(arand rand); sub ligate_population; sub join_fraga; sub ligation; sub ampliflable_ditags; sub random_reverse_elements; sub roundint; sub revcomp; sub complementseq; sub reverseseq; sub fisher_yates_shuffle; sub HASHdata_normalize; sub HASHdata_stats; sub printHASl-l; sub printHASH_table; sub fiizzify_vahse; sub printARRAY;  sub timestamp; sub datestamp; ########################################################################### # PREAMBLE # # ########################################################################### my $endtime  =  join ‘J, datestamp, timestamp;  # tag structures 5’3’---—3’5’ (1st position sense, 2nd position anti) 4 rag HNHP/PHNH 4 ditag HNHP.PHNH 4 unblocked tag HHHP/PHHH 4 allowable ligation rules: HP.PH, HP.HH, HH.PH 4 unblocked ditags: HHHP.PI-IHH, HNHP.PHHH, HHHP.HHHP, PHHH.PHHH, print STDOUT “Enter number of tags introduced into ligation reaction:\n”; my Stood = <STDIN>; chomp Stood; print STDOUT “Enter number of replicates desired:\n”; my Sreplicates = <STDIN>; chomp $replicates;  181  print STDOUT “Enter fractional ranges for flU\-in (start,end,increment):\n”; my $fraction = <STDIN>; chomp $fraction; my ($flhl_stan,$flhl_end,$flll_increment) split I,!, $fraction; print STDOUT “Enter fractional ranges for FIP::PH ligation (start,end,increment):\n”; $fraction = <STDIN>; chomp $fraction; my ($lig_start,$lig_end,$ligjncrement) = split I,!, $fractioti;  print STDOUT “Enter fractional ranges for HP::HH ligation (atart,cnd,increment):\n”; $fraction = <STDIN>; chomp $fraction; my ($lig2_start,$lig2_end,$lig2_increment) = split 7,!, $fraction; print STDOUT “Enter a cutoff value of fractional monomer to halt simulation:\n”; my $eutoffs <STDIN>; chomp $cutoffs; my ($eutofCvalue,$var) = split I,!, $cutoffs; print STDOUT “Enter filehandle to save results to:\n”; my $fh = <STD1N>; chomp $fh; my $flli = $611_start; my $conditions =  my $i = $611_start; while ($i < $611_end) my $j while  =  ($j  $lig_start; < $lig_end)  my $k = $lig2_atart; while ($k < $lig2_encl) push @$conclitiona, (join $k + $lig2_increment;  Si + $i  +  “::“,  $i/100,$j/100,$k/100);  Slig_increment;  $flli_increment  ########################################################################### # # MAIN PROGRAM # ########################################################################### #sol haven molecules of tags (say 10 ng 0.28 pmol 168.97 x 10e9) # start off with 1 tag, and continue to build until binding rules cannot be found, # then save the last build. —  —  # Some data of relevance: # One cohesive-end ligation unit is defined as the amount of enzyme required to it give 50% ligation of HindlU fragments of lambda DNA in 30nrin at lCC in 2( 1 of it the assay mixture: it 50mM Tris-HCI it (pH 7.5), 10mM MgCI2, 10mM Dfl, 1mM ATh, 25sg!m1 BSA and a 5’-DNA termini it concentration of 0.12 FIM (300gg/ml).  182  20s1 = 2.4 pmol > 1.2 pmol “sites” 5 # 5’-end termini = 0.12sM # 50% ligation > 0.6 pmol sites in 30 mm. #3.61 x 1(Y’11 molecules joined in 30 minor 1.44 x l0”12 molecules in 2 hrs by I # CLU #1 Weiss -‘300 CEU >5 Weiss 1500 CEU. # starting material is 0.15 pmol  my $condition; my ($global...multimer_list,$distribution,$recovered)  =  # The $condition variable reflects the input values for the simulation foreach $condition (@$conditions) my ($f_fllled,$p_HPPH,$p_HPHH)  =  split  I::!, Scondition;  print STDOUT “Applying (filled of $fJIlled and (p_I-IPPH,p_HPHH) of ($p_HPPH,Sp_HPHH):\n”; my ($tag..pool,$sizes) = my ($disrribution_list,$reeovered_list,$local_multimer_list) = my $n = 1; while ($n  <  {};  $replieates)  print STDOUT “Performing replicate $n\n”; # create an ARRAY containing frill population of monomers used for simulation # with a fraction modified according to the conditions set my $tag_pool = [(“HHXHP”) x eval{ roundint($f_filled*$total)  }, (“HNXFIP”) x eval{ roundint((I$f_fllled)*$total) }];  # set a cut-off value to terminate the ligation reaction based on the # densitometric data obtained for the amouot of monomers. Otherwise, the # number of steps could be fixed according to the activity of T4 DNA ligase in # cohesive-end units as a basis for the number of ligations in 2 hrs. This # would assume that the rate limiting step is the collision and not the # joining reaction. my $err = roundint(rand) ? rand(Svar) : -rand($var); my $eutoff $cutoff_value? ($eutofCvalue + $err)/l00 :0; # the full set of tags generated is then combined according to various joining # rules ($tag.pool,$siaes)  =  ligate...population($tag...pool,$p_HPPH,$p_HPHH$cutoff);  # Now, the lengths of all ligation products are determined to obtain the size # distributions of ligation products. my $size_distribution for $i (0..$#$sizes) $local_multimer_list-> {$aizes-> [Si]) ++; $global_multimer_list-> {$sizes->[$i]} ++; $size_distribution-> {$sizes->[$i]) ++;  # From the list of obtained ligation products, the number of products with # amplifiable ditags are determined.  183  my Samplifiable [map { amplifiable_ditags(L) } @$tag_pool]; my $recovered = 2*( scalar (grep {S_) @Samplifiable) )/$total; $clistributionjist-> (Sn)  =  $size_distribution;  open (OUTPUT,”>> $fh\_$endtime\_recovered.out”); select OUTPUT; print “$f_fihled\t$p_HPPH\t$p...HPHH\t$recovered\n”; dose OUTPUT;  # Data HASHes are resrranged to HASH_table format for generating summary tables my $summsry_table U; my $multimers = [sort {Sa <> Sb) keys %$local_multimer_list 1; my Smultiple; foreach Smultiple (@Smuldmers) foreach $n (keys %$clistribution_list) push  @ ( Ssummary_tablc-> {$multiple) ), $distribution_list-> {$n)-> {$mukiple);  $clistribution->{Scondition)  =  HASHdata_stats(HASHdata_normslize($summary_table));  unlink (%Ssummsry_tablc); my $multimers = [sort {$a <> Sb) keyt %$global_multimer_list 1; unlink %$global_multimer_list; ########################################################################### # 4 OUTPUT # ########################################################################### open (OUTPUT,”> $fh\_$endtime.ouC); select OUTPUT; print “Size distribution table:\n”; print “Multimer\t”, join “\t”, @Sconditions, “\n”; my ($multinier,$entry); foreach $multimer (@$mulsimcrs) print “$multimcr”; foresch Scondition (@$conditions) if (exists $distribution->{$condition)->{Smultimer}) foreacb Sentry  (@( Sdistribution->{$coodition}->($multimer)))  Sentry? print “\tScntry” print “\t\-”;  else print “\t\-”;  184  print “\n”;  } #################################################4H11Th’tN$###fl##H#WHt(fiWuIIIIIHN # SUBROUTINE # # ###########################################ItII II !I####N#N!IMPIMN ffH#N###flH IIRHHNI!II sub ligatcpopulation my ($tag_pool,$p_HPPH,Sp..HPHH,$cutoff) my$i = 1; my $total = scalar @Stag_pool; #my $max_iterations = $total*100; my $max_iterations = 1000; my $sizes = while ($i  <  =  $max_iterations)  #print STDOUT “iteration $i of $max_iterations\n”; # randomize the orientation of the population and mix weil Stag_pool = random_reverse_e1ements(fisher_yates_shuffle($ag_poo); now perform ligasions between adjacent entries  my Sligation_pool  =  while (Stag_pool) my $build = join_frsgs( (shift @Stag_pool),(shift ag_pool),$p_HPPH,Sp_HPH.H); (push $ligation_pool, @$build) if ($build);  # the number of monomers present in the array is counted to determine if if simulation should be halted assuming that number of steps is discounted. Stag_pool 5sizes  =  Sligasion_pool;  [map  { length($._)/5 }  my 5monomers  =  (scalar (grep  last if( $monomers  <  @$tag_pool];  { 5_  1  }  @Ssizes ))/Stotal;  Scutoff);  print STDOUT “Done\n”; return ($tag_pool,Ssizes);  #######################################hI 111/11111111111111111 sub join_frags my ($a,$b,Sc,Sd)  =  if Sa refers to the first element of pair to join if Sb refers to the second element of pair to join if Sc refers to the probability of joining p_HPPH if Sd refers to the probability of joining p_HHPH/HPI-IH  185  return [Sal if (!$b);  my $test = “Sa.Sb’; my Sligation = ligation($test,Sc,Sd); my Sproducts = Sligation? [Sligation I [Sa,Sbl;  return Sproducts;  sub ligation my ($a,$b,$c) if (Sa  =  =-  (randO elsif (Sa Sa (rando  >  Sb) ? return 0 return $a;  /(HP\.HH) I (HH\.PHl/ >  $c) ? return 0 return $a;  else return 0;  sub printHAsil my $a$_[0]; open (rEST,”>> test.uct”); select (rEST);  print “\n”; my $key; foreach Skey Qceys %Sa) print “$key\tSa-> {$key}\n”; print “\n”; close (rEST);  sub printARRAY my 5a3_[0l; open (lEST,”>> test.txt”); select (TEST); print “\n”; my Si; for ($i0; $i<$#$a; $i++) print “$a->[$i]\n”;  186  print “\n”;  close (rEST);  sub tirnestamp  my ($sec,$min,$hr) = (localtime)[0..2]; return (sprintf(”%02d%02d%02d”, $hr,$min,$secD;  sub datestsrnp my ($dsy,$month,$yesr) = (localtime)[3..5]; return (sprintf(”%02d%02d%04d”, $month+1,$day,$year+1900));  sub roundint  { my $a = return (sprintf(”%.Of”,$afl;  sub reverseseq my$a$_[O]; return (join”,(reverse(split//,$aD)  sub complementseq  my $a = $401; tr/ACGT/TGCA/; return $a;  ######4###################llll#llutttt#########II 111/1111 II#######tIlluhI II /!tttt############ sub revcomp my $a  =  $40];  return (reverseseq(complementseq($a)));  ########################################################################### sub random_reverse_elements my $a  =  187  my $i; for ($i0; $i<4#$a; $i++) $a-> [$ij  =  roundint(rand) ? revcomp( $a->[$i]) next;  return $a;  # fisher_yates_shuffle( \@array): # generate a random permutation of ®array in place sub fisher_yates_shuffle #print STDOUT “Shuffling array my $array = shift; my $1; for ($i = $array; --$1;)  my $j = hit rand ($i+1); $array[$i,$j] = $array[$j,$; #print STDOUT “Done\.\n”; return Sarray  sub ampliflable_ditags my $a = my $ampliflable = 0; $amplifiable++ if (Sa return (Samplifiable);  /(HPPH)/g);  sub HASHdata_normalize my $a  =  my ($key,$j); my $coljotals = foreach $key (keys %$a) for ($j0; $j<$#{$a->{$key}}; ($col_totals->[$j]  z  $j++)  $a-> {$key} [$jl ) if ($a-> {$key} [si]);  foreach Skey (keys %$a) for ($j0; $j<4#{$a-> {$key) }; ($a->{$key}[$j]  =  $j++)  $a->{$key}[$j/$coI_tota1s->[$j] ) if($a->{$key}[$j]);  return $a  188  sub HASHdata_stats my Sa  =  my ($row_avgs,SSSE_avgs,Sstats) my (Skey,S;  =  foreach Skey (keys %Sa) my Srow_avgs = 0; for (5jr0; Sj<5#{Sa->{Skey}}; 5j++) Srow_avgs  +  Sa-> {$key} [Si] ? 5a-> {Skey} [Sj]/( S#{Sa-> {Skey} } +1): next;  my SSSE_avgs = for Sj0; Sj<5#{Sa->{Skey}};  $j++)  SSSE_avgs + 5a-> {Skey} [Si]? sqrr( (Sa-> {Skey} [Si] Srowavgs )**2/( S#{Sa-> {Skey} } +1)): next; -  push  @( Sstats->{Skey} }, (join “\t’, $row_avgs, SSSE.,_avgs);  return (Sstats);  sub printHAS}Ltable my Sa  =  my (Skey,Sentry); foreach $key (keys %$a) print “Skey’; foreach Sentry  (@{ Sa->{$key} })  print “\tSentry”; print “\n”;  sub fuzzify_value my (Sa,$b)  =  # Sa refers to the number to center deviates around # Sb refers to the expected %error around 5a my Serror_magnisude = roundint(Sa*Sb/100); my Sdirccsion = ins(rand(1)); my Snoise = ins( rand(Serror_magnitude)); return (Sdirection ? (Sa+Snoise) (Sa-Snoise));  189  A.2. ditagj’CR_bias.pl #!/uar/bin/perl # This program parses tag-based transcriptome data and applies a series of 4 calculations to determine the impact of structural bias (segregation into homo 4 versus hetero-ditags) and abundance bias (as inferred from the decrease in 4 efficiencies observed through real-time PCR data). Data is output in tabular 4 format to give relative abundances pre- and post-amplification  #####################################################II Il/Ill!! II##’############## 4 4 SUBROUTINE & MODULE LIST 4 #########################################################g##4ns####ullgu#u# use PDL; use Benchmark; $PDL:BIGPDL  =  1;  uae strict uae warninga;  sub recursive_dirfile; sub savefile_handle; sub max_array; sub max_matrix; sub sum_array; sub revcomp; sub complementseq; sub reverseseq; sub timestamp; sub datestamp; sub printMatrix; sub printHASl-l_table; sub create_ditags; sub ditag_amplification_simple; sub ditag_amplification_iterative; sub tag_distribution; sub inverse;  ########################################################################### 4 4 PREAMBLE 4 ########################################################################## print STDOUT “Enter directory containing datasers to parse:\n”; my $direcrory = <STDIN>; chomp $directory; print S’IDOUT “Enter text that uniquely identifies set of files to parse:\n”; my $exrension = <STDIN>; chomp $extension; print STDOUT “Enter number of cycles to amplifr:\n”; my Scycles = <STDIN>; chomp $cycles;  190  my Sues my $fIIe;  recursive_dirfile($clirectory,$extension);  my Stimehandle  =  join ‘J, datestamp, titnestamp;  ########################################################################### 4 4 MAIN PROGRAM 4 ########################################################################### foreach $file (@$ffles) 4 First, convert tag count data or artifical transcriptome data into a HASH table my Slime = 0; my Scols; my Sdata_table = {}; my @headers = 0; my SID;  print STDOUT “\nCollating data from $flle open (TAGFILE, “< Sfile”) while (<TAGFILE>)  II  die “Cannot open results file: 5!\n”;  chomp S_; if (!Sline)  ($TD,@headers) Slime = I;  =  split  /\t/,  S...;  else my (Stag,@data) = split /\t/, 5_; Sdata_table>{Stag)rz [@dara]; $eols = $#data+1; close (IAGflLE); print STDOUT “Done.\n\n”; # Next rearrange data so that data is in columnar format and remove zero # entries print STDOUT “Parsing dataset by header ID my Stagjist  =  my (Si,Stag); mySsunimary  “;  [soft keys %Sdata_table];  {};  for Si (0..5#headers) 4 filter out datasets that I don’t want next if ( Sheaders[5i] /STDDEV/); foreach Stag (@Stag_list) Sdata_table-> {Stag)->[Si]? (push @{Ssummary-> {Sheaders [Si])), Sdata_table-> (Stag) -> [Si] ) : next;  191  unlink %Sdata_table; unlink @Stag_lisq print STDOUT “Done\n”;  # Now I can proceed with each dataset for seeing effects of amplification. To # facilitate the use of large data matrices, I am going to use the PDL library my Sdata; my5dataset= {}; my SsaveFile; foreach Sdata (soft keys %Ssummary) print STDOUT “\nProeessing $file\n\n”; print STDOUT “Creating feactional distribution table for Sdata my Stags $summary->{Sdata}; my Stotal = sum...array(Stags); Stags = [map {S_ / Stotal) @Stags]; Stags = [soft {$b <> Sa) @Stags]; Stags = [grep {5_? 5_: next} @Stags j; print STDOUT “Done\n”;  4 Calculate the segregation of tags into heterotag vs. homotag structures and 4 print the results to a table. In this case, the amplification bias should be 4 since I have not applied any amplification here. 5saveFde  =  savefile...handle($file,”ditag_formation\_$data\...Stimehandle”,”out”);  open (OUTPUT, select OUTPUT;  “>  $saveFile”);  my (Shomo_ditags,Shetero_ditags)  =  tadiathbution($tags,$#$tags);  print “\n original\t original\t amplifled\t amplification_bias\t tags_hetero\t hetero....bias\t tags_homo\t homo_bias\n”; for Si (O..S#Stags) my Stotal_bias = 1; my Shetercs_bias = Shetero_ditaga->[$i]/Stags->[Si]; my Shomo_bias = Shomo_ditsgs->{Si]/Stags->[Si];  print “Stags->[Si]\t Stags->[Si]\t Stags->[Si]\t Stotal_bias\t Shetero_ditags->[Si]\t Shetero_bias\t Shomo_ditags->[$i]\t Shomo_bias\n”;  close OUTPUT; 4 Now I can take the created ditag mixture and amplifS’ it based on a deduced 4 efficiency function. Here I make no aasumptions beyond what the data has 4 given me i.e. no interactions and not iterative since it should be reflected # in the derived parameters print STDOUT “Amplifying ditag mixtures  my Saniplified  = ditag_amplification...simple(Stags,Scycles,$#Stags); print STDOUT “Done\n”;  4 The rag types are segregated from this final pool of “amplified” material.  192  $saveFile = saveflle_handle($flle,”PCRbias$cycles\_$data\_$timehsndle”,”out”); open (OUTPUT,”> $saveFile”); select OUTPUT; ($homo_ditags,$hetero_ditags)  =  tag_distribution($smplifled,$#$amplifled);  print “\n original\t original\t amplifled\t ampliflcation_bias\t tags_hetero\t hetero_biss\t tags_homo\t homo_bias\n”; for $i (O..$#$tags) my $total_bias = $amplifled->[$i]/$tags->[$i; my $hetero_bias = $hetero_ditags->[$i]/$tags->[$i]; my $homo_bias = $homo_dlitags->[$ij/$tags->[$ij; print “$tags->[$i\t $tags->[$ij\t $amplifled->[$i\t $total_bias\t $hetero_ditags->[$i]\t $hetero_bias\t $homo_.ditags->[$i]\t $homo_bias\n”;  } dose OUTPUT; unlink unlink unlink unlink  @$tags; @$amplified; Shomo_ditags; @$hetero_ditags;  print STDOUT “Done\n”;  } ###r###########################r#####################44#############/I Ill/Il  # 4 4  /I#####  SUBROUTINES  ############################################htlI Ii I/il II###############tt #1? Ill/Il /1444 sub recursive_dirfile my ($a,$b)  =  use File::Find; my @flleist =  0;  sub process_file push @fileist, $File::Find::name; find (\&process_filc, Sa); my$flles fl; foreach $_ (@flleist) push @$files,  $_  if (/$b/);  print STDOUT “Parsing the following flles:\n”; foreach 5 (@$files) print STDOUT “$_\n”;  } return $flles;  } 193  sub savefile_handle my ($fules,$handle,$extension)  =  my $aliasFiies = $files; s/”*\/)//g $aliasFiles ($aliasFiles  s/\.\w+//ig) if ($aliasFiles—/\.M;  my $directory = $files; $directory s/Q=\/1\/Dc)$/$1/ ($directory.$aliasPiles.”\j’.$handle. \ .”.$extension); return t  sub timestamp my ($sec,$min,$hr) = (localtime) [0.2]; return (sprinrf(”%02d%02d%02d”, $hr,$min,$secD;  sub datestamp my ($day,$month,$year) = (Iocaltime)[3..5]; return (sprintf(”%O2d%02d%O4d”, $month+1$day,$year+1900;  sub reverseseq my $a = return (join”,(reverse(split//,$a)  sub complementseq my Sa  =  rr/ACGT/TGCA/; return $a  sub revcomp my$a  $_[0];  return (reverseseq(complementseq($a)));  194  sub Sum_array mySa  =  my $sum  =  0;  my $i; for $i (0..$#$a) $sum  +  return $sum;  sub printARRAY  my $a4_[0J; open (rEST,”>> testtxt”); seleet (TEST); print “\n”; my $i; for ($i0; $i<$#$a; $i++) print “$a->[$iJ\n”; print “\rs”; dose (TEST);  sub printMatrix  my$a  $_[01;  my ($i,$j); for ($i0; $i<$#$a; $i++) for ($j0; $j<4#($a->[$i]}; $j++) ($a->[$i][$jj)? (print “$a->[$i][$j]\t”) (print  “\-\t”);  print “\n”;  sub printHASH_table  my $azl_[01; my $key; foreaeh $key (keys %$a) print “$key\t”; print join “\t”, @($a-> {$key} print “\n”; print “\n”;  195  sub creste_ditags my Sa = my $1; my$ditags=  fl;  for $i (0..$#$a) my $p = Sa->[Sil; $ditags->[Si] = [map  { S_* $p } @Sa);  return $ditags;  sub ditag_amplification_simple my (Stags,Scydes,$array_size) = my Si; my Smax (max_array($tags**2; my Samplifled = [1; # each tag type and its frequency is put through the amplification efficiency # function that was derived experimentally by applying # y = 1.15708 + 1.17516*x”0.58579 for Si (0..$srray_size) my Sp = Stags-> [$1]; Samplifled->[$i] = sum_array([ map {($_!$p)? ((Sp*SJ*((1.15708+1.17516*(((Sp*$_)/Smax)**0.58579))**$cycles)) :0) @Stags D; my Stotal  sum_array(Samplifled);  # Now I normalize the obtained abundances to obtain the relative frequencies of 4 tags obtained post-amplification for Si (0..Sarray_size) $amplifled->[SD /  Stotal;  return (Samplifled);  ##################################it#ll II till: sub tag_distribution my (Stags,Sarray_size) = my (Shomo_ditags,Shetero_ditags)  fl;  my Si; for Si (0..Sarray_size)  196  $homo_ditags->[$i] = $hetero_ditags->[Si] = Stags-> [Si] Shomo_ditags->[$i]; -  my Sp_homo = sum_array($homo_ditags); my Sp_hetero = I Sp_homo; -  Sheterojlitsgs = [map {S_/Sp_hetero} @$hetero_ditags 1; Shomo_ditags = [map {S_/Sp_homo} @Shomo_ditags]; return ($homo_ditags,Shetero_ditags);  sub max_array  mySat $_[O]; my Smsx =0;  my Si; for 5i (0..5#Sa)  my 5maxirnum = = (Smaximum> Smax) ? 5maximum $max;  Smax  return $msx;  sub max_matrix mySa$_[0]; my Smax 0; my Si; for Si (0..$#$a) my Smaximum  =  msx_array(Sa->[Si]);  Smax = (Smaximum> $max) ? 5maximum Smax;  197  A.3. Tm_calculate.pl #!/usr/bin/perl # This program calculates the predicted Tm’s for a list of sequences with IDs based 4 on the universal parameteres of Santa Lucia (1998) [PNAS 95, 1460-1465], and 4 includes a salt correction factor based on Owaracky (2004). This also uses the 4 corrections of Giesen (1999) and McTigue (2004) for LNAa.  4 SUBROUTINE & MODULE LIST 4  use strict #use warnings; sub printHASll; sub printARRAY; sub saveflie_handle; sub reveomp; sub reverseseq; sub complementseq; sub NN_pararneters; sub LNA_correetions; sub PNA_correction; sub Tm_calc; sub Na_correction; sub NN_calculate; sub NN_sequence; sub NN_count sub NN_parameters; sub vectorSize; sub dimensionMatrix; sub scalarMultiply; sub normVector; sub multiplyMatrix; sub lengthVeetor; sub dotProduct; sub transposeMatrix; sub pythag; sub SIGN; sub MAX; sub MIN;  ##############################################################HIIllhIllIIhL###### 4 PREAMBLE 4 # ########################################################################### print “Enter path and filename containing sequences:\n”; my $seqFile = <STDIN>; chomp $seqFile; print “Enter search pattem for desired flles:\n”; my $extension <STDIN>;  198  chomp Sextension; my @flleist use File::Find; sub process_file push @flleist, $File::Find::namc; find (\&process_flle, $seqFile);  my $fllestfl; foreach $_ (@fileist) push @$flles, $_if (/$cxtension/); @flleist=O; print STDOUT “Processing the following flles:\n”; printARRAY ($flles); print STDOUT “Enter sequence type (O\=DNA,1\tLNA,2\=PNA):\n”; my $type = <STDIN>; chomp $type; print SThOUT “Enter Na concentration:\n”; my $Na = <STDIN>; chomp $Na; print “Enter strand concentrations in molarity (delimit with a comma)\n”; my $strandConc = <STDTN>; chomp $strandConc; print “Enter range (mm/max) for Tm values to output (delimit with a comma):\n”; my $ranget <STDIN>; chomp $range; my ($Tm_min,$Tm_max) = split /\,/, $range; print “Extract sequences within $Tm_min and $Tm_max Tm’s?\n”; my $extract = <STDIN>; chomp $extract ########################################################################## # # MAIN PROGRAM # ########################################################################### my $flle; foreach $flle (@$flles) print STDOUT “Processing file $flle  “;  my $saveFile = saveflle_handle($flle,”Tm”,”out”); my Sextracted = B; open (OUTPUT_FILE, “> $savePile”); seleet(OUTPUT_FILE); open (F[LE,”< $flle”) die “can’t open seqflle: $!\n”; print “seqlD\tS’-sequenee-3’\tTm(q\tdH\tdS\tdG\n”;  my SID 0; while (<FILE>)  199  chomp  $_  my $sequence = next if ($sequence  /[ACGTJ/g);  my ($TmC,$dH,$dS,$dG) = Tm_calc($sequence,$strandConc,$type,$Na); if ( ($TmC < $Tm_max) && ($TmC > $Tm.min)) print “$ID\$sequence\t$TmC\t$dH\t$dS\t$dG\n”; push @$extracted, (join “\t”, $ID,$sequence);  close FILE; close (OUTPUT_FILE); my $saveFile  savefile_handle($ffle,”Tm_cxtracted”,”out”);  =  open (EXTRACTED,”> $saveFile”); select (EXTRACTED); foreach $_ (@$extracted) print “$_\n”; close (EXTRACTED);  print STDOUT “Done\n”;  ########################################################################### # # SUBROUTINES # ########################################################################### sub Tm_cslc my ($a,$b,$c,$d)  =  # Sa refers to sequence # $1, refers to strand concentration 4 $c refers to type 0: DNA I: LNA 2: PNA 4 $d refers to Na concentration my $reveomp  =  revcomp(Sa);  my $Cq my ($Ca,$Cb)  =  split /\,/,$b;  if ( $a eq revcomp($a)) $Ct = $Ca; else $Ct  =  ($Ca  ==  $Cb) ? $Ca/4: (MAX($Ca,$Cb)-MIN($Ca,$Cb) )/2;  my ($dI-l,$dS) = NN_calculate($a); my $dG = $dH 298.15*$dS/I000; -  #print STDOUT “$a\$dH\t$dS\t$dG\t”;  200  $dS = Na_correction($dS,$a,$d) if ( $d 1); ($dH,$dS,$dG) LNA_corrections($dH,$dS,$dG,$a) if ($c  ==  1);  #prlnt STDOUT “$a\t$dH\t$dS\$dG\t”; my $TmK = 1000*($dH)/( ($dS) ÷ 1.987*log($Ct)); my $TmC $TmK 273.15; -  $TmC  =  PNA_corrections($a,$TmC) if ($c  ==  2);  #print STDOUT “$TmC\n;  return ($TmC,$dH,$dS,$dG);  ######################################### sub LNA_correciions my ($dH,dS,$dG,$c) while ($c-/(?(Aa))/sg)  $dH $dH + 0.707; $dS = $dS + 2.477; $dG $dG 0.092; -  while ($c-/(?(Ac))/sg) $dH = $dH + 1.131; $dS = $dS + 4.064; $dG = $dG 0.122; -  while ($c/(?(A)/sg) $dH = $dH + 0264; $dS dS + 2.613; $dG = $dG 0.561; -  while ($c—/(?tr(At))/sg $dH = $d}-I + 2.282; kIS $dS + 7.457; $dG = $dG 0.007; -  while ($c-/(?(Ca))/sg) $dH = $dH + 1.049; $dS $dS ÷ 4.32; $dG = $dG 0.27; -  while ($c—/Q(Cc))/sg) $dH = $dH + 2.096; $dS = $dS + 7.996; $dG = $dG 0.457; -  while ($c/(?(C&)/s $dH $dH + 0.785; $dS = $dS + 3.709; $dG = $dG 0.332; -  201  while ($c=/(Ct))/sg) $dH = $dH + 0.708; $dS = $dS + 4.175; $dG $dG 0.666; -  while ($c/(G))/s $dH = $dH + 3.162; $dS = $dS + 10.544; $dG $dG 0.072; -  while $dH = $dH -0.36; $dS=$cIS-0.251; $dG = $dG 0.414; -  while ($c—/Q=(Gg))/s $dH $dH 2.844; $dS = $dS 6.68; $dG = $dG -0.7; -  -  while ($c—/Q=(Gt))/s& = $dH 0.212; $dS $dS + 0.073; $dG = $dG 0.194; -  -  while ($c-/(?=(Ta))/sg) $dil-l = $dH 0.046; $dS = $dS + 1.562; $dG = $dG 0.563; -  -  while ($c—/Q(Tc))/sg) $dH $dH + 1.893; $dS $dS + 6.685; $dG $dG 0.208; -  while ($c—/(?r)/sg) $dH$df{- 1.54; $dS = $dS 3.044; $dG = $dG 0.548; -  -  while ($c’—/(?(Tt))/sg) $dH = $dH + 1.528; $dS = $dS + 5.298; $dG $dG -0.13; while ($c—/(?(aA))/s $dH = $dH + 0.992; $dS $dS + 4.065; $dG = $dG 0.396; -  while ($c/Q(aq)/s $dH $dH + 2.89; $dS = $dS + 10.576;  202  $dG  $dG 0.39; -  while ($c=-/(aG))/s $dH=$dH-1.2; $dS $dS 1.826; $dG = $dG 0.603; -  -  while ($c/Q=(aT))/sg) $cIH = $d}-l + 1.816; $dS $dS + 6.863; $dG = $dG 0.309; -  while ($C/()=(cA))/Sg $dH = $dFI ÷ 1.358; $clS $dS + 4.367; $dG = $dG + 0.046;  while ($c—/(?=(cQ)/s $dH = $dN + 2.063; $clS $dS + 7.565; $dG $dG 0.404; -  while ($c/(cG))/sg) $dH $dH 0.276; $dS = $dS 0.718; $dG $dG 0.003; -  -  -  while  ($c/Q=(cI))/sg)  $dH $dH 1.671; $dS = $dS -4.07; dG $dG 0.409; -  -  while $C/Q(gA))/sg)  $dJ-l $dff + 0.444; $dS $dS + 2.898; $dG $dG 0.437; -  while ($c—/(?=q)Jsg) = $dH 0.925; $dS = $dS 1.111; $dG = $dG 0.535; -  -  -  while ($c-/Qzz(gG))/sg) $dH = $dH 0.943; $dS $dS 0.933; $dG $dG 0.666; -  -  -  while  ($C=/(?=))/S&  $dH $dS $dG  $dH 0.635; $dS 0.342; $dG -0.52; -  -  while ($c=-/Q=(tA))/s  203  $dH = $dH + 1.591; $dS = $dS + 5.281; $dG = $dG + 0.004; while ($e.—/Qt(tC/sg) $dH = $dH + 0.609; $dS = $dS + 3.169; $dG $dG 0.396; -  while ($c-/Q(tG))/s& $dH $dH + 2.165; $dS = $dS + 7.163; $dG = $dG 0.106; -  while ($c-/Q(t1))/s& $d}1 = $dH ÷ 2.326; $dS = $dS + 8.051; $dG = $dG 0.212; -  return ($dH,$dS,$dG);  sub PNA_correetions my ($sequenee,$TmC)  =  my $fraePyr++ while ($sequence-/C T/sig); $fracPyr = $frscPyr/length($sequenee);  return (20.9+0.81*($TmC)27.2*$fraePyr+0.52*length($sequence));  ##################################################Itll Ill/Il II##1######ll?i li/ill II#4## sub Ns_correction my ($dS,$a,$Na)  =  return ($dS + 0.368*( length($a)-1 )*log($Na));  ########################################################################## sub reverseseq my $a =  return ( reverse  ###############################Th,iItllhI####################################### sub complementseq my Sa tr/ACGTaegt/TGCAtgca/; return Sa;  204  sub revcomp my $a  =  return (reverseseq(complementseq($a)));  sub printHASH my $azz$_[O]; my $key; foreach Skey (keys %$a)  print “$key\t$a-> ($key}\n”;  sub Ns_correction_2 my ($Tm,$a,$Na)  =  my$CGC0;  while($a  /(gIc)/sig  $CGC++;  my $Tm_corr  =  1/$Tm+(4.29*$CGC3.95)*(1e.5*1og($Na/1))+9.4*1e6*(Qog($Ns))**2  (log(1))** ) 2 ; return (1 /$Tmcon);  sub NNcalculate my $a =  my $NN